Frank Kane's Taming Big Data with Apache Spark and Python

上QQ阅读APP看书，第一时间看更新

Spark is not that hard

The beautiful thing about Spark is it's not that hard. It allows you to code in Python, Java, or Scala, so if you're already familiar with Python, you don't have to learn a new language. From a developer's standpoint, it's built around one main concept, the Resilient Distributed Dataset (RDD). There is one main kind of object that you'll encounter in Spark over and over and over again – the RDD. Various operations on that RDD will let you slice and dice and carve up that data and do what you want with it. So what would take many lines of code and different functions in a MapReduce job can often be done in just one line, much more quickly and efficiently using Spark.

If you actually took my MapReduce course, we're going to be doing a lot of the exact same problems in the Spark course. You might find it interesting how much simpler and how much easier those same problems are to solve in Spark compared to MapReduce in many cases.