Apache Spark Basics - RDDs and Operation Types

When starting with Apache Spark, a “lightning-fast cluster computing” engine, it is important to understand how Spark fits into the Hadoop ecosystem. This article provides a brief overview of Spark’s distinctive features and its ties Hadoop.

Hadoop has been around for about 12 years and it dominated the space of Big Data by providing reliable distributed processing of large data sets. Hadoop was inspired by and provides a framework for a programming pattern called MapReduce invented by Google. To guarantee reliable processing Hadoop comes with its own flavour of a distributed file system known as HDFS. Arguably one of the most valuable features of HDFS is automatic fault detection and quick recovery from errors. There are other parts comprising Hadoop’s architecture, but let’s leave them out for now.

Reliable and robust as it is, Hadoop is not designed for speed. That’s where Spark comes into play. As disc operations result into a slowdown, Spark’s architecture revolves around in-memory processing in a form of Resilient Distributed Data Sets (RDDs). When it comes to persistence Spark cleverly skips a solution of its own and instead it exposes a high-level API towards HDFS and other storage systems.

That’s enough for introduction. Let me quickly iterate over Spark’s core features and finally show some code to make it all look a little bit less abstract.

RDDs in a Nutshell

fault-tolerant, as in zero data loss
are operated on in parallel in a cluster of workers
wrap around the actual data (textual, numeric or custom objects)

There are two types of operations which can be performed on RDDs: transformations and actions.

Transformations

create new RDDs according to transformation rules
are lazy, i.e. nothing happens until an action is called
represent the Map part of the MapReduce paradigm

Actions

trigger a transformation workflow
return aggregated result back to the driver program (beware memory consumption)
represent the Reduce part of the MapReduce paradigm

Code time! A word count example is a hello world in Big Data realm. The example parses a file containing random text (lorem ipsum) and produces a lexically ordered set of word counts, such as this one:

(a,9)
(ac,13)
(accumsan,1)
(ad,1)
(adipiscing,1)
(aenean,6)
(aliquam,7)
..
(lorem,5)
(luctus,5)
(maecenas,1)
(magna,8)
..

As Spark leverages MapReduce, the corresponding code only takes a few lines (of Scala):

https://gist.github.com/zezutom/edbdc5b1dd75e4b990a5

Spark’s transformations take up most of those ten lines of Scala code: flatMap, map and filter. Once we are happy with how the text is manipulated, we apply actions to collect the results. Namely reduceByKey, which sums up occurrences of each and every word, and sortByKey to have the final set lexically sorted.

Typically, an action would be performed on the driver. Interestingly enough, that’s not the case with the actions above, both of which are processed in parallel and therefore return yet another RDD. To get to the actual results, a sorted set of word count pairs, we reach out for Spark’s high-level API and persist the output into HDFS by calling saveAsTextFile on the RDD.

There is much more to grasp when beginning with Apache Spark and I hope this brief write-up gives you an idea. The source code along with detailed instructions can be found on Github as part of a project called Spark by Example.

Apache Spark Basics – RDDs and Operation Types

Published by Tomas Zezula on December 13, 2015

Tomas Zezula

Scala

Apache Spark Basics – Accumulators and Broadcast Variables

Scala

Scala SBT project template ready to be imported into Eclipse

Scala

Application of Currying in Scala

Apache Spark Basics – RDDs and Operation Types

Published by Tomas Zezula on December 13, 2015

Tomas Zezula

Related Posts

Scala

Apache Spark Basics – Accumulators and Broadcast Variables

Scala

Scala SBT project template ready to be imported into Eclipse

Scala

Application of Currying in Scala