In my previous post I talked about of RDDs as an abstraction of parallel data processing. Today, I’d like to briefly discuss and set an example for accumulators and broadcast variables.

Accumulators

  • counters or sums that can be reliably used in parallel processing
  • native support for numeric types, extensions possible via API
  • workers can modify state, but cannot read content
  • only a driver program can read the accumulated value

Broadcast Variables

  • allow for an efficient sharing of potentially large data sets
  • intended for workers as reference data
  • cached, transported via broadcast protocol, (de)serialized

Suppose we are assigned with a task to write a text analyser (Github). For instance, we could upload a larger piece of a text (English only for simplicity), such as an e-book, and collect some basic facts: total number of characters and words, as well as a list of the most frequent words.

As the text is broken into smaller chunks processed in parallel, the counting part naturally lends itself to the use of accumulators.

https://gist.github.com/zezutom/f4d214e92e8867a8814b

An effort to “describe” the book by extracting the most frequent words is surely somewhat more exciting than just counting characters. One of the first challenges is with words which are typically overused, but do not carry any significant information, such as articles, prepositions, pronouns and even some verbs. Our application maintains a list of these words and ensures they are ruled out when parsing the book content.

This is when broadcast variables come into play. The list of common words could potentially run long and it would be inefficient to create a copy in a each and every worker. Using a broadcast variable helps performance via caching and reduced network traffic thanks to a specialized broadcast protocol.

https://gist.github.com/zezutom/24c87900224969edc9d3

Finally, we can have some fun and reach out for some genuine master piece, such as 20.000 Leagues under the Sea by Jules Verne. Courtesy of textfiles.com. Here is what the text analyser concluded about the remarkable book:

characters: 568889, words: 101838, the most frequent words:
(captain,564)
(nautilus,493)
(nemo,334)
(ned,283)
(sea,273)

The source code along with detailed instructions can be found on Github as part of a project called Spark by Example.

Categories: Scala

Tomas Zezula

Hello! I'm a technology enthusiast with a knack for solving problems and a passion for making complex concepts accessible. My journey spans across software development, project management, and technical writing. I specialise in transforming rough sketches of ideas to fully launched products, all the while breaking down complex processes into understandable language. I believe a well-designed software development process is key to driving business growth. My focus as a leader and technical writer aims to bridge the tech-business divide, ensuring that intricate concepts are available and understandable to all. As a consultant, I'm eager to bring my versatile skills and extensive experience to help businesses navigate their software integration needs. Whether you're seeking bespoke software solutions, well-coordinated product launches, or easily digestible tech content, I'm here to make it happen. Ready to turn your vision into reality? Let's connect and explore the possibilities together.