Exciting Tools for Big Data: S4, Sawzall and mrjob!

This week, a few different big data processing tools were released to the open-source community. I know, I know, this is probably the 1000th blog post about this, and perhaps the train has left the station without me, but here I am.

Yahoo’s S4: Distributed Stream Computing Platform

First off, it must be said. S4 is NOT real-time map-reduce! This is the meme that has been floating around the Internets lately.

S4 is a distributed, scalable, partially fault-tolerant, pluggable platform that allows users to create applications that process unbounded streaming data. It is not a Hadoop project. A matter of fact, it is not even a form of map-reduce. S4 was developed at Yahoo for personalization of search advertising products. Map-reduce, so far, is not a great platform for dealing with streaming/non-stored data.

Pieces of data, apparently called events, are sent and consumed by a Processing Element (yes, PE, but not the kind that requires you to sweat). The PEs can do one of two things:

  1. emit another event that will be consumed by another PE, or
  2. publish some result

Streaming data is different from non-streaming data in that the user does not know how much data will be transmitted, and at what rate. Analysis on streaming data should not rely on storing the data, as the amount of required disk space is unknown. Additionally, the processing of the data is likely to take longer than the rate of transmission would allow. Since the data is not stored, special algorithms must be developed for aggregating and analyzing data. Neal Richter (@nealrichter) has an excellent list of resources on research on the management and analysis of streaming data.

More information can be found at the S4 Wiki and S4 main site, that contains tutorials, a manual, a cookbook as well as API documentation. The Yahoo project page, which contains very little information, can be found here. The source code is on everybody’s favorite site, GitHub.

S4 is released under the Open Source Apache 2.0 license. It must also be said that S4 is not to be confused with S3! They are two totally different technologies!

Remember, S4 is not a Hadoop. A matter of fact, Bill McColl over at Gigaom has pondered a”NoHadoop” movement…that parallels (see what I did there?) our favorite NoSQL movement.

Google’s Sawzall: Programming Language for Big Data


Google made a contribution of its own. Sawzall is an interpreted, procedural DSL for working with huge amounts of data.  Greg Linden (@greglinden) made an interesting comparison, suggesting that Yahoo’s Pig project is similar to Google’s Sawzall project. At Google, it is used on top of existing systems including Protocol Buffers, the Google File System and MapReduce. Sawzall reads one line of data at a time, and does not preserve state between reads so it is useful in the map phase of a map-reduce job. There are also routines for statistical aggregation that can be used in a reduce phase. Users compile Sawzall source using the szl compiler that can be found here.

Much more detailed information can be found on the Google Code overview site for the szl project. For programming language buffs, the language specification can be found here.

The research publication discussing this project in more detail is here.

Yelp’s mrJob: Distributed Computing for Everybody

Ok, I’ve ignored Hadoop long enough…

Every time you write a review complaining about the terrible gas the burrito at El Torasco’s gave you, or the amazing buzz you got from their margaritas, Yelp processes it and extracts some type of information from it. Yelp accumulates about 100GB of data per day! Naturally, Yelp analyzes this data using map-reduce, Amazon Elastic MapReduce to be exact.

You see, most companies are building up their Hadoop clusters but Yelp decided to tear theirs down. In May 2010, Yelp engineering moved its data processing to Amazon. mrjob is Yelp’s Python framework for writing map-reduce jobs and interacting with Amazon EMR!

Below is an example from their engineering blog. It is so simple it is beautiful!

from mrjob.job import MRJob
import re

WORD_RE = re.compile(r"[\w']+")

class MRWordFreqCount(MRJob):
    def mapper(self, _, line):
        for word in WORD_RE.findall(line):
            yield (word.lower(), 1)
    def reducer(self, word, counts):
        yield (word, sum(counts))

if __name__ == '__main__':
    MRWordFreqCount().run()

The mrjob code is available on GitHub as is the Python documentation.

Oh, and El Torasco’s is to be a fictional name I use in this post.

1 comment to Exciting Tools for Big Data: S4, Sawzall and mrjob!

Leave a Reply

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>