It certainly has been an exciting year to work for IBM. Not only is IBM celebrating its centennial this year, IBM has just also made headlines with Watson‘s victory over puny humans in the game of Jeopardy. Walking in the hallways of the IBM Toronto Lab nearing the final night is like walking in a strange alternate universe where Watson has taken over already and brainwashed the entire population to worship his shiny avatar face. It’s impossible to not look at Watson posters on the walls or to hear people talking about Watson, Watson, or even… Watson (gasp!). My co-workers had even adopted the new assignment of forwarding me articles on how Watson was going to change the world, enslave humanity, save humanity, blue screens to oblivion (not necessarily in that order).
Call me oddball, but the more faith people place on Watson, the more skeptical I became. With the extraordinarily high level of expectation, I couldn’t help but wonder how badly Watson would fail. After all, how could a mere machine, no matter how
bloodthirsty intelligent, win? Boy, was I about to be proven wrong.
The final night blew me away. It was simply ridiculous how fast Watson buzzed in,
sprouted arms and legs and creepy glowy red eyes and kung-fu fought with Ken Jennings and Brad Rutter and answered the questions correctly one after the other, and made history by becoming the first computer to win Jeopardy.
So When’s Hadoop Going to Come In?
The day after Watson kicked asses, one of the IBMers who worked on DeepQA (the algorithm which powered Watson) came in for a QA session with us (perks of being IBMer as I said it was a good time to work for IBM). He explained about the technologies that they used, and one of them is of course, Hadoop.
So a quick recap, Apache Hadoop is a collection of distributed computing technologies which, among other things, provide an excellent MapReduce framework on top of its distributed file system, HDFS. Quite a lot of departments and projects within IBM (including us) have been looking at and using Hadoop to process huge dataset in our products. It is a good toolkit that’s used by almost everyone in the industry, and it has a lot of neat features, however it is written in Java and thus it tends to turn off a lot of programmers. That’s a true pity since Hadoop does allow you to do many things that would have been very difficult otherwise, so to clarify: Yes, it’s written in Java but for MapReduce queries you can totally use Python, Ruby, or any scripting language of your choice!
How you ask? Say hello to Hadoop Streaming.
How does it work?
Hadoop Streaming allows you to use any executables or script as mapper or reducer. It can even package the script files for you as part of the MR job, so the MR script need not exist in all the Hadoop nodes. Neat! So, any executables (such as cat, wc, etc) or any scripts (bash, Python, Ruby) that can read STDIN, parse and process text data, and output to STDOUT can be used to define the mapper and the reducer for an MR job.
Michael Noll has a truly superb tutorial on running MR jobs on Hadoop with pure Python. To my knowledge that was the most hassle-free tutorial for Python (i.e. no Jython, no Thrift, no Hadoop patching that Dumbo or Pydoop require).
For most hassle-free way of doing Ruby and Hadoop MapReduce, you can use Michael Noll’s tutorial I linked above except replacing the Python scripts with Ruby. It should work the same.
If you are also using HBase on top of Hadoop, and would like to code a MapReduce job with a scripting language instead, you can try out wanpark’s hadoop-hbase-streaming package. I’ve used it with a pretty good result (hint: you need to add the hbase jar to your HADOOP_PATH as well) though I’ve only used pure Ruby and Python approach and did not try it with the more fleshed out libraries I mentioned above.