Are you new to "Hadoop"? Settle in...

A Brief Overview

For the last year or so I've been hearing about "Hadoop" and all the amazing things it can do for you. With the passing of the seasons, the drumbeat has steadily been getting stronger and louder. This past Friday, October 2nd, 2009, was Hadoopworld NYC, which I attended with the wonderment fall semester provides a bright-eyed college freshman. The first thing that caught my attention upon entering the conference space was the formidable array of companies throwing their heft behind the "Hadoop" platform. I put Hadoop in quotes because Hadoop is less a single product but rather an ecosystem of products all working from a central thesis. That thesis is that the way data storage and analysis have been done in the past should officially be regarded as legacy. Traditional RDBMS systems will yield to the distributed, map/reduce paradigm of Hadoop at a certain threshold. That threshold will continue to shift as Hadoop becomes more user friendly, abstract and accessible. Hadoop is the New Way ™.

When I say "things" I mean Data Analysis and all the lovelies that flow from it. I capitalize Data Analysis because as @igrigorik so succinctly puts it Hadoop is powerful. Formidable enough to power some of the biggest data sets anywhere on the web. From IBM to Facebook, Yahoo to Amazon, Chase to China Mobile - these are just some of the marquee names heavily invested in the Hadoop ecosystem. What do they know that the rest of us should? Let's rewind the clock. Hadoop has its roots in a search engine called Nutch a web search system co-founded by Doug Cutting. At the time Cutting, et al., were working on a problem that required massive resources in terms of storage and computational power to maintain a searchable index of the web, algorithm not withstanding. It just so happens, like with most things big and interweby, Google cut its teeth into this research sandbox. Two seminal academic papers released by Google researchers, The Google File System in 2003 and MapReduce: Simplified Data Processing on Large Clusters in 2004, led the way in answering two very important questions Nutch had been trying to answer - how to store massive amounts of data and how to process it. To their credit Cutting, et al., adopted the lessons learned at Google rather than reinvent the wheel. I highly recommend anyone interested in this space familiarize themselves with the concepts in those two papers as a start.

Briefly, "The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing"  and goes on to enumerate the list of projects comprising the Hadoop ecosystem as noted on the main Hadoop project page.

  • Hadoop Common: The common utilities that support the other Hadoop subprojects.
  • Avro: A data serialization system that provides dynamic integration with scripting languages.
  • Chukwa: A data collection system for managing large distributed systems.
  • HBase: A scalable, distributed database that supports structured data storage for large tables.
  • HDFS: A distributed file system that provides high throughput access to application data.
  • Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying.
  • MapReduce: A software framework for distributed processing of large data sets on compute clusters.
  • Pig: A high-level data-flow language and execution framework for parallel computation.
  • ZooKeeper: A high-performance coordination service for distributed applications.

If all that leaves you at a loss let's look at what Hadoop does for users in the real world. At the conference were a number of presenters who would fall into the end-user category. These are corporations employing Hadoop today to solve real world problems in a number of different industries.

- eHarmony presented (not exact slides but close enough) during Amazon's time in the morning session. They use Hadoop via Amazon's web services to crunch hundreds of millions of replies to their extensive questionnaire by tens of millions of users. You do the math - lots of data points, lots of permutations. The only way they could go forward with increasingly complex and interesting matching models was to make the move to Hadoop.

- IBM previewed their M2 technology, which is a front end for Hadoop. They demoed the ingestion of raw patent data from the USPTO, rejiggering that data to better suit their needs and extracting meaningful data quickly, enabling their patent lawyers to work faster. Uh ya... that's what we need... more productive patent lawyers.

- StumbleUpon and Streamy spoke about serving their users without databases via HBase which is part of the Hadoop ecosystem.

- Visa uses Hadoop to speed up their risk score modeling, decreasing compute time from 1 month to 13 minutes.

 

Observations from Hadoopworld

These are my thoughts after a day of immersion in the world of Hadoop. Taking a look at the Hadoopworld agenda will give you an understanding of the caliber of presenters and breadth of topics covered. While I was both impressed and excited, there was also reason for concern. To be certain, there is a tremendous amount of energy, talented people and deep-pocketed corporations behind the success of Hadoop. However, users should be aware of the pros and cons before taking the plunge. As indicated in many spheres, Hadoop is powerful in that it allows you to provision resources which when employed correctly will accommodate the growth of your particular data problem with near linear scale. Moreover, Hadoop does provide an open source solution that solves problems in both the scalability and computational space. That said, let's take off the rose-colored glasses for a moment and enumerate some of the problems and challenges Hadoop faces and discuss solutions the community is providing.

  • Learning Curve
  • Interoperability
  • Security
  • Governance

Learning Curve

Do not kid yourself into thinking you can press a button and have your data problems melt away with Hadoop. Hadoop has quite a steep learning curve. Not only do you need to become familiarized with the dozen odd projects within the Hadoop ecosystem but perhaps more fundamentally one must reorient their way of thinking from a traditional RDBMS mindset. By and large web workers and any other data dredger have been weaned on a steady diet of RDBMS either in the form of MySql, Postgres or any number of other database systems. Databases and SQL are the way generations of knowledge workers have plied their trade. Of course, this is not an apples to apples comparison. Where Hadoop provides tremendous scalability and computational agility, RDBMS's provide transactional accuracy and programmatic familiarity. Also, Hadoop in its most basic incarnation is not a stand-in replacement for a database system. For that you have the HBase (based on another Google paper, BigTable: A Distributed Storage System for Structured Data, 2006) project which gets you a lot closer.

Highlighted at Hadoopworld were a number of players moving the ball forward to lessen the learning curve slope. Among them we have Cloudera, the organizers of the event, and from them we get Cloudera Desktop. Cloudera Desktop brings a number of creature comforts to the Hadoop operator and user, two of which are a file browser for the underlying HDFS that acts very much like Windows Explorer and a cluster health monitor so you can get a bird's-eye view of your systems. From Karmasphere, comes Karmasphere Studio, a plug-in for Netbeans that simplifies the development process for developers by masking many command line tools and instructions behind a much simpler GUI. The growing Hadoop user-base is attracting new focus to abstract away the difficulty of working in a new technical environment. Closer to the core are internal projects that add a layer of abstraction to raw map/reduce programming, specifically Hive and Pig. Interestingly, their roots can be traced to two early Hadoop adopters, Pig at Yahoo! and Hive at Facebook.

 

Interoperability 

I am not talking about interoperability in the sense of how Windows and Mac used to not get along but rather, Windows vs. Windows or Mac vs. Mac. It turns out that different versions of Hadoop are not compatible with each other. There are a number of different data formats for keeping files in HDFS (the underpinning of the Hadoop ecosystem). This basically means that all of the nodes within the Hadoop cluster need to run the same version. I'm sure I'll be corrected, but I'm pretty sure that as it stands data migration from one Hadoop version to the next is a major concern. The best advice I have read to perform a migration miracle is simply to install a separate cluster and write a map function that will pull data out of the old system and input it into the new system. Do not play games with mismatched versioning of cluster components. Strange and potentially scary things may happen to you, your pets, your children and/or your data.

Doug Cutting, now at Cloudera, is taking a major lead in fixing this situation by spearheading development of the Avro binary data format. Avro is being promoted as the new de facto data format for all things deep down in the Hadoop internals and the heir apparent to thrift. Prostrations to the new binary Prince aside, Avro is currently represented only in a small fraction of production Hadoop logs.

Beyond the data format issue is the actual programmable API issue. As it stands programs written against a certain version of Hadoop need to be recompiled and perhaps rewritten when moving to a new version of Hadoop. "Recompiled" and "rewritten" are two words with dollar signs attached to them. Watch any talk by Owen O'Malley (at Yahoo!, Hadoopworld slides - pdf) on Hadoop and you will see how this is evolving. The API is in flux but efforts are underway to annotate methods and functions with status notifications, and in so doing let developers know what is stable and what is still a work in progress.

Good luck and God's speed.

 

Security

Hadoop grew up in an environment of a handful of trusted users running map/reduce jobs against a large set of data. Over time the Hadoop cluster has become an asset to the organization running it and in short order there are more and more users competing for time on the cluster. One benefit of using a Hadoop cluster is that you can broadly share data sets in your organization. One problem with using a Hadoop cluster is that you can broadly share data sets in your organization. Once the number of users start to balloon you run into security considerations and that is where Hadoop as a platform is now. As it stands there are many ways users can interfere with each others jobs and data. Everything ranging from killing running jobs to deleting data. You know, Very Bad Things™. As with migrating data, the way to deal with this problem is to isolate users and data within separate clusters.

Moving forward there is progress being made in authenticating users via kerberose at various levels within the cluster. With validated users you will be able to secure job creation and scheduling and have a higher level of trust in your HDFS access logs. Also on the drawing board is how to effectively hide data from people you don't want to share with, which will be critical for getting Hadoop into sensitive industries and areas such as finance and medicine. Sarbanes-Oxley and HIPAA are two regulations you do not want to run afoul of in regards to access control.

 

Governance

Throughout this post I have mentioned major contributors and components to the Hadoop ecosystem and their provenance. As it stands from my vantage there are at least three major players here. Yahoo!, Facebook and Cloudera. Yahoo! takes the role of grandfather, having employed Doug Cutting during the project's metamorphoses towards what we now know as Hadoop and being the single largest contributor and user of Hadoop. Recently, Doug has moved on to Cloudera to join his co-founder, Mike Cafarella. In my world Facebook is the father in this picture, as it is likely the second largest user and major contributor, employing a number of Hadoop luminaries like Ashish Thusoo, father of Hive, and Dhruba Borthakur, HDFS expert, all-around nice guy and formerly of Yahoo!. Now to Cloudera, the son. They do not extensively use Hadoop internally to solve their own business problems, rather they supply Hadoop to the rest of us, and in so doing perhaps become the biggest user of them all.

When asked at the conference about the direction of Hadoop, one of the early members of the Yahoo! Hadoop team told me that up until recently most of the main contributors were a close-knit bunch, who all knew each other from their days at Yahoo! and can more or less still get on the phone with one another although they may be at different organizations now. As time goes on this will no longer be the case. Between competing interests all tugging at Hadoop to solve their business needs, the three main players and those in the rest of the field (Amazon, IBM, etc.) it's not that hard to imagine a fragmented and forked Hadoop. The one solace I take is that direction for the open source project is under the stewardship of the Apache Foundation. The Apache Foundation has guided the maturation of a number of open source projects, most notably the Apache web server and the Lucene search engine to name but two. The question is: Can Hadoop grow its contributor base while maintaining focus in its direction? Who knows. So far so good in my opinion. One example to the positive is the fact that the scheduling system received a pluggable architecture refactoring thereby allowing both the Facebook solution and the Yahoo! solution to coexist. 

 

Conclusion

If you are still getting started in your career or have exhausted what a traditional RDBMS can do for you I would highly recommend getting to know Hadoop. As I noted during the conference this space is very hot now and for the foreseeable future. All the players I've mentioned are very interested (practically falling over themselves) in hiring people with expertise and working knowledge of Hadoop.With the help of Amazon's Elastic Map Reduce mere mortals may hone their skills by harnessing the power of Hadoop at utility prices while using publicly available data sets. With tools like Karmasphere you can even deploy your code against a local system installation provided for free by Cloudera.

At the end of the day I have to say the thesis holds true. Hadoop is the New Way. In my humble opinion, I submit that Hadoop and the growing ecosystem of products around Hadoop will be at the center of the data analysis and heavy data storage and retrieval universe for some time to come (at least until Google unveils their successor papers to GFS/Map Reduce/BigTable). The benefits of sheer scalability and near limitless computational horsepower on the cheep is entirely too tantalizing a prospect for most to ignore. The open source nature of Hadoop will, with the right guidance, ensure that the best technical solutions find their way into the core of the platform; sub-projects will merge or wither (I'm looking at you Pig/Hive) and all will be well in the number crunching world.

 

Updates:

-20091006 slide link for Owen O'Malley's hadoopworld talk