Thehas raged on for some time now, with some claiming that cloud computing and wholly owned infrastructure don't mix, and others pointing out that applying "on demand," "at scale," and "multitennant" to enterprise IT data centers offers unique advantages to those who have already made that investment. It has been difficult, however, to do an objective comparison of the two approaches--until now.
The announcement on Thursday of, combined with the , means that we finally have a reasonable means of watching which directions enterprise IT prefers. Let me explain.
Amazon's service is a simplified, prepackaged Hadoop implementation that can be leveraged by anyone with an Amazon account. The Amazon Web Services blog describes it as follows:
Today we are rolling out Amazon Elastic MapReduce. Using Elastic MapReduce, you can create, run, monitor, and control Hadoop jobs with point-and-click ease.
You don't have to go out and buys scads of hardware. You don't have to rack it, network it, or administer it. You don't have to worry about running out of resources or sharing them with other members of your organization. You don't have to monitor it, tune it, or spend time upgrading the system or application software on it.
You can run world-scale jobs anytime you would like, while remaining focused on your results. Note that I said jobs (plural), not job. Subject to the number of EC2 (Elastic Compute Cloud) instances you are allowed to run, you can start up any number of MapReduce jobs in parallel. You can always request an additional allocation of EC2 instances here.
Processing in Elastic MapReduce is centered around the concept of a Job Flow. Each Job Flow can contain one or more steps. Each step inhales a bunch of data from Amazon S3, distributes it to a specified number of EC2 instances running Hadoop (spinning up the instances if necessary), does all of the work, and then writes the results back to S3.
Each step must reference application-specific "mapper" and/or "reducer" code (Java JARs or scripting code for use via the Streaming model). We've also included the Aggregate Package with built-in support for a number of common operations such as Sum, Min, Max, Histogram, and Count. You can get a lot done before you even start to write code!
Cloudera, on the other hand, provides a Hadoop build that you can deploy wherever you wish:
Cloudera's Distribution for Hadoop is based on the most recent stable version of Apache Hadoop. It includes some useful patches back-ported from future releases, as well as improvements we have developed for our support customers.
Cloudera's Distribution includes everything you need to configure and deploy Hadoop using standard Linux system administration tools.
Here's what I'm thinking: enterprise IT is looking at an entirely new class of applications that take advantage of MapReduce to process very large sets of both structured and unstructured data for things like predictive analysis, sorting/sequencing, and data mining. Both commercial Hadoop offerings meet the demand for a platform to simplify the development and operation of these applications. The primary difference is the where, not so much the what.
That is exactly what will make the competition between the two offerings so compelling to watch. Let me break it down for you:
Will the requirement to own and operate hardware work against Cloudera? What makes the Amazon offering so groundbreaking (and it will prove to be historic, in my opinion) is that it is now possible for anyone with a need to analyze large data sets to do so simply for the cost of data storage plus processing time. (Note that the use of Elastic MapReduce adds a nominal cost to the server instances that host the instances.)
Where "grid computing" was once the playground of large enterprises and academic institutions that could afford the hardware to justify the cost of building them out, Amazon makes it possible for even individuals to run such jobs for a few tens or hundreds of dollars.
Cloudera, on the other hand, requires that the hardware be available to install it on. That either means existing server capacity, new hardware (which greatly adds to the cost, and can only be justified for continuous Hadoop use), or leased capacity. The latter starts to look a lot like Amazon's service.
Will Amazon's requirements to use S3 work against it? There are three reasons why I see it might:
- The commonly cited concern about data security outside of corporate firewalls. (Even if the perception is wrong, the perception exists.)
- The cost of data transfer to and from the S3 service--currently as high as 17 cents per gigabyte a month.
- The cost of storage of both the raw data and the aggregate results--currently as high as 15 cents per gigabyte a month.
It should be rightly noted that if you already rely on S3 to store your data sets to be processed, this is a great deal. However, if you have to upload terabytes or even petabytes of data to be combed through by MapReduce, then this could get quite pricey on its own, and existing infrastructure might serve the purpose well. If you are going to leave the data up there permanently--and update it regularly--the cost of Amazon's service should be weighed against the cost of owning and operating that storage yourself in your existing facilities.
Will the so-called "barrier of exit" stand up? I'm not even arguing that the choice will be based solely on the comparative costs to the business. In fact, what I am interested in is the extent to which business units and departments will simply bypass IT altogether to build and run their own jobs in Amazon Elastic MapReduce.
If IT maintains a valuable service using existing facilities and computing investments, then Cloudera will likely do fine. If not, then Amazon stands to be the overwhelmingly dominant commercial Hadoop implementation.
I should also note that running a Hadoop instance is not the same thing as cloud computing in and of itself. An internal Cloudera implementation is not necessarily an internal cloud, though if operated "on demand," "at scale," and with multitenancy, it certainly qualifies as a cloud.
I will be watching this space closely for the next year or two. I have a feeling that Amazon will do fine, regardless, as there are many possible implementations that would benefit from a completely public cloud implementation. The real test is probably how much opportunity Cloudera finds within enterprise data centers.
Cloudera also has much more competition from the free downloads of Hadoop than Amazon has, in my opinion, as it faces a more traditional open-source competitive landscape.
Is your company looking at MapReduce for a new generation of data-mining applications? If so, what will you choose: the public, external cloud implementation of Hadoop from Amazon Web Services, or the wholly owned, internal implementation of the same from Cloudera?