The Hadoop open-source project for distributed compute processing continues to be one of the most interesting projects for managing the vast amount of data being analyzed and collected in a wide variety of scenarios.
Today, Cloudera, a provider of Hadoop data management software and services, is set to release a major release of its open source software distribution--Cloudera Distribution for Hadoop (CDH), including Apache Hadoop v3.
Cloudera's CDH3 distribution is an integrated set of components and functions that interoperate through standard APIs and manage required component versions and dependencies.
CDH3 is an integrated stack that includes not just software components but the associated libraries and testing necessary for a smooth experience. Software stacks have remained ever-elusive in the open source world, where there can arguably be too much choice--so much so that developers end up having to tweak every component to address issues with just one.
As such, the stack approach for something like Hadoop, which has inherent complexity and many components (this is big data after all) can be hugely beneficial for both users the project itself.
CDH3 includes the following components:
- HBase: Hadoop database for random read/write access
- Hive: SQL-like queries and tables on large datasets
- Pig: dataflow language and compiler
- Sqoop: integrates databases and data warehouses with Hadoop
- Flume: highly reliable, configurable streaming data collection
- Extended security and authentication functions
While Hadoop is readily available on its own, CDH makes it easier and more consumable for people to be up and running quickly, especially in light of the sub-projects that have emerged, according to Cloudera CEO Mike Olson.
Olson said the company has thrived because the core Hadoop software has remained open source and a large community has developed to not only support users but to extend the platform in ways that no single developer or company could. Additionally, because Cloudera has a large team of Hadoop committers, it has visibility into what may or may not be interesting features or problems with the software and can best address the needs of their customers.