X

Open-source 'R' gets Hadoop integration

The R programming language is getting a new dose of big data processing by integrating the open-source Hadoop framework.

Dave Rosenberg Co-founder, MuleSource
Dave Rosenberg has more than 15 years of technology and marketing experience that spans from Bell Labs to startup IPOs to open-source and cloud software companies. He is CEO and founder of Nodeable, co-founder of MuleSoft, and managing director for Hardy Way. He is an adviser to DataStax, IT Database, and Puppet Labs.
Dave Rosenberg
2 min read

Lately, you can't talk about business without talking about "big data," which, incidentally, is the focus of the latest package from Revolution Analytics. Revolution Analytics, which commercialized the open-source R statistics language, emphasizes expanding the use of R beyond its academic roots to business.

On Tuesday, Revolution is expected to release a new addition of big data analysis to its Revolution R Enterprise software. This is an add-on package called RevoScaleR that provides a framework for fast and efficient multicore processing of large data sets.

According to the company, the new package will allow users to process, visualize, and model terabyte-class data sets in a matter of seconds, and it leverages many popular data processors and storage mechanisms, including the popular Apache Hadoop framework and countless NoSQL databases, for complex statistical analysis.

The RevoScaleR package introduces a number of new features, including:

  • a new binary 'Big Data' file format--XDF--with an interface to the R language that provides high-speed access to arbitrary rows, blocks, and columns of data
  • a collection of the most common statistical algorithms optimized for big data, including high-performance implementations of summary statistics, linear regression, binomial logistic regression, and crosstabs
  • data reading and transformation tools to prepare large data sets for analysis

In a conversation with CNET, Revolution cited a use case that involved the processing of more than 13 GB of FAA data that detailed every commercial flight departure and arrival between 1987 and 2008. In the past, analyzing such a large data set would be an overnight affair that could take upward of 12 hours; with RevoScaleR, the same data set was analyzed in less than one second.