Open-source 'R' gets Hadoop integration
The R programming language is getting a new dose of big data processing by integrating the open-source Hadoop framework.
Lately, you can't talk about business without talking about "big data," which, incidentally, is the focus of the latest package from Revolution Analytics. Revolution Analytics, which commercialized the open-source R statistics language, emphasizes expanding the use of R beyond its academic roots to business.
On Tuesday, Revolution is expected to release a new addition of big data analysis to its Revolution R Enterprise software. This is an add-on package called RevoScaleR that provides a framework for fast and efficient multicore processing of large data sets.
According to the company, the new package will allow users to process, visualize, and model terabyte-class data sets in a matter of seconds, and it leverages many popular data processors and storage mechanisms, including the popular Apache Hadoop framework and countless NoSQL databases, for complex statistical analysis.
The RevoScaleR package introduces a number of new features, including:
- a new binary 'Big Data' file format--XDF--with an interface to the R language that provides high-speed access to arbitrary rows, blocks, and columns of data
- a collection of the most common statistical algorithms optimized for big data, including high-performance implementations of summary statistics, linear regression, binomial logistic regression, and crosstabs
- data reading and transformation tools to prepare large data sets for analysis
In a conversation with CNET, Revolution cited a use case that involved the processing of more than 13 GB of FAA data that detailed every commercial flight departure and arrival between 1987 and 2008. In the past, analyzing such a large data set would be an overnight affair that could take upward of 12 hours; with RevoScaleR, the same data set was analyzed in less than one second.