Big data in context

A few weeks back I attended venture firm Accel Partners' New Data Workshop event and learned quite a bit about the state of what we are now commonly referring to as "big data" and the challenges that await the vendors trying to target this new way of slicing and dicing vast amounts of information.

One of the big takeaways for me was the realization that even with all of the processing power available nowadays, the amount of data is growing at such a rapid pace that people are simply looking to cope with the problem, rather than facing it head on.

The issue of processing large amounts of data is not necessarily new--most developers and IT staff can tell you about having too much information to deal with--but, the big difference is that there are new approaches, tools and technologies that can help alleviate the difficult in processing.

Over the course of the last 30 years or so the way that machines process transactions has changed, but so too has the vast amount of data that is being processed and collected, now with an eye toward real-time analysis of information.

This has led to the advent of a number of technologies that allow for data processing to be offloaded and managed in both structured and unstructured ways--examples include open-source projects like Memcached and Hadoop as well as NoSQL data storage mechanisms like Cassandra.… Read more

Open-source Lustre gets supercomputing nod

A new start-up called Whamcloud is coming out of stealth mode Wednesday with $10 million in private funding and a notion to disrupt the often academic world of supercomputing by leveraging the Lustre open-source project.

According to CEO Brent Gorda, the company is targeting the need for high-performance storage solutions based on the popular combination of Linux and Lustre for application and data storage environments. The company plans to offer support and services initially, with an eye toward a turnkey supercomputing setup with hardware and software components, in the future.

For those less familiar with supercomputing technologies, Lustre is a … Read more

Rackspace goes open source with cloud platform

Data center and cloud infrastructure service provider Rackspace is expected to announce Monday the release of a new open-source offering that will allow users to build and launch their own internal and hosted clouds.

Dubbed OpenStack, the new Apache-licensed project will feature several cloud infrastructure components, including a fully distributed object store based on Rackspace Cloud Files, the company's highly scalable storage engine.

In addition to the initial offering, a scalable compute-provisioning engine based on the NASA Nebula cloud technology and Rackspace Cloud Servers technology is expected to be available later this year.

Rackspace has been hosting enterprise computing … Read more

Report: Java and MySQL doing fine under Oracle

A new developer survey report from open-source business intelligence vendor Jaspersoft shows that there has been minimal fallout from Oracle's acquisition of Sun Microsystems, and that Java and MySQL seem to be doing just fine in their new home.

These results contrast with the latest developments of the OpenSolaris project, which, under Oracle's watch, has seen its Governing Board threatening to disband.

MySQL and Java have a strong presence in modern open-source software stacks, both in the enterprise and in Web shops. Interestingly, the survey report suggests that, thanks to Oracle's commitment to Java, as part of … Read more

Free NoSQL and data scalability cheat sheet

NoSQL databases and associated operational-data technologies based on nonrelational approaches to data management and manipulation continue to be top of mind for big Web shops and are slowly starting to make their way into enterprise IT infrastructure.

This means that developers need to get a handle on the latest information about NoSQL and big data in order to stay on top of the trend.

Accordingly, developer site DZone just released a new Getting Started with NoSQL and Data Scalability reference card as part of their cheat-sheet library.

The refcard is a good primer to get you asking all the right … Read more

Cloudera goes enterprise with new Hadoop offering

Cloudera, a provider of support and services around the open-source cloud platform Apache Hadoop, on Tuesday announced Cloudera Enterprise, a suite of subscription-only add-ons to its free distribution.

The core platform, called Cloudera's Distribution for Hadoop (or CDH for short), was first unveiled in March 2009 and is 100 percent open-source software. Now, the company is offering Cloudera Enterprise, a suite of additional tools for monitoring, managing, and administering a cluster in production to complement the core CDH platform--for a fee.

This business model fits into the open-core category, where companies charge for exclusive tools or functions on top … Read more

IBM chief scientist seeks patterns in patterns

Despite what is often considered to be a conservative approach to business, IBM has no shortage of big thinkers who use their skills both internally and externally to influence the way the company thinks about technology and how it applies to business processes.

This week I met with Jeff Jonas, chief scientist, IBM Entity Analytics, to talk about how predictive analytics is moving into new realms of big data and how companies are using software to deal with the deluge of information.

Jonas joined IBM in 2005 when Big Blue acquired SRD, a company he founded to develop so-called extraordinary systems with specific data analysis tasks, such as facial recognition and analysis systems casinos use to catch cheating gamblers.

The main thrust of Jonas' research right now is trying to figure out ways to better take advantage of as much data as possible as fast as the transaction is happening--with an eye toward real-time predictive analytics. This is basically pattern detection in real-time, based on patterns that may or may not exist already.

Jonas explained that you may not know of a pattern, but you want to find one, and that many might be interesting but they don't always matter. In the casino example, bad guys are looking to perform channel separation by mixing and matching, people, places, and things, but the casino needs to do channel consolidation to aggregate information and determine an immediate course of action.… Read more

Adobe releasing Puppet code for managing Hadoop

Puppet Labs announced on Thursday that Adobe Systems is publishing code for managing Hadoop on the Puppet Forge community development site. (Disclosure: I am an adviser to Puppet Labs.)

Puppet is an open-source data center automation and configuration management framework aiming to provide system administrators a platform for consistent, transparent, and flexible systems management.

The necessity of data center automation and management tools (often grouped into the DevOps category) is becoming ever more apparent, as cloud principles and large-scale systems that process data in a parallel manner continue to emerge.

Case in point: Hadoop is an open-source platform powering hugely … Read more

NorthScale, Zynga team up on NoSQL

The massive amounts of data being created on the Web and the rise of cloud computing together make an ideal environment for alternative database technologies to thrive. And the Web is often proving to be just an entry point for bleeding-edge technology to be tested out before it starts heading into the enterprise.

NoSQL databases and associated operational-data technologies based on nonrelational approaches to data management and manipulation continue to be top of mind for big Web shops and are slowly starting to make their way into enterprise IT infrastructure.

I've spoken with a number of vendors roaming the NoSQL space over the last few months and there seems to be one common thread that they push: traditional relational databases are expensive, bulky, and simply not ideal for this new era of Web technology.

On Wednesday, a new NoSQL database joins the fray: Membase. Launched as an open-source project under the Apache 2.0 license and co-sponsored by NorthScale, Zynga, and NHN (Korea's top online gaming portal), Membase is optimized for storing the data behind interactive Web applications.

Membase says it is 100 percent compatible with Memcached, the de facto standard for distributed object caching behind Web applications. Basically, Membase is as easy to use as Memcached but also stores data.

According to James Phillips, NorthScale co-founder and senior vice president of products, the thousands of organizations that use Memcached (18 of the top 20 most visited Web sites including Twitter, Facebook, and Google) have a demand for a solution that looks like Memcached but acts like a distributed, highly available, high-performance, elastic database technology. … Read more

Cloudera teams up to connect Oracle and Hadoop

This week Cloudera, a provider of software and services for the Apache Hadoop project, is set to announce a partnership with Quest Software to develop, support, and distribute an Oracle connector for Hadoop.

Hadoop is the popular open-source implementation of MapReduce, a powerful tool designed for deep analysis and transformation of very large data sets. It enables its users to explore complex data, using custom analyses tailored to users' information and questions.

Code-named "Ora-Oop," the connector will provide connectivity between Cloudera's Hadoop distribution and Oracle through an interface that allows for bidirectional, scalable, and functional data transfer … Read more