A new company called Riptano recently launched to provide support and services for the Apache Cassandra project, a nonrelational open-source database designed for high performance that has a strong presence in Web shops like Twitter, Digg, and Reddit. I recently had the chance to chat with Matt Pfeil, founder of Riptano, and he provided some insight into the project and the new world of NoSQL database approaches.
What exactly is Cassandra and who uses it?
Cassandra is a highly scalable, distributed, open source database. It's a top-level Apache project with committers from Riptano, Rackspace, Digg, Facebook, and others.
Cassandra was designed with performance in mind. Cassandra is good at handling huge amounts of data and large numbers of requests, both writes and reads. Most databases are only good at read-mostly workloads, but Cassandra thrives at large write volumes too. Jonathan Ellis posted a quick benchmark back in January, which is already out of date with 0.6 being about 30 percent faster now.
Cassandra also runs on commodity hardware, which makes it a good fit for the cloud or your own cluster. It doesn't require expensive things like SANs or even SSDs.
Cassandra is seeing rapid adoption because data is getting larger and there's more of it on a daily basis. That's forcing everyone--not just the Googles of the world--to think about scalability.
How does Cassandra compare to MySQL or other traditional relational DBs?
Besides scaling and write performance, Cassandra has the best support for geographically separated data centers in the industry, which is important both for redundancy and for having data local to your users.
Cassandra is also fully distributed, with no single points of failure. This means you don't have to deal with a failover process when machines go down, you just replace them when it's convenient. This is absolutely crucial for reliability because the reality is if you have an infrequently performed procedure like failover, it's going to break, and practically by definition, it's going to break at the least-convenient time possible.
Now, all engineering is trade-offs, and Cassandra isn't a magic wand either. One of the things you give up with Cassandra is ad-hoc queryability. But, that's something you have to give up as you scale whether you're deploying on Cassandra or on a sharded, ad-hoc architecture based on relational databases. See, for example, eBay's paper on BASE: An Acid Alternative. eBay uses Oracle, or did in 2006, but their architecture anticipates the principles behind scaling any large system.
So, given that people running into the scaling wall are going to have to give that up anyway, Cassandra brings some really nice properties to the table.
Finally, with 0.6, Cassandra added Hadoop support, and thus offers the ability to run your applications and analytics against the same database, instead of exporting to a separate system for that.
Any immediate thoughts on the Web versus enterprise tension in terms of databases?
RedMonk analyst Stephen O'Grady had a great blog post on this, but for the foreseeable future, both enterprise and web companies want pretty much the same things from Cassandra: more features, better management tools, and so forth.
MySQL is one data point, but PostgreSQL is another, and they've arguably done a better job of offering enterprise features while still remaining accessible to Web companies.
Riptano does see some demand for a stable Cassandra distribution, but with added features backported from the development branch, which is the same kind of thing you see Red Hat doing with the Linux kernel or Cloudera doing with Hadoop. With that kind of model, we'll continue to contribute features back to the open-source version, while also maintaining a version we add those features into that we distribute to our customers--customers who want more stability, controlled change, etc.
What's your take on the whole NoSQL movement?
I firmly believe it's "Not Only SQL"--not "No SQL." SQL is the original DSL for data, and it's really very good at that.
But the world is changing. Data is bigger than ever before, and there's more of it. Relational databases don't handle large amounts of data easily, and Cassandra solves that problem.
There's definitely room for both in the world, and sometimes even in the same application. But there are more tools available today than five years ago. Let's use them.