The new databases

New techniques and technologies are increasingly augmenting the relational database in large-scale distributed computing.

Gordon Haff

Gordon Haff is Red Hat's cloud evangelist although the opinions expressed here are strictly his own. He's focused on enterprise IT, especially cloud computing. However, Gordon writes about a wide range of topics whether they relate to the way too many hours he spends traveling or his longtime interest in photography.

See full bio

Gordon Haff

Aug. 5, 2009 2:24 p.m. PT

4 min read

"Database" has come to be largely synonymous with a relational database management system (RDBMS) or, more specifically, a relational database that is accessed using the SQL query language. Some simpler products run on desktops, but if you are talking about products used for serious business computing on a server, SQL it is. The widespread adoption of open-source products such as MySQL and PostgreSQL only cemented SQL's dominance by making it available to a broad audience that couldn't afford licensing fees for products from Oracle and other large database vendors.

An RDBMS stores data in the form of multiple tables that are related to each other by keys that are unique among all occurrences in a given table. The "relational database" term was originally defined and coined by IBM's Edgar Codd in a 1970 paper. Products based on this database model came to largely replace a variety of hierarchical and other technology approaches. While it could be lower performance than alternatives, it tended to offer more flexibility in how data could be laid out, added, and accessed.

As computer systems got faster (and SQL RDBMSs were enhanced in many ways), concerns about the performance of the basic approach largely receded into the background. In general, efforts to displace RDBMSs--such as object databases--have ended up possibly generating a lot of hype for a time but have stayed very much in the niches.

However, with the advent of truly massive scale distributed computing infrastructures, we're starting to see the significant adoption of technologies that don't necessarily replace RDBMSs, but certainly complement them.

The basic issue is that RDBMSs are architected to process and store all transactions with absolute reliability. (ACID--atomicity, consistency, isolation, and durability--is a set of properties commonly used to describe the requirements.) This is a good thing when we're talking about, say, financial transactions. A bank balance has to immediately reflect a withdrawal; the system has to prevent multiple withdrawals of the same balance from happening simultaneously.

RDBMSs and their associated infrastructure also tend to reflect the assumption that data will be retained for a significant period. Again, this makes a lot of sense in the context of the traditional role of databases. A business not only wants to keep transaction records for at least several years--in many cases, it's legally required to do so.

However, we're seeing the increased use of alternative approaches in large distributed systems that don't have as stringent consistency requirements or that generate lots of intermediate results that don't need to be stored permanently. In exchange, they can use replication for maximum performance and availability.

One form this takes is "eventual consistency," which Amazon CTO Verner Vogels describes as tolerating inconsistency for "improving read and write performance under highly concurrent conditions and handling partition cases where a majority model would render part of the system unavailable even though the nodes are up and running." You can read a paper Vogels wrote on the topic here.

Amazon SimpleDB implements such a model. It "keeps multiple copies of each domain. When data is written or updated (using PutAttributes, DeleteAttributes, CreateDomain or DeleteDomain) and Success is returned, all copies of the data are updated. However, it takes time for the update to propagate to all storage locations. The data will eventually be consistent, but an immediate read might not show the change."

We're also seeing products that essentially augment RDBMSs by reducing the volume of data that they need to store. Terracotta is a commercial product that provides distributed caching for Java applications. An example could be a travel reservation application where the actual "books" need to go into an RDBMS but many of the transactions associated with "looks" can be handled in a distributed way without touching the database every time. Terracotta says that they can frequently offload 40 percent to 60 percent of transactions.

Memcached, an open-source distributed memory caching system, is conceptually similar. It distributes data (together with an associated structure to lookup that data) across multiple systems to reduce accesses to external data stores. It is widely used at large Web sites such as Twitter, YouTube, and Wikimedia.

These techniques and technologies don't replace RDBMSs in the way that RDBMSs replaced older technologies such as hierarchical databases. Rather, they trade off characteristics that have been considered non-negotiable must-haves in the realm of database design such as full consistency. As a result, they can't be used instead of RDBMSs for the situations where those characteristics truly are requirements.

However, a lot of software that is more asynchronous and read-intensive than traditional business applications doesn't have the same constraints on the one hand and needs to massively scale performance across many systems on the other. And for the organizations implementing that software, pairing RDBMSs with distributed data stores of various forms isn't just the right architectural approach; it may be the only way they can get to the scale levels they need at a price point that makes business sense.