3Leaf's modern take on NUMA

3Leaf has developed hardware and software that melds multiple small servers into a single large SMP system.

Gordon Haff

Gordon Haff is Red Hat's cloud evangelist although the opinions expressed here are strictly his own. He's focused on enterprise IT, especially cloud computing. However, Gordon writes about a wide range of topics whether they relate to the way too many hours he spends traveling or his longtime interest in photography.

See full bio

Gordon Haff

Nov. 3, 2009 11:21 a.m. PT

4 min read

Over the years, we've seen a variety of approaches intended to meld multiple small servers into a single larger system. 3Leaf Systems is the latest. On November 3, it introduced a Dynamic Data Center Server (DDC-Server) for AMD Opteron processors. The DDC-Server combines a custom DDC-ASIC chip with software to create a symmetric multiprocessing server with 32 6-core AMD "Istanbul" processors and 1 terabyte of memory.

The system, together with the InfiniBand switch required to interconnect the server components, 8TB of storage, and 3Leaf's software, is priced at $250,000. A smaller $99,000 version is also available. However, these systems should be thought of primarily as proof of concepts intended to create proof points with customers and to provide system makers with a tangible product. 3Leaf's go-to-market plan is to sign up system original equipment manufacturers and sell them ASIC (application-specific integrated circuit) chips and software--not to itself be a seller of systems.

The basic concept behind 3Leaf's design has quite a few antecedents.

In the 1990s, Data General and Sequent came up with large Unix server designs that connected "standard high volume" (SHV) x86 modules with cables using a protocol from Dolphin Technology called SCI. The component modules were never as standard or high volume as the SHV term implied but the approach still reduced development costs and increased the flexibility of the system relative to the more monolithic designs that characterized most large SMP servers of the day.

More recently, Virtual Iron developed a distributed hypervisor that could not only subdivide a single server in the vein of server virtualization products like VMware's ESX Server, but could also meld multiple smaller systems into large ones on the fly. (Virtual Iron later abandoned its proprietary hypervisor in favor of Xen and was later absorbed by Oracle.)

ScaleMP's vSMP Foundation is the current product for aggregating x86 servers that is probably most comparable to 3Leaf's. To date, it's been primarily focused on high-performance computing. The key distinction is that, unlike ScaleMP, 3Leaf uses a custom ASIC in addition to software. Both companies are primarily focused on InfiniBand as their interconnect although there is nothing architectural to prohibit the use of 10-Gigabit Ethernet over time. From a technical perspective, 3Leaf is essentially layering its own coherency protocol on top of InfiniBand. The current product uses the same socket as the AMD processor. However, 3Leaf also has a license for Intel's QuickPath Interconnect.

3Leaf says that, by developing an ASIC that gets into coherent memory transactions at the cache level, they are able to get better performance across a wider range of workloads than a purely software-based approach can.

Performance has been a stumbling block with this approach historically.

An SMP server, however constructed, is characterized by the fact that it is a shared memory architecture. This means that any processor can directly access any memory in the system. In general, this makes for a simpler programming model than distributed memory architectures, such as clusters, in which a lot of the work associated with making sure you're working with latest data is shifted from hardware to software.

How quickly a given processor can get to the memory that it needs plays a big part in a system's performance. In fact, for some workloads such as database transaction processing, memory access times can be the single factor that most affects how fast a system is. As a result, traditional large server designs incorporated expensive hardware such as crossbars to keep memory traffics flowing across the entire system quickly.

Today's small servers have equally speedy and high-bandwidth memory links--indeed their compact footprint can help to reduce latency even further. However, once you combine multiple nodes, the time it takes for a processor to access memory on another node can rise dramatically. The exact numbers depend on many factors, including what else is going on in the system at the time. But, as a rule of thumb, it takes at least twice as long to access memory on another node than if it were local--and could take several multiples of that. In other words, memory access is non-uniform; NUMA is the term often used.

Over time, operating systems have gotten much better at keeping processing and associated memory physically close to each other. Certain workloads are also less sensitive to NUMA designs than others. Many HPC, analytics, and business intelligence applications involve fewer of the sort of memory updates that tend to drag down performance in NUMA architectures than does typical enterprise online transaction processing.

It's also the case that, today, large SMP is as much about having a large and flexible pool of hardware resources for server virtualization as it is about having a single large SMP image. Thus, in many respects, large SMP is increasingly about management rather than monolithic application performance. Which is one of the reasons that we're seeing a general trend towards modularity in all SMP designs.

Thus, the SHV approach to SMP system design arguably sits closer to the mainstream than it has in the past.

As 3Leaf's Shahin Khan told me, the key factors with this approach are it "had better be low cost and work." Performance has to be acceptable over at least an interesting subset of workloads and there can't be a significant price premium over the constituent systems and hardware. And ultimately, for 3Leaf, success will result from convincing one or more major system OEMs that the time has arrived to add a system or systems based on this approach.