At Tuesday's Hot Chips conference IBM is scheduled to take the wraps off Power7, its next generation of RISC microprocessor. This is a big deal for IBM because Power is the foundation for its AIX Unix operating system, which has been one of the stars of its server portfolio in recent years. Power also supports the IBM i operating system and can also run Linux either natively or in an x86 binary translation mode that IBM acquired from Transitive. (Transitive is the company that developed the "Rosetta" technology that Apple used for the PowerPC to Intel transition.)
Modern microprocessors are incredibly complex machines. And major iterations, such as Power7, incorporate a multitude of new features, approaches, and techniques that are collectively far beyond the scope of a piece such as this to describe. Therefore, rather than trying to touch on everything, I'm going to touch on just a few aspects of the new processor generation that struck me as particularly noteworthy.
Power7 bumps both the number of cores and the performance per core over its predecessor. Its eight cores each support up to four simultaneous multithreading (SMT4) threads for a total of 32 threads per chip. SMT is a technique that helps make better use of the many execution units within a core by reducing the amount of time that software spends waiting for resources to become free. (One of the big issues with modern processor design is that some parts of the system run much more slowly than others so it's easy for a given thread to effectively create a roadblock while it's waiting; SMT is one way to alleviate this problem.)
This relative focus on multi-threading is a considerable departure from IBM's current Power6 which, at its 2006 unveiling, showed a focus on processor on frequency and hefty individual cores at a time when radical multi-core designs were grabbing the limelight. Despite its core count increase, Power7 continues to pay attention to per-core performance, but it's through techniques other than frequency; IBM says that the Power7 core has higher performance at lower frequency than the Power6 core.
One of these techniques is the aforementioned SMT4, coupled to an increase in the number of execution units per core. Power7 also reaches back to earlier Power playbooks and reintroduces out-of-order (OoO) execution, which was temporarily shelved for the Power6 generation. OoO execution can be thought of as a complementary technique to SMT that lets the processor skip over instructions that aren't ready to be processed because they are waiting on data.
Striking a balance between single-core and total-chip performance is one aspect of a general "balanced design" theme. Another is bandwidth. Each chip has dual-DDR3 memory controllers for a total of 100GB/sec of sustained memory bandwidth per chip. Scalability ports built into each chip are expandable to systems with a total of 32 sockets with 360GB/sec SMP bandwidth per chip.
Another aspect of balance is the design of the cache hierarchy, the memory physically near the processor that keeps frequently and recently used data near the processing units so that they can be accessed faster. Perhaps most notable is that there's a 32MB shared Level 3 (L3) cache in the middle of each chip. In the past, IBM has often implemented L3 caches as a separate die on a multi-chip module (MCM). This provides lots of room for the cache but means that the memory is physically further away (and therefore often slower) and demands lot of pins on the processor package to communicate with it.
Power7 takes a different approach. It's the first major commercial processor to implement an on-chip L3 cache using embedded DRAM (eDRAM). Caches are more typically constructed from static RAM (SRAM), which is faster and doesn't need to be refreshed on an ongoing basis but requires six transistors per device, rather than one for DRAM.
IBM estimates that the eDRAM has a 6:1 Latency improvement for L3 accesses relative to an external L3. Relative to an internal SRAM array, eDRAM takes about one third the space and consumes about one fifth the standby power. As for performance, IBM characterizes it as "almost as fast" and says that it handles the memory refreshes required by DRAM--memory contents have to be periodically written or they will decay--during "windows of opportunity" and generally won't have much of an impact on system performance.
As is increasingly the norm with microprocessor designs, power management also plays a big role in Power7. It's also an area where microprocessor designers are still learning. Power6 placed considerable focus on a feature called power gating, that effectively turned off portions of a core when they weren't being used. Power7's top power-saving mode, sleep, is less aggressive. It drops the voltage to the minimum level required to retain state. With the 45nm process used by Power7, IBM says that this almost eliminates leakage current and provides most of the power benefits of turning off the power entirely while saving a lot of verification complexities.
Finally, as the processing power of chips and the servers built on them grow, so too does the need to provide a level of resiliency against errors both transient and permanent in the literally billions of electronic features that make up these systems. It may be a cliche to note that if you're going to put a lot of eggs in one basket, you need to protect that basket well--but it's no less true for that.
Many of the new Power7 reliability features are focused on memory. For example, there's full X8 "chip-kill" with 64 byte error correcting code (ECC). This means that a full DIMM can die and the data can be steered over a spare device. For system partitions tasked with particularly critical workloads, Power7 can also do selective memory mirroring--think RAID 1 for memory. Power7 also just generally keeps building on error-checking and failover features; this includes the new ability to dynamically fail over if the main oscillator (clock) associated with the chip fails.
At a high level, Power7 shares a number of general design philosophies and directions with microprocessors from other vendors, including those from AMD and Intel that are more associated with scale-out designs and redundancy at the software level. This, in part, reflects that all vendors are fighting the same physical laws and are largely constrained by the same fabrication technologies. We see a general shift toward multi-core, an increased focus on power efficiency, and a need to protect against the inevitable glitches that affect ever smaller transistors--it doesn't matter who the vendor is.
However, that said, Power7 is a different design center than the scale-out standards bearers. While not literally a mainframe, its focus is very much the same sort of resiliency and performance balance at high vertical scale.