How would you like a single-chip microprocessor with more than four times the performance (on some applications) of Intel's best Core i7?
Then consider that up to 32 of these chips can be directly connected to form a single server, achieving four times the built-in scalability of Intel's next-generation Nehalem-EX processor.
That's IBM's widely anticipated Power7, which it described at last week's Hot Chips conference. But if you're interested, you'd better be prepared to spend a lot more than four times as much per chip. IBM isn't talking about pricing, but large Power servers can cost more than $10,000 per processor.
IBM's forthcoming Power7 server processor has eight cores, manages 32 threads, and includes 32MB of on-chip embedded DRAM cache. Power7 also has the highest levels of off-chip bandwidth ever achieved by a microprocessor.
(Credit: IBM)What makes the Power7 so powerful? Each chip has eight cores, and each core supports four-way multithreading. There's 32MB of level-3 cache on the chip, made using embedded DRAM (eDRAM) cells. Most CPUs use SRAM for cache because it's generally easier to combine with high-performance logic, but DRAMs--with only one transistor per bit--offer compelling density advantages. IBM spent years developing a new kind of eDRAM that would work with SOI (silicon on insulator) manufacturing processes, and the Power7 is the most advanced product to use the new technology.
Interestingly, the Power7 cores run much more slowly than those in the Power6 processor, which I wrote about here in 2007 ("Live from Hot Chips 19: Session 1, IBM's Power6"). The Power6 was designed to run very fast using a long CPU pipeline in order to deliver the highest possible performance on each thread of execution.
Maybe that strategy didn't work out as well as IBM hoped, because the Power7 returns to a more traditional microarchitecture with a shorter pipeline and much lower clock rates--though IBM didn't say exactly what those rates would be.
IBM did, however, promise that the Power7 would be roughly four times as fast as the Power6, chip for chip. Since it has four times as many cores, each of the new slower-clocked cores must still deliver about as much performance as those in the previous generation.
Chip-level performance must always be matched by off-chip connections lest the incoming data or outgoing results be bottlenecked by a too-slow channel. Accordingly, the Power7 is equipped with eight I/O channels for DRAM, each of which connects to an off-chip buffering device that splits the channel into two 64-bit DRAM interfaces. All together, IBM says the Power7 has 180 GBps of DRAM interconnect that can sustain over 100 GBps of effective memory bandwidth.
There's another 50 GBps of peak I/O bandwidth and a staggering 360 GBps of peak bandwidth used to let each Power7 chip communicate with others. The DRAM connected to each chip is thus shared across larger systems.
Combining these figures, IBM says a single Power7 has 590 GBps of total off-chip bandwidth. This isn't the real number, since many of those bytes are used for error-correcting codes and other overhead, but it's still pretty impressive.
So is Power7's die size: 567 square millimeters for 1.2 billion transistors. That's nearly a square inch! IBM says that if the 32MB L3 cache had been manufactured using SRAM, the transistor count would have been 2.7 billion instead.
Still, Power7 wasn't the only high-end chip talked about at Hot Chips.
Rainbow Falls, a record for core count
Sun Microsystems was there to describe its forthcoming Rainbow Falls chip, which I assume will be marketed as the UltraSparc T3. The chip has 16 cores, each of which is reportedly able to manage 8 threads.
Sun's primary Rainbow Falls presentation focused on details of Rainbow Falls' internal and external interconnects; a second talk described the cryptographic coprocessors present in each of the chip's cores. These coprocessors--one for modular arithmetic (commonly used in public-key cryptography) and a cipher/hash unit to accelerate bulk ciphers like AES and secure hash algorithms--provide many times the performance of pure software implementations.
Fujitsu was also at Hot Chips to describe its eight-core, 2GHz Sparc64 VIIIfx processor, the latest in a long series of impressive designs from the company. Fujitsu quoted a peak performance figure of 128 GFLOPS (billions of floating-point operations per second) with a typical power consumption of just 58 watts. It did not, however, provide sustained performance or worst-case power consumption figures.
AMD, Intel vie for high-volume servers
Few of us will have direct exposure to the IBM, Sun, and Fujitsu chips. A pair of presentations from Advanced Micro Devices and Intel described products that will be much more widely available.
AMD launched its six-core Opteron processor code-named "Istanbul" earlier this year (see Brooke Crothers' coverage from June). Next year the company will begin shipping a new Opteron model currently code-named Magny-Cours (after a racetrack in France). Magny-Cours will consist of two Istanbul chips in a single package, with twice as many DRAM interfaces to support the new processor's increased performance.
AMD also teased the audience with another mention of a new processor core design that has been under development there for several years: "Bulldozer," which is now targeted at 32nm process technology. This new core will incorporate new x86 instruction-set extensions which will probably not be adopted by Intel (a strategy that reminds me of AMD's old 3DNow extensions).
But saving the best for last--best, that is, from the perspective of anticipated sales--Intel's talk on Nehalem-EX showed just how far Intel has been able to push the technology envelope for high-volume servers.
Nehalem-EX is an eight-core version of the existing quad-core Nehalem design. The new chip also has 24MB of L3 cache done in old-school SRAM. By my calculations, about 60 percent of the chip's 2.3 billion transistors are in this cache alone.
Nehalem provides four links to external DRAM buffer chips supporting two DDR3 DRAM interfaces each (much like the Power7 solution) and four QuickPath Interconnect links that provide direct "glueless" connections for up to eight-processor systems (64 cores, 128 threads). Intel is also working on an external Node Controller chip for systems with up to 2,048 Nehalem-EX processors.
The aggregate bandwidth numbers for Nehalem aren't as mind-boggling as those for Power7, but they're still far beyond anything available for PC-architecture servers today. Based on the presentation, I estimate Nehalem could boast over 85 GBps of peak memory bandwidth and 100 GBps of chip-to-chip bandwidth, some of which must be allocated to I/O.
I expect the raw number-crunching performance of the Nehalem-EX cores to be roughly on the same level as Power7's cores. The lower ratio of bandwidth to processing power for Nehalem-EX reflects a different design target, not a design shortfall--and most importantly, a much lower selling price. There will presumably be versions of Nehalem-EX priced similarly to existing Xeon MP products, which currently top out at $2,301 each in small volumes, but that's a very reasonable price to pay for the market's most advanced x86 server processor.
Apple's Snow Leopard operating system, which hits the streets on Friday, has plenty of new technology--but one of its major new features will soon be available on Microsoft Windows, Linux, and other major platforms.
OpenCL, the Open Computing Language, was originally proposed by Apple to support parallel programming on GPUs. There are other GPU programming languages, such as Nvidia's CUDA (Compute Unified Device Architecture) extensions for C and the Brook stream program language developed at Stanford University and included in Advanced Micro Devices' Stream Computing software development kit, but rather than choosing one of these languages, Apple chose to create a new standard independent of the big graphics vendors.
In fact, OpenCL is even independent of Apple. One of the first things Apple did was offer to hand it over to the Khronos Group, the same independent standards organization that manages the OpenGL standard for 3D rendering.
Supporters of the OpenCL standards effort at the Khronos Group include the biggest CPU and GPU makers in the industry. Apple is also involved but not shown here.
The members of the OpenCL working group turned Apple's draft specification into the released version 1.0 spec in just six months (see Brooke Crothers' "OpenCL goes beyond Apple" from last December)--and in the process, it created what may be the best solution so far to the general problem of parallel programming.
See, OpenCL isn't just for GPUs. It was designed from the beginning to get the most out of multicore processors too. After all, if you have a multicore CPU--and you probably do--why let it go to waste? OpenCL is flexible enough to support both CPU-optimized and GPU-optimized code, and smart enough to choose the right code, depending on what hardware is available in the system to run it. Most of the competing parallel-programming languages can't do that.
OpenCL can take advantage of both task-level parallelism (running many tasks at once, whether different tasks or copies of the same task) and data-level parallelism (where a single instruction within a task is applied to multiple data items at once--also known as SIMD). Some parallel-programming languages can't do that, either.
But OpenCL's biggest advantage isn't technical in nature: it's that no other parallel-programming language will be so widely supported. The support starts with Snow Leopard but will go well beyond that. AMD and Nvidia will have OpenCL drivers for their GPUs under Windows and Linux. AMD and Intel will support OpenCL on their CPUs (including Intel's Larrabee). And AMD has already shipped its first OpenCL implementation for its Athlon and Opteron processors.
Implementations for video game consoles and DSPs (digital signal processors) are also under development. I've even heard that future releases of OpenCL may be able to work with less common hardware, such as FPGAs (field-programmable gate arrays).
We had an excellent half-day OpenCL tutorial last weekend at Hot Chips 21. There were also some great OpenCL presentations at Siggraph 2009 earlier this month; if you'd like more detailed information, that's a good place to start.
All this support for OpenCL means that it should become the first choice of academic and commercial developers who want a good cross-platform way to develop parallel code. Expect to see OpenCL used in software for audio and video processing, cryptography, medical imaging, and many other applications--including, of course, gaming.
(Disclosure: I will be writing a technical white paper for Nvidia, one of the companies covered in this story.)
In a story on PC Pro, Nvidia architect John Montrym (whose name was incorrectly spelled "Mottram") quoted my recent blog post on Larrabee as concluding that "the 'large' Larrabee in 2010 will have roughly the same performance as a 2006 GPU from Nvidia or ATI."
Alas, this isn't really what I said or meant.
What I actually described as equating to "the performance of a 2006-vintage...graphics chip" was a performance standard defined by Intel itself--running the game F.E.A.R. at 60 fps in 1,600 x 1,200-pixel resolution with four-sample antialiasing.
Intel used this figure for some comparisons of rendering performance. If Larrabee ran at 1GHz, for example, Intel's figures show that... Read more
Now for the Mobile PC Processors session at Hot Chips. Previous Hot Chips installments covered networking, the Reed Hundt speech, AMD keynote, wireless networking, technology and software, process technology, multicore designs, IBM's Power 6 efforts, Vernor Vinge's keynote address, and Nvidia. Other CNET coverage may be found here. Comments are welcome!
Alas, there wasn't much ... Read more
On to the networking session at Hot Chips. Previous Hot Chips installments covered the Reed Hundt speech, AMD keynote, wireless networking, technology and software, process technology, multicore designs, IBM's Power6 efforts, Vernor Vinge's keynote address, and Nvidia. Other CNET coverage may be found here. Comments are welcome!
After the highly political talk by former FCC Chairman Reed Hundt, the Networking session pulled us sharply back into ... Read more
Yes, I'm still at Hot Chips. This post covers a special presentation by Reed Hundt of Frontline Wireless, who is a former chairman of the FCC. (Michael Kanellos has also blogged about this speech, here.) Previous Hot Chips installments include the AMD keynote, wireless networking, technology and software, process technology, multicore designs, IBM's Power 6 efforts, Vernor Vinge's keynote address and Nvidia. Other CNET coverage may be found here. Comments are welcome!
Reed Hundt is best known as a former chairman of the FCC (Federal Communications Commission), where his role in enacting the Telecommunications Act of 1996 generated considerable controversy.
He opened his talk by regaling us with ... Read more
This is the eighth in a series of posts from the Hot Chips conference at Stanford. The previous installments looked at wireless networking, technology and software, process technology, multicore designs, IBM's Power 6 efforts, Vernor Vinge's keynote address, and Nvidia. Other CNET coverage may be found here. This is sort of an experiment for me; I usually prefer to have time to review my work before I publish it. If you see anything wrong, please leave a comment!
The second keynote here comes from Phil Hester, the chief technical officer at AMD. It's titled "Multicore and Beyond: Evolving the x86 Architecture."
He began by describing the ... Read more
This is the seventh in a series of posts from the Hot Chips conference at Stanford University. The previous installments looked at technology and software, process technology, multicore designs, IBM's Power6 efforts, Vernor Vinge's keynote address, and Nvidia. Other CNET coverage may be found here. This is sort of an experiment for me; I usually prefer to have time to review my work before I publish it. If you see anything wrong, please leave a comment!
This session has two presentations--one from SiBeam describing wireless HDTV transmission for home use, the other from Broadcom on new 802.11n Wi-Fi technology.
The SiBeam presentation is easily summarized: It describes a chipset that sends uncompressed HDTV video over ... Read more
This is the sixth in a series of posts from the Hot Chips conference at Stanford University. The previous installments looked at process technology, multicore designs, IBM's Power 6 efforts, Vernor Vinge's keynote address, and Nvidia. Other CNET coverage may be found here. This is sort of an experiment for me; I usually prefer to have time to review my work before I publish it. If you see anything wrong, please leave a comment!
We began Tuesday morning with a session on assorted technology developments.
The first talk was from Sun Microsystems, about the company's Proximity chip-to-chip interconnect technology. Today, to put multiple chips in a package--a common technique in high-end servers, for example--each chip will be individually connected to the package substrate through conductive ... Read more

