Intel announced on Monday that it will be presenting a paper at Siggraph 2008 about its "many-core" Larrabee architecture, which will be the basis of future Intel graphics processors.
The paper itself, however, has already been published, and I was able to get a copy of it. (Unfortunately, as you'll see at that link, the paper is normally available only to members of the Association for Computing Machinery.)
The paper is a pretty thorough summary of Intel's motives for developing Larrabee and the major features of the new architecture. Basically, Larrabee is about using many simple x86 cores--more than you'd see in the central processor (CPU) of the system--to implement a graphics processor (GPU). This concept has received a lot of attention since Intel first started talking about it last year.
The paper also answers perhaps the biggest unanswered question about Larrabee--what are the cores, and how can Intel put "many" of them on a chip when desktop CPUs are still moving from two to four cores?
Intel describes the Larrabee cores as "derived from the Pentium processor," but I think perhaps this is an oversimplification. The design shown in the paper is only vaguely Pentium-like, with one execution unit for scalar (single-operation) instructions and one primarily for vector (multiple-operation) instructions.
That's the basic answer: Larrabee cores just have less going on. A quad-core desktop processor might have six or more execution units, and a lot of special logic to let it reorder instructions and execute code past conditional branches just in case it can guess the direction of the branch correctly. This complexity is necessary to maximize performance in a lot of desktop software, but it's not needed for linear, predictable code--which is what we usually find in 3D-rendering software.
But the vector unit in Larrabee is much more powerful than anything in older Intel processors--or even in the current Core 2 chips--because 3D rendering needs to do a lot of vector processing. The vector unit can perform 16 single-precision floating-point operations in parallel from a single instruction, which works out to 512 bits wide--great for graphics, though it would be overkill for a general-purpose processor, which is why the vector units in mainstream CPUs are 128 or 256 bits wide at most.
The new vector unit also supports three-operand instructions, probably including the classic "A * B + C" operation that is so common in many applications, including graphics. With three operands and two calculations per instruction, the peak throughput of a single Larrabee core should be 32 operations per cycle, and that's just what the paper claims.
I say "probably" because the Siggraph paper doesn't describe exactly what operations will be implemented in the vector unit, but I suspect this part of the Larrabee design is related to Intel's Advanced Vector Extensions, announced last April. The first implementations of AVX for desktop CPUs will apparently begin with a 256-bit design, another indication of how unusual it is for Larrabee to have a 512-bit vector unit.
The multithreading factor
Intel also built four-way multithreading into the Larrabee cores. Each Larrabee core can save all the register data from four separate threads in hardware, so that most thread-switch operations can be performed almost instantly rather than having to save one set of registers to main memory and load another. This approach is a reasonable compromise for reducing thread-switching overhead, although it probably consumes a significant amount of silicon.
Note that this kind of multithreading in Larrabee is very different from the Hyper-Threading technology Intel uses on Pentium 4, Atom, and future Nehalem processors. Hyper-Threading (aka simultaneous multi-threading) allows multiple threads to execute simultaneously on a single core, but this only makes sense when there are many execution units in the core. Larrabee's two execution units are not enough to share this way.
All of these differences prove rather conclusively that Larrabee's cores are not the same as the cores in Intel's Atom processors (also known as Silverthorne). That surprised me; the Atom core seemed fairly appropriate for the Larrabee project. All that really should have been necessary was to graft a wider vector unit onto the Atom design. But now I suppose the Atom and Larrabee projects have been completely independent from one another all along.
Intel won't say how many cores are in the first chip. The paper describes an on-chip ring network that connects the cores. The network is 512 bits wide. Interestingly, the paper mentions that there are two different ring designs--one for Larrabee chips with up to 16 cores, and one for larger chips. That suggests Intel has chips planned with relatively small numbers of cores, possibly as few as four or eight. Such small implementations might be appropriate for Intel's future integrated-graphics chip sets, but as such they will be very slow by comparison with contemporary discrete GPUs, just as Intel's current products are.
Larrabee provides some graphics-specific logic in addition to the CPU cores, but not much. The paper says that many tasks traditionally performed by fixed-function circuits, such as rasterization and blending, are performed in software on Larrabee. This is likely to be a disadvantage for Larrabee, since a software solution will inevitably consume more power than optimized logic--and consume computing resources that could have been used for other purposes. I suspect this was a time-to-market decision: tape out first, write software later.
The paper says Larrabee does provide fixed-function logic for texture filtering because filtering requires steps that don't fit as well into a CPU core. I presume there's other fixed-function logic in Larrabee, but the paper doesn't say.
Larrabee's rendering code uses binning, a technique that has been used in many software and hardware 3D solutions over the years, sometimes under names such as "tiling" and "chunking." Binning divides the screen into regions and identifies which polygons will appear in each region, then renders each region separately. It's a sensible choice for Larrabee, since each region can be assigned to a separate core.
Binning also reduces memory bandwidth, since it's easier for each core to keep track of the lower number of polygons assigned to it. The cores are less likely to need to go out to main memory for additional information.
The numbers crunch
The paper gives some performance numbers, but they're hard to interpret. For example, game benchmarks were constructed by running a scene through a game, then taking only widely separated frames for testing on the Intel design. In the F.E.A.R. game, for example, only every 100th frame was used in the tests. This creates an unusually difficult situation for Larrabee; there's likely to be much less reuse of information from one frame to the next.
But given that limitation of the test procedure, the results don't look very good. To render F.E.A.R. at 60 frames per second--a common definition of good-enough gaming performance--required from 7 to 25 cores, assuming each was running at 1GHz. Although there's a range here depending on the complexity of each frame, good gameplay requires maintaining a high frame rate--so it's possible that F.E.A.R. would, in practice, require at least a 16-core Larrabee processor.
In other words, unless Intel is prepared to make big, hot Larrabee chips, I don't think it's going to be competitive with today's best graphics chips on games.
Intel can certainly do that-- no other semiconductor company on Earth can afford to make big chips the way Intel can-- but that would ruin Intel's gross margins, which are how Wall Street judges the company. Also, Intel's newest processor fabs are optimized for high-performance logic, like that used in Core 2 processors. Larrabee runs more slowly, suggesting it could be economically manufactured on ASIC product lines... but Intel's ASIC lines are all relatively old, refitted CPU lines.
Nvidia, by comparison, gets around this problem by designing its chips from the beginning to be made in modern ASIC factories, chiefly those run by TSMC. Although these factories are a generation behind Intel's in process technology, they're much less expensive to operate. So this may be a situation where Intel's process edge doesn't mean as much as it does in the CPU business.
The Larrabee programming model also supports nongraphics applications. Since it's fundamentally just a multicore x86 processor, it can do anything a regular CPU can do. Intel's paper even uses Sun Microsystems' term, Throughput Computing, for multicore processing.
The Larrabee cores aren't nearly as powerful as ordinary notebook or desktop processors for most applications. Real Larrabee chips will likely be faster than the 1GHz reference frequency used in the paper, but they still don't have as many execution units for the scalar operations that make up the bulk of operating-system and office software. That means a single Larrabee core could feel slow even when compared with a Pentium III processor at the same frequency, never mind a Core 2 Duo.
But with such a strong vector unit, a Larrabee core could be very good at video encoding and other tasks, especially those that use floating-point math. At 1GHz, a single Larrabee core hits a theoretical 32 GFLOPS (32 billion floating-point operations per second). A 32-core Larrabee chip could exceed a teraflop--roughly the performance of Nvidia's latest GPU, the, which has 240 (very simple) cores.
But I don't expect to see that kind of performance from the first Larrabee chips. The power consumption of a 32-core design with all the extra overhead required by x86 processing would be very high. Even with Intel's advantages in process technology, such a large Larrabee chip would probably be commercially impractical. Smaller Larrabee designs may find some niche applications, however, acting as number-crunching coprocessors much as IBM's Cell chips do in some systems.
And although a Larrabee chip could, in principle, be exposed to Windows or Mac OS X to act as a collection of additional CPU cores, that wouldn't work very well in the real world and Intel has no intention of using it that way. Instead, Larrabee will be used like a coprocessor. In that application, Larrabee's x86 compatibility isn't worth very much.
The bottom line
So...what's Larrabee good for, and why did Intel bother with it?
I think maybe this was a science project that got out of hand. It came along just as AMD was buying ATI and so positioning itself as a leader in CPU-GPU integration. Intel had (and still has) no competitive GPU technology, but perhaps it saw Larrabee as a way to blur the line distinguishing CPUs from GPUs, allowing Intel to leverage its expertise in CPU design into the GPU space as well.
Intel may have paid too much attention to some of its own researchers, who have been touting ray tracing as a potential alternative to traditional polygon-order ray tracing. I wrote about this in some depth back in June (""). But ray tracing merits just one paragraph and one figure in this paper, which establish merely that Larrabee is more efficient at ray tracing than an ordinary Xeon server processor. It falls well short of establishing that ray tracing is a viable option on Larrabee, however.
Future members of the Larrabee family may be good GPUs, but from what I can see in this paper, the first Larrabee products will be too slow, too expensive, and too hot to be commercially competitive. It may be several more years beyond the expected 2009/2010 debut of the first Larrabee parts before we find out just how much of Intel's CPU know-how is transferable to the GPU market.
I'll be at Siggraph again this year, and I'll have more to say after I've read this paper through a few more times and had a chance to speak with some of the folks I know at AMD, Nvidia, and other companies in the graphics market.