X

Live from Hot Chips 19: Session 3, Multicore II

Glaskowsky takes stock of GPU designs from AMD and Nvidia, the Trips prototype from UT Austin and the Tile64 chip from Tilera.

Peter Glaskowsky
Peter N. Glaskowsky is a computer architect in Silicon Valley and a technology analyst for the Envisioneering Group. He has designed chip- and board-level products in the defense and computer industries, managed design teams, and served as editor in chief of the industry newsletter "Microprocessor Report." He is a member of the CNET Blog Network and is not an employee of CNET. Disclosure.
Peter Glaskowsky
5 min read

This is the fourth in a series of posts from the Hot Chips conference at Stanford. The previous installments looked at IBM's Power 6 efforts, Vernor Vinge's keynote address, and Nvidia. Other CNET coverage may be found here. This is sort of an experiment for me; I usually prefer to have time to review my work before I publish it. If you see anything wrong, please leave a comment!

The first talk in session 3 is from Advanced Micro Devices, describing the ATI Radeon HD 2900. (I checked, and AMD does still use the ATI brand name for some of its products; this is one of them.)

This is another chip I described briefly in one of my Siggraph 2007 pieces (here). The 2900 has 320 cores (which AMD calls "stream processing units) running at 742MHz--now that's a serious multicore design. AMD claims 475 billion floating-point operations per second from these cores, which appears to be a little lower than the 576 GFLOPS of the Nvidia GeForce 8800 chip I described in my previous Hot Chips post, but in practice the differences between these chips isn't so clear-cut. Each GPU can claim performance advantages on some benchmarks.

Another point of differentiation between the AMD and Nvidia products is in the memory interface. Nvidia went with a fast 384-bit interface; AMD has a wider 512-bit interface that runs at slightly lower speeds. I give this round to AMD, since its design is more scalable and likely easier to implement at the board level, yet achieves the same bandwidth over all. On the other hand, it does require about 33 percent more pins and DRAMs.

In my previous post, I described how Nvidia requires groups of 8 cores (out of the 128 on the chip) to share a single program at any given moment. AMD took a very different approach, merging groups of 5 cores together to run a predefined combination of five instructions in each clock cycle. This makes for 64 groups of cores that execute independently.

One thing that impresses me about GPU design is that radically different designs like these from AMD and Nvidia can still run the same games with similar performance, even with little or none GPU-specific software development from the games companies. The intermediate application-programming interfaces (APIs) such as OpenGL and Microsoft's Direct3D do an excellent job of hiding this complexity from the application developers.

The Intel presentation on its "Teraflops Prototype Processor with 80 Cores" was... well... redundant, to say the least. Intel has given many previous talks on the same subject. If you're interested in the Tera-Scale project, you've probably already heard about it.

This presentation, following those from Nvidia and AMD so closely, begs a comparison. Intel has fewer cores (nominally 80, but in fact Nvidia or AMD might count them as 160 since each of the Intel cores has two math pipelines). Intel's chip runs much faster than any GPU, around 5GHz-- that's 3.3 times Nvidia's clock speed. (The Tera-Scale chip is a laboratory demo, so it's hard to say how fast it would be in a production-qualified version.)

That all adds up to higher performance than the AMD and Nvidia products, which is enough to justify Intel's pride in the thing, but otherwise it's somewhere between a research project and a publicity stunt.

The Trips project
The next presentation was all research and very little publicity: "TRIPS: A Distributed Explicit Data Graph Execution (EDGE) Microprocessor" from a team at the University of Texas at Austin. The Trips project has been around for several years (I met with the team in Austin in 2003). The project has been making fairly slow progress, but earlier this year they launched a prototype system.

I suppose I'd better explain what Trips is, since that presentation title is entirely non-obvious. A Trips processor consists of multiple arrays of execution units (two 4x4 arrays in the first chip). Programs are broken down by the compiler to figure out the "data graphs"--that is, how data flows within the program. These graphs are mapped onto the array and executed. In theory, this approach allows general-purpose software--the same stuff you run on your PC today--to run faster than it does on existing PC processors.

The key metric here is IPC: instructions per cycle. Even with the ability to execute three or four instructions per cycle under ideal conditions, today's processors average just one or two instructions per cycle due to memory delays. The idea behind Trips is that the compiler can remove some of the memory transactions, schedule the remaining ones better, and achieve substantially higher IPC figures.

This presentation was primarily about the Trips prototype hardware, but it did give some IPC figures. These results are promising but not yet compelling. Compared with an Intel Core 2 Duo processor, the Trips prototype chip can run some real-world software faster...and some slower. The big issue is that the Trips compiler is not as advanced as the team would like. I think we'll have to keep an eye on this project for a while longer before the potential of the Trips architecture becomes clear.

The fourth and final presentation in the session was from Tilera, a Silicon Valley start-up: "The Tile Processor Architecture: Embedded Multicore for Networking and Digital Multimedia." Tilera has received some publicity recently (for example, this CNET article).

For me, Tilera's marketing message is like a blast from the past. I covered quite a few of these "array processors" for Microprocessor Report between 1996 and 2004. They all claimed to break the traditional multicore bottlenecks--communication among the cores and out to memory, programmability, etc.--and yet, few of them really did.

The few exceptions were in graphics (where it's relatively easy to find and exploit parallelism in the software) and networking. The networking market, however, is dominated by highly optimized ASICs (application-specific integrated circuits) and network processors. Multimedia processing also provides a lot of natural parallelism, but in practical terms that market is split between ASICs and graphics chips.

So what is Tilera touting? Networking and multimedia. (I think they're too smart to have any aspirations in the graphics market.) The presentation slide on Tilera's "Key Innovations," in my opinion, identifies nothing that hasn't already been done. Tilera's first chip, the Tile64, may yet find some market opportunities, but they're walking a well-trodden path and few of the pioneers they're following are still alive. I wish Tilera well, but my experience in this field has made me too cynical to be particularly hopeful.

The next session is on embedded systems and video processing. I think I'll give my wrists a rest and resume blogging later this evening during the evening panel session on process technology.