Live from Hot Chips 19: Session 2, Nvidia
Glaskowsky summarizes session 2 at Hot Chips.
Welcome back to the ongoing Speeds and Feeds coverage of Hot Chips 19 at Stanford. They give us comfy chairs and free Wi-Fi, so blogging about it is the least I can do. By the way, Dean Takahashi of the San Jose Mercury News is also blogging from Hot Chips, so you can get another perspective on the event here.
Session 2 is the first of two sessions of "Multi-Core and Parallelism" presentations. This one happens to be all about Nvidia. Session 3, up next, will include presentations about AMD's ATI Radeon HD 2900, Intel's 80-core "Tera-Scale" processor, the TRIPS project at the University of Texas at Austin, and the Tile Processor from Tilera.
The first presentation in this session, "The Nvidia GeForce 8800 GPU," is an overview of that chip. As I mentioned in my Siggraph coverage, the 8800 includes 128 processor cores, but there's more to say about it than that.
Unlike a conventional multicore processor, the multiple cores on a GPU are often doing the same thing. So the 8800 is designed so that groups of eight cores are all running a single program. They can be out of step with each other, making the 8800 more flexible than old lock-step SIMD (single instruction, multiple data) designs, but if at a given moment fewer than 8 copies of a given program are needed, some of the 8800's 128 cores will be idle.
For a single chip, all this adds up--576 billion floating-point operations per second in these cores, 104 GB/s of memory bandwidth, and 150W typical power consumption for advanced 3D games and other graphics-hungry applications.
The second presentation is also self-explanatory: "The Nvidia GPU Parallel Computing Architecture & CUDA Programming Model". CUDA (Compute Unified Device Architecture) supports high-level programming of these complicated chips using the C language so that software developers don't have to manage all the low-level hardware details.
CUDA implements a straightforward multithreaded programming model. Developers write software as if it will run on just one processor at a time. There are some restrictions on data access and data sharing, but most of the GPU complexity is hidden. Complete applications are built by combining many of these single-thread programs--potentially thousands of them--and defining when and how these threads are used, and what data they consume and produce.
The critical achievements of CUDA are that programmers write one program for all GPU sizes--Nvidia makes versions of the GeForce 8000 family with different numbers of cores. Programs don't even know how many cores they're using. CUDA programs work with the hardware to distribute the running threads across the available cores.
The final presentation in the session covers issues that arise when running non-graphics applications on Nvidia GPUs. The title is a mouthful: "Performance Insights on Executing Non-Graphics Applications on CUDA on the Nvidia GeForce 8800 GTX." The lead presenter was Professor Wen-mei Hwu of the University of Illinois at Urbana-Champaign, who has been working with Nvidia in this area.
Nvidia's GPUs are designed to support such apps, and Nvidia even makes boards and systems exclusively for non-graphics use, the Tesla family.
Depending on the software, however, the GPU is not necessarily a good platform for non-graphics uses. Apps that are inherently parallel with streaming data flows are good; apps with many serialized operations, especially where conditional tests control the flow of execution, aren't so good.
The presentation analyzed three sample applications:
- MRI (magnetic resonance imaging) image reconstruction
- Fluid dynamics
- H.264 video encoding
Although all three of these applications are parallelizable to a certain extent, they have different levels of suitability for the GeForce 8800 architecture.
The MRI processing Hwu described runs 416 times faster on an 8800 than on an Athlon 64 2800+ (which I must point out is not a very modern microprocessor; it shipped in 2004).
The fluid-dynamics code is the LBM benchmark from the SPEC CPU2006 suite (more information here). This code runs only about 12 times faster on a GPU than on a CPU because of non-ideal memory usage and thread synchronization.
Finally, the H.264 code runs about 20 times faster, but this algorithm is also not well-optimized for GPUs.
This wide range of performance, even on inherently parallel applications, shows how sensitive GPUs are to algorithms and implementation details. This situation is likely to improve over time--Hwu himself made specific recommendations about how to improve GPU suitability for these algorithms--but there will likely always be applications that run more efficiently on general-purpose processors than on GPUs. Horses for courses, as they say.