Intel's James Reinders on parallelism: Part 1
Intel exec talks about how to think about performance in a parallel programming environment, and why such environments give developers headaches.
Multicore processors are here to stay and the number of cores that we'll see packed onto a single chip is only going to increase. That's because Moore's Law is only indirectly about performance; it's directly about increasing the number of transistors. And, for a variety of reasons, turning those transistors into performance today largely depends on cranking up the core count.
There's a downside to this approach though. Programs that consist of a single thread of instructions can only run on a single core. This in turn means that they're not going to get much faster no matter how many cores a chip adds. Running faster means going multi-threaded--splitting up the task and working on the different pieces in parallel. The problem is that programming multi-threaded applications introduces complications that don't exist with single-threading.
These complications and ways to overcome them was the topic of my conversation with James Reinders at the Intel Developers Forum in September. Reinders is the director of marketing and business for Intel's Software Development Products. He's an expert on parallelism and his most recent book covered the C++ extensions for parallelism provided by Intel Threaded Building Blocks.
In part 1 of this discussion we talked about how to think about performance in a parallel programming environment, why such environments give developers headaches, and what can be done about it.
Reinders began by noting that developers fall into roughly two groups when it comes to parallel programming: those who are still concerned about ultimate performance even in a parallel world and those who are just looking for a way to deal with it at all.
The challenge is understanding what we're trying to introduce, how to use parallelism, but with programmer efficiency. Because programmers don't need yet another thing to worry about. There's plenty of those out there.
And we need to be a little more relaxed about the performance. The people who start asking me about efficiency in every last cycle used and such--I characterize them as people we need to talk to more about our high-performance computing-oriented tools that give you full control. And other people are "I don't even know how to approach parallelism." I think there is a different set of ways to talk about the problem.
The problems with this second group comes down to the fact that most programmers are used to dealing with something called "sequential semantics." A detailed description of programming semantics is a complex computer science topic but, at a high level, sequential semantics means more or less what it sounds like it sounds; instructions follow one after another and execute in the order that they are written.
If you store the number "1" in variable A, then store the number "2" in variable B, and then add them together in a third instruction, you can be confident that the answer will be "3." It won't depend on timing vagaries that might have caused the addition to happen before the stores. Most people start out programming sequentially using languages designed for that purpose.
Parallel programming, on the other hand, introduces concepts like data races (the answer is dependent on the timing of other events) and deadlocks (in which two threads are each waiting for the other to complete so that neither ever does). Here's Reinders:
If you've ever managed and got a bunch of people working on a project together, one of the headaches you get is coordinating with each other. What did Fred say to Sally? They're doing things out of order or whatever. Parallel programming can give you that same sort of headache.
The programming terminology you'll hear the compiler people use is "sequential semantics." One of the interesting areas is what can we do if we ensure sequential semantics. We recently acquired a team in Massachusetts who were working for a company called Cilk Arts.
Our hope is that Cilk can do a subset of what Threaded Building Blocks [TBB] can but preserve sequential semantics. We think we can do sequential semantics, do a subset of what TBB does, since we're introducing keywords into the compiler--that has some disadvantages because it's not as portable--but we think we might be able to magically give you sequential semantics and not give up performance. That's a big if.
Now why would we invest in that?
Because there are a lot programmers who have been getting along just fine with sequential programming. But when you tell them to add this or that for parallelism, a big thing that trips them up is that you no longer obey sequential semantics; you have more than one thing running around and you get data races, deadlocks, and it doesn't feel comfortable.
Now some people will argue that you need to do these things to get good performance. We have the feeling that in some cases you don't need to take that big of a leap to get pretty good performance.
And no one's going to criticize your app on a quad core for being only 70 percent efficient.
From there we moved on to data parallelism which focuses on distributing data across processing elements. It contrasts with the task parallelism that we commonly associate with the term parallel programming. Pervasive DataRush is one commercial product based on a data parallelism model. APL, the language with the strange symbols (for those with long memories), is often considered the first data parallel language. There have been a variety of others, often extensions to more conventional languages like C and FORTRAN, but none were widely used.
Data parallelism just takes it one step further. Data parallelism is all about the parallelism in the data. So you're talking about the data when you program.
And once you start talking about the data, the tools underneath can move the data around. Leaving the data management up to the programmer [as with Cilk and TBB] turns out to be a terrific headache. This applies equally to a cluster where they don't share memory or a GPU and a CPU in the same system.
But a language like RapidMind or Ct can address that problem. And CUDA and OpenCL can too [frameworks primarily oriented towards heterogeneous processing that uses graphics cores for computing tasks] but RapidMind and Ct are at a much higher level of abstraction which means that we're betting on the idea that we can attract more developers and give up some efficiency.
Part 2 of our conversation will cover cloud computing, functional and dynamic languages, and what needs to happen with respect to programmer education.