Taming the supercomputer

IBM big-iron visionary Tilak Agerwala wants to take the pain out of supercomputing.

Michael Kanellos Staff Writer, CNET News.com

Michael Kanellos is editor at large at CNET News.com, where he covers hardware, research and development, start-ups and the tech industry overseas.

See full bio

Michael Kanellos

Dec. 17, 2003 4:00 a.m. PT

5 min read

Tilak Agerwala, vice president of systems at IBM's T.J. Watson Research Center, is trying to take some of the pain out of supercomputing.

In a variety of projects with various universities and national laboratories, Agerwala is working on ways to reduce the cost of installing, but also using, supercomputers and high-performance clusters.

In the PERCS (Productive, Easy-to-use, Reliable Computing System) program, for instance, the idea is to build a machine that can optimize itself with a variety of applications, which should cut down the time required to develop the underlying programs needed to conduct complex research projects. It will operate at a petaflop, or one quadrillion calculations per second.

In the TRIPS (Tera-op Reliable Intelligently Adaptive Processing System), IBM and the University of Texas are working on a "supercomputer on a chip" that will be capable of running 1 trillion applications a second.

Agerwala spoke with CNET News.com about the challenges system architects face and the next horizon of big problems.

Q: Can you give a quick overview of your job?
A: I am responsible for developing the advanced hardware and software technology for servers, supercomputers and embedded systems.

Our goal here is to have a computer that is able to adapt itself to different kinds of application requirements.

It is a pretty wide range of disciplines--we go all the way from circuits to design automation tools to microprocessor architecture to operating systems and an on-demand operating environment.

It seems that system architecture, particularly on the hardware side, is more active than normal. Is that the case or just my imagination?
I would say that these are very exciting times. One thing happening in high-performance computing is that it really demands special attention now, because it is playing a central role in contemporary science and engineering that is kind of without precedence.

This is driven by exponential improvement in performance. We are able to consistently solve more complex problems more frequently and at lower cost. It is possible to have three different kinds of impact. One is to solve complex problems to enable economic growth, to advance industry and science and to address the mission-critical, computationally intense problems of the nation.

I think that there is a confluence of events that is saying that the high-performance community can actually address both the national-security kind of issues and accelerate economic growth.

How is that embodied in IBM's supercomputing research?
Again, our goal, really, is threefold. We want to solve complex problems to advance economic growth, we want to advance science and engineering, and we want to address mission-critical, computationally intense problems. This translates into several different elements for us: solving complex problems at lower costs, solving problems that are optimized for the broadest set of applications and working with government and academia.

The first part of our strategy is to aggressively improve our power-based, high-performance computing product line. The advantage of this approach is that it optimizes existing models and techniques. You are kind leveraging Moore's Law and leveraging software standards. ASCI Purple is our flagship product in this area.

The second part of our strategy is to develop high-performance clusters that are based on high-volume, low-cost building blocks, based like standard processors and interconnects.

The high-performance community can actually address both the national-security kind of issues and accelerate economic growth.

The third part of the strategy is to do advanced research and development, with novel approaches and design points.

Would that be like Blue Gene L?
Yes, Blue Gene L, TRIPS and PERCS.

Blue Gene L is actually much more in the category of the next-generation cluster, except that it really pushes the limit of massive parallelism. It combines a new processor design and an innovative network, and it optimizes for performance, compute density and energy consumed. It is really is a very cost-effective way of computing, so in that sense, it is very different than the Earth Simulator.

And PERCS?
In PERCS, we are developing the advanced hardware and software technology that will be needed for our mainstream commercial systems in the 2010 time frame. With Blue Gene, we will deliver a third of a petaflop to Lawrence Livermore National Laboratory at the end of next year. PERCS is further out in time. Once again, the commercial viability is a goal.

With PERCS, we are investigating highly adaptable systems that will configure its hardware and software components to match application demand. This adaptability enhances the technical efficiency of the system and the ease-of-use. The goal here is to accommodate a large set of high-performance computing and commercial workloads.

How should we think of this? As a system in which processors can be switched almost instantaneously to different operating systems or applications?
It is not just about the processor. How do you adapt the cache structure? How do you adapt and reconfigure what is going on in the chip to meet the dynamic requirements of the application? It is a bunch of different things that the teams are going to have to look at and come to some conclusion over the next year or so on about what the right approach is.

Ultimately, what sort of problems will this solve?
It is also about reducing the time it takes to come up with (applications). PERCS will include pretty sophisticated compilers and middleware that will be supported by some of the hardware features to automate many phases of the program development process.

The P in HPCS -- the P is for productivity, not performance. (HPCS stands for High Productivity Computing Systems and is part of a $146 million project created by the Defense Advanced Research Projects Agency to develop new supercomputer architectures by 2010.) So our goal here is to have a computer that is able to adapt itself to different kinds of application requirements.

I just want to emphasize that the next phase is an investigation phase. We will be developing these technologies and be able to talk more intelligently about what impact this is going to have. There is an overall blueprint, but there is a lot of investigation still to be done to see how close we can come to that goal. There are some fundamental problems that will have to be addressed.

Are there any major scientific problems you would like to see solved with your systems? We've cracked the genetic code, the origin of the big bang. What's the next big problem?
I've had a long history in this area, and one of the reasons is that is that there is such a great potential to make advances. This might sound a little cliche, but we can and are having this kind of impact. Weather forecasting is one of them. I am deeply interested in life sciences. The Blue Gene program started out as a way to solve a grand challenge, protein folding. We are also trying to solve problems for the nuclear stewardship program so that we can validate the safety of our nuclear stockpile without having to test the weapons.