Wrapping up Speeds and Feeds, part 2: Reliability

Personal computers aren't as reliable as they could be. Creating truly reliable PCs will take a lot of work and a growing share of the system transistor budget, but it'll be worth the cost.

Peter Glaskowsky
Peter N. Glaskowsky is a computer architect in Silicon Valley and a technology analyst for the Envisioneering Group. He has designed chip- and board-level products in the defense and computer industries, managed design teams, and served as editor in chief of the industry newsletter "Microprocessor Report." He is a member of the CNET Blog Network and is not an employee of CNET. Disclosure.
Peter Glaskowsky
3 min read

Personal computers have become much more reliable over the last 10 years or so, mostly due to the introduction of advanced operating systems with memory protection and hardware abstraction. The hardware itself has gotten better too; uncorrectable random errors are rare in PCs and extraordinarily rare in server-class systems.

These and other improvements have largely eliminated machine crashes. Blue-screen errors on Windows and kernel panics in Linux and Mac OS X still occur, but much more rarely.

Error-reporting services have become common, helping software developers figure out what went wrong. Most large developers now issue regular patches to fix newly discovered bugs, making systems more reliable between major releases.

All this progress is wonderful, of course, but our PCs still aren't reliable in the way that other consumer products are reliable. Machine crashes are still possible, and any bug can bring down an individual application.

Automobiles, for example, can fail in many ways, but they are still much more reliable than PCs. The risks associated with vehicle failures have been greatly reduced by decades of design refinements. Would you feel safe if PC technology controlled the steering and brakes in your car? Conversely, wouldn't you be more confident in your PC if you knew it was as reliable as your vehicle?

Lagoon Nebula
Can you rely on your system to display this 370-megapixel image? European Southern Observatory (ESO)

PCs are also fragile in response to change. I know I'm always a little nervous the first time I install a new device driver or run a new application. Even without software changes, opening an unusually large image can induce some trepidation. Consider this 370-megapixel image of the Lagoon Nebula available from the European Southern Observatory Web site; how confident are you that all of your image-viewing programs would survive the attempt to open it?

And worst of all, PCs are fragile in response to attack. The kinds of problems that are sometimes created accidentally by software bugs are relatively easy to create on purpose.

Minimizing the frequency and consequences of these problems would require tremendous effort from everyone in the industry. Almost every bit of PC hardware and software would have to change. One part of the solution is an extension of the same techniques that make today's PCs more reliable than older models: more hardware-based isolation of one function from another.

The minimal isolation of today's systems is very convenient for software developers, making it easier to write code and achieve high levels of performance. More isolation means more complexity and more overhead, but it improves reliability.

Developers are taking the first steps in this direction already, for example, with the process isolation features of the Microsoft Internet Explorer 8 and Google Chrome browsers. But there's much more that can be done.

Another way to improve reliability is to verify that data and addresses are consistent in range and format with the original intent of the software developer before they are used by the program. Making these checks in software can help; the incidence of failures related to accidental and deliberate buffer-overflow conditions has been dramatically reduced in this way. There's plenty of room for new hardware to help in this process too.

There's also work to be done in making it easier to recover from failures, since true hardware failures are inevitable. This is another area where some high-end systems are way ahead of the PC. Fault-tolerant machine architectures have been around for a long time in the aerospace industry, for example.

Historically, fault tolerance has never been practical on the PC because PCs always had only one of each critical subsystem: one processor, one bank of memory, one display channel. Today, PC processors and graphics chips have multiple cores and multiple memory interfaces, creating the potential for redundant operation where it's most needed.

Recoverability also implies backups--not just of the contents of disk drives, but even of the live data in memory through checkpointing. And disk backups can be improved too, by making the backup process an integral part of all disk I/O. Modern file systems use journaling to increase reliability; this technique can be extended to allow recovering from errors long after they occur.

There will be a heavy price to be paid in complexity and performance for all of these techniques, but the currency for this payment is transistors, and Moore's Law gives us more of those in every new process generation. We need to consider how we want to allocate these transistors. Over time, I believe reliability should account for an increasing portion of them.