• On mySimon: Luke Skywalker Doll

Speeds and Feeds

Read all 'Power7' posts in Speeds and Feeds
November 2, 2009 5:45 AM PST

Tilera's balancing act: 100 cores vs. market realities

by Peter Glaskowsky
  • 14 comments

While we're all familiar with the steady increase in the number of cores in mainstream PC and server processors, the corresponding progress in the embedded-processor market has been anything but steady.

With mainstream PC microprocessors standardizing on four-core designs such as Intel's Core i7 and leading-edge server chips ranging from 8 to 16 cores, single-core chips are no longer competitive. For embedded systems, however, one core may still be the right answer; if more are needed, the choices range up into the hundreds.

Tilera Tile-Gx100

The Tilera Tile-Gx100 combines 100 independent 64-bit integer processor cores and cryptographic accelerators with memory, network, and PCI Express interfaces.

(Credit: Tilera Corporation)

The latest announcement in the many-core embedded processor market is Tilera's Tile-Gx family, which combines 16 to 100 64-bit integer processor cores with cryptographic accelerators and off-chip interfaces for memory, networking, and PCI Express. I met with Tilera before last week's announcement to discuss the technical and business issues related to the Tile-Gx.

The technical details
San Jose, Calif.-based Tilera is eager to set itself apart from the many other chip companies competing in its target markets. Unlike most embedded processors with high core counts, for example, Tilera's design allows its cores to operate truly independently, even to the extent of running different operating systems if needed. More commonly, groups of tiles will be combined to run a single task that is part of a larger workload. In this way, one chip can operate like a cluster of multiprocessor systems.

Between this distinction and the fact that cores in the Tile-Gx family are a full 64 bits wide, Tilera claims the Tile-Gx100 is the "world's first 100-core processor." I think that's just a little too broad a claim, personally, since companies such as Clearspeed and Xelerated have previously made similar claims for their chips. Even more significantly, the Tile-Gx100 doesn't exist yet. It won't be a real product until early 2011, according to Tilera's current schedule.

Tile-Gx processors aren't something most CNET readers will ever knowingly use, though these chips will likely, eventually, help carry traffic over the public Internet and through larger corporate networks. But they do provide an excellent example of the issues facing PC processor vendors as core counts continue to grow.

Consider the Tile-Gx100 block diagram shown above. It's easy to imagine that this chip can get a lot of work done. Every core can run up to three instructions per cycle at up to 1.5GHz. It has dedicated hardware accelerators for cryptography and network packet processing. The network interfaces can implement up to eight 10Gb Ethernet ports. The chip also has four DDR3 memory interfaces; to reduce DRAM accesses, every core has 320KB of local cache memory. (The total amount of cache memory in the Tile-Gx100, about 32MB, matches that of IBM's Power7 processor, which has only eight cores.)

The need for balance
It's not so easy to keep all these resources busy, however. The more complicated a chip gets, generally speaking, the more difficult it becomes to make full use of its resources. This is what we often call the balance between hardware and software.

Tilera will offer four products in the Tile-Gx family with 16, 36, 64, and 100 cores and corresponding differences in memory and networking support. This range of products helps meet the needs of different applications, but each product still needs a particular balance of application requirements for maximum efficiency.

So here lies Tilera's great challenge--finding software applications that need a large amount of CPU performance and that also:

1. Are highly parallel, so they can keep many cores busy.
2. Don't need much (if any) floating-point math, since the Tile-Gx doesn't do that.
3. Can benefit from cryptographic acceleration.
4. Consume large amounts of network bandwidth.

Tilera wants customers to think of its chips as "general-purpose" processors, but as this list shows, they're better for some purposes than for others. As PC processors reach higher core counts and integrate more functionality, they too will become more sensitive to application requirements. Integration eventually reaches a point where additional complexity adds no practical value. And the closer PC processor vendors approach that limit, the more difficult it will become to sell their latest, greatest, most complicated chips.

Network processing is the most natural fit for Tilera's capabilities, particularly high-level services like virus scanning as I discussed in September (see "Insatiable demand for mobile data challenges industry"). Internet service providers rarely provide such services for PC users, since PCs can do their own scanning--but mobile phones and other Internet appliances often can't, so these services are seeing increasing demand.

The networking market, unfortunately, is not large enough to support a company like Tilera. Although there is a lot of networking equipment sold each year, each company in the business has its own ideas about how this processing should be done. A single chip design could never capture the majority of this potential demand.

Further, the larger equipment vendors often have policies in place against relying too heavily on individual suppliers, especially small start-ups. They will commonly design different products using different chip-level technology so that the failure of a single supplier--or the purchase of a supplier by a competing equipment vendor--will have only a limited effect on their bottom line.

New business opportunities
Tilera is working to develop new markets for its current TilePro and future Tile-Gx parts. The most significant of these new markets is cloud computing, which may favor solutions like Tilera's that offer higher performance per watt.

That's the metric Tilera touts most heavily for the Tile-Gx, promising 10 times the performance per watt of Intel's Westmere-EP, a six-core 32nm processor that Intel will begin shipping in 2010, which is aimed at high-efficiency servers. (Incidentally, I commend Tilera for making this comparison; Westmere-EP is exactly what they'll be competing against. Too often, chip companies will try to make themselves look better by comparing next year's products with last year's competition.)

Although 10x is a critical multiplier in this business (see my post "The factor factor"), such an advantage doesn't necessarily guarantee success. Tilera has done everything it can to minimize the difficulties associated with software development by adopting industry-standard development tools such as GCC and Eclipse, but its Tile chips still can't run Windows and it just can't match the developer relationships that companies like Advanced Micro Devices and Intel have established.

Fortunately, Tilera is small and relatively efficient for a chip company. Last month, Tilera announced that Quanta Computer invested $10 million in the company based on Quanta's interest in cloud computing. Tilera said it has enough funding to reach cash-flow breakeven in 2011, assuming the Tile-Gx reaches market and achieves the kind of success Tilera predicts.

I remain skeptical, but hopeful. I think there's no question that in the long run, there will be plenty of demand for complex, many-core processors like Tilera's. But will Tilera still be around by that time? And in the long run, once this demand develops, larger companies such as Intel will have their own offerings.

Can Tilera carve out a market niche that it can defend against such strong competition? I just don't know, but I'm always glad to see people trying new ideas.

August 31, 2009 5:35 AM PDT

High-end server chips breaking records

by Peter Glaskowsky
  • 3 comments

How would you like a single-chip microprocessor with more than four times the performance (on some applications) of Intel's best Core i7?

Then consider that up to 32 of these chips can be directly connected to form a single server, achieving four times the built-in scalability of Intel's next-generation Nehalem-EX processor.

That's IBM's widely anticipated Power7, which it described at last week's Hot Chips conference. But if you're interested, you'd better be prepared to spend a lot more than four times as much per chip. IBM isn't talking about pricing, but large Power servers can cost more than $10,000 per processor.

IBM Power7 die photo

IBM's forthcoming Power7 server processor has eight cores, manages 32 threads, and includes 32MB of on-chip embedded DRAM cache. Power7 also has the highest levels of off-chip bandwidth ever achieved by a microprocessor.

(Credit: IBM)

What makes the Power7 so powerful? Each chip has eight cores, and each core supports four-way multithreading. There's 32MB of level-3 cache on the chip, made using embedded DRAM (eDRAM) cells. Most CPUs use SRAM for cache because it's generally easier to combine with high-performance logic, but DRAMs--with only one transistor per bit--offer compelling density advantages. IBM spent years developing a new kind of eDRAM that would work with SOI (silicon on insulator) manufacturing processes, and the Power7 is the most advanced product to use the new technology.

Interestingly, the Power7 cores run much more slowly than those in the Power6 processor, which I wrote about here in 2007 ("Live from Hot Chips 19: Session 1, IBM's Power6"). The Power6 was designed to run very fast using a long CPU pipeline in order to deliver the highest possible performance on each thread of execution.

Maybe that strategy didn't work out as well as IBM hoped, because the Power7 returns to a more traditional microarchitecture with a shorter pipeline and much lower clock rates--though IBM didn't say exactly what those rates would be.

IBM did, however, promise that the Power7 would be roughly four times as fast as the Power6, chip for chip. Since it has four times as many cores, each of the new slower-clocked cores must still deliver about as much performance as those in the previous generation.

Chip-level performance must always be matched by off-chip connections lest the incoming data or outgoing results be bottlenecked by a too-slow channel. Accordingly, the Power7 is equipped with eight I/O channels for DRAM, each of which connects to an off-chip buffering device that splits the channel into two 64-bit DRAM interfaces. All together, IBM says the Power7 has 180 GBps of DRAM interconnect that can sustain over 100 GBps of effective memory bandwidth.

There's another 50 GBps of peak I/O bandwidth and a staggering 360 GBps of peak bandwidth used to let each Power7 chip communicate with others. The DRAM connected to each chip is thus shared across larger systems.

Combining these figures, IBM says a single Power7 has 590 GBps of total off-chip bandwidth. This isn't the real number, since many of those bytes are used for error-correcting codes and other overhead, but it's still pretty impressive.

So is Power7's die size: 567 square millimeters for 1.2 billion transistors. That's nearly a square inch! IBM says that if the 32MB L3 cache had been manufactured using SRAM, the transistor count would have been 2.7 billion instead.

Still, Power7 wasn't the only high-end chip talked about at Hot Chips.

Rainbow Falls, a record for core count
Sun Microsystems was there to describe its forthcoming Rainbow Falls chip, which I assume will be marketed as the UltraSparc T3. The chip has 16 cores, each of which is reportedly able to manage 8 threads.

Sun's primary Rainbow Falls presentation focused on details of Rainbow Falls' internal and external interconnects; a second talk described the cryptographic coprocessors present in each of the chip's cores. These coprocessors--one for modular arithmetic (commonly used in public-key cryptography) and a cipher/hash unit to accelerate bulk ciphers like AES and secure hash algorithms--provide many times the performance of pure software implementations.

Fujitsu was also at Hot Chips to describe its eight-core, 2GHz Sparc64 VIIIfx processor, the latest in a long series of impressive designs from the company. Fujitsu quoted a peak performance figure of 128 GFLOPS (billions of floating-point operations per second) with a typical power consumption of just 58 watts. It did not, however, provide sustained performance or worst-case power consumption figures.

AMD, Intel vie for high-volume servers
Few of us will have direct exposure to the IBM, Sun, and Fujitsu chips. A pair of presentations from Advanced Micro Devices and Intel described products that will be much more widely available.

AMD launched its six-core Opteron processor code-named "Istanbul" earlier this year (see Brooke Crothers' coverage from June). Next year the company will begin shipping a new Opteron model currently code-named Magny-Cours (after a racetrack in France). Magny-Cours will consist of two Istanbul chips in a single package, with twice as many DRAM interfaces to support the new processor's increased performance.

AMD also teased the audience with another mention of a new processor core design that has been under development there for several years: "Bulldozer," which is now targeted at 32nm process technology. This new core will incorporate new x86 instruction-set extensions which will probably not be adopted by Intel (a strategy that reminds me of AMD's old 3DNow extensions).

But saving the best for last--best, that is, from the perspective of anticipated sales--Intel's talk on Nehalem-EX showed just how far Intel has been able to push the technology envelope for high-volume servers.

Nehalem-EX is an eight-core version of the existing quad-core Nehalem design. The new chip also has 24MB of L3 cache done in old-school SRAM. By my calculations, about 60 percent of the chip's 2.3 billion transistors are in this cache alone.

Nehalem provides four links to external DRAM buffer chips supporting two DDR3 DRAM interfaces each (much like the Power7 solution) and four QuickPath Interconnect links that provide direct "glueless" connections for up to eight-processor systems (64 cores, 128 threads). Intel is also working on an external Node Controller chip for systems with up to 2,048 Nehalem-EX processors.

The aggregate bandwidth numbers for Nehalem aren't as mind-boggling as those for Power7, but they're still far beyond anything available for PC-architecture servers today. Based on the presentation, I estimate Nehalem could boast over 85 GBps of peak memory bandwidth and 100 GBps of chip-to-chip bandwidth, some of which must be allocated to I/O.

I expect the raw number-crunching performance of the Nehalem-EX cores to be roughly on the same level as Power7's cores. The lower ratio of bandwidth to processing power for Nehalem-EX reflects a different design target, not a design shortfall--and most importantly, a much lower selling price. There will presumably be versions of Nehalem-EX priced similarly to existing Xeon MP products, which currently top out at $2,301 each in small volumes, but that's a very reasonable price to pay for the market's most advanced x86 server processor.

  • prev
  • 1
  • next
advertisement

About Speeds and Feeds

Silicon Valley-based computer architect and chip analyst Peter N. Glaskowsky attends a variety of industry conferences throughout the year to meet with industry thought leaders and dig into the future of computing technology. In Speeds and Feeds, he analyzes trends in system architecture and interface design, as well as market and political pressures surrounding those trends. He is a member of the CNET Blog Network and is not an employee of CNET. Disclosure.

Add this feed to your online news reader

Speeds and Feeds topics

Most Discussed