How Yahoo is betting its cloud will pay off
With Yahoo counting on cloud computing to accelerate development, a behind-the-scenes IT guy like Shelton Shugar moves into the spotlight.
There was a day when information technology personnel toiled behind the scenes to make their corporate computing infrastructure work.
But in the Internet era, those experts increasingly are getting starring roles in corporate computing leadership rather than being supporting cast members. Such is the case for Shelton Shugar, Yahoo's senior vice president of cloud computing.
"It becomes more a topic at cocktail parties," he said of his present job, which he took shortly after Yahoo formed the group in June 2008. "I was at a wine tasting, and an acquaintance said, 'I did a search on your name and found cloud computing. What's that all about?'"
Shugar isn't running any publicly available cloud computing services, be they nuts and bolts like Amazon Web Services or full-on applications such as Google Docs. Instead, he's in charge of building a computing infrastructure crucial to Yahoo's ability to operate at large scale and improve its services rapidly. Those are essential for the company's attempt to fend off Google, which arguably is nimble for its size, and start-ups such as Twitter or Facebook that can change course a bit more easily.
"Most Yahoo properties you interact with use the cloud to some extent," said Shugar, who came to Yahoo from eBay and who's a headliner at the Cloud Computing Expo in November. "Over time, that percentage will continue increasing. Almost anything you touch uses some of it."
Most telling about his role: although rebuilding Yahoo on its own cloud-computing foundation is expected to save some money, the primary motivation is to liberate the company's programmers from the difficulties and drudgery of coding for gargantuan audience on the Internet.
"If we have a thousand developers who no longer have to build a lot of infrastructure, and who (instead) work on products and features, that puts us way farther ahead than squeezing a few nickels out of the infrastructure," Shugar said.
It's a mammoth chore. For example, when it comes to background data processing that underlies search results, behavioral ad targeting, site trend spotting, spam filtering, and Yahoo.com content selection, Yahoo uses open-source software called Hadoop.
"I've got 25,000 machines running Hadoop," Shugar said. They're divided into several "grids," the largest with about 4,000 servers. "It's a fascinating activity...watching the different dynamics of usage push things in different ways. Some tasks are very heavy on computation, some are heavy disk input-output. It's pretty complex."
Yahoo's 'private cloud'
Cloud computing is a popular buzzword these days, and as a result it means many things to many people. It generally refers to moving services to the Internet, which in innumerable technical diagrams over the years has been represented by, in fact, a cloud.
Cloud computing has grown less ephemeral in recent years, though. The Amazon Web Services suite offers a highly specific set of interfaces that Internet operations can use for everything from data storage to computing capacity. At a higher level, Google Docs lets people perform word processing and spreadsheet calculations through a Web browser. Microsoft is spanning that spectrum, working on its Azure foundation for generic Windows Server chores and on a Web-based version of Office.
One somewhat controversial concept is that of a "private cloud"--computing services that embrace some of the principles of publicly available clouds but that are used just by one company. If it's in-house only, what makes it any different from just the IT department's computers?
Yahoo makes a reasonable case that it's got a cloud of its own, though. It operates on a scale larger than many public cloud companies, and it embraces some of the principles cloud computing in that infrastructure.
For example, it's got a variety of interfaces that many Yahoo services can use--a concept often called multitenancy--so they don't have to build them on their own. For another, it's global, handling thorny issues such as operating at large scale and replicating data for reliability and responsiveness. And it's got a degree of elasticity built in, so the infrastructure can expand, contract, or otherwise adjust to changing work load demands.
This computing foundation is designed to ease the pain of developing Yahoo services. Some features and projects are easy to build at a small scale but hard to expand.
"It's like the cat going up the tree. It looks good at the beginning," Shugar said.
Yahoo's cloud services
Yahoo has four cloud services in varying degrees of availability:
Operational storage, for housing data such as e-mail attachments or a user's social connections.
Batch processing, for crunching oceans of data to sort Web search data, tailoring content and ads for individual users, and figuring out who's spamming Yahoo Mail.
Edge content, for presenting Web pages, balancing the load from many users, caching data for fast use in many areas. This service is already in widespread use.
Online serving, a flexible foundation for designing and housing complicated applications.
This last service is under development, due to arrive in 2010. "We're working with a few Yahoo properties as anchor tenants," Shugar said, declining to say which exactly.
The online serving interface will be based on Red Hat Enterprise Linux running atop the open-source Xen virtualization software offered by Citrix, Shugar said. "Virtual machines are going to get more and more and more important."
Though open-source software is freely available for the do-it-yourself crowd, Yahoo is in negotiations with some outside parties that will involve commercial relationships, he added.
Linux and virtualization
With virtualization, one server can house multiple operating systems at the same time, and operating system instances can be shuffled from one physical machine to another to adjust to changing demands. Yahoo has worked with virtualization leader VMware, but the company wants the nitty-gritty control enabled by open-source software.
"We need to be able to tweak it quite a bit for performance, to match it with our hardware," Shugar said.
Open-source software helps with Hadoop, too. Yahoo is the primary contributor to the project at the Apache Software Foundation, but others' participation helps ensure Hadoop stays generally useful.
Most people aren't going to wire up their own Hadoop computing cluster, of course. But "data center" no longer is a term understood just by server administrators and CIOs.
"There are these big data centers behind the scenes supporting all these searching, social networking, information-gathering activities," Shugar said. "Before, it was a business function inside a company. Now the awareness has increased as result of people using the Web. It becomes a topic of general interest."