Turns out it's hard to run a social network with 2.2 billion people.
Maybe you've heard of Facebook's old engineering mantra: move fast and break things. The company dumped the "break things" part years ago, but today it's moving faster than ever.
At its own Systems@Scale conference Thursday, Facebook engineers detailed several parts of a computing infrastructure massive enough to serve the 2.2 billion people who use Facebook. One of those details: Facebook now updates its service's core software at least 10 times more frequently than it did about a decade ago.
"When I joined Facebook in 2009, we pushed [an update to] that main application tier ... once a day. That was an epic thing," said Jay Parikh, Facebook's head of engineering and infrastructure. Now, he said, the site "is getting pushed maybe every one or two hours."
And updates come faster even though Facebook has more than 10 times as many servers in its data centers, 20 times the engineers updating its software and more than 10 times the users it did a decade ago, Parikh said. Oh, and it's got more than a billion people using Instagram, WhatsApp and Facebook Messenger now, too.
The glimpse into Facebook's inner workings is unusual. In other industries -- say, banking or railroads or automaking -- this kind of operational detail can be information tightly protected to keep competitors from getting an edge. But in the tech industry, it can actually help a company get ahead.
Opening up helps the technology ecosystem -- hardware, software and the people who put it all together -- keep up better with Facebook's needs. The problems Facebook finds are likely to be the ones others in the industry encounter as they grow, too.
Facebook has had to work hard to speed things up, when the natural tendency of organizations is to slow down to guard against the increasing risks of change as projects grow larger, Parikh said. To get there, Facebook's operations mission is now "move fast with stable infra."
At the conference, engineers from Facebook and other tech companies, like Amazon, Shopify, Lyft, Google and Yahoo gave talks and asked questions of their peers. These are folks for whom operating a data center packed with thousands of servers is last decade's challenge. Today's difficulties span multiple data centers around the globe -- how do you synchronize data or get a second data center to take over when there's a problem with the first?
"You're building something billions of people are going to be impacted by on a daily basis. That is cool, but equally scary," Parikh said.
Frequent updates are key to fix problems, add new features and run experiments to see what works best. Facebook has to make the changes without disrupting operations at colossal scale: 65 billion messages and 2 billion minutes of voice and video chats per day on WhatsApp, 8 billion Facebook Messenger messages per day between businesses and their customers, and more than 10 million Facebook Live videos on New Year's Eve.
One Facebook achievement, reached in April 2017, is called continuous push. Two decades ago, tech companies would issue updates months or even years apart. With continuous push, Facebook programmers issue unceasing updates, said programmer Anca Agape. Each is tested automatically on gradually larger groups -- Facebook's own employees, then 2 percent of users, then 98 percent of users -- and if there aren't problems, the change is accepted. The result: updates reach Facebook's entire user base in three hours on average.
"This is pretty impressive," she said. "The site is always changing."
The audience was hungry for answers, from Facebook and from others who spoke.
"Do you run containers directly on the bare metal or on the virtual machines?" one asked Facebook. And another: "Do you guys disable swap on the host machines?" These are folks who live in the world of tools like Spanner, Chef, OpenCensus, Kubernetes, MySQL, Kafka, Canopy and btrfs.
And Facebook added a little more jargon to the mix Thursday. It announced two projects -- load-aware distribution to improve how updates are sent to millions of servers and OOMD, a utility to respond more gracefully to computers running out of memory.
The profusion of management tools shows how complex it is to run suites of services on hundreds or thousands of servers. Over and over, engineers spoke of completely overhauling their technology every few years as massive growth overwhelmed the earlier system.
Increasingly sophisticated tools spotlight problems and help people trace their origins, said Google site reliability engineer Liz Fong-Jones. And seemingly rare one-in-a-million problems actually become common when, as in the case of Oath's Yahoo Mail, your system handles 120 billion transactions per day, said Jeff Bonforte, the senior vice president in charge of the communications products.
Under Chief Executive Mark Zuckerberg , Facebook got its start with a few servers tucked into racks of computing gear hosted by data center specialists. By 2009, Facebook was buying off-the-shelf servers from companies like Dell and Hewlett-Packard. But the mainstream technology approach couldn't keep pace with Facebook's challenges, so Facebook decided to build its own technology.
"We're designing infrastructure from the dirt on up," Parikh said, with 14 or 15 data centers dotted around the world and hundreds of smaller sites closer to all of us who use Facebook's services.
"This system is ever-growing, with things I'd never thought we'd have to do, like building our cable systems in the ocean and the ground for connecting our infrastructure," Parikh said. The number of companies that build their own long-haul fiber-optic links is small -- Google just announced this week that it's building its own transatlantic cable -- but the investments can pay off for big enough companies.
"We're pushing the boundaries of things that help us advance our infrastructure," Parikh said.
First published July 19, 2:22 p.m. PT.
Update, 8:20 p.m.: Added comments from Google, Yahoo Mail and Facebook's Anca Agape.
Blockchain Decoded: CNET looks at the tech powering bitcoin -- and soon, too, a myriad of services that will change your life.
Security: Stay up-to-date on the latest in breaches, hacks, fixes and all those cybersecurity issues that keep you up at night.