The man who keeps Facebook humming (Q&A)
Jay Parikh is a key person responsible for making sure Facebook remains up and running each day.
Jay Parikh is happy to never get a call from Mark Zuckerberg. Why? It means he's doing his job well. As Facebook's vice president of infrastructure engineering, Parikh oversees a team of 600 and is charged with the enormous task of keeping the machines that run Facebook operating with as few hiccups as possible. As Facebook now approaches 1 billion users, and continues to roll out more features that connect people every which way, that challenge grows. Which is why Parikh, who this morning gave the keynote at the Velocity conference in Silicon Valley, has been hard at work building out Facebook's back-end technology and data centers.
I met up with Parikh at Facebook's headquarters in Menlo Park, Calif., and discussed a range of topics. Here's an edited version:
Q: Twitter had an outage last week. You had an
Parikh: First and foremost, we take this stuff extremely seriously. We want to be fast, and we always want to be up and running. That's our No. 1, No. 2, No. 3 priority. When we do have a misstep, we spend a lot of time internally really understanding what happened, how we're not going to let it happen to us again.
I run a meeting every week where we go through all the issues that happened. We go over the timeline of what happened, the impact to users, what the root cause was, how we fixed it. And we spend the bulk of the time going over what do we need to do to correct it.... A lot of companies come up with great ideas and they go into a folder that no one ever sees again. We've very diligent on the follow up.
The important thing here is we do emphasize needing to focus on impact and moving fast. Because of this, we're going to take on some risk. For us, moving fast is the most important thing, and we just try to minimize or buffer ourselves from the risk of what that means. This doesn't mean that just because we have 900 million users that we need to slow down or we're going to take less risk.
But is there a number -- a goal, a metric -- that you aim for?
Parikh: We do have one, but it's not something we share. It's a goal that we metric ourselves on.... Each of the critical components in the stack has a separate goal, so there are really many, many goals that help us deliver the site reliably. And we measure each part of the stack with very fine-grained metrics.
Take us into the weeds please.
Parikh: It's best to talk about that in context of everything that we do that helps us move fast. The things that help us move fast are a varied set of things. We use a lot of open-source software. We've done a lot of work over the years to make that system scale. When Mark built the site in 2004, it was built on PHP and mySQL, and we've added lots and lots of stuff over time.
But the main stack is pretty much the same. We've got a Web front end, which runs PHP code. Over the years, we've then basically written our own compiler and our own runtime system because we found that we needed to do this for efficiency. We wanted to keep our infrastructure as efficient and as cost-effective as possible, so we built a thing called HipHop for PHP, which we open-sourced a little over two years ago. We push a new version of that front end to 900 million users every day of the week.
In the back end, there are lots of different things. You have a Web front end, but they talk to hundreds of different back end servers, talking and getting and looking at tens of thousands of different objects of data. We have to process it, sort it, rank it, privacy check it, filter it, do all this analysis on it. Then we have to compose a page and serve it back to your desktop or phone in less than a second. At any given time, there may be tens of thousands of different experiments running in production.... We iterate on these ideas without losing momentum.
You're the guy people come to when they have a new product or feature that is in the works that requires changes for handling capacity. How does that work?
Parikh: We have a capacity and performance team that is responsible for working with all the engineering teams in trying to understand their needs. Now, this team isn't just a bunch of analysts and business people pushing spreadsheets around. They're actual engineers who know code, who know performance, who know distributed systems and architectures. So they really have a thorough understanding of how a system is working and how a system is supposed to be working.
You don't just come to this team when you need more servers. You can get more capacity, but that doesn't always mean getting more servers. We had a funny story where one of our engineer teams came to the capacity team and said, "We need 8,000 servers." The capacity team said, "OK, why?" and then spent some time understanding what the engineer team was trying to do. After they got through a series of discussions and review, [it turned out] that they only needed 16 servers instead of 8,000.
Parikh: I can't talk much about where we're going from a product perspective.
From an infrastructure perspective, we continue to try to build infrastructure that's efficient and cost effective. But what we want to do is be able to manage and harness all the data we're collecting. We have a big investment in [Apache] Hadoop.... Our largest cluster in Hadoop today is over 100 petabytes. So for us it's about how do we continue to scale out and handle this ever-increasing amount of data.... 100 petabytes becomes exabytes not too far off in the future. 100 petabytes is going to be boring for us a couple of years from now.
How do you plan for that?
Parikh: We're literally figuring this out every day.... As the needs of the product change, as we approach a billion and beyond users, and as mobile adoption across the world takes off, this is something that we're zigging and zagging and sort of adjusting every day.
How much energy does Facebook consume?
Parikh: We have not released that.
Efficiency is something you're always working on with your data centers.
Parikh: We brought on back in 2010, 2011, in Prineville, Ore., We've talked a lot about that in terms of it being a data center that's far more power-efficient [and] cost-effective using our own servers. We use outside air-cooling. We use a simpler, way more efficient electrical system inside the building all the way down to the power supplies on the servers, so we use a lot less energy.
The industry gauges around this PuE number -- power usage effectiveness -- with 1 being ideal, so that every watt you bring off the grid is put on a computer somewhere. The industry average is around 1.5, which means 50 percent of the energy that's pulled off the grid is thrown off as waste. Our data center in Prineville came in at an average PUE of 1.07. We on average waste about 7 percent of the electricity that we pull in, versus the industry average of 50 percent.
Parikh: We got Forest City up and running in 10 months -- two months shorter [than Prineville]. We've pretty much beat any industry metric we found out there in terms of the time we put a shovel on the ground to the time we had the data center operational. So we're really racing ourselves. But that's not the full story.
Parikh: We ended up changing everything inside the data center, and ended up revamping the entire network topology. We ended up bringing in and building an entire new generation of servers for that data center. And so we made significant changes from just months before what we thought was going to be the right thing to go do again. Most companies would -- and do -- take it conservative and take fewer risks.
You can ask, "That sounds insane, why did you do that?" The main motivation was we were seeing this new wave of products coming down the pike for us with things we launched last year -- Timeline, open graph, to mention two of them -- that we felt were going to have dramatically different impact on our infrastructure in terms of network consumption, the way that requests get processed inside of all these servers. So we're trying to prepare ourselves for this change in demand from our new products.
What's an example of a change you made inside the data center?
Parikh: At the hardware level, we made some big changes. We moved the drive. It was in the back and we moved it to the front (see photo above). This dramatically reduced the time technicians spent swapping drives and doing repairs in the data center.... It's huge. Little stuff like this makes a big difference for us. And it's part of the open compute project.
How do you measure success?
Parikh: For us, success is about really being able to move fast. I think once we have that tempo -- that cadence -- then we can do anything we want in terms of features.
How does Zuckerberg measure your success?
Parikh: Usually by telling us we're all behind.