A few weeks ago I talked with Jonathan Heiliger, vice president of technical operations at Facebook, about the challenge of innovating quickly and building stable infrastructure while 250,000 new members are added to the social network every day. Check out the video on ZDNet.
Q: You've been at Facebook, I think, for about a year and it's been quite a ride I guess, scaling up from zero in 2004 to over 80 million today. How do you keep up with that hyper growth?
Heiliger: You're absolutely right--we've had a lot of growth. We add over 250,000 users every day, and that means a lot of infrastructure, a lot of servers, and constantly looking at new processes and looking at how we're doing things and ensuring that we're doing things the most efficient way possible, not just for delivering all the content to our users but to stay on top of what it costs to run the site.
How do you stay on top of the cost in terms of the kind of equipment you buy and how you work with the vendors? How do you prioritize those things?
Heiliger: One of the things we recently did was we ran an RFP process for the servers we buy from vendors and essentially did a bake-off with a number of different people looking at building servers on our own. What we concluded from that process was to continue to buy servers from a couple of major OEMs (original equipment manufacturers), but through that process we were able to lock in prices today and carry those prices forward as all the commodity components costs drop.
When you're buying those servers, and I assume you're doing just a huge scale out of commodity servers, what do they look like? How are they configured?
Heiliger: We're pretty lucky in that we run a wide variety of applications, literally tens of applications on our own and hundreds of applications for our platform developers that use Facebook as a distribution mechanism, as a way of interacting with their users. But one of the reasons we're very lucky is our engineering team has selected to use PHP as the primary development language. That allows us to use a fairly generic server type. So we, with a couple of exceptions, have three main server types and run a fairly homogeneous environment, which allows us to then consolidate our buying power.
You're different from Google in the kinds of applications that you run. They are mostly running search queries, and you're running all kinds of queries and bringing back all kinds of data from the social graph. How is it different in terms of the way you build out your data center from the inside?
Heiliger: Google has a tremendous amount of information that they index and archive and present to users, but fundamentally if you go to Google and type in a search for a "tiger" and I go to Google and type in a search for a "tiger" we're going to see generally the same results, so they're presenting that same information to both of us. Facebook is a little different in that the context for our data is all social. When you look at your friends and their status updates and their photos and the notes they may have written, you're going to see one set of data versus if I look at my friends and their photos and their notes and status updates, and those tend to be non-intersecting sets of data.
So it's much more dynamic?
Heiliger: Much more dynamic data set--and what that means is it's caused us to do a bunch of different things relative to caching and relative to federating all of that data up amongst thousands of different databases so that as a user requests all of that information we're not using one particular server every time for different data.
You recently introduced a chat application on Facebook, and it seems like it took a lot of time to test it to make sure it could scale having all those simultaneous conversations going on. Could you give us a little background and color on how that came to be?
Heiliger: Chat is actually one of our most recent launches. It started as a hack-a-thon project, which is one of the things we do about every other month. People get together and work all night and pick a project they don't have time to do necessarily during the day. From the time it really germinated as an idea to the time it launched and was available for our entire user base, it became a more formal development project. One of the things we did as part of that was actually built a new back-end service to be able to deal with all of the millions of simultaneous connections that we persist for users.
One other thing I was reading up on some of the work you've been doing--you say that clouds don't solve single points of failure in your stack. What are those single points of failure?
Heiliger: Interesting question, and the notion you are referring to there was part of the talk I give in regards to cloud computing is just a panacea, and for a start-up or even a more mature start-up like Facebook, isn't the answer to solving failure points in an application. By that I mean the underlining infrastructure that powers an application is typically the result of, or the outcome of, how the application is originally designed and how users interact with that application. If an application is poorly designed or designed to constantly reference a single set of data, the underlining infrastructure is going to be the victim of that. Guys like myself in the infrastructure world have to figure out how to best make that work.
As someone who is in operations how much impact do you have on the application development to make sure that once it gets into the data center that it can work properly and scale and not have the kind of failures we're seeing with some of the new applications?
Heiliger: I think it's a constant challenge in any organization, particularly a fast-moving one like Facebook, where we want to iterate quickly and get product out in our customers' hands so we can get feedback on that product and continue to tweak and enhance it over time. We have one force that's moving in that direction, and we have another force that says we want to keep the site up, we want the site to be reliable, and we want the site to be fast.
So there's a fine balancing act, where everyone in management and everyone in both the engineering and operations department constantly just sort of works, interacts, and goes back and forth, figures out just how to make those trade-offs. Sometimes we err too aggressively on the side of innovation and iteration, and put things out on the site in perhaps a small quantity that may break the site or cause the site to slow temporarily. Other times we air on the side of conservatism, of not releasing new functionality or new features, and that then delays the sort of user gratification of having that feature or fixing that bug.
What are the challenges that you see--let's say you're at 80 million unique users per month, 250,000 being added per day and 50,000 transactions per second. What happens when you get to 500 million or a billion if you ever get there?
Heiliger: Hopefully, tremendous things. I think we can only look forward to those days.
But what are some of the bottlenecks or barriers you have to overcome to get to that kind of scale?
Heiliger: Some of the bottlenecks we're facing are how we scale this extremely distributed set of data. One of the challenges we have is figuring out how to make that replicated such that it can exist in multiple places around the world and we don't also have to bring users back to the U.S. or back to one of our data centers. I think it's a challenge that most Web sites tend to face as they scale, which is you start in one location with a single database and then you have to figure out how to grow from there, primarily driven by the amount of latency or the amount of time it takes to reach the site and interact with the site. Being able to replicate the data across multiple data centers and across multiple geographies allows users to not just read their data from a local version but write that data as well. That is one of our key challenges over the next 12 months.
As you learn more about building up this very large scale infrastructure do you ever see the possibility that a Facebook could be a service provider?
Heiliger: What do you mean by service provider?
In the sense that right now you're just running the Facebook application but what if a developer or user wanted to do something similar to what Amazon is doing, using your infrastructure to run their applications in the cloud?
Heiliger: Gotcha. So one of the values of Facebook is the Facebook platform. We have over 100,000 developers and several hundred applications that have over a million users using them. We've talked about perhaps opening up or further opening up the platform by offering compute power for those application developers. One of the steps we've already taken to improve that development environment and improve the experience for our developers is just to open-source our platform, which we announced just a couple of weeks ago as well.