The BBC has blamed a combination of behind-the-scenes problems for crashing iPlayer and other online services over the weekend. "It was a unique experience," a BBC spokesperson told CNET.
Both the BBC Online website and video-streaming catch-up service iPlayer were affected throughout Saturday and Sunday, causing viewers to miss coverage of the German Grand Prix and the new series of "Dragon's Den" among other radio and TV shows.
The failure occurred when the database of metadata -- the information that identifies each video stream -- fell over on Saturday morning. A second separate problem then occurred, causing iPlayer and other services to fail.
iPlayer is the online service that lets you watch TV shows and films that have been shown on BBC TV channels in past weeks and months, along with some new episodes and short films that only appear online. You can catch up on programmes you've missed online through the iPlayer website, or on an iPlayer app available on most phones, tablets, smart TVs and games consoles.
The BBC reported via the BBCiPlayer Twitter account at 3.30pm on Monday that the service was "back up and running." Normal service was restored to the online iPlayer and apps first, but the Beeb admitted that TVs and games consoles were "still very patchy."
Richard Cooper, the BBC's Controller of Digital Distribution for BBC Future Media, explains exactly what went wrong in a blog post.
Here's what happened.
We have a system comprising 58 application servers and 10 database servers that provides programme and clip metadata. This data powers various BBC iPlayer applications for the devices that we support (which is over 1200 and counting) as well as modules of programme information and clips on many sites across BBC Online. This system is split across two data centres in a "hot-hot" configuration (both running at the same time), with the expectation that we can run at any time from either one of those data centres.
At 9.30 on Saturday morning (19th July 2014) the load on the database went through the roof, meaning that many requests for metadata to the application servers started to fail.
The immediate impact of this depended on how each product uses that data. In many cases the metadata is cached at the product level, and can continue to serve content while attempting to revalidate. In some cases (mostly older applications), the metadata is used directly, and so those products started to fail.
At almost the same time we had a second problem. We use a caching layer in front of most of the products on BBC Online, and one of the pools failed. The products managed by that pool include BBC iPlayer and the BBC homepage, and the failure made all of those products inaccessible. That opened up a major incident at the same time on a second front.
Our first priority was to restore the caching layer. The failure was a complex one (we're still doing the forensics on it), and it has repeated a number of times. It was this failure that resulted in us switching the homepage to its emergency mode ("Due to technical problems, we are displaying a simplified version of the BBC Homepage"). We used the emergency page a number of times during the weekend, eventually leaving it up until we were confident that we had completely stabilised the cache.
Restoring the metadata service was complex. Isolating the source of the additional load proved to be far from straightforward, and restoring the service itself is not as simple as rebooting it (turning it off and on again is the ultimate solution to most problems). Performance of the system remained sufficiently poor that in the end we decided to do some significant remedial work on Saturday afternoon, which ran on until the evening. During that period, BBC iPlayer was effectively not useable.
After that work was complete we were in a walking wounded state that allowed close to normal operation for much of the site, though BBC iPlayer remained down on a number of devices. We chose to run it in this mode throughout the rest of the weekend while planning a full restoration of the service. By the time we were ready to do that we were entering the peak period on Sunday evening, so rather than risk the service further, we chose instead to do it on Monday morning.
Everything appears to be working properly now, with iPlayer now including a new round of one-off "Comedy Feeds," online-only episodes from up-and-coming comics.