Skype has pinned the blamein part on a buggy version of its software for Windows.
In a blog posted today, Chief Information Officer Lars Rabbe explained the house of cards that took down the service the morning of Wednesday, December 22, and until the following day.
On December 22, a number of support servers that handle offline instant messaging became overloaded, according to Rabbe. Because of that, some Skype clients didn't receive responses as quickly as usual. A bug in one particular Skype client for Windows (version 22.214.171.124) prevented it from processing those delayed server responses, causing the client software to crash.
Since Skype is a peer-to-peer network, any PC running the client software can act as a node to route and process traffic. But PCs can also be tapped to serve as supernodes, which help maintain connections for multiple users.
Since about half of all Skype customers around the world were running the buggy client version, the resulting wave of crashes triggered failures in 25-30 percent of Skype's supernodes. That put extra strain on the rest of the supernodes, causing them to start failing. Despite the efforts of the tech folks at Skype to disable the overloaded servers and stop the client requests, the entire Skype network eventually shut down.
"Regrettably, as a result of the confluence of events--server overload, a bug in Skype for Windows clients (version 126.96.36.199), and the decline in available supernodes--Skype's functionality became unavailable to many of our users for approximately 24 hours," wrote Rabbe.
To get the service up and running again, Skype engineers spent that Wednesday introducing more and more instances of the Skype client software (the non-buggy version) into the network to generate more and more supernodes. That helped the network gradually recover, allowing the majority of Skype users to get back online by Thursday.
What is Skype doing to make sure an outage like this won't happen again?
First, Rabbe says the company had provided a fix (version 188.8.131.52) to the buggy software before the outage occurred, but many people hadn't yet installed it. As such, Skype will be reviewing its process for automatic updates. Second, the company will look into ways of detecting and recovering from such problems much faster. And third, it will evaluate its testing processes to better find and avoid bugs that could take down the entire system.
Rabbe also acknowledged the company's failure to prevent the outage and its lack of communication when the service was down.
"Lessons will be learned and we will use this as an opportunity to identify and introduce areas of improvement to our software, further assess and invest in capacity and stability, and develop better processes for outage recovery and communications to our user base," Rabbe wrote in closing.