Google apologizes for this week's Docs outage
The company says that a "memory management bug" is to blame for the hourlong outage and that it's taking steps to prevent a repeat.
Google has officially apologized for this week's Google Docs outage.
On Wednesday, Google Docs--the search giant's productivity suite, featuring a word processor, spreadsheet, presentation app, and drawing service--. In a statement, Google said that it was "aware" of the problem, and was working on a resolution. About an hour later, the service was brought back up.
Writing on the company's blog, Google engineering director Alan Warren, said that the company was "very sorry," adding that the service was hit by a "memory management bug" that was exposed following an update made to the Docs' real-time collaboration feature.
Warren explains it this way:
Every time a Google Doc is modified, a machine looks up the servers that need to be updated. Due to the memory management bug, the lookup machines didn't recycle their memory properly after each lookup, causing them to eventually run out of memory and restart."
Warren went on to say that when those machines restarted, more trouble ensued, causing the service's servers to improperly "process a large fraction of the requests to access document lists, documents, drawings, and scripts which led to the outage you saw on Wednesday."
To try to avoid such a long downtime from affecting Google Docs again, Warren said that the search giant has come up with a "list of steps" it will use in the future. Those steps, he said, are designed to "reduce the chance of a future event, decrease the time required to notice and resolve a problem, and limit the scope which any single problem can affect."
CNET's Rafe Needleman described the outage, brief as it was, as a, or at least a brightly lit reminder that safety nets are in order:
Yes, it is very true that Google's engineers brought the system back up in fairly short order, probably faster than any understaffed IT department would have been able to react to a similar outage on a local system. And, as far as we can tell, there was no data loss. But if it's your job to worry about a company's productivity, you have to think about a worse case than this--and about not being able to do anything when, say, 10,000 workers are suddenly idled by a single tech outage. Is it worth it?
Google wasn't alone this week in seeing its online services hit with an outage.
Last night, Microsoft's Office 365, Hotmail, SkyDrive, and Windows Live services. Microsoft reported that the downtime was due to a Domain Name System (DNS) issue.
Update at 11:01 a.m. PT to include more details.