X

Google apologizes for this week's Docs outage

The company says that a "memory management bug" is to blame for the hourlong outage and that it's taking steps to prevent a repeat.

Don Reisinger
CNET contributor Don Reisinger is a technology columnist who has covered everything from HDTVs to computers to Flowbee Haircut Systems. Besides his work with CNET, Don's work has been featured in a variety of other publications including PC World and a host of Ziff-Davis publications.
Don Reisinger
2 min read
The error screen displayed when Google Docs went down.
The error screen displayed when Google Docs went down. Screenshot by Rafe Needleman/CNET

Google has officially apologized for this week's Google Docs outage.

On Wednesday, Google Docs--the search giant's productivity suite, featuring a word processor, spreadsheet, presentation app, and drawing service--went down. In a statement, Google said that it was "aware" of the problem, and was working on a resolution. About an hour later, the service was brought back up.

Writing on the company's blog, Google engineering director Alan Warren, said that the company was "very sorry," adding that the service was hit by a "memory management bug" that was exposed following an update made to the Docs' real-time collaboration feature.

Warren explains it this way:

Every time a Google Doc is modified, a machine looks up the servers that need to be updated. Due to the memory management bug, the lookup machines didn't recycle their memory properly after each lookup, causing them to eventually run out of memory and restart."

Warren went on to say that when those machines restarted, more trouble ensued, causing the service's servers to improperly "process a large fraction of the requests to access document lists, documents, drawings, and scripts which led to the outage you saw on Wednesday."

To try to avoid such a long downtime from affecting Google Docs again, Warren said that the search giant has come up with a "list of steps" it will use in the future. Those steps, he said, are designed to "reduce the chance of a future event, decrease the time required to notice and resolve a problem, and limit the scope which any single problem can affect."

CNET's Rafe Needleman described the outage, brief as it was, as a blow to the growth of cloud computing, or at least a brightly lit reminder that safety nets are in order:

Yes, it is very true that Google's engineers brought the system back up in fairly short order, probably faster than any understaffed IT department would have been able to react to a similar outage on a local system. And, as far as we can tell, there was no data loss. But if it's your job to worry about a company's productivity, you have to think about a worse case than this--and about not being able to do anything when, say, 10,000 workers are suddenly idled by a single tech outage. Is it worth it?

Google wasn't alone this week in seeing its online services hit with an outage.

Last night, Microsoft's Office 365, Hotmail, SkyDrive, and Windows Live services were down for three hours. Microsoft reported that the downtime was due to a Domain Name System (DNS) issue.

Related stories:
Was brief Google Docs outage a tremor or a tsunami?
Microsoft's online services hit by outage
Amazon cloud outage downs Netflix, Quora
Yahoo Mail suffers outage; users react

Update at 11:01 a.m. PT to include more details.