Google comes to HP's aid

Engineers help dust off and release an old HP Labs optical character recognition engine.

Stefanie Olsen Staff writer, CNET News

Stefanie Olsen covers technology and science.

Stefanie Olsen

Sept. 5, 2006 4:39 p.m. PT

2 min read

Ever heard of bit rot?

Google engineers apparently have in their work reviving an old indexing engine developed and left to rust by Hewlett-Packard.

The search giant announced that it's helped fix software bugs in the 2-decades-old Tesseract, an optical character recognition (OCR) engine originally built by HP Labs and retired in 1995 before the company released the code to the open-source community in recent months.

Why is Google interested in OCR? According to the company, which posted the news Thursday on its code page

: "In a nutshell, we are all about making information available to users, and when this information is in a paper document, OCR is the process by which we can convert the pages of this document into text that can then be used for indexing."

The project dovetails with Google's overall goal to index and organize the world's information--everything from campy high school videos to academic papers that have yet to be digitized. With open-source technology like Tesseract, other engineers or institutions could help digitize more information in the form of papers.

Google helped with the project at the behest of engineers at the University of Nevada at Las Vegas, who have been working with HP to clear the dust off Tesseract in the last two years. UNLV turned to Google to help fix several bugs in the old software, which in its day was one of the most accurate character recognition engines.

Tesseract was judged to be highly accurate in reading paper documents in a UNLV contest in 1995, before HP retreated from the OCR business and put the software into storage.

"Fortunately some of our esteemed HP colleagues realized a year or two ago that rather than sit on this engine, it would be better for the world if they brought it back to life by open sourcing it," Google said.

For the record, bit rot is typically jargon in the computing world for a gradual decay of storage media or buggy software, according to Wikipedia. In literal terms, there's no rust involved.