Science

IBM helping Europe scan historical documents

Big Blue is working with the EU to help libraries, universities, and companies digitize documents using new scanning tools that provide greater accuracy.

Lance Whitney Contributing Writer

Lance Whitney is a freelance technology writer and trainer and a former IT professional. He's written for Time, CNET, PCMag, and several other publications. He's the author of two tech books--one on Windows and another on LinkedIn.

See full bio

Lance Whitney

Aug. 26, 2010 8:16 a.m. PT

2 min read

IBM and the European Union are teaming up to offer a better way to scan the massive collection of Europe's treasured historical documents.

Expanding on an existing collaboration project, Big Blue and the EU will now be working with more than two dozen libraries, research institutes, universities, and companies across Europe to help them digitize their rare books and documents.

The project known as Impact (Improving Access to Text), is using new tools and tapping into crowd sourcing to speed up the mass digitization process and ensure that the scanned documents are as accurate as possible. Impact will also play a role in making those scans available and searchable online, so that researchers and other people who can't access the actual documents will be able to view the scanned versions via the Internet.

On their own, libraries and other organizations have already spent the past couple of decades scanning their documents and converting them to text via OCR (optical character recognition). But the faded text and old-style fonts used in these historical documents have proved a challenge to traditional scanning and OCR software, rendering the process slow and the results inaccurate.

By combining new OCR technologies with "crowd computing," IBM said it believes Impact will greatly improve the quality and efficiency of the process. Big Blue's new Web-enabled OCR software will provide between 25 percent and 50 percent greater accuracy than standard OCR programs, according to IBM. The system will also be able to learn from its mistakes to better recognize specific fonts and character sets.

Beyond the OCR component, though, the Impact project is also relying on the skills and expertise of the crowd, namely a large group of dedicated volunteers who will review each scan online to verify its accuracy. The volunteers will be able to identify any mistakes in the scanned text and quickly choose the right character from a list of suggestions.

Though Impact is backed by IBM's group in Haifa, Israel, the project will allow the libraries, universities, and other institutions to scan their documents independently on an ongoing basis. An IBM spokesperson told CNET that the overall project could very well involve tens of thousands of documents.

"Impact is remarkable in that it not only allows these prominent centers of culture to ultimately bring people closer to perhaps never before seen historically significant texts of heritage--but because it actually allows these people to become part of the preservation process," Tal Drory, manager of the document processing group at IBM Research in Haifa, said in a statement. "Impact offers the first digitization system that combines the power of crowd computing with an adaptive optical character recognition (OCR) correction solution that can achieve excellent recognition rates across all kinds of documents--from the 15th century right up through the 19th century."