X

Grant funds open-source challenge to Google library

Internet Archive receives $1 million from charitable trust to help boost up its open-source alternative to book-scanning efforts.

Candace Lombardi
In a software-driven world, it's easy to forget about the nuts and bolts. Whether it's cars, robots, personal gadgetry or industrial machines, Candace Lombardi examines the moving parts that keep our world rotating. A journalist who divides her time between the United States and the United Kingdom, Lombardi has written about technology for the sites of The New York Times, CNET, USA Today, MSN, ZDNet, Silicon.com, and GameSpot. She is a member of the CNET Blog Network and is not a current employee of CNET.
Candace Lombardi
5 min read
The nonprofit Internet Archive announced Wednesday it has received $1 million from the Alfred P. Sloan Foundation to continue its effort to scan public domain works for open online accessibility.

The archiving organization's Open-Access Text Archive is an open-source alternative to book-scanning efforts like the ones from Google and Microsoft. Internet Archive, perhaps best known for its WayBack Machine archive of Web pages by date--is also an online digital library of text, audio, software, images and video content.

"Brewster Kahle and the Internet Archive are pioneers in this exciting and historic opportunity to create a universal digital library that is both open-access and non-proprietary," said Doron Weber, who overseas public understanding of science and technology at the Sloan Foundation, in a statement.

Kahle was one of the inventors of Wide Area Information Servers (WAIS), a text-based search system that searched database indexes on remote servers before there were Internet search engines. After WAIS was sold to AOL in 1995 for several million dollars, Kahle founded the Internet Archive, which works closely with the Open Content Alliance (OCA). The OCA developed a set of principles dedicated to a "permanent archive of multilingual digitized text and multimedia content" for free and open access.

The grant from the Sloan charitable trust will enable Internet Archive and the OCA to scan collections from several major institutions, including the entire collection of publications from the Metropolitan Museum of Art as well as several thousand images from the museum; John Adams' personal library of over 3,800 works at the Boston Public Library; and other collections from The Getty Research Institute, Johns Hopkins University and the University of California, Berkeley.

The announcement comes just after the San Francisco-based Internet Archive reached the milestone of scanning 100,000 books. That may not sound like a lot compared to Google Book Search's claim of millions within a decade, but the OCA has ramped up its scanning recently to about 12,000 books a month. According to its own statistics, the organization has also archived 65 billion pages from 50 million Web sites.

"Google is so good at the media being their PR machine, that you would not know there was an alternative out there," Kahle said. "We have brand name institutions going open and foundations like the Sloan are funding (us). It shows that the Open Content Alliance is viable, that there is support for public interest. We don't have to privatize the library system."

Google has begun to offer full-text, printable PDFs of public domain works with plans to add more as it scans more books. But its platform is closed, and its PDF pages have a "Digitized by Google" watermark. The company is not planning to share its scanned material with the OCA or Internet Archive, according to Kahle.

"We think they (Google) are doing great stuff. If the materials would be made available for broad public search and educational use we'd be all for it, but in my discussion with the founders (Google co-founders Larry Page and Sergey Brin) they aren't going to," said Kahle.

Google did not respond to requests for comment about its book scanning project.

"It shows that the Open Content Alliance is viable, that there is support for public interest. We don't have to privatize the library system."
--Brewster Kahle, Internet Archive founder

Google scans and indexes both public domain and copyright works, an issue that has raised legal concerns. The Google Book Search engine restricts full access to copyright works while still offering snippet views, instead of excluding the work from its search feature altogether, according to the Google Book Search Web site.

"This whole Google Book Search looks like Amazon's Search Inside the Book," said Kahle. "Let's go open with these collections...These are beautiful books."

Yahoo is a supporter of the OCA and has helped the OCA index some of the scanned content, but its project is smaller than those of Google and Microsoft, according to Gregory Crane, a classics professor and digital library expert at Tufts University.

Microsoft was an early supporter of the OCA and in June worked with it on a project scanning and indexing materials from the University of California and the University of Toronto libraries as part of its Windows Live Book Search project. But Microsoft has become more proprietary in recent months, Kahle said.

"We continue to work with Microsoft, but the results going forward are not strictly OCA principles," Kahle later added in an e-mail. "To their credit, they are interested in helping get more scanning done in the open, of course because they can use the books as well, but still, this is more than other projects.

Jay Girotto, who heads Microsoft's Live Book Search selection team, further explained his company's position.

"We support the fundamental mission of the OCA, and hope that many more partners like the Sloan Foundation will step forward and contribute significant resources to scan public-domain materials under the OCA principles," he said in a statement.

Research impacts

Tufts' Crane thinks the companies are reluctant to share for fear of helping the competition.

"My impression is that both Microsoft and Google don't want the other benefiting from their investment, he wrote in an e-mail. "Now each is hoarding. Ideally, each would split the cost of digitizing content and then make the public domain material available in the OCA. At the moment, Google is well ahead, and I would think that they would feel that Microsoft would benefit too much."

A lack of open-source access, Crane explained, impedes research that requires access to multiple groups of works in bulk, and prevents researchers from applying more nuanced OCR (optical character recognition) searches to those texts.

"We are evaluating OCR on classical Greek. Google runs OCR on all its texts--that's how it generates searchable OCR. The Google OCR, though, doesn't know Greek and produces no usable text as far as we can tell. Google says that you have to get permission to run OCR, etc...on its PDF books," Crane said, further explaining, "Even if the PDF books are good enough quality to support OCR--they might be lower than the archival resolution.

"I am sure that Google would be open to us doing this work, but that means (for each academic project) getting their attention, writing letters, and a lot of hassle," Crane said. "I think it's easier and better in the long run to open the library up and let the world have at it," he said.