Culture

An open-source rival to Google's book project

Digitizing the world's books is too important a job to leave to private ventures, say backers of the Internet Archive. Photos: Gearing up the scanners

Stefanie Olsen Staff writer, CNET News

Stefanie Olsen covers technology and science.

See full bio

Stefanie Olsen

Oct. 26, 2005 2:12 p.m. PT

6 min read

SAN FRANCISCO--When it comes to digitizing books, two stories appear to be unfolding: One is about open source, and the other, Google.

Or so it seemed at a party held by the Internet Archive on Tuesday evening, when the nonprofit foundation and a parade of partners, including the Smithsonian Institution, Hewlett-Packard, Yahoo and Microsoft's MSN, rallied around a collective open-source initiative to digitize all the world's books and make them universally available.

Google was noticeably absent from the cadre of partners, considering that the search behemoth has a high-profile project of its own to scan library books and add them to its searchable index.

Some supporters of the Internet Archive, a San Francisco-based nonprofit, took the opportunity to criticize such private ventures.

"We want to digitize all human knowledge...and we can't risk having it privatized," said Doron Weber, an executive of the Alfred P. Sloan Foundation, a philanthropic organization that has contributed more than $3 million to the Internet Archive since 2003. Citing the importance of an open library for educational purposes, he called on private companies to "rein in their impulses" while urging libraries to "embrace the future."

Still, a Google executive in attendance downplayed the perceived rivalry.

"I think (the project) is great," said Alexander Macgillivray, Google's senior product counsel, following a presentation on the book-scanning effort. "It's a shame it's being portrayed as a battle between the two projects because the efforts are complementary."

Digitizing books has become a focus in recent years as people try to make otherwise analog information available on the Internet. Academic research, music from classical to pop and video are all being digitized, and now books are in technology's path.

Google put its own far-reaching digitization project in the spotlight 10 months ago, when it announced partnerships with Harvard University, Stanford University and others to digitize collections of copyright and out-of-copyright books. In 2004, Amazon.com also opened up a digital book collection on its Web site and announced its efforts to scan popular works in partnership with publishers. Amazon visitors can "search inside the book" as a result.

Still, to make the millions of books in the world available online is a Herculean task. Issues of publisher copyrights, data storage and backup, and labor costs must still be hashed out. It would take 6 petabytes to digitally store just 1 million books, according to the Internet Archive. By comparison, Google reportedly has stored nearly 10 million Web documents, requiring between 1.7 and 5 petabytes of storage.

One thorny issue has already reached the courts. Google faces lawsuits from publishers and authors that claim it is violating their copyrights and overstepping the boundaries of fair use laws. Google has made scanning books an "opt out" program for publishers, meaning they must actively tell the search company not to scan their books to stay out of the company's Web index.

The Internet Archive only plans to scan books that are in the public domain and those that copyright holders have given the green light for scanning.

Though it has been working on the effort for years, the Internet Archive recently jump-started its effort by introducing the Open Content Alliance. Members include Adobe Systems, Columbia University, the European Archive, the Biodiversity Heritage Library and Smithsonian Institution Libraries.

Yahoo and MSN Search are also notable members, given their investments in Web search and driving traffic to their proprietary services. The two companies boasted the openness of the project Tuesday night, but their allegiance to the open-source project surely is a strategic counterbalance to Google's project. In the end, the open-source library will also be searchable using MSN Search and Yahoo.

Their support means donating money. MSN Search, for example, has committed approximately $5 million to ensure 150,000 books are scanned and added to the collection over the next year.

Last week, the Internet Archive launched Open Library, a Web site that will eventually house all the world's books, according to the nonprofit. It now demonstrates the project with 15 digitized works. The Web site's interface is modeled after that of the British Library in the United Kingdom.

The foundation will digitize 18,000 works of fiction chosen from the University of California archive project that are no longer bound by copyright.

For now, people can download 15 demonstration books from the Open Library site and print them for free at home. Visitors can

also purchase bound copies from Lulu.com for $8 each. The service even lets people create their own book covers and art, and then have the books printed with them. Users can search inside the works and see tabs on pages where the terms occurred. With the move of a cursor, visitors can see which page they will turn to before clicking on it.

Volunteers from LibriVox, an open-source effort trying to make books freely available in audio, have also made audio recordings of the books so that people can listen to them via the Open Library Web site.

In addition, the Internet Archive started "bookmobile" tours around the country to promote on-demand printing of the books. It has vans equipped with printers, binders and computers so that it can print books on demand for children across the country.

How it works
While Google has released few details of its scanning project (the search company has nondisclosure agreements with its library partners), the Internet Archive had a display of its technology at the Tuesday night event.

The Internet Archive built a specialized scanning machine and written open-source software called Scribe for the specific purpose of digitizing books. The "machine" is an assembly of a standard PC with the Scribe software installed, two Cannon EOS cameras, a pedal-operated glass and metal stand to hold and secure books at an angle, along with a table and chair. The machine looks much like a photo or voting booth, with black cloth covering a box frame and shielding the books and computer gear from ambient light.

The chair seats one person, who operates the computer program and turns book pages by hand. During the scanning process, the book sits at a 90-degree angle under glass, which protects it from the camera light and causes the least amount of damage to its pages, according to the Internet Archive. The operator pushes a pedal under the table to release the book from under the glass, and turns the page before it's ready to take another picture.

Once a picture is taken, both pages of the book appear on a computer screen in their original form. The Scribe software then finds the center of the page and makes adjustments of the picture's angle or ensures that it's cropped properly. It will also clean up any poor coloring and make it uniform.

The operator enters some metadata about the book--its author, title and publication date. And once the book is scanned, it's then saved to the system and catalogued. Scribe takes the metadata from the book and matches it with data from existing card catalogs in order to prevent duplication. The work is then added to the digital record.

It takes roughly one hour to scan two 300-page books. And it costs an estimated 10 cents a page, split among data storage, labor and equipment and administration fees, according to Brewster Kahle, the project's leader. The cost does not take into account libraries' fees for getting the book to the scanners.

Daniel Greenstein of the University of California's archive project said that his group has donated $500,000 to assess the ultimate costs of scanning from the libraries' perspective.

The Internet Archive currently has 10 scanning machines, but it is ramping up to build 10 more in the next year.

"This is one of the great things we've ever done," said Kahle. "It's up there with the Library of Alexandria and putting a man on the moon."

CNET News.com's Elinor Mills contributed to this report.