X

New tool screens spam, digitizes books

The ReCaptcha service turns annoying spam-avoidance chores into a productive project to digitize books. Image: Killing two birds with two words

Stephen Shankland Former Principal Writer
Stephen Shankland worked at CNET from 1998 to 2024 and wrote about processors, digital photography, AI, quantum computing, computer science, materials science, supercomputers, drones, browsers, 3D printing, USB, and new computing technology in general. He has a soft spot in his heart for standards groups and I/O interfaces. His first big scoop was about radioactive cat poop.
Expertise Processors, semiconductors, web browsers, quantum computing, supercomputers, AI, 3D printing, drones, computer science, physics, programming, materials science, USB, UWB, Android, digital photography, science. Credentials
  • Shankland covered the tech industry for more than 25 years and was a science writer for five years before that. He has deep expertise in microprocessors, digital photography, computer hardware and software, internet standards, web technology, and more.
Stephen Shankland
3 min read
A group of Carnegie Mellon University programmers has launched a service called ReCaptcha that can help cut down on spam while letting people digitize books.

The project is a variation of the widely used "Captcha" technique to weed out computer abuse such as e-mailing spam or posting spam on blog comments. Captchas require users to pass little pattern recognition tests, commonly reading distorted or obscured words.

ReCaptcha turns this chore into a productive task by letting users digitize scanned images of words that computers couldn't figure out.

"Not only can you solve your problems with spam, you can help preserve mankind's written history into the digital age," said Ben Maurer, the project's chief architect and a Carnegie Mellon University undergraduate, announcing the project on his blog on Wednesday.

Since the project launched Tuesday, 150 Web sites have begun using it, said Luis von Ahn, a Carnegie Mellon assistant professor and ReCaptcha's "executive producer." In just the first half of Thursday, the project had digitized 8,000 words, he said.

It's a new example of how the Internet can harness the collective energies of large numbers of people. Other examples include news sites such as Digg and Slashdot, which give prominence to content that users rate highly, and stock photography seller iStockphoto, which is beta testing an Image Fight site to rate photo quality.

ReCaptcha has the potential to digitize vast quantities of words. Von Ahn estimates that people perform 60 million Captcha (Completely Automated Public Turing test to tell Computers and Humans Apart) tests daily.

Image: ReCaptcha kills two birds with two words

The service presents users with two words, one from a conventional Captcha test and the other an unknown word that a computerized optical character recognition couldn't figure out. If the user correctly identifies the known word, he or she is presumed to have decoded the unknown one. Currently, ReCaptcha requires three separate people to digitize the word the same before it's determined to be correct, von Ahn said.

Von Ahn was a member of the Carnegie Mellon team that developed Captcha in response to a Yahoo request for technology to keep computers from registering for bogus e-mail accounts, according to Carnegie Mellon. He's a recipient of a MacArthur Foundation "genius" grant, which funded some ReCaptcha work.

Digital libraries
The ReCaptcha project is digitizing books in the Internet Archive, a project building a digital library of cultural materials and that operates the Wayback Machine of historical Web site snapshots.

Among the first books being digitized is Psychology by philosopher John Dewey, von Ahn said. The project is considering other book archives, too, he added.

The ReCaptcha service is available now through an application programming interface (API) for people to integrate into their Web sites. Software plug-ins to use the API are open-source software packages hosted at Google Code.

ReCaptcha also can be used to shield e-mail addresses from computers that harvest them for spam mailing lists.

Von Ahn's specialty is what he calls "human computation," which he defines as "novel techniques for utilizing the computational abilities (or 'cycles') of humans."

Microsoft Research has its own philanthropic variation of Captcha technology: a project called Asirra that shows pictures of cats and dogs rather than text. Computers do a lousy job telling the animals apart, but people can. To get a supply of constantly refreshed pet images, Microsoft pulls photos--and "adopt me" links--from the PetFinder Web site.

Two of his higher-profile projects were online games, ESP Game and Peekaboom, that rely on crowds to label images. Like reading obfuscated text, it's a task at which computers are lousy.

Google licensed the ESP Game technology and offers it as its Google Image Labeler to improve its own image search technology.

Carnegie Mellon is hosting the ReCaptcha service on $30,000 worth of servers donated by Intel, von Ahn said. Other sponsors include Novell, which contributed Novell's Suse Linux Enterprise Server support subscriptions, and Carnegie Mellon.