ZURICH, Switzerland--Chances are that if you've solved one of those distorted-word tests to secure an account with Facebook, Craigslist, or Ticketmaster, you've helped The New York Times inch a little closer to digitizing its entire print newspaper archive from 1851 to 1980.
How have you unwittingly helped the Gray Lady by wasting 10 seconds on a computer-generated word challenge? It's thanks to a year-old initiative called ReCaptcha, a play on the antispam tests known as Captchas (Completely Automated Public Turing Test To Tell Computers and Humans Apart), a test that people can pass, but machines cannot.
People typically fill out Captchas so Web sites can verify that a human, rather than a spam bot, is behind the request for a new e-mail address, log-in, or membership. But with ReCaptchas, which are double-word tests, humans are also helping machines better recognize faded-ink or blurry words that have been digitally scanned from old newspapers or books--text that's difficult for a computer to recognize optically. That way, people will eventually be able to sift through print archives with a more intelligent search engine.
In the last year, as many as 600 million people have completed at least one ReCaptcha on sites such as Twitter, LastFM, and Ticketmaster, which use the technology for free, according to ReCaptcha creator and Carnegie Mellon University assistant professor Luis von Ahn.
With all those helping hands, von Ahn expects that The New York Times digitization project will be finished by the end of 2009, at the latest. (About five months ago, The New York Times paid an undisclosed sum to von Ahn's CMU team to complete its project.)
"We're reusing wasted human cycles," von Ahn, 28, said while speaking at a robotics conference here recently.
The venture involves putting millions of eyes on words printed in roughly 47,000 newspapers, with various counts of pages. For example, before the turn of the century, The New York Times was about one-fourth the breadth it is today. It's doubled in size about every 50 years or so since its beginning in the 1850s, when it was published every day except Sunday. (The New York Times did not immediately respond to a request for comment for this story.)
Von Ahn's team is also helping the Internet Archive with the digitization of books through ReCaptcha, but it's doing that project gratis.
In fact, von Ahn, a recipient of the MacArthur Fellowship (or "genius award") in 2006 for his work as a computer scientist, only wants to aid projects that work for the good of humanity. His main work-related guilt, it seems, is that he helped invent Captchas in the first place (in 2000, so that Yahoo could fend off spammers). And that's only because he's factored how much time people have wasted on the four- to six-character tests. He's estimated that people type 200 million Captchas every day around the world, or a collective estimate of 500,000 man hours (at 10 seconds per puzzle).
But that lost time is nothing compared with the amount spent on games--another key focus for von Ahn. By the time the average American has turned 21, researchers estimate that he or she has spent about 10,000 hours playing video games--that's the equivalent of holding down a full-time job for five years. In 2003, players collectively spent 9 billion human hours on the game Solitaire. In contrast, building the Empire State Building took only 7 million human hours, or the equivalent of a collective 6.8 Solitaire hours.
Such thoughts spurred von Ahn to create Games with a Purpose, or Gwap.com, a project designed to harness people's time having fun to solve bigger computational problems. (The field is known as human computation.) He developed the first of those games, the ESP Game, several years ago to tackle image labeling to improve Web search. The game asks two randomly paired people (on different computers) to describe the same image without any way to communicate. Within a time limit, the players must predict the same word for an image before moving onto another image.
It's infectious. As many as 200,000 players have provided 50 million labels for images since the game was created, according to von Ahn. Some people play as much as 20 hours a week.
Normally, companies like Google or Yahoo would need to hire people to label the millions of images in their archives. But with only 5,000 people playing the ESP Game simultaneously, they could label all of Google's image archive within two months, he said. That must be why Google licensed the ESP Game from von Ahn and Carnegie Mellon University in 2006 to label its images.
Even though it would seem Google has completed its image labeling, it's really a never-ending project because of a constant influx of photos and people's changing perceptions.
For example, people's perceptions of celebrities like Britney Spears or political figures like George Bush morph over time. Just two years ago, labels for Britney Spears were as simple as "Britney" and "hot." But recently, they turned into "crazy," "shaved head," and "rehab." President Bush's tags have gone from "George" and "President," to "dumb" and "yuck."
Thanks in part to the success of the ESP Game, von Ahn and a team of 10 computer scientists at CMU have launched four new games to solve different artificial-intelligence problems. Gwap.com, introduced in May, is the umbrella site for all five games, which include the new Verbosity, Tag a Tune, Squigl, and Matchin. Since May, the site has attracted about 85,000 registered users.
Tag a Tune, for example, is much like the ESP Game, but for audio recordings. A player must figure out if he or she is listening to the same song as an opposing player by watching their descriptive guesses and making guesses of their own.
There's a 50 percent chance players are listening to the same song. That game would help describe the contents of audio recordings in a way that someone could eventually ask a search engine for a "happy song about rainy days," rather than using the exact song title. Squigl asks players to outline an object they see in a photo--a task meant to eventually further the field of computer vision.
Next up: von Ahn plans within the next three months to introduce a game that deals with labeling video clips. That way, the system would improve search over video archives. It currently doesn't have any other licensors for its games, although it's easy to see a host of interested parties for audio, music, and video labels.
In a bit of procrastination of his own, von Ahn had been thinking about how not to waste time with games, and then Captchas, at least two years before he acted on a project to recoup energy spent on word tests. He's certainly seen some weird things since he helped get them started on Yahoo in 2000.
HotorNot.com, for example, has shown prospective account holders images of nine women and they must pick from the selection which three are "hot." Von Ahn said that through this exercise, a man met his wife on the site.
Spammers have also created so-called Captcha sweatshops to get around the tests. He said that they will hire people for an hourly wage of $2.50 and the average worker will solve about six word puzzles per minute. Even though Captcha sweatshops generate new jobs, von Ahn said he would rather put people's time to better use.
"I started thinking about how you could direct people's efforts in a way that's good for humanity," he said.
Last year, von Ahn introduced the ReCaptcha free antispam system with a double-word test (six to eight characters each), which, it turns out, doesn't take people any longer than solving many single-word tests that mix characters, he said. With two words, the system can develop a confidence rating for the human by serving up one word the computer doesn't know, with another it does know.
Digitizing books or old newsprint is a worthy chore for von Ahn. Typically, if you print something, then scan it, the computer's optical character recognition would be able to "see" the text with 100 percent accuracy. But for older works, with faded ink or warped letters, OCR will not detect the words with accuracy. Recaptcha, which literally shows words scanned from old New York Times newsprint or books in the queue for the Internet Archive, uses people's intelligence in this process.
From blogs like Wordpress and sites like Craigslist, Recaptcha is digitizing between 15 million and 16 million words a day. Sometimes, however, the automated system generates offbeat combinations of words, such as "bad" and "Christians," or "damn" and "liberal."
As for clients other than The New York Times? Von Ahn said he's been approached by at least one bank that wanted to digitize checks, but he turned that offer down.
"We want to do stuff with the preservation of important material," he said.