Culture

Next-generation search tools to refine results

The vast corpus of human knowledge could soon be published on the Internet. The problem now is how to wade through it.

Michael Kanellos Staff Writer, CNET News.com

Michael Kanellos is editor at large at CNET News.com, where he covers hardware, research and development, start-ups and the tech industry overseas.

See full bio

Michael Kanellos

Aug. 9, 2004 12:29 p.m. PT

6 min read

SAN JOSE, Calif.--The vast corpus of human knowledge could soon be published on the Internet. The problem now is how to wade through it.

Although search engines have greatly enhanced access to information, and storage technology has made it cheap to digitize nearly everything, search tools need to be refined to make it easier to digest information or conduct queries. That was the word from researchers and speakers at the New Paradigms for Using Computers Conference, held at IBM's Almaden research lab here last week.

News.context

What's new:
Scientists are working on next-generation search engines and tools so users will be able to pick through the data on their hard drives and the Web.

Bottom line:
The amount of digital information is exploding, and unless inventions bubble up, we could get lost in the morass.

More stories on this topic

"We live in a world with lots of information but also lots of interruptions. It is a teriyaki of information. The question is, 'How do we survive in the marinade?'" joked Dan Russell, senior manager of user sciences and experience research at IBM Almaden.

Early attempts to better locate the world's information are already under way. The University of California at Berkeley, for example, showed off at the conference a prototype of a search engine called Flamenco that makes it easier to search for works of art or antiques. Santa Clara, Calif.-based Inxight, meanwhile, has created software that attempts to graphically represent latent connections between people or institutions by studying where and how they get mentioned on the Web.

On the desktop, companies such as Ingenuity Software, founded by former Apple Computer developer Bruce Horn, are creating tools designed to make it easier for people to index their photos and documents for subsequent Google-like searches on their hard drive.

These research efforts are in addition to new operating systems under development that will include better search tools.

Microsoft plans to add better search features to a future version of Windows, code-named Longhorn, due sometime around 2006 or 2007. The software giant last week demonstrated a more general Web search "service" that's also in development.

And Apple's Tiger, a new version of the company's Mac OS X operating system that's due next year, will include a new systemwide search engine called Spotlight that will allow Mac users to quickly search and find any file, Apple says.

How many books?
One of the surprises that has emerged from the Internet Archive, which is intended to become a repository of everything ever published, is that the body of public works can probably be corralled, said Brewster Kahle, founder of the organization.

About 100 million different books have been published in history, Kahle said, citing estimates from professor Raj Reddy at Carnegie Mellon University. About 28 million sit in the Library of Congress. On average, a book can be condensed to a megabyte in Microsoft Word. Thus, the books in the Library of Congress could fit into a 28-terabyte storage system.

"For the cost of a house, you could have the Library of Congress," Reddy said, adding that mass book-scanning projects are currently under way in India and China.

"Universal access to all human knowledge is within our grasp. It could be one of the greatest achievements of all time."

-- Brewster Kahle, founder, Internet Archive

Only about 2 million to 3 million audio recordings--mostly music--have ever been published for public consumption. The Internet Archive has begun to store digitized recordings of concerts as well and has about 15,000 shows in its database to date. There are between 100,000 to 200,000 theatrical movies--half of them from India--in existence and about 20 terabytes of TV broadcasts a month. The Web grows by about 20 terabytes of compressed data a month as well. (One terabyte equals 1 trillion bytes.) Since 1984, about 50,000 software titles, including CD-ROMs, have emerged.

Though the legal issues around storing and viewing all this information remain thorny, storing it is doable.

"Universal access to all human knowledge is within our grasp," Kahle said. "It could be one of the greatest achievements of all time."

Still, that's a lot to grasp. Similarly, individuals will experience an explosion in their personal catalogs of data. In the MyLifeBits project under way at Microsoft Research, noted scientist Gordon Bell is attempting to digitally capture all of the books, movies, TV shows, music and other media he has experienced in his life. He's up to 44GB of data so far.

E-mails, phone messages, photographs and personal video will also add to an individual's data trove. In another experiment, doctors in Cambridge, England, have equipped patients suffering from severe memory loss with a Microsoft SenseCam, a wearable camera that takes pictures when a person moves. One man is currently using it so he can show his wife, who has memory problems, a diary of the day, said Ken Wood, who works on the project.

Microsoft has also entered a three-year alliance with the Edinburgh International Festival in Scotland. In a likely experiment, attendees will wander about the arts fest with SenseCams around their necks, snapping shots.

Hide and seek
One approach to mastering data overload lies in developing search engines specialized for certain topics and data sets. That's the tack taken by Berkeley's Flamenco project.

In Flamenco, a Yahoo-like interface categorizes artworks drawn from museum collections around the world by content (animals, heaven and earth, shapes and colors, and so on), century, artist, medium (such as painting, furniture, sculpture) and other identifiers. By going up and down the tree, users can browse through all the animal pictures found in the database, or they can zero in on, say, the years 1700 to 1709 and discover that the period, at least as represented by the database, produced only four paintings of hoofed mammals.

The search engine does not search on the visual information contained in the picture, said Kevil Li, a student on the project. Instead, searches are conducted on descriptive text submitted by the museums that digitize their artwork for such databases.

Other tools, such as Inxight and GeoFusion, produce graphical representations of data obtained through searches. GeoFusion, which makes software that can extrapolate from geographic data, was able to render a map of the movements of a tagged tuna.

By contrast, Inxight's software creates a map of relationships between names and topics. A search on the White House and business showed that Haliburton is the corporation linked most often to the White House. In a similar fashion, IBM's own WebFountain project is used to test how cohesive certain blogging communities are by how quickly and in unison they react to news events.

File systems will likely begin to disappear as search gains popularity. One of the phenomena that Microsoft researchers are finding in MyLifeBits is that files are largely ad hoc categories that become outdated, said Jim Gemmell at Microsoft Research.

Instead, data should be tagged so that if people remember a name or part of a name, they can find their way back to documents or pictures involving that person, or they can find documents created on the same day that they had a phone conversation with the person, even if the discussion involved something unrelated.

"The problem is not that we keep too much with MyLifeBits. The problem is how to use it," Gemmell said.

Poorer nations will also be able to take advantage of these advances, even without an electrical grid. The Internet Archive has created mobile bookmobiles in conjunction with Hewlett-Packard and others. The bookmobiles contain a printer hooked up to a satellite feed, which can print books for kids. Two are in operation in India, while another in rural Uganda prints about 1,500 books a week. The entire bookmobile, including the cost of the used van, is $15,000, and 100-page books cost about a $1 to print and bind in the van.

"It takes about 12 to 15 minutes to make a book," he said. "It is cheaper for a library in the United States to print and give away a book than retrieve it."