Want CNET to notify you of price drops and the latest stories?

Microsoft Research seeks better search

Scientists in the software giant's labs are plugging away at one of the growing dilemmas in computing: so much data, so little time.

Michael Kanellos Staff Writer, CNET News.com
Michael Kanellos is editor at large at CNET News.com, where he covers hardware, research and development, start-ups and the tech industry overseas.
Michael Kanellos
5 min read

Microsoft Research is plugging away at one of the growing dilemmas in computing: so much data, so little time.

Scientists in the Redmond, Wash.-based software giant's labs are experimenting with new types of search and user interface technology that will let individuals and businesses tap into the vast amounts of data on the Internet, or inside their own computers, that increasingly will be impractical or impossible to find.

A prototype application called "Stuff I've Seen," for instance, will store every screen that has popped up on a given computer monitor for a year. Another prototype called "Ask MSR" allows users to pose queries using the natural flow of language, asking "Where is Saddam Hussein?" for example.

Over time, new varieties of search technology will become a common feature, like spell checkers, in desktop software.

Search tools "will start to show up in certain applications such as photo editing," Rick Rashid, senior vice president of Microsoft Research, said during an interview at an open house at Microsoft's Silicon Valley facilities. "Long term, it becomes a more central feature" of applications like word processing, he added.

Separately, Microsoft has said it plans to invest more in developing its own search technologies, an area where the software giant has lagged. The plan is aimed primarily at creating paid Web-based search similar to commercial search provider Overture Services, which provides advertiser-purchased links for Microsoft's MSN services. The investment could include acquisitions and research and development and may involve corporate search applications.

While search tools exist today, a major focus of Microsoft's research will be to allow for a freer flow of associations between data and to expand how searches can take place. Currently, data on computers is largely stored in a hierarchical fashion: A picture or document gets a file name and is stuffed into a folder. To find a document, people largely hunt and peck, a technique that also gets used on search engines.

People, however, don't think that way, Rashid said. To find a vacation shot from Australia using newer tools, for example, a person could ask a computer to pull up pictures that feature an ocean background or family members. A search engine inside an application would then comb through the visual images to get matches.

"The problem with hierarchies is this conceit that all knowledge has a place, but no single thing fits in one space," he said. "They become very cumbersome."

Microsoft's "Sapphire," another lab experiment, exemplifies the difference. The application lists associations with a word in a document. Scroll over a person's e-mail address, and Sapphire will pop up a balloon listing the person's instant message address, work title, recent publications, and lists of e-mail exchanges and meetings you've had with this person.

Storage glut
The explosion in storage capacity is adding urgency to the research. Currently, a terabyte of disk space costs about $1,600. In two to three years, it will only cost $400 and, consequently, become increasingly common.

A terabyte, however, can hold one person's entire conversations from a lifetime, or all the video if someone kept a camera in his or her head for six months. More stored data and a vaster storage space makes finding something all the more difficult.

"No one is going to search through that," Rashid said.

In the same vein, researchers in the lab have produced a spam tool, called "No Spam at Any (CPU) Speed," that attempts to cut data

Invite Michael Kanellos into your in-box
Senior department editor Michael Kanellos scrutinizes the hardware industry in a weekly column that ranges from chips to servers and other critical business systems. Enterprise Hardware every Wednesday.

clutter by reducing spam. Currently, most spam tools depend on the challenge and response method: An e-mail comes in, and the recipient's computer sends a message back that forces the sender to prove the message isn't spam.

The problem with many challenge systems is that actual humans who want to get a message through have to identify themselves manually.

With No Spam, the sender's computer has to solve a cryptographic puzzle with its own processor to get its message into a recipient's in-box, said Cynthia Dwork, one of the architects of the software. If the sender doesn't have the puzzle generator, the recipient's computer will send a message to the sender with a link for downloading one.

The key is that the puzzle takes about 10 seconds to solve. "There are only 80,000 seconds in a day, so a computer can only send 8,000 messages in a single day," she said. As a result, machines that send out millions of spam messages a day would be substantially throttled.

Another project, called PageTurner, seeks to make software agents that retrieve updates from Web pages automatically more efficient. By observing the changes in 151 million Web pages each week over an 11-week period, Microsoft Research made a surprising discovery about the pace of change on the Web. It's slow. Nearly 65 percent of the pages don't change at all one week to the next.

"Ninety percent change less than a quarter, and 85 percent change less than 10 percent," said Marc Najork, one of the project leaders.

What does this imply for software agents? Designers may be able to devise ways to have them troll less and still obtain all the relevant data.

Along the way, the group also discovered a pornographic Web server in Germany with 115,000 host aliases that was serving up an outrageous number of freshly minted Web pages. The technique allowed the server to spoof search engines and make them think these ephemerally produced pages were heavily trafficked. Since the discovery of that server, Najork says they have discovered 80 more doing the same thing.

"It is a very good way to fool search engines," he said. "Five percent of the data from Germany was polluted in this study."

Computerized language translation also is being improved. Microsoft recently translated 125,000 articles from its Knowledge Base, a compendium of technical how-to articles, from English to Spanish on computers. Next up are Japanese, German and French. Using machines instead of humans for translation is expected to cut costs.

"Microsoft is one of the biggest consumers of translation services," Rashid said. "For us, it is an enormous expense."

New interfaces
The company also is working on new types of interfaces. In GWindows, a person can scroll through files or move windows though a combination of voice commands and hand gestures, said Andy Wilson, the project's designer. Special gloves are not needed.

"It looks for moving parts of your body," he said. "With this, you might be able to have speech recognition work on an open microphone."

The project in part relies on Bayesian mathematics, sources said, which is influencing other interface and artificial intelligence projects at Microsoft.

While the system currently only recognizes a few simple gestures, it will expand. Volume in a future version of Windows Media Player could be adjusted, hypothetically, with a wrist-twisting motion, Wilson said. He added that a friend advised the special effects team on "Minority Report" on building Tom Cruise's gesture-driven PC screens.

A similar project, called Whiteboard, records the entire audio and visual history of a presentation on a whiteboard. As a result, people who missed the meeting can go back and look at the entire presentation or parts.