The software maker is exploring ways to make search systems a greater part of its Windows operating system and to make searches better tailored to individual needs.
The Redmond, Wash., software giant is experimenting with different search technologies that will, among other tasks, conduct Google-like searches on an individual's hard drive or categorize query results in different ways intended to make the data easier to digest.
The experimentation may make search a greater part of the Windows operating system, and the results could appear in the forthcoming Longhorn OS. The research could also spell competition for Google.
Implicit Query, an experimental application that was put together a few weeks ago, for example, retrieves links, music files, e-mails and other materials that relate to applications running in the foreground, according to the company.
"We analyze whatever text you are working on and then pull out words that are important and query on those automatically," said Susan Dumais, a senior researcher in the Adaptive Systems and Interactive Group at Microsoft Research. "The idea is to retrieve a bunch of things without you explicitly searching for them."
Microsoft is also looking at integrating these tools directly into operating systems and applications. "I don't want to stop everything I am doing. Bring the search results to me," Dumais said. "People spent a lot of time essentially acting as a file clerk."
Building a search system that links the many incompatible files has long been an elusive goal for Microsoft, and a pet project of Chairman Bill Gates.
With Longhorn, Microsoft intends to finally deliver software that can link the documents, e-mail messages and Web pages that exist in separate, largely incompatible software silos. Longhorn will include an underlying technology called WinFS, derived in part from Microsoft SQL Server, that will allow applications to pull data from a unified database.Right now, the kind of application dictates how data is stored. Databases are typically used for more numerically oriented applications, such as storing bank account information, while file systems are usually used for document-centric applications with unstructured data types. The problem is that retrieving information from different storage systems is a challenge, at best.
WinFS seeks to bridge the worlds of unstructured documents and data stored in relational databases with a common storage and look-up mechanism. If Microsoft is successful, the net result would likely be greater data interoperability and much improved viewing and searching.
The tools could also permit Microsoft to undermine the utility of commercial search engines such as Google by making its own software the easiest place to initiate an investigation. Spell-checkers, after all, were once independent applications too.
"They don't want to rely on someone else's technology," said Matt Rosoff, an analyst at Directions on Microsoft. "Microsoft's point of view is that it has the right to include pretty much what it wants to in Windows, and they look at search as one of those things people do with computers."
Dumais declined to comment on whether or when the search tools developed by Microsoft Research would be included in shipping products, noting that many of the ideas have just been devised.
Still, some of the work is already being tested fairly extensively. Over 1,000 internal users at Microsoft are already using "Stuff I've Seen," a research project that conducts hard-drive searches, and Dumais' group is conducting interviews with these beta users to determine how people actually use search.
Search, in the Microsoft view, is ubiquitous, but not very efficient. A fairly simple query can generate 20 or more screens of results. The results are also generally not well tailored to an individual's taste or the context of their needs.
"Search in many ways is brute force," Dumais said. "If the two of us type in a query, we get the same thing back, and that is just brain dead. There is no way an intelligent human being would tell us the same thing about the same topic."
Personalization was one of the big buzzwords of the early years of the dot-com era, but many of the efforts to deliver individualized content failed. Software developers, however, are increasingly becoming more adept at using Bayesian models and other probabilistic techniques to insert intelligence into software.
Although the underlying calculation in these models is complex, the overriding concept is fairly simple. Software keeps tabs on an individual's Web surfing habits, interests, acquaintances, work and travel history, work projects, and other data. It also constructs a model that tries to anticipate what a person finds important and what will be irrelevant.
"I have the same meeting every week with the same people. Maybe that isn't so important," Dumais said. "I have a meeting with Bill G. (Gates) He's pretty high on the org chart. Maybe that one is important."
Microsoft's experiments differ from commercial search engines in that the universe of data searched consists of data found on an individual's hard drive. Although a smaller universe, it's a well-traveled one. Studies cited by the company suggest that up to 81 percent of Web pages accessed are repeat visits. Hence, the links someone wants to see are likely on his or her hard drive.
There is also no theoretical reason the scope of these type of searches couldn't be extended, which would allow Longhorn or other search-enhanced applications to compete with commercial search engines. Dumais pointed out that search queries could take into account the geographic location of the PC used in the search.
Microsoft's path to expand the
Windows empire is leading directly
to search king Google.
In demonstrating Implicit Query, Dumais began to type an e-mail asking a colleague about a set of slides for an upcoming conference. Before the message was complete, the program--which appears in a window on the side of the screen-- pulled up e-mails, slide decks and Word documents containing the name of the conference and the future recipient. Each hit came with a brief summary of the internal content, date, the type of software the file was written in, and its potential relevance, among other information.
By incorporating this functionality into existing applications, users could more easily obtain attachments. Dumais recalled once writing a note to inform a colleague that a link on one of her group's sites was broken. Before sending, Implicit Query then showed her an unopened e-mail in her in-box that contained a fix.
Stuff I've Seen essentially conducts the same type of searches, but it doesn't work automatically. Some commercially available systems, such as Apple Computer's Sherlock and Microsoft's own Finder, already perform some limited indexing functions, but Stuff I've Seen covers a broader array of files, including earlier accessed Web links and e-mails, according to Microsoft's published papers.
Memory Landmarks, meanwhile, is a mnemonic device recently developed at Microsoft Research. The application examines a chronological list of search results and then inserts landmarks that might help individuals more rapidly pinpoint the results they seek.
If a major election took place in November, for example, or an individual downloaded an inordinate number of pictures in December and put them in a file entitled "Vacation," small windows noting these significant events appears on the side of the search results. Lines connect the graphic to a point in the results, sort of like a display of tree-ring dating in a natural history museum.
Work in progress
Of all three applications, Stuff I've Seen is by far the most advanced, but work remains. For the application to function, all the data on a given hard drive has to be indexed, which can be a drag on performance.
Dumais' group, for instance, recently discerned through user interviews that people generally want to view documents sorted by date more than rank of importance. Names are big in queries.
"Date", however, is an evolving concept. In e-mail, the operative date is the day the message was sent. In meetings, it's the day the meeting took place. An early version that categorized meetings by the date the meetings were originally scheduled generated complaints about perceived bugs in the first few hours it went out.
One of the current projects is studying a group of individuals who have become acclimated to using Stuff I've Seen to the point that they don't store documents in segmented files anymore.
"We're calling it Flatland," Dumais said. "We are now working with these people to try to understand what it is to live (there)."