Making the real-time Web relevant

With an explosion in tweets, blogs, and other instantly published content flooding the Web, search engines are scrambling to organize information that's happening right now.

Tom Krazit Former Staff writer, CNET News
Tom Krazit writes about the ever-expanding world of Google, as the most prominent company on the Internet defends its search juggernaut while expanding into nearly anything it thinks possible. He has previously written about Apple, the traditional PC industry, and chip companies. E-mail Tom.
Tom Krazit
6 min read

If there is perhaps one universal truth about the Web, it's that people want it now.

During the past 15 years, our expectations for how quickly information should be delivered to us over the Internet have changed. Now a delay of minutes on a breaking news story is unacceptable, as we saw during the frantic search for information in the hours after Michael Jackson died last year.

Enter real-time search. Search has been our gateway to the Web for almost as long as it has existed, and the big search players of the day are gearing up to handle a new challenge: how can the explosion of instant content produced by news organizations, blogs, and social-media users be organized in a relevant fashion, sorting through one of the worst signal-to-noise ratios in modern communication? Oh, and by the way, those results have to be displayed instantly.

Google Amit Singhal
Google's Amit Singhal announces the company's real-time search strategy in December. Stephen Shankland/CNET

"If information was generated seconds ago that's relevant to what I am looking for, it should be available to me in one place," said Amit Singhal, a Google Fellow and a legend in the search industry who is responsible for Google's real-time search project. "It's awfully hard."

It's been about four months since Google integrated real-time results into its pages, and a bit longer since Google and Microsoft cut deals with Twitter to bring that service's "firehose" feed directly into those companies. Real-time search today is in its infancy, but it's the next stage in the evolution of Internet search.

Time to get real
So, what is "real-time" content? There are nearly as many definitions as there are companies scrambling to get their names associated with one of the more hyped developments in Internet publishing.

Most people agree it centers on the concept of microblogging, or instant publishing of content to the open Web from social-media services. But in practice, "real-time search is still primarily Twitter search," said Danny Sullivan, editor of Search Engine Land.

Microsoft's Paul Yiu, one of Bing's leading real-time search experts, agreed. Bing has centered almost all of its real-time search efforts on its Bing.com/twitter page. The 140-character service is the undisputed king of "what's happening now" status updates and continues to grow amid high-profile anecdotes such as uprisings in Iran and the landing of a jetliner in the Hudson River.

Beyond Twitter, however, Yiu thinks there are two components to real-time information: the actual content of the status update or post, and the link that is being shared within that update. Both parts are relevant to a searcher's query, Yiu said.

Tobias Peggs, president of start-up OneRiot, has built an entire company on the premise that the link being shared within the status update is more relevant than the message itself. When you search for a topic with the intent of finding out what's happening with, say, the bombings in Moscow last week, OneRiot analyzes the links being shared within status updates and user-controlled sites like Digg to determine the most relevant pieces of content being shared at a given moment.

"We filter through that real-time social noise and extract the useful signal," Peggs said, surfacing the definitive Los Angeles Times story about the bombings being retweeted by thousands of users as opposed to a tweet that says "OMG, those Moscow bombings are really bad."

Relevant to my interests
Real-time search starts by determining that something important is happening in, well, real time.

The major search players have the luxury of comparing spikes in their search query logs with spikes in certain topics from the feeds they receive from real-time information sources like Twitter. When activity around the same topic is spiking on both search query traffic and real-time publishing platforms, the search companies know something is happening.

Sunday's 7.2-magnitude earthquake shook the U.S.-Mexican border near San Diego and triggered Google's real-time search results box. Screenshot by Tom Krazit/CNET

"Earthquake" is the classic Silicon Valley example, and the 7.2-magnitude quake that rattled San Diego and Northern Mexico put the system to the test. But celebrity deaths, political events like the passage of the health care bill, and major sporting events will trigger Google's scrolling real-time results box, Singhal said.

At that point, Google starts evaluating the relevance of its real-time content sources in order to determine what to surface in that box. Three things count: quality, or the spam-or-real question; the authority of the author of the content, determined by a PageRank-like algorithm that gets beyond mere follower counts to evaluate the quality of one's followers; and semantic evaluation, using Google's language data to filter status updates that may share characters but are unrelated ("gm cars" as distinct from "gm foods").

It's not an exact science at this point. Anyone who has watched a real-time search results stream during a breaking news event (try it tonight during the Duke-Butler NCAA championship game) will see lots of off-topic chatter from Twitter users with 10 followers and blog posts repurposing blog posts that repurpose blog posts.

"You don't have the opportunity to do the same kind of relevance and ranking on information that's coming in in real time than you would have the opportunity to do otherwise," said Shashi Seth, senior vice president for search at Yahoo.

Feeding the beast
Everyone in the business of real-time search takes great pains to consider a wide range of sources part of their curation of the real-time world. But, as noted above, we're really talking about Twitter.

"Twitter is an amazing story in that they are one of the few companies that has gotten Google to cough up money for content," Sullivan said. Google is usually loath to pay for content, outside of a few deals with the Associated Press and others, but it eagerly forked over what is believed to be several million dollars for the right to access Twitter's "firehose." Business Week reported in December that Twitter cleared $25 million from its deals with Google and Microsoft, a figure that has not been confirmed.

Why spend the money? It's simply too difficult to crawl Twitter the way traditional search engines crawl the Web. All three major search engines at this point have inked deals to have Twitter push its content directly to them, saving those companies (and Twitter) time, energy, and money.

But that raises a question: if the quality of real-time search is so dependent on the willingness of private companies to license their content, could real-time search be fragmented by business concerns? Purely for example, could Google lock up Twitter's content and Microsoft lock up Facebook's content in an effort to out real-time the other?

For the time being, it seems Twitter is taking the high road by offering to license its content based on what users are able to pay, recently turning on access to several smaller developers. And many expect the search community to eventually settle on standards for real-time streaming such as the Google-backed PubSubHubbub.

Still, if publishers like Rupert Murdoch ever follow through on threats to sign exclusive deals with particular search engines, it's not a stretch to imagine that real-time publishers will also come to understand the true value of their content. Twitter is today's real-time darling, but it's not unimaginable that it could be the Myspace of 2015's SXSWi.

This is for real
There's no going back to a delayed publishing model for media companies: deadlines are dead in the real-time world. And more and more regular people realize every day that there is an audience for the thoughts, rants, and banal moments in their day-to-day lives.

The result is a content explosion, the likes of which crushed Google CEO Eric Schmidt's dream of one day indexing the entire Web. That may have been a pipe dream to begin with, but it's definitely not going to happen now.

So if search engines are to remain relevant themselves, they'll need to make sense of this content. And unless social-media networks are able to make their content discoverable, they won't turn into the types of content-discovery engines that their public-relations people like to imagine are already here.

Expect the importance of real-time search to only grow over the next several years. For example, Yahoo's search deal with Microsoft does not include real-time indexing and ranking efforts, as the company believes that it's too important to give away.

"We think of (real-time search) as a very strategic and important asset, and we are going to continue to invest in it in a big way," Seth said.