Microsoft's voice platform to get a 'brain'

Microsoft's voice team says it's on the cusp of getting its voice platform to understand not just what you're saying, but what you actually mean.

Josh Lowensohn Former Senior Writer

Josh Lowensohn joined CNET in 2006 and now covers Apple. Before that, Josh wrote about everything from new Web start-ups, to remote-controlled robots that watch your house. Prior to joining CNET, Josh covered breaking video game news, as well as reviewing game software. His current console favorite is the Xbox 360.

See full bio

Josh Lowensohn

Dec. 8, 2010 4:00 a.m. PT

4 min read

TellMe voice search on Windows Phone 7 — Josh Lowensohn/CNET

Microsoft wants to make its voice platform a little more decisive.

Over the years, Microsoft's speech technology has gotten increasingly more capable of figuring out what people are saying, as well as letting them do voice-powered searches and commands on devices besides the phone. But what's been missing is the second part of the equation, which is a deeper understanding of their meaning and the context behind them.

To that end, Microsoft is in the process of building what it's calling "conversational understanding" (CU), which mixes speech, a dictionary, grammatical structures, and machine learning to better figure out what users are saying so that the system can spit out an answer that takes into account all those things.

While there's not yet a Microsoft-created product or a service available that does this, the vision for CU is coming together, Zig Serafin, the general manager of Microsoft's speech group, told CNET.

"Everything that we've been doing up to this point has been knowing what people are saying," Serafin said. "If you use the analogy of a human, it's like having a really good ear. Did I hear what you were saying while you were out on the go while you were on the corner of Market and San Francisco, and did I hear it well enough to be able to give the response you wanted?"

The next step, Serafin explained, was to get those words to do more than start a Web search, make a phone call, or launch an app.

"Where things are going, and where we're right on the cusp of moving into is the brain element of the system. And that is understanding meaning," Serafin said. To make that a reality, it's meant getting the various pieces of Microsoft's speech technology to work together.

That infrastructure is made up of a handful of technologies, both consumer and enterprise. Names you might recognize include TellMe, Bing's 411 service and its iPhone app, the voice search on Windows Phone 7, and in places like the car with Sync. More recently it's popped up on the Xbox 360 as part of the Kinect, which is Microsoft's first implementation of an always-on microphone system that keeps an ear open for voice commands instead of requiring a button press.

Voice recognition on the Xbox 360 has been done through Kinect's built-in microphones, and uses the system's audio processing to cancel out noise from games and applications. Microsoft

Most of these systems revolve around finding out what users are saying, then feeding that back into the cloud. Though in some cases, those commands can be simple enough to not need to phone home. For instance, saying something like "play (song name)" or "call mom" can be processed locally, but if you're saying something that goes outside of that short list of commands, it will ping Microsoft for the answer.

The idea behind CU is to take all this one big step further by hooking into buckets of data--be it third-party sites or private data feeds to add context to user queries and figure out what the user was trying to do. To that end, it's not all just about search.

"[For] the application of conversational understanding, certainly search is one, but it's much, much broader," said Ilya Bukshteyn, Microsoft's senior director of marketing for TellMe, the voice company Microsoft bought in 2007, and later folded into its speech group. "Understanding intent on search is going to be key to actually helping you complete your task instead of just finding data," he said.

Bukshteyn detailed a system where Microsoft will be able to take something like helping plan dinner for two people, and break it down into a query that uses data from various places such as calendars, restaurant ratings, and location.

"All of that data is actually available in different places," Bukshteyn said. "So having an engine and a service that can look in all those places--looking around your calendar, your past history, places you have in common that you may have been to, and then can assist you by giving you a few places to choose from, and then finalize that reservation we think is going to be of tremendous value."

The secret, of course, is getting that process started by telling your phone you simply want to go out to dinner that evening. "This is effectively where Microsoft's speech tools are headed," Serafin said.

Echoing comments about Microsoft's goal to get Bing to be able to consolidate multistep tasks into one action, made last month by Yusuf Mehdi, Microsoft's senior vice president of Online Audience Business, Serafin outlined a system that would make the number of apps users have installed on their phone, as well as the need to use them all, less critical.

"This area where you're actually able to complete tasks that may have taken you multiple keystrokes, may have taken you multiple apps...In this world of understanding, you actually get into an environment where you can assist the user in what they'd like to get done," he said.

As for when all this is coming, Serafin wouldn't say. "There's implementation that we're building on this basis, and you'll see more forthcoming on it," he said. "What we're highlighting is the strategy behind it, and how it actually makes use of what we've built up until this point."