Things work--as long as you speak slowly. What's more, the extent of a conversation is constrained by a limited vocabulary a computer understands. Another problem: Costing up to $1 million for an average installation, the current technology remains beyond the reach of many businesses.
But steady advances in technology have made researchers more confident about a coming breakthrough in speech recognition and activation. Microsoft earlier this summer released a beta version of a product due out in 2004 that aims to enable servers to better handle oral commands.
Kai-Fu Lee, a computer scientist whom Microsoft hired away from Carnegie Mellon University to run its speech technologies group, known as the Natural Interactive Services Division, or NISD, recently spoke with CNET News.com about the industry's progress.
Q: What's the level of speech recognition quality users should expect to see over the next six to 12 months?
A: Speech technology has made great progress. Every year, it's 10 percent to 15 percent better than the year before. So, I think that over the next seven years, we will reach or exceed human-level performance.
How good will be good?
We'll have a person sitting in front of a machine and another person who'll be asked to speak in a natural voice.
It's just that machines need more guidance; babies don't need to be told it's a man or woman speaking.
What about the science behind all this? What's the methodology computer scientists will be using to teach computers to better recognize language?
Think about building a statistical model of every sound imaginable in every language in the world, collecting lots of data spoken by a lot of people and teaching what it actually sounds like. The more examples to which a machine is exposed, the more it can generalize beyond that.
This is very similar to the way people learn. It's just like a baby. A baby hears "Mom" and "Dad" a lot, and those are the first words it learns. Then, they generalize to maybe a name and then to similar-sounding words.
For the name "Mary," the baby generalizes the "M" in "Mary" to "Mom." The more they hear it, the more they build that generalization. We're doing that with machines. It's just that machines need more guidance; babies don't need to be told it's a man or woman speaking.
On a practical level, it seems that the state of speech recognition software applications that currently exist--things are better but still are not there yet.
I'm not saying it will take another seven years for systems to be good enough. What we're doing today is not bad, though it's not yet useful unless you have a repetitive stress injury. But there are many applications for which speech is actually better than people. Directory assistance is a good example. Microsoft has about 50,000 names in its directory, and machines can do a better job than people for 50,000 name recognition.
Where, then, are the roadblocks? Is it a matter of the available technology?
Technology development is not anything I worry about. I don't think there's any bottleneck or constraint. A problem today is this: Where are the applications that will drive developers to build applications? And what are the business imperatives that will drive speech?
If you want a natural, social interface, you have to talk--and people will expect a more anthropomorphic response on the other side.
Is Microsoft Speech Server going to become an integrated part of Windows?
No. It is an independent product like SQL or Exchange. It fully utilizes Windows Server 2003 capabilities and leverages all Windows functions fully. But these are separate product functions.
What's the top item on your agenda?
Job No. 1 is to build affordable speech solutions for enterprises and developers. We'll have another updated beta this year and ship a finishes product in the early spring of 2004 with very attractive price points. We'll have a scalable solution that's easy to maintain and manage.
You've talked about the viability of the social interface that was planned as a part of Microsoft Bob (with an active agent responding to user commands). Do you expect to see that vision realized sometime soon?
Not in the near term. It's a long-term thing. I strongly believe in the vision behind Bob, which was way before its time, but perhaps, it did not address customers with the right technology. You'll have the user describing the goal he or she wants to accomplish and the computer figuring out the complexity.
So I could say, "Cancel the next meeting with Charlie today." Speaking with Outlook, you'd then say, "Open Outlook, open calendar, find Charlie, cancel." The computer has to have the smarts to break down this delegation to executable steps.
We have to do a number of key technology revolutions. First is that data on the PC has to become structuralized, and it has to be in XML that can be quickly retrieved and reasoned with. Second, there has to be a way to do more than retrieval, and there have to be more verbs the system can understand.
People also have to be taught to talk to the computer. This will happen. It will just take time. I don't see that happening within the next five years but definitely in the next 10. The human social interface is by speech, not typing. If you want a natural, social interface, you have to talk--and people will expect a more anthropomorphic response on the other side.
And will the systems be 100 percent conversant?
I wouldn't say 100 percent. I would use human as a bar. So, talking to Outlook, for example, a human would expect it to work as well as a human assistant. Humans make errors, and you can't expect machines to make fewer errors. We're comparing to humans in a domain-specific way. I'm saying the human assistant within Outlook will be as good as your assistant working with Outlook.