Want CNET to notify you of price drops and the latest stories?

Sorting the ABCs of speech recognition

A new standards battle may be brewing, but SpeechWorks CEO Stuart Patterson says the inevitable confusion--and looming mess--will not derail the spreading use of speech recognition.

Charles Cooper Former Executive Editor / News
Charles Cooper was an executive editor at CNET News. He has covered technology and business for more than 25 years, working at CBSNews.com, the Associated Press, Computer & Software News, Computer Shopper, PC Week, and ZDNet.
Charles Cooper
7 min read
Speech recognition remains one of the most deferred dreams of the computer industry. But with more businesses, communications carriers and Internet portals offering automated, speech-activated services over the telephone and on personal digital devices, SpeechWorks CEO Stuart Patterson says this Holy Grail may soon be within reach.

To be sure, the technology has received its fair share of attention, though the claims have sometimes raced ahead of reality. Perhaps the biggest potential obstacle now in the way of faster adoption is the looming split between advocates of different industry standards. One approach is supported by the likes of Cisco Systems, Intel, Microsoft, as well as by SpeechWorks, while the other is favored by IBM. If this standards confusion follows the pattern of other spirited battles in the technology industry, developers may again be forced to pick sides.

Until the mess gets sorted out, SpeechWorks is holding its own in this still-developing market. For instance, it retained a bigger share in the U.S. automated speech-recognition software industry last year than either Nuance or IBM, according to market researchers Frost & Sullivan.

CNET News.com recently spoke with Patterson about the future of voice and speech recognition as well as the possibility of a standards battle.

Q: Speech recognition is one of the most deferred dreams of the computer industry. What needs to happen before it really becomes a practical consumer tool to access information, and how long before we reach that point?
A: I think it's already a practical tool. People talk about technology risks. I think we've proven that the technology risk is no longer the brake on the market.

So what is?
I think some of it is the natural speed of things that haven't reached fad or vogue status. The emergence of technologies doesn't happen in three or five years. It can take 10 or 15 years to happen, and up until two or three years ago, speech recognition didn't have much traction except among the early adapters. My view is that it's just part of the natural order.

I came across an old quote where one industry analyst said that speech-interface technology would be ready for Christmas 2001, which he predicted would be the coming-out party for voice-enabled e-commerce. That was one of the bigger misses. What happened?
I don't think it's that far away...At most, the guy was off by six to 18 months. The thing that's missing is really the follow-me behavior--the feeling that enough people in my vertical market are doing it, that I must do it also. If you look at CRM (customer relationship management) or ERP (enterprise resource planning) or even the Web itself, that was the behavior that kicked in at some moment. I don't think we're that far from it in speech (technology).

What do you think needs to happen before Web voice-access software and services take off?

"I think the voice portal is a very unproven model as a stand-alone business."
How do you describe the amorphous moment? I don't know how we define it. I see voice in employee productivity, where companies are saving on the amount of money they spend on 800 numbers and don't have to use live customer operators; that's happening now. If you're talking about more speculative areas, then maybe a year or two...Just like the Web, this will take more experimentation. It will be rolled out by carriers, and they move more slowly.

Voice portals had a lot of promise, but you don't hear that much about it anymore. What's your assessment of that business?
I think the voice portal is a very unproven model as a stand-alone business. I think it's a wonderful extension to options the user has for getting connected. But as a stand-alone, I think it just didn't prove out.

But how many commercial applications will you see incorporating voice? It won't be a browsing medium, for instance, or will it?
Voice won't be a browsing tool for some time. However, it will be very large-scale in 2002 and 2003 in business-to-business and business-to-consumer applications for customer service and employee productivity. The economics of that market are just obvious now to many, many people.

Natural language capability is already in the market, but how long before you'll have servers able to process big, intricate sentences?
It's a complex question. We can process intricate sentences now. If you call a brokerage and say, "I want to buy 100 shares of IBM at market," there are a billion permutations; you can throw in all kinds of garbage words, and it will still get it. I think it will take many years before we break out of that domain. If you call a travel agent, you're not going to talk stock; you're going to talk travel. And we can provide a lot of natural language in that context.

You don't usually spit out things in streams of consciousness when you call up. So, I don't think the lack of completely context-free, mixed language will be a barrier to very friendly, conversational systems. It is important that callers feel they're having a conversation with the system and that they don't feel it was stilted or unfriendly.

You've got the first spec of SALT (Speech Application Language Tags) out there. After you collect comments from developers, the next step is what--to submit it to the W3C?
Yes. I think the W3C is the most likely standards body to submit it to. We'll wait until 1.0 before turning it over.

When do you expect that to take place?
In the next couple of months.

And then what?

"I think there are some clear choices that distinguish VoiceXML from SALT."
Then we hope it gets some serious traction and backing. There are going to be lots of discussion about whether and how we might be able to bring VoiceXML and SALT together. We might have two standards out there.

Standards are always the key--in this case, a common way to build software to offer Web information and services over the phone. But you've also got VoiceXML out there (an effort led by a group of companies including IBM, Motorola, AT&T and Lucent Technologies). I don't want to sound like an old skeptic, but this sounds like deja vu all over again, what with the countless other standards battles that have marked the computer industry. Am I missing something?
I think that's true. It is not as clear as everyone might like at this point in time and may not get clearer right away. But I think there are some clear choices that distinguish VoiceXML from SALT. VoiceXML is in version 2.0, and 2.1 is a relatively firm, solid spec--whereas SALT is about to be submitted to the W3C. There are already somewhere between five and 25 companies that can offer well-behaved VoiceXML environments. With Microsoft announcing support for SALT, I think there'll be others.

Do you think that's an obstacle? After all, developers will be forced to pick sides.
The standards confusion is regrettable, but it's not going to slow down the market.

At the Telephony Voice User Interface conference, you announced an alliance with Microsoft and agreed to configure your speech technologies to integrate with the Microsoft speech platform. Why'd you decide to go with Microsoft?
First and foremost, because they asked us to. It's in the interests of SpeechWorks to be part of this. If and when the Microsoft platform succeeds, there's a significant upturn for us.

How did the Lernout & Hauspie flameout affect the way outsiders view what's happening in the speech-recognition market?
One thing is that it started to clarify who was really driving what part of the market. Some people hear voice and they lump things together. I think Lernout & Hauspie was trying to be all things to all people. I don't think there's one company that brings all of speech together. You have to look at different companies in different places. Lernout & Hauspie also contributed to the perception that the market was further along than it really was. But from our point of view, the voice market is much better off than where it was a couple of years ago.

Let's talk applications. Is there much work happening in military or security circles when it comes to applying this technology?
Since Sept. 11, that's been a focus. Also, there's incredible stuff going on in enterprises where they're using speech recognition to offer price quotes and customer service and information services. There are also financial applications that are more personalized than ever before. We can play two voices side by side and challenge you to pick which one is real and which is synthesized. That shows a real advance. But you couldn't have done that six months ago.

Also, there's multimodal. I'm sitting in my car, which has these navigational CDs, but I can't use them when driving because the interface is this little knob. But if I could give it current location and where I'm going and then say, "Zoom in," I'd use it a heck of a lot more. I personally think there'll be lots of sweet spots for multimodal design starting in 2003.