Talking computers nearing reality

The technical kinks, high costs and application misfires that have held back the acceptance of speech recognition and activation are being ironed out.

Michael Kanellos Staff Writer, CNET News.com

Michael Kanellos is editor at large at CNET News.com, where he covers hardware, research and development, start-ups and the tech industry overseas.

See full bio

Michael Kanellos

July 10, 2003 9:43 a.m. PT

6 min read

Machines that listen and talk like humans are becoming a reality, researchers and tech executives say.

More about speech recognition

The technical kinks, high costs and application misfires that have held back the acceptance of speech recognition and activation are being ironed out, they say. As a result, companies are coming out with a variety of products that will let consumers access databases using voice commands, for example, or transform e-mails into one- or two-way verbal exchanges.

Microsoft on Wednesday released the first public beta of its Speech Server, which will let servers better handle oral commands. It also released the third beta of its Speech Application software developer kit. A partner program has begun to encourage third-party developers to promote Speech Server, which will debut in the first half of 2004.

Speech Server, formerly .Net Speech Platform, will attempt to reduce the cost of creating automated phone response systems and coincides with other phone-computer efforts at Microsoft. Automated response systems such as those used by many airlines can cost as much as $1 million--too expensive for the bulk of the business market, said Kai-Fu Lee, vice president of Microsoft's speech technologies group.

"Only a very small percentage of the call center opportunity has been realized to date," Lee said.

IBM, meanwhile, is using its research labs and services divisions to create showcase applications for large corporations. Financial services firm T. Rowe Price, for example, has installed an account management system from Big Blue that lets its customers conduct transactions through common speech requests.

"You can say, 'I'd like to make a trade,' and it will say 'What kind?'" said Eugene Cox, director of mobile solutions in IBM's pervasive computing unit.

Computers that can facilitate conversations between two people speaking different languages--a kiosk for dispensing information to tourists who speak English and to tour guides who speak only Chinese, for example--will also emerge from IBM's labs by year's end, according to the company.

"During the past three to four years we have made very good progress in understanding the elements of a sentence," said David Nahamoo, director of Human Factors Technologies at IBM Research. "The market is now responding positively to the technology. We have crossed the threshold where users will accept it."

By 2010, through its "Super Human Speech Recognition Project," IBM hopes to develop commercially viable systems that can transcribe speech into written text more accurately than humans. Now, machines have an error rate that is five to 10 times higher, according to various estimates. Automated translation will also be greatly improved.

The dream of conversational computers has been around since the beginning of the digital age, and it's typically been a fitful one due to the inherent complexities. The Turing test--building a machine that can respond like a human via typed messages--was posed by World War II era computing pioneer Alan Turing. It is still unsolved.

Deciphering our babble
One challenge is that humans typically don't follow rigid rules when speaking. "Yes," "yep," "ya," "uh-huh" and "that's a fact, Jack" all mean the same thing to people but present bewildering choices to machines programmed to accept rigidly defined input. When speaking quickly, people tend to use different grammar, making machine transcription even more difficult.

Background noise and filtering have been persistent challenges as well.

Compounding the problem, speech proponents have made their own miscalculations. In earlier decades, researchers studied human syntax and tried to develop machines that could comprehend it, resulting in computers that spoke their own version of "broken" English.

Companies also tried to promote speech for the PC, where keyboards, mice and screens were already doing an adequate job.

"It is still a niche, just like a lot of features in the security market, like retina scan," said Laura DiDio, a Yankee Group analyst. To date, voice recognition has made the most inroads in computing devices for people with mental or physical challenges, including epilepsy and carpal-tunnel syndrome.

Now the directions of both research and marketing have changed. Rather than developing a machine that can converse, researchers are creating computers that can understand speech as a function of probability, the basis of much of Microsoft's artificial intelligence work.

Yoda, a speech-to-text engine under development at Microsoft, can turn spoken word into coherent text e-mail messages by studying a person's habits, said Alex Acero, manager of the speech research group at Microsoft.

Yoda doesn't look for an object to follow a verb, but it knows that a particular sound pattern ("meet") will likely be followed by a limited number of your now familiar sound patterns ("in the conference room" or "tomorrow").

The topics for discussion have to be circumscribed; these applications can't follow tangents or new topics. Still, it's progress.

"The way we are trying to teach machines to speak is very different than the way humans do it," he said. "It is still very primitive, but it is more intelligent than current applications."

Better hardware helps as well. If a computer has access to video of the speaker, error rates drop by 80 percent or more in noisy environments where the sound feed can be choppy, said Chalapathy Neti, manager of audio-visual speech technologies at IBM. In these systems, the computer cross-checks the speech input against a catalog of lip movements and facial tics.

"When you speak, there is a lot of visual information," Neti said. As a result, many of these new systems will likely include cameras.

It's the application, stupid
Rather than incorporating speech technology into PCs, companies are now looking at cell phones, pagers and other hardware devices where keyboards don't work as well. Not only are these growth markets but speech advocates predict that consumers, who are starting to use phones for data reception, will find a need for different types of input devices.

"It isn't like you couldn't put a keyboard and a display in your car. There is enough room. It's just not the right place," Cox said.

One of the more promising devices will likely be the standard phone. So far, most phone-server systems require that people punch in commands or passwords using a 12-button keypad. Some can handle basic verbal commands but require the person to make numerous choices. Automated phone systems are expensive, and getting a return on investment could take years for smaller companies, Microsoft's Lee said.

On the client side, Microsoft is working on projects such as Athens, a PC equipped with video and telephony, but the bigger profits will come from selling the back-end software, such as Speech Server, to these systems.

Most of these back-end systems consist of three parts: a speech-to-text engine for turning oral commands into something a computer can understand; a prompt engine, or pre-recorded set of responses to guide the caller; and a text-to-speech engine, which allows a computer to orally send back a response or ask a question that isn't covered by the pre-recorded prompts.

"Now there are separate applications for voice and data," said James Masten, director of marketing in the Microsoft speech technology group. "We want to convert the telephony and the data side of the house."

These phone systems add a level of complexity, however, because speech that is transmogrified into text must get transformed back into speech.

"The more verbose you are in the prompt, the greater chance for error," Nahamoo said. "If you give too much freedom, the (users) will give you a mouthful of a response."

Most of these new applications are written around various standards or standards proposals, such as VoiceXML; X + V, which is xHTML plus VoiceXML; and SALT, which stands for Speech Application Language Tags.

Cross-company licensing is also speeding development. The text-to-speech engine in Speech Server, for example, comes from SpeechWorks. Microsoft will also include a Telephony Interface Manager from Intel and Intervoice for integrating the server into communications hardware.

"It is all going the right way for call centers--for automated service centers this will be absolutely vital," said Yankee Group's DiDio.