Appliance Science: How Alexa learns about you

How do services like Amazon's Alexa understand your voice? Appliance Science looks at the role of machine learning and natural language processing in your voice-controlled future.

Richard Baguley
Richard Baguley has been writing about technology for over 20 years. He has written for publications such as Wired, Macworld, USA Today, Reviewed.com. Amiga Format and many others.
Colin McDonald
Essentially born with a camera in hand, Colin West McDonald has been passionately creating video all his life. A native of Columbus, Ohio, Colin founded his own production company, Stoker Motion Pictures, and recently wrote and directed his first feature film. Colin handled photography and video production for CNET's Appliance Reviews team.
Richard Baguley
Colin McDonald
4 min read

In my last column, I looked at how Amazon's Echo device and the Alexa voice service allows you to control things with your voice. You speak, it understands and obeys. Alexa is just part of a new wave of services that allow you to control things with your voice, from cell phones to intercoms and thermostats. You can even do things now like ask her to start your car. So, how do these listening devices transform your mellifluous voice into computer commands? The answer lies in two new fields of computer science, called machine learning and natural language programming.

Colin McDonald/CNET

Firstly, the system has to take the recorded audio of your voice that was captured by the Echo or other device and translate that into a command that it can understand, a problem called speech recognition. When they were first created, speech recognition systems relied on hand-coded sets of rules for how to detect words. Each word would have a particular structure in the recorded audio, and the system would try and match this structure to convert speech to command. But as these services became more complex, these rules couldn't keep up. People have different voices and say things in different ways -- you can't create a rigid set of rules that covers every possibility.

So, these systems switched to a new approach called machine learning. With a machine-learning approach, the rules are written and constantly revised by the system itself. The programmer would feed the system samples of speech and what was said, and it would learn by figuring out how the samples related to the results. With some examples, the system could figure out the rules that linked the two, creating its own set of more-flexible rules for converting speech to text. The system isn't searching for an exact match to a preprogrammed rule, but is instead looking for the closest match from what it knows, then revising the match as it learns. It can, effectively, make an informed guess. The programmer can help the process by teaching the system when it was right or wrong (a process called reinforcement, like giving a dog a treat when it sits on command). The more speech it analyzes, the better the system gets at analyzing speech.

What machine learning needs is a lot of computing power. Fortunately, companies like Amazon and Google have computing power to spare in their cloud-processing services, huge distributed computers that run tasks like Alexa. These distributed computers can crunch through millions of pieces of speech an hour, with each conversion making the system more accurate.

The second area of computer science that made these services possible is called natural language processing (NLP). By breaking down the rules of language and applying new ways to process sentences, NLP allows a computer to figure out the meaning of a sentence once it has been converted.

Natural language processing uses techniques such as Markov chains. This technique provides a statistical analysis of language: by analyzing a large amount of text, the system can figure out what the next most likely word is in a sentence. This analysis, called the corpus, contains probabilities for what, based on the previous word, the next word in a sentence is likely to be. If, for instance, the first few words are "the quick brown", the next word is likely to be "fox". This can be used for fun stuff like generating random movie titles, but it also works when converting speech to text, as it provides a way to guess a word that can't be converted by providing the most likely option, based on the corpus. So, if a bit of recognized speech has a word in it that can't be easily converted, the system can use this to make an educated guess as to what it might be.

This is only scratching the surface of this fast-developing field of computer science. NLP is a huge area of research: companies like Microsoft and Google have created special NLP research groups, and universities like Stanford and MIT have created labs that are working on this complex problem, too. It is the focus of thousands of researchers around the world, all of whom are looking to free computers from the tyranny of the keyboard and usher in a new world of speech control.

All of this heralds a new era for the appliances in your kitchen when you may be able to control them with speech. A dishwasher that can load itself may be a few years away, but soon you might be able to ask Alexa to start the dishwasher, and have her tell you when it is done.