Google's AI sounds more human than you, even when it's speaking nonsense

Soon you may not be able to tell the difference between a human voice and a robot.

Michelle Starr Science editor

Michelle Starr is CNET's science editor, and she hopes to get you as enthralled with the wonders of the universe as she is. When she's not daydreaming about flying through space, she's daydreaming about bats.

See full bio

Michelle Starr

Sept. 12, 2016 10:53 p.m. PT

Where Google is moving towards lifelike AI, Hiroshi Ishiguro's Alter is moving towards embracing the inhuman.
Toru Yamanaka/AFP/Getty Images

Google's AI can dream up its own surreal images and beat a human champion at the ancient game of Go. Now it can realistically mimic human speech, including the nonspeech sounds the mouth and respiratory system make when a human talks. The system is called WaveNet, a neural network that generates raw audio waveforms, and it's uncannily lifelike.

We do have text-to-speech generators, and they're very useful, particularly for blind people. However, they're usually built by recording one person speaking a variety of sounds, then making different combinations of those sounds to match text. This is concatenative TTS, and it's glaringly artificial. Parametric TTS generates audio using vocoders, synthesisers that analyse and reproduce speech input, but this also sounds unnatural and robotic.

WaveNet is an AI, which means it can learn. The researchers fed it samples of recorded human speech. This allows WaveNet to model the raw waveform, over 16,000 samples per second, and generate a predictive model that produces sounds based on the sounds that came before.

The results are almost shockingly humanlike, even -- or especially -- when WaveNet was tasked with generating its own sounds.

WaveNet compared to concatenative and parametric TTS.

WaveNet making nonsense sounds.