Google's AI sounds more human than you, even when it's speaking nonsense
Soon you may not be able to tell the difference between a human voice and a robot.
Google's AI can dream up its own surreal images and beat a human champion at the ancient game of Go. Now it can realistically mimic human speech, including the nonspeech sounds the mouth and respiratory system make when a human talks. The system is called WaveNet, a neural network that generates raw audio waveforms, and it's uncannily lifelike.
We do have text-to-speech generators, and they're very useful, particularly for blind people. However, they're usually built by recording one person speaking a variety of sounds, then making different combinations of those sounds to match text. This is concatenative TTS, and it's glaringly artificial. Parametric TTS generates audio using vocoders, synthesisers that analyse and reproduce speech input, but this also sounds unnatural and robotic.
WaveNet is an AI, which means it can learn. The researchers fed it samples of recorded human speech. This allows WaveNet to model the raw waveform, over 16,000 samples per second, and generate a predictive model that produces sounds based on the sounds that came before.
The results are almost shockingly humanlike, even -- or especially -- when WaveNet was tasked with generating its own sounds.
You can read more about it on the DeepMind blog.
WaveNet compared to concatenative and parametric TTS.
WaveNet making nonsense sounds.