Google's AI can dream up its own surreal images and beat a human champion at the ancient game of Go. Now it can realistically mimic human speech, including the nonspeech sounds the mouth and respiratory system make when a human talks. The system is called WaveNet, a neural network that generates raw audio waveforms, and it's uncannily lifelike.
We do have text-to-speech generators, and they're very useful, particularly for blind people. However, they're usually built by recording one person speaking a variety of sounds, then making different combinations of those sounds to match text. This is concatenative TTS, and it's glaringly artificial. Parametric TTS generates audio using vocoders, synthesisers that analyse and reproduce speech input, but this also sounds unnatural and robotic.
WaveNet is an AI, which means it can learn. The researchers fed it samples of recorded human speech. This allows WaveNet to model the raw waveform, over 16,000 samples per second, and generate a predictive model that produces sounds based on the sounds that came before.
The results are almost shockingly humanlike, even -- or especially -- when WaveNet was tasked with generating its own sounds.
WaveNet compared to concatenative and parametric TTS.
WaveNet making nonsense sounds.