Google's Translatotron translates speech directly to speech

The first-of-its-kind translation tool taps AI to directly convert speech from one language into another while retaining the voice of the original speaker, Google says.

Corinne Reichert Senior Editor
Corinne Reichert (she/her) grew up in Sydney, Australia and moved to California in 2019. She holds degrees in law and communications, and currently writes news, analysis and features for CNET across the topics of electric vehicles, broadband networks, mobile devices, big tech, artificial intelligence, home technology and entertainment. In her spare time, she watches soccer games and F1 races, and goes to Disneyland as often as possible.
Expertise News, mobile, broadband, 5G, home tech, streaming services, entertainment, AI, policy, business, politics Credentials
  • I've been covering technology and mobile for 12 years, first as a telecommunications reporter and assistant editor at ZDNet in Australia, then as CNET's West Coast head of breaking news, and now in the Thought Leadership team.
Corinne Reichert

Translatotron skips the usual step of translating speech to text and then back to speech again.

James Martin/CNET

Google has announced Translatotron, an "experimental new system" that it says will translate speech directly into speech, removing the need for any text.

"Translatotron is the first end-to-end model that can directly translate speech from one language into speech in another language," a Google AI blog post on Wednesday said.

Google said there are three stages of today's translation systems: automatic speech recognition, which transcribes speech as text; machine translation, which translates this text into another language; and text-to-speech synthesis, which uses this text to generate speech.

Cascading these steps led to services like Google Translate, but the tech giant now says it will use a single model without the need for text.


"Dubbed Translatotron, this system avoids dividing the task into separate stages," the blog post by Google AI software engineers Ye Jia and Ron Weiss said.

This will mean faster translation speed and less compounding errors, according to Google.

The system uses spectrograms as input and generates spectrograms, also relying on a neural vocoder and a speaker encoder, meaning the system retains the speaker's vocal characteristics once translated.

Watch this: Google Assistant's new interpreter mode erases language barriers at CES 2019