Every day we crawl a little closer to Douglas Adam's famous and prescient Babel fish. A new research project from Google takes spoken sentences into one language and outputs spoken words into another – but unlike most translation techniques, it uses no text that only works with the sound. This makes it quick, but more importantly makes it easier to reflect the cadence and tone of the speaker's voice.
Translatotron, as the project is called, is the culmination of years of related work, but it is still very much an experiment. Google's researchers and others have looked at the possibility of direct speech-to-speech translation for years, but only recently have these efforts gained fruit worth harvesting.
Translating speech is usually done by breaking down the problem in less sequential ones: turning the source text into text (text-to-text or STT), turning text into one language into text in another (machine translation), and then rotating the resulting text to speech (text-to-speech or TTS). It works pretty well, but it's not perfect. each step has types of errors it is prone to, and these can unite.
It is not really how multilingual people translate into their own heads, as a testimony to their own thought processes suggests. How accurate it works is impossible to say with certainty, but few would say that they break down the text and visualize it to switch to a new language, then read the new text. Human cognition is often a guide for how to advance machine learning algorithms.
For this purpose, the researchers began to investigate the conversion of spectrograms, detailed frequency distributions of sounds, speech in one language directly to the spectrogram in another. This is a very different process from the three-stage and has its own weaknesses, but it also has advantages.
One is that while it is complex, it is basically a one-step process rather than several steps, which means that Translatotron can work faster if you have sufficient processing power. But more importantly for many, the process makes it easy to retain the character of the source part, so the translation does not come out robotically but with tone and cadence of the original meaning.
Of course, this has a tremendous impact on expression, and someone who relies on translation or voice synthesis will regularly appreciate that not only what they say comes through, but how they say it. It is difficult to exaggerate how important this is to ordinary users of synthetic speech.
The accuracy of the translation, recognizes the researchers, is not as good as the traditional systems, which have had more time to delete their accuracy. But many of the resulting translations are (at least in part) quite good, and being able to include expressions is too great an advantage to pass. In the end, the team modestly describes their work as a starting point that shows whether the approach is feasible, but it is easy to see that it is also an important advance on an important domain.
The paper describing the new technology was published on Arxiv, and you can browse samples of speech, from source to traditional translation to Translatotron, on this page. Just be aware that these are not all selected for the quality of their translation, but serve more as examples of how the system retains the expression while getting the essence of the sentence.