Today, common approaches to automatic speech-to-speech translation have traditionally been performed through cascading multiple speech and language technologies. Such pipelines begin with automatic speech recognition (ASR) to produce a transcript of the original audio source, followed by neural machine translation (NMT) to turn the ASR-produced transcript into translated text in the target language. Text-to-speech (TTS) technology is then utilized to convert the NMT-translated text back into natural-sounding synthesized audio. Such functionality can be found in our AppTek Speech Translate app.
Native textless speech translation, on the other hand, is an attempt to produce the same results in a completely textless manner, meaning by not using any intermediate textual representations in the source or the target language to produce the translated audio output. For a long time, this has been considered something akin to science fiction, however, for the past three years, researchers Parnia Bahar and her colleagues at AppTek have been developing exactly that , .
In the first step to producing true speech-to-speech translation, AppTek’s scientists trained and released a neural speech-to-text translation system that translates, as an example, English audio straight into German text. This system is trained end-to-end which means that some automatic speech recognition (ASR) errors can be avoided with the help of the downstream translation constraints.
The challenge thus far to train such a system is to find suitable training data – triplets of source language audio recordings, their transcriptions in the same language and their translations into the target language. Such data is quite rare and expensive. With various sophisticated data augmentation techniques, it is possible to also leverage other, more accessible types of data, such as audio plus transcripts in the source language only. Automatic translations of transcripts can be generated using a high-quality text translation system and then used as synthetic training data for training the speech-to-text translation system. This is done with NMT systems which are adapted to spoken content both in terms of domain (e.g., media and entertainment), but also in terms of the form of the text (written text is converted to its spoken form, with numbers, dates, etc., spelled out) which can also help in the second step.
In the second step, AppTek uses neural text-to-speech (TTS) technology to convert the target language translation to a phoneme sequence and then pronounce it using naturally sounding synthetic voices.
So far, the approach has been textless with regard to any textual transcript of the spoken utterance in the source language, but the cascaded target language text was still necessary. The next challenge is to train an end-to-end system, from the source speech signal to the target (artificially created) speech. Once this is achieved, directly generating speech from speech could become possible.
Scientists at Facebook are taking a stab at this challenging task, starting with a speech generator that produces a continuation of an initial spoken utterance, just like the famous GPT-3 model generates fluent and sometimes even meaningful text given an initial “seed” sentence. Such neural speech-language models may become helpful for speech translation in the future. The challenge here is that neural speech generators require large amounts of training data which may not be available for low-resource languages. Google researchers recently presented work on Translatotron 2 which uses only an intermediate phoneme representation and no explicit text representation in the source or target language.
AppTek is now also taking on this challenge. Our end-to-end spoken translation system achieved top ranking at the 2021 IWSLT workshop, outperforming others in the end-to-end categories. With the ideas expressed in our patent application , as well as extending current research [1,2,3,4], we are confident that textless speech translation can work and can help people communicate when text is either hard to obtain or not available, or when the users of the translation system are in a situation when they are not able to read or refer to text. We will share more details on this topic in the future as our research progresses, but, in short, we see a promising future ahead for textless speech translation.
1. “Start-Before-End and End-to-End: Neural Speech Translation by AppTek and RWTH Aachen University”. Parnia Bahar, Patrick Wilken, Tamer Alkhouli, Andreas Guta, Pavel Golik, Evgeny Matusov, Christian Herold. In proceedings of the 17th International Workshop on Spoken Language Translation (IWSLT), July 2020. https://www.aclweb.org/anthology/2020.iwslt-1.3/
2. “On using specaugment for end-to-end speech translation”. Parnia Bahar, Albert Zeyer, Ralf Schlüter, Hermann Ney. In Proceedings of IWSLT, 2019. https://arxiv.org/pdf/1911.08876.
3. “Tight Integrated End-to-End Training for Cascaded Speech Translation”. Parnia Bahar, Tobias Bieschke, Ralf Schlüter, Hermann Ney. 2021 IEEE Spoken Language Technology Workshop (SLT). https://ieeexplore.ieee.org/abstract/document/9383462/
4. “Without Further Ado: Direct and Simultaneous Speech Translation by AppTek in 2021”. Parnia Bahar, Patrick Wilken, Mattia di Gangi, Evgeny Matusov. In proceedings of the 18th International Conference on Spoken Language Translation (IWSLT), 2021.
5. “System and method for direct speech translation system”. Evgeny Matusov, Jintao Jiang, Mudar Yaghi, 2020, US Patent Application 16/741,477. https://patentimages.storage.googleapis.com/b2/72/d4/1d734665c8fe3c/US20200226327A1.pdf.
AppTek provides an artificial intelligence and machine learning-based automatic speech recognition, machine translation and natural language understanding platform for organizations in a variety of markets, such as media and entertainment, call centers, government, enterprise business and others across the globe. Available via the cloud or on-premise, AppTek delivers the highest quality real-time streaming and batch speech technology solutions in the industry. Featuring scientists and research engineers who are recognized amongst the best and most experienced in the world, the company’s solutions cover a wide array of languages, dialects, and channels.