A: I come from the west part of Germany, so I didn’t have to go very far to study computer science in Aachen when the time came. My story is rather mundane in this respect: I focused on math and physics at school, I had an early interest in computer science on account of a random programming book that my father gave me and, like most teenagers, I was interested in computer games and wanted to make my own.
It wasn’t until I started studying computer science at RWTH Aachen University that I realized there was a lot more to it. I got into human language processing almost by accident: I wanted to participate in an elective seminar on computer graphics, but instead was offered a seat in the one on human language technologies (HLT). As fate would have it, this is what I’ve been working on ever since.
I also decided to work as a student researcher so as to better understand the topic of neural networks, which I found fascinating. I started with machine translation (MT) and this is what my bachelor’s thesis was on. I then switched to data filtering for MT and during my postgraduate studies I moved to automatic speech recognition (ASR) and decided to specialize in speech synthesis. For my master’s thesis I focused on how to generate synthetic data for ASR using TTS, and this is what I’ve been working on in my PhD as well. Specifically, I work on ways we can use recent TTS advances to make ASR systems better such as, for example, how to integrate TTS reconstructions to achieve better audio to text alignment.
A: Apart from computers, I love music and I play the guitar. What I find interesting in languages is the pronunciation aspect, which has to do with how easy it is to synthesize a language. There are some wide differences from language to language.
Let me give you an example. There is a software for singing synthesis called Vocaloid which was developed by a Japanese company more than ten years ago and addresses the field of music production – it’s a digital toolkit for people to make music. When I tried it in English and Japanese, there was a significant difference in quality between the two although the same technology was used. The reason is that some languages are much easier to be synthesized than others because their pronunciation is a lot clearer. In Japanese most syllables are spoken as they are written, as is the case with Spanish and Italian: they basically contain a consonant and a vowel, so it is very easy to write and annotate corpora and to digitize in this language. The result is exceptional, and the singing synthesis software is used for music production in Japanese; some artists are exclusively digital and have even become famous. However, it did not work as well with English, the quality was worse, and it sounded a lot less natural.
A: Working at AppTek came as a well-timed coincidence. As soon as I had started working on speech synthesis, an opportunity was presented to join the AppTek team as a TTS scientist as part of a specific project. I jumped at it as it felt like a great way to practice what I was studying and I continued working at the company after the project was over.
Today, my work at AppTek focuses on implementing the TTS systems that are developed by another senior scientist on the TTS team, Alex Perez. A lot of my effort goes into developing the company’s automatic dubbing pipeline, by putting all the components together, the ASR, the MT and the TTS system and also a speech enhancement system, the result of which you can see in this demo.
A: Work on dubbing automation involves staff from all of AppTek’s cross-functional teams, such as the MT team. For example, the MT system needs to be capable to produce output with dynamic length. In this respect, it helps that I have also worked on MT and have seen how things work in ASR. My minor was in electronic engineering with a focus on signal transmission and signal theory, which is also useful when working on dubbing automation.
The same goes for prosodic alignment: it is not just a TTS problem but also part of the MT process. MT systems are based on Transformer architectures, which have a lot of layers with different attention heads looking at different locations of the source text. This means you can’t make a direct one-to-one word mapping anymore, you need to combine neural approaches with traditional methods. This is one of the advantages AppTek brings to the industry – having all of these unique teams and technologies working in tandem with one another.
A: With neural net technology, the audio quality is very close to natural voices if clean studio data has been used when training a voice. Although the audio quality as such is sufficient, issues come up with respect to pronunciation, prosody, emotion and so on. The speed and pitch control of TTS is now default in the model architecture of all systems, and users can play around with such settings as long as that’s supported in the user interface.
In other words, synthetic voices can be improved with some post-processing effort, i.e. with human intervention. A pronunciation issue, for instance, can be fixed by someone who is not a speech scientist with the help of the right interface: if you type a sequence of words and there is a mispronunciation in one of them, you can type in the actual phonetic sequence by hand so that the machine can then pronounce it correctly.
A: The difference between custom and adaptive voices is significant, as both the technology and the quality of the output is very different.
Custom-made voices mean that the system has been trained on some data of the target voices. In other words, you have studio recordings of say 30 minutes or one hour of the target voice, and you use them to train the system so that it performs well in this custom voice, even if the data for it is limited.
An adaptive system on the other hand can synthesize any voice by using even a single utterance from that voice. The target voice data is not part of the training in this case, the system simply looks at a voice and tries to mimic it. It uses its own database of voices to come up with an appropriate neural net mixture in its attempt to synthesize the target voice. As the system has never seen the target voice in training before, naturally the result comes with a strong reduction in quality.
A: Voice cloning is just another term for custom voice creation. You train the system to use a specific voice to say new, arbitrary things. When, however, an actor is used to deliver the speech and the technology is there to mask such speech with the target voice, i.e. when you go from one speech signal to another speech signal without going into text or prosody or speech information, that’s voice conversion.
The technology used in these two cases is very different- the neural networks involved look and are trained differently, and the errors that come out are different too. In conversion, you already know that the prosody and speech exist, since the speaker is already speaking in the correct language; you only need to change the voice characteristics. In proper voice cloning however, when you use a TTS system to deliver new speech in a synthesized voice, instead of using an actor, you can get pronunciation errors, speed errors, etc. – the quality deteriorates. Most of the high-quality videos we see online that resemble the voices of famous people are typically the output of voice conversion rather than of voice cloning.
A: I think there will be gradual progress in the quality of the TTS output. We already know what we need to work on to develop the technology further. From a practical perspective, there are so many things on the list that we could easily be busy for the next 5-10 years. The real milestone will be when the first movie is localized in its entirety with speech synthesis. Such a scenario would be plausible in the case of a low-visibility title, where perhaps the cost of dubbing in a certain language would be prohibitive. A human would still be involved in the process in terms of making manual adjustments to the synthetic voices to correct errors, nevertheless the entire film would be voiced by machines. Once the technology is made available to a broad audience in such a manner, this is when we will have reached the next milestone in my opinion. I am personally looking forward to that!
AppTek is a global leader in artificial intelligence (AI) and machine learning (ML) technologies for automatic speech recognition (ASR), neural machine translation (NMT), natural language processing/understanding (NLP/U) and text-to-speech (TTS) technologies. The AppTek platform delivers industry-leading solutions for organizations across a breadth of global markets such as media and entertainment, call centers, government, enterprise business, and more. Built by scientists and research engineers who are recognized among the best in the world, AppTek’s solutions cover a wide array of languages/ dialects, channels, domains and demographics.