Spotlight on Automatic Dubbing, an Interview with Mattia Di Gangi

October 6, 2022

Automatic dubbing is the latest hot research topic in AI-enabled language technologies. The high level of expertise and complexities involved in the dubbing process are legendary in media localization workflows, which have remained the same since dubbing was first used as a localization strategy in the 1930’s, right after the ‘talkies’ became the dominant film production method in the market. The pandemic spearheaded a lot of experimentation in the field, with dubbing workflows moving to the cloud and new software making its appearance for dubbing script writing and online recording.

At the same time, scientists at Amazon, Google and Meta have been experimenting on the dubbing workflow with the use of Automatic Speech Recognition (ASR), Machine Translation (MT) and Text-To-Speech (TTS) technologies in a cascaded pipeline, aiming to automate the process fully or partially to make audio tracks in one language easily accessible in as many audio languages as possible. We spoke to AppTek’s Lead for Automatic Dubbing, Dr. Mattia Di Gangi, to find out more about how he got involved in the field, the current state of the art and AppTek’s award-winning solutions.

Q: Tell us a bit about yourself – where did you grow up, what languages do you speak?

A: I grew up in Palermo, Sicily, where there is not much happening in terms of software development, yet I had an interest in programming growing up – a typical case of playing a lot of video games as a kid and becoming curious about how they are developed. My first encounter with programming was working on simple scripts for webpages and at high school I took a technical direction in which programming was part of the curriculum. It was then that I decided this is what I wanted to do as a job. I spent a year in environmental engineering at first, but chemistry was not my thing, so I enrolled for a BSc in computer science at the University of Palermo. I completed an MSc in computer science there as well, with a focus in machine learning (ML). This is when I had the opportunity to spend a semester in France at a university famous for Java development, which I enjoyed a lot, as that was the first time I took structured exams on software development and learned a lot about developing software in an operational way.

This semester in France also meant that on top of my English, I had to learn French very quickly in order to take my exams, which was not that hard given my native Italian. I can also speak basic Russian and, as I now live in Germany, I am studying German, but I am finding it more difficult than I expected.

Q: Are your hobbies related to language?

A: Most of my hobbies are completely unrelated to languages. I like reading a lot about software development and engineering, as this is paradoxically understudied at university though very important in my opinion. I reuse a good part of what I read in my daily work, and I also blog about software development and speech translation in Medium, as I like to popularize knowledge in these two domains.

I enjoy reading about history, economy and politics too, and since the pandemic I have taken up watching many TV shows. I do a bit of sport and since I moved to Aachen I appreciate traveling by car in Europe, which I did not have the opportunity to do much of in Sicily, it being an island. I had been looking forward to start traveling again to attend conferences  since the anti-Covid restrictions eased up.  I started by taking a trip in early June to Ghent, Belgium, to attend EAMT 2022, where I was the recipient for the Best Thesis Award for 2020.

Q: How did you get involved in machine translation?

A: I was fortunate that the deep learning revolution happened when I was about to embark on a doctorate degree. I remember seeing a tool about the automatic captioning of images, which was unthinkable until then but became possible with deep learning. I was very impressed by it and decided to focus on deep learning in my PhD. I applied to the best programs in Italy and was accepted at the University of Trento with a scholarship from the Fondazione Bruno Kessler (FBK), which is a research institute with a history in artificial intelligence. I took up a position there in the MT group, under the direction of Marcello Federico, who later became principal research scientist at Amazon Web Services.

Q: Why did you choose to work on speech translation over any other MT topic?

A: Shortly after I started working at FBK, the research group decided to take a new direction in MT and develop models that would take audio as input and produce text translations as output, with no transcription in between.  A technology called direct, or end-to-end, speech translation. I was very interested in this as other MT topics were not that exciting to me. There is some creativity in improving MT for focused use cases, but when you take a model that is already of good quality and fix the few remaining errors, the improvements are of a diminishing return. Direct speech translation was not only an interesting topic to work on, but a lot of advancements were possible as when I started working in the field the technology was more or less unusable.

I ended up working in direct speech translation under the direction of Marco Turchi.  At the time, there was only one public implementation for speech translation by the University of Grenoble, which was functioning but slow to run. We then decided to write our own implementation by adapting an MT tool based on  PyTorch, which was much faster, but the real problem was that it was very difficult to train speech translation systems. The translation output, when you start from audio in another language, was very poor as compared to text translation output – the BLEU score was approximately half in comparison.

We needed more training data for the task but there was little available. As a result, we started working on MuST-C, a multilingual speech translation corpus, which became the largest corpus available for speech translation. It consisted of TED talks, which included high quality transcriptions and translations by volunteers, that we cleaned up and delivered in 2019 in eight target languages out of English. The impact of this training set was huge. It has been used at the International Conference on Spoken Language Translation (IWSLT) ever since and has been expanded upon. It is because of this training set that I became quite well-known in the scientific community – deciding to work on speech translation was a major turning point in my career.

Q: How did you end up working for AppTek in Germany?

A: I finished my PhD in April 2020, in the middle of the first full lockdown in Italy. I had already completed an internship in Amazon in 2019, in Marcello Federico’s group in speech translation and, though I did have an offer to go back, I was hesitant because of the pandemic and opted to stay in Europe, closer to home. I knew I wanted to continue working on speech translation, and AppTek is one of the few companies in Europe that had both speech recognition and machine translation teams, and is also a regular participant at IWSLT, specifically in the direct speech translation track, so I was well aware of their work. AppTek’s Lead Science Architect in MT, Evgeny Matusov, was on the examiners’ board of my PhD, so I had come across the company and its staff at various opportunities and thought highly of the group. As a result, it did not take much for me to be persuaded to move to Germany to oversee the company’s automatic dubbing pipeline.

Q: What do you think about the current state of speech translation and its future?

A: The first big step in terms of speech translation is the very fact that the pipeline has become useable for real use cases. This is not something obvious or simple. There are many ML models involved, which are imperfect if you think about it, and the errors they produce multiply by the nature of the pipeline itself. So having speech translation output that sounds good and makes sense is a feat in itself. This is all very recent – it was only in 2019 that the first papers proposing a full dubbing pipeline came out.

To be more precise, the systems that have been built so far  are more about revoicing rather than dubbing. Dubbing is a specific type of revoicing which involves lip synchronization as well. The research community is primarily focusing on isochronous MT so far, not full lip synchrony, aside from some limited experimentation. The translations produced are based on measuring the length of the translated text, something that is also useful for subtitling, but for dubbing things are more complicated. AppTek has actually won first place at IWSLT’s isochronous speech translation task for the English to German language pair.

To achieve lip synchrony in dubbing the focus needs to be on linguistically motivated approaches, such as counting syllables or phonemes, rather than counting characters, which is a human artefact. Predicting how long it takes to utter a sentence at a natural pace is also very difficult, as it is not a linear mapping from text to voice due to cultural specificities and strings of words that are typically uttered together rather than separately. I believe achieving synchrony on the basis of linguistic approaches will be a major milestone towards higher quality in automatic dubbing going forward.  

Q: What are you working on at AppTek now and what are you planning on rolling out next?

A: My main focus at AppTek is to ensure we have a working automatic dubbing service by the end of the year. There are some aspects of the pipeline that need to be improved before then, which I am coordinating.

One aspect we’ve been working on is to improve our speech placement algorithm, which decides the start and end points of each synthesized sentence. In voiceover such timing constraints are more relaxed than in lip sync dubbing, but you also need to take the action on the screen into account.

Another thing is improving the isochrony for English to non-Latin based script languages, such as Arabic or Chinese, in which the characters may have different granularities. Arabic sentences take longer to utter than their English counterparts and the TTS system ends up speaking faster to accommodate them, so the speech does not sound as natural as it could. Now our MT systems have the ability to generate sentences of the appropriate length taking into account the granularities between different scripts.

We are also working on making speaker adaptation more robust to different recording conditions. This is related to speaker diarization too, since it uses the same speaker encoder models. If background audio and the microphone can affect speaker clustering, then they will affect the quality of synthesized voices as well.

But the main thing we all look forward to is rolling out the automatic dubbing production pipeline by the end of the year, not only because our clients are asking for it, but also for ourselves, so we can perform more experiments and gather data points, to better understand where things go wrong so we can find solutions to improve the output.

Q: Has AppTek’s speech translation service being used in real-life so far?

A: We completed our first real-life project back in May. In collaboration with aiXplain, we provided a service for the ACL 2022 conference  which concerned a special Diversity & Inclusion Initiative 60-60, that had been organized on the occasion of ACL’s 60th anniversary with the aim to make the conference accessible in 60 languages.

As part of this initiative, AppTek offered a 24-hour delayed speech translation service from English into Spanish for all conference plenary events (keynotes, panels, etc.) and videos. All of the talks during the conference were recorded, the videos were uploaded on a server together with automatically generated subtitles, and they were then automatically dubbed from English into Spanish.

Since then, the pipeline has been improved further and we have tried it with various language pair combinations. Several demos were showcased at IBC 2022 and you can find them below.

AI and ML Technologies to Bridge the Language Gap
Find us on Social Media:

AppTek is a global leader in artificial intelligence (AI) and machine learning (ML) technologies for automatic speech recognition (ASR), neural machine translation (NMT), natural language processing/understanding (NLP/U) and text-to-speech (TTS) technologies. The AppTek platform delivers industry-leading solutions for organizations across a breadth of global markets such as media and entertainment, call centers, government, enterprise business, and more. Built by scientists and research engineers who are recognized among the best in the world, AppTek’s solutions cover a wide array of languages/ dialects, channels, domains and demographics.

Copyright 2021 AppTek    |    Privacy Policy      |       Terms of Service     |      Cookie Policy