Video dubbing is the media localization process of revoicing a video from the source audio language to a new target language while offering a viewing experience equivalent to the original video. The workflow involves translating a script in the target language which is then used to generate speech that reproduces the emotions of the source speaker, coherent with the body language of the actors on the screen while matching their lip movements. Dubbing is renown in the industry for its complexity, long turnaround times and high costs. Automatic video dubbing aims to automatically revoice videos to make them more easily accessible to audiences in other languages at just a fraction of the cost.
“It is very exciting to work on a complex task such as automatic dubbing,” said Mattia Di Gangi, Lead for Automatic Dubbing at AppTek. “It is a very new research area, which means many developments are possible. The automatic dubbing pipeline itself did not become useable until 2019, which is not something obvious or simple. Having output that sounds lifelike and makes sense is a feat in itself. Now we need to do some more research to improve it further and capture the full range of emotion in speech.”
AppTek’s automatic dubbing pipeline uses a cascade of automatic speech recognition (ASR), neural machine translation (NMT) and text-to-speech (TTS) technologies. The NMT component is augmented with metadata features for style adaptation and output length control. Additionally, a speaker-adaptive TTS system reproduces the voice features of the original speaker for each given segment in the new target language.
Today, the system translates video contents and dubs the output videos in a voice-over style, with the original audio dipped in the background while a new voice track is rendered at a natural volume over it, usually with a delay of a few frames. Demos are available here.
The AppTek science team is currently working on improving the speaker diarization component which is used to partition audio streams into homogenous segments according to speaker identity. The team is also working on improving voice quality and the timing of the generated speech, and also fine-tuning the NMT output length control to better match the length of the source speech so as to create synthesized speech that in turn better matches the original audio segment. Research is focusing on detecting emotions in the source audio, matching it with the recognized text and annotating it with emotion tags as part of the supporting metadata. Annotating the output text with emotions as well, at the word level, will allow such metadata to also be used by the TTS system to generate emotion-aware output.
The last step for a fully-fledged automatic dubbing system would be to add lip synchronization to the generated output using technologies seen in modern deepfake videos, as lip sync is a strict requirement for close-ups in the dubbing workflow.
“AppTek’s automatic dubbing pipeline promises to be a complete game changer in media localization,” said Kyle Maddock, AppTek’s SVP of Marketing. “It is a very ambitious pioneering project that aims to make dubbing available not only to the media and entertainment industry but to many other use cases as well. Our immediate product releases will support voice-over mode, and we will also offer a human-in-the-loop approach to allow users more control over the output.”
AppTek’s submission at this year’s EAMT conference was presented by Mattia Di Gangi, who was also the recipient for the Best Thesis Award, on the topic of Neural Speech Translation.
View the full technical paper submitted to the EAMT here.
AppTek is a global leader in artificial intelligence (AI) and machine learning (ML) technologies for automatic speech recognition (ASR), neural machine translation (NMT), natural language processing/understanding (NLP/U) and text-to-speech (TTS) technologies. The AppTek platform delivers industry-leading solutions for organizations across a breadth of global markets such as media and entertainment, call centers, government, enterprise business, and more. Built by scientists and research engineers who are recognized among the best in the world, AppTek’s solutions cover a wide array of languages/ dialects, channels, domains and demographics.