The International Conference in Spoken Language Translation (IWSLT) is an annual scientific conference dedicated to all aspects of spoken language translation. For the past 20 years, it has organized and published key evaluation campaigns on the topic, which include the creation of data sets, benchmarks and metrics, as well as define the progress made in the field. Given the rise in the visibility, usage and interest in subtitle translation in both live and offline environments, this year the conference included, for the first time, a shared task in subtitle translation in addition to its simultaneous and consecutive translation tasks.
AppTek is a regular participant of the competitive evaluations at IWSLT and a leader in the field of human language technologies (HLT) for media localization. As such the company was asked to contribute to the setup of a subtitle track for the 2023 conference. This was made possible with the contribution of original video and subtitle data from ITV Studios and Peloton to be used for the purposes of the automatic subtitling track, among other publicly available data. Conference participants were asked to generate subtitles of audiovisual documents, belonging to different domains with increasing levels of complexity. AppTek also successfully participated in the task of controlling the formality level of the machine translation output.
The task of automatic subtitling is multi-faceted: starting from speech, not only the translation has to be generated, but it must be segmented into subtitles compliant with constraints that ensure a high-quality user experience, such as an appropriate reading speed, synchronicity with the dialogue, a maximum number of subtitle lines and characters per line, etc.
The IWSLT evaluation task was to produce subtitles which comply with the maximum subtitle reading speed of 21 characters per second and are not longer than 42 characters per line and two lines each. Participants were expected to use only the audio track from the provided videos (dev and test sets), while the video track was provided primarily to verify synchronicity and the appropriate display of subtitles on the screen.
Participants were asked to automatically create subtitles in German and/or Spanish for audio-visual content in English, collected from the following sources:
There were two conditions involved, a constrained one, in which only a pre-defined list of training data resources were allowed, and an unconstrained one, without any data restrictions.
The evaluation of the output was based on assessing subtitle quality, translation quality and subtitle compliance. The generated subtitles were compared against reference subtitles created by professional translators using the following automatic measures:
Translation quality alone was measured with the established metrics BLEU (surface similarity with the reference translation) and BLEURT (semantic similarity based on sentence embeddings).
Subtitle compliance was measured by computing the rate of subtitles with reading speed equal to or lower than 21 char / sec (CPS), the rate of lines longer than 42 characters (CPL), and the rate of subtitles with more than two lines (LPB).
AppTek participated both in the constrained and unconstrained conditions. In the former, AppTek submitted runs both for the English-to-German and English-to-Spanish language pairs, primarily using a cascade architecture composed of the following modules: neural encoder-decoder ASR, followed by a neural Machine Translation model trained on the data allowed in the constrained track, with the source (English) side lowercased and normalized to resemble raw ASR output, as well as adapted to the IWSLT subtitling domains, followed by a subtitle line segmentation model (intelligent line segmentation by AppTek (Matusov et al., 2019)).
A contrastive run was generated for the English-to-German language pair only by a direct speech translation system that translates English speech directly to the target language text. The challenge in this scenario was the start and end time prediction for the subtitles in the target language, which was solved by using a CTC model to generate word-level timestamps, followed by AppTek’s intelligent line segmentation model. For the English-to-German language pair, AppTek also submitted a run in the unconstrained setup, where a cascade architecture was employed consisting of: neural encoder-decoder CTC ASR, followed by a neural punctuation prediction model and inverse text normalization model, followed by an MT model adapted to the IWSLT domains, followed by AppTek’s Intelligent Line Segmentation.
Two other teams participated in the subtitling track of the IWSLT 2023 evaluation. FBK (Foundatione Bruno Kesseler from Trento, Italy) submitted primary runs for the two language pairs, generated by a direct neural speech translation model, trained in the constrained setup, and MateSub submitted primary runs for the two language pairs, automatically generated by the back-end subtitling pipeline of MateSub, its web-based tool that supports professionals in the creation of high-quality subtitles, which is also based on a cascade architecture, composed of ASR, text segmenter and MT neural models, which allows covering any pair from about 60 languages and their variants, including the two language pairs of the task. Since Matesub is a production software, its neural models are trained on more resources than those allowed for the constrained condition, therefore the submissions fall into the unconstrained setup.
The results of the evaluation on the blind test data showed that, averaged over the 4 domains, AppTek achieved the lowest SubER scores with their primary submission for English-to-German in the constrained and unconstrained condition, with the overall best results for the latter. For English-to-Spanish, MateSub obtained the overall lowest SubER with their unconstrained system, whereas AppTek was the winner in the constrained condition.
The team observed that in terms of domain difficulty, the TV series (from ITV) posed the most challenges for automatic subtitling. This has to do with diverse acoustic conditions in which speech is found in movies and series - background music, noises, shouts, and crosstalk. All of this makes the task of recognizing speech quite challenging, which results in error accumulation in the downstream components. AppTek’s unconstrained systems performed significantly better on this domain, which shows the importance of training on additional data that is more representative of real-life content.
The second-hardest domain are the fitness videos from Peloton. Here, despite a generally clear single-speaker audio with reduced background noise, the challenge is the MT: some of the fitness- and sports-specific terminology and slang pose significant challenges in translation to their German and Spanish equivalents.
Surprisingly, even the EPTV interviews presented significant challenges for subtitling, even though the topics discussed in the interviews are found in abundance in the allowed speech-to-text and text-to-text parallel data for the constrained condition (e.g., transcriptions and translations of the speeches and debates of the European Parliament). Here, the issues such as spontaneous speech with many pauses, as well as speaker separation may have been cause of some of the errors.
The TED talks which have been the main domain for the IWSLT evaluations in the past years were the easiest to be automatically subtitled.
Whereas the current level of subtitle quality for TED talks may require minimal human corrections or can even be shown unedited on the screen, for the other three domains the automatic subtitles will require post-editing. This showed the importance of running evaluations not only under very controlled conditions as in the case of TED talks, but on a variety of real-life content where multiple research challenges in speech translation are yet to be overcome.
This year's direct speech translation systems seem to be too weak to compete with the cascaded approaches. In particular, a full end-to-end approach that directly generates subtitle boundaries is currently inferior in comparison with the systems that adopt a specific solution for segmenting the text (Intelligent Line Segmentation by AppTek). Such specific solutions lead to almost perfect subtitle compliance. But even in terms of pure speech translation quality as measured e.g., with BLEU and BLEURT the cascaded systems currently provide better translations even under constrained training data conditions.
Regarding the automatic metrics used in the evaluation, we observed that the subtitle quality metric Sigma, but also the pure MT quality metrics exhibit some discrepancies in how the performance of the same system is ranked on the four domains. This ranking sometimes differs depending on whether you choose BLEU, ChrF, or BLEURT as the ``primary'' metric. All of these discrepancies highlight the importance of human evaluation, which we have not conducted this time. One of the reasons for this is that in most prior research the automatic subtitling quality is evaluated in post-editing scenarios, which are too expensive to be run on significant amounts of data as they require professional subtitle translators. On the other hand, as mentioned above, for 3 out of 4 domains the quality of the automatically generated subtitle translations is low, so that an evaluation of user experience when watching subtitles would be also challenging, especially if the users would have to assign evaluation scores to individual subtitles or sentences. With all of this in mind, any human evaluation was postponed to the next edition of the subtitling track at IWSLT.
Overall, this first edition of the subtitling track emphasized the crucial role of the following components related to speech processing: noise reduction and/or speech separation, speaker diarization, and sentence segmentation. So far they have been underestimated in speech translation research. Current automatic solutions do not reach the level of quality that is necessary in subtitling. At AppTek, we will conduct further and deeper research into these areas, for which subtitle translation is a good test case.
Different languages encode formality distinctions in different ways, including the use of honorifics, grammatical registers, verb agreement, pronouns, and lexical choices. While MT systems typically produce a single generic translation for each input segment, spoken language translation requires the translation output to be appropriate to the context of communication and target audience. The formality control track of IWSLT challenges machine translation systems to generate translations of different formality levels.
Given a source text in English, and a target formality level of either “formal” or “informal”, the goal in formality-sensitive MT is to generate a translation in the target language that accurately preserves the meaning of the source text and conforms to the desired formality level.
This year AppTek, participated in the formality track for the first time for translations from English (EN) into Portugal Portuguese (PT) and Russian (RU), under the “zero-shot” condition, i.e., not using any training data specific to the task. In fact, AppTek’s production neural MT systems for several language pairs already support formality control as one of many metadata-based controls such as gender, gender, dialect information, and others (Matusov et al., 2020). The formality level in these systems is encoded with a pseudo-token at the beginning of each training source sentence with one of 3 values: formal, informal, or no style. AppTek’s systems participating in the evaluation were thus unconstrained, since they were trained on large amounts of public parallel data (millions of sentences pairs) that were not part of the provided data for the constrained condition.
AppTek trains its systems to support formality control by partitioning the training data into the 3 formality classes mentioned above. This is done with regular expressions for the target language that match specific form of the second-person pronouns typical for formal/informal speech, as well as corresponding verb forms. Additional smoothing and data balancing methods are applied during training to ensure that all of the 3 classes of formality are trained on enough examples.
For the IWSLT evaluation, we did not perform any experiments, but just set the AppTek’s MT API parameter “style” to “style=formal” or “style=informal” and translated the blind evaluation data with the AppTek's production systems, trained as described above.
Among the 5 participants of the unconstrained condition, we obtained the best results for English-to-Russian in terms of the MT evaluation metrics BLEU and COMET (higher scores are better), while producing the correct formality level for more than 98% of the sentences. The second-best competitor system obtained formality accuracy of 100% but scored 1.7% absolute lower in BLEU for the formal and 0.9% BLEU absolute lower for the informal class.
For English-to-Portuguese, our system scored second in terms of automatic MT quality metrics and correctly produced the formal style for 99% of the sentences in the evaluation data. However, when the informal style was requested, our system could generate it in only 64% of the cases. We attribute this low score to the imperfect regular expressions we defined for informal Portuguese pronouns and corresponding verb forms, since some of them are ambiguous. At AppTek, we strive to continuously improve our NMT systems, and competitive evaluations such as IWSLT help us to find areas for improvement.
AppTek is a global leader in artificial intelligence (AI) and machine learning (ML) technologies for automatic speech recognition (ASR), neural machine translation (NMT), natural language processing/understanding (NLP/U) and text-to-speech (TTS) technologies. The AppTek platform delivers industry-leading solutions for organizations across a breadth of global markets such as media and entertainment, call centers, government, enterprise business, and more. Built by scientists and research engineers who are recognized among the best in the world, AppTek’s solutions cover a wide array of languages/ dialects, channels, domains and demographics.