Since their market introduction in the beginning of our century, voice writing (also known as respeaking) workflows for live captioning have really taken off. Their growing adoption is boosted by their increasingly improved quality, reported to now be at stenocaptioner accuracy levels. That’s confirmed by Dr. Pablo Romero-Fresco, a leading live captioning authority.
While the quality of voice writing today is impressive, mistakes in misrecognizing speech input have frequently been the source of jokes since the workflow was introduced into live captioning. Reports about awkward errors swiftly make the headlines, as closed captioning’s transparent nature makes it prone to widespread criticism.
Live captioning quality has been the focus of a recent FCC petition for rulemaking, filed by consumer groups representing the interests of the deaf and hard-of-hearing (HOH) community. Dr Romero-Fresco submitted his own comments to this petition, stating that “discussions on quality in captioning would benefit from referring to models of assessment that can actually give us a good reflection of the experience of the viewers.”
To that end, Dr. Romero-Fresco developed the NER model with Juan Martínez for assessing captioning quality. The model is based on idea units as opposed to the focus on words that is central to various other models currently used in the US.
The NER model looks at two types of errors, regarding the editing and recognition process, which are given different weights depending on their severity.
• Edition errors are typically caused by the captioner when making a misjudgement on how to edit text, because s/he cannot reproduce it all.
• Recognition errors have to do with the relationship between the captioner and the software, irrespective of whether we are talking about stenocaptioning or voice writing workflows.
The NER achievement threshold is 98% for captions to be considered good enough for broadcast – this is the equivalent of a 5 in a 1-10 scale. Some captioners reach scores as high as the 99% range.
Because voice writing is still a relatively recent practice, there are not yet extensive data available to compare quality advances in live captions over time. However, since the implementation of the NER model in the UK, there is noticeable improvement due to two factors. First, the ASR itself improved. Additionally, the captioners were trained together with the model. “By focusing on user comprehension in order to train NER evaluators,” Dr Romero-Fresco explains, “we are also encouraging captioners to think about what will be understood by the viewers.” This resulted in a reduction of the number of captioners’ serious edition errors during the NER model’s two-year trial with all UK public broadcasters.
With regard to automating captions, Dr. Romero-Fresco reports being surprised by the high-quality rates he and his research group GALMA have found in the samples they analyzed in Europe. Some of these captions reached near the 98% NER threshold if errors related to lack of punctuation in captions weren’t factored. When auto-punctuation was applied and relevant errors were factored in the NER formula, the overall quality was significantly lower than what in his experience a professional would produce.
Punctuation restoration is indeed a relatively small research field in the scientific community. But it has been a focus area for ASR improvement at AppTek, who have trained their punctuation models on billions of sentences, as opposed to millions that is the market standard. In doing so, AppTek has achieved significant improvements in its punctuation and capitalization models, integrating them for 23 languages to date. For each language, AppTek offers a fast model for real-time captioning and a slower, more accurate model for offline processing.
AppTek’s US English live automated captions have also recently undergone NER evaluation by a Canadian broadcaster that uses the company’s ASR services. Four tests were carried out during October 2019 on local news programming. The resulting scores were all above 97%, just missing the 98% NER threshold. This is a significant outcome given that the evaluation included punctuation errors.
Clearly, the quality of automated captions varies with the broadcast. Programs with fewer speakers, who speak in a scripted manner and without much background noise, make things a lot easier for ASR. Dr. Romero-Fresco explains that another issue in automated captions is the lack of control over the severity of errors the software may produce, or a way to correct them after they’ve been broadcast, as voice writers would do. There is a risk that the software may produce serious errors which could cause misinformation; so much so that the deaf community typically refers to such errors as “lies.” This risk needs to be mitigated.
Fully automated captions are increasingly used by broadcasters around the world, Dr. Romero-Fresco points out. In some countries, broadcasters are opting to analyze the quality of such automated captions first, compare them to the output of professional captioners and then make a decision regarding their use. In other cases the captions are output first and analyzed later. In such cases it would be best if this was done after properly informing the public of the automated nature of the captioning service and the possibility for errors, so as to avoid the potential for misinformation.
Dr. Romero-Fresco is certain that the role of ASR will become even greater in the future. He believes that there will still be situations where voice writers and stenocaptioners will be needed, and there may be alternative workflows where ASR output is edited live. An exciting prospect for live captioners is the provision of interlingual live captions, which in Europe is offered through voice writers who listen to the audio in one language and translate it into captions in another. He also believes that his role as a researcher is to use models like NER that can assess the quality of captions and ensure that whatever the generated output, automatic or not, “the ultimate goal is to provide high-quality access for the viewers.”
With ASR technology continuously improving, a lot of attention is being given to the quality of automated captions. Broadcasters see automation as a godsend given the increasing volumes of programming that require captioning but which lack respective production budget increases.
ASR is already proving to be a reliable backup solution in cases of captioner unavailability, and many believe that such use could be further extended. In our next post, we will speak to Pavel Golik, AppTek’s senior speech scientist, on the ASR’s current state-of-the-art, its breakthroughs and limitations, and the work that is underway to adapt it to the needs of the captioning market.
AppTek is a global leader in artificial intelligence (AI) and machine learning (ML) technologies for automatic speech recognition (ASR), neural machine translation (NMT), natural language processing/understanding (NLP/U) and text-to-speech (TTS) technologies. The AppTek platform delivers industry-leading solutions for organizations across a breadth of global markets such as media and entertainment, call centers, government, enterprise business, and more. Built by scientists and research engineers who are recognized among the best in the world, AppTek’s solutions cover a wide array of languages/ dialects, channels, domains and demographics.