For this and another soon-to-be-published post, we had the pleasure of speaking with Dr. Pablo Romero-Fresco, a leading expert, trainer, academic and author on live subtitling and captioning, to get his insights on the state of live captioning today. He shares some interesting and highly relevant behind-the-scenes information on voice writing, an important methodology for producing live captions.
But first, let’s start with understanding what voice writing is.
Voice writing, or respeaking as it is called in Europe, was introduced in live subtitling and captioning workflows around the turn of the 21st century as an alternative to stenocaptioning.
Voice writing’s origins can be traced back to the stenomasking technique developed for court reporters in the 1940s. The concept has seen a renaissance due to digital recording systems and improved quality Automatic Speech Recognition (ASR) technology. Far from the much more extensive training that stenographers require, voice writing training can be completed in just 6-12 months. As a result, the practice has enjoyed great popularity in recent years in the courts and in state judicial procedures that allow for it.
Voice writing has enjoyed similar success among medical transcriptionists and live captioners as well. It’s one of the mainstream methods for producing live broadcast captions in the USA today, after being introduced in 2004. Ms Darlene Parker, Steno Captioning and Realtime Relations Director at the National Captioning Institute (NCI), estimates that the percentage of stenocaptioners vs. real-time voice writers now employed in the US captioning industry may be approaching an even 50-50 split. Voice writing has already made much more headway abroad. For example, in the UK there is as much as 90% use of respeaking vs. just 10% use of stenotyping.
Dr. Romero-Fresco explains that while voice writing quality is assessed through several factors, the two key interconnected issues are accuracy vs. latency: “These captions are likely to have both errors and delay.” One then needs to make a decision in striking the right balance between the two and using a workflow to support that balance. He notes that “You work towards minimizing errors or latency, but it’s tricky to achieve both.”
For instance, in France the focus is on quality. Between three and four people are used in the process: a respeaker, a corrector and a whisperer to suggest corrections that might have been missed. This guarantees close to 100% quality, but the latency can easily reach or even exceed 10 seconds. In the UK, US and Canada, the workflow involves a single person who self-corrects, but only after the errors have been broadcast. The latency in such cases is much lower, between 4-6 seconds.
This is important because latency can prove very disturbing to audiences. It tends to be higher with voice writing than with stenocaptioning. Unfortunately there is currently not enough data available to determine what latency the deaf community considers acceptable.
In the voice writing workflow, a latency of 3 seconds is the absolute minimum if speaking normally. That would involve 1 second to listen, 1 to respeak (rarely the case given that pauses often need to be introduced even when there aren’t any in the original speech), and 1 for the software to output the caption. Some broadcasters are implementing methods to circumvent this issue by providing the voice writer access to the audio feed earlier than the viewers. That minimizes the delay by an extremely valuable 3-5 seconds, and can make latency completely disappear at the viewer end.
Dr. Romero-Fresco notes that what is most important when it comes to live captioning is not just how much can be captioned but how much viewers can read while still being able to see the images.
For instance, an interesting discussion is taking place in Canada about whether play-by-play in sports needs to be captioned at all. Delivery is typically very fast, and the captions are so fast that the viewers end up reading a description of what’s being seen instead of being able to actually see it. In an ironic twist, he comments that “it’s almost like we are turning people who are deaf into blind.”
His research into a new, engagement-based, ability-driven approach instead focuses on captioners learning to look for solutions that make the most of the viewer’s abilities. That may mean captioning less and letting a viewer see and experience as much as possible in real-time.
When it comes to live captioning, clearly there is still progress to be made in facilitating the deaf/hard-of-hearing viewers’ engagement and giving them access on par with those who can hear. Fortunately, technological advances are enabling voice writers and broadcasters to move steadily toward that balance.
In our next article, we’ll continue our discussion with Dr. Romero-Fresco about live captioning quality, including his NER model for caption quality measurement and technology advances supporting automated captions. Stay tuned!
A distinguished, internationally-known live captioning expert, Dr. Pablo Romero-Fresco’s qualifications and credentials are among the best in the world. He has led many studies involving members of the deaf community, and advises regulators in various countries on caption standards. He developed, along with Juan Martínez, the NER model for measuring live captioning quality. He currently runs certification programs for NER evaluators (including deaf evaluators) who provide input that helps refine the model further. He also leads the EU-funded ILSA project on interlingual respeaking, and currently teaches at the University of Vigo in Spain, after several years at universities in the UK.
On top of that outstanding resume, Dr. Romero-Fresco’s other passion is filmmaking. He is heavily involved in accessible filmmaking, an effort to incorporate localisation and accessibility ‘upstream’ in the planning and filming process. That facilitates subsequent localization to make film accessible. He was recently invited to present on this model to Netflix, who are beginning to integrate accessible filmmaking into their productions.
AppTek is a global leader in artificial intelligence (AI) and machine learning (ML) technologies for automatic speech recognition (ASR), neural machine translation (NMT), natural language processing/understanding (NLP/U) and text-to-speech (TTS) technologies. The AppTek platform delivers industry-leading solutions for organizations across a breadth of global markets such as media and entertainment, call centers, government, enterprise business, and more. Built by scientists and research engineers who are recognized among the best in the world, AppTek’s solutions cover a wide array of languages/ dialects, channels, domains and demographics.