The State of Live Captioning Today ― An Expert’s Perspective | Accessibility Series Part 4

February 13, 2020

The State of Live Captioning Today ― An Expert’s Perspective

Voice writing, or respeaking, is “a technique in which a respeaker listens to the original sound of a live program or event and respeaks it, including punctuation marks and some specific features for the deaf and hard of hearing audience, into a speech recognition software, which turns the recognized utterances into subtitles displayed on the screen with the shortest possible delay.”

Pablo Romero-Fresco, 2011, Subtitling Through Speech Recognition: Respeaking

For this and another soon-to-be-published post, we had the pleasure of speaking with Dr. Pablo Romero-Fresco, a leading expert, trainer, academic and author on live subtitling and captioning, to get his insights on the state of live captioning today. He shares some interesting and highly relevant behind-the-scenes information on voice writing, an important methodology for producing live captions.

But first, let’s start with understanding what voice writing is.

Voice writing, or respeaking as it is called in Europe, was introduced in live subtitling and captioning workflows around the turn of the 21st century as an alternative to stenocaptioning.

Voice writing’s origins can be traced back to the stenomasking technique developed for court reporters in the 1940s. The concept has seen a renaissance due to digital recording systems and improved quality Automatic Speech Recognition (ASR) technology. Far from the much more extensive training that stenographers require, voice writing training can be completed in just 6-12 months. As a result, the practice has enjoyed great popularity in recent years in the courts and in state judicial procedures that allow for it.

Voice writing has enjoyed similar success among medical transcriptionists and live captioners as well. It’s one of the mainstream methods for producing live broadcast captions in the USA today, after being introduced in 2004. Ms Darlene Parker, Steno Captioning and Realtime Relations Director at the National Captioning Institute (NCI), estimates that the percentage of stenocaptioners vs. real-time voice writers now employed in the US captioning industry may be approaching an even 50-50 split. Voice writing has already made much more headway abroad. For example, in the UK there is as much as 90% use of respeaking vs. just 10% use of stenotyping.

A Balancing Act

Dr. Romero-Fresco explains that while voice writing quality is assessed through several factors, the two key interconnected issues are accuracy vs. latency: “These captions are likely to have both errors and delay.” One then needs to make a decision in striking the right balance between the two and using a workflow to support that balance. He notes that “You work towards minimizing errors or latency, but it’s tricky to achieve both.”

For instance, in France the focus is on quality. Between three and four people are used in the process: a respeaker, a corrector and a whisperer to suggest corrections that might have been missed. This guarantees close to 100% quality, but the latency can easily reach or even exceed 10 seconds. In the UK, US and Canada, the workflow involves a single person who self-corrects, but only after the errors have been broadcast. The latency in such cases is much lower, between 4-6 seconds.

This is important because latency can prove very disturbing to audiences. It tends to be higher with voice writing than with stenocaptioning. Unfortunately there is currently not enough data available to determine what latency the deaf community considers acceptable.

In the voice writing workflow, a latency of 3 seconds is the absolute minimum if speaking normally. That would involve 1 second to listen, 1 to respeak (rarely the case given that pauses often need to be introduced even when there aren’t any in the original speech), and 1 for the software to output the caption. Some broadcasters are implementing methods to circumvent this issue by providing the voice writer access to the audio feed earlier than the viewers. That minimizes the delay by an extremely valuable 3-5 seconds, and can make latency completely disappear at the viewer end.

Training for Voice Writers

It is interesting to note that voice writers reach a maximum speed of 180 words per minute, significantly lower than what stenocaptioners can achieve. Consequently, paraphrasing and editing down are key voice writing training techniques.  

Ms. Parker describes stenocaptioners as being like athletes who can train hard and become much faster and accurate in their output, even reaching speeds of 300 words per minute. But Dr. Romero-Fresco explains that voice writers are constrained by the capability of the particular software they’re using that requires them to respeak up to a certain pace and add pauses. Otherwise there is a risk of introducing recognition errors.

Training is provided one skill at a time. The first is learning to dictate, then to respeak, i.e. to dictate while listening. At a third stage, trainees are asked to process what’s happening in the on-screen images, observe the captioning errors they make, and decide if they need to correct them or not. Correcting an error is the most challenging task because, as Dr. Romero-Fresco describes, you are “correcting the past (already spoken soundtrack) as you keep listening to the present audio to respeak it in the immediate future – so we are dealing with three moments in time simultaneously.” On top of this, a voice writer can’t increase their speed because of the risk of introducing more errors, making it difficult to catch up.

Dr. Romero-Fresco has been training voice writers for years, both for same language and for translation into another language. “Training of intralingual voice writers can take as little as 3 months, especially if you have good candidates,” he explains. “The right skills for the job lie somewhere between subtitling and interpreting, the ability to talk while listening, good grammar, knowledge of world affairs, ability to cope with pressure, and awareness of deafness of the end users.”

In his current research on interlingual respeaking in several European languages, the voice writer not only has to do all of those things, but also needs to translate from one language to another at the same time. That adds a third dimension to the split attention that voice writers experience.

Reading and Seeing

Dr. Romero-Fresco notes that what is most important when it comes to live captioning is not just how much can be captioned but how much viewers can read while still being able to see the images.

For instance, an interesting discussion is taking place in Canada about whether play-by-play in sports needs to be captioned at all. Delivery is typically very fast, and the captions are so fast that the viewers end up reading a description of what’s being seen instead of being able to actually see it. In an ironic twist, he comments that “it’s almost like we are turning people who are deaf into blind.”

His research into a new, engagement-based, ability-driven approach instead focuses on captioners learning to look for solutions that make the most of the viewer’s abilities. That may mean captioning less and letting a viewer see and experience as much as possible in real-time.

Captioning’s Next Steps

When it comes to live captioning, clearly there is still progress to be made in facilitating the deaf/hard-of-hearing viewers’ engagement and giving them access on par with those who can hear. Fortunately, technological advances are enabling voice writers and broadcasters to move steadily toward that balance.

In our next article, we’ll continue our discussion with Dr. Romero-Fresco about live captioning quality, including his NER model for caption quality measurement and technology advances supporting automated captions. Stay tuned!


More about Dr. Pablo Romero-Fresco

A distinguished, internationally-known live captioning expert, Dr. Pablo Romero-Fresco’s qualifications and credentials are among the best in the world. He has led many studies involving members of the deaf community, and advises regulators in various countries on caption standards. He developed, along with Juan Martínez, the NER model for measuring live captioning quality. He currently runs certification programs for NER evaluators (including deaf evaluators) who provide input that helps refine the model further. He also leads the EU-funded ILSA project on interlingual respeaking, and currently teaches at the University of Vigo in Spain, after several years at universities in the UK.

On top of that outstanding resume, Dr. Romero-Fresco’s other passion is filmmaking. He is heavily involved in accessible filmmaking, an effort to incorporate localisation and accessibility ‘upstream’ in the planning and filming process. That facilitates subsequent localization to make film accessible. He was recently invited to present on this model to Netflix, who are beginning to integrate accessible filmmaking into their productions.

View Next Article

AI and ML Technologies to Bridge the Language Gap
Find us on Social Media:

AppTek is a global leader in artificial intelligence (AI) and machine learning (ML) technologies for automatic speech recognition (ASR), neural machine translation (NMT), natural language processing/understanding (NLP/U) and text-to-speech (TTS) technologies. The AppTek platform delivers industry-leading solutions for organizations across a breadth of global markets such as media and entertainment, call centers, government, enterprise business, and more. Built by scientists and research engineers who are recognized among the best in the world, AppTek’s solutions cover a wide array of languages/ dialects, channels, domains and demographics.

Copyright 2021 AppTek    |    Privacy Policy      |       Terms of Service     |      Cookie Policy