To this point in our Accessibility series, we’ve talked about the history of captioning and government policy around its adoption. Now it’s time to make it more personal. This post describes what captioning means in an experiential way for those who rely on it.
It’s a reality that many readers may have never considered. For those of us with our hearing abilities intact, all we need to do if we want to talk to our friends is pick up the phone. We can also instantly understand warning notifications about severe weather or other potential risks, ‘get’ the onscreen joke, pick up the banter of sports commentary, or immediately talk to a 911 emergency operator if needed. That’s not the case for those who are deaf or hard of hearing (HOH).
Featured Expert: Dr. Christian Vogler
To better understand, we’re honored to share perspectives from Dr. Christian Vogler, a distinguished expert on accessible technologies for the deaf and HOH and a member of the U.S. Federal Communications Commission (FCC) Disability Advisory Council. With a background in computer science, specifically in sign language automation, Dr. Vogler teaches communication studies at Gallaudet University, the premier American institution of learning, teaching and research for deaf/HOH students. He also heads Gallaudet’s highly innovative Technology Access Program that conducts research in communication technologies and services for the deaf/HOH in order to promote equality in communications.
Growing up deaf in Germany, Dr. Vogler recalls being frustrated by the few available technology options for aiding deaf people, which were limited to teletypewriter (TTY) used for making calls, and fax machines which were used at the time as an early precursor to text messaging. Offline subtitles were available for a limited number of broadcasts via teletext, the European version of closed captioning. However no technology was available for captioning live broadcasts, so subtitles for prime-time news were prepared in advance, resulting in a considerable mismatch to the live audio feed. This gap served as early motivation for his work.
By the time he relocated to the U.S. in the early 1990s, the internet and email were mainstream. The ability to communicate online was “the biggest equalizer” in the history of communication access. News and shopping websites were among the first Dr. Vogler remembers using, and he quickly adopted email for communication until instant messaging came about which allowed for more interactive conversations. Progress was propelled by continuing advances in technology, public demand and first and foremost legislation passed by the FCC, covered in our previous post. In his view, the 21st Century Communications and Video Accessibility Act of 2010 was probably the most important as it laid the ground work for the majority of captioning available today even in cases where such captioning is not mandated, as people are taking advantage of the mechanisms incorporated in the Act to provide it.
Aside from the appreciation Dr. Vogler expresses for being able to actively engage and communicate on a par with hearing viewers thanks to captioning, he also notes that several aspects of captioning output still frustrates deaf/HOH viewers-- especially when it comes to live programming. Imagine not hearing an important game call until seconds after it happens or not hearing the punchline of a perfectly timed joke on your favorite sitcom until after the scene changed.
Another challenge is caption accuracy, especially when viewers cannot tell if captions are inaccurate. Speed is often at the root of the issue, as people may speak much faster in broadcasts than it is possible for captioners to output the captions. It is then necessary for the spoken information to be summarized so that it is presented at an appropriate reading speed for the viewers. If a viewer can’t hear, they wouldn’t know what information is left out in the editing process. Even cursory lip reading is a good indicator that things aren’t lining up. Beyond viewer frustration, this can lead to issues of trust in the broadcaster.
Caption readability is an issue in itself. Caption presentation does not always follow a set rhythm, and captions may be presented faster or slower, on top of the delay that is inherent in live scenarios. This creates a cognitive load for the viewers who are not able to read at a constant speed. Readability can be an issue for pre-recorded content as well. Aside from mistakes in the text, caption presentation can be uneven, with text bunched together onto a single long line, making captions hard to read.
Caption placement can also be a challenge when captions are displayed over important visual elements on the screen, such as characters’ faces or live action in sporting events. While viewing would in theory be greatly enhanced by placing captions below the screen, current regulations are grounded in older technologies which didn’t allow for this. Modern ‘smart’ TVs offer a lot of capabilities for controlling the appearance and placement of captions, but the standards would need to be updated to what is technically feasible today for this issue to be solved.
Dr. Vogler feels a lot of these challenges are rooted in inconsistent or poor quality control practices combined with a lack of understanding about how deaf/HOH people consume captions. The result is an ongoing tension between consumers, who find the quality of captioning unacceptable, and broadcasters who think it is fine – or that it at least meets FCC requirements. Dr. Vogler, together with his research partners at Gallaudet University, Rochester Institute of Technology and AppTek, is hoping to change that.
The 1996 Telecommunications Act focused on how much programming had to be captioned but not the quality level of the captions themselves. Consumer advocacy groups representing the deaf/HOH filed a petition to correct that after the turn of the century, but the issue was not tackled by the FCC until much later, as change in policy always takes time. It was not until Commissioner Clyburn was the acting chairwoman of the FCC in 2013 that the agency really engaged in discussions about caption quality. As there wasn’t enough of a record at the time to justify choosing a metric to measure caption quality, a compromise was reached and the FCC mandated the “best practices” approach for closed captioning in its 2014 Caption Quality Order.
This approach did not result in good quality captions for the consumers, however, with major issues identified in all approaches used in live captioning: Electronic Newsroom Technique, captioners, or Automatic Speech Recognition (ASR). As a result, in 2019 Dr. Vogler and Gallaudet University joined eight other leading associations representing the deaf to again petition the FCC for rulemaking on closed captioning quality metrics. In simple terms, the petition acknowledges that things are not working and that something needs to be done. Dr. Vogler believes that implementing reasonable quality metrics is non-negotiable for the consumers. It is understood that this is a long-term approach for changes to take place, but the wheels have been set in motion.
Dr. Vogler also sees an opportunity with ASR technology, which was previously not good enough to be considered for captioning. It now performs surprisingly well in some cases but can also fail in others. The problem is that ASR was previously completely outside the FCC’s “best practices” approach, with no practices set for it at all, yet it is still being used by some stations without accountability or guidance behind it, often resulting in very poor quality or incomprehensible captions. The petition aims at including ASR for captioning provision, but in a controlled manner. At first the technology should be included under the “best practices” approach, until it is decided how to measure caption quality accurately.
As Dr. Vogler explains, advances in ASR do in fact show great potential and the case for it will be strengthened by continued collection of captioning metrics that provide measurable evidence of ASR’s progressing accuracy. Gallaudet University has been working closely with AppTek to collect such data. From a policy perspective, ASR could dramatically expand where captioning is required, because as technology advances it becomes cheaper and more feasible to adopt. For the near term, the Petition’s goal is to raise the bar on accountability, so caption quality and consumer experience improve.
Dr. Vogler also talks about other areas that would benefit from captioning services. User-generated content is a prime target. An average person uploading videos to YouTube or on social media sites would not typically caption their videos. They likely would not even know how or be aware of the need for captioning. Dr. Vogler visualizes a future where technology provides the tools to the end user to add automated captions to any audio or video they play back on their device in a manner that is custom-tailored to each end-user’s needs. This technology, however, is still in its infancy.
This is an area where technologies like ASR will help tremendously as they improve. We already see improvements in punctuation and in using context information so that names are recognized correctly. Dr. Vogler expects that automated caption formatting will also improve, so that captions are easier to read. This is indeed an area where AppTek has invested considerable work developing a neural net model for Intelligent Line Segmentation, so that caption formatting is based on syntax and semantics to make captions more easily readable.
A more complex issue to achieve true quality is that the full context of the audio must be interpreted to correctly convey the experience, the mood and meaning of a video. This is part of the decisions a captioner needs to make, such as how to treat overlapping conversation, or how to describe sound effects - just what is ‘scary music’? Today, this requires humans making decisions. Dr. Vogler believes eventually machine learning, as well as novel ways to represent sound visually and tactilely, will be able to help there as well.
Podcasts are another popular form of content ripe for quality captioning, however there is no legal requirement currently to caption them. Because most are already recorded with high quality audio, auto-captioning technology can work well for those files. And, for any professional media distribution, Dr. Vogler explains that the ASR output could be polished by a human, in a hybrid workflow that is still less costly and more efficient, thus expanding the market for captioning.
There are also webinars and live conference presentations, and even online communications. It’s not always possible to have a sign language interpreter available, especially for last-minute events, unless it is for telecom-based conversations through video relay services. These, however, are not widely available outside the USA. This is an area where ASR could be especially helpful.
Improving technology will make the entire captioning market bigger, less expensive and more efficient. Dr. Vogler points out that as technology makes new things possible, the entire captioning production workflow needs to be re-examined to assess where improvements and efficiencies can be gained. That might well involve adopting a hybrid approach that uses humans and technology in new ways to produce a better output.
Any successful future for captioning requires both a strong vision of where deaf/HOH consumers want captioning to be and their buy-in regarding new product decisions. Rather than getting stuck in traditional methods, Dr. Vogler encourages the entire captioning industry to apply more imagination in making the possible future into reality. He hopefully suggests that, “People need to be open to completely new ways of doing captioning. We should dare to dream a little bit.”
Our next post will include insights from a recent discussion with live captioning expert Dr. Pablo Romero-Fresco and author of the NER metric for caption quality evaluation. Further in our series, we will look deeper into ASR for captioning including suggestions for best practices for optimal performance. Check back soon!
AppTek provides an artificial intelligence and machine learning-based automatic speech recognition, machine translation and natural language understanding platform for organizations in a variety of markets, such as media and entertainment, call centers, government, enterprise business and others across the globe. Available via the cloud or on-premise, AppTek delivers the highest quality real-time streaming and batch speech technology solutions in the industry. Featuring scientists and research engineers who are recognized amongst the best and most experienced in the world, the company’s solutions cover a wide array of languages, dialects, and channels.