Commercial applications for Automatic Speech Recognition (ASR) have been around since 1990. The technology was bound to achieve significant success with typists who no longer needed to rely on stenography to produce lengthy legal or medical transcripts. From simple beginnings, ASR’s capabilities continued to advance throughout the ’90s, eventually allowing for the use of natural flowing speech – an exciting and significant leap forward.
Early commercial speech recognition systems were originally used in voice writing workflows for the production of live closed captions. Such speaker-dependent systems required setting up a user profile to train the software into adapting its acoustic model to a user’s voice.
But further developments in ASR technology over the past decade, combined with the availability of huge amounts of training data, made it possible to train speaker-independent systems—the output of which was so good that speaker adaptation was no longer necessary. Speaker-independent systems were consequently rolled out, able to recognize a wide variety of voices and accents, and better able to handle noise. Some of today’s ASR systems are speaker-adaptive, i.e. they are speaker-independent at core, but can adapt automatically to a speaker with a fraction of user training data, producing much lower error rates.
With the latest generation ASR tools at their command, today’s voice writers can achieve accuracy levels previously attained by stenocaptioners alone, i.e. above the 98% quality threshold required for live captions, and in some cases even above 99%, which is a testament both to a voice writer’s skills as well as to the improvement of the ASR technology. In fact, ASR quality has improved so drastically in the last decade with the wide-scale adoption of deep neural networks, that the technology is already implemented without a human in the loop as a fail-safe in live captioning workflows, in cases where a real-time captioner is not available. It is also as a regular part of services offered to meet the growing accessibility demands for the mass proliferation of streaming meetings since the Covid-19 outbreak, or simply for smaller television stations and local programming where human-in-the-loop captioning services are cost-prohibitive.
When deploying ASR in live captioning, it becomes clear that, although dialogue recognition is the primary task, there are other elements to captioning that also need to be taken into account and addressed if the process is to be successful.
Closed captioning is a service designed for deaf and hard-of-hearing (HOH) viewers and, as such, non-speech elements also need to be included in captions, such as descriptions of noises that are not ‘visible’, as well as speaker IDs. Speaker diarization and identification technology can be implemented to address the latter, though accuracy issues can be significant in the case of similar voices or very short interjections.
Caption latency is another issue, as it impacts a viewer’s ability to process the information in the captions. The average latency with current voice writing workflows in broadcast live captions in the US, UK or Canada is 4-7 seconds.
With state-of-the-art ASR systems running on powerful GPU processors, this seems to be a problem already addressed by AppTek, who report an overall latency as low as 2-3 seconds, depending on the difficulty of the material. “By applying a running asymmetric window over the sequence, we are able to run an efficient processing pipeline while making selective use of future context information, and we continue to improve our systems by researching variants of this approach”, Pavel Golik, AppTek’s Senior Speech Scientist, explains.
Traditional methods to measure ASR quality, such as word error rate (WER), which is widely used in the scientific community, focused on spoken word sequences rather than punctuation and capitalization, since the latter are not spoken and therefore must be measured via other standards. However, both are considered significant types of errors in captioning as they have an impact on the viewers’ processing of the captions, and this is addressed by more recent and manual metrics, such as NER.
The live captioning tradition in the USA involves the use of all capitals, which simplifies things for ASR output, but recognizing and restoring punctuation is not as easy as it may sound. ASR punctuation models predict only full stops, commas, question and exclamation marks. These tend to work pretty well with scripted speech, as in broadcast news. However, “in spontaneous, emotional speech, sentences often become loose phrases and regular punctuation rules do not apply; try asking any three people to insert punctuation in the transcript of a Jerry Springer show, and there is bound to be a high level of disagreement”, Pavel Golik explains.
He and his group have integrated punctuation and capitalization models for 23 different languages, which are trained with several billion sentences for the high-resource languages. For each language, the company offers a fast model for real-time captioning and a slower, more accurate model for offline processing. The performance of a punctuation model is measured in terms of the F1 score (measured from 0 to 1, with 1 being best), which represents an average of the model’s precision and recall performance on test data. The streaming punctuation model achieves an impressive F1 score of 0.96 on news data and 0.90 on subtitle output, which is more challenging. The offline model achieves improved results at 0.97 and 0.92 for news and subtitles respectively. Although it is hard to compare to the performance of other research groups on punctuation restoration, as it is a small research field with no standardized training or evaluation data sets, we can safely say that AppTek’s punctuation performance on captions and subtitles is among the best in the market.
Finally, caption presentation also affects viewer processing time. In real-time scenarios this is handled by the use of a black box in which the captions appear in a scrolling mode until the caption lines fill up or the speaker changes. Offline or pop-on captions, however, are formatted in blocks, with line and caption breaks according to the pace of the program, as well as syntax and semantics.
Until recently the mainstream method for displaying ASR output in caption format was primarily based on speaker pauses for line and caption segmentation. In 2019, AppTek developed a patent-pending neural-net system for the Intelligent Line Segmentation of offline subtitles and captions, so that the ASR output matches professional choices more closely. The system also respects hard constraints in subtitling, such as number of lines and number of characters per line, as well as minimum and maximum subtitle durations, all of which can be configured to the client’s preference. The latter innovation has resulted in highly positive feedback from professional subtitlers and industry executives alike.
ASR is not only used in live contexts, but also as a first pass in transcribing audio offline, where a post-editing workflow will be followed to ensure same-language captions are 100% accurate. The availability of word confidence scores is one of the features of ASR that can prove extremely valuable in such cases, as they allow captioners to pay particular attention to the chunks in the text that the ASR is least confident in, or hide them altogether from their view, to aid their productivity when post-editing by not having to deal with substandard content.
Other applications of ASR in captioning include the alignment of existing scripts to the audio or video and automatically generate a caption file; or to the alignment of existing caption files to video for QA purposes; or in order to align a caption file to a new cut of a film. The latter can prove particularly useful in workflows involving preliminary material that needs to be updated to new or final videos.
The below video shows examples of our offline ASR output adapted for captioning and subtitling purposes across a variety of content in different languages. Captions and subtitles created by professionals are also provided to facilitate comparison.
As audiovisual content is continuously expanding, it is difficult to scale its localization without help from technology. ASR may not be perfect yet, but it offers significant value in enabling accessibility faster across an ocean of content. With ASR making headway in increasingly more applications, people are looking at ways of furthering the state-of-the-art with domain customization and client specialization to boost its quality.
In our next and final blog post on accessibility, we discuss open issues in ASR research and the developments that are taking place to overcome domain-specific stumbling blocks and solve hard problems. We thus hope to outline what the future holds for ASR technology and what we can expect of it with respect to captioning.
AppTek provides an artificial intelligence and machine learning-based automatic speech recognition, machine translation and natural language understanding platform for organizations in a variety of markets, such as media and entertainment, call centers, government, enterprise business and others across the globe. Available via the cloud or on-premise, AppTek delivers the highest quality real-time streaming and batch speech technology solutions in the industry. Featuring scientists and research engineers who are recognized amongst the best and most experienced in the world, the company’s solutions cover a wide array of languages, dialects, and channels.