What Does the Future Hold for Automatic Speech Recognition? | Accessibility Series Part 8

July 28, 2020

What Does the Future Hold for Automatic Speech Recognition?

Technology, like art, is a soaring exercise of the human imagination.

Daniel Bell, American Sociologist

Throughout our closed captioning discussion these last few months, we’ve explored many issues and opportunities that this exciting domain presents for enhancing accessibility and equality. From breakthrough research and development efforts at Galludet University, to Dr. Pablo Romero-Fresco’s NER model captioning quality assessment, to the expanding capabilities of Automatic Speech Recognition (ASR) technology, captioning technology progress has been extraordinary in recent years.

Right now, amidst modern advancements in artificial intelligence and machine learning, it’s possible to envision a not-so-distant future where fluent voice interaction with connected devices will be a matter of course; we’ll forget there was ever an era where we had to go shopping for milk and the fridge didn’t do it by itself.  Additionally, next generation neural machine translation and text-to-speech technologies, along with advancements in end-to-end speech translation, will lead to an unprecedented future of multilingual accessibility. That future will go well beyond the simple captioning of spoken words to the full translation of what is spoken being dubbed on-screen and reproduced in the original speaker’s voice (think of a soccer announcer yelling “GOOAL!” in the same voice, energy and tone simultaneously across hundreds of languages).

In the meanwhile, the pandemic we are all living through is changing everything—our daily routines, the way businesses work, even the very fabric of our societies. The move to virtual workflows, work spaces and places has been rapidly forced on everyone, and its effect is bound to be lasting. The need for instant access to accurate information from any possible source has never been greater.

When originally drafting this article in the pre-COVID world, we were asking ourselves “What will it take for us to use conversational interfaces in more than just a limited number of products, such as customer service calls and voice searches? What will it take for the accessibility bar to be lowered so much that it will be almost non-existent for deaf and hard-of-hearing individuals and for us to live in a truly democratic society?”

Now we know what it will take. Now there’s no excuse for the lack of accessibility, or the lack of inclusivity. Because it’s not just the deaf and hard of hearing that need accessible solutions any more. We all do, as our lives have become more constrained in the physical space. As communication is becoming increasingly digital, ASR is becoming the enabler in the dawn of a fully online and connected society.

So what can we predict about the future on the basis of already existing ASR applications in captioning?

The Hard Stuff

One of the more difficult tasks for ASR is actually one of the simplest tasks for humans to perform, as the human brain is the best speech decoder to date: it knows how to recognize and ignore ambiguities and artefacts such as false starts, repetitions, filler words, hesitations, speech impediments—all the things that we encounter in spoken language. As Dr. Pablo Romero-Fresco points out during his interview, “when we talk, we don’t always make as much sense as we would like.” If we are focused on captions being verbatim, the output might be difficult to comprehend at times. “A little bit of tidying up that captioners do, which doesn’t mean censorship or manipulation, can help a lot in getting the meaning across.”

When training an ASR system to perform more like a human, training data can help simulate this to an extent by ignoring noisy speech attributes. For instance, transcribers typically transcribe audio by omitting such fillers or editing down repetitions, such as ‘Wait, wait, wait!’ to a single ‘Wait!’ Training with such data teaches the ASR software that it is OK to not output anything when it hears filler words.

Another issue in ASR is new vocabulary. Humans can recognize a new word as such from the context, and even understand what it means without having heard it before. This is not the case with ASR. The latter will try to match the sound with some word it has seen previously, even if it makes no sense in the given context.

Proper names are often part of such new vocabulary and have always been a particular issue in captioning, irrespective of the workflow used in production. Working on vocabulary lists is part of the preparation both steno and voice writer captioners do before going on air. In the case of automated captions, making vocabulary updates is one of the essential techniques employed to reduce out-of-vocabulary (OOV) items in ASR. Integration of glossaries and terminology is one of the standard requirements of end clients, as is the case in any localization workflow.

This technique is particularly important when it comes to entertainment content, where it would be more common to encounter fictional words. ASR systems already use lexicons comprised of very large amounts of words–typically 250k to 500k–which are probably tenfold the number of words a human would be able to learn in a lifetime.

As a result, OOV items in ASR output tend to be a very small percentage (approximately 1%) of the total words in a file. However, this can easily increase to 2%-3% in the case of a film like Harry Potter, which includes many made-up words that are essential to understanding the plot. By including glossaries of such words and providing reasonable probabilities for them in the language model, the augmented recognition vocabulary is able to overcome this issue to a large extent.

A better and more complete solution to the never-ending and laborious process of vocabulary updates, however, can be provided as a result of a recent development that has taken ASR research labs by storm: neural attention models, which encompass all ASR components (acoustic model, language model, and search) into one. Such models can do away with the OOV issue, as they do not work on full words, but on a rather small (5k-10k) list of ‘subwords.’ Pavel Golik, AppTek’s Senior Speech Scientist, explains that they “consist of fragments that are not linguistically defined but rather a statistically-motivated factorization of words.”

This allows attention models to compose new words that they have never ‘heard’ in the training data by using a sequence of subwords. Such a model would be able to spell ‘Louisville’ correctly, even if the city name was never in its training data. It could even spell the made-up German city of ‘Frankenburg’, because it would have ‘heard’ of other cities like Frankfurt and Augsburg.

Equally confusing to ASR are cases of crosstalk, emotional speech, dialects and idiolects, children talk and so on. We have all seen ASR quality drop in the case of speaker overlap in talk shows, while even excellent ASR models struggle with non-standard or non-native speech. In some languages, there are also words typically used only in spoken utterances, with no standard spelling, such as in Arabic where there is a lot of regional slang. Swiss German is another great example, as it is a language used orally, while classic German is used in the written form. That means a training corpus for a Swiss German ASR is virtually non-existent to date.

In such scenarios, the older generation of Hidden Markov Model (HMM)-based ASR systems will try to output a word that is as close as possible phonetically, irrespective of whether the word makes sense in the context of the utterance. This can result in slang words like ‘fo shizzle’ being output as ‘fascism,’ creating a confusing, humorous, or even offensive and politically-loaded, headline-making output. Non-native accents and dialectal variations in the pronunciation of words are also better dealt with by attention models, which are able to spell ‘Spiel’ correctly even if the Bavarians pronounce it ‘Schbui.’

A particular case of this issue is code-switching, which is becoming increasingly common in our globalized world. Speakers of one language will incorporate foreign words into their speech and inflect them according to the rules of their own grammar; for instance by adding prefixes and suffixes. As an example, this happens a lot among Indian or Malay speakers, who are naturally used to speaking multiple languages, when speaking English. But it also occurs in cases of families with an immigration history where bilingualism is a common trait; the amount of code-switching depends on the conversation topic and the speakers involved.

A common approach to overcome this issue is to include the most frequent loan words in the ASR vocabulary, so that phrases like the German ‘Ich muss meinen Browser updaten’ (i.e. I need to update my browser) do not become ‘Brause abgehen’ (i.e. to take a shower). Attention models, however, will try to improvise on the basis of phonetics, much like professional transcribers do with words they’ve never heard before or which have no standard spelling. This makes multi-dialect or even multilingual models possible, with obvious applications such as  captioning multilingual film.

The Next ASR Milestone?

Scientists are already working hard to improve on the state-of-the-art in ASR. Five hundred hours of user-generated content are uploaded on YouTube per minute, so the amount of untranscribed audio data available is astronomically large. Leveraging the millions of hours of such untranscribed audio could provide the key to the next stage in ASR’s evolution.

Let’s look at the math. If a child is exposed to, say, 6 hours of speech per day, then in 6 years, s/he would have been exposed to 13,000 hours of speech. That is on a par with an ASR production model. Over a lifetime, the same person will have probably been exposed to approximately 10 times more audio data. By extension we could say that the current ASR systems in the market perform like a six-year-old and, in theory, by leveraging the millions of hours of speech data on the web to train them, there will be nothing they won’t have ‘heard’ of before.

Low-resource languages are the biggest challenge, as there just isn’t enough media content available to train ASR models on them. Small regional languages like Maltese and Estonian only have a very small number of speakers from a certain geographical area. On the other hand, African languages often cover large geographical areas and are actively evolving. But there is sometimes not enough written data for them, or a formal grammar that is used consistently, nor is there enough media. Some spoken languages, like Yiddish, Romani, and other lesser-known ones, are unfortunately doomed to become more of a second language to native speakers, making the amount of content in such languages much smaller.

The Future of ASR in Captioning

We have already seen that offline ASR is more accurate than streaming (live) models. That is because the acoustic model can access the future context—the next few seconds— and the decoder can postpone its decision until it reaches the end of an utterance. AppTek provides an efficient processing pipeline for streaming models as well, where selective use of future context is made. By providing some buffer time to the ASR, more processing time is available so streaming ASR can get closer to the quality of offline ASR. This is currently being done by some broadcasters for live captioning workflows that involve a human in the loop.  

One of the most problematic issues in live automated captioning that can hinder understanding is appropriate speaker diarization. As latency is currently unavoidable in live captioning, the image cannot always be used as a clue by deaf and hard-of-hearing viewers to alert them that the speaker has changed. As a result, speaker changes that are not indicated properly in the captions can cause misunderstanding of the dialogue. That can become more severe when dialogue recognition is also not perfect.

Using the image in combination with the audio input can, however, provide the key to perfect speaker diarization. That would have a direct impact on the accuracy of the ASR output as well. One of the most confusing errors the ASR can make in captioning applications is when it combines two speakers’ speech into the same caption.

To overcome that, a machine learning model for accurate speaker diarization can be trained by utilizing both the visual and auditory signals of the input video. Such models allow for separating the voice of a single person in a multi-participant conversation, with all other voices and background sounds suppressed. This solution might have interesting applications in dubbing workflows as well. Single-channel separation can also prove particularly handy in contexts like conference calls, which have seen a tremendous increase in recent weeks and months. All major conferencing tool companies are now looking to implement captioning functionalities in their platforms to better serve their end users.

Aside from the issues described above, more is still required of ASR when it comes to captioning. Audio scene understanding is needed for two main reasons: so that non-verbal sounds are not confused with actual speech (i.e., you do not end up with ‘Ja, ja!’ as a caption in a German documentary showing birds squawking); and also so that non-verbal audio information can be retrieved and utilized as needed. Sounds in a film, the source of which is not visible, can then be described in captions for the deaf and hard of hearing, aiding their overall understanding of what is going on (e.g. “romantic music plays,” “cat meows,” “sirens wailing,” etc.).

Aside from being a very important source of information regarding sounds, images in a video stream can also be used to provide context so that ASR can make better decisions. This is one of the latest approaches, called multi-modal ASR. In this case, video and audio cues are combined to inform the ASR output. For example, police sirens in the background could provide the necessary context for the accurate recognition of the phrase “we got to run.” Pavel Golik again notes that “We expect scene understanding to be an area of ASR focus for most research labs in the coming years.” This can be a very exciting opportunity for ASR companies that have established close partnerships with others specializing in image recognition and emotion detection. For example, AppTek has such a relationship with iDenTV.


With scientists solving one challenging problem after the other, ASR technology is continuing to evolve at an impressive pace.  While that continues, we should begin looking for new workflows that can take more advantage of the potential the technology has to offer today, and be open to new ways of producing captions, as Dr. Vogler points out. This could mean, for instance, using a human in the loop to correct fully automated ASR output in real time, as Dr. Pablo Romero-Fresco suggests, but it could also mean new use cases for the technology altogether.

What if we could do away with taking minutes and action items in the multiple conference calls we participate in every day and let a machine do this for us? Or be able to mute background noise coming from the other end of the line during a call? The possibilities and efficiencies offered by next-generation AI technologies are endless.

The simple truth is that communication remains one of the most vital components to human interaction and social development.  As we continue to progress further in these next generations of artificial intelligence technologies, the true markers of success will be the advancement of our own humanity, and the benefits to the individuals within our society regardless of language or ability to openly communicate.

30-Year Leaders in Speech Technology
Find us on Social Media:

AppTek provides an artificial intelligence and machine learning-based automatic speech recognition, machine translation and natural language understanding platform for organizations in a variety of markets, such as media and entertainment, call centers, government, enterprise business and others across the globe. Available via the cloud or on-premise, AppTek delivers the highest quality real-time streaming and batch speech technology solutions in the industry.   Featuring scientists and research engineers who are recognized amongst the best and most experienced in the world, the company’s solutions cover a wide array of languages, dialects, and channels.

Copyright 2020 AppTek    |    Privacy Policy      |       Terms of Use