Automatic Speech Recognition | Accessibility Series Part 6

March 12, 2020

Automatic Speech Recognition

Speech recognition is the technical process in which a spoken audio sequence (human speech) is automatically transformed to its underlying word sequence. It is also known as automatic speech recognition (ASR), voice recognition or speech to text (STT).

– Dr. Volker Steinbiss, Managing Director, AppTek GmbH

Automatic Speech Recognition (ASR) is increasingly present in our lives. Using natural language as a command to action is something that has forever captured our imagination and has been dramatized in films. People have always dreamed of intelligent robots that can understand and produce speech, such as HAL in 2001: A Space Odyssey (1968), C3PO in Star Wars (1977) and KITT in Knight Rider (1982).

Half a century later we live in an era where such technological dreams are an everyday reality: voice-enabled devices field our calls to customer support services teams; we dictate messages to our phones rather than type them; we switch the lights at our homes on and off with our voices, or use our voice to play our favorite songs. Over half of all digital search is likely to be voice and image-based only in a handful of years, so it can be easy to take all of this for granted and forget how much research has gone into making complex tasks such as voice search a reality.

The history of ASR science

The technology that makes all this possible dates back to 1952 in the Bell laboratories, where the first speech recognition system was built to recognize the digits 0 to 9. A decade later, an IBM system was trained to recognize a total of 16 words. Accelerated progress came in the 1970s, when the US Department of Defence funded DARPA’s Speech Understanding Research program, as they saw the value of speech recognition applications in the military. ‘Harpy’ was one of the program’s results, a recognition engine that could recognize just over 1,000 words.

But it was not until IBM pioneered the Hidden Markov Model (HMM) in the 1970s that speech recognition was able to reach the benchmark of recognizing thousands of words, and developed the potential to understand many more. It was then that ASR methodology shifted from a pattern-matching to a probabilistic model. The latter performed educated guesses on the basis of knowledge about syntactic and semantic dependencies in language, which it derived by automatically processing vast amounts of texts.

The HMM framework has dominated the field since it first appeared. AppTek’s Senior Speech Scientist, Pavel Golik, explains that ASR can technically be described as a function that maps an audio signal to a sequence of written words – aiming at the correct underlying word sequence and trying to make as few errors as possible. If we formulate this in appropriate mathematical language, the problem changes into the form: "Given an audio snippet, what is the most likely word sequence that has been uttered?" At this point, everything – the decision, the learning of pronunciations or typical word sequences from data – is described with probabilities, and we can apply mathematical methods to well-defined pieces with which we know how to deal.

This yields two components for recognizing speech.  One contains knowledge on how words and sentences can sound (the acoustic model) and the other contains knowledge about language structure and how typical a sentence is (the language model).

The two can be modeled in different ways. The state-of-the-art approach is to use artificial neural networks. While neural networks have been around since the 1950s, it wasn’t until the 1990s that we learned to use their power for efficiently modelling the two ASR components. It was only a few years ago that they made a big performance leap and were widely adopted in the speech recognition community.

The power of Deep Neural Networks

Deep Neural Networks (DNNs) require vast amounts of processing power, so it makes sense that their applications only came about recently. GPU performance has grown exponentially over the last few years, in parallel with the Nvidia stock price, and disk space has become virtually free. With significantly more computational power at our disposal, more training data and larger models can be used, which offer higher quality output.

But of course, improvements in hardware are only half the picture. The credit is due to the community of thousands of ASR scientists who contributed smart improvements to the way the computations are performed and how the parameters of the DNN models are trained. Two other factors that helped were the availability of more data (also fuelled by fast hardware and cheap storage solutions) and the lower entrance barrier resulting from publicly available neural net tools and platforms. Cloud computing made it much simpler to scale up, which allowed for massive deployments of such systems commercially.

The quality of current ASR output is reported to be at approximately 5% word error rate (WER) for US English in conversational speech recognition. Many research teams reach great results in two difficult tasks (Switchboard and LibriSpeech) used for evaluation purposes. However it is useful to understand the detail behind these evaluation methods. For example, the test sets come pre-segmented by humans, which is not a real world condition. Also, the 5% WER reported applies to average point estimates and not the full distribution over a wide variety of recordings. That’s contrary to what one would have to deal with in the marketplace.

In media and entertainment, for instance, ASR is applied to a wide range of videos such as news, sports and weather, talk shows, cooking shows, films, series, specialized documentaries, trailers, and anything that is broadcast. Input audio may thus vary from heavily scripted to highly unscripted speech, and from a single talking head narrator to multiple speakers with a lot of overlap and background noise.

As a result, the quality of ASR varies widely depending on the input content, while most ASR systems are trained on news data and perform best on it. In other words, media and entertainment is a vertical encompassing a wide range of domains, with the same ASR systems implemented across all in real-world scenarios.

How to achieve great quality in ASR

So what is conducive to great quality output in ASR? How can a company set itself apart from competitors when it comes to ASR services? “It all starts with having a very good baseline model and the right training data,” Pavel Golik explains.

Having identified the strategic importance of data early on, AppTek has invested significant effort over the years in collecting and creating vast amounts of training data. This includes data from both major verticals the company specializes in: telephony, for call center applications, and broadcast, which encompasses a wide variety of broadcast news and sports, radio, speeches, lectures, entertainment data, specialized informative programs, user generated content, and so on. The benefits are shown in practice through the high quality telephony and broadcast ASR models that the company makes available to its clients through APIs.

And what makes a good baseline model? The quality of the data itself and the ASR system used to train it. Most ASR providers will typically use an open-source ASR system to train their models – Kaldi probably being the best and most popular. This, however, constrains them to the boundaries of what the system is built to do.  If you have both the knowledge and access to the ASR system itself though, you could “implement any desired modification rather than just follow the recipe,” Volker Steinbiss, Managing Director at AppTek GmbH, points out. With only a dozen or fewer ASR systems available worldwide, having access to one gives an ASR provider the added advantage of full control. AppTek is one of the few companies that have the luxury to train their models on their own proprietary ASR system.

To achieve even higher quality, custom lexicons and glossaries can also be fed into an ASR system to specialize it. Given the amount of new terms and names that make their way to everyday vocabulary, ingesting news and other website content to keep the vocabulary and the language model up-to-date is a technique that needs to be applied periodically. When applying ASR to an open domain like captioning, it is obvious how program material like news and sports benefits from such lexicon updates. Broadcasters would thus have an interest in teaming up with their technology provider to establish a workflow that automatically provides the latter with a continuous stream of new vocabulary as soon as it appears.

Additionally, both the acoustic and the language models in any ASR system can be adapted to the speaker and the speech content. The acoustic model can be adapted for accent, regional dialect, speaking style, speaking rate, etc. The language model can be adapted for content (e.g. daily news vs. a sermon), as certain phrases are more common in some scenarios than in others. Even part of a sentence is often enough to identify a piece of content as a sports piece, a sermon or a pharmaceutical report. Adapting the ASR system to such a style would obviously improve the quality of its output.

Language diversity at AppTek

While US English is still the most dominant audio language in the media and entertainment market, both for live and offline processing, other languages are increasingly gaining ground. Recent blockbusters such as the Spanish series Money Heist, the German Dark or the Oscar-winning Korean film Parasite have proven that non-English entertainment content travels well internationally. As a result, it is rapidly growing in volume. To best serve the media and entertainment vertical, AppTek offers a wide variety of languages and dialects for speech recognition, including 5 varieties of English and another 5 for Arabic.

In our next blog post, we continue our discussion with Pavel Golik, AppTek’s Senior Speech Scientist, on the application of ASR in captioning and subtitling in the media vertical, and the specific challenges this poses. Stay tuned!

30-Year Leaders in Speech Technology
Find us on Social Media:

AppTek provides an artificial intelligence and machine learning-based automatic speech recognition, machine translation and natural language understanding platform for organizations in a variety of markets, such as media and entertainment, call centers, government, enterprise business and others across the globe. Available via the cloud or on-premise, AppTek delivers the highest quality real-time streaming and batch speech technology solutions in the industry.   Featuring scientists and research engineers who are recognized amongst the best and most experienced in the world, the company’s solutions cover a wide array of languages, dialects, and channels.

Copyright 2021 AppTek    |    Privacy Policy      |       Terms of Service     |      Cookie Policy