“Spotlight on Automatic Speech Recognition”, an interview with Eugen Beck

September 16, 2021

While automatic speech recognition (ASR) is one of the most mature of AI-enabled language technologies, the science behind it is continuously evolving. Today, the use of ASR is widespread across multiple domains ranging from voice-enabled applications to conversational analytics, interview transcriptions and more, and it continues to gain traction in satisfying increasing accessibility requirements in the USA and abroad. We spoke to AppTek’s Lead Science Architect for ASR, Eugen Beck, to find out more about the current state of ASR and the way forward.

Q: Tell us a bit about yourself – where did you grow up, what languages do you speak?

Eugen Beck

A: I was born in Russia but my family moved to Germany when I was 3.5 y.o., so I grew up and studied in Germany. As a result, my Russian is limited to household level, but I speak German and English fluently. I am very interested in languages and wanted to learn a language with a completely different structure to European languages. I started learning Chinese when I travelled to China a few years ago and it’s been fascinating. I am now able to communicate in Chinese to get things done and discuss most topics with my teachers, though I still have a lot of trouble understanding accents.

Q: How did you decide what to study?

A: As a kid I loved playing computer games and wanted to make my own, so my mother started buying me programming books. I quickly discovered I had a knack for coding and it was a lot of fun for me, so I decided to study computer science – that’s how I moved to Aachen and I’ve been here ever since. Speech recognition came later, when I worked on my PhD. I had originally decided to study data mining as I’m interested in getting insights from data, but this did not work out for me, so I decided to switch to Prof. Dr.-Ing. Hermann Ney’s chair at the university and focus on the more challenging topic of human language technologies instead – I’ve never looked back.

Q: Why did you choose speech recognition over machine translation (MT)?

A: Both technologies were very interesting to me, but in speech you have (at least in theory) a ground truth, i.e. the task is a lot more clearly defined. In MT you can have different equally valid translations, as language is not only about the spoken word, but also about culture and in some cases it’s very hard to bring the context across in different languages. In Chinese for instance, there are thousands of sayings and you may be able to translate them in a sentence or two, but in Chinese many of them are only four characters long. In MT you measure the output using a BLEU score, but this is of course a very imperfect measure of semantic similarity. In ASR, things are more straightforward in that respect, although on a technical level ASR is a harder task than MT simply because the amount of data that is being processed is a lot larger. This means you need to be computationally more efficient, which I enjoy, as it fits with my desire to process large amounts of data and design efficient algorithms. I like to focus on efficiency and this is part of my work here at AppTek.

Q: What do you like best about working at AppTek?

A: I had been working in Prof. Ney’s scientific team for about a decade before joining AppTek full time, and I also completed an internship at Google, so I had the experience of working in a large company too. What I like best about AppTek is the freedom it offers me to explore my own ideas and the fact that I have more control over the technology stack. In a smaller company it is easier to get things done and there are still a lot of challenging tasks to solve, so I prefer working in such an environment.

I also like the fact that I can continue using RTWH Aachen’s speech toolkit, RASR, which I’ve been maintaining for a while and which I know well how to optimize. In fact, I can carry over all the tools that I used at university, where I was building an automation pipeline, and this makes my life easier.

Q: Tell us more about your work at AppTek – what is your vision?

A: The priority until recently at AppTek had been language expansion, so as to cover all the major languages we are requested. Now that we cover a large selection of languages, we are focusing on adding more automation to the pipeline, so we can regularly iterate on languages and not just on the basis of client demand.

We have just completed a major redesign of our streaming server that increases parallelization and extensibility. My vision is to make the speech team scalable, so everybody can train models and rapidly update them. The goal is to update all our models on a monthly basis, as soon as we crawl or get new data in, so we are always at the cutting-edge for any language we offer. Another thing we work on is bringing together our batch and streaming models.

Q: You talk about batch and streaming ASR – aren’t the same models used for both? Is there a difference in quality?

A: Yes, it’s the same models used for both, but the difference is in how fast one collapses the search space. The machine will recognize the ending of a word very shortly after a word has ended in the speech signal, but this word end is only one out of many hypotheses. You might need to hear the next word to be sure that the previous word is correct, that it makes sense in the context of the next word, so it takes some time for the search space to collapse in the past. For ASR applications like live captioning there is a limit, because you need to output the words as quickly as possible. We look at the best hypothesis at each time frame, and although the last couple of words might still change, everything else is fixed, so we prune away all other hypotheses. This means we might lose some accuracy, as we could have found another half-a-word sequence that fits the data better, but if there is a latency requirement, we need to collapse the search space faster.

Q: Aside from captioning, there are other ASR applications that also require very low latency, e.g. interpreting. What is the minimum latency you can have with streaming ASR?

A: What adds to the latency is that there are a number of things that need to happen in the pipeline. When we get the speech signal, first we perform feature extraction. We put the signal through our voice-activity detector to cut out the speech parts. Then we put it through the neural network, which gives us the scores for each phoneme, and then we try to find the best path for each word. So by the time the decoder looks at the word end, a second might have passed already from the time the audio was actually spoken.

At AppTek we have traditionally prioritized quality over speed or latency. Thus our current generation of streaming models have a latency of around 1.6 seconds which works fine for many applications. In live captioning, for example, the accepted latency can be quite a few seconds. In other applications like interpreting, you can overcome some of the latency by pushing through recognition updates in real-time. We’re working on bringing the latency further down without compromising quality by integrating a new transformer neural-network model into our streaming system, which has the capacity for lower latency as it requires less future context.

Q: In subtitling, as in other markets, the commercial demand is to produce different versions for the different language variants, e.g. English or Spanish. Do you need to have different acoustic models for them?

A: The better question here is how closely related these languages are. An ASR model includes the spelling and the pronunciation, and also multiple pronunciation variants. So it’s not a problem to have a model that includes data from different language variants, as long as the model is large enough to learn from the data. If the variation is too large for the given size of the model, it might degrade the recognition accuracy. I prefer to work with larger models, for example a global English model to include all English variants, as this fits more commercial use cases.

Q: What is your opinion on the hybrid vs. end-to-end approaches in ASR? What is the next milestone in ASR?

A: One could write a whole book about the topic of hybrid versus end-to-end, but, in short, end-to-end systems do not use a pronunciation dictionary and are thus to some extent easier to build, but often underperform on tasks with small amounts of data. The decision for which architecture to use can depend on many factors and the distinction between hybrid and end-to-end is not black and white. For example, RNN-T (a neural-network architecture used in many end-to-end systems) can also be used in hybrid systems (i.e. together with a pronunciation lexicon) where it also performs well. I think the more important aspect is the choice of NN architecture and the preparation of your data.

Neural network acoustic models have been around for a long time and only with advances in computing power started to overtake probabilistic models, like the Gaussian mixture models. And even then improvements where gradual. The trend of using more and more computing resources and data will certainly continue and is probably the most important factor in the advances still to be made in the field. One of the current hot topics in the community is unsupervised training on vast amounts of unlabeled data, which promises nice gains but also requires significant infrastructure and data to work well.

For use-cases where you have a lot of high-quality data to train systems with, ASR quality is already very good as measured by word error rate. When it comes to improving the ASR for specific domains or noisy environments though, there is still significant room for improvement. Then there are also textual features that are important to customers, but do not enjoy as much popularity in the research community, like punctuation and speaker diarization. This is one of the areas where AppTek distinguishes itself from many of our competitors.

AI and ML Technologies to Bridge the Language Gap
Find us on Social Media:

AppTek provides an artificial intelligence and machine learning-based automatic speech recognition, machine translation and natural language understanding platform for organizations in a variety of markets, such as media and entertainment, call centers, government, enterprise business and others across the globe. Available via the cloud or on-premise, AppTek delivers the highest quality real-time streaming and batch speech technology solutions in the industry.   Featuring scientists and research engineers who are recognized amongst the best and most experienced in the world, the company’s solutions cover a wide array of languages, dialects, and channels.

Copyright 2021 AppTek    |    Privacy Policy      |       Terms of Service     |      Cookie Policy