By Yota Georgakopoulou
A: AppTek is a great choice of workplace for me because it is a mix of a small startup company but also a company with many years of experience, with a core of people who are interested in research and allow you time for scientific publications. It is easy in such an environment to have your ideas implemented, put into production and to do so very fast. We serve client requests very quickly; I like the short decision path and the great interactions with customers, the users of our technology, from who we can get direct feedback easily and improve our models. This is something I particularly enjoy, so after spending four years at a large company with a lot more bureaucracy, I decided to come back to a smaller one.
At AppTek I am the lead architect for MT, which means I lead a team of people who are all very talented and experienced researchers. Most of them work independently to a large extent, and I simply direct them down the path we need to go. But they also come up with their own ideas and develop parallel paths of their own. The other part of my role has to do with coming up with new ideas in the core MT research or in the applied research side of things, i.e., how to do things better. This creates an opportunity, especially for the more junior members of the team, to come up with their own ideas and perform quick implementations and tests that I may not have time to do myself. The result is new solutions that can be tested and implemented quickly to solve specific problems. I also enjoy programming and training models myself, and in a company of this size I still manage to find the time to do so.
The other thing that is special about AppTek is the collaboration between its different language technology teams. Some of the people in my team also work on automatic speech recognition (ASR) tasks, and both teams work on speech translation and revoicing/dubbing automation. In this task we combine both technologies in a single, direct model which translates from speech input in one language to text or even speech in another language.
A: Speech translation has been around for a while, but its quality wasn’t great until recently. There has been tremendous progress in the field of speech translation in the past couple of years where direct speech translation systems can compete with the older, so-called ‘cascaded’ systems, which are built out of separate components from each technology.
An important benefit of direct speech translation is that one can avoid some recognition errors because of the constraints of MT, but at the same time if the transcript is not output as a deliverable, then the system might not have as a wide variety of practical applications. I don’t believe this is a major showstopper though. In applications such as live subtitling and simultaneous interpreting, where there is no time to correct recognition errors, such systems will be of great use.
We already see implementations of this technology being appreciated by its users, such as the implementation that has recently been made in Zoom. Needless to say, an even more forward thinking application of the technology is in revoicing scenarios, such as in this demo, which was still created with the old cascaded approach. There is already research happening to automate the lip synchronization aspect as well, which is needed in dubbing workflows.
A: Even in ASR, the research community is moving away from systems that use phonetic modeling to what is called ‘end-to-end’ systems, where there is no explicit phoneme representation, and you go directly from the spoken signal to words or subwords, and everything in between is modeled with neural network layers of different kinds. This works quite well.
The extension to direct speech translation is to replace these words or subwords in the source language with words or subwords in the target language. You also add more layers of different types of attention to semantic units, plus layers which group acoustic features together, as otherwise the sequence becomes too long to translate. This way the model learns internally whatever is best for its representation to output the target translation. This is how you can avoid recognition errors, because of the additional constraint that the text should be translatable, that the output from the speech signal should be something that fits the context for translation.
Speech translation in its basic form requires training data which are very hard to get: triplets of the spoken signal, in its .wav form for instance, its transcript and also its translation. The best models today are the ones that use additional data by pre-training some parts of the neural network with data that has only transcripts and translation but no audio, or only transcripts and audio but no translation, or even by creating synthetic data where you generate a translation with a regular MT system, and you use it in training.
A: We are now creating MT systems which are customized to different types of metadata – it is like injecting world knowledge of some kind into the text. This is what I believe the next milestone in MT will be.
There are two types of user input you can have in such MT systems. The first relates to the desired output style, e.g., formal or informal, or the length of the translation. The latter is quite useful for both subtitling and dubbing. The system provides length-controlled output by rephrasing with the use of shorter words of the same meaning. In the case of shorter translations, the system may also omit certain words instead of simply paraphrasing, and it usually omits the less important ones. We don’t really know how it chooses which words to omit, it just learns to do so automatically using data that contain such shorter translations and which have been tagged in the training as such.
The other type of input is providing the system with info about the text that’s coming in, so that the system can use context information to inform the MT output. Such context is not just the surrounding words, but also the context of the entire document. The latter can be information about the genre of the document, or speaker information and his or her gender, which may affect the MT output because of the grammar in some languages. But it even goes beyond this to analyzing images or videos, as the source text in the creative industries is often audiovisual, so as to provide additional information which can help disambiguate certain words, e.g., the ‘bank’ of a river or the financial institution. In media localization, post-production scripts may often contain such visual information, which could of course be used instead of having to analyze the video.
This is a very active area of work right now and it is often referred to as ‘personalized’ MT. This doesn’t mean of course that everyone gets their own MT system, but that the MT is a single system that can use additional inputs from the user or from other sources to produce better translations.
A: The inputs are provided both during the system training phase as well as during the actual translation, as the system is being used. The information can either be inferred from the data itself (e.g., if we have the document information, we can infer its genre), or it can come from a tag that is provided manually or by some separate algorithm.
This development also poses a challenge, because we have reached a point where we have horrendously large models and equally large training datasets for these models, and sometimes this data can be very noisy because it is so large. I think that at some point this must stop, because you cannot go further without digging deeper to see what is really happening. Just throwing more data to these models is not environmentally friendly to begin with, but it will also not allow us to learn what these models are doing.
To overcome this problem, I believe the few-shot learning approach is promising. In this scenario, you provide the system with a few examples to learn from, just like you do with a child – it does not see too many examples of language, but it still learns quickly. This is especially important if you want to use more world knowledge. Even with the computational resources we have available today, it is not feasible to analyze an entire video and all its frames, you need to be able to quickly extract the important parts only. Instead, what has been happening so far is that you throw everything in and hope that the system will extract the useful parts –sometimes it does and sometimes it doesn’t.
Knowing what to pick out is precisely where the challenge lies, that’s why this is a hot research topic. The work is still done with neural networks: you provide them with different inputs or ask them to pay attention to different things. Once this problem is solved, the progress of AI will be far greater going forward; I believe this will be the next milestone in the technology.
A: We have seen over the last few years that neural MT is of better quality than previous generations of MT systems, but there are some practical issues around it that are not solved yet. I think the time has come to solve them and we see that people are now paying attention to things like better glossary integration and quality estimation, which still do not work as well as they could for neural MT.
Quality estimation for instance is a very hard problem. Even with the best methods, it is easy to estimate when a translation is good, but it is not yet clear when it is bad. There are also a lot of new metrics that have come out, such as LEPOR, COMET, etc. and more new ones come out every year. This is another problem that the research community needs to address. Papers are still accepted on the basis of BLEU score improvements for neural systems, when it has already been proven that BLEU is not an appropriate metric for their evaluation.
This is also what we are working on at AppTek. As I mentioned before, we are rolling out speech translation and metadata-informed MT, but we are also working on better glossary integration and quality estimation. Additionally, we are continuing to expand our language coverage. We have multilingual systems that also cover low resource languages, like African languages, for which we now see more of a demand.
A. Exactly. Arle Lommel from Common Sense Advisory calls this ‘responsive MT’, i.e. MT that considers context and is able to adjust itself on the basis of user feedback. He also believes such adaptation of the MT at the segment level through metadata enrichment is the next MT milestone that will truly augment translators with more tools to perform their work easier and faster. Some of the initial results we’ve seen for the subtitling use case do support this statement. For me, it shows that AppTek's metadata-informed MT will lay the foundation for responsive MT in the near future.
AppTek provides an artificial intelligence and machine learning-based automatic speech recognition, machine translation and natural language understanding platform for organizations in a variety of markets, such as media and entertainment, call centers, government, enterprise business and others across the globe. Available via the cloud or on-premise, AppTek delivers the highest quality real-time streaming and batch speech technology solutions in the industry. Featuring scientists and research engineers who are recognized amongst the best and most experienced in the world, the company’s solutions cover a wide array of languages, dialects, and channels.