A: I am half French, half Irish and I am a nomad at heart, as I was born in Madagascar but grew up in different places. My mother tongue is English, because of my Irish mother, but I learned French from the age of four when my family moved to France. We moved around a lot, I have lived in Africa and North America too, but I spent most of my childhood and schooling in France. As a result, I now speak better French than English, and I think in French. I completed my Ph.D. in automatic speech recognition (ASR) in France, travelled again, mostly in Africa, and when I returned to Europe I decided to move to Germany where I have been based ever since.
I originally went to Frankfurt where I worked on expert systems, using symbolic processing which forms the basis of AI. Symbolic processing is about working with concepts, semantics and logic, in order to build systems that try to simulate the way people reason, at a very abstract level. These are the same principles that are also used in rule-based machine translation (MT). I thoroughly enjoyed this work, but when Philips Research in Hamburg offered me a position alongside Prof. Dr.-Ing. Hermann Ney, I decided to switch back from symbolic-based AI systems to probabilistic modeling, as the potential of bottom-up data driven methods was clear to me. This was the first turning point in my career.
A: Despite the fact that I speak three languages fluently and have a basic understanding of a couple more, language was never my passion, I simply use language in my work. I would rather say the three things that have played a determining factor in my development were all sport related. I took up rowing in Avignon and was lucky to be trained by the late French world gold medalist René Duhamel, who taught us a lot about hard work and instilled in us the notion of excellence and endurance – he was my first mentor. I then went into rock climbing, which is one step ahead of rowing in terms of technique. Rowing has much less to do with muscle than one might think, it’s about understanding how a boat moves, not interrupting its speed, and also training on your heart rate and endurance. Rock climbing is about shifting your weight and balance optimally so that you don’t lose energy, it’s about precision, self-confidence and especially humility with respect to nature; precision is a core aspect of my work, and rock climbing helped me there. The third sport I took up was sailing, which has a lot to do with physics. I have been sailing with high-speed catamarans specifically, which also involves technology and new ways of thinking, stepping away from tradition and old experiences.
A: NLU did not exist when I was studying for my Ph.D., it came much later in my life. I started with ASR research, working on acoustic phonetic modeling at Philips for about a decade, and then I moved to the business side, which was the second turning point in my career. Curiosity has always been my main driver; I felt I had reached a plateau in research and going into business felt like a good challenge. I spent the next decade building business strategies and ecosystems of companies in the ASR market, starting with Nuance Communications. I was the company’s first employee in Europe and built up a team of 15 people which very quickly accounted for 25% of the company’s turnover.
Once I begun missing the technical challenge in my work, I started trying out various things that slowly led me to the NLU direction. I am a wine aficionado and decided to crawl the web to build a database of wines and wineries, including wine ratings over the years, as harvesting years can be better or worse for each winery. The challenge was that the name of a winery can differ to others by one or two characters only, while at the same time web reference to wineries is often error prone e.g. names can be misspelled or part of the name is missing. I solved this problem as a pure Hidden Markov Modeling case, like in speech recognition.
I also wanted to organize my life better, and I started building a system that would analyze my emails to tell me what is next for me to do, where and when, e.g. a system to replace a personal assistant who would read my emails, understand that I am trying to organize a meeting with somebody and automatically populate my calendar with a calendar item containing all the relevant information: whom I am meeting with (including linked.in profile), the topic, where the meeting will take place, how to get there and eventually select hotels and restaurants in the vicinity of the meeting place – I even built an iPhone app to that effect.
These questions brought me back to research, in NLU this time, to solve such problems. Working on how human understanding basically works felt like going back to my roots, and this was the third turning point in my career. I joined AppTek’s team of scientists, which I find is an excellent environment to satisfy my need to keep learning from all the amazing colleagues I work with and the expertise they bring to the table.
A: In brief, NLU is about understanding textual content (transcribed by ASR or by a person), extracting information from it, and building a dynamic knowledge graph about this information in the document we are analyzing. We can then use this knowledge graph to perform a variety of tasks, such as monitoring compliance/adherence in call centers for instance, which we work on a lot at AppTek.
As a technology, it is newer and therefore behind ASR and MT in terms of maturity, so it almost follows the progress of the other two technologies. Rule-based systems are still used, but when good quality annotated corpora are available, we make use of machine learning methods in NLU as well. This is the case for named entity recognition (NER), where we differentiate between person, location, organization and facility names. While existing NER corpora are based on written text, the models we build on these corpora can be applied to spoken content, which typically contains features such as hesitations and disfluencies.
The situation becomes harder when we want to identify spoken phone numbers or credit card numbers, and email addresses. Think of a call center agent trying to understand the phone number or email address of a customer: the spoken word includes hesitations (“hm”) syntactical information (“followed by”, “finally”, “and then”), repetitions and also spelling in various ways (“g for George”, “e as in Evelyn”, “l like love”). In such a scenario, while we are annotating spoken corpora to build machine learning (ML) models, we also make use of rule-based systems.
In NLU we need to be pragmatic and understand what data type is available and what can be done with it according to the task we want to solve. At AppTek, we are able to master ML methods with our own ML toolkit called RETURNN, and we also have a very powerful rule-based tagging system called WMatch that can be used in replacement of ML models when annotated data is insufficient for a given task.
A: Quality annotated data is an important issue for NLU. There aren’t many such data because it takes significantly more time than creating, say, transcription data for ASR. The main reason is that higher skill is needed. When manually transcribing a spoken corpus, one can tell objectively what a person said. But when you annotate a corpus for NLU, you need to take into account the context, relate the context to the sentence you are transcribing and only then can you abstract the meaning of a sentence, which is not a job you could easily get a crowd of workers to perform. For example, when annotating “I love Paris”, “Paris” could be annotated or tagged as a “person”, a “city”, or a “god”. It is only after looking into the context, i.e., reading the next sentence, that one can decide what is the correct tag.
Typically, inter-annotator agreement for NLU tasks is at 85%-90%. When it comes to sentiment analysis (also part of NLU), where emotional intelligence is applied, agreement between annotators is only 60-70%, which shows you how hard the task is. In contrast to that, inter-annotator agreement on transcription tasks is typically above 95%. It’s not only that annotators need to think and understand the context, but also that they need to de-bias themselves from their own context, so as to be as objective as possible.
There is a large need for NLU data and language service providers could probably help in this task. The domains themselves vary a lot depending on market needs. The call center domain is of particular interest to AppTek, where our ASR system is the best performing one, especially for English. But there are also other domains, of course.
A: At AppTek we are working on creating annotated data for NLU purposes along with rule-based systems in order to be able to propose answers for diverse NLU tasks. So, we are working hard at making NLU machine learnable. But I am not sure when NLU tasks will be able to run on completely data-driven systems, as reasoning is such a large part of the equation. A task as simple as changing the time of a meeting between several participants involves analyzing data specific only to that meeting (in other words, deciding what is relevant and what is not), and also understanding what the change implies for the calendar invite or the organization of the meeting.
Compliance or script adherence is also a very popular NLU topic (e.g., did a call center agent follow the script they were supposed to follow?). We can easily calculate the distance between two sentences or even compare their meaning. Things get trickier when it comes to numbers and named entities. The call center does not need to know if an agent talked about a monetary amount, it wants to know if the amount is the correct one. Consider the variability: $17.95 can be “seventeen dollars and ninety-five cents”, or “seventeen ninety five” or “seventeen and ninety five”, etc. Further, as we work on ASR transcriptions, which are not error free, we are dealing with subtle errors like “seventeen” being recognized as “seventy”. We thus need to come up with ways to account for the possibility of such errors in the transcript when performing the NLU analysis. We are currently building a framework to work on such issues.
When we develop new methods such as these, we always think about how to use them in a general way. This means we can apply the same principles in search, in extracting information from a document or to summarize information.
A: NLU is the technology that allows you to map content into (new) knowledge that is used to make decisions. Abstracting and extracting information from a document is only half the story in NLU. The other half is inferencing or reasoning within the given context, or in other words applying rules that we use ourselves as human beings every day, such as making use of time zone differences to calculate a local time, the location, the planning of the journey (calculating the time needed to go to a certain place), etc. We need to provide the system with rules and/or methods that describe how to infer the needed information to make the final decision, i.e., to automatically create a calendar invite in this example.
As of today, it is not possible for a machine to learn rules needed to infer the expected, exact and correct meaning/decision/conclusion related to a task analyzed within a document or conversation – very well annotated data would be needed for this to happen. Ideally, we would want to build a unique data structure that includes the entity recognition, the knowledge graph and the rules, all within one unique probabilistic model. While the path is rather clear, the time we will take to obtain a complete learning system that incorporates all three levels (entity, knowledge-graph, rules/inferencing) is open.
Creating a computable calendar invite is an NLU product. This in itself is a building block for more complex solutions or apps. We are in a phase now where such building blocks are being built. At some point NLU development will accelerate to combine all these building blocks into a single meaningful system. One could say that, if robots are to understand humans one day, this will have a lot to do with NLU.
AppTek provides an artificial intelligence and machine learning-based automatic speech recognition, machine translation and natural language understanding platform for organizations in a variety of markets, such as media and entertainment, call centers, government, enterprise business and others across the globe. Available via the cloud or on-premise, AppTek delivers the highest quality real-time streaming and batch speech technology solutions in the industry. Featuring scientists and research engineers who are recognized amongst the best and most experienced in the world, the company’s solutions cover a wide array of languages, dialects, and channels.