AppTek has been building large language models (LLMs) applied to a variety of human language technology (HLT) topics, ranging from the development of custom LLMs for enterprise specific applications, to augmenting existing automatic speech recognition (ASR), neural machine translation (NMT), text-to-speech (TTS), natural language processing and understanding (NLP/U) models to improve the accuracy and fluency of these systems. Additionally, AppTek’s Data Science team has incorporated its ABCD (AI Bias Correction Data-Science) methodology and evaluation protocols, as well as integrating AppTek’s 4D methodology, which covers a balanced representation of diverse Dialects, Demographics, Domains and Devices, to find ways to increase diversity, equity and ethics in multilingual LLMs, and offer more inclusive and ethical natural language generation systems for enterprise applications.
Recently, there has been a lot of growing public attention surrounding LLMs including the possibilities, and potential perils, that come with them. LLMs have made remarkable progress in natural language processing, generating text that closely resembles human-like responses to user prompts or instructions formulated in natural language. These advancements have led to the development of several increasingly innovative LLM-based applications and services, including more advanced text classification, language translation, summarization, question and answering systems, sentiment analysis, personalized recommendation systems, and content creation, with significant and sometimes even dramatic accuracy improvements on these tasks as compared to state-of-the-art specialized solutions. Despite these breakthroughs, there is still much to be built into LLMs, especially in terms of the multidimensional demographic diversity, domains, dialects and devices for the long-term downstream impact they can have.
With the growing public attention and usage surrounding GPT, more and more enterprises are approaching AppTek to better understand what large language models are (and aren’t), the type of applications they could potentially service within their organization, what’s involved in building and customizing GPT-based or other LLMs, and the types of ethical concerns and risks that need to be addressed when deploying such systems. The following article is designed to offer a general overview of how AppTek is building large language models today and offer guidance on the process and ethical considerations when considering building a customized LLM.
Large language models are developed with the objective of producing fluent natural language text as an extension of a prefix/prompt provided by the user. They are created by training a large artificial neural network on extensive volumes of text data that encompass enormous amounts of written content, such as what was produced overall on the internet over a considerable period of time. As a result, these models often generate well-formed and refined responses to user-initiated prompts, delivering polished sentences similar to the education level of how an average-performing college student would produce them. The important difference to the not-so-large language models of the previous generation, is that there is no training towards a specific task or domain. The prompt alone is often enough to lead the LLM to produce a reasonable response. Most natural language processing scientists attribute this behavior to the enormous amounts of training data and the large size of the LLM’s neural network.
The process of training a large language model involves feeding it with extensive amounts of text data from various sources. Through this, the model learns patterns and relationships within a language, spanning words, phrases, sentences, and even whole documents. Large language models are renowned for their massive number of parameters, which can range from billions to trillions. This enables them to capture intricate linguistic nuances, dependencies, and long-range contexts, leading to a remarkable level of fluency.
GPT (Generative Pre-trained Transformer) is a specific type of large language model that uses the Transformer architecture which uses self-attention mechanisms to capture contextual relationships between words. The model was introduced with the intention that it will be fine-tuned, in a second step, to various NLP tasks. Other large language models use different architectures and training methods to achieve a similar goal of providing a base model that can be then adapted to various natural language processing tasks using small amounts of task-specific data and/or prompt engineering and reinforcement learning.
LLMs are versatile models that can be utilized in various applications involving natural language text generation. These applications range from creative text generation over sentiment analysis up to question answering systems or summarization of lengthy text into shorter and easily digestible forms. Even if fact grounding of the generated texts is not guaranteed, chatbot prototypes have been proposed that can communicate with users while comprehending their intent. LLM-based systems have also been developed to assist with the creation of software code, improve how we understand protein structures in health care, and many more non-language related tasks.
Creating a generative LLM involves several steps, including:
• Data Collection: Gather high-quality data from a variety of sources, ranging from pre-existing data sets such as books and articles to customer supplied data, covering a wide variety of domains and content which should be incorporated in the model.
• Data Preprocessing: Clean the text data by removing any noise, such as HTML tags, punctuation and special characters, and tokenize the text into smaller units, such as words and sentences.
• Model Architecture: Decide on the architecture of the LLM, including the number of layers, the size of the hidden state, the number of attention heads, and the sequence length.
• Training: Train the LLM on the preprocessed text data using a large amount of computing resources, such as GPUs or TPUs. The model is trained to predict the next word given the preceding words, and the model parameters are updated during training according to how well the predictions of the model being trained match the ground truth.
• Hyper-Parameter Fine-tuning: Fine-tune the pre-trained LLM on a specific task, such as text generation or language translation, by providing it with task-specific training data, and adjusting the model's parameters to fit the task.
• Evaluation: Evaluate system on downstream tasks including text-generation tasks, model's propensity for bias and toxicity, cultural insensitivities, etc. Conduct post-training analysis to identify any biases in the model's output and take corrective measures as needed.
• Deployment: Deploy the trained and fine-tuned LLM to a production environment, such as a web application or mobile app, where it can be used to generate text or perform other natural language processing tasks.
Large language models are being utilized in automatic speech recognition (ASR) systems to enhance the precision and naturalness of transcribed speech. The language modeling aspect of LLMs involves predicting the probability distribution of the next word based on the previous context. This helps LLMs identify the most likely next word or character in a given speech segment, thereby improving the accuracy and fluency of the transcribed ASR speech output. However, when using LLMs in ASR systems, researchers have to balance the weights given to the acoustic model and the language model to prevent the LLM from hallucinating words that were not spoken. Despite this challenge, integrating LLMs into ASR systems has resulted in significant improvements in the accuracy and naturalness of ASR output, thereby making the technology more accessible and user-friendly for a wider range of applications and use cases.
Large language models are incorporated into neural machine translation (NMT) models to improve the language modeling component of the system, which is the one that affects the fluency of the translation. This allows the model to generate more natural and contextually appropriate translations. There are several methods to improve the accuracy and efficiency of NMT systems by incorporating LLMs, including initializing NMT model weights with pre-trained LLM weights, combining LLM and NMT models, optimizing LLM prompts to improve NMT output, providing document-level and other context with LLMs for NMT, and using LLMs to create synthetic data for NMT. Additionally, prompting the LLM to generate synthetic data of a particular style, domain, or speaker gender can improve the performance of metadata-aware NMT systems. These approaches aim to leverage the vast data that LLMs are trained on and their ability to capture long-range dependencies.
Large language models can improve the naturalness and expressiveness of streaming-based (incremental) Text-To-Speech (TTS) systems as well, where the output is generated incrementally with little or no future context. Additionally, LLMs can be used to improve grapheme-to-phoneme conversion accuracy during the input preprocessing step by disambiguating word pronunciations based on their context.
Recent approaches to TTS leverage LLM technology to improve the naturalness and speaker adaptation capabilities of TTS systems. These tackle the TTS task as an audio language modeling problem, where the audio is first compressed by means of a neural audio encoder and an auto-regressive decoder generates the output iteratively.
There are various applications of LLMs in natural language understanding (NLU). One such application involves using pre-trained LLMs with higher capacity to achieve better results on downstream tasks such as Named Entity Recognition, Sentiment Analysis, Intent Recognition in Dialog, and further populating missing information in Knowledge Graphs (KGs). Another use case is to leverage the well-formed generated answers of LLMs in customer service dialogue systems. To overcome the challenge of aligning generated facts and entities with a particular application, LLMs can be adapted using customer/domain-specific scripts and/or structured knowledge sources. This involves adjusting the LLM's weights, probability distributions, and other parameters to suit the customer/domain-specific context.
To better understand what LLMs lack with respect to natural language understanding and conversational AI, one can compare them to AI-based image-generator tools such as Dall-E 2 or Midjourney. Both models are creative models, the creativity being defined by the statistics/weights of the model applied to a given context (initial photo/image or prompt). Like image-generators that can generate a Van Gogh or Picasso-like version of the initial image, LLMs will generate fantasy worlds that are enjoyable to read. However, what is expected from generated text within conversational AI is fact-based information that is organized and structured, which refers to the real world. LLMs have been trained on stories (unstructured content), and therefore, the challenge of using them for NLU is ensuring that the generated answers are fact-grounded and follow existing world knowledge and structure rather than just producing a nice piece of art.
To effectively incorporate structured information (knowledge, facts) into LLMs, there are several approaches that can be taken at different stages of the process, such as during data pre-processing, model training, or inference. These approaches include:
• Pre-processing: Filtering and organizing training data according to domain, devices, demographics, and dialects/languages.
• Model training: Adding structured information to the training data.
• Inference time: Re-scoring the search space for domain consistency or dynamically grounding generated text with additional (structured) information.
One approach to structuring knowledge from unstructured texts is to translate entity-related sentences into small KGs and integrate them into document-related larger KGs. The knowledge graph representation of a text can be used for fact-checking against the domain. For instance, irrelevant information can be pruned, or existing information can be enhanced to align with the domain.
Developing a responsible and ethical LLM entails careful consideration of the ethical implications and a commitment to implementing relevant practices during development and deployment. This may necessitate addressing various issues, such as:
• Bias: LLMs can be trained on biased data, which can result in biased outputs. This can lead to perpetuating and amplifying existing social biases.
• Privacy: LLMs often require vast amounts of data to be trained on, which can raise concerns about privacy. The data used to train the model may contain personally identifiable information (PII), and the model itself may have the ability to generate text from PII that can be used for malicious purposes.
• Misinformation: LLMs can generate realistic-looking text, which can be used to spread misinformation and fake news. This can have serious consequences, particularly in the areas of politics and mental health.
• Ownership: LLMs require significant computing resources to train, which can be prohibitively expensive for smaller organizations. This can result in a concentration of power and ownership over the models, which can have implications for fairness and competition.
• Regulation: The development and deployment of LLMs may require new regulations to ensure ethical and responsible use. These regulations may include requirements for transparency, accountability, and oversight.
Reducing bias in LLMs is important for ensuring accuracy and reliability in predictions and recommendations. Biased outputs can lead to unfair outcomes for underrepresented communities and reinforce existing inequalities. Additionally, building trust and credibility in LLMs is necessary to ensure their usefulness and effectiveness.
Here are some of the ways we can mitigate bias:
• Data Collection: Collect a diverse and representative set of training data that includes a wide range of perspectives and experiences. This can help reduce the risk of bias by ensuring that the model is exposed to a broad range of language use.
• Data Preprocessing: Carefully preprocess the training data to remove any biased or discriminatory language, such as racial or gender stereotypes. This can help ensure that the model is not influenced by harmful language and can reduce the risk of perpetuating biases.
• Balanced Training Data: Ensure that the training data is balanced with respect to different demographics and social groups. This can help prevent the model from being skewed towards any particular group or perspective.
• Regularization Techniques: Use regularization techniques such as dropout, weight decay, or early stopping during training. These techniques can help prevent the model from overfitting to the training data and generalize better to new inputs.
• Evaluation Metrics: Use evaluation metrics that explicitly measure fairness and bias, such as demographic parity or equalized odds. This can help identify and quantify any bias in the model's output and guide further improvements.
• Reinforcement Learning from Human Feedback with Reward Model (RLHF): Train a reward model to predict the rewards that a human expert would give to an agent for performing a given action in a given state. The reward model is then used to guide behavior, so that it learns to take actions that are more likely to receive positive feedback from the expert.
• Post-Training Analysis: Conduct post-training analysis to identify any biases in the model's output and take corrective measures as needed. This can involve examining the model's outputs for patterns of bias, as well as conducting user studies to gather feedback on the model's performance and identify areas for improvement.
Eliminating bias from LLM models is challenging, as biases can be subtle and pervasive. Working with AppTek’s ABCD methodology (AI Bias Correction Data-Science), along with its 4D methodology inclusive of a balanced representation of diverse demographics, domains, dialects and devices/channels, is a great first step for enterprise customers to reduce the risk of bias and toxicity in customized LLMs.
In brief, LLMs represent a groundbreaking milestone in artificial intelligence, human language technology including GPT. Some call it a tipping point to a new transformative state-of-the-art. As always, we should not overestimate the short-term effects of the current breakthrough and, more importantly, not underestimate its long-term effects either. This is our opportunity to create a truly inclusive and fair society, by building it ourselves using the language models that are fit for one.
At AppTek, LLMs have actively powered human language technology and now with GPT revolutionized how we interact and process natural language. We acknowledge the significance of LLMs in enabling cognition, generation and comprehension at an unprecedented level of complexity, and we strive to harness this capability to improve communication across the globe. By leveraging LLMs, we can facilitate language processing with greater speed, accuracy, and nuance than ever before. This technology enables us to break down linguistic barriers and engage in cross-cultural communication. As we move forward, we look to further enhance or augment the possibilities of LLMs and expand their potential to enhance natural human communication and interaction.
Generative AI.. different language.. same voice quality and emotion
AppTek is a global leader in artificial intelligence (AI) and machine learning (ML) technologies for automatic speech recognition (ASR), neural machine translation (NMT), natural language processing/understanding (NLP/U) and text-to-speech (TTS) technologies. The AppTek platform delivers industry-leading solutions for organizations across a breadth of global markets such as media and entertainment, call centers, government, enterprise business, and more. Built by scientists and research engineers who are recognized among the best in the world, AppTek’s solutions cover a wide array of languages/ dialects, channels, domains and demographics.