ClimateGPT: Climate-Specific Large Language Model for Factual and Fluent Information Retrieval and Generation

December 4, 2023
AppTek

As nations convene at the United Nation’s Climate Change Conference (COP28) in UAE this month to reassess and recalibrate climate strategies, one crucial aspect that often hinders timely and informed decision-making is the accessibility of accurate information and across the diversity of languages needed for the myriad of countries participating in the decision-making process. Enter ClimateGPT, a groundbreaking project by AppTek, EQTYLab, and Erasmus AI, aimed at harnessing the power of Large Language Models (LLMs) to improve broader and faster global access to climate-specific intelligence.

The importance of climate information exchange

Climate change is not primarily an environmental concern; it is a challenge that permeates every facet of our life. Air pollution, extreme weather events, water quality impact: from public health to economic stability, the impact of climate change is widespread and complex.
To address the multifaceted climate crisis effectively, decision-makers are better served with a collective knowledge base, termed ‘climate social intelligence’. This involves fostering a global understanding of climate issues, circulating critical climate information across languages, promoting informed decision-making, and accelerating positive change with global climate social intelligence.

Traditional media outlets such as books, films, articles or conferences have played a significant role in raising awareness but LLMs offer a unique advantage when it comes to disseminating information. As we have witnessed this past year, LLMs are already revolutionizing communication, research and innovation. Their ability to comprehend and generate human-like text from user-friendly prompting is unparalleled, making them indispensable in today’s digital landscape.

Christian Dugast speaking at the COP 28 Summit

ClimateGPT is a specialized LLM adapted to the topic of climate change and will be showcased to the public for the first time at COP28 by AppTek scientists David Thulke and Christian Dugast.  Other AppTek scientists who contributed to the project include Yingbo Gao, Rricha Jalota, Abdallah Nasir, Taylor Tragemann, Katie Nguyen, Evgenii Tsymbalov and Evgeny Matusov. The goal of the project is to harness the collective climate social intelligence to address climate challenges collectively. ClimateGPT is more than just a chatbot; it is a powerful tool, a climate social intelligence platform, able to assist governments, organizations, and individuals in making informed decisions, that enhances a global social intelligence related to climate.

Domain adaptation pre-training for climate expertise

To begin, scientists selected the LLaMa2 framework as its baseline model due to its flexibility, scalability and optimized architecture for effective handling of large data sets.  While the LLaMA2 baseline model, trained on 2 trillion tokens, is versatile in general domain content, it lacks any deep expertise in scientific climate-specific domain information. Recognizing this, AppTek scientists employed the Erasmus climate data set using domain adaptation methods in pre-training to specialize the model for the climate science domain with an additional 300B tokens representative of books, patents, science, news, policy and publications within the climate domain.  Using this data, the model can develop a more refined understanding of climate-specific concepts, terminologies and contextual nuances.

There are different approaches to domain adaptation. The team performed experiments with different ones to see which approach would yield the better results, to deploy the better model in production. A common approach is the continued pre-training of foundation general-purpose models. This was one of the approaches also used for ClimateGPT, as Erasmus AI provided curated in-domain climate data that were used for this purpose.

The team also experimented with training a smaller model from scratch with a larger curated data set covering multiple scientific domains which include climate. This provides complete control over the training data, a critical factor in a field prone to misinformation and bias.


Instruction fine-tuning for more comprehensive insights

Once pre-training a system is complete, the resulting climate language models have a deeper understanding of the target domain in comparison to the foundation models. Since these models were merely trained to predict the next token in the pre-training dataset, using them for specific downstream tasks requires careful prompting and providing the model with few-shot examples. Adapting these models to follow users’ instructions and generate text in a style appropriate for this use case requires Instruction Fine-Tuning.

Instruction Fine-Tuning enables natural and fluent conversations within ClimateGPT for its users: policy makers, scientists and journalists who need to educate themselves better on the complex issues surrounding climate and in their native language.  As climate science is inherently an interdisciplinary field, encompassing natural science, economics, and social aspects. The goal is to develop a model capable of addressing queries from these three critical perspectives and provide comprehensive insights crucial for informed discussions and decision-making.

Data was also needed for the instruction fine-tuning model training, which were collected by AppTek in three phases. The bulk of the data was collected by AppTek’s TechOps team, where the work was carried out by data annotators working on a platform built to this effect, but who are non-experts in the climate domain. This was coupled by collecting data from experts as well, PhD and graduate students specializing in the climate domain, who provided pairs of questions and answers on this topic to demonstrate what would be expected in real user interactions. Finally, an interview with a seasoned expert was arranged to better understand how the model could be useful to users looking for information. One of the differentiators of the approach followed in this project is the fact that no synthetic data was used for training purposes.

One of the most common approaches to train models for instruction fine-tuning is to use a different LLM to generate synthetic data to train the system with. For ClimateGPT the team’s focus was on the creation of high-quality data, created by humans, in order to ensure the highest possible accuracy of the resulting model.


Assessing the accuracy of the model’s responses

A common problem with LLMs is hallucination. Being a metaphorical term, hallucination in this context refers to deviations into undesirable regions of the probability space during model generations, leading to off-track responses. Coupled with the fact that a model’s knowledge is ‘frozen’ in time and does not incorporate new facts or knowledge without additional training, mitigating this issue is key to ensuring the model’s usefulness.

To do that, external knowledge is incorporated in the form of manually curated scientific papers and reports. By providing the model with more relevant content during the generation phase, we boost the model’s ability to provide more focused and relevant responses. Furthermore, manual and automatic evaluation methods have been used to assess the correctness of the responses generated by ClimateGPT using three different climate-specific tests sets. This has resulted in an improvement in the model’s accuracy from an initial 63% to 73%.


Machine translation (MT) for multilinguality

The problem with accessibility to valid information on climate change does not only have to do with the risks associated with bias or inaccuracies found in online data. It also has to do with the fact that a large volume of valid, scientific information is only available in English when the interested parties that need to access it come from different linguistic backgrounds.
Due to the lack of multilingual scientific data that could have been included in the model’s pre-training or in the instruction finetuning component, machine translation is employed as an alternative to solve this problem. MT thus becomes an integral component of ClimateGPT, both to translate user queries from any language into English, as well as to translate the model’s responses from the analysis made in English back into said language.

AppTek’s baseline machine translation systems were fine-tuned to the climate domain and utilized as the machine translation component of the pipeline. Several domain adaptation experiments were conducted using parallel sentence pairs relating to the climate domain which were extracted from the available data. Furthermore, a glossary override feature was applied, to ensure climate terms are always translated accurately.


Green computing for environmental responsibility

Acknowledging the environmental impact of computing resources, the ClimateGPT team take a responsible stance. The model is developed using a high-performance computing cluster that is entirely powered by clean, green, renewable energy, provided by MLFoundry. Although it is challenging to secure high-end GPU computing resources, especially those powered by green energy, the decision to partner with an environmentally conscious provider committed to clean energy sources reflects a commitment to minimizing the carbon footprint associated with the project.

Enhancing access to climate information for more informed decisions

ClimateGPT represents a significant stride in leveraging cutting-edge technology to combat climate change. By combining the power of LLMs with a dedicated focus on climate science, the project aims to enhance global climate social intelligence with faster and more robust access to information to be used by policy makers, scientists and journalists.

“It has been very exciting to work on this project and experiment with different solutions to train a high-performing domain-adapted LLM,” says David Thulke, one of the leading scientists in the ClimateGPT project at AppTek. “We plan to publish the models we’ve built as open source, to help the scientific community progress faster with respect to finding efficient ways to harness the power of LLMs for new, domain-specific applications.”

As nations and organizations grapple with the complexities of climate action, having an advanced tool like ClimateGPT at their disposal can pave the way for more informed decisions and, ultimately, a sustainable future for our planet. To learn more about the ClimateGPT project, click here.

AI and ML Technologies to Bridge the Language Gap
Find us on Social Media:
ABOUT APPTEK.ai

AppTek.ai is a global leader in artificial intelligence (AI) and machine learning (ML) technologies for automatic speech recognition (ASR), neural machine translation (NMT), natural language processing/understanding (NLP/U), large language models (LLMs)  and text-to-speech (TTS) technologies. The AppTek platform delivers industry-leading solutions for organizations across a breadth of global markets such as media and entertainment, call centers, government, enterprise business, and more. Built by scientists and research engineers who are recognized among the best in the world, AppTek’s solutions cover a wide array of languages/ dialects, channels, domains and demographics.

SEARCH APPTEK.AI
Copyright 2021 AppTek    |    Privacy Policy      |       Terms of Service     |      Cookie Policy