
Innovative collaborations with private, non-federal partners are a cornerstone of the work done by NASA’s Interagency Implementation and Advanced Concepts Team (IMPACT). One such collaboration with International Business Machines (IBM) has yielded INDUS, a groundbreaking suite of large language models (LLMs) designed specifically for the domains of Earth science, biological and physical sciences, heliophysics, planetary sciences, and astrophysics. These models are meticulously trained using curated scientific corpora drawn from diverse data sources, making them powerful tools for advancing scientific research.
The INDUS Advantage
INDUS includes two types of models: encoders and sentence transformers. Encoders convert natural language text into numeric codes that can be processed by the LLM. The INDUS encoders were trained on a colossal corpus of 60 billion tokens encompassing data from astrophysics, planetary science, Earth science, heliophysics, biological, and physical sciences. What sets INDUS apart is its custom tokenizer, developed by the IMPACT-IBM collaborative team, which improves upon generic tokenizers by recognizing scientific terms like “biomarkers” and “phosphorylated.” Over half of the 50,000-word vocabulary contained in INDUS is unique to the specific scientific domains used for its training. These encoder models were fine-tuned with approximately 268 million text pairs, including titles/abstracts and questions/answers, to enhance the sentence transformer models.
By equipping INDUS with domain-specific vocabulary, the IMPACT-IBM team achieved superior performance over open, non-domain-specific LLMs on benchmarks for biomedical tasks, scientific question-answering, and Earth science entity recognition tests. Designed for diverse linguistic tasks and retrieval-augmented generation, INDUS excels at processing researcher questions, retrieving relevant documents, and generating accurate answers. For latency-sensitive applications, the team developed smaller, faster versions of both the encoder and sentence transformer models.
Validation and Performance
Validation tests have shown that INDUS excels in retrieving relevant passages from scientific corpora in response to a NASA-curated test set of about 400 questions. IBM researcher Bishwaranjan Bhattacharjee noted, “We achieved superior performance by not only having a custom vocabulary but also a large specialized corpus for training the encoder model and a good training strategy. For the smaller, faster versions, we used neural architecture search to obtain a model architecture and knowledge distillation to train it with supervision of the larger model.”
Real-World Applications
INDUS has been evaluated using data from NASA’s Biological and Physical Sciences (BPS) Division. Dr. Sylvain Costes, the NASA BPS project manager for Open Science, highlighted the benefits of incorporating INDUS: “Integrating INDUS with the Open Science Data Repository (OSDR) Application Programming Interface (API) enabled us to develop and trial a chatbot that offers more intuitive search capabilities for navigating individual datasets. We are currently exploring ways to improve OSDR’s internal curation data system by leveraging INDUS to enhance our curation team’s productivity and reduce the manual effort required daily.”
At the NASA Goddard Earth Sciences Data and Information Services Center (GES-DISC), the INDUS model was fine-tuned using labeled data from domain experts to categorize publications specifically citing GES-DISC data into applied research areas. According to NASA principal data scientist Dr. Armin Mehrabian, this fine-tuning “significantly improves the identification and retrieval of publications that reference GES-DISC datasets, which aims to improve the user journey in finding their required datasets.” Furthermore, the INDUS encoder models are integrated into the GES-DISC knowledge graph, supporting a variety of other projects, including the dataset recommendation system and GES-DISC GraphRAG.
Kaylin Bugbee, team lead of NASA’s Science Discovery Engine (SDE), emphasized the benefits INDUS offers to existing applications: “Large language models are rapidly changing the search experience. The Science Discovery Engine, a unified, insightful search interface for all of NASA’s open science data and information, has prototyped integrating INDUS into its search engine. Initial results have shown that INDUS improved the accuracy and relevancy of the returned results.”
The Future of Scientific Research with INDUS
INDUS enhances scientific research by providing researchers with improved access to vast amounts of specialized knowledge. It can understand complex scientific concepts and reveal new research directions based on existing data. INDUS also enables researchers to extract relevant information from a wide array of sources, improving efficiency. Aligned with NASA and IBM’s commitment to open and transparent artificial intelligence, the INDUS models are openly available on Hugging Face. For the benefit of the scientific community, the team has released the developed models and will release the benchmark datasets that span named entity recognition for climate change, extractive QA for Earth science, and information retrieval for multiple domains. The INDUS encoder models are adaptable for science domain applications, and the INDUS retriever models support information retrieval in RAG applications.
INDUS represents a significant leap forward in the application of large language models to scientific research, paving the way for new discoveries and innovations in our understanding of the universe.