Logo image
Harnessing Instruction-Tuned Large Language Models to Mine Structured Omics Data for Predicting Chemical Toxicity
Book chapter

Harnessing Instruction-Tuned Large Language Models to Mine Structured Omics Data for Predicting Chemical Toxicity

Yufan Liu, Guoping Lian and Tao Chen
Computer Aided Chemical Engineering, pp.2815-2820
2024

Abstract

Chemical safety fine-tuning GPT Large Language Models (LLMs) omics technologies Structured information extraction
Chemical safety and toxicology are important considerations in designing safer and sustainable products and processes. Omics technologies, including transcriptomics, proteomics, and metabolomics, provide crucial insights into chemical toxicity by identifying molecular-level changes post-chemical exposure and elucidating regulatory pathways. Despite the vast literature on this topic, there's a lack of comprehensive datasets detailing chemical perturbations and their outcomes. A tool that can efficiently and accurately extract structured data from scientific literature is needed. Large Language Models (LLMs) like GPT-4 offer the potential for efficient information retrieval from intricate texts. However, optimising their factuality and desired behaviour often requires labour-intensive human feedback. Addressing this, our work introduces a semi-automated pipeline for structured information extraction from voluminous literature. Initially, literature that contain any type of omics in the title or abstract and mention pathway analysis in the text were obtained from PubMed. Subsequently, GPT-4 was employed to extract data points including omics type, perturbation, perturbation type and study results, from selected literature abstracts in a zero-shot manner. After manual corrections, this data served to fine-tune the GPT-3.5-turbo model. This fine-tuned model then processed a new batch of abstracts, with its output validated by GPT-4. Discrepancies were manually reconciled, and the consolidated data was used to further fine-tune the GPT-3.5-turbo model. Following an iterative process of reconciliation and fine-tuning, the resulting model demonstrated high accuracy and alignment in extracting structured data from literature with minimal human intervention, which holds the potential to accelerate knowledge transformation. Additionally, we present a structured dataset encapsulating omics type, perturbations, perturbation types, results, etc., that can be used for future omics studies.

Metrics

1 Record Views

Details

Logo image

Usage Policy