Harnessing Instruction-Tuned Large Language Models to Mine Structured Omics Data for Predicting Chemical Toxicity

Yufan Liu; Guoping Lian; Tao Chen

doi:10.1016/B978-0-443-28824-1.50470-1

Back

Book chapter

Harnessing Instruction-Tuned Large Language Models to Mine Structured Omics Data for Predicting Chemical Toxicity

Yufan Liu, Guoping Lian and Tao Chen

Computer Aided Chemical Engineering, pp.2815-2820

2024

DOI: https://doi.org/10.1016/B978-0-443-28824-1.50470-1

Abstract

Chemical safety

fine-tuning GPT

Large Language Models (LLMs)

omics technologies

Structured information extraction

Chemical safety and toxicology are important considerations in designing safer and sustainable products and processes. Omics technologies, including transcriptomics, proteomics, and metabolomics, provide crucial insights into chemical toxicity by identifying molecular-level changes post-chemical exposure and elucidating regulatory pathways. Despite the vast literature on this topic, there's a lack of comprehensive datasets detailing chemical perturbations and their outcomes. A tool that can efficiently and accurately extract structured data from scientific literature is needed. Large Language Models (LLMs) like GPT-4 offer the potential for efficient information retrieval from intricate texts. However, optimising their factuality and desired behaviour often requires labour-intensive human feedback. Addressing this, our work introduces a semi-automated pipeline for structured information extraction from voluminous literature. Initially, literature that contain any type of omics in the title or abstract and mention pathway analysis in the text were obtained from PubMed. Subsequently, GPT-4 was employed to extract data points including omics type, perturbation, perturbation type and study results, from selected literature abstracts in a zero-shot manner. After manual corrections, this data served to fine-tune the GPT-3.5-turbo model. This fine-tuned model then processed a new batch of abstracts, with its output validated by GPT-4. Discrepancies were manually reconciled, and the consolidated data was used to further fine-tune the GPT-3.5-turbo model. Following an iterative process of reconciliation and fine-tuning, the resulting model demonstrated high accuracy and alignment in extracting structured data from literature with minimal human intervention, which holds the potential to accelerate knowledge transformation. Additionally, we present a structured dataset encapsulating omics type, perturbations, perturbation types, results, etc., that can be used for future omics studies.

Metrics

1 Record Views

Details

Title: Harnessing Instruction-Tuned Large Language Models to Mine Structured Omics Data for Predicting Chemical Toxicity
Creators: Yufan Liu - University of Surrey
Guoping Lian - University of Surrey
Tao Chen - School of Chemistry and Chemical Engineering, University of Surrey,Stag Hill, Guildford GU2 7XH, UK
Publication Details: Computer Aided Chemical Engineering, pp.2815-2820
Number of pages: 6
Publication Date: 2024
Identifiers: 991121791002346
Academic Unit: School of Chemistry & Chemical Engineering
Language: English
Resource Type: Book chapter

Harnessing Instruction-Tuned Large Language Models to Mine Structured Omics Data for Predicting Chemical Toxicity

Abstract

Metrics

Details

Usage Policy