Leveraging Large Language Models for Contextual Prioritization of Contaminants of Emerging Concern in Chemical Mixtures

NIAID Data Ecosystem2026-05-10 收录

下载链接：

https://figshare.com/articles/dataset/Leveraging_Large_Language_Models_for_Contextual_Prioritization_of_Contaminants_of_Emerging_Concern_in_Chemical_Mixtures/31971918

下载链接

链接失效反馈

官方服务：

资源简介：

Effective management of chemical mixtures presents a continuing challenge due to the growing diversity and inadequate characterization of contaminants of emerging concern (CECs). While recent advances in nontarget analysis enable the generation of extensive chemical inventories, key bottlenecks have shifted to postidentification interpretation within heterogeneous data. Here, we present an agent-based workflow that integrates large language models (LLMs) with functional categories, potential sources, and toxicology information to support risk prioritization. The practical technical components and evaluation benchmarks for LLMs were established, showing that optimized prompts and the best-performing model (GPT-4-Turbo) among the seven candidates enhanced user alignment with context perfectly. Integrating real-world data through retrieval-augmented generation enabled us to retrieve 100% truthful content, and further fine-tuning nearly doubled response consistency, substantially reducing hallucination. The workflow was validated using two mixture scenarios to assess the applicability across matrices and chemical contexts. The agent enabled complete functional and source annotation of chemicals by querying the NORMAN Network and achieved ∼85% accuracy for substances absent from existing databases by emulating NORMAN-aligned logic. This capability allowed mixture-level interpretation of chemical inventory, revealing dominant categories and industrial sources, such as lubricants in shale gas flowback produced water and semiconductor-related industrial intermediates, which contributed to elevated risks in the studied scenarios.

创建时间：

2026-04-09

5,000+

优质数据集

54 个

任务类型

进入经典数据集