Leveraging Large Language Models for Contextual Prioritization of Contaminants of Emerging Concern in Chemical Mixtures
收藏NIAID Data Ecosystem2026-05-10 收录
下载链接:
https://figshare.com/articles/dataset/Leveraging_Large_Language_Models_for_Contextual_Prioritization_of_Contaminants_of_Emerging_Concern_in_Chemical_Mixtures/31971918
下载链接
链接失效反馈官方服务:
资源简介:
Effective management of chemical
mixtures presents a
continuing
challenge due to the growing diversity and inadequate characterization
of contaminants of emerging concern (CECs). While recent advances
in nontarget analysis enable the generation of extensive chemical
inventories, key bottlenecks have shifted to postidentification interpretation
within heterogeneous data. Here, we present an agent-based workflow
that integrates large language models (LLMs) with functional categories,
potential sources, and toxicology information to support risk prioritization.
The practical technical components and evaluation benchmarks for LLMs
were established, showing that optimized prompts and the best-performing
model (GPT-4-Turbo) among the seven candidates enhanced user alignment
with context perfectly. Integrating real-world data through retrieval-augmented
generation enabled us to retrieve 100% truthful content, and further
fine-tuning nearly doubled response consistency, substantially reducing
hallucination. The workflow was validated using two mixture scenarios
to assess the applicability across matrices and chemical contexts.
The agent enabled complete functional and source annotation of chemicals
by querying the NORMAN Network and achieved ∼85% accuracy for
substances absent from existing databases by emulating NORMAN-aligned
logic. This capability allowed mixture-level interpretation of chemical
inventory, revealing dominant categories and industrial sources, such
as lubricants in shale gas flowback produced water and semiconductor-related
industrial intermediates, which contributed to elevated risks in the
studied scenarios.
创建时间:
2026-04-09



