Large Language Models in Materials Science: Assessing RAG Evaluation Frameworks through graphene synthesis

NIAID Data Ecosystem2026-05-10 收录

下载链接：

https://data.mendeley.com/datasets/ry7phxn4js

下载链接

链接失效反馈

官方服务：

资源简介：

Retrieval-Augmented Generation (RAG) systems increasingly support scientific research, yet evaluating their performance in specialized domains remains challenging due to the technical complexity and precision requirements of scientific knowledge. This study presents the first systematic analysis of automated evaluation frameworks for scientific RAG systems, using graphene synthesis in materials science as a representative case study. We develop a comprehensive evaluation protocol comparing four assessment approaches: RAGAS (an automated RAG evaluation framework), BERTScore, LLM-as-a-Judge, and expert human evaluation across 20 domain-specific questions. Our analysis of automated evaluators reveals that BERTScore lacks the interpretability and score sensitivity required to distinguish meaningful performance difference, while LLM-as-a-Judge failed to capture retrieval augmentation benefits. In contrast, RAGAS successfully captured relative performance improvements from retrieval augmentation, identifying performance gains in RAG-augmented systems (0.52-point improvement for Gemini, 1.03-point for Qwen on a 10-point scale), and demonstrating particular sensitivity to retrieval benefits in smaller, open-source models. However, it still exhibits fundamental limitations in absolute score interpretation for scientific content. These findings establish methodological guidelines for scientific RAG evaluation and highlight critical considerations for researchers deploying AI systems in specialized domains.

创建时间：

2026-02-10

5,000+

优质数据集

54 个

任务类型

进入经典数据集