YESciEval Corpus
收藏DataCite Commons2025-05-28 更新2026-05-03 收录
下载链接:
https://data.uni-hannover.de/dataset/4dbd7b0c-ca2b-4f10-b1d7-50c2d4b32de2
下载链接
链接失效反馈官方服务:
资源简介:
**YESciEval** is a benchmark dataset for evaluating the robustness of Large Language Models (LLMs) as evaluators in scientific question answering (scienceQ&A). It features multi-domain and biomedical question-answering instances with both standard and adversarial variants. The dataset is part of the [YESciEval framework](https://github.com/sciknoworg/YESciEval), developed to support robust, transparent, and scalable evaluation using open-source LLMs.
---
## 🔍 Overview
YESciEval provides:
- **ScienceQ&A datasets** generated using open-source LLMs
- **Adversarial variants** designed using fine-grained rubric-based heuristics
- **Evaluation scores** from multiple LLMs acting as evaluators (LLM-as-a-judge)
---
## 📂 Dataset Structure
The dataset is organized into two main parts:
### 1. **Benign (Original) ScienceQ&A Data**
Synthesized answers to research questions based on abstracts from relevant papers.
- Sources:
- **ORKGSyn**: Multidisciplinary questions from the Open Research Knowledge Graph
- **BioASQ**: Biomedical questions from the BioASQ challenge
- Format: For each Q&A instance:
- `question`: research question
- `abstracts`: relevant paper abstracts
- `answer`: LLM-generated synthesis
### 2. **Adversarial ScienceQ&A Data**
Each benign answer is perturbed with two types of adversarial modifications:
- **Subtle Perturbations**: Realistic, light-weight errors designed to be difficult for models to detect
- **Extreme Perturbations**: Significant modifications that should be easily identifiable by robust evaluators
Perturbations target nine qualitative rubrics:
- Cohesion
- Conciseness
- Readability
- Coherence
- Integration
- Relevancy
- Correctness
- Completeness
- Informativeness
Each rubric has a defined subtle and extreme perturbation heuristic.
---
## 🧪 Evaluation Outputs
Each dataset variant is scored using multiple open-source LLMs (e.g., LLaMA-3.1, Qwen-2.5, Mistral-Large) as evaluators. For each response, a JSON file records:
- A 1–5 Likert score for each rubric
- A rationale for the score
---
## 📊 Statistics
### ORKGSyn (33 disciplines)
- **Benign**: 348 Q&A pairs
- **Subtle Adversarial**: 348 Q&A pairs
- **Extreme Adversarial**: 348 Q&A pairs
### BioASQ (Biomedical)
- **Benign**: 73 Q&A pairs
- **Subtle Adversarial**: 73 Q&A pairs
- **Extreme Adversarial**: 73 Q&A pairs
Total evaluations: ~45,000 across models and variants.
---
## 🗃️ Access
The dataset is also released on the [YESciEval GitHub repository](https://github.com/sciknoworg/YESciEval/tree/main/experiments/dataset).
A dedicated repository, as this one, with only the dataset files can be used to simplify integration into benchmarking pipelines.
---
## 📜 Citation
If you use this dataset, please cite the following:
> D’Souza, J., Babaei Giglou, H., & Münch, Q. (2025). **YESciEval: Robust LLM-as-a-Judge for Scientific Question Answering**. *Proceedings of ACL 2025*. [Preprint](https://github.com/sciknoworg/YESciEval)
---
## 🛠️ License
This dataset is released under the [Creative Commons Attribution 4.0 International (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/deed.en).
---
## 🙋♀️ Questions?
For questions or collaborations, contact [Jennifer D’Souza](mailto:jennifer.dsouza@tib.eu).
提供机构:
LUIS
创建时间:
2025-05-28



