five

YESciEval Corpus

收藏
DataCite Commons2025-05-28 更新2026-05-03 收录
下载链接:
https://data.uni-hannover.de/dataset/4dbd7b0c-ca2b-4f10-b1d7-50c2d4b32de2
下载链接
链接失效反馈
官方服务:
资源简介:
**YESciEval** is a benchmark dataset for evaluating the robustness of Large Language Models (LLMs) as evaluators in scientific question answering (scienceQ&A). It features multi-domain and biomedical question-answering instances with both standard and adversarial variants. The dataset is part of the [YESciEval framework](https://github.com/sciknoworg/YESciEval), developed to support robust, transparent, and scalable evaluation using open-source LLMs. --- ## 🔍 Overview YESciEval provides: - **ScienceQ&A datasets** generated using open-source LLMs - **Adversarial variants** designed using fine-grained rubric-based heuristics - **Evaluation scores** from multiple LLMs acting as evaluators (LLM-as-a-judge) --- ## 📂 Dataset Structure The dataset is organized into two main parts: ### 1. **Benign (Original) ScienceQ&A Data** Synthesized answers to research questions based on abstracts from relevant papers. - Sources: - **ORKGSyn**: Multidisciplinary questions from the Open Research Knowledge Graph - **BioASQ**: Biomedical questions from the BioASQ challenge - Format: For each Q&A instance: - `question`: research question - `abstracts`: relevant paper abstracts - `answer`: LLM-generated synthesis ### 2. **Adversarial ScienceQ&A Data** Each benign answer is perturbed with two types of adversarial modifications: - **Subtle Perturbations**: Realistic, light-weight errors designed to be difficult for models to detect - **Extreme Perturbations**: Significant modifications that should be easily identifiable by robust evaluators Perturbations target nine qualitative rubrics: - Cohesion - Conciseness - Readability - Coherence - Integration - Relevancy - Correctness - Completeness - Informativeness Each rubric has a defined subtle and extreme perturbation heuristic. --- ## 🧪 Evaluation Outputs Each dataset variant is scored using multiple open-source LLMs (e.g., LLaMA-3.1, Qwen-2.5, Mistral-Large) as evaluators. For each response, a JSON file records: - A 1–5 Likert score for each rubric - A rationale for the score --- ## 📊 Statistics ### ORKGSyn (33 disciplines) - **Benign**: 348 Q&A pairs - **Subtle Adversarial**: 348 Q&A pairs - **Extreme Adversarial**: 348 Q&A pairs ### BioASQ (Biomedical) - **Benign**: 73 Q&A pairs - **Subtle Adversarial**: 73 Q&A pairs - **Extreme Adversarial**: 73 Q&A pairs Total evaluations: ~45,000 across models and variants. --- ## 🗃️ Access The dataset is also released on the [YESciEval GitHub repository](https://github.com/sciknoworg/YESciEval/tree/main/experiments/dataset). A dedicated repository, as this one, with only the dataset files can be used to simplify integration into benchmarking pipelines. --- ## 📜 Citation If you use this dataset, please cite the following: > D’Souza, J., Babaei Giglou, H., & Münch, Q. (2025). **YESciEval: Robust LLM-as-a-Judge for Scientific Question Answering**. *Proceedings of ACL 2025*. [Preprint](https://github.com/sciknoworg/YESciEval) --- ## 🛠️ License This dataset is released under the [Creative Commons Attribution 4.0 International (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/deed.en). --- ## 🙋‍♀️ Questions? For questions or collaborations, contact [Jennifer D’Souza](mailto:jennifer.dsouza@tib.eu).
提供机构:
LUIS
创建时间:
2025-05-28
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作