SciDQA
收藏魔搭社区2025-12-05 更新2025-02-01 收录
下载链接:
https://modelscope.cn/datasets/yale-nlp/SciDQA
下载链接
链接失效反馈官方服务:
资源简介:
# SciDQA: A Deep Reading Comprehension Dataset over Scientific Papers
📄 [Paper](https://arxiv.org/pdf/2411.05338) | 💻 [Code](https://github.com/yale-nlp/SciDQA)

Scientific literature is typically dense, requiring significant background knowledge and deep comprehension for effective engagement. We introduce SciDQA, a new dataset for reading comprehension that challenges LLMs for a deep understanding of scientific articles, consisting of 2,937 QA pairs. Unlike other scientific QA datasets, SciDQA sources questions from peer reviews by domain experts and answers by paper authors, ensuring a thorough examination of the literature. We enhance the dataset's quality through a process that carefully filters out lower quality questions, decontextualizes the content, tracks the source document across different versions, and incorporates a bibliography for multi-document question-answering. Questions in SciDQA necessitate reasoning across figures, tables, equations, appendices, and supplementary materials, and require multi-document reasoning.
We evaluate several open-source and proprietary LLMs across various configurations to explore their capabilities in generating relevant and factual responses. Our comprehensive evaluation, based on metrics for surface-level similarity and LLM judgements, highlights notable performance discrepancies. SciDQA represents a rigorously curated, naturally derived scientific QA dataset, designed to facilitate research on complex scientific text understanding.
### Licence
Open Data Commons Attribution License (ODC-By) v1.0
### How to use the dataset
#### Setting up the repo:
`git clone https://github.com/yale-nlp/SciDQA.git`
`conda create -n scidqa --python=3.11`
`conda activate scidqa`
`pip install -r requirements.txt`
#### Usage:
To use the QA dataset, load it as dataframe using pandas:
```
import pandas as pd
scidqa_df = pd.read_xlsx('src/data/scidqa.xlsx')
print(scidqa_df.columns)
```
The paper metadata (title and abstract) is present in `src/data/relevant_ptabs.pkl` and can be used as follows:
```
import pickle
paper_id = scidqa_df['pid'][0]
with open('src/data/relevant_ptabs.pkl', 'rb') as fp:
papers_tabs = pickle.load(fp)
print('Paper title: ', papers_tabs[paper_id]['title'])
print('Paper abstract: ', papers_tabs[paper_id]['abs'])
```
To use the full-text of papers for the QA pairs, use the `src/data/papers_fulltext_nougat.pkl` file. It can be used as follows:
```
import pickle
paper_id = scidqa_df['pid'][0]
with open('src/data/papers_fulltext_nougat.pkl', 'rb) as fp:
paper_fulltext_dict = pickle.load(fp)
print("Full-text of the mansucript at submission:\n", paper_fulltext_dict['initial'][paper_id])
print("Full-text of the camera-ready mansucript:\n", paper_fulltext_dict['final'][paper_id])
```
SciDQA data can be used directly from [HF](https://huggingface.co/datasets/yale-nlp/SciDQA) as follows:
```
from datasets import load_dataset
scidqa = load_dataset("yale-nlp/SciDQA")
```
### Citation
```
@inproceedings{singh-etal-2024-scidqa,
title = "{S}ci{DQA}: A Deep Reading Comprehension Dataset over Scientific Papers",
author = "Singh, Shruti and
Sarkar, Nandan and
Cohan, Arman",
editor = "Al-Onaizan, Yaser and
Bansal, Mohit and
Chen, Yun-Nung",
booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2024",
address = "Miami, Florida, USA",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.emnlp-main.1163",
doi = "10.18653/v1/2024.emnlp-main.1163",
pages = "20908--20923",
abstract = "Scientific literature is typically dense, requiring significant background knowledge and deep comprehension for effective engagement. We introduce SciDQA, a new dataset for reading comprehension that challenges language models to deeply understand scientific articles, consisting of 2,937 QA pairs. Unlike other scientific QA datasets, SciDQA sources questions from peer reviews by domain experts and answers by paper authors, ensuring a thorough examination of the literature. We enhance the dataset{'}s quality through a process that carefully decontextualizes the content, tracks the source document across different versions, and incorporates a bibliography for multi-document question-answering. Questions in SciDQA necessitate reasoning across figures, tables, equations, appendices, and supplementary materials, and require multi-document reasoning. We evaluate several open-source and proprietary LLMs across various configurations to explore their capabilities in generating relevant and factual responses, as opposed to simple review memorization. Our comprehensive evaluation, based on metrics for surface-level and semantic similarity, highlights notable performance discrepancies. SciDQA represents a rigorously curated, naturally derived scientific QA dataset, designed to facilitate research on complex reasoning within the domain of question answering for scientific texts.",
}
```
# SciDQA:面向学术论文的深度阅读理解数据集
📄 [论文](https://arxiv.org/pdf/2411.05338) | 💻 [代码](https://github.com/yale-nlp/SciDQA)

学术文献通常内容密集,需要扎实的背景知识与深度理解才能有效研读。我们推出了面向学术论文的深度阅读理解数据集(SciDQA),这是一个全新的阅读理解数据集,旨在挑战大语言模型(Large Language Models, LLMs)对学术论文的深度理解能力,该数据集包含2937个问答对。与其他学术问答数据集不同,SciDQA的问题来自领域专家的同行评审,答案由论文作者提供,确保对文献进行全面的审视。我们通过一系列流程提升数据集质量:严格筛除低质量问答对、对内容进行去语境化处理、追踪不同版本的源文档,并为多文档问答任务引入参考文献库。SciDQA中的问答对需要结合图表、公式、附录与补充材料进行推理,且需具备多文档推理能力。
我们针对多种配置下的多款开源与闭源大语言模型开展了评估,以探究其生成相关且符合事实的回复的能力。本次评估基于表层相似度指标与大语言模型评判结果,揭示了显著的性能差异。SciDQA是一套经过严格筛选整理、自然衍生的学术问答数据集,旨在推动复杂学术文本理解领域的研究工作。
### 许可协议
开放数据共同体署名许可协议(Open Data Commons Attribution License, ODC-By)v1.0
### 数据集使用方法
#### 仓库配置
执行以下命令完成仓库克隆与环境配置:
git clone https://github.com/yale-nlp/SciDQA.git
conda create -n scidqa --python=3.11
conda activate scidqa
pip install -r requirements.txt
#### 使用方式
若需使用该问答数据集,可通过pandas将其加载为数据框:
import pandas as pd
scidqa_df = pd.read_xlsx('src/data/scidqa.xlsx')
print(scidqa_df.columns)
论文元数据(包括标题与摘要)存储于`src/data/relevant_ptabs.pkl`文件中,使用方式如下:
import pickle
paper_id = scidqa_df['pid'][0]
with open('src/data/relevant_ptabs.pkl', 'rb') as fp:
papers_tabs = pickle.load(fp)
print('论文标题:', papers_tabs[paper_id]['title'])
print('论文摘要:', papers_tabs[paper_id]['abs'])
若需使用问答对对应的论文全文,可使用`src/data/papers_fulltext_nougat.pkl`文件,使用方式如下:
import pickle
paper_id = scidqa_df['pid'][0]
with open('src/data/papers_fulltext_nougat.pkl', 'rb') as fp:
paper_fulltext_dict = pickle.load(fp)
print("提交版本的手稿全文:
", paper_fulltext_dict['initial'][paper_id])
print("终版手稿全文:
", paper_fulltext_dict['final'][paper_id])
也可直接从[Hugging Face(HF)](https://huggingface.co/datasets/yale-nlp/SciDQA)加载SciDQA数据集,示例代码如下:
from datasets import load_dataset
scidqa = load_dataset("yale-nlp/SciDQA")
### 引用格式
@inproceedings{singh-etal-2024-scidqa,
title = "{S}ci{DQA}: A Deep Reading Comprehension Dataset over Scientific Papers",
author = "Singh, Shruti and
Sarkar, Nandan and
Cohan, Arman",
editor = "Al-Onaizan, Yaser and
Bansal, Mohit and
Chen, Yun-Nung",
booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2024",
address = "Miami, Florida, USA",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.emnlp-main.1163",
doi = "10.18653/v1/2024.emnlp-main.1163",
pages = "20908--20923",
abstract = "学术文献通常内容密集,需要扎实的背景知识与深度理解才能有效研读。我们推出了面向学术论文的深度阅读理解数据集(SciDQA),这是一个全新的阅读理解数据集,旨在挑战大语言模型对学术论文的深度理解能力,该数据集包含2937个问答对。与其他学术问答数据集不同,SciDQA的问题来自领域专家的同行评审,答案由论文作者提供,确保对文献进行全面的审视。我们通过一系列流程提升数据集质量:严格筛除低质量问答对、对内容进行去语境化处理、追踪不同版本的源文档,并为多文档问答任务引入参考文献库。SciDQA中的问答对需要结合图表、公式、附录与补充材料进行推理,且需具备多文档推理能力。我们针对多种配置下的多款开源与闭源大语言模型开展了评估,以探究其生成相关且符合事实的回复的能力,而非简单记忆评审内容。本次评估基于表层相似度与语义相似度指标,揭示了显著的性能差异。SciDQA是一套经过严格筛选整理、自然衍生的学术问答数据集,旨在推动学术文本问答领域的复杂推理研究。",
}
提供机构:
maas
创建时间:
2025-01-29



