qasper

Name: qasper
Creator: maas
Published: 2025-12-05 16:36:13
License: 暂无描述

魔搭社区2025-12-05 更新2025-05-31 收录

下载链接：

https://modelscope.cn/datasets/allenai/qasper

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for Qasper ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Initial Data Collection and Normalization](#initial-data-collection-and-normalization) - [Who are the source language producers?](#who-are-the-source-language-producers) - [Annotations](#annotations) - [Annotation process](#annotation-process) - [Who are the annotators?](#who-are-the-annotators) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [https://allenai.org/data/qasper](https://allenai.org/data/qasper) - **Demo:** [https://qasper-demo.apps.allenai.org/](https://qasper-demo.apps.allenai.org/) - **Paper:** [https://arxiv.org/abs/2105.03011](https://arxiv.org/abs/2105.03011) - **Blogpost:** [https://medium.com/ai2-blog/question-answering-on-scientific-research-papers-f6d6da9fd55c](https://medium.com/ai2-blog/question-answering-on-scientific-research-papers-f6d6da9fd55c) - **Leaderboards:** [https://paperswithcode.com/dataset/qasper](https://paperswithcode.com/dataset/qasper) ### Dataset Summary QASPER is a dataset for question answering on scientific research papers. It consists of 5,049 questions over 1,585 Natural Language Processing papers. Each question is written by an NLP practitioner who read only the title and abstract of the corresponding paper, and the question seeks information present in the full text. The questions are then answered by a separate set of NLP practitioners who also provide supporting evidence to answers. ### Supported Tasks and Leaderboards - `question-answering`: The dataset can be used to train a model for Question Answering. Success on this task is typically measured by achieving a *high* [F1 score](https://huggingface.co/metrics/f1). The [official baseline model](https://github.com/allenai/qasper-led-baseline) currently achieves 33.63 Token F1 score & uses [Longformer](https://huggingface.co/transformers/model_doc/longformer.html). This task has an active leaderboard which can be found [here](https://paperswithcode.com/sota/question-answering-on-qasper) - `evidence-selection`: The dataset can be used to train a model for Evidence Selection. Success on this task is typically measured by achieving a *high* [F1 score](https://huggingface.co/metrics/f1). The [official baseline model](https://github.com/allenai/qasper-led-baseline) currently achieves 39.37 F1 score & uses [Longformer](https://huggingface.co/transformers/model_doc/longformer.html). This task has an active leaderboard which can be found [here](https://paperswithcode.com/sota/evidence-selection-on-qasper) ### Languages English, as it is used in research papers. ## Dataset Structure ### Data Instances A typical instance in the dataset: ``` { 'id': "Paper ID (string)", 'title': "Paper Title", 'abstract': "paper abstract ...", 'full_text': { 'paragraphs':[["section1_paragraph1_text","section1_paragraph2_text",...],["section2_paragraph1_text","section2_paragraph2_text",...]], 'section_name':["section1_title","section2_title"],...}, 'qas': { 'answers':[{ 'annotation_id': ["q1_answer1_annotation_id","q1_answer2_annotation_id"] 'answer': [{ 'unanswerable':False, 'extractive_spans':["q1_answer1_extractive_span1","q1_answer1_extractive_span2"], 'yes_no':False, 'free_form_answer':"q1_answer1", 'evidence':["q1_answer1_evidence1","q1_answer1_evidence2",..], 'highlighted_evidence':["q1_answer1_highlighted_evidence1","q1_answer1_highlighted_evidence2",..] }, { 'unanswerable':False, 'extractive_spans':["q1_answer2_extractive_span1","q1_answer2_extractive_span2"], 'yes_no':False, 'free_form_answer':"q1_answer2", 'evidence':["q1_answer2_evidence1","q1_answer2_evidence2",..], 'highlighted_evidence':["q1_answer2_highlighted_evidence1","q1_answer2_highlighted_evidence2",..] }], 'worker_id':["q1_answer1_worker_id","q1_answer2_worker_id"] },{...["question2's answers"]..},{...["question3's answers"]..}], 'question':["question1","question2","question3"...], 'question_id':["question1_id","question2_id","question3_id"...], 'question_writer':["question1_writer_id","question2_writer_id","question3_writer_id"...], 'nlp_background':["question1_writer_nlp_background","question2_writer_nlp_background",...], 'topic_background':["question1_writer_topic_background","question2_writer_topic_background",...], 'paper_read': ["question1_writer_paper_read_status","question2_writer_paper_read_status",...], 'search_query':["question1_search_query","question2_search_query","question3_search_query"...], } } ``` ### Data Fields The following is an excerpt from the dataset README: Within "qas", some fields should be obvious. Here is some explanation about the others: #### Fields specific to questions: - "nlp_background" shows the experience the question writer had. The values can be "zero" (no experience), "two" (0 - 2 years of experience), "five" (2 - 5 years of experience), and "infinity" (> 5 years of experience). The field may be empty as well, indicating the writer has chosen not to share this information. - "topic_background" shows how familiar the question writer was with the topic of the paper. The values are "unfamiliar", "familiar", "research" (meaning that the topic is the research area of the writer), or null. - "paper_read", when specified shows whether the questionwriter has read the paper. - "search_query", if not empty, is the query the question writer used to find the abstract of the paper from a large pool of abstracts we made available to them. #### Fields specific to answers Unanswerable answers have "unanswerable" set to true. The remaining answers have exactly one of the following fields being non-empty. - "extractive_spans" are spans in the paper which serve as the answer. - "free_form_answer" is a written out answer. - "yes_no" is true iff the answer is Yes, and false iff the answer is No. "evidence" is the set of paragraphs, figures or tables used to arrive at the answer. Tables or figures start with the string "FLOAT SELECTED" "highlighted_evidence" is the set of sentences the answer providers selected as evidence if they chose textual evidence. The text in the "evidence" field is a mapping from these sentences to the paragraph level. That is, if you see textual evidence in the "evidence" field, it is guaranteed to be entire paragraphs, while that is not the case with "highlighted_evidence". ### Data Splits | | Train | Valid | | ----- | ------ | ----- | | Number of papers | 888 | 281 | | Number of questions | 2593 | 1005 | | Number of answers | 2675 | 1764 | ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data NLP papers: The full text of the papers is extracted from [S2ORC](https://huggingface.co/datasets/s2orc) (Lo et al., 2020) #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations [More Information Needed] #### Annotation process [More Information Needed] #### Who are the annotators? "The annotators are NLP practitioners, not expert researchers, and it is likely that an expert would score higher" ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators Crowdsourced NLP practitioners ### Licensing Information [CC BY 4.0](https://creativecommons.org/licenses/by/4.0) ### Citation Information ``` @inproceedings{Dasigi2021ADO, title={A Dataset of Information-Seeking Questions and Answers Anchored in Research Papers}, author={Pradeep Dasigi and Kyle Lo and Iz Beltagy and Arman Cohan and Noah A. Smith and Matt Gardner}, year={2021} } ``` ### Contributions Thanks to [@cceyda](https://github.com/cceyda) for adding this dataset.

# Qasper 数据集卡片 ## 目录 - [目录](#table-of-contents) - [数据集描述](#dataset-description) - [数据集概述](#dataset-summary) - [支持任务与排行榜](#supported-tasks-and-leaderboards) - [语言](#languages) - [数据集结构](#dataset-structure) - [数据实例](#data-instances) - [数据字段](#data-fields) - [数据划分](#data-splits) - [数据集构建](#dataset-creation) - [构建动因](#curation-rationale) - [源数据](#source-data) - [初始数据收集与标准化](#initial-data-collection-and-normalization) - [源语言生产者是谁？](#who-are-the-source-language-producers) - [标注](#annotations) - [标注流程](#annotation-process) - [标注者身份](#who-are-the-annotators) - [个人与敏感信息](#personal-and-sensitive-information) - [数据集使用注意事项](#considerations-for-using-the-data) - [数据集的社会影响](#social-impact-of-dataset) - [偏差讨论](#discussion-of-biases) - [其他已知局限性](#other-known-limitations) - [附加信息](#additional-information) - [数据集维护者](#dataset-curators) - [许可信息](#licensing-information) - [引用信息](#citation-information) - [贡献](#contributions) ## 数据集描述 - **主页:** [https://allenai.org/data/qasper](https://allenai.org/data/qasper) - **演示页面:** [https://qasper-demo.apps.allenai.org/](https://qasper-demo.apps.allenai.org/) - **论文:** [https://arxiv.org/abs/2105.03011](https://arxiv.org/abs/2105.03011) - **博客文章:** [https://medium.com/ai2-blog/question-answering-on-scientific-research-papers-f6d6da9fd55c](https://medium.com/ai2-blog/question-answering-on-scientific-research-papers-f6d6da9fd55c) - **排行榜:** [https://paperswithcode.com/dataset/qasper](https://paperswithcode.com/dataset/qasper) ### 数据集概述 QASPER是面向科研论文的问答数据集，共包含1585篇自然语言处理（Natural Language Processing，NLP）论文对应的5049个问题。每个问题均由仅阅读过对应论文标题与摘要的NLP从业者撰写，问题旨在获取论文全文中包含的相关信息。随后由另一组NLP从业者对问题进行解答，并为答案提供佐证依据。 ### 支持任务与排行榜 - `问答任务（question-answering）`: 该数据集可用于训练问答模型。该任务的性能通常通过F1分数进行衡量。当前官方基准模型采用Longformer，取得了33.63的Token F1分数。该任务设有活跃排行榜，可点击[此处](https://paperswithcode.com/sota/question-answering-on-qasper)查看。 - `证据选取任务（evidence-selection）`: 该数据集可用于训练证据选取模型。该任务的性能通常通过F1分数进行衡量。当前官方基准模型采用Longformer，取得了39.37的F1分数。该任务设有活跃排行榜，可点击[此处](https://paperswithcode.com/sota/evidence-selection-on-qasper)查看。 ### 语言英语，因数据集源自科研论文文本。 ## 数据集结构 ### 数据实例数据集内典型数据实例格式如下： { 'id': "数据ID（字符串类型）", 'title': "论文标题", 'abstract': "论文摘要 ...", 'full_text': { 'paragraphs':[["章节1段落1文本","章节1段落2文本",...],["章节2段落1文本","章节2段落2文本",...]], 'section_name':["章节1标题","章节2标题",...]}, 'qas': { 'answers':[{ 'annotation_id': ["问题1答案1标注ID","问题1答案2标注ID"], 'answer': [{ 'unanswerable':False, 'extractive_spans':["问题1答案1抽取式片段1","问题1答案1抽取式片段2"], 'yes_no':False, 'free_form_answer':"问题1答案1", 'evidence':["问题1答案1佐证依据1","问题1答案1佐证依据2",..], 'highlighted_evidence':["问题1答案1高亮佐证依据1","问题1答案1高亮佐证依据2",..] }, { 'unanswerable':False, 'extractive_spans':["问题1答案2抽取式片段1","问题1答案2抽取式片段2"], 'yes_no':False, 'free_form_answer':"问题1答案2", 'evidence':["问题1答案2佐证依据1","问题1答案2佐证依据2",..], 'highlighted_evidence':["问题1答案2高亮佐证依据1","问题1答案2高亮佐证依据2",..] }], 'worker_id':["问题1答案1标注者ID","问题1答案2标注者ID"] },{...["问题2的答案"]..},{...["问题3的答案"]..}], 'question':["问题1","问题2","问题3"...], 'question_id':["问题1ID","问题2ID","问题3ID"...], 'question_writer':["问题1撰写者ID","问题2撰写者ID","问题3撰写者ID"...], 'nlp_background':["问题1撰写者NLP从业背景","问题2撰写者NLP从业背景",...], 'topic_background':["问题1撰写者主题熟悉度","问题2撰写者主题熟悉度",...], 'paper_read': ["问题1撰写者论文阅读情况","问题2撰写者论文阅读情况",...], 'search_query':["问题1检索查询词","问题2检索查询词","问题3检索查询词"...], } } ### 数据字段以下内容摘录自数据集README文档：在`qas`字段内，部分字段含义较为直观。以下对其余字段进行说明： #### 问题专属字段 - `nlp_background`：展示问题撰写者的从业经验，可选取值包括`zero`（无从业经验）、`two`（0-2年从业经验）、`five`（2-5年从业经验）以及`infinity`（5年以上从业经验），该字段也可能为空，表示撰写者未披露该信息。 - `topic_background`：展示问题撰写者对论文主题的熟悉程度，可选取值包括`unfamiliar`（不熟悉）、`familiar`（熟悉）、`research`（该主题为撰写者的研究方向），或为空值。 - `paper_read`：用于标识问题撰写者是否阅读过目标论文。 - `search_query`：若不为空，则为问题撰写者用于从提供的大量摘要池中检索该论文摘要的查询词。 #### 答案专属字段无法回答的答案会将`unanswerable`设为`true`。其余答案恰好仅包含以下某一个非空字段： - `extractive_spans`：作为答案的论文文本片段。 - `free_form_answer`：书面形式的自由格式答案。 - `yes_no`：若为`true`则答案为“是”，若为`false`则答案为“否”。 `evidence`：用于推导答案的段落、图表或表格集合。图表或表格的条目以字符串`FLOAT SELECTED`开头。 `highlighted_evidence`：答案提供者选定为佐证依据的句子集合。`evidence`字段中的文本是这些句子到段落级别的映射：若`evidence`字段中包含文本佐证，则其必为完整段落，而`highlighted_evidence`则无此限制。 ### 数据划分 | | 训练集 | 验证集 | | ----- | ------ | ----- | | 论文总数 | 888 | 281 | | 问题总数 | 2593 | 1005 | | 答案总数 | 2675 | 1764 | ## 数据集构建 ### 构建动因 [需补充更多信息] ### 源数据自然语言处理论文：论文全文从[S2ORC](https://huggingface.co/datasets/s2orc)（Lo等人，2020）提取。 #### 初始数据收集与标准化 [需补充更多信息] #### 源语言生产者是谁？ [需补充更多信息] ### 标注 [需补充更多信息] #### 标注流程 [需补充更多信息] #### 标注者身份 “标注者为NLP从业者，而非专业研究人员，专家可能会取得更高的得分” ### 个人与敏感信息 [需补充更多信息] ## 数据集使用注意事项 ### 数据集的社会影响 [需补充更多信息] ### 偏差讨论 [需补充更多信息] ### 其他已知局限性 [需补充更多信息] ## 附加信息 ### 数据集维护者众包NLP从业者 ### 许可信息 [CC BY 4.0](https://creativecommons.org/licenses/by/4.0)（知识共享署名4.0国际许可协议） ### 引用信息 @inproceedings{Dasigi2021ADO, title={A Dataset of Information-Seeking Questions and Answers Anchored in Research Papers}, author={Pradeep Dasigi and Kyle Lo and Iz Beltagy and Arman Cohan and Noah A. Smith and Matt Gardner}, year={2021} } ### 贡献感谢 [@cceyda](https://github.com/cceyda) 为本数据集添加至仓库。

提供机构：

maas

创建时间：

2025-05-27

搜集汇总

数据集介绍