ScienceQA
收藏魔搭社区2026-05-15 更新2024-06-08 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/ScienceQA
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card Creation Guide
## Table of Contents
- [Dataset Card Creation Guide](#dataset-card-creation-guide)
- [Table of Contents](#table-of-contents)
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Initial Data Collection and Normalization](#initial-data-collection-and-normalization)
- [Who are the source language producers?](#who-are-the-source-language-producers)
- [Annotations](#annotations)
- [Annotation process](#annotation-process)
- [Who are the annotators?](#who-are-the-annotators)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** [https://scienceqa.github.io/index.html#home](https://scienceqa.github.io/index.html#home)
- **Repository:** [https://github.com/lupantech/ScienceQA](https://github.com/lupantech/ScienceQA)
- **Paper:** [https://arxiv.org/abs/2209.09513](https://arxiv.org/abs/2209.09513)
- **Leaderboard:** [https://paperswithcode.com/dataset/scienceqa](https://paperswithcode.com/dataset/scienceqa)
- **Point of Contact:** [Pan Lu](https://lupantech.github.io/) or file an issue on [Github](https://github.com/lupantech/ScienceQA/issues)
### Dataset Summary
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering
### Supported Tasks and Leaderboards
Multi-modal Multiple Choice
### Languages
English
## Dataset Structure
### Data Instances
Explore more samples [here](https://scienceqa.github.io/explore.html).
``` json
{'image': Image,
'question': 'Which of these states is farthest north?',
'choices': ['West Virginia', 'Louisiana', 'Arizona', 'Oklahoma'],
'answer': 0,
'hint': '',
'task': 'closed choice',
'grade': 'grade2',
'subject': 'social science',
'topic': 'geography',
'category': 'Geography',
'skill': 'Read a map: cardinal directions',
'lecture': 'Maps have four cardinal directions, or main directions. Those directions are north, south, east, and west.\nA compass rose is a set of arrows that point to the cardinal directions. A compass rose usually shows only the first letter of each cardinal direction.\nThe north arrow points to the North Pole. On most maps, north is at the top of the map.',
'solution': 'To find the answer, look at the compass rose. Look at which way the north arrow is pointing. West Virginia is farthest north.'}
```
Some records might be missing any or all of image, lecture, solution.
### Data Fields
- `image` : Contextual image
- `question` : Prompt relating to the `lecture`
- `choices` : Multiple choice answer with 1 correct to the `question`
- `answer` : Index of choices corresponding to the correct answer
- `hint` : Hint to help answer the `question`
- `task` : Task description
- `grade` : Grade level from K-12
- `subject` : High level
- `topic` : natural-sciences, social-science, or language-science
- `category` : A subcategory of `topic`
- `skill` : A description of the task required
- `lecture` : A relevant lecture that a `question` is generated from
- `solution` : Instructions on how to solve the `question`
Note that the descriptions can be initialized with the **Show Markdown Data Fields** output of the [Datasets Tagging app](https://huggingface.co/spaces/huggingface/datasets-tagging), you will then only need to refine the generated descriptions.
### Data Splits
- name: train
- num_bytes: 16416902
- num_examples: 12726
- name: validation
- num_bytes: 5404896
- num_examples: 4241
- name: test
- num_bytes: 5441676
- num_examples: 4241
## Dataset Creation
### Curation Rationale
When answering a question, humans utilize the information available across different modalities to synthesize a consistent and complete chain of thought (CoT). This process is normally a black box in the case of deep learning models like large-scale language models. Recently, science question benchmarks have been used to diagnose the multi-hop reasoning ability and interpretability of an AI system. However, existing datasets fail to provide annotations for the answers, or are restricted to the textual-only modality, small scales, and limited domain diversity. To this end, we present Science Question Answering (ScienceQA).
### Source Data
ScienceQA is collected from elementary and high school science curricula.
#### Initial Data Collection and Normalization
See Below
#### Who are the source language producers?
See Below
### Annotations
Questions in the ScienceQA dataset are sourced from open resources managed by IXL Learning,
an online learning platform curated by experts in the field of K-12 education. The dataset includes
problems that align with California Common Core Content Standards. To construct ScienceQA, we
downloaded the original science problems and then extracted individual components (e.g. questions,
hints, images, options, answers, lectures, and solutions) from them based on heuristic rules.
We manually removed invalid questions, such as questions that have only one choice, questions that
contain faulty data, and questions that are duplicated, to comply with fair use and transformative
use of the law. If there were multiple correct answers that applied, we kept only one correct answer.
Also, we shuffled the answer options of each question to ensure the choices do not follow any
specific pattern. To make the dataset easy to use, we then used semi-automated scripts to reformat
the lectures and solutions. Therefore, special structures in the texts, such as tables and lists, are
easily distinguishable from simple text passages. Similar to ImageNet, ReClor, and PMR datasets,
ScienceQA is available for non-commercial research purposes only and the copyright belongs to
the original authors. To ensure data quality, we developed a data exploration tool to review examples
in the collected dataset, and incorrect annotations were further manually revised by experts. The tool
can be accessed at https://scienceqa.github.io/explore.html.
#### Annotation process
See above
#### Who are the annotators?
See above
### Personal and Sensitive Information
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Discussion of Biases
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Other Known Limitations
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## Additional Information
### Dataset Curators
- Pan Lu1,3
- Swaroop Mishra2,3
- Tony Xia1
- Liang Qiu1
- Kai-Wei Chang1
- Song-Chun Zhu1
- Oyvind Tafjord3
- Peter Clark3
- Ashwin Kalyan3
From:
1. University of California, Los Angeles
2. Arizona State University
3. Allen Institute for AI
### Licensing Information
[Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
](https://creativecommons.org/licenses/by-nc-sa/4.0/)
### Citation Information
Provide the [BibTex](http://www.bibtex.org/)-formatted reference for the dataset. For example:
```
@inproceedings{lu2022learn,
title={Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering},
author={Lu, Pan and Mishra, Swaroop and Xia, Tony and Qiu, Liang and Chang, Kai-Wei and Zhu, Song-Chun and Tafjord, Oyvind and Clark, Peter and Ashwin Kalyan},
booktitle={The 36th Conference on Neural Information Processing Systems (NeurIPS)},
year={2022}
}
```
### Contributions
Thanks to [Derek Thomas](https://huggingface.co/derek-thomas) [@datavistics](https://github.com/datavistics) for adding this dataset.
# 数据集卡片制作指南
## 目录
- [数据集卡片制作指南](#dataset-card-creation-guide)
- [目录](#table-of-contents)
- [数据集描述](#dataset-description)
- [数据集摘要](#dataset-summary)
- [支持任务与基准测试榜单](#supported-tasks-and-leaderboards)
- [语言](#languages)
- [数据集结构](#dataset-structure)
- [数据实例](#data-instances)
- [数据字段](#data-fields)
- [数据划分](#data-splits)
- [数据集构建](#dataset-creation)
- [数据集构建依据](#curation-rationale)
- [源数据](#source-data)
- [初始数据采集与标准化](#initial-data-collection-and-normalization)
- [源语言生产者是谁?](#who-are-the-source-language-producers)
- [标注信息](#annotations)
- [标注流程](#annotation-process)
- [标注人员构成](#who-are-the-annotators)
- [个人与敏感信息](#personal-and-sensitive-information)
- [数据集使用注意事项](#considerations-for-using-the-data)
- [数据集的社会影响](#social-impact-of-dataset)
- [偏见讨论](#discussion-of-biases)
- [其他已知局限性](#other-known-limitations)
- [附加信息](#additional-information)
- [数据集策展人](#dataset-curators)
- [许可信息](#licensing-information)
- [引用信息](#citation-information)
- [贡献致谢](#contributions)
## 数据集描述
- **主页**:[https://scienceqa.github.io/index.html#home](https://scienceqa.github.io/index.html#home)
- **代码仓库**:[https://github.com/lupantech/ScienceQA](https://github.com/lupantech/ScienceQA)
- **相关论文**:[https://arxiv.org/abs/2209.09513](https://arxiv.org/abs/2209.09513)
- **基准测试榜单**:[https://paperswithcode.com/dataset/scienceqa](https://paperswithcode.com/dataset/scienceqa)
- **联络人**:[Pan Lu](https://lupantech.github.io/) 或在 [Github](https://github.com/lupantech/ScienceQA/issues) 提交Issue
### 数据集摘要
学会解释:面向科学问答的基于思维链(Chain of Thought, CoT)的多模态推理
### 支持任务与基准测试榜单
多模态多项选择
### 语言
英语
## 数据集结构
### 数据实例
可前往[此处](https://scienceqa.github.io/explore.html)查看更多样本。
json
{'image': Image,
'question': 'Which of these states is farthest north?',
'choices': ['West Virginia', 'Louisiana', 'Arizona', 'Oklahoma'],
'answer': 0,
'hint': '',
'task': 'closed choice',
'grade': 'grade2',
'subject': 'social science',
'topic': 'geography',
'category': 'Geography',
'skill': 'Read a map: cardinal directions',
'lecture': 'Maps have four cardinal directions, or main directions. Those directions are north, south, east, and west.
A compass rose is a set of arrows that point to the cardinal directions. A compass rose usually shows only the first letter of each cardinal direction.
The north arrow points to the North Pole. On most maps, north is at the top of the map.',
'solution': 'To find the answer, look at the compass rose. Look at which way the north arrow is pointing. West Virginia is farthest north.'}
部分样本可能缺失`image`、`lecture`、`solution`中的任意一项或全部。
### 数据字段
- `image`:上下文图像
- `question`:关联于**讲解(lecture)**的提问
- `choices`:该提问的多项选择选项,包含1个正确答案
- `answer`:正确答案对应的选项索引
- `hint`:辅助解答该提问的提示
- `task`:任务描述
- `grade`:K-12教育阶段年级
- `subject`:一级学科分类
- `topic`:自然科学、社会科学或语言科学
- `category`:`topic`的子分类
- `skill`:完成该任务所需技能的描述
- `lecture`:生成该提问的相关讲解内容
- `solution`:解答该提问的步骤说明
请注意,可通过[数据集标注应用](https://huggingface.co/spaces/huggingface/datasets-tagging)的**显示Markdown数据字段**输出初始化字段描述,之后仅需优化生成的描述内容即可。
### 数据划分
- 名称:train
- 字节数:16416902
- 样本数量:12726
- 名称:validation
- 字节数:5404896
- 样本数量:4241
- 名称:test
- 字节数:5441676
- 样本数量:4241
## 数据集构建
### 数据集构建依据
人类在解答问题时,会利用多模态可用信息合成连贯完整的思维链(Chain of Thought, CoT)。此类过程在大语言模型等深度学习模型中通常属于黑盒过程。近年来,科学问答基准被用于诊断AI系统的多跳推理能力与可解释性。然而现有数据集要么无法为答案提供标注,要么仅局限于单文本模态、规模较小且领域多样性有限。为此,我们提出科学问答(ScienceQA)数据集。
### 源数据
ScienceQA数据集采集自中小学科学课程大纲。
#### 初始数据采集与标准化
详见下文
#### 源语言生产者是谁?
详见下文
### 标注信息
ScienceQA数据集的问题来源于由IXL Learning(一家由K-12教育领域专家打造的在线学习平台)管理的公开资源,数据集包含符合加州共同核心内容标准的习题。为构建ScienceQA数据集,我们下载了原始科学习题,并基于启发式规则从中提取各个组件(如提问、提示、图像、选项、答案、讲解与解答等)。我们手动剔除了无效问题,例如仅含1个选项的问题、存在错误数据的问题以及重复问题,以符合合理使用与转型性使用的法律规范。若存在多个正确答案,我们仅保留其中1个。此外,我们打乱了每个提问的答案选项,确保选项不会遵循特定模式。为便于数据集使用,我们使用半自动化脚本对讲解与解答内容进行了格式化处理,因此文本中的特殊结构(如表格与列表)可与普通文本段落轻松区分。与ImageNet、ReClor和PMR数据集类似,ScienceQA仅可用于非商业性研究用途,版权归原作者所有。为确保数据质量,我们开发了数据探索工具以审查采集到的数据集样本,错误标注将由专家进一步手动修正。该工具可通过https://scienceqa.github.io/explore.html访问。
#### 标注流程
详见上文
#### 标注人员构成
详见上文
### 个人与敏感信息
[需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## 数据集使用注意事项
### 数据集的社会影响
[需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 偏见讨论
[需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 其他已知局限性
[需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## 附加信息
### 数据集策展人
- Pan Lu1,3
- Swaroop Mishra2,3
- Tony Xia1
- Liang Qiu1
- Kai-Wei Chang1
- Song-Chun Zhu1
- Oyvind Tafjord3
- Peter Clark3
- Ashwin Kalyan3
所属机构:
1. 加州大学洛杉矶分校
2. 亚利桑那州立大学
3. 艾伦人工智能研究所
### 许可信息
[署名-非商业性使用-相同方式共享4.0国际版(CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/)
### 引用信息
请提供该数据集的BibTex格式引用信息,例如:
@inproceedings{lu2022learn,
title={Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering},
author={Lu, Pan and Mishra, Swaroop and Xia, Tony and Qiu, Liang and Chang, Kai-Wei and Zhu, Song-Chun and Tafjord, Oyvind and Clark, Peter and Ashwin Kalyan},
booktitle={The 36th Conference on Neural Information Processing Systems (NeurIPS)},
year={2022}
}
### 贡献致谢
感谢[Derek Thomas](https://huggingface.co/derek-thomas) [@datavistics](https://github.com/datavistics) 添加本数据集。
提供机构:
maas
创建时间:
2024-05-09



