---
annotations_creators:
- expert-generated
language:
- en
language_creators:
- found
license:
- cc
multilinguality:
- monolingual
pretty_name: semeval-task-7-2013
size_categories:
- 10K<n<100K
source_datasets:
- original
tags:
- asag
- short-answer
- grading
- semantic-similarity
task_categories:
- text-classification
task_ids:
- natural-language-inference
dataset_info:
features:
- name: split
dtype: string
- name: classification_type
dtype: string
- name: corpus
dtype: string
- name: test_set
dtype: string
- name: question_qtype
dtype: string
- name: question_id
dtype: string
- name: question_module
dtype: string
- name: question_stype
dtype: string
- name: question
dtype: string
- name: reference_answer_quality
dtype: string
- name: reference_answer_id
dtype: string
- name: reference_answer_file_id
dtype: string
- name: reference_answer
dtype: string
- name: student_answer_count
dtype: float64
- name: student_answer_match
dtype: string
- name: student_answer_id
dtype: string
- name: student_answer_label
dtype: string
- name: student_answer
dtype: string
- name: label_5way
dtype: string
splits:
- name: test
num_bytes: 11688998
num_examples: 23656
- name: train
num_bytes: 23544814
num_examples: 47866
download_size: 1488533
dataset_size: 35233812
---
# Dataset Card for SemEval 2013 Task 7 Dataset
## Table of Contents
- [Table of Contents](#table-of-contents)
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
### Dataset Summary
This dataset contains responses to questions from two distinct corpuses, _BEETLE_ and _SCIENTSBANK_. The _BEETLE_ corpus consists of 56 questions in an electricity and circuits domain, requiring answers of 1-2 sentences and containing approximately 3000 answers. The _SCIENTSBANK_ corpus consists of 197 questions in 15 different science domains, containing approximately 10000 answers. _BEETLE_ contains up to 6 reference answers of differing quality for each question, while _SCIENTSBANK_ contains only one.
The dataset was originally published as part of an open source competition. It was [introduced by Dzikovska in this paper](https://aclanthology.org/S13-2045.pdf), however it was difficult to find the official version of the data in 2022. It was eventually [found on Kaggle at this link](https://www.kaggle.com/datasets/smiles28/semeval-2013-2-and-3-way) and it is these XML files that are used here.
The XML is essentially preprocessed to combine all separate files into one single dataframe, containing all metadata.
The Kaggle dataset only contains the 2 and 3 way labels for each data point. [An additional Github repository](https://github.com/ashudeep/Student-Response-Analysis) was found which contains the original 5-way labels for the _BEETLE_ subset, and can be joined to the data (explained below).
### Supported Tasks and Leaderboards
### Languages
## Dataset Structure
The data is tabular containing 19 columns, which is each piece of information that was contained in the original XML files expanded into dataframe format. The _BEETLE_ corpus contains 56 unique questions and approximately 3000 answers, while _SCIENTSBANK_ contains 197 unique questions and approximately 10,000 answers.
Each question in the _BEETLE_ dataset can contain between 1 and 6 Reference Answers. These answers are of differing quality, and can be either 'MINIMAL', 'GOOD' or 'BEST'. In cases where multiple reference answers are provided, each student answer is joined to each reference answer. ie. for a given question with reference answers `A`, `B` and `C`, and student answers `1`, `2`, `3`, `4`, all responses for this question would be formatted as follows in the dataframe:
| reference_answer | student_answer |
| ---------------- | -------------- |
| A | 1 |
| A | 2 |
| A | 3 |
| A | 4 |
| B | 1 |
| B | 2 |
| B | 3 |
| B | 4 |
| C | 1 |
| C | 2 |
| C | 3 |
| C | 4 |
So, each student answer is joined to each reference answer. This results in _BEETLE_ contributing more rows to the final dataset than _SCIENTSBANK_, because _SCIENTSBANK_ contains only one reference answer per question.
### Data Instances
The data is in csv format. A single example from the data looks like the following:
| split | classification_type | corpus | test_set | question_qtype | question_id | question_module | question_stype | question | reference_answer_quality | reference_answer_id | reference_answer_file_id | reference_answer | student_answer_count | student_answer_match | student_answer_id | student_answer_label | student_answer | label_5way |
| ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
|test | 2way | beetle | unseen-answers | Q_EXPLAIN_SPECIFIC | HYBRID_BURNED_OUT_EXPLAIN_Q1 | SwitchesBulbsParallel | PREDICT | Explain your reasoning. | BEST | answer371 | HYBRID_BURNED_OUT_EXPLAIN_Q1_ANS1 | If bulb A burns out, B and C are no longer in a closed path with the battery | 1 | | SwitchesBulbsParallel-HYBRID_BURNED_OUT_EXPLAIN_Q1.sbj15-l2.qa123 | incorrect | because the paths will still be closed | non_domain |
### Data Fields
- 'split': string, the set that the response belongs to, either 'training' or 'test'
- 'classification_type': string, whether the classification was '2way' or '3way'
- 'corpus': string, the corpus the question belongs to, either 'beetle' or 'scientsbank'
- 'test_set': string, the part of the test set it belongs to (if it belongs to one), either 'test-unseen-answers', 'test-unseen-domains' (scientsbank only), 'test-unseen-questions'
- 'question_qtype',: string (beetle only), the type of question
- 'question_id': string, the question id
- 'question_module': string, the question module
- 'question_stype': string (beetle only) unsure
- 'question': string, the question text
- 'reference_answer_quality': string (beetle only), the type of reference answer. Can be 'MINIMAL', 'GOOD' or 'BEST'
- 'reference_answer_id': string, the reference answer id
- 'reference_answer_file_id': string, the reference answer file id
- 'reference_answer': string, the reference answer text
- 'student_answer_count': string, unknown meaning
- 'student_answer_match': string, unknown meaning
- 'student_answer_id': string, the student answer id
- 'student_answer_label': string, the label given to the answer. In 2-way, it is 'CORRECT' or 'INCORRECT'. In 3-way, it is 'CORRECT', 'INCORRECT' or 'CONTRADICTORY'
- 'student_answer': string, the student answer text
- 'label_5way': string (beetle only), contains the original 5-way classification of the student answer. Can be 'CORRECT', 'PARTIALLY_CORRECT_INCOMPLETE', 'CONTRADICTORY', 'IRRELEVANT', 'NON_DOMAIN'
### Data Splits
The data was pre-split at the time of acquisition.
The test set is comprised of unseen answers, unseen questions, and for _SCIENTSBANK_, unseen domains (since there are multiple domains).
## Dataset Creation
### Curation Rationale
The dataset is to be used for fine tuning and benchmarking automarking models. It is one of the cardinal datasets in the ASAG literature, so it enables us to compare our results to existing work.
### Source Data
The data was sourced [from this Kaggle link](https://www.kaggle.com/datasets/smiles28/semeval-2013-2-and-3-way). It is unknown if this is the original state of the data, or it has been preprocessed before this stage, because we were unable to access the original.
[The dataset creation information is located here via Dzikovska](https://aclanthology.org/S13-2045.pdf)
The 5-way labels were accessed from [this public Github repository](https://github.com/ashudeep/Student-Response-Analysis). The required data is contained at:
- Training: https://raw.githubusercontent.com/ashudeep/Student-Response-Analysis/master/semevalFormatProcessing-5way/trainingGold.txt
- Test (Unseen Answer): https://raw.githubusercontent.com/ashudeep/Student-Response-Analysis/master/semevalFormatProcessing-5way/testGold-UA.txt
- Test (Unseen Question): https://raw.githubusercontent.com/ashudeep/Student-Response-Analysis/master/semevalFormatProcessing-5way/testGold-UQ.txt
These labels are joined to the Kaggle data using the answer id. At this stage, we only have the 5-way classifications for the _BEETLE_ subset - for _SCIENTSBANK_ we unfortunately only have the less granular 2 and 3 way classifications.
#### Initial Data Collection and Normalization
#### Who are the source language producers?
### Annotations
#### Annotation process
Annotations have already been retrieved.
#### Who are the annotators?
Annotations have already been retrieved. Annotators came from the _BEETLE_ and _SCIENTSBANK_ corpora.
### Personal and Sensitive Information
## Considerations for Using the Data
### Social Impact of Dataset
### Discussion of Biases
The _BEETLE_ corpus contains multiple reference answers of differing quality for each question, while _SCIENTSBANK_ contains only one. This means when joining each student answer to each reference answer, there are more _BEETLE_ rows generated (because every student answer is duplicated for each reference answer). This can be remedied by filtering to only include 'BEST' _BEETLE_ reference answers, although if multiple BEST answers are provided for a question (which does happen), _BEETLE_ may still be overrepresented.
### Other Known Limitations
## Additional Information
[This repository](https://github.com/ashudeep/Student-Response-Analysis) appears to provide preprocessing scripts for this dataset. It also may contain the original 5-way labels, which could be helpful for us if we want to draw our own classification boundaries.
### Dataset Curators
### Licensing Information
### Citation Information
Dzikovska MO, Nielsen R, Brew C, Leacock C, Giampiccolo D, Bentivogli L, Clark P, Dagan I, Dang HT (2013b) Semeval-2013 task 7: The Joint Student Response Analysis and 8th Recognizing Textual Entailment Challenge. In: Proceedings of the 6th International Workshop on Semantic Evaluation (SEMEVAL-2013), Association for Computational Linguistics, Atlanta, Georgia, USA
### Contributions
annotations_creators:
- 专家生成
language:
- 英语(en)
language_creators:
- 公开采集
license:
- CC(知识共享协议)
multilinguality:
- 单语言
pretty_name: semeval-task-7-2013
size_categories:
- 10000 < 样本量 < 100000
source_datasets:
- 原始数据集
tags:
- 自动简答评分(Automatic Short Answer Grading, ASAG)
- 短答案
- 评分
- 语义相似度
task_categories:
- 文本分类
task_ids:
- 自然语言推理
dataset_info:
features:
- name: split(划分集)
dtype: 字符串
- name: classification_type(分类类型)
dtype: 字符串
- name: corpus(语料库)
dtype: 字符串
- name: test_set(测试集子集)
dtype: 字符串
- name: question_qtype(问题类型)
dtype: 字符串
- name: question_id(问题ID)
dtype: 字符串
- name: question_module(问题模块)
dtype: 字符串
- name: question_stype(问题子类型)
dtype: 字符串
- name: question(问题文本)
dtype: 字符串
- name: reference_answer_quality(参考答案质量等级)
dtype: 字符串
- name: reference_answer_id(参考答案ID)
dtype: 字符串
- name: reference_answer_file_id(参考答案文件ID)
dtype: 字符串
- name: reference_answer(参考答案文本)
dtype: 字符串
- name: student_answer_count(学生答案数量)
dtype: float64
- name: student_answer_match(学生答案匹配度)
dtype: 字符串
- name: student_answer_id(学生答案ID)
dtype: 字符串
- name: student_answer_label(学生答案标注)
dtype: 字符串
- name: student_answer(学生答案文本)
dtype: 字符串
- name: label_5way(五分类标注)
dtype: 字符串
splits:
- name: test(测试集)
num_bytes: 11688998
num_examples: 23656
- name: train(训练集)
num_bytes: 23544814
num_examples: 47866
download_size: 1488533
dataset_size: 35233812
---
# SemEval 2013任务7数据集卡片
## 目录
- [目录](#table-of-contents)
- [数据集概述](#dataset-description)
- [数据集概览](#dataset-summary)
- [支持任务与基准榜](#supported-tasks-and-leaderboards)
- [语言](#languages)
- [数据集结构](#dataset-structure)
- [数据样例](#data-instances)
- [数据字段说明](#data-fields)
- [数据划分](#data-splits)
- [数据集构建](#dataset-creation)
- [构建初衷](#curation-rationale)
- [源数据来源](#source-data)
- [标注信息](#annotations)
- [个人与敏感信息](#personal-and-sensitive-information)
- [数据集使用注意事项](#considerations-for-using-the-data)
- [数据集的社会影响](#social-impact-of-dataset)
- [偏差分析](#discussion-of-biases)
- [已知局限性](#other-known-limitations)
- [附加信息](#additional-information)
- [数据集维护者](#dataset-curators)
- [许可信息](#licensing-information)
- [引用信息](#citation-information)
- [贡献](#contributions)
## 数据集概述
### 数据集概览
本数据集包含来自两个独立语料库——BEETLE与SCIENTSBANK——的问题作答数据。其中BEETLE语料库涵盖电力与电路领域的56道问题,要求作答长度为1-2个句子,总计约3000条学生答案。SCIENTSBANK语料库涵盖15个不同科学领域的197道问题,总计约10000条学生答案。BEETLE语料库为每道问题提供最多6个不同质量等级的参考答案,而SCIENTSBANK语料库仅为每道问题提供1个参考答案。
本数据集最初作为一场开源竞赛的一部分发布,由Dzikovska等人在[该论文](https://aclanthology.org/S13-2045.pdf)中首次提出。但在2022年,官方版本的数据集难以获取,最终在Kaggle平台的[该链接](https://www.kaggle.com/datasets/smiles28/semeval-2013-2-and-3-way)中找到了相关XML文件,本数据集即基于此构建。
原始XML文件已完成预处理,将所有独立文件整合为一个包含全部元数据的单数据框。
Kaggle平台的数据集仅为每条数据提供二分类与三分类标注。我们随后在[该GitHub仓库](https://github.com/ashudeep/Student-Response-Analysis)中找到了BEETLE子集的原始五分类标注,可将其与本数据集进行关联(具体关联方式详见下文)。
### 支持任务与基准榜
### 语言
## 数据集结构
本数据集为表格格式,共包含19列,对应原始XML文件中的全部信息并已展开为数据框形式。BEETLE语料库涵盖56道唯一问题与约3000条学生答案,SCIENTSBANK语料库涵盖197道唯一问题与约10000条学生答案。
BEETLE数据集中的每道问题可包含1至6个参考答案,这些答案按质量分为“最低(MINIMAL)”、“良好(GOOD)”与“最优(BEST)”三个等级。当存在多个参考答案时,每条学生答案会与每个参考答案进行配对。例如,若某道问题的参考答案为`A`、`B`和`C`,学生答案为`1`、`2`、`3`、`4`,则该问题的所有数据在数据框中的格式如下:
| 参考答案 | 学生答案 |
| ---------------- | -------------- |
| A | 1 |
| A | 2 |
| A | 3 |
| A | 4 |
| B | 1 |
| B | 2 |
| B | 3 |
| B | 4 |
| C | 1 |
| C | 2 |
| C | 3 |
| C | 4 |
因此,每条学生答案会与每个参考答案配对,这导致BEETLE语料库在最终数据集中的样本量多于SCIENTSBANK语料库,因为SCIENTSBANK语料库的每道问题仅对应1个参考答案。
### 数据样例
本数据集采用CSV格式存储,单条数据样例如表所示:
| 划分集 | 分类类型 | 语料库 | 测试集子集 | 问题类型 | 问题ID | 问题模块 | 问题子类型 | 问题文本 | 参考答案质量等级 | 参考答案ID | 参考答案文件ID | 参考答案文本 | 学生答案数量 | 学生答案匹配度 | 学生答案ID | 学生答案标注 | 学生答案文本 | 五分类标注 |
| ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
| test | 2way | beetle | unseen-answers | Q_EXPLAIN_SPECIFIC | HYBRID_BURNED_OUT_EXPLAIN_Q1 | SwitchesBulbsParallel | PREDICT | 请解释你的推理过程。 | BEST | answer371 | HYBRID_BURNED_OUT_EXPLAIN_Q1_ANS1 | 若灯泡A烧毁,则B与C将不再与电池形成闭合回路 | 1 | | SwitchesBulbsParallel-HYBRID_BURNED_OUT_EXPLAIN_Q1.sbj15-l2.qa123 | 错误 | 因为路径仍将保持闭合 | 非领域相关 |
### 数据字段说明
- `split`:字符串类型,表示该样本所属的划分集,可选值为`training(训练集)`或`test(测试集)`
- `classification_type`:字符串类型,表示分类任务类型,可选值为`2way(二分类)`或`3way(三分类)`
- `corpus`:字符串类型,表示该样本所属的语料库,可选值为`beetle`或`scientsbank`
- `test_set`:字符串类型,表示该样本所属的测试集子集(若属于测试集),可选值为`test-unseen-answers(未见过的答案)`、`test-unseen-domains(未见过的领域,仅SCIENTSBANK语料库适用)`或`test-unseen-questions(未见过的问题)`
- `question_qtype`:字符串类型(仅BEETLE语料库适用),表示问题类型
- `question_id`:字符串类型,表示问题的唯一标识符
- `question_module`:字符串类型,表示问题所属模块
- `question_stype`:字符串类型(仅BEETLE语料库适用),含义未明确
- `question`:字符串类型,表示问题文本
- `reference_answer_quality`:字符串类型(仅BEETLE语料库适用),表示参考答案的质量等级,可选值为`MINIMAL(最低)`、`GOOD(良好)`或`BEST(最优)`
- `reference_answer_id`:字符串类型,表示参考答案的唯一标识符
- `reference_answer_file_id`:字符串类型,表示参考答案文件的唯一标识符
- `reference_answer`:字符串类型,表示参考答案文本
- `student_answer_count`:字符串类型,含义未明确
- `student_answer_match`:字符串类型,含义未明确
- `student_answer_id`:字符串类型,表示学生答案的唯一标识符
- `student_answer_label`:字符串类型,表示该学生答案的标注结果:二分类任务下可选值为`CORRECT(正确)`或`INCORRECT(错误)`;三分类任务下可选值为`CORRECT(正确)`、`INCORRECT(错误)`或`CONTRADICTORY(矛盾)`
- `student_answer`:字符串类型,表示学生答案文本
- `label_5way`:字符串类型(仅BEETLE语料库适用),表示学生答案的原始五分类标注,可选值为`CORRECT(正确)`、`PARTIALLY_CORRECT_INCOMPLETE(部分正确且不完整)`、`CONTRADICTORY(矛盾)`、`IRRELEVANT(不相关)`或`NON_DOMAIN(非领域相关)`
### 数据划分
本数据集在获取时已完成预划分。测试集包含未见过的答案、未见过的问题,对于SCIENTSBANK语料库而言还包含未见过的领域(因其涵盖多个科学领域)。
## 数据集构建
### 构建初衷
本数据集用于自动评分模型的微调与基准测试,是自动简答评分(Automatic Short Answer Grading, ASAG)领域的核心数据集之一,可帮助研究者将模型结果与已有研究进行对比。
### 源数据来源
本数据集的源数据来自Kaggle平台的[该链接](https://www.kaggle.com/datasets/smiles28/semeval-2013-2-and-3-way)。由于无法获取原始数据集,无法确认该版本是否为数据的初始状态,或是已在此前进行过预处理。
数据集的构建细节可参考Dzikovska等人的[该论文](https://aclanthology.org/S13-2045.pdf)。
五分类标注数据来自[该公开GitHub仓库](https://github.com/ashudeep/Student-Response-Analysis),所需文件路径如下:
- 训练集:https://raw.githubusercontent.com/ashudeep/Student-Response-Analysis/master/semevalFormatProcessing-5way/trainingGold.txt
- 测试集(未见过的答案):https://raw.githubusercontent.com/ashudeep/Student-Response-Analysis/master/semevalFormatProcessing-5way/testGold-UA.txt
- 测试集(未见过的问题):https://raw.githubusercontent.com/ashudeep/Student-Response-Analysis/master/semevalFormatProcessing-5way/testGold-UQ.txt
标注数据通过答案ID与Kaggle数据集进行关联。目前仅可为BEETLE语料子集提供五分类标注,SCIENTSBANK语料库仅支持粒度更粗的二分类与三分类标注。
#### 初始数据收集与标准化
#### 源语言生产者是谁?
### 标注信息
#### 标注流程
标注数据已完成获取。
#### 标注人员
标注数据已完成获取,标注人员来自BEETLE与SCIENTSBANK语料库项目组。
### 个人与敏感信息
## 数据集使用注意事项
### 数据集的社会影响
### 偏差分析
BEETLE语料库为每道问题提供多个不同质量等级的参考答案,而SCIENTSBANK语料库仅提供1个。这意味着在将学生答案与参考答案配对时,BEETLE语料库生成的样本量更多(每条学生答案会与每个参考答案生成一条新样本)。可通过仅保留“最优(BEST)”等级的参考答案来缓解该问题,但如果某道问题存在多个最优参考答案(此种情况确实存在),BEETLE语料库的样本占比仍可能偏高。
### 已知局限性
## 附加信息
[该GitHub仓库](https://github.com/ashudeep/Student-Response-Analysis)提供了本数据集的预处理脚本,同时包含原始五分类标注数据,若需自定义分类边界,该仓库的数据可提供一定帮助。
### 数据集维护者
### 许可信息
### 引用信息
Dzikovska MO, Nielsen R, Brew C, Leacock C, Giampiccolo D, Bentivogli L, Clark P, Dagan I, Dang HT (2013b) SemEval 2013任务7:联合学生作答分析与第8届文本蕴含识别挑战赛. 见:第6届语义评估国际研讨会(SEMEVAL-2013)论文集,计算语言学协会,美国佐治亚州亚特兰大市。
### 贡献