five

Atomi/sem_eval_2013_task_7

收藏
Hugging Face2022-11-17 更新2024-05-25 收录
下载链接:
https://hf-mirror.com/datasets/Atomi/sem_eval_2013_task_7
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - expert-generated language: - en language_creators: - found license: - cc multilinguality: - monolingual pretty_name: semeval-task-7-2013 size_categories: - 10K<n<100K source_datasets: - original tags: - asag - short-answer - grading - semantic-similarity task_categories: - text-classification task_ids: - natural-language-inference dataset_info: features: - name: split dtype: string - name: classification_type dtype: string - name: corpus dtype: string - name: test_set dtype: string - name: question_qtype dtype: string - name: question_id dtype: string - name: question_module dtype: string - name: question_stype dtype: string - name: question dtype: string - name: reference_answer_quality dtype: string - name: reference_answer_id dtype: string - name: reference_answer_file_id dtype: string - name: reference_answer dtype: string - name: student_answer_count dtype: float64 - name: student_answer_match dtype: string - name: student_answer_id dtype: string - name: student_answer_label dtype: string - name: student_answer dtype: string - name: label_5way dtype: string splits: - name: test num_bytes: 11688998 num_examples: 23656 - name: train num_bytes: 23544814 num_examples: 47866 download_size: 1488533 dataset_size: 35233812 --- # Dataset Card for SemEval 2013 Task 7 Dataset ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description ### Dataset Summary This dataset contains responses to questions from two distinct corpuses, _BEETLE_ and _SCIENTSBANK_. The _BEETLE_ corpus consists of 56 questions in an electricity and circuits domain, requiring answers of 1-2 sentences and containing approximately 3000 answers. The _SCIENTSBANK_ corpus consists of 197 questions in 15 different science domains, containing approximately 10000 answers. _BEETLE_ contains up to 6 reference answers of differing quality for each question, while _SCIENTSBANK_ contains only one. The dataset was originally published as part of an open source competition. It was [introduced by Dzikovska in this paper](https://aclanthology.org/S13-2045.pdf), however it was difficult to find the official version of the data in 2022. It was eventually [found on Kaggle at this link](https://www.kaggle.com/datasets/smiles28/semeval-2013-2-and-3-way) and it is these XML files that are used here. The XML is essentially preprocessed to combine all separate files into one single dataframe, containing all metadata. The Kaggle dataset only contains the 2 and 3 way labels for each data point. [An additional Github repository](https://github.com/ashudeep/Student-Response-Analysis) was found which contains the original 5-way labels for the _BEETLE_ subset, and can be joined to the data (explained below). ### Supported Tasks and Leaderboards ### Languages ## Dataset Structure The data is tabular containing 19 columns, which is each piece of information that was contained in the original XML files expanded into dataframe format. The _BEETLE_ corpus contains 56 unique questions and approximately 3000 answers, while _SCIENTSBANK_ contains 197 unique questions and approximately 10,000 answers. Each question in the _BEETLE_ dataset can contain between 1 and 6 Reference Answers. These answers are of differing quality, and can be either 'MINIMAL', 'GOOD' or 'BEST'. In cases where multiple reference answers are provided, each student answer is joined to each reference answer. ie. for a given question with reference answers `A`, `B` and `C`, and student answers `1`, `2`, `3`, `4`, all responses for this question would be formatted as follows in the dataframe: | reference_answer | student_answer | | ---------------- | -------------- | | A | 1 | | A | 2 | | A | 3 | | A | 4 | | B | 1 | | B | 2 | | B | 3 | | B | 4 | | C | 1 | | C | 2 | | C | 3 | | C | 4 | So, each student answer is joined to each reference answer. This results in _BEETLE_ contributing more rows to the final dataset than _SCIENTSBANK_, because _SCIENTSBANK_ contains only one reference answer per question. ### Data Instances The data is in csv format. A single example from the data looks like the following: | split | classification_type | corpus | test_set | question_qtype | question_id | question_module | question_stype | question | reference_answer_quality | reference_answer_id | reference_answer_file_id | reference_answer | student_answer_count | student_answer_match | student_answer_id | student_answer_label | student_answer | label_5way | | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | |test | 2way | beetle | unseen-answers | Q_EXPLAIN_SPECIFIC | HYBRID_BURNED_OUT_EXPLAIN_Q1 | SwitchesBulbsParallel | PREDICT | Explain your reasoning. | BEST | answer371 | HYBRID_BURNED_OUT_EXPLAIN_Q1_ANS1 | If bulb A burns out, B and C are no longer in a closed path with the battery | 1 | | SwitchesBulbsParallel-HYBRID_BURNED_OUT_EXPLAIN_Q1.sbj15-l2.qa123 | incorrect | because the paths will still be closed | non_domain | ### Data Fields - 'split': string, the set that the response belongs to, either 'training' or 'test' - 'classification_type': string, whether the classification was '2way' or '3way' - 'corpus': string, the corpus the question belongs to, either 'beetle' or 'scientsbank' - 'test_set': string, the part of the test set it belongs to (if it belongs to one), either 'test-unseen-answers', 'test-unseen-domains' (scientsbank only), 'test-unseen-questions' - 'question_qtype',: string (beetle only), the type of question - 'question_id': string, the question id - 'question_module': string, the question module - 'question_stype': string (beetle only) unsure - 'question': string, the question text - 'reference_answer_quality': string (beetle only), the type of reference answer. Can be 'MINIMAL', 'GOOD' or 'BEST' - 'reference_answer_id': string, the reference answer id - 'reference_answer_file_id': string, the reference answer file id - 'reference_answer': string, the reference answer text - 'student_answer_count': string, unknown meaning - 'student_answer_match': string, unknown meaning - 'student_answer_id': string, the student answer id - 'student_answer_label': string, the label given to the answer. In 2-way, it is 'CORRECT' or 'INCORRECT'. In 3-way, it is 'CORRECT', 'INCORRECT' or 'CONTRADICTORY' - 'student_answer': string, the student answer text - 'label_5way': string (beetle only), contains the original 5-way classification of the student answer. Can be 'CORRECT', 'PARTIALLY_CORRECT_INCOMPLETE', 'CONTRADICTORY', 'IRRELEVANT', 'NON_DOMAIN' ### Data Splits The data was pre-split at the time of acquisition. The test set is comprised of unseen answers, unseen questions, and for _SCIENTSBANK_, unseen domains (since there are multiple domains). ## Dataset Creation ### Curation Rationale The dataset is to be used for fine tuning and benchmarking automarking models. It is one of the cardinal datasets in the ASAG literature, so it enables us to compare our results to existing work. ### Source Data The data was sourced [from this Kaggle link](https://www.kaggle.com/datasets/smiles28/semeval-2013-2-and-3-way). It is unknown if this is the original state of the data, or it has been preprocessed before this stage, because we were unable to access the original. [The dataset creation information is located here via Dzikovska](https://aclanthology.org/S13-2045.pdf) The 5-way labels were accessed from [this public Github repository](https://github.com/ashudeep/Student-Response-Analysis). The required data is contained at: - Training: https://raw.githubusercontent.com/ashudeep/Student-Response-Analysis/master/semevalFormatProcessing-5way/trainingGold.txt - Test (Unseen Answer): https://raw.githubusercontent.com/ashudeep/Student-Response-Analysis/master/semevalFormatProcessing-5way/testGold-UA.txt - Test (Unseen Question): https://raw.githubusercontent.com/ashudeep/Student-Response-Analysis/master/semevalFormatProcessing-5way/testGold-UQ.txt These labels are joined to the Kaggle data using the answer id. At this stage, we only have the 5-way classifications for the _BEETLE_ subset - for _SCIENTSBANK_ we unfortunately only have the less granular 2 and 3 way classifications. #### Initial Data Collection and Normalization #### Who are the source language producers? ### Annotations #### Annotation process Annotations have already been retrieved. #### Who are the annotators? Annotations have already been retrieved. Annotators came from the _BEETLE_ and _SCIENTSBANK_ corpora. ### Personal and Sensitive Information ## Considerations for Using the Data ### Social Impact of Dataset ### Discussion of Biases The _BEETLE_ corpus contains multiple reference answers of differing quality for each question, while _SCIENTSBANK_ contains only one. This means when joining each student answer to each reference answer, there are more _BEETLE_ rows generated (because every student answer is duplicated for each reference answer). This can be remedied by filtering to only include 'BEST' _BEETLE_ reference answers, although if multiple BEST answers are provided for a question (which does happen), _BEETLE_ may still be overrepresented. ### Other Known Limitations ## Additional Information [This repository](https://github.com/ashudeep/Student-Response-Analysis) appears to provide preprocessing scripts for this dataset. It also may contain the original 5-way labels, which could be helpful for us if we want to draw our own classification boundaries. ### Dataset Curators ### Licensing Information ### Citation Information Dzikovska MO, Nielsen R, Brew C, Leacock C, Giampiccolo D, Bentivogli L, Clark P, Dagan I, Dang HT (2013b) Semeval-2013 task 7: The Joint Student Response Analysis and 8th Recognizing Textual Entailment Challenge. In: Proceedings of the 6th International Workshop on Semantic Evaluation (SEMEVAL-2013), Association for Computational Linguistics, Atlanta, Georgia, USA ### Contributions

annotations_creators: - 专家生成 language: - 英语(en) language_creators: - 公开采集 license: - CC(知识共享协议) multilinguality: - 单语言 pretty_name: semeval-task-7-2013 size_categories: - 10000 < 样本量 < 100000 source_datasets: - 原始数据集 tags: - 自动简答评分(Automatic Short Answer Grading, ASAG) - 短答案 - 评分 - 语义相似度 task_categories: - 文本分类 task_ids: - 自然语言推理 dataset_info: features: - name: split(划分集) dtype: 字符串 - name: classification_type(分类类型) dtype: 字符串 - name: corpus(语料库) dtype: 字符串 - name: test_set(测试集子集) dtype: 字符串 - name: question_qtype(问题类型) dtype: 字符串 - name: question_id(问题ID) dtype: 字符串 - name: question_module(问题模块) dtype: 字符串 - name: question_stype(问题子类型) dtype: 字符串 - name: question(问题文本) dtype: 字符串 - name: reference_answer_quality(参考答案质量等级) dtype: 字符串 - name: reference_answer_id(参考答案ID) dtype: 字符串 - name: reference_answer_file_id(参考答案文件ID) dtype: 字符串 - name: reference_answer(参考答案文本) dtype: 字符串 - name: student_answer_count(学生答案数量) dtype: float64 - name: student_answer_match(学生答案匹配度) dtype: 字符串 - name: student_answer_id(学生答案ID) dtype: 字符串 - name: student_answer_label(学生答案标注) dtype: 字符串 - name: student_answer(学生答案文本) dtype: 字符串 - name: label_5way(五分类标注) dtype: 字符串 splits: - name: test(测试集) num_bytes: 11688998 num_examples: 23656 - name: train(训练集) num_bytes: 23544814 num_examples: 47866 download_size: 1488533 dataset_size: 35233812 --- # SemEval 2013任务7数据集卡片 ## 目录 - [目录](#table-of-contents) - [数据集概述](#dataset-description) - [数据集概览](#dataset-summary) - [支持任务与基准榜](#supported-tasks-and-leaderboards) - [语言](#languages) - [数据集结构](#dataset-structure) - [数据样例](#data-instances) - [数据字段说明](#data-fields) - [数据划分](#data-splits) - [数据集构建](#dataset-creation) - [构建初衷](#curation-rationale) - [源数据来源](#source-data) - [标注信息](#annotations) - [个人与敏感信息](#personal-and-sensitive-information) - [数据集使用注意事项](#considerations-for-using-the-data) - [数据集的社会影响](#social-impact-of-dataset) - [偏差分析](#discussion-of-biases) - [已知局限性](#other-known-limitations) - [附加信息](#additional-information) - [数据集维护者](#dataset-curators) - [许可信息](#licensing-information) - [引用信息](#citation-information) - [贡献](#contributions) ## 数据集概述 ### 数据集概览 本数据集包含来自两个独立语料库——BEETLE与SCIENTSBANK——的问题作答数据。其中BEETLE语料库涵盖电力与电路领域的56道问题,要求作答长度为1-2个句子,总计约3000条学生答案。SCIENTSBANK语料库涵盖15个不同科学领域的197道问题,总计约10000条学生答案。BEETLE语料库为每道问题提供最多6个不同质量等级的参考答案,而SCIENTSBANK语料库仅为每道问题提供1个参考答案。 本数据集最初作为一场开源竞赛的一部分发布,由Dzikovska等人在[该论文](https://aclanthology.org/S13-2045.pdf)中首次提出。但在2022年,官方版本的数据集难以获取,最终在Kaggle平台的[该链接](https://www.kaggle.com/datasets/smiles28/semeval-2013-2-and-3-way)中找到了相关XML文件,本数据集即基于此构建。 原始XML文件已完成预处理,将所有独立文件整合为一个包含全部元数据的单数据框。 Kaggle平台的数据集仅为每条数据提供二分类与三分类标注。我们随后在[该GitHub仓库](https://github.com/ashudeep/Student-Response-Analysis)中找到了BEETLE子集的原始五分类标注,可将其与本数据集进行关联(具体关联方式详见下文)。 ### 支持任务与基准榜 ### 语言 ## 数据集结构 本数据集为表格格式,共包含19列,对应原始XML文件中的全部信息并已展开为数据框形式。BEETLE语料库涵盖56道唯一问题与约3000条学生答案,SCIENTSBANK语料库涵盖197道唯一问题与约10000条学生答案。 BEETLE数据集中的每道问题可包含1至6个参考答案,这些答案按质量分为“最低(MINIMAL)”、“良好(GOOD)”与“最优(BEST)”三个等级。当存在多个参考答案时,每条学生答案会与每个参考答案进行配对。例如,若某道问题的参考答案为`A`、`B`和`C`,学生答案为`1`、`2`、`3`、`4`,则该问题的所有数据在数据框中的格式如下: | 参考答案 | 学生答案 | | ---------------- | -------------- | | A | 1 | | A | 2 | | A | 3 | | A | 4 | | B | 1 | | B | 2 | | B | 3 | | B | 4 | | C | 1 | | C | 2 | | C | 3 | | C | 4 | 因此,每条学生答案会与每个参考答案配对,这导致BEETLE语料库在最终数据集中的样本量多于SCIENTSBANK语料库,因为SCIENTSBANK语料库的每道问题仅对应1个参考答案。 ### 数据样例 本数据集采用CSV格式存储,单条数据样例如表所示: | 划分集 | 分类类型 | 语料库 | 测试集子集 | 问题类型 | 问题ID | 问题模块 | 问题子类型 | 问题文本 | 参考答案质量等级 | 参考答案ID | 参考答案文件ID | 参考答案文本 | 学生答案数量 | 学生答案匹配度 | 学生答案ID | 学生答案标注 | 学生答案文本 | 五分类标注 | | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | | test | 2way | beetle | unseen-answers | Q_EXPLAIN_SPECIFIC | HYBRID_BURNED_OUT_EXPLAIN_Q1 | SwitchesBulbsParallel | PREDICT | 请解释你的推理过程。 | BEST | answer371 | HYBRID_BURNED_OUT_EXPLAIN_Q1_ANS1 | 若灯泡A烧毁,则B与C将不再与电池形成闭合回路 | 1 | | SwitchesBulbsParallel-HYBRID_BURNED_OUT_EXPLAIN_Q1.sbj15-l2.qa123 | 错误 | 因为路径仍将保持闭合 | 非领域相关 | ### 数据字段说明 - `split`:字符串类型,表示该样本所属的划分集,可选值为`training(训练集)`或`test(测试集)` - `classification_type`:字符串类型,表示分类任务类型,可选值为`2way(二分类)`或`3way(三分类)` - `corpus`:字符串类型,表示该样本所属的语料库,可选值为`beetle`或`scientsbank` - `test_set`:字符串类型,表示该样本所属的测试集子集(若属于测试集),可选值为`test-unseen-answers(未见过的答案)`、`test-unseen-domains(未见过的领域,仅SCIENTSBANK语料库适用)`或`test-unseen-questions(未见过的问题)` - `question_qtype`:字符串类型(仅BEETLE语料库适用),表示问题类型 - `question_id`:字符串类型,表示问题的唯一标识符 - `question_module`:字符串类型,表示问题所属模块 - `question_stype`:字符串类型(仅BEETLE语料库适用),含义未明确 - `question`:字符串类型,表示问题文本 - `reference_answer_quality`:字符串类型(仅BEETLE语料库适用),表示参考答案的质量等级,可选值为`MINIMAL(最低)`、`GOOD(良好)`或`BEST(最优)` - `reference_answer_id`:字符串类型,表示参考答案的唯一标识符 - `reference_answer_file_id`:字符串类型,表示参考答案文件的唯一标识符 - `reference_answer`:字符串类型,表示参考答案文本 - `student_answer_count`:字符串类型,含义未明确 - `student_answer_match`:字符串类型,含义未明确 - `student_answer_id`:字符串类型,表示学生答案的唯一标识符 - `student_answer_label`:字符串类型,表示该学生答案的标注结果:二分类任务下可选值为`CORRECT(正确)`或`INCORRECT(错误)`;三分类任务下可选值为`CORRECT(正确)`、`INCORRECT(错误)`或`CONTRADICTORY(矛盾)` - `student_answer`:字符串类型,表示学生答案文本 - `label_5way`:字符串类型(仅BEETLE语料库适用),表示学生答案的原始五分类标注,可选值为`CORRECT(正确)`、`PARTIALLY_CORRECT_INCOMPLETE(部分正确且不完整)`、`CONTRADICTORY(矛盾)`、`IRRELEVANT(不相关)`或`NON_DOMAIN(非领域相关)` ### 数据划分 本数据集在获取时已完成预划分。测试集包含未见过的答案、未见过的问题,对于SCIENTSBANK语料库而言还包含未见过的领域(因其涵盖多个科学领域)。 ## 数据集构建 ### 构建初衷 本数据集用于自动评分模型的微调与基准测试,是自动简答评分(Automatic Short Answer Grading, ASAG)领域的核心数据集之一,可帮助研究者将模型结果与已有研究进行对比。 ### 源数据来源 本数据集的源数据来自Kaggle平台的[该链接](https://www.kaggle.com/datasets/smiles28/semeval-2013-2-and-3-way)。由于无法获取原始数据集,无法确认该版本是否为数据的初始状态,或是已在此前进行过预处理。 数据集的构建细节可参考Dzikovska等人的[该论文](https://aclanthology.org/S13-2045.pdf)。 五分类标注数据来自[该公开GitHub仓库](https://github.com/ashudeep/Student-Response-Analysis),所需文件路径如下: - 训练集:https://raw.githubusercontent.com/ashudeep/Student-Response-Analysis/master/semevalFormatProcessing-5way/trainingGold.txt - 测试集(未见过的答案):https://raw.githubusercontent.com/ashudeep/Student-Response-Analysis/master/semevalFormatProcessing-5way/testGold-UA.txt - 测试集(未见过的问题):https://raw.githubusercontent.com/ashudeep/Student-Response-Analysis/master/semevalFormatProcessing-5way/testGold-UQ.txt 标注数据通过答案ID与Kaggle数据集进行关联。目前仅可为BEETLE语料子集提供五分类标注,SCIENTSBANK语料库仅支持粒度更粗的二分类与三分类标注。 #### 初始数据收集与标准化 #### 源语言生产者是谁? ### 标注信息 #### 标注流程 标注数据已完成获取。 #### 标注人员 标注数据已完成获取,标注人员来自BEETLE与SCIENTSBANK语料库项目组。 ### 个人与敏感信息 ## 数据集使用注意事项 ### 数据集的社会影响 ### 偏差分析 BEETLE语料库为每道问题提供多个不同质量等级的参考答案,而SCIENTSBANK语料库仅提供1个。这意味着在将学生答案与参考答案配对时,BEETLE语料库生成的样本量更多(每条学生答案会与每个参考答案生成一条新样本)。可通过仅保留“最优(BEST)”等级的参考答案来缓解该问题,但如果某道问题存在多个最优参考答案(此种情况确实存在),BEETLE语料库的样本占比仍可能偏高。 ### 已知局限性 ## 附加信息 [该GitHub仓库](https://github.com/ashudeep/Student-Response-Analysis)提供了本数据集的预处理脚本,同时包含原始五分类标注数据,若需自定义分类边界,该仓库的数据可提供一定帮助。 ### 数据集维护者 ### 许可信息 ### 引用信息 Dzikovska MO, Nielsen R, Brew C, Leacock C, Giampiccolo D, Bentivogli L, Clark P, Dagan I, Dang HT (2013b) SemEval 2013任务7:联合学生作答分析与第8届文本蕴含识别挑战赛. 见:第6届语义评估国际研讨会(SEMEVAL-2013)论文集,计算语言学协会,美国佐治亚州亚特兰大市。 ### 贡献
提供机构:
Atomi
原始信息汇总

数据集卡片:SemEval 2013 Task 7 数据集

数据集描述

数据集摘要

该数据集包含两个不同语料库(BEETLESCIENTSBANK)的问题回答。BEETLE 语料库包含56个关于电路领域的问题,每个问题需要1-2句的回答,大约有3000个回答。SCIENTSBANK 语料库包含197个问题,涉及15个不同的科学领域,大约有10000个回答。BEETLE 每个问题最多有6个不同质量的参考答案,而 SCIENTSBANK 每个问题只有一个参考答案。

支持的任务和排行榜

语言

数据集结构

数据实例

数据为表格形式,包含19列,每个信息都从原始XML文件扩展为数据框格式。BEETLE 语料库包含56个独特问题和大约3000个回答,而 SCIENTSBANK 包含197个独特问题和大约10,000个回答。

每个 BEETLE 数据集中的问题可以包含1到6个参考答案。这些答案的质量不同,可以是MINIMAL、GOOD或BEST。在提供多个参考答案的情况下,每个学生答案都会与每个参考答案配对。例如,对于一个有参考答案ABC,以及学生答案1234的问题,所有回答在数据框中的格式如下:

reference_answer student_answer
A 1
A 2
A 3
A 4
B 1
B 2
B 3
B 4
C 1
C 2
C 3
C 4

因此,每个学生答案都会与每个参考答案配对。这导致 BEETLE 对最终数据集的行数贡献更多,因为 SCIENTSBANK 每个问题只有一个参考答案。

数据字段

  • split: 字符串,响应所属的集合,training或test
  • classification_type: 字符串,分类是2way还是3way
  • corpus: 字符串,问题所属的语料库,beetle或scientsbank
  • test_set: 字符串,它所属的测试集部分(如果属于一个),test-unseen-answers、test-unseen-domains(仅限scientsbank)、test-unseen-questions
  • question_qtype: 字符串(仅限beetle),问题的类型
  • question_id: 字符串,问题ID
  • question_module: 字符串,问题模块
  • question_stype: 字符串(仅限beetle)不确定
  • question: 字符串,问题文本
  • reference_answer_quality: 字符串(仅限beetle),参考答案的类型。可以是MINIMAL、GOOD或BEST
  • reference_answer_id: 字符串,参考答案ID
  • reference_answer_file_id: 字符串,参考答案文件ID
  • reference_answer: 字符串,参考答案文本
  • student_answer_count: 字符串,未知含义
  • student_answer_match: 字符串,未知含义
  • student_answer_id: 字符串,学生答案ID
  • student_answer_label: 字符串,答案的标签。在2-way中,是CORRECT或INCORRECT。在3-way中,是CORRECT、INCORRECT或CONTRADICTORY
  • student_answer: 字符串,学生答案文本
  • label_5way: 字符串(仅限beetle),包含学生答案的原始5-way分类。可以是CORRECT、PARTIALLY_CORRECT_INCOMPLETE、CONTRADICTORY、IRRELEVANT、NON_DOMAIN

数据分割

数据在获取时已经预先分割。

测试集由未见过的答案、未见过的问题以及对于 SCIENTSBANK,未见过的领域(因为有多个领域)组成。

数据集创建

策划理由

该数据集用于微调和基准测试自动评分模型。它是ASAG文献中的关键数据集之一,因此使我们能够将我们的结果与现有工作进行比较。

源数据

数据从这个Kaggle链接获取。我们无法访问原始数据,因此不确定这是否是数据的原始状态,或者在此阶段之前是否已经过预处理。

数据集创建信息位于此处,由Dzikovska提供

5-way标签从这个公共Github仓库获取。所需数据位于:

  • 训练:https://raw.githubusercontent.com/ashudeep/Student-Response-Analysis/master/semevalFormatProcessing-5way/trainingGold.txt
  • 测试(未见过的答案):https://raw.githubusercontent.com/ashudeep/Student-Response-Analysis/master/semevalFormatProcessing-5way/testGold-UA.txt
  • 测试(未见过的问题):https://raw.githubusercontent.com/ashudeep/Student-Response-Analysis/master/semevalFormatProcessing-5way/testGold-UQ.txt

这些标签通过答案ID与Kaggle数据连接。此时,我们只有 BEETLE 子集的5-way分类 - 对于 SCIENTSBANK,我们不幸只有较粗粒度的2和3-way分类。

初始数据收集和规范化

源语言生产者是谁?

注释

注释过程

注释已经获取。

注释者是谁?

注释已经获取。注释者来自 BEETLESCIENTSBANK 语料库。

个人和敏感信息

使用数据时的考虑

数据集的社会影响

偏见的讨论

BEETLE 语料库包含每个问题的多个不同质量的参考答案,而 SCIENTSBANK 只包含一个。这意味着当每个学生答案与每个参考答案配对时,会生成更多的 BEETLE 行(因为每个学生答案都会为每个参考答案复制)。这可以通过过滤只包含BEST BEETLE 参考答案来解决,尽管如果一个问题提供了多个BEST答案(确实会发生),BEETLE 可能仍然会过度代表。

其他已知限制

附加信息

这个仓库似乎提供了该数据集的预处理脚本。它还可能包含原始的5-way标签,如果我们想自己绘制分类边界,这可能对我们有帮助。

数据集策展人

许可信息

引用信息

Dzikovska MO, Nielsen R, Brew C, Leacock C, Giampiccolo D, Bentivogli L, Clark P, Dagan I, Dang HT (2013b) Semeval-2013 task 7: The Joint Student Response Analysis and 8th Recognizing Textual Entailment Challenge. In: Proceedings of the 6th International Workshop on Semantic Evaluation (SEMEVAL-2013), Association for Computational Linguistics, Atlanta, Georgia, USA

贡献

搜集汇总
数据集介绍
main_image_url
构建方式
该数据集源自SemEval 2013 Task 7竞赛,旨在评估自动评分模型在短答案评分任务中的表现。数据集由两个主要语料库组成:BEETLE和SCIENTSBANK。BEETLE语料库包含电路和电力领域的56个问题,每个问题有1至6个不同质量的参考答案,而SCIENTSBANK语料库涵盖15个科学领域的197个问题,每个问题仅有一个参考答案。数据集的原始格式为XML文件,经过预处理后合并为一个包含所有元数据的单一数据框架。此外,通过外部GitHub仓库获取了BEETLE子集的五分类标签,进一步丰富了数据集的标注信息。
使用方法
该数据集主要用于自动评分模型的训练和评估。研究者可以通过加载数据集,利用其中的问题和参考答案对模型进行训练,并通过学生答案的标签进行性能评估。数据集已预先划分为训练集和测试集,测试集进一步分为未见答案、未见问题和未见领域(仅适用于SCIENTSBANK)。为了简化实验流程,建议研究者在使用BEETLE语料库时,优先选择‘BEST’质量的参考答案,以减少数据冗余。此外,数据集支持多种分类任务,研究者可以根据具体需求选择二分类、三分类或五分类标签进行实验。
背景与挑战
背景概述
SemEval 2013 Task 7数据集是自动评分(ASAG)领域的重要资源,由Dzikovska等人于2013年发布。该数据集包含来自两个语料库(BEETLE和SCIENTSBANK)的学生回答,旨在评估学生对科学问题的回答质量。BEETLE语料库聚焦于电路和电力领域,包含56个问题和约3000个回答,而SCIENTSBANK语料库涵盖15个科学领域,包含197个问题和约10000个回答。该数据集通过提供多层次的参考答案和学生回答标签,为自动评分模型的训练和评估提供了丰富的实验数据。其发布不仅推动了ASAG领域的研究,还为教育技术中的自动化评估提供了重要参考。
当前挑战
该数据集在解决自动评分问题时面临多重挑战。首先,学生回答的多样性和复杂性使得模型难以准确捕捉语义相似性,尤其是在多领域和多类型问题的背景下。其次,BEETLE语料库中每个问题包含多个参考答案,且质量不一,这增加了数据处理的复杂性,可能导致模型在训练过程中偏向某些答案类型。此外,数据集的构建过程中,原始数据的获取和整合也面临挑战,部分数据需要通过第三方平台(如Kaggle和GitHub)获取,且5-way标签仅适用于BEETLE子集,限制了SCIENTSBANK数据的细粒度分析。这些挑战为研究者提出了更高的技术要求,同时也为未来数据集的改进指明了方向。
常用场景
经典使用场景
在自动评分系统(ASAG)的研究中,Atomi/sem_eval_2013_task_7数据集被广泛用于评估和优化模型对学生简短答案的评分能力。该数据集包含来自_BEETLE_和_SCIENTSBANK_两个语料库的学生回答,涵盖了电路和科学领域的多个问题。研究者通过该数据集训练模型,以自动判断学生答案的正确性,并探索语义相似性在评分中的应用。
解决学术问题
该数据集解决了自动评分系统中常见的语义理解和答案匹配问题。通过提供多质量参考答案和学生答案的配对,研究者能够深入分析模型在不同语义复杂度下的表现。此外,数据集中的5-way分类标签为细粒度评分提供了可能,帮助研究者更好地理解学生答案的语义差异,推动了自然语言处理技术在教育领域的应用。
实际应用
在实际应用中,Atomi/sem_eval_2013_task_7数据集被用于开发智能教育工具,如在线学习平台中的自动评分系统。这些工具能够实时评估学生的答案,提供即时反馈,减轻教师的工作负担。此外,该数据集还被用于研究跨领域知识迁移,帮助开发能够在不同学科中通用的评分模型,提升教育技术的普适性和效率。
数据集最近研究
最新研究方向
在自动评分系统(ASAG)领域,SemEval 2013 Task 7数据集作为经典基准,持续推动着短答案评分和语义相似性分析的研究。近年来,研究者们致力于通过深度学习模型提升评分的准确性和泛化能力,尤其是在处理多参考答案和跨领域问题时。该数据集中的BEETLE和SCIENTSBANK语料库为模型训练提供了丰富的多样性,使得研究者能够探索更细粒度的评分策略,如五分类标签的应用。此外,随着大语言模型(LLMs)的兴起,该数据集也被用于评估模型在理解复杂科学问题和生成高质量参考答案方面的能力。这些研究不仅推动了教育技术的发展,也为自然语言处理领域的语义理解和推理任务提供了重要参考。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作