nkazi/SciEntsBank
收藏Hugging Face2024-01-10 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/nkazi/SciEntsBank
下载链接
链接失效反馈官方服务:
资源简介:
---
pretty_name: SciEntsBank
license: cc-by-4.0
language:
- en
task_categories:
- text-classification
size_categories:
- 10K<n<100K
dataset_info:
features:
- name: id
dtype: string
- name: question
dtype: string
- name: reference_answer
dtype: string
- name: student_answer
dtype: string
- name: label
dtype:
class_label:
names:
'0': correct
'1': contradictory
'2': partially_correct_incomplete
'3': irrelevant
'4': non_domain
splits:
- name: train
num_bytes: 232655
num_examples: 4969
- name: test_ua
num_bytes: 52730
num_examples: 540
- name: test_uq
num_bytes: 35716
num_examples: 733
- name: test_ud
num_bytes: 177307
num_examples: 4562
dataset_size: 498408
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: test_ua
path: data/test-ua-*
- split: test_uq
path: data/test-uq-*
- split: test_ud
path: data/test-ud-*
---
# Dataset Card for "SciEntsBank"
SciEntsBank is one of the two distinct subsets within the Student Response Analysis (SRA) corpus, the other subset being the
[Beetle](https://huggingface.co/datasets/nkazi/Beetle) dataset. Derived from student answers gathered by Nielsen et al. [1],
this dataset comprises nearly 11K responses to 197 assessment questions spanning 15 diverse science domains. The dataset
features three labeling schemes: (a) 5-way, (b) 3-way, and (c) 2-way. The dataset includes a training set and three distinct
test sets: (a) Unseen Answers (`test_ua`), (b) Unseen Questions (`test_uq`), and (c) Unseen Domains (`test_ud`).
- **Authors:** Myroslava Dzikovska, Rodney Nielsen, Chris Brew, Claudia Leacock, Danilo Giampiccolo, Luisa Bentivogli, Peter Clark, Ido Dagan, Hoa Trang Dang
- **Paper:** [SemEval-2013 Task 7: The Joint Student Response Analysis and 8th Recognizing Textual Entailment Challenge](https://aclanthology.org/S13-2045)
## Loading Dataset
```python
from datasets import load_dataset
dataset = load_dataset('nkazi/SciEntsBank')
```
## Labeling Schemes
The authors released the dataset with annotations using five labels (i.e., 5-way labeling scheme) for Automated Short-Answer Grading (ASAG).
Additionally, the authors have introduced two alternative labeling schemes, namely the 3-way and 2-way schemes, both derived from the 5-way
labeling scheme designed for Recognizing Textual Entailment (RTE). In the 3-way labeling scheme, the categories "partially correct but
incomplete", "irrelevant", and "non-domain" are consolidated into a unified category labeled as "incorrect". On the other hand, the 2-way
labeling scheme simplifies the classification into a binary system where all labels except "correct" are merged under the "incorrect" category.
The `label` column in this dataset presents the 5-way labels. For 3-way and 2-way labels, use the code provided below to derive it
from the 5-way labels. After converting the labels, please verify the label distribution. A code to print the label distribution is
also given below.
### 5-way to 3-way
```python
from datasets import ClassLabel
dataset = dataset.align_labels_with_mapping({'correct': 0, 'contradictory': 1, 'partially_correct_incomplete': 2, 'irrelevant': 2, 'non_domain': 2}, 'label')
dataset = dataset.cast_column('label', ClassLabel(names=['correct', 'contradictory', 'incorrect']))
```
Using `align_labels_with_mapping()`, we are mapping "partially correct but incomplete", "irrelevant", and "non-domain" to the same id. Subsequently,
we are using `cast_column()` to redefine the class labels (i.e., the label feature) where the id 2 corresponds to the "incorrect" label.
### 5-way to 2-way
```python
from datasets import ClassLabel
dataset = dataset.align_labels_with_mapping({'correct': 0, 'contradictory': 1, 'partially_correct_incomplete': 1, 'irrelevant': 1, 'non_domain': 1}, 'label')
dataset = dataset.cast_column('label', ClassLabel(names=['correct', 'incorrect']))
```
In the above code, the label "correct" is mapped to 0 to maintain consistency with both the 5-way and 3-way labeling schemes. If the preference is to
represent "correct" with id 1 and "incorrect" with id 0, either adjust the label map accordingly or run the following to switch the ids:
```python
dataset = dataset.align_labels_with_mapping({'incorrect': 0, 'correct': 1}, 'label')
```
### Saving and loading 3-way and 2-way datasets
Use the following code to store the dataset with the 3-way (or 2-way) labeling scheme locally to eliminate the need to convert labels each time the dataset is loaded:
```python
dataset.save_to_disk('SciEntsBank_3way')
```
Here, `SciEntsBank_3way` depicts the path/directory where the dataset will be stored. Use the following code to load the dataset from the same local directory/path:
```python
from datasets import DatasetDict
dataset = DatasetDict.load_from_disk('SciEntsBank_3way')
```
### Printing Label Distribution
Use the following code to print the label distribution:
```python
def print_label_dist(dataset):
for split_name in dataset:
print(split_name, ':')
num_examples = 0
for label in dataset[split_name].features['label'].names:
count = dataset[split_name]['label'].count(dataset[split_name].features['label'].str2int(label))
print(' ', label, ':', count)
num_examples += count
print(' total :', num_examples)
print_label_dist(dataset)
```
## Label Distribution
<style>
.label-dist th:not(:first-child), .label-dist td:not(:first-child) {
width: 15%;
}
</style>
<div class="label-dist">
### 5-way
Label | Train | Test UA | Test UQ | Test UD
--- | --: | --: | --: | --:
Correct | 2,008 | 233 | 301 | 1,917
Contradictory | 499 | 58 | 64 | 417
Partially correct but incomplete | 1,324 | 113 | 175 | 986
Irrelevant | 1,115 | 133 | 193 | 1,222
Non-domain | 23 | 3 | - | 20
Total | 4,969 | 540 | 733 | 4,562
### 3-way
Label | Train | Test UA | Test UQ | Test UD
--- | --: | --: | --: | --:
Correct | 2,008 | 233 | 301 | 1,917
Contradictory | 499 | 58 | 64 | 417
Incorrect | 2,462 | 249 | 368 | 2,228
Total | 4,969 | 540 | 733 | 4,562
### 2-way
Label | Train | Test UA | Test UQ | Test UD
--- | --: | --: | --: | --:
Correct | 2,008 | 233 | 301 | 1,917
Incorrect | 2,961 | 307 | 432 | 2,645
Total | 4,969 | 540 | 733 | 4,562
</div>
## Citation
```tex
@inproceedings{dzikovska2013semeval,
title = {{S}em{E}val-2013 Task 7: The Joint Student Response Analysis and 8th Recognizing Textual Entailment Challenge},
author = {Dzikovska, Myroslava and Nielsen, Rodney and Brew, Chris and Leacock, Claudia and Giampiccolo, Danilo and Bentivogli, Luisa and Clark, Peter and Dagan, Ido and Dang, Hoa Trang},
year = 2013,
month = jun,
booktitle = {Second Joint Conference on Lexical and Computational Semantics ({SEM}), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation ({S}em{E}val 2013)},
editor = {Manandhar, Suresh and Yuret, Deniz}
publisher = {Association for Computational Linguistics},
address = {Atlanta, Georgia, USA},
pages = {263--274},
url = {https://aclanthology.org/S13-2045},
}
```
## References
1. Rodney D. Nielsen, Wayne Ward, James H. Martin, and Martha Palmer. 2008. Annotating students' understanding of science
concepts. In *Proceedings of the Sixth International Language Resources and Evaluation Conference*, Marrakech, Morocco.
---
pretty_name: SciEntsBank
许可协议: CC BY 4.0
语言:
- 英语
任务类别:
- 文本分类
样本规模:
- 10000 < 样本数 < 100000
数据集信息:
特征:
- 名称: id
数据类型: 字符串
- 名称: 问题
数据类型: 字符串
- 名称: 参考答案
数据类型: 字符串
- 名称: 学生作答内容
数据类型: 字符串
- 名称: 标签
数据类型:
分类标签:
类别名:
'0': 正确(correct)
'1': 矛盾(contradictory)
'2': 部分正确但不完整(partially_correct_incomplete)
'3': 不相关(irrelevant)
'4': 非本领域(non_domain)
划分集:
- 名称: 训练集(train)
字节数: 232655
样本数: 4969
- 名称: 未见过的作答集(test_ua)
字节数: 52730
样本数: 540
- 名称: 未见过的问题集(test_uq)
字节数: 35716
样本数: 733
- 名称: 未见过的领域集(test_ud)
字节数: 177307
样本数: 4562
总数据集大小: 498408
配置项:
- 配置名称: 默认(default)
数据文件:
- 划分集: 训练集(train)
路径: data/train-*
- 划分集: 未见过的作答集(test_ua)
路径: data/test-ua-*
- 划分集: 未见过的问题集(test_uq)
路径: data/test-uq-*
- 划分集: 未见过的领域集(test_ud)
路径: data/test-ud-*
---
## SciEntsBank数据集卡片
SciEntsBank是学生作答分析(Student Response Analysis, SRA)语料库中的两个独立子集之一,另一个子集为[Beetle数据集](https://huggingface.co/datasets/nkazi/Beetle)。该数据集源自Nielsen等人[1]收集的学生作答数据,包含针对197个涵盖15个不同科学领域的评测问题的近1.1万条作答样本。数据集提供三种标注方案:(a) 5分类、(b) 3分类、(c) 2分类。数据集包含一个训练集与三个独立测试集:(a) 未见过的作答集(test_ua)、(b) 未见过的问题集(test_uq)以及(c) 未见过的领域集(test_ud)。
- **作者:** Myroslava Dzikovska、Rodney Nielsen、Chris Brew、Claudia Leacock、Danilo Giampiccolo、Luisa Bentivogli、Peter Clark、Ido Dagan、Hoa Trang Dang
- **论文:** [《SemEval-2013任务7:联合学生作答分析与第8届文本蕴含识别挑战赛》](https://aclanthology.org/S13-2045)
## 加载数据集
python
from datasets import load_dataset
dataset = load_dataset('nkazi/SciEntsBank')
## 标注方案
作者最初发布该数据集时,采用了5分类标注方案(即5-way标注体系)用于自动化简答题评分(Automated Short-Answer Grading, ASAG)。此外,作者还提供了两种基于5分类标注体系衍生的可选标注方案:3分类与2分类方案,二者均为适配文本蕴含识别(Recognizing Textual Entailment, RTE)任务设计。在3分类标注体系中,"部分正确但不完整""不相关"与"非本领域"三类将被合并为一个统一的"错误(incorrect)"类别。而2分类标注体系则将任务简化为二分类任务,除"正确(correct)"之外的所有标签均被归入"错误(incorrect)"类别。
本数据集中的`label`字段即为5分类标签。若需使用3分类或2分类标签,请使用下方提供的代码从5分类标签进行转换。完成标签转换后,请核对标签分布情况,下方同时提供了打印标签分布的代码。
### 5分类转3分类
python
from datasets import ClassLabel
dataset = dataset.align_labels_with_mapping({'correct': 0, 'contradictory': 1, 'partially_correct_incomplete': 2, 'irrelevant': 2, 'non_domain': 2}, 'label')
dataset = dataset.cast_column('label', ClassLabel(names=['correct', 'contradictory', 'incorrect']))
通过`align_labels_with_mapping()`函数,我们将"部分正确但不完整""不相关"与"非本领域"三类映射至同一标签ID。随后使用`cast_column()`函数重定义标签特征,将ID 2对应为"错误"类别。
### 5分类转2分类
python
from datasets import ClassLabel
dataset = dataset.align_labels_with_mapping({'correct': 0, 'contradictory': 1, 'partially_correct_incomplete': 1, 'irrelevant': 1, 'non_domain': 1}, 'label')
dataset = dataset.cast_column('label', ClassLabel(names=['correct', 'incorrect']))
在上述代码中,我们将"正确(correct)"映射至ID 0,以保持与5分类、3分类标注体系的一致性。若偏好将"正确"表示为ID 1、"错误"表示为ID 0,可相应调整标签映射表,或运行以下代码交换标签ID:
python
dataset = dataset.align_labels_with_mapping({'incorrect': 0, 'correct': 1}, 'label')
### 保存与加载3分类/2分类数据集
可使用以下代码将采用3分类(或2分类)标注方案的数据集本地存储,避免每次加载数据集时都需进行标签转换:
python
dataset.save_to_disk('SciEntsBank_3way')
此处`SciEntsBank_3way`为数据集的存储路径/目录。使用以下代码可从该本地目录加载数据集:
python
from datasets import DatasetDict
dataset = DatasetDict.load_from_disk('SciEntsBank_3way')
### 打印标签分布
使用以下代码可打印标签分布:
python
def print_label_dist(dataset):
for split_name in dataset:
print(split_name, ':')
num_examples = 0
for label in dataset[split_name].features['label'].names:
count = dataset[split_name]['label'].count(dataset[split_name].features['label'].str2int(label))
print(' ', label, ':', count)
num_examples += count
print(' total :', num_examples)
print_label_dist(dataset)
## 标签分布
<style>
.label-dist th:not(:first-child), .label-dist td:not(:first-child) {
width: 15%;
}
</style>
<div class="label-dist">
### 5分类
标签 | 训练集 | 未见过的作答集 | 未见过的问题集 | 未见过的领域集
--- | --: | --: | --: | --:
正确(correct) | 2,008 | 233 | 301 | 1,917
矛盾(contradictory) | 499 | 58 | 64 | 417
部分正确但不完整(partially_correct_incomplete) | 1,324 | 113 | 175 | 986
不相关(irrelevant) | 1,115 | 133 | 193 | 1,222
非本领域(non_domain) | 23 | 3 | - | 20
总计 | 4,969 | 540 | 733 | 4,562
### 3分类
标签 | 训练集 | 未见过的作答集 | 未见过的问题集 | 未见过的领域集
--- | --: | --: | --: | --:
正确(correct) | 2,008 | 233 | 301 | 1,917
矛盾(contradictory) | 499 | 58 | 64 | 417
错误(incorrect) | 2,462 | 249 | 368 | 2,228
总计 | 4,969 | 540 | 733 | 4,562
### 2分类
标签 | 训练集 | 未见过的作答集 | 未见过的问题集 | 未见过的领域集
--- | --: | --: | --: | --:
正确(correct) | 2,008 | 233 | 301 | 1,917
错误(incorrect) | 2,961 | 307 | 432 | 2,645
总计 | 4,969 | 540 | 733 | 4,562
</div>
## 引用
tex
@inproceedings{dzikovska2013semeval,
title = {{S}em{E}val-2013 Task 7: The Joint Student Response Analysis and 8th Recognizing Textual Entailment Challenge},
author = {Dzikovska, Myroslava and Nielsen, Rodney and Brew, Chris and Leacock, Claudia and Giampiccolo, Danilo and Bentivogli, Luisa and Clark, Peter and Dagan, Ido and Dang, Hoa Trang},
year = 2013,
month = jun,
booktitle = {Second Joint Conference on Lexical and Computational Semantics ({SEM}), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation ({S}em{E}val 2013)},
editor = {Manandhar, Suresh and Yuret, Deniz}
publisher = {Association for Computational Linguistics},
address = {Atlanta, Georgia, USA},
pages = {263--274},
url = {https://aclanthology.org/S13-2045},
}
## 参考文献
1. Rodney D. Nielsen、Wayne Ward、James H. Martin与Martha Palmer. 2008年. 标注学生对科学概念的理解. 载于*第六届国际语言资源与评估会议论文集*,摩洛哥马拉喀什。
提供机构:
nkazi
原始信息汇总
数据集概述
基本信息
- 数据集名称: SciEntsBank
- 许可协议: cc-by-4.0
- 语言: 英语
- 任务类别: 文本分类
- 数据量: 10K<n<100K
数据集结构
特征
- id: 字符串类型
- question: 字符串类型
- reference_answer: 字符串类型
- student_answer: 字符串类型
- label: 分类标签
- 类别名称:
- 0: correct
- 1: contradictory
- 2: partially_correct_incomplete
- 3: irrelevant
- 4: non_domain
- 类别名称:
数据分割
- train:
- 字节数: 232655
- 样本数: 4969
- test_ua:
- 字节数: 52730
- 样本数: 540
- test_uq:
- 字节数: 35716
- 样本数: 733
- test_ud:
- 字节数: 177307
- 样本数: 4562
数据集大小
- 总字节数: 498408
配置
- 配置名称: default
- 数据文件:
- train: data/train-*
- test_ua: data/test-ua-*
- test_uq: data/test-uq-*
- test_ud: data/test-ud-*
- 数据文件:
标签方案
- 5-way:
- correct
- contradictory
- partially_correct_incomplete
- irrelevant
- non_domain
- 3-way:
- correct
- contradictory
- incorrect (合并 partially_correct_incomplete, irrelevant, non_domain)
- 2-way:
- correct
- incorrect (合并 contradictory, partially_correct_incomplete, irrelevant, non_domain)
标签分布
5-way
| Label | Train | Test UA | Test UQ | Test UD |
|---|---|---|---|---|
| Correct | 2,008 | 233 | 301 | 1,917 |
| Contradictory | 499 | 58 | 64 | 417 |
| Partially correct but incomplete | 1,324 | 113 | 175 | 986 |
| Irrelevant | 1,115 | 133 | 193 | 1,222 |
| Non-domain | 23 | 3 | - | 20 |
| Total | 4,969 | 540 | 733 | 4,562 |
3-way
| Label | Train | Test UA | Test UQ | Test UD |
|---|---|---|---|---|
| Correct | 2,008 | 233 | 301 | 1,917 |
| Contradictory | 499 | 58 | 64 | 417 |
| Incorrect | 2,462 | 249 | 368 | 2,228 |
| Total | 4,969 | 540 | 733 | 4,562 |
2-way
| Label | Train | Test UA | Test UQ | Test UD |
|---|---|---|---|---|
| Correct | 2,008 | 233 | 301 | 1,917 |
| Incorrect | 2,961 | 307 | 432 | 2,645 |
| Total | 4,969 | 540 | 733 | 4,562 |
搜集汇总
数据集介绍

构建方式
SciEntsBank数据集源自学生对科学领域问题的回答,由Nielsen等人收集并标注。该数据集包含近11,000条学生回答,对应197个评估问题,涵盖15个不同的科学领域。数据集的构建采用了三种标注方案:5-way、3-way和2-way,分别用于自动化短答案评分(ASAG)和文本蕴含识别(RTE)。训练集和三个测试集(Unseen Answers、Unseen Questions、Unseen Domains)的划分,确保了数据集在不同场景下的适用性。
特点
SciEntsBank数据集的显著特点在于其多样的标注方案和广泛的科学领域覆盖。5-way标注方案详细区分了回答的正确性,而3-way和2-way方案则通过简化分类,适应不同的研究需求。此外,数据集的多样化测试集设计,如未见答案、未见问题和未见领域,为模型在不同情境下的泛化能力提供了全面的评估。
使用方法
使用SciEntsBank数据集时,可通过Hugging Face的`datasets`库加载。数据集默认提供5-way标注,但可通过代码转换为3-way或2-way标注方案。为方便使用,用户可将转换后的数据集保存至本地,避免重复转换。此外,数据集提供了标签分布的打印功能,便于用户了解各标注方案下的标签分布情况,从而更好地进行模型训练和评估。
背景与挑战
背景概述
SciEntsBank数据集是学生响应分析(SRA)语料库中的两个子集之一,另一个子集是Beetle数据集。该数据集由Nielsen等人于2008年收集的学生回答构建,涵盖了15个不同的科学领域,包含近11,000条回答,针对197个评估问题。主要研究人员包括Myroslava Dzikovska、Rodney Nielsen等,其核心研究问题是如何自动化地评估学生对科学概念的理解。SciEntsBank通过提供多种标签方案(5-way、3-way和2-way),为自动短答案评分(ASAG)和文本蕴涵识别(RTE)提供了丰富的资源,对教育技术领域具有重要影响。
当前挑战
SciEntsBank数据集在构建过程中面临多项挑战。首先,如何从多样化的科学领域中提取具有代表性的问题,并确保这些问题的回答能够准确反映学生的理解水平,是一个复杂的问题。其次,数据集的标签方案设计需要平衡分类的精细度与模型的可处理性,特别是从5-way到3-way和2-way的简化过程中,如何保持分类的准确性是一个关键挑战。此外,数据集的多样性(如未见答案、未见问题和未见领域)增加了模型泛化能力的测试难度,要求模型在不同情境下均能保持稳定的性能。
常用场景
经典使用场景
SciEntsBank数据集在自动短答案评分(ASAG)领域中具有经典应用,主要用于评估学生对科学问题的回答质量。通过提供问题、参考答案和学生答案,该数据集支持多种标注方案,包括5-way、3-way和2-way分类,从而为研究者提供了灵活的评估框架。
实际应用
SciEntsBank数据集在实际应用中广泛用于教育评估系统,帮助教师自动化评分过程,减轻工作负担。此外,该数据集还被用于开发智能辅导系统,通过分析学生答案的准确性和相关性,提供个性化的学习建议,从而提升教育质量和学生学习效果。
衍生相关工作
SciEntsBank数据集的发布激发了大量相关研究工作,特别是在自动评分和文本蕴涵领域。研究者们基于该数据集开发了多种评分模型和算法,推动了自然语言处理技术在教育领域的应用。此外,该数据集还与其他相关数据集(如Beetle)共同构成了学生响应分析(SRA)语料库,进一步丰富了研究资源。
以上内容由遇见数据集搜集并总结生成



