APURVAASINHAAAA/SciEntsBank
收藏Hugging Face2026-03-25 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/APURVAASINHAAAA/SciEntsBank
下载链接
链接失效反馈官方服务:
资源简介:
---
pretty_name: SciEntsBank
license: cc-by-4.0
language:
- en
task_categories:
- text-classification
size_categories:
- 10K<n<100K
dataset_info:
features:
- name: id
dtype: string
- name: question
dtype: string
- name: reference_answer
dtype: string
- name: student_answer
dtype: string
- name: label
dtype:
class_label:
names:
'0': correct
'1': contradictory
'2': partially_correct_incomplete
'3': irrelevant
'4': non_domain
splits:
- name: train
num_bytes: 232655
num_examples: 4969
- name: test_ua
num_bytes: 52730
num_examples: 540
- name: test_uq
num_bytes: 35716
num_examples: 733
- name: test_ud
num_bytes: 177307
num_examples: 4562
dataset_size: 498408
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: test_ua
path: data/test-ua-*
- split: test_uq
path: data/test-uq-*
- split: test_ud
path: data/test-ud-*
---
# Dataset Card for "SciEntsBank"
SciEntsBank is one of the two distinct subsets within the Student Response Analysis (SRA) corpus, the other subset being the
[Beetle](https://huggingface.co/datasets/nkazi/Beetle) dataset. Derived from student answers gathered by Nielsen et al. [1],
this dataset comprises nearly 11K responses to 197 assessment questions spanning 15 diverse science domains. The dataset
features three labeling schemes: (a) 5-way, (b) 3-way, and (c) 2-way. The dataset includes a training set and three distinct
test sets: (a) Unseen Answers (`test_ua`), (b) Unseen Questions (`test_uq`), and (c) Unseen Domains (`test_ud`).
- **Authors:** Myroslava Dzikovska, Rodney Nielsen, Chris Brew, Claudia Leacock, Danilo Giampiccolo, Luisa Bentivogli, Peter Clark, Ido Dagan, Hoa Trang Dang
- **Paper:** [SemEval-2013 Task 7: The Joint Student Response Analysis and 8th Recognizing Textual Entailment Challenge](https://aclanthology.org/S13-2045)
## Loading Dataset
```python
from datasets import load_dataset
dataset = load_dataset('nkazi/SciEntsBank')
```
## Labeling Schemes
The authors released the dataset with annotations using five labels (i.e., 5-way labeling scheme) for Automated Short-Answer Grading (ASAG).
Additionally, the authors have introduced two alternative labeling schemes, namely the 3-way and 2-way schemes, both derived from the 5-way
labeling scheme designed for Recognizing Textual Entailment (RTE). In the 3-way labeling scheme, the categories "partially correct but
incomplete", "irrelevant", and "non-domain" are consolidated into a unified category labeled as "incorrect". On the other hand, the 2-way
labeling scheme simplifies the classification into a binary system where all labels except "correct" are merged under the "incorrect" category.
The `label` column in this dataset presents the 5-way labels. For 3-way and 2-way labels, use the code provided below to derive it
from the 5-way labels. After converting the labels, please verify the label distribution. A code to print the label distribution is
also given below.
### 5-way to 3-way
```python
from datasets import ClassLabel
dataset = dataset.align_labels_with_mapping({'correct': 0, 'contradictory': 1, 'partially_correct_incomplete': 2, 'irrelevant': 2, 'non_domain': 2}, 'label')
dataset = dataset.cast_column('label', ClassLabel(names=['correct', 'contradictory', 'incorrect']))
```
Using `align_labels_with_mapping()`, we are mapping "partially correct but incomplete", "irrelevant", and "non-domain" to the same id. Subsequently,
we are using `cast_column()` to redefine the class labels (i.e., the label feature) where the id 2 corresponds to the "incorrect" label.
### 5-way to 2-way
```python
from datasets import ClassLabel
dataset = dataset.align_labels_with_mapping({'correct': 0, 'contradictory': 1, 'partially_correct_incomplete': 1, 'irrelevant': 1, 'non_domain': 1}, 'label')
dataset = dataset.cast_column('label', ClassLabel(names=['correct', 'incorrect']))
```
In the above code, the label "correct" is mapped to 0 to maintain consistency with both the 5-way and 3-way labeling schemes. If the preference is to
represent "correct" with id 1 and "incorrect" with id 0, either adjust the label map accordingly or run the following to switch the ids:
```python
dataset = dataset.align_labels_with_mapping({'incorrect': 0, 'correct': 1}, 'label')
```
### Saving and loading 3-way and 2-way datasets
Use the following code to store the dataset with the 3-way (or 2-way) labeling scheme locally to eliminate the need to convert labels each time the dataset is loaded:
```python
dataset.save_to_disk('SciEntsBank_3way')
```
Here, `SciEntsBank_3way` depicts the path/directory where the dataset will be stored. Use the following code to load the dataset from the same local directory/path:
```python
from datasets import DatasetDict
dataset = DatasetDict.load_from_disk('SciEntsBank_3way')
```
### Printing Label Distribution
Use the following code to print the label distribution:
```python
def print_label_dist(dataset):
for split_name in dataset:
print(split_name, ':')
num_examples = 0
for label in dataset[split_name].features['label'].names:
count = dataset[split_name]['label'].count(dataset[split_name].features['label'].str2int(label))
print(' ', label, ':', count)
num_examples += count
print(' total :', num_examples)
print_label_dist(dataset)
```
## Label Distribution
<style>
.label-dist table {display: table; width: 100%;}
.label-dist th:not(:first-child), .label-dist td:not(:first-child) {
width: 15%;
}
</style>
<div class="label-dist">
### 5-way
Label | Train | Test UA | Test UQ | Test UD
--- | --: | --: | --: | --:
Correct | 2,008 | 233 | 301 | 1,917
Contradictory | 499 | 58 | 64 | 417
Partially correct but incomplete | 1,324 | 113 | 175 | 986
Irrelevant | 1,115 | 133 | 193 | 1,222
Non-domain | 23 | 3 | - | 20
Total | 4,969 | 540 | 733 | 4,562
### 3-way
Label | Train | Test UA | Test UQ | Test UD
--- | --: | --: | --: | --:
Correct | 2,008 | 233 | 301 | 1,917
Contradictory | 499 | 58 | 64 | 417
Incorrect | 2,462 | 249 | 368 | 2,228
Total | 4,969 | 540 | 733 | 4,562
### 2-way
Label | Train | Test UA | Test UQ | Test UD
--- | --: | --: | --: | --:
Correct | 2,008 | 233 | 301 | 1,917
Incorrect | 2,961 | 307 | 432 | 2,645
Total | 4,969 | 540 | 733 | 4,562
</div>
## Citation
Please consider adding a **footnote** linking to this dataset page (e.g., `SciEntsBank\footnote{https://huggingface.co/datasets/nkazi/SciEntsBank}` in LaTeX)
when first mentioning the dataset in your paper, alongside citing the authors/paper. This will promote the availability of this dataset on
Hugging Face and make it more accessible to researchers, given that the original repository is no longer available.
```tex
@inproceedings{dzikovska2013semeval,
title = {{S}em{E}val-2013 Task 7: The Joint Student Response Analysis and 8th Recognizing Textual Entailment Challenge},
author = {Dzikovska, Myroslava and Nielsen, Rodney and Brew, Chris and Leacock, Claudia and Giampiccolo, Danilo and Bentivogli, Luisa and Clark, Peter and Dagan, Ido and Dang, Hoa Trang},
year = 2013,
month = jun,
booktitle = {Second Joint Conference on Lexical and Computational Semantics ({SEM}), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation ({S}em{E}val 2013)},
editor = {Manandhar, Suresh and Yuret, Deniz}
publisher = {Association for Computational Linguistics},
address = {Atlanta, Georgia, USA},
pages = {263--274},
url = {https://aclanthology.org/S13-2045},
}
```
## References
1. Rodney D. Nielsen, Wayne Ward, James H. Martin, and Martha Palmer. 2008. Annotating students' understanding of science
concepts. In *Proceedings of the Sixth International Language Resources and Evaluation Conference*, Marrakech, Morocco.
---
数据集显示名称: SciEntsBank
许可证: CC BY 4.0
语言:
- 英语
任务类别:
- 文本分类
样本规模:
- 1万 < 样本数 < 10万
数据集信息:
字段:
- 字段名: id
数据类型: 字符串
- 字段名: question
数据类型: 字符串
- 字段名: reference_answer
数据类型: 字符串
- 字段名: student_answer
数据类型: 字符串
- 字段名: label
数据类型:
类别标签:
类别名称:
'0': 正确(correct)
'1': 矛盾(contradictory)
'2': 部分正确但不完整(partially_correct_incomplete)
'3': 不相关(irrelevant)
'4': 非领域相关(non_domain)
数据集划分:
- 划分名称: train
字节数: 232655
样本数: 4969
- 划分名称: test_ua
字节数: 52730
样本数: 540
- 划分名称: test_uq
字节数: 35716
样本数: 733
- 划分名称: test_ud
字节数: 177307
样本数: 4562
总数据集大小: 498408
配置:
- 配置名称: default
数据文件:
- 划分: train
路径: data/train-*
- 划分: test_ua
路径: data/test-ua-*
- 划分: test_uq
路径: data/test-uq-*
- 划分: test_ud
路径: data/test-ud-*
---
# SciEntsBank 数据集卡片
SciEntsBank是学生应答分析(Student Response Analysis, SRA)语料库的两个独立子集之一,另一个子集为[Beetle](https://huggingface.co/datasets/nkazi/Beetle)数据集。该数据集源自Nielsen等人[1]收集的学生应答数据,包含针对197个评估问题的近1.1万条学生应答,涵盖15个不同的科学领域。本数据集提供三种标注方案:(a) 5分类、(b) 3分类以及(c) 2分类方案,同时包含一个训练集与三个独立的测试集:(a) 未见过的答案(`test_ua`)、(b) 未见过的问题(`test_uq`)以及(c) 未见过的领域(`test_ud`)。
- **作者**:米罗斯拉瓦·吉科夫斯卡(Myroslava Dzikovska)、罗德尼·尼尔森(Rodney Nielsen)、克里斯·布鲁(Chris Brew)、克劳迪娅·莱科克(Claudia Leacock)、达尼洛·詹皮科洛(Danilo Giampiccolo)、路易莎·本蒂沃利(Luisa Bentivogli)、彼得·克拉克(Peter Clark)、伊多·达甘(Ido Dagan)、黄庄华(Hoa Trang Dang)
- **论文**:[SemEval-2013任务7:联合学生应答分析与第八届文本蕴含识别挑战赛](https://aclanthology.org/S13-2045)
## 数据集加载
python
from datasets import load_dataset
dataset = load_dataset('nkazi/SciEntsBank')
## 标注方案
作者最初为自动化短答案评分(Automated Short-Answer Grading, ASAG)任务发布了该数据集的5分类标注方案,包含5个标签:正确(correct)、矛盾(contradictory)、部分正确但不完整(partially_correct_incomplete)、不相关(irrelevant)以及非领域相关(non_domain)。此外,作者还针对文本蕴含识别(Recognizing Textual Entailment, RTE)任务,基于5分类标注方案衍生出了3分类与2分类两种备选标注方案。
在3分类标注方案中,“部分正确但不完整”“不相关”与“非领域相关”三类将被合并为一个统一的“错误”类别;而2分类标注方案则将任务简化为二元分类,除“正确”标签外的所有类别均被合并至“错误”类别下。
本数据集的`label`字段采用5分类标签。如需获取3分类或2分类标签,请使用下文提供的代码从5分类标签进行转换。转换标签后,请验证标签分布情况,下文同时提供了打印标签分布的代码。
### 5分类转3分类
python
from datasets import ClassLabel
dataset = dataset.align_labels_with_mapping({'correct': 0, 'contradictory': 1, 'partially_correct_incomplete': 2, 'irrelevant': 2, 'non_domain': 2}, 'label')
dataset = dataset.cast_column('label', ClassLabel(names=['correct', 'contradictory', 'incorrect']))
使用`align_labels_with_mapping()`函数将“部分正确但不完整”“不相关”与“非领域相关”映射至同一标签ID,随后通过`cast_column()`函数重新定义类别标签(即`label`字段),将ID 2对应为“错误”标签。
### 5分类转2分类
python
from datasets import ClassLabel
dataset = dataset.align_labels_with_mapping({'correct': 0, 'contradictory': 1, 'partially_correct_incomplete': 1, 'irrelevant': 1, 'non_domain': 1}, 'label')
dataset = dataset.cast_column('label', ClassLabel(names=['correct', 'incorrect']))
上述代码中,为保持与5分类和3分类标注方案的一致性,将“正确”标签映射至ID 0。若希望将“正确”标签设为ID 1、“错误”标签设为ID 0,可相应调整标签映射表,或运行以下代码交换标签ID:
python
dataset = dataset.align_labels_with_mapping({'incorrect': 0, 'correct': 1}, 'label')
## 保存与加载3分类/2分类数据集
如需将采用3分类(或2分类)标注方案的数据集本地存储,避免每次加载数据集时都需转换标签,请使用以下代码:
python
dataset.save_to_disk('SciEntsBank_3way')
其中`SciEntsBank_3way`为数据集的存储路径/目录。如需从该本地目录加载数据集,请使用以下代码:
python
from datasets import DatasetDict
dataset = DatasetDict.load_from_disk('SciEntsBank_3way')
### 打印标签分布
请使用以下代码打印标签分布:
python
def print_label_dist(dataset):
for split_name in dataset:
print(split_name, ':')
num_examples = 0
for label in dataset[split_name].features['label'].names:
count = dataset[split_name]['label'].count(dataset[split_name].features['label'].str2int(label))
print(' ', label, ':', count)
num_examples += count
print(' total :', num_examples)
print_label_dist(dataset)
## 标签分布
<style>
.label-dist table {display: table; width: 100%;}
.label-dist th:not(:first-child), .label-dist td:not(:first-child) {
width: 15%;
}
</style>
<div class="label-dist">
### 5分类
标签 | 训练集 | 未见过的答案测试集 | 未见过的问题测试集 | 未见过的领域测试集
--- | --: | --: | --: | --:
正确 | 2,008 | 233 | 301 | 1,917
矛盾 | 499 | 58 | 64 | 417
部分正确但不完整 | 1,324 | 113 | 175 | 986
不相关 | 1,115 | 133 | 193 | 1,222
非领域相关 | 23 | 3 | - | 20
总计 | 4,969 | 540 | 733 | 4,562
### 3分类
标签 | 训练集 | 未见过的答案测试集 | 未见过的问题测试集 | 未见过的领域测试集
--- | --: | --: | --: | --:
正确 | 2,008 | 233 | 301 | 1,917
矛盾 | 499 | 58 | 64 | 417
错误 | 2,462 | 249 | 368 | 2,228
总计 | 4,969 | 540 | 733 | 4,562
### 2分类
标签 | 训练集 | 未见过的答案测试集 | 未见过的问题测试集 | 未见过的领域测试集
--- | --: | --: | --: | --:
正确 | 2,008 | 233 | 301 | 1,917
错误 | 2,961 | 307 | 432 | 2,645
总计 | 4,969 | 540 | 733 | 4,562
</div>
## 引用说明
当您在论文中首次提及该数据集时,请添加指向本数据集页面的**脚注**(例如在LaTeX中使用`SciEntsBankfootnote{https://huggingface.co/datasets/nkazi/SciEntsBank}`),同时引用原作者与论文。鉴于原仓库已无法访问,此举将有助于提升该数据集在Hugging Face平台的曝光度,方便更多研究者获取。
tex
@inproceedings{dzikovska2013semeval,
title = {{S}em{E}val-2013 Task 7: The Joint Student Response Analysis and 8th Recognizing Textual Entailment Challenge},
author = {Dzikovska, Myroslava and Nielsen, Rodney and Brew, Chris and Leacock, Claudia and Giampiccolo, Danilo and Bentivogli, Luisa and Clark, Peter and Dagan, Ido and Dang, Hoa Trang},
year = 2013,
month = jun,
booktitle = {Second Joint Conference on Lexical and Computational Semantics ({SEM}), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation ({S}em{E}val 2013)},
editor = {Manandhar, Suresh and Yuret, Deniz}
publisher = {Association for Computational Linguistics},
address = {Atlanta, Georgia, USA},
pages = {263--274},
url = {https://aclanthology.org/S13-2045},
}
## 参考文献
1. Rodney D. Nielsen, Wayne Ward, James H. Martin, and Martha Palmer. 2008. Annotating students' understanding of science concepts. In *Proceedings of the Sixth International Language Resources and Evaluation Conference*, Marrakech, Morocco.
提供机构:
APURVAASINHAAAA



