nkazi/SciEntsBank

Name: nkazi/SciEntsBank
Creator: nkazi
Published: 2024-01-10 03:56:32
License: 暂无描述

Hugging Face2024-01-10 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/nkazi/SciEntsBank

下载链接

链接失效反馈

官方服务：

资源简介：

--- pretty_name: SciEntsBank license: cc-by-4.0 language: - en task_categories: - text-classification size_categories: - 10K<n<100K dataset_info: features: - name: id dtype: string - name: question dtype: string - name: reference_answer dtype: string - name: student_answer dtype: string - name: label dtype: class_label: names: '0': correct '1': contradictory '2': partially_correct_incomplete '3': irrelevant '4': non_domain splits: - name: train num_bytes: 232655 num_examples: 4969 - name: test_ua num_bytes: 52730 num_examples: 540 - name: test_uq num_bytes: 35716 num_examples: 733 - name: test_ud num_bytes: 177307 num_examples: 4562 dataset_size: 498408 configs: - config_name: default data_files: - split: train path: data/train-* - split: test_ua path: data/test-ua-* - split: test_uq path: data/test-uq-* - split: test_ud path: data/test-ud-* --- # Dataset Card for "SciEntsBank" SciEntsBank is one of the two distinct subsets within the Student Response Analysis (SRA) corpus, the other subset being the [Beetle](https://huggingface.co/datasets/nkazi/Beetle) dataset. Derived from student answers gathered by Nielsen et al. [1], this dataset comprises nearly 11K responses to 197 assessment questions spanning 15 diverse science domains. The dataset features three labeling schemes: (a) 5-way, (b) 3-way, and (c) 2-way. The dataset includes a training set and three distinct test sets: (a) Unseen Answers (`test_ua`), (b) Unseen Questions (`test_uq`), and (c) Unseen Domains (`test_ud`). - **Authors:** Myroslava Dzikovska, Rodney Nielsen, Chris Brew, Claudia Leacock, Danilo Giampiccolo, Luisa Bentivogli, Peter Clark, Ido Dagan, Hoa Trang Dang - **Paper:** [SemEval-2013 Task 7: The Joint Student Response Analysis and 8th Recognizing Textual Entailment Challenge](https://aclanthology.org/S13-2045) ## Loading Dataset ```python from datasets import load_dataset dataset = load_dataset('nkazi/SciEntsBank') ``` ## Labeling Schemes The authors released the dataset with annotations using five labels (i.e., 5-way labeling scheme) for Automated Short-Answer Grading (ASAG). Additionally, the authors have introduced two alternative labeling schemes, namely the 3-way and 2-way schemes, both derived from the 5-way labeling scheme designed for Recognizing Textual Entailment (RTE). In the 3-way labeling scheme, the categories "partially correct but incomplete", "irrelevant", and "non-domain" are consolidated into a unified category labeled as "incorrect". On the other hand, the 2-way labeling scheme simplifies the classification into a binary system where all labels except "correct" are merged under the "incorrect" category. The `label` column in this dataset presents the 5-way labels. For 3-way and 2-way labels, use the code provided below to derive it from the 5-way labels. After converting the labels, please verify the label distribution. A code to print the label distribution is also given below. ### 5-way to 3-way ```python from datasets import ClassLabel dataset = dataset.align_labels_with_mapping({'correct': 0, 'contradictory': 1, 'partially_correct_incomplete': 2, 'irrelevant': 2, 'non_domain': 2}, 'label') dataset = dataset.cast_column('label', ClassLabel(names=['correct', 'contradictory', 'incorrect'])) ``` Using `align_labels_with_mapping()`, we are mapping "partially correct but incomplete", "irrelevant", and "non-domain" to the same id. Subsequently, we are using `cast_column()` to redefine the class labels (i.e., the label feature) where the id 2 corresponds to the "incorrect" label. ### 5-way to 2-way ```python from datasets import ClassLabel dataset = dataset.align_labels_with_mapping({'correct': 0, 'contradictory': 1, 'partially_correct_incomplete': 1, 'irrelevant': 1, 'non_domain': 1}, 'label') dataset = dataset.cast_column('label', ClassLabel(names=['correct', 'incorrect'])) ``` In the above code, the label "correct" is mapped to 0 to maintain consistency with both the 5-way and 3-way labeling schemes. If the preference is to represent "correct" with id 1 and "incorrect" with id 0, either adjust the label map accordingly or run the following to switch the ids: ```python dataset = dataset.align_labels_with_mapping({'incorrect': 0, 'correct': 1}, 'label') ``` ### Saving and loading 3-way and 2-way datasets Use the following code to store the dataset with the 3-way (or 2-way) labeling scheme locally to eliminate the need to convert labels each time the dataset is loaded: ```python dataset.save_to_disk('SciEntsBank_3way') ``` Here, `SciEntsBank_3way` depicts the path/directory where the dataset will be stored. Use the following code to load the dataset from the same local directory/path: ```python from datasets import DatasetDict dataset = DatasetDict.load_from_disk('SciEntsBank_3way') ``` ### Printing Label Distribution Use the following code to print the label distribution: ```python def print_label_dist(dataset): for split_name in dataset: print(split_name, ':') num_examples = 0 for label in dataset[split_name].features['label'].names: count = dataset[split_name]['label'].count(dataset[split_name].features['label'].str2int(label)) print(' ', label, ':', count) num_examples += count print(' total :', num_examples) print_label_dist(dataset) ``` ## Label Distribution <style> .label-dist th:not(:first-child), .label-dist td:not(:first-child) { width: 15%; } </style> <div class="label-dist"> ### 5-way Label | Train | Test UA | Test UQ | Test UD --- | --: | --: | --: | --: Correct | 2,008 | 233 | 301 | 1,917 Contradictory | 499 | 58 | 64 | 417 Partially correct but incomplete | 1,324 | 113 | 175 | 986 Irrelevant | 1,115 | 133 | 193 | 1,222 Non-domain | 23 | 3 | - | 20 Total | 4,969 | 540 | 733 | 4,562 ### 3-way Label | Train | Test UA | Test UQ | Test UD --- | --: | --: | --: | --: Correct | 2,008 | 233 | 301 | 1,917 Contradictory | 499 | 58 | 64 | 417 Incorrect | 2,462 | 249 | 368 | 2,228 Total | 4,969 | 540 | 733 | 4,562 ### 2-way Label | Train | Test UA | Test UQ | Test UD --- | --: | --: | --: | --: Correct | 2,008 | 233 | 301 | 1,917 Incorrect | 2,961 | 307 | 432 | 2,645 Total | 4,969 | 540 | 733 | 4,562 </div> ## Citation ```tex @inproceedings{dzikovska2013semeval, title = {{S}em{E}val-2013 Task 7: The Joint Student Response Analysis and 8th Recognizing Textual Entailment Challenge}, author = {Dzikovska, Myroslava and Nielsen, Rodney and Brew, Chris and Leacock, Claudia and Giampiccolo, Danilo and Bentivogli, Luisa and Clark, Peter and Dagan, Ido and Dang, Hoa Trang}, year = 2013, month = jun, booktitle = {Second Joint Conference on Lexical and Computational Semantics ({SEM}), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation ({S}em{E}val 2013)}, editor = {Manandhar, Suresh and Yuret, Deniz} publisher = {Association for Computational Linguistics}, address = {Atlanta, Georgia, USA}, pages = {263--274}, url = {https://aclanthology.org/S13-2045}, } ``` ## References 1. Rodney D. Nielsen, Wayne Ward, James H. Martin, and Martha Palmer. 2008. Annotating students' understanding of science concepts. In *Proceedings of the Sixth International Language Resources and Evaluation Conference*, Marrakech, Morocco.

--- pretty_name: SciEntsBank 许可协议: CC BY 4.0 语言: - 英语任务类别: - 文本分类样本规模: - 10000 < 样本数 < 100000 数据集信息: 特征: - 名称: id 数据类型: 字符串 - 名称: 问题数据类型: 字符串 - 名称: 参考答案数据类型: 字符串 - 名称: 学生作答内容数据类型: 字符串 - 名称: 标签数据类型: 分类标签: 类别名: '0': 正确（correct） '1': 矛盾（contradictory） '2': 部分正确但不完整（partially_correct_incomplete） '3': 不相关（irrelevant） '4': 非本领域（non_domain）划分集: - 名称: 训练集（train）字节数: 232655 样本数: 4969 - 名称: 未见过的作答集（test_ua）字节数: 52730 样本数: 540 - 名称: 未见过的问题集（test_uq）字节数: 35716 样本数: 733 - 名称: 未见过的领域集（test_ud）字节数: 177307 样本数: 4562 总数据集大小: 498408 配置项: - 配置名称: 默认（default）数据文件: - 划分集: 训练集（train）路径: data/train-* - 划分集: 未见过的作答集（test_ua）路径: data/test-ua-* - 划分集: 未见过的问题集（test_uq）路径: data/test-uq-* - 划分集: 未见过的领域集（test_ud）路径: data/test-ud-* --- ## SciEntsBank数据集卡片 SciEntsBank是学生作答分析（Student Response Analysis, SRA）语料库中的两个独立子集之一，另一个子集为[Beetle数据集](https://huggingface.co/datasets/nkazi/Beetle)。该数据集源自Nielsen等人[1]收集的学生作答数据，包含针对197个涵盖15个不同科学领域的评测问题的近1.1万条作答样本。数据集提供三种标注方案：(a) 5分类、(b) 3分类、(c) 2分类。数据集包含一个训练集与三个独立测试集：(a) 未见过的作答集（test_ua）、(b) 未见过的问题集（test_uq）以及(c) 未见过的领域集（test_ud）。 - **作者:** Myroslava Dzikovska、Rodney Nielsen、Chris Brew、Claudia Leacock、Danilo Giampiccolo、Luisa Bentivogli、Peter Clark、Ido Dagan、Hoa Trang Dang - **论文:** [《SemEval-2013任务7：联合学生作答分析与第8届文本蕴含识别挑战赛》](https://aclanthology.org/S13-2045) ## 加载数据集 python from datasets import load_dataset dataset = load_dataset('nkazi/SciEntsBank') ## 标注方案作者最初发布该数据集时，采用了5分类标注方案（即5-way标注体系）用于自动化简答题评分（Automated Short-Answer Grading, ASAG）。此外，作者还提供了两种基于5分类标注体系衍生的可选标注方案：3分类与2分类方案，二者均为适配文本蕴含识别（Recognizing Textual Entailment, RTE）任务设计。在3分类标注体系中，"部分正确但不完整""不相关"与"非本领域"三类将被合并为一个统一的"错误（incorrect）"类别。而2分类标注体系则将任务简化为二分类任务，除"正确（correct）"之外的所有标签均被归入"错误（incorrect）"类别。本数据集中的`label`字段即为5分类标签。若需使用3分类或2分类标签，请使用下方提供的代码从5分类标签进行转换。完成标签转换后，请核对标签分布情况，下方同时提供了打印标签分布的代码。 ### 5分类转3分类 python from datasets import ClassLabel dataset = dataset.align_labels_with_mapping({'correct': 0, 'contradictory': 1, 'partially_correct_incomplete': 2, 'irrelevant': 2, 'non_domain': 2}, 'label') dataset = dataset.cast_column('label', ClassLabel(names=['correct', 'contradictory', 'incorrect'])) 通过`align_labels_with_mapping()`函数，我们将"部分正确但不完整""不相关"与"非本领域"三类映射至同一标签ID。随后使用`cast_column()`函数重定义标签特征，将ID 2对应为"错误"类别。 ### 5分类转2分类 python from datasets import ClassLabel dataset = dataset.align_labels_with_mapping({'correct': 0, 'contradictory': 1, 'partially_correct_incomplete': 1, 'irrelevant': 1, 'non_domain': 1}, 'label') dataset = dataset.cast_column('label', ClassLabel(names=['correct', 'incorrect'])) 在上述代码中，我们将"正确（correct）"映射至ID 0，以保持与5分类、3分类标注体系的一致性。若偏好将"正确"表示为ID 1、"错误"表示为ID 0，可相应调整标签映射表，或运行以下代码交换标签ID： python dataset = dataset.align_labels_with_mapping({'incorrect': 0, 'correct': 1}, 'label') ### 保存与加载3分类/2分类数据集可使用以下代码将采用3分类（或2分类）标注方案的数据集本地存储，避免每次加载数据集时都需进行标签转换： python dataset.save_to_disk('SciEntsBank_3way') 此处`SciEntsBank_3way`为数据集的存储路径/目录。使用以下代码可从该本地目录加载数据集： python from datasets import DatasetDict dataset = DatasetDict.load_from_disk('SciEntsBank_3way') ### 打印标签分布使用以下代码可打印标签分布： python def print_label_dist(dataset): for split_name in dataset: print(split_name, ':') num_examples = 0 for label in dataset[split_name].features['label'].names: count = dataset[split_name]['label'].count(dataset[split_name].features['label'].str2int(label)) print(' ', label, ':', count) num_examples += count print(' total :', num_examples) print_label_dist(dataset) ## 标签分布 <style> .label-dist th:not(:first-child), .label-dist td:not(:first-child) { width: 15%; } </style> <div class="label-dist"> ### 5分类标签 | 训练集 | 未见过的作答集 | 未见过的问题集 | 未见过的领域集 --- | --: | --: | --: | --: 正确（correct） | 2,008 | 233 | 301 | 1,917 矛盾（contradictory） | 499 | 58 | 64 | 417 部分正确但不完整（partially_correct_incomplete） | 1,324 | 113 | 175 | 986 不相关（irrelevant） | 1,115 | 133 | 193 | 1,222 非本领域（non_domain） | 23 | 3 | - | 20 总计 | 4,969 | 540 | 733 | 4,562 ### 3分类标签 | 训练集 | 未见过的作答集 | 未见过的问题集 | 未见过的领域集 --- | --: | --: | --: | --: 正确（correct） | 2,008 | 233 | 301 | 1,917 矛盾（contradictory） | 499 | 58 | 64 | 417 错误（incorrect） | 2,462 | 249 | 368 | 2,228 总计 | 4,969 | 540 | 733 | 4,562 ### 2分类标签 | 训练集 | 未见过的作答集 | 未见过的问题集 | 未见过的领域集 --- | --: | --: | --: | --: 正确（correct） | 2,008 | 233 | 301 | 1,917 错误（incorrect） | 2,961 | 307 | 432 | 2,645 总计 | 4,969 | 540 | 733 | 4,562 </div> ## 引用 tex @inproceedings{dzikovska2013semeval, title = {{S}em{E}val-2013 Task 7: The Joint Student Response Analysis and 8th Recognizing Textual Entailment Challenge}, author = {Dzikovska, Myroslava and Nielsen, Rodney and Brew, Chris and Leacock, Claudia and Giampiccolo, Danilo and Bentivogli, Luisa and Clark, Peter and Dagan, Ido and Dang, Hoa Trang}, year = 2013, month = jun, booktitle = {Second Joint Conference on Lexical and Computational Semantics ({SEM}), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation ({S}em{E}val 2013)}, editor = {Manandhar, Suresh and Yuret, Deniz} publisher = {Association for Computational Linguistics}, address = {Atlanta, Georgia, USA}, pages = {263--274}, url = {https://aclanthology.org/S13-2045}, } ## 参考文献 1. Rodney D. Nielsen、Wayne Ward、James H. Martin与Martha Palmer. 2008年. 标注学生对科学概念的理解. 载于*第六届国际语言资源与评估会议论文集*，摩洛哥马拉喀什。

提供机构：

nkazi

原始信息汇总

数据集概述

基本信息

数据集名称: SciEntsBank
许可协议: cc-by-4.0
语言: 英语
任务类别: 文本分类
数据量: 10K<n<100K

数据集结构

特征

id: 字符串类型
question: 字符串类型
reference_answer: 字符串类型
student_answer: 字符串类型
label: 分类标签
- 类别名称:
  - 0: correct
  - 1: contradictory
  - 2: partially_correct_incomplete
  - 3: irrelevant
  - 4: non_domain

数据分割

train:
- 字节数: 232655
- 样本数: 4969
test_ua:
- 字节数: 52730
- 样本数: 540
test_uq:
- 字节数: 35716
- 样本数: 733
test_ud:
- 字节数: 177307
- 样本数: 4562

数据集大小

总字节数: 498408

配置

配置名称: default
- 数据文件:
  - train: data/train-*
  - test_ua: data/test-ua-*
  - test_uq: data/test-uq-*
  - test_ud: data/test-ud-*

标签方案

5-way:
- correct
- contradictory
- partially_correct_incomplete
- irrelevant
- non_domain
3-way:
- correct
- contradictory
- incorrect (合并 partially_correct_incomplete, irrelevant, non_domain)
2-way:
- correct
- incorrect (合并 contradictory, partially_correct_incomplete, irrelevant, non_domain)

标签分布

5-way

Label	Train	Test UA	Test UQ	Test UD
Correct	2,008	233	301	1,917
Contradictory	499	58	64	417
Partially correct but incomplete	1,324	113	175	986
Irrelevant	1,115	133	193	1,222
Non-domain	23	3	-	20
Total	4,969	540	733	4,562

3-way

Label	Train	Test UA	Test UQ	Test UD
Correct	2,008	233	301	1,917
Contradictory	499	58	64	417
Incorrect	2,462	249	368	2,228
Total	4,969	540	733	4,562

2-way

Label	Train	Test UA	Test UQ	Test UD
Correct	2,008	233	301	1,917
Incorrect	2,961	307	432	2,645
Total	4,969	540	733	4,562

搜集汇总

数据集介绍

构建方式

SciEntsBank数据集源自学生对科学领域问题的回答，由Nielsen等人收集并标注。该数据集包含近11,000条学生回答，对应197个评估问题，涵盖15个不同的科学领域。数据集的构建采用了三种标注方案：5-way、3-way和2-way，分别用于自动化短答案评分（ASAG）和文本蕴含识别（RTE）。训练集和三个测试集（Unseen Answers、Unseen Questions、Unseen Domains）的划分，确保了数据集在不同场景下的适用性。

特点

SciEntsBank数据集的显著特点在于其多样的标注方案和广泛的科学领域覆盖。5-way标注方案详细区分了回答的正确性，而3-way和2-way方案则通过简化分类，适应不同的研究需求。此外，数据集的多样化测试集设计，如未见答案、未见问题和未见领域，为模型在不同情境下的泛化能力提供了全面的评估。

使用方法

使用SciEntsBank数据集时，可通过Hugging Face的`datasets`库加载。数据集默认提供5-way标注，但可通过代码转换为3-way或2-way标注方案。为方便使用，用户可将转换后的数据集保存至本地，避免重复转换。此外，数据集提供了标签分布的打印功能，便于用户了解各标注方案下的标签分布情况，从而更好地进行模型训练和评估。

背景与挑战

背景概述

SciEntsBank数据集是学生响应分析（SRA）语料库中的两个子集之一，另一个子集是Beetle数据集。该数据集由Nielsen等人于2008年收集的学生回答构建，涵盖了15个不同的科学领域，包含近11,000条回答，针对197个评估问题。主要研究人员包括Myroslava Dzikovska、Rodney Nielsen等，其核心研究问题是如何自动化地评估学生对科学概念的理解。SciEntsBank通过提供多种标签方案（5-way、3-way和2-way），为自动短答案评分（ASAG）和文本蕴涵识别（RTE）提供了丰富的资源，对教育技术领域具有重要影响。

当前挑战

SciEntsBank数据集在构建过程中面临多项挑战。首先，如何从多样化的科学领域中提取具有代表性的问题，并确保这些问题的回答能够准确反映学生的理解水平，是一个复杂的问题。其次，数据集的标签方案设计需要平衡分类的精细度与模型的可处理性，特别是从5-way到3-way和2-way的简化过程中，如何保持分类的准确性是一个关键挑战。此外，数据集的多样性（如未见答案、未见问题和未见领域）增加了模型泛化能力的测试难度，要求模型在不同情境下均能保持稳定的性能。

常用场景

经典使用场景

SciEntsBank数据集在自动短答案评分（ASAG）领域中具有经典应用，主要用于评估学生对科学问题的回答质量。通过提供问题、参考答案和学生答案，该数据集支持多种标注方案，包括5-way、3-way和2-way分类，从而为研究者提供了灵活的评估框架。

实际应用

SciEntsBank数据集在实际应用中广泛用于教育评估系统，帮助教师自动化评分过程，减轻工作负担。此外，该数据集还被用于开发智能辅导系统，通过分析学生答案的准确性和相关性，提供个性化的学习建议，从而提升教育质量和学生学习效果。

衍生相关工作

SciEntsBank数据集的发布激发了大量相关研究工作，特别是在自动评分和文本蕴涵领域。研究者们基于该数据集开发了多种评分模型和算法，推动了自然语言处理技术在教育领域的应用。此外，该数据集还与其他相关数据集（如Beetle）共同构成了学生响应分析（SRA）语料库，进一步丰富了研究资源。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集