five

APURVAASINHAAAA/SciEntsBank

收藏
Hugging Face2026-03-25 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/APURVAASINHAAAA/SciEntsBank
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: SciEntsBank license: cc-by-4.0 language: - en task_categories: - text-classification size_categories: - 10K<n<100K dataset_info: features: - name: id dtype: string - name: question dtype: string - name: reference_answer dtype: string - name: student_answer dtype: string - name: label dtype: class_label: names: '0': correct '1': contradictory '2': partially_correct_incomplete '3': irrelevant '4': non_domain splits: - name: train num_bytes: 232655 num_examples: 4969 - name: test_ua num_bytes: 52730 num_examples: 540 - name: test_uq num_bytes: 35716 num_examples: 733 - name: test_ud num_bytes: 177307 num_examples: 4562 dataset_size: 498408 configs: - config_name: default data_files: - split: train path: data/train-* - split: test_ua path: data/test-ua-* - split: test_uq path: data/test-uq-* - split: test_ud path: data/test-ud-* --- # Dataset Card for "SciEntsBank" SciEntsBank is one of the two distinct subsets within the Student Response Analysis (SRA) corpus, the other subset being the [Beetle](https://huggingface.co/datasets/nkazi/Beetle) dataset. Derived from student answers gathered by Nielsen et al. [1], this dataset comprises nearly 11K responses to 197 assessment questions spanning 15 diverse science domains. The dataset features three labeling schemes: (a) 5-way, (b) 3-way, and (c) 2-way. The dataset includes a training set and three distinct test sets: (a) Unseen Answers (`test_ua`), (b) Unseen Questions (`test_uq`), and (c) Unseen Domains (`test_ud`). - **Authors:** Myroslava Dzikovska, Rodney Nielsen, Chris Brew, Claudia Leacock, Danilo Giampiccolo, Luisa Bentivogli, Peter Clark, Ido Dagan, Hoa Trang Dang - **Paper:** [SemEval-2013 Task 7: The Joint Student Response Analysis and 8th Recognizing Textual Entailment Challenge](https://aclanthology.org/S13-2045) ## Loading Dataset ```python from datasets import load_dataset dataset = load_dataset('nkazi/SciEntsBank') ``` ## Labeling Schemes The authors released the dataset with annotations using five labels (i.e., 5-way labeling scheme) for Automated Short-Answer Grading (ASAG). Additionally, the authors have introduced two alternative labeling schemes, namely the 3-way and 2-way schemes, both derived from the 5-way labeling scheme designed for Recognizing Textual Entailment (RTE). In the 3-way labeling scheme, the categories "partially correct but incomplete", "irrelevant", and "non-domain" are consolidated into a unified category labeled as "incorrect". On the other hand, the 2-way labeling scheme simplifies the classification into a binary system where all labels except "correct" are merged under the "incorrect" category. The `label` column in this dataset presents the 5-way labels. For 3-way and 2-way labels, use the code provided below to derive it from the 5-way labels. After converting the labels, please verify the label distribution. A code to print the label distribution is also given below. ### 5-way to 3-way ```python from datasets import ClassLabel dataset = dataset.align_labels_with_mapping({'correct': 0, 'contradictory': 1, 'partially_correct_incomplete': 2, 'irrelevant': 2, 'non_domain': 2}, 'label') dataset = dataset.cast_column('label', ClassLabel(names=['correct', 'contradictory', 'incorrect'])) ``` Using `align_labels_with_mapping()`, we are mapping "partially correct but incomplete", "irrelevant", and "non-domain" to the same id. Subsequently, we are using `cast_column()` to redefine the class labels (i.e., the label feature) where the id 2 corresponds to the "incorrect" label. ### 5-way to 2-way ```python from datasets import ClassLabel dataset = dataset.align_labels_with_mapping({'correct': 0, 'contradictory': 1, 'partially_correct_incomplete': 1, 'irrelevant': 1, 'non_domain': 1}, 'label') dataset = dataset.cast_column('label', ClassLabel(names=['correct', 'incorrect'])) ``` In the above code, the label "correct" is mapped to 0 to maintain consistency with both the 5-way and 3-way labeling schemes. If the preference is to represent "correct" with id 1 and "incorrect" with id 0, either adjust the label map accordingly or run the following to switch the ids: ```python dataset = dataset.align_labels_with_mapping({'incorrect': 0, 'correct': 1}, 'label') ``` ### Saving and loading 3-way and 2-way datasets Use the following code to store the dataset with the 3-way (or 2-way) labeling scheme locally to eliminate the need to convert labels each time the dataset is loaded: ```python dataset.save_to_disk('SciEntsBank_3way') ``` Here, `SciEntsBank_3way` depicts the path/directory where the dataset will be stored. Use the following code to load the dataset from the same local directory/path: ```python from datasets import DatasetDict dataset = DatasetDict.load_from_disk('SciEntsBank_3way') ``` ### Printing Label Distribution Use the following code to print the label distribution: ```python def print_label_dist(dataset): for split_name in dataset: print(split_name, ':') num_examples = 0 for label in dataset[split_name].features['label'].names: count = dataset[split_name]['label'].count(dataset[split_name].features['label'].str2int(label)) print(' ', label, ':', count) num_examples += count print(' total :', num_examples) print_label_dist(dataset) ``` ## Label Distribution <style> .label-dist table {display: table; width: 100%;} .label-dist th:not(:first-child), .label-dist td:not(:first-child) { width: 15%; } </style> <div class="label-dist"> ### 5-way Label | Train | Test UA | Test UQ | Test UD --- | --: | --: | --: | --: Correct | 2,008 | 233 | 301 | 1,917 Contradictory | 499 | 58 | 64 | 417 Partially correct but incomplete | 1,324 | 113 | 175 | 986 Irrelevant | 1,115 | 133 | 193 | 1,222 Non-domain | 23 | 3 | - | 20 Total | 4,969 | 540 | 733 | 4,562 ### 3-way Label | Train | Test UA | Test UQ | Test UD --- | --: | --: | --: | --: Correct | 2,008 | 233 | 301 | 1,917 Contradictory | 499 | 58 | 64 | 417 Incorrect | 2,462 | 249 | 368 | 2,228 Total | 4,969 | 540 | 733 | 4,562 ### 2-way Label | Train | Test UA | Test UQ | Test UD --- | --: | --: | --: | --: Correct | 2,008 | 233 | 301 | 1,917 Incorrect | 2,961 | 307 | 432 | 2,645 Total | 4,969 | 540 | 733 | 4,562 </div> ## Citation Please consider adding a **footnote** linking to this dataset page (e.g., `SciEntsBank\footnote{https://huggingface.co/datasets/nkazi/SciEntsBank}` in LaTeX) when first mentioning the dataset in your paper, alongside citing the authors/paper. This will promote the availability of this dataset on Hugging Face and make it more accessible to researchers, given that the original repository is no longer available. ```tex @inproceedings{dzikovska2013semeval, title = {{S}em{E}val-2013 Task 7: The Joint Student Response Analysis and 8th Recognizing Textual Entailment Challenge}, author = {Dzikovska, Myroslava and Nielsen, Rodney and Brew, Chris and Leacock, Claudia and Giampiccolo, Danilo and Bentivogli, Luisa and Clark, Peter and Dagan, Ido and Dang, Hoa Trang}, year = 2013, month = jun, booktitle = {Second Joint Conference on Lexical and Computational Semantics ({SEM}), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation ({S}em{E}val 2013)}, editor = {Manandhar, Suresh and Yuret, Deniz} publisher = {Association for Computational Linguistics}, address = {Atlanta, Georgia, USA}, pages = {263--274}, url = {https://aclanthology.org/S13-2045}, } ``` ## References 1. Rodney D. Nielsen, Wayne Ward, James H. Martin, and Martha Palmer. 2008. Annotating students' understanding of science concepts. In *Proceedings of the Sixth International Language Resources and Evaluation Conference*, Marrakech, Morocco.

--- 数据集显示名称: SciEntsBank 许可证: CC BY 4.0 语言: - 英语 任务类别: - 文本分类 样本规模: - 1万 < 样本数 < 10万 数据集信息: 字段: - 字段名: id 数据类型: 字符串 - 字段名: question 数据类型: 字符串 - 字段名: reference_answer 数据类型: 字符串 - 字段名: student_answer 数据类型: 字符串 - 字段名: label 数据类型: 类别标签: 类别名称: '0': 正确(correct) '1': 矛盾(contradictory) '2': 部分正确但不完整(partially_correct_incomplete) '3': 不相关(irrelevant) '4': 非领域相关(non_domain) 数据集划分: - 划分名称: train 字节数: 232655 样本数: 4969 - 划分名称: test_ua 字节数: 52730 样本数: 540 - 划分名称: test_uq 字节数: 35716 样本数: 733 - 划分名称: test_ud 字节数: 177307 样本数: 4562 总数据集大小: 498408 配置: - 配置名称: default 数据文件: - 划分: train 路径: data/train-* - 划分: test_ua 路径: data/test-ua-* - 划分: test_uq 路径: data/test-uq-* - 划分: test_ud 路径: data/test-ud-* --- # SciEntsBank 数据集卡片 SciEntsBank是学生应答分析(Student Response Analysis, SRA)语料库的两个独立子集之一,另一个子集为[Beetle](https://huggingface.co/datasets/nkazi/Beetle)数据集。该数据集源自Nielsen等人[1]收集的学生应答数据,包含针对197个评估问题的近1.1万条学生应答,涵盖15个不同的科学领域。本数据集提供三种标注方案:(a) 5分类、(b) 3分类以及(c) 2分类方案,同时包含一个训练集与三个独立的测试集:(a) 未见过的答案(`test_ua`)、(b) 未见过的问题(`test_uq`)以及(c) 未见过的领域(`test_ud`)。 - **作者**:米罗斯拉瓦·吉科夫斯卡(Myroslava Dzikovska)、罗德尼·尼尔森(Rodney Nielsen)、克里斯·布鲁(Chris Brew)、克劳迪娅·莱科克(Claudia Leacock)、达尼洛·詹皮科洛(Danilo Giampiccolo)、路易莎·本蒂沃利(Luisa Bentivogli)、彼得·克拉克(Peter Clark)、伊多·达甘(Ido Dagan)、黄庄华(Hoa Trang Dang) - **论文**:[SemEval-2013任务7:联合学生应答分析与第八届文本蕴含识别挑战赛](https://aclanthology.org/S13-2045) ## 数据集加载 python from datasets import load_dataset dataset = load_dataset('nkazi/SciEntsBank') ## 标注方案 作者最初为自动化短答案评分(Automated Short-Answer Grading, ASAG)任务发布了该数据集的5分类标注方案,包含5个标签:正确(correct)、矛盾(contradictory)、部分正确但不完整(partially_correct_incomplete)、不相关(irrelevant)以及非领域相关(non_domain)。此外,作者还针对文本蕴含识别(Recognizing Textual Entailment, RTE)任务,基于5分类标注方案衍生出了3分类与2分类两种备选标注方案。 在3分类标注方案中,“部分正确但不完整”“不相关”与“非领域相关”三类将被合并为一个统一的“错误”类别;而2分类标注方案则将任务简化为二元分类,除“正确”标签外的所有类别均被合并至“错误”类别下。 本数据集的`label`字段采用5分类标签。如需获取3分类或2分类标签,请使用下文提供的代码从5分类标签进行转换。转换标签后,请验证标签分布情况,下文同时提供了打印标签分布的代码。 ### 5分类转3分类 python from datasets import ClassLabel dataset = dataset.align_labels_with_mapping({'correct': 0, 'contradictory': 1, 'partially_correct_incomplete': 2, 'irrelevant': 2, 'non_domain': 2}, 'label') dataset = dataset.cast_column('label', ClassLabel(names=['correct', 'contradictory', 'incorrect'])) 使用`align_labels_with_mapping()`函数将“部分正确但不完整”“不相关”与“非领域相关”映射至同一标签ID,随后通过`cast_column()`函数重新定义类别标签(即`label`字段),将ID 2对应为“错误”标签。 ### 5分类转2分类 python from datasets import ClassLabel dataset = dataset.align_labels_with_mapping({'correct': 0, 'contradictory': 1, 'partially_correct_incomplete': 1, 'irrelevant': 1, 'non_domain': 1}, 'label') dataset = dataset.cast_column('label', ClassLabel(names=['correct', 'incorrect'])) 上述代码中,为保持与5分类和3分类标注方案的一致性,将“正确”标签映射至ID 0。若希望将“正确”标签设为ID 1、“错误”标签设为ID 0,可相应调整标签映射表,或运行以下代码交换标签ID: python dataset = dataset.align_labels_with_mapping({'incorrect': 0, 'correct': 1}, 'label') ## 保存与加载3分类/2分类数据集 如需将采用3分类(或2分类)标注方案的数据集本地存储,避免每次加载数据集时都需转换标签,请使用以下代码: python dataset.save_to_disk('SciEntsBank_3way') 其中`SciEntsBank_3way`为数据集的存储路径/目录。如需从该本地目录加载数据集,请使用以下代码: python from datasets import DatasetDict dataset = DatasetDict.load_from_disk('SciEntsBank_3way') ### 打印标签分布 请使用以下代码打印标签分布: python def print_label_dist(dataset): for split_name in dataset: print(split_name, ':') num_examples = 0 for label in dataset[split_name].features['label'].names: count = dataset[split_name]['label'].count(dataset[split_name].features['label'].str2int(label)) print(' ', label, ':', count) num_examples += count print(' total :', num_examples) print_label_dist(dataset) ## 标签分布 <style> .label-dist table {display: table; width: 100%;} .label-dist th:not(:first-child), .label-dist td:not(:first-child) { width: 15%; } </style> <div class="label-dist"> ### 5分类 标签 | 训练集 | 未见过的答案测试集 | 未见过的问题测试集 | 未见过的领域测试集 --- | --: | --: | --: | --: 正确 | 2,008 | 233 | 301 | 1,917 矛盾 | 499 | 58 | 64 | 417 部分正确但不完整 | 1,324 | 113 | 175 | 986 不相关 | 1,115 | 133 | 193 | 1,222 非领域相关 | 23 | 3 | - | 20 总计 | 4,969 | 540 | 733 | 4,562 ### 3分类 标签 | 训练集 | 未见过的答案测试集 | 未见过的问题测试集 | 未见过的领域测试集 --- | --: | --: | --: | --: 正确 | 2,008 | 233 | 301 | 1,917 矛盾 | 499 | 58 | 64 | 417 错误 | 2,462 | 249 | 368 | 2,228 总计 | 4,969 | 540 | 733 | 4,562 ### 2分类 标签 | 训练集 | 未见过的答案测试集 | 未见过的问题测试集 | 未见过的领域测试集 --- | --: | --: | --: | --: 正确 | 2,008 | 233 | 301 | 1,917 错误 | 2,961 | 307 | 432 | 2,645 总计 | 4,969 | 540 | 733 | 4,562 </div> ## 引用说明 当您在论文中首次提及该数据集时,请添加指向本数据集页面的**脚注**(例如在LaTeX中使用`SciEntsBankfootnote{https://huggingface.co/datasets/nkazi/SciEntsBank}`),同时引用原作者与论文。鉴于原仓库已无法访问,此举将有助于提升该数据集在Hugging Face平台的曝光度,方便更多研究者获取。 tex @inproceedings{dzikovska2013semeval, title = {{S}em{E}val-2013 Task 7: The Joint Student Response Analysis and 8th Recognizing Textual Entailment Challenge}, author = {Dzikovska, Myroslava and Nielsen, Rodney and Brew, Chris and Leacock, Claudia and Giampiccolo, Danilo and Bentivogli, Luisa and Clark, Peter and Dagan, Ido and Dang, Hoa Trang}, year = 2013, month = jun, booktitle = {Second Joint Conference on Lexical and Computational Semantics ({SEM}), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation ({S}em{E}val 2013)}, editor = {Manandhar, Suresh and Yuret, Deniz} publisher = {Association for Computational Linguistics}, address = {Atlanta, Georgia, USA}, pages = {263--274}, url = {https://aclanthology.org/S13-2045}, } ## 参考文献 1. Rodney D. Nielsen, Wayne Ward, James H. Martin, and Martha Palmer. 2008. Annotating students' understanding of science concepts. In *Proceedings of the Sixth International Language Resources and Evaluation Conference*, Marrakech, Morocco.
提供机构:
APURVAASINHAAAA
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作