csabstruct

Name: csabstruct
Creator: maas
Published: 2025-11-12 16:35:14
License: 暂无描述

魔搭社区2025-11-12 更新2025-05-31 收录

下载链接：

https://modelscope.cn/datasets/allenai/csabstruct

下载链接

链接失效反馈

官方服务：

资源简介：

# CSAbstruct CSAbstruct was created as part of *"Pretrained Language Models for Sequential Sentence Classification"* ([ACL Anthology][2], [arXiv][1], [GitHub][6]). It contains 2,189 manually annotated computer science abstracts with sentences annotated according to their rhetorical roles in the abstract, similar to the [PUBMED-RCT][3] categories. ## Dataset Construction Details CSAbstruct is a new dataset of annotated computer science abstracts with sentence labels according to their rhetorical roles. The key difference between this dataset and [PUBMED-RCT][3] is that PubMed abstracts are written according to a predefined structure, whereas computer science papers are free-form. Therefore, there is more variety in writing styles in CSAbstruct. CSAbstruct is collected from the Semantic Scholar corpus [(Ammar et a3., 2018)][4]. E4ch sentence is annotated by 5 workers on the [Figure-eight platform][5], with one of 5 categories `{BACKGROUND, OBJECTIVE, METHOD, RESULT, OTHER}`. We use 8 abstracts (with 51 sentences) as test questions to train crowdworkers. Annotators whose accuracy is less than 75% are disqualified from doing the actual annotation job. The annotations are aggregated using the agreement on a single sentence weighted by the accuracy of the annotator on the initial test questions. A confidence score is associated with each instance based on the annotator initial accuracy and agreement of all annotators on that instance. We then split the dataset 75%/15%/10% into train/dev/test partitions, such that the test set has the highest confidence scores. Agreement rate on a random subset of 200 sentences is 75%, which is quite high given the difficulty of the task. Compared with [PUBMED-RCT][3], our dataset exhibits a wider variety of writ- ing styles, since its abstracts are not written with an explicit structural template. ## Dataset Statistics | Statistic | Avg ± std | |--------------------------|-------------| | Doc length in sentences | 6.7 ± 1.99 | | Sentence length in words | 21.8 ± 10.0 | | Label | % in Dataset | |---------------|--------------| | `BACKGROUND` | 33% | | `METHOD` | 32% | | `RESULT` | 21% | | `OBJECTIVE` | 12% | | `OTHER` | 03% | ## Citation If you use this dataset, please cite the following paper: ``` @inproceedings{Cohan2019EMNLP, title={Pretrained Language Models for Sequential Sentence Classification}, author={Arman Cohan, Iz Beltagy, Daniel King, Bhavana Dalvi, Dan Weld}, year={2019}, booktitle={EMNLP}, } ``` [1]: https://arxiv.org/abs/1909.04054 [2]: https://aclanthology.org/D19-1383 [3]: https://github.com/Franck-Dernoncourt/pubmed-rct [4]: https://aclanthology.org/N18-3011/ [5]: https://www.figure-eight.com/ [6]: https://github.com/allenai/sequential_sentence_classification

# CSAbstruct CSAbstruct 是《用于序列句子分类的预训练语言模型》（[ACL文集][2]、[arXiv预印本][1]、[GitHub仓库][6]）相关研究的配套数据集。它包含2189篇经人工标注的计算机科学摘要，其中句子按照其在摘要中的修辞功能进行标注，标注类别与[PUBMED-RCT][3]的分类体系一致。 ## 数据集构建细节 CSAbstruct 是一款全新的计算机科学摘要标注数据集，其句子标签根据句子在摘要中的修辞功能进行赋予。本数据集与[PUBMED-RCT][3]的核心差异在于：PubMed 摘要遵循预定义的标准化写作结构，而计算机科学学术论文的摘要则采用自由格式撰写，因此 CSAbstruct 的写作风格具有更高的多样性。 CSAbstruct 采集自 Semantic Scholar 语料库[(Ammar et al., 2018)][4]。每一条句子均由[Figure-eight 平台][5]上的5名标注人员完成标注，可选标注类别共5种：`{背景（BACKGROUND）、目标（OBJECTIVE）、方法（METHOD）、结果（RESULT）、其他（OTHER）}`。我们使用8篇摘要（共51个句子）作为测试题对众包标注人员进行岗前培训。标注准确率低于75%的标注者将被取消正式标注资格。最终的标注结果采用加权聚合方式生成：以标注者在初始测试题上的准确率作为权重，对单条句子的多人标注结果进行加权融合。基于标注者的初始准确率以及所有标注者对该实例的标注一致性，为每个样本赋予对应的置信度得分。随后我们将数据集按照75%/15%/10%的比例划分为训练集、验证集与测试集，且测试集包含置信度最高的样本。在随机抽取的200个句子子集上，标注一致性率达到75%，考虑到该任务的固有难度，这一数值已经处于较高水平。与[PUBMED-RCT][3]相比，本数据集的写作风格更为多样，因为其摘要未遵循显式的结构模板。 ## 数据集统计信息 | 统计指标 | 平均值 ± 标准差 | |--------------------------|-------------| | 文档长度（句子数） | 6.7 ± 1.99 | | 句子长度（单词数） | 21.8 ± 10.0 | | 标签类别 | 数据集占比 | |---------------|--------------| | `BACKGROUND`（背景） | 33% | | `METHOD`（方法） | 32% | | `RESULT`（结果） | 21% | | `OBJECTIVE`（目标） | 12% | | `OTHER`（其他） | 3% | ## 引用说明若您使用本数据集，请引用以下论文： @inproceedings{Cohan2019EMNLP, title={Pretrained Language Models for Sequential Sentence Classification}, author={Arman Cohan, Iz Beltagy, Daniel King, Bhavana Dalvi, Dan Weld}, year={2019}, booktitle={EMNLP}, } [1]: https://arxiv.org/abs/1909.04054 [2]: https://aclanthology.org/D19-1383 [3]: https://github.com/Franck-Dernoncourt/pubmed-rct [4]: https://aclanthology.org/N18-3011/ [5]: https://www.figure-eight.com/ [6]: https://github.com/allenai/sequential_sentence_classification

提供机构：

maas

创建时间：

2025-05-27

搜集汇总

数据集介绍

背景与挑战

背景概述

CSAbstruct是一个包含2189个计算机科学摘要的数据集，每个句子根据其修辞角色（如BACKGROUND、METHOD等）进行人工标注。该数据集与PUBMED-RCT相比，由于计算机科学摘要写作风格更自由，因此呈现更多样的文本结构。数据通过众包平台收集，并基于标注者准确性和一致性进行质量控制，最终划分为训练、开发和测试集。

以上内容由遇见数据集搜集并总结生成