five

csabstruct

收藏
魔搭社区2025-11-12 更新2025-05-31 收录
下载链接:
https://modelscope.cn/datasets/allenai/csabstruct
下载链接
链接失效反馈
官方服务:
资源简介:
# CSAbstruct CSAbstruct was created as part of *"Pretrained Language Models for Sequential Sentence Classification"* ([ACL Anthology][2], [arXiv][1], [GitHub][6]). It contains 2,189 manually annotated computer science abstracts with sentences annotated according to their rhetorical roles in the abstract, similar to the [PUBMED-RCT][3] categories. ## Dataset Construction Details CSAbstruct is a new dataset of annotated computer science abstracts with sentence labels according to their rhetorical roles. The key difference between this dataset and [PUBMED-RCT][3] is that PubMed abstracts are written according to a predefined structure, whereas computer science papers are free-form. Therefore, there is more variety in writing styles in CSAbstruct. CSAbstruct is collected from the Semantic Scholar corpus [(Ammar et a3., 2018)][4]. E4ch sentence is annotated by 5 workers on the [Figure-eight platform][5], with one of 5 categories `{BACKGROUND, OBJECTIVE, METHOD, RESULT, OTHER}`. We use 8 abstracts (with 51 sentences) as test questions to train crowdworkers. Annotators whose accuracy is less than 75% are disqualified from doing the actual annotation job. The annotations are aggregated using the agreement on a single sentence weighted by the accuracy of the annotator on the initial test questions. A confidence score is associated with each instance based on the annotator initial accuracy and agreement of all annotators on that instance. We then split the dataset 75%/15%/10% into train/dev/test partitions, such that the test set has the highest confidence scores. Agreement rate on a random subset of 200 sentences is 75%, which is quite high given the difficulty of the task. Compared with [PUBMED-RCT][3], our dataset exhibits a wider variety of writ- ing styles, since its abstracts are not written with an explicit structural template. ## Dataset Statistics | Statistic | Avg ± std | |--------------------------|-------------| | Doc length in sentences | 6.7 ± 1.99 | | Sentence length in words | 21.8 ± 10.0 | | Label | % in Dataset | |---------------|--------------| | `BACKGROUND` | 33% | | `METHOD` | 32% | | `RESULT` | 21% | | `OBJECTIVE` | 12% | | `OTHER` | 03% | ## Citation If you use this dataset, please cite the following paper: ``` @inproceedings{Cohan2019EMNLP, title={Pretrained Language Models for Sequential Sentence Classification}, author={Arman Cohan, Iz Beltagy, Daniel King, Bhavana Dalvi, Dan Weld}, year={2019}, booktitle={EMNLP}, } ``` [1]: https://arxiv.org/abs/1909.04054 [2]: https://aclanthology.org/D19-1383 [3]: https://github.com/Franck-Dernoncourt/pubmed-rct [4]: https://aclanthology.org/N18-3011/ [5]: https://www.figure-eight.com/ [6]: https://github.com/allenai/sequential_sentence_classification

# CSAbstruct CSAbstruct 是《用于序列句子分类的预训练语言模型》([ACL文集][2]、[arXiv预印本][1]、[GitHub仓库][6])相关研究的配套数据集。 它包含2189篇经人工标注的计算机科学摘要,其中句子按照其在摘要中的修辞功能进行标注,标注类别与[PUBMED-RCT][3]的分类体系一致。 ## 数据集构建细节 CSAbstruct 是一款全新的计算机科学摘要标注数据集,其句子标签根据句子在摘要中的修辞功能进行赋予。本数据集与[PUBMED-RCT][3]的核心差异在于:PubMed 摘要遵循预定义的标准化写作结构,而计算机科学学术论文的摘要则采用自由格式撰写,因此 CSAbstruct 的写作风格具有更高的多样性。 CSAbstruct 采集自 Semantic Scholar 语料库[(Ammar et al., 2018)][4]。每一条句子均由[Figure-eight 平台][5]上的5名标注人员完成标注,可选标注类别共5种:`{背景(BACKGROUND)、目标(OBJECTIVE)、方法(METHOD)、结果(RESULT)、其他(OTHER)}`。 我们使用8篇摘要(共51个句子)作为测试题对众包标注人员进行岗前培训。标注准确率低于75%的标注者将被取消正式标注资格。最终的标注结果采用加权聚合方式生成:以标注者在初始测试题上的准确率作为权重,对单条句子的多人标注结果进行加权融合。基于标注者的初始准确率以及所有标注者对该实例的标注一致性,为每个样本赋予对应的置信度得分。随后我们将数据集按照75%/15%/10%的比例划分为训练集、验证集与测试集,且测试集包含置信度最高的样本。在随机抽取的200个句子子集上,标注一致性率达到75%,考虑到该任务的固有难度,这一数值已经处于较高水平。与[PUBMED-RCT][3]相比,本数据集的写作风格更为多样,因为其摘要未遵循显式的结构模板。 ## 数据集统计信息 | 统计指标 | 平均值 ± 标准差 | |--------------------------|-------------| | 文档长度(句子数) | 6.7 ± 1.99 | | 句子长度(单词数) | 21.8 ± 10.0 | | 标签类别 | 数据集占比 | |---------------|--------------| | `BACKGROUND`(背景) | 33% | | `METHOD`(方法) | 32% | | `RESULT`(结果) | 21% | | `OBJECTIVE`(目标) | 12% | | `OTHER`(其他) | 3% | ## 引用说明 若您使用本数据集,请引用以下论文: @inproceedings{Cohan2019EMNLP, title={Pretrained Language Models for Sequential Sentence Classification}, author={Arman Cohan, Iz Beltagy, Daniel King, Bhavana Dalvi, Dan Weld}, year={2019}, booktitle={EMNLP}, } [1]: https://arxiv.org/abs/1909.04054 [2]: https://aclanthology.org/D19-1383 [3]: https://github.com/Franck-Dernoncourt/pubmed-rct [4]: https://aclanthology.org/N18-3011/ [5]: https://www.figure-eight.com/ [6]: https://github.com/allenai/sequential_sentence_classification
提供机构:
maas
创建时间:
2025-05-27
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作