csabstruct
收藏魔搭社区2025-11-12 更新2025-05-31 收录
下载链接:
https://modelscope.cn/datasets/allenai/csabstruct
下载链接
链接失效反馈官方服务:
资源简介:
# CSAbstruct
CSAbstruct was created as part of *"Pretrained Language Models for Sequential Sentence Classification"* ([ACL Anthology][2], [arXiv][1], [GitHub][6]).
It contains 2,189 manually annotated computer science abstracts with sentences annotated according to their rhetorical roles in the abstract, similar to the [PUBMED-RCT][3] categories.
## Dataset Construction Details
CSAbstruct is a new dataset of annotated computer science abstracts with sentence labels according to their rhetorical roles.
The key difference between this dataset and [PUBMED-RCT][3] is that PubMed abstracts are written according to a predefined structure, whereas computer science papers are free-form.
Therefore, there is more variety in writing styles in CSAbstruct.
CSAbstruct is collected from the Semantic Scholar corpus [(Ammar et a3., 2018)][4].
E4ch sentence is annotated by 5 workers on the [Figure-eight platform][5], with one of 5 categories `{BACKGROUND, OBJECTIVE, METHOD, RESULT, OTHER}`.
We use 8 abstracts (with 51 sentences) as test questions to train crowdworkers.
Annotators whose accuracy is less than 75% are disqualified from doing the actual annotation job.
The annotations are aggregated using the agreement on a single sentence weighted by the accuracy of the annotator on the initial test questions.
A confidence score is associated with each instance based on the annotator initial accuracy and agreement of all annotators on that instance.
We then split the dataset 75%/15%/10% into train/dev/test partitions, such that the test set has the highest confidence scores.
Agreement rate on a random subset of 200 sentences is 75%, which is quite high given the difficulty of the task.
Compared with [PUBMED-RCT][3], our dataset exhibits a wider variety of writ- ing styles, since its abstracts are not written with an explicit structural template.
## Dataset Statistics
| Statistic | Avg ± std |
|--------------------------|-------------|
| Doc length in sentences | 6.7 ± 1.99 |
| Sentence length in words | 21.8 ± 10.0 |
| Label | % in Dataset |
|---------------|--------------|
| `BACKGROUND` | 33% |
| `METHOD` | 32% |
| `RESULT` | 21% |
| `OBJECTIVE` | 12% |
| `OTHER` | 03% |
## Citation
If you use this dataset, please cite the following paper:
```
@inproceedings{Cohan2019EMNLP,
title={Pretrained Language Models for Sequential Sentence Classification},
author={Arman Cohan, Iz Beltagy, Daniel King, Bhavana Dalvi, Dan Weld},
year={2019},
booktitle={EMNLP},
}
```
[1]: https://arxiv.org/abs/1909.04054
[2]: https://aclanthology.org/D19-1383
[3]: https://github.com/Franck-Dernoncourt/pubmed-rct
[4]: https://aclanthology.org/N18-3011/
[5]: https://www.figure-eight.com/
[6]: https://github.com/allenai/sequential_sentence_classification
# CSAbstruct
CSAbstruct 是《用于序列句子分类的预训练语言模型》([ACL文集][2]、[arXiv预印本][1]、[GitHub仓库][6])相关研究的配套数据集。
它包含2189篇经人工标注的计算机科学摘要,其中句子按照其在摘要中的修辞功能进行标注,标注类别与[PUBMED-RCT][3]的分类体系一致。
## 数据集构建细节
CSAbstruct 是一款全新的计算机科学摘要标注数据集,其句子标签根据句子在摘要中的修辞功能进行赋予。本数据集与[PUBMED-RCT][3]的核心差异在于:PubMed 摘要遵循预定义的标准化写作结构,而计算机科学学术论文的摘要则采用自由格式撰写,因此 CSAbstruct 的写作风格具有更高的多样性。
CSAbstruct 采集自 Semantic Scholar 语料库[(Ammar et al., 2018)][4]。每一条句子均由[Figure-eight 平台][5]上的5名标注人员完成标注,可选标注类别共5种:`{背景(BACKGROUND)、目标(OBJECTIVE)、方法(METHOD)、结果(RESULT)、其他(OTHER)}`。
我们使用8篇摘要(共51个句子)作为测试题对众包标注人员进行岗前培训。标注准确率低于75%的标注者将被取消正式标注资格。最终的标注结果采用加权聚合方式生成:以标注者在初始测试题上的准确率作为权重,对单条句子的多人标注结果进行加权融合。基于标注者的初始准确率以及所有标注者对该实例的标注一致性,为每个样本赋予对应的置信度得分。随后我们将数据集按照75%/15%/10%的比例划分为训练集、验证集与测试集,且测试集包含置信度最高的样本。在随机抽取的200个句子子集上,标注一致性率达到75%,考虑到该任务的固有难度,这一数值已经处于较高水平。与[PUBMED-RCT][3]相比,本数据集的写作风格更为多样,因为其摘要未遵循显式的结构模板。
## 数据集统计信息
| 统计指标 | 平均值 ± 标准差 |
|--------------------------|-------------|
| 文档长度(句子数) | 6.7 ± 1.99 |
| 句子长度(单词数) | 21.8 ± 10.0 |
| 标签类别 | 数据集占比 |
|---------------|--------------|
| `BACKGROUND`(背景) | 33% |
| `METHOD`(方法) | 32% |
| `RESULT`(结果) | 21% |
| `OBJECTIVE`(目标) | 12% |
| `OTHER`(其他) | 3% |
## 引用说明
若您使用本数据集,请引用以下论文:
@inproceedings{Cohan2019EMNLP,
title={Pretrained Language Models for Sequential Sentence Classification},
author={Arman Cohan, Iz Beltagy, Daniel King, Bhavana Dalvi, Dan Weld},
year={2019},
booktitle={EMNLP},
}
[1]: https://arxiv.org/abs/1909.04054
[2]: https://aclanthology.org/D19-1383
[3]: https://github.com/Franck-Dernoncourt/pubmed-rct
[4]: https://aclanthology.org/N18-3011/
[5]: https://www.figure-eight.com/
[6]: https://github.com/allenai/sequential_sentence_classification
提供机构:
maas
创建时间:
2025-05-27



