scitldr
收藏魔搭社区2025-11-27 更新2025-05-31 收录
下载链接:
https://modelscope.cn/datasets/allenai/scitldr
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for SciTLDR
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** https://github.com/allenai/scitldr
- **Repository:** https://github.com/allenai/scitldr
- **Paper:** https://arxiv.org/abs/2004.15011
- **Leaderboard:**
- **Point of Contact:** {isabelc,kylel,armanc,danw}@allenai.org
### Dataset Summary
`SciTLDR`: Extreme Summarization of Scientific Documents
SciTLDR is a new multi-target dataset of 5.4K TLDRs over 3.2K papers. SciTLDR contains both author-written and expert-derived TLDRs, where the latter are collected using a novel annotation protocol that produces high-quality summaries while minimizing annotation burden.
### Supported Tasks and Leaderboards
summarization
### Languages
English
## Dataset Structure
SciTLDR is split in to a 60/20/20 train/dev/test split. For each file, each line is a json, formatted as follows
```
{
"source":[
"sent0",
"sent1",
"sent2",
...
],
"source_labels":[binary list in which 1 is the oracle sentence],
"rouge_scores":[precomputed rouge-1 scores],
"paper_id":"PAPER-ID",
"target":[
"author-tldr",
"pr-tldr0",
"pr-tldr1",
...
],
"title":"TITLE"
}
```
The keys `rouge_scores` and `source_labels` are not necessary for any code to run, precomputed Rouge scores are provided for future research.
### Data Instances
{
"source": [
"Mixed precision training (MPT) is becoming a practical technique to improve the speed and energy efficiency of training deep neural networks by leveraging the fast hardware support for IEEE half-precision floating point that is available in existing GPUs.",
"MPT is typically used in combination with a technique called loss scaling, that works by scaling up the loss value up before the start of backpropagation in order to minimize the impact of numerical underflow on training.",
"Unfortunately, existing methods make this loss scale value a hyperparameter that needs to be tuned per-model, and a single scale cannot be adapted to different layers at different training stages.",
"We introduce a loss scaling-based training method called adaptive loss scaling that makes MPT easier and more practical to use, by removing the need to tune a model-specific loss scale hyperparameter.",
"We achieve this by introducing layer-wise loss scale values which are automatically computed during training to deal with underflow more effectively than existing methods.",
"We present experimental results on a variety of networks and tasks that show our approach can shorten the time to convergence and improve accuracy, compared with using the existing state-of-the-art MPT and single-precision floating point."
],
"source_labels": [
0,
0,
0,
1,
0,
0
],
"rouge_scores": [
0.2399999958000001,
0.26086956082230633,
0.19999999531250012,
0.38095237636054424,
0.2051282003944774,
0.2978723360796741
],
"paper_id": "rJlnfaNYvB",
"target": [
"We devise adaptive loss scaling to improve mixed precision training that surpass the state-of-the-art results.",
"Proposal for an adaptive loss scaling method during backpropagation for mix precision training where scale rate is decided automatically to reduce the underflow.",
"The authors propose a method to train models in FP16 precision that adopts a more elaborate way to minimize underflow in every layer simultaneously and automatically."
],
"title": "Adaptive Loss Scaling for Mixed Precision Training"
}
### Data Fields
- `source`: The Abstract, Introduction and Conclusion (AIC) or Full text of the paper, with one sentence per line.
- `source_labels`: Binary 0 or 1, 1 denotes the oracle sentence.
- `rouge_scores`: Precomputed ROUGE baseline scores for each sentence.
- `paper_id`: Arxiv Paper ID.
- `target`: Multiple summaries for each sentence, one sentence per line.
- `title`: Title of the paper.
### Data Splits
| | train | valid | test |
|-------------------|-------|--------|------|
| SciTLDR-A | 1992 | 618 | 619 |
| SciTLDR-AIC | 1992 | 618 | 619 |
| SciTLDR-FullText | 1992 | 618 | 619 |
## Dataset Creation
[More Information Needed]
### Curation Rationale
[More Information Needed]
### Source Data
#### Initial Data Collection and Normalization
[More Information Needed]
#### Who are the source language producers?
https://allenai.org/
### Annotations
#### Annotation process
Given the title and first 128 words of a reviewer comment about a paper,
re-write the summary (if it exists) into a single sentence or an incomplete
phrase. Summaries must be no more than one sentence.
Most summaries are between 15 and 25 words. The average rewritten summary is
20 words long.
#### Who are the annotators?
[More Information Needed]
### Personal and Sensitive Information
[More Information Needed]
## Considerations for Using the Data
### Social Impact of Dataset
To encourage further research in the area of extreme summarization of scientific documents.
### Discussion of Biases
[More Information Needed]
### Other Known Limitations
[More Information Needed]
## Additional Information
### Dataset Curators
[More Information Needed]
### Licensing Information
Apache License 2.0
### Citation Information
@article{cachola2020tldr,
title={{TLDR}: Extreme Summarization of Scientific Documents},
author={Isabel Cachola and Kyle Lo and Arman Cohan and Daniel S. Weld},
journal={arXiv:2004.15011},
year={2020},
}
### Contributions
Thanks to [@Bharat123rox](https://github.com/Bharat123rox) for adding this dataset.
# SciTLDR 数据集卡片
## 目录
- [数据集描述](#dataset-description)
- [数据集概述](#dataset-summary)
- [支持任务与排行榜](#supported-tasks-and-leaderboards)
- [语言](#languages)
- [数据集结构](#dataset-structure)
- [数据实例](#data-instances)
- [数据字段](#data-fields)
- [数据划分](#data-splits)
- [数据集构建](#dataset-creation)
- [构建缘由](#curation-rationale)
- [源数据](#source-data)
- [标注信息](#annotations)
- [个人与敏感信息](#personal-and-sensitive-information)
- [数据集使用注意事项](#considerations-for-using-the-data)
- [数据集的社会影响](#social-impact-of-dataset)
- [偏差讨论](#discussion-of-biases)
- [其他已知局限性](#other-known-limitations)
- [附加信息](#additional-information)
- [数据集构建者](#dataset-curators)
- [授权信息](#licensing-information)
- [引用信息](#citation-information)
- [贡献者](#contributions)
## 数据集描述
- **主页**: https://github.com/allenai/scitldr
- **代码仓库**: https://github.com/allenai/scitldr
- **相关论文**: https://arxiv.org/abs/2004.15011
- **排行榜**:
- **联系人**: {isabelc,kylel,armanc,danw}@allenai.org
### 数据集概述
**SciTLDR:科学文献极端摘要(Extreme Summarization of Scientific Documents)**
SciTLDR是一个全新的多目标数据集,涵盖来自3200余篇学术文献的5400余篇过长未读(Too Long Didn't Read,简称TLDR)摘要。该数据集同时包含作者撰写与专家生成的TLDR摘要,其中专家摘要通过一种新型标注协议采集,该协议可在降低标注负担的同时产出高质量摘要。
### 支持任务与排行榜
文本摘要
### 语言
英语
## 数据集结构
SciTLDR按照60/20/20的比例划分为训练集、验证集与测试集。每个数据文件的每一行均为一条JSON格式数据,格式示例如下:
json
{
"source":[
"sent0",
"sent1",
"sent2",
...
],
"source_labels": [二元列表,其中1代表最优语句(oracle sentence)],
"rouge_scores": [预计算的ROUGE-1(Recall-Oriented Understudy for Gisting Evaluation 1)评分],
"paper_id":"PAPER-ID",
"target":[
"author-tldr",
"pr-tldr0",
"pr-tldr1",
...
],
"title":"TITLE"
}
`rouge_scores`与`source_labels`这两个键并非代码运行的必需字段,我们预计算了ROUGE评分以供后续研究使用。
### 数据实例
json
{
"source": [
"混合精度训练(Mixed Precision Training,简称MPT)通过利用现有GPU对IEEE半精度浮点数的快速硬件支持,正成为提升深度神经网络训练速度与能源效率的实用技术。",
"混合精度训练通常与一种称为损失缩放的技术结合使用,该技术通过在反向传播开始前放大损失值,以最小化数值下溢对训练的影响。",
"遗憾的是,现有方法将损失缩放值设为需针对每个模型调优的超参数,且单一缩放值无法适配训练阶段不同层的需求。",
"我们提出一种基于损失缩放的训练方法——自适应损失缩放,通过移除针对特定模型调整损失缩放超参数的需求,使混合精度训练更易于使用且更具实用性。",
"我们通过引入逐层损失缩放值实现这一目标,该值在训练过程中自动计算,相比现有方法能更有效地应对下溢问题。",
"我们在多种网络与任务上开展了实验,结果表明,与现有最优混合精度训练方法及单精度浮点数训练相比,我们的方法可缩短收敛时间并提升准确率。"
],
"source_labels": [
0,
0,
0,
1,
0,
0
],
"rouge_scores": [
0.2399999958000001,
0.26086956082230633,
0.19999999531250012,
0.38095237636054424,
0.2051282003944774,
0.2978723360796741
],
"paper_id": "rJlnfaNYvB",
"target": [
"我们提出自适应损失缩放方法以改进混合精度训练,其性能超越现有最优结果。",
"针对混合精度训练的反向传播过程,我们提出一种自适应损失缩放方法,可自动确定缩放率以降低下溢问题的影响。",
"作者提出一种采用FP16精度训练模型的方法,通过更精细的方式同时且自动地最小化各层的下溢问题。"
],
"title": "混合精度训练的自适应损失缩放"
}
### 数据字段
- `source`: 论文的摘要、引言与结论(Abstract, Introduction and Conclusion,简称AIC)或全文,按句分行存储。
- `source_labels`: 二元标签0或1,其中1代表最优语句(oracle sentence)。
- `rouge_scores`: 为每个句子预计算的ROUGE基准评分。
- `paper_id`: arXiv论文ID。
- `target`: 针对单句的多条摘要,按句分行存储。
- `title`: 论文标题。
### 数据划分
| | 训练集 | 验证集 | 测试集 |
|-------------------|-------|--------|------|
| SciTLDR-A | 1992 | 618 | 619 |
| SciTLDR-AIC | 1992 | 618 | 619 |
| SciTLDR-FullText | 1992 | 618 | 619 |
## 数据集构建
### 构建缘由
需补充更多信息
### 源数据
#### 初始数据采集与标准化
需补充更多信息
#### 源文本生产者
https://allenai.org/
### 标注信息
#### 标注流程
给定论文标题与审稿人评论的前128个单词,将现有摘要(若存在)重写为单句或不完整短语。所有摘要不得超过一个句子,多数摘要长度介于15至25词之间,重写后的摘要平均长度为20词。
#### 标注人员
需补充更多信息
### 个人与敏感信息
需补充更多信息
## 数据集使用注意事项
### 数据集的社会影响
旨在推动科学文献极端摘要领域的进一步研究。
### 偏差讨论
需补充更多信息
### 其他已知局限性
需补充更多信息
## 附加信息
### 数据集构建者
需补充更多信息
### 授权信息
Apache许可证2.0
### 引用信息
bibtex
@article{cachola2020tldr,
title={{TLDR}: Extreme Summarization of Scientific Documents},
author={Isabel Cachola and Kyle Lo and Arman Cohan and Daniel S. Weld},
journal={arXiv:2004.15011},
year={2020},
}
### 贡献者
感谢 [@Bharat123rox](https://github.com/Bharat123rox) 为本数据集提供的添加支持。
提供机构:
maas
创建时间:
2025-05-27



