newsroom
收藏huggingface.co2025-03-26 收录
下载链接:
https://huggingface.co/datasets/lil-lab/newsroom
下载链接
链接失效反馈官方服务:
资源简介:
NEWSROOM is a large dataset for training and evaluating summarization systems.
It contains 1.3 million articles and summaries written by authors and
editors in the newsrooms of 38 major publications.
Dataset features includes:
- text: Input news text.
- summary: Summary for the news.
And additional features:
- title: news title.
- url: url of the news.
- date: date of the article.
- density: extractive density.
- coverage: extractive coverage.
- compression: compression ratio.
- density_bin: low, medium, high.
- coverage_bin: extractive, abstractive.
- compression_bin: low, medium, high.
This dataset can be downloaded upon requests. Unzip all the contents
"train.jsonl, dev.josnl, test.jsonl" to the tfds folder.
NEWSROOM 是一项用于训练与评估摘要系统的大型数据集。该数据集汇聚了来自38家主要新闻机构新闻编辑室中的作者和编辑撰写的130万篇新闻文章及其摘要。
数据集特性包括:
- text:输入的新闻文本。
- summary:新闻摘要。
此外,还包含以下附加特性:
- title:新闻标题。
- url:新闻的URL链接。
- date:文章的日期。
- density:提取密度。
- coverage:提取覆盖率。
- compression:压缩比率。
- density_bin:低、中、高。
- coverage_bin:提取式、抽象式。
- compression_bin:低、中、高。
本数据集可应请求下载。将“train.l”、“dev.l”和“test.l”所有内容解压至tfds文件夹。
提供机构:
Hugging Face



