giuliadc/newsroom_filtered_test_split

Name: giuliadc/newsroom_filtered_test_split
Creator: giuliadc
Published: 2024-07-17 16:33:12
License: 暂无描述

Hugging Face2024-07-17 更新2024-07-22 收录

下载链接：

https://hf-mirror.com/datasets/giuliadc/newsroom_filtered_test_split

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集是从Newsroom数据集的测试集中筛选出来的，使用了Aumiller等人的Cleaner工具进行过滤。筛选条件包括摘要的最小长度为18个字符，参考文本的最小长度为250个字符，且所有摘要-参考文本对的密度大于2的样本被移除，以确保摘要更偏向于抽象性。筛选后，随机选择了10,000个样本保留在数据集中。此外，数据集中还进行了进一步的处理，如替换换行符、删除不必要的列、重命名列、添加唯一ID以及替换HTML转义字符等。

This dataset was created by filtering the test split of the Newsroom dataset using the Cleaner tool by Aumiller et al. The filtering criteria included a minimum summary length of 18 characters, a minimum reference text length of 250 characters, and the removal of all article-summary pairs with a density greater than 2 to ensure that the summaries are more abstractive. After filtering, 10,000 random samples were selected and retained in the dataset. Additionally, further processing was performed, such as replacing line breaks, removing unnecessary columns, renaming columns, adding unique IDs, and replacing HTML escape characters.

提供机构：

giuliadc

原始信息汇总

数据集概述

任务类别

摘要生成

语言

英语

数据规模

1K<n<10K

数据处理步骤

数据过滤：
- 使用Aumiller等人开发的Cleaner工具对Newsroom数据集的测试部分进行过滤。
- 设置参数：min_length_summary = 18，min_length_reference = 250，length_metric = "whitespace"。
- 移除密度大于2的所有文章-摘要对，确保摘要偏向抽象性。
数据采样：
- 从过滤后的数据中随机选择10k个样本，其余样本被移除。
数据清洗：
- 将文章和摘要中的换行符（）替换为空格（）。
- 移除以下列：date, density_bin, url, compression, coverage_bin, coverage, title。
- 将summary列重命名为reference-summary。
- 添加id列，并为每个样本分配唯一ID，格式为newsroom-n，其中n从1开始递增。
HTML转义字符替换：
- 将摘要中的HTML转义字符替换为对应的Unicode字符，包括：
  - “
  - ”
  - ’
  -  
  - –
  - —
  - £
  - "
  - &
  - ’
  - &8220;
  - &8221;

5,000+

优质数据集

54 个

任务类型

进入经典数据集