stacked-summaries/stacked-xsum-1024
收藏Hugging Face2023-10-08 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/stacked-summaries/stacked-xsum-1024
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: apache-2.0
size_categories:
- 100K<n<1M
source_datasets:
- xsum
task_categories:
- summarization
pretty_name: 'Stacked XSUM: 1024 tokens max'
tags:
- stacked summaries
- xsum
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: validation
path: data/validation-*
- split: test
path: data/test-*
dataset_info:
features:
- name: document
dtype: string
- name: summary
dtype: string
- name: id
dtype: int64
- name: chapter_length
dtype: int64
- name: summary_length
dtype: int64
- name: is_stacked
dtype: bool
splits:
- name: train
num_bytes: 918588672
num_examples: 320939
- name: validation
num_bytes: 51154057
num_examples: 17935
- name: test
num_bytes: 51118088
num_examples: 17830
download_size: 653378162
dataset_size: 1020860817
---
# stacked-xsum-1024
a "stacked" version of `xsum`
1. Original Dataset: copy of the base dataset
2. Stacked Rows: The original dataset is processed by stacking rows based on certain criteria:
- Maximum Input Length: The maximum length for input sequences is 1024 tokens in the longt5 model tokenizer.
- Maximum Output Length: The maximum length for output sequences is also 1024 tokens in the longt5 model tokenizer.
3. Special Token: The dataset utilizes the `[NEXT_CONCEPT]` token to indicate a new topic **within** the same summary. It is recommended to explicitly add this special token to your model's tokenizer before training, ensuring that it is recognized and processed correctly during downstream usage.
4.
## updates
- dec 3: upload initial version
- dec 4: upload v2 with basic data quality fixes (i.e. the `is_stacked` column)
- dec 5 0500: upload v3 which has pre-randomised order and duplicate rows for document+summary dropped
## stats

## dataset details
see the repo `.log` file for more details.
train input
```python
[2022-12-05 01:05:17] INFO:root:INPUTS - basic stats - train
[2022-12-05 01:05:17] INFO:root:{'num_columns': 5,
'num_rows': 204045,
'num_unique_target': 203107,
'num_unique_text': 203846,
'summary - average chars': 125.46,
'summary - average tokens': 30.383719277610332,
'text input - average chars': 2202.42,
'text input - average tokens': 523.9222230390355}
```
stacked train:
```python
[2022-12-05 04:47:01] INFO:root:stacked 181719 rows, 22326 rows were ineligible
[2022-12-05 04:47:02] INFO:root:dropped 64825 duplicate rows, 320939 rows remain
[2022-12-05 04:47:02] INFO:root:shuffling output with seed 323
[2022-12-05 04:47:03] INFO:root:STACKED - basic stats - train
[2022-12-05 04:47:04] INFO:root:{'num_columns': 6,
'num_rows': 320939,
'num_unique_chapters': 320840,
'num_unique_summaries': 320101,
'summary - average chars': 199.89,
'summary - average tokens': 46.29925001324239,
'text input - average chars': 2629.19,
'text input - average tokens': 621.541532814647}
```
## Citation
If you find this useful in your work, please consider citing us.
```
@misc {stacked_summaries_2023,
author = { {Stacked Summaries: Karim Foda and Peter Szemraj} },
title = { stacked-xsum-1024 (Revision 2d47220) },
year = 2023,
url = { https://huggingface.co/datasets/stacked-summaries/stacked-xsum-1024 },
doi = { 10.57967/hf/0390 },
publisher = { Hugging Face }
}
```
提供机构:
stacked-summaries
原始信息汇总
数据集概述
基本信息
- 语言: 英语
- 许可证: Apache-2.0
- 大小分类: 100K<n<1M
- 源数据集: xsum
- 任务分类: 摘要生成
- 美观名称: Stacked XSUM: 1024 tokens max
- 标签: 堆叠摘要, xsum
配置
- 默认配置:
- 训练数据:
data/train-* - 验证数据:
data/validation-* - 测试数据:
data/test-*
- 训练数据:
数据集信息
- 特征:
- document: 字符串
- summary: 字符串
- id: 整数64位
- chapter_length: 整数64位
- summary_length: 整数64位
- is_stacked: 布尔值
- 分割:
- 训练: 320939个例子, 918588672字节
- 验证: 17935个例子, 51154057字节
- 测试: 17830个例子, 51118088字节
- 下载大小: 653378162字节
- 数据集大小: 1020860817字节
数据集处理
- 原始数据集: 基础数据集的副本
- 堆叠行处理:
- 最大输入长度: 1024个令牌
- 最大输出长度: 1024个令牌
- 特殊令牌: 使用
[NEXT_CONCEPT]标记来指示同一摘要内的新主题
数据集统计
- 训练输入:
- 平均摘要字符数: 125.46
- 平均摘要令牌数: 30.38
- 平均文本输入字符数: 2202.42
- 平均文本输入令牌数: 523.92
- 堆叠训练:
- 平均摘要字符数: 199.89
- 平均摘要令牌数: 46.30
- 平均文本输入字符数: 2629.19
- 平均文本输入令牌数: 621.54
更新记录
- 12月3日: 上传初始版本
- 12月4日: 上传v2版本,包含基本数据质量修复
- 12月5日: 上传v3版本,预随机化顺序并删除文档+摘要的重复行
引用信息
@misc {stacked_summaries_2023, author = { {Stacked Summaries: Karim Foda and Peter Szemraj} }, title = { stacked-xsum-1024 (Revision 2d47220) }, year = 2023, url = { https://huggingface.co/datasets/stacked-summaries/stacked-xsum-1024 }, doi = { 10.57967/hf/0390 }, publisher = { Hugging Face } }



