stacked-summaries/stacked-xsum-1024

Name: stacked-summaries/stacked-xsum-1024
Creator: stacked-summaries
Published: 2023-10-08 23:34:15
License: 暂无描述

Hugging Face2023-10-08 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/stacked-summaries/stacked-xsum-1024

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en license: apache-2.0 size_categories: - 100K<n<1M source_datasets: - xsum task_categories: - summarization pretty_name: 'Stacked XSUM: 1024 tokens max' tags: - stacked summaries - xsum configs: - config_name: default data_files: - split: train path: data/train-* - split: validation path: data/validation-* - split: test path: data/test-* dataset_info: features: - name: document dtype: string - name: summary dtype: string - name: id dtype: int64 - name: chapter_length dtype: int64 - name: summary_length dtype: int64 - name: is_stacked dtype: bool splits: - name: train num_bytes: 918588672 num_examples: 320939 - name: validation num_bytes: 51154057 num_examples: 17935 - name: test num_bytes: 51118088 num_examples: 17830 download_size: 653378162 dataset_size: 1020860817 --- # stacked-xsum-1024 a "stacked" version of `xsum` 1. Original Dataset: copy of the base dataset 2. Stacked Rows: The original dataset is processed by stacking rows based on certain criteria: - Maximum Input Length: The maximum length for input sequences is 1024 tokens in the longt5 model tokenizer. - Maximum Output Length: The maximum length for output sequences is also 1024 tokens in the longt5 model tokenizer. 3. Special Token: The dataset utilizes the `[NEXT_CONCEPT]` token to indicate a new topic **within** the same summary. It is recommended to explicitly add this special token to your model's tokenizer before training, ensuring that it is recognized and processed correctly during downstream usage. 4. ## updates - dec 3: upload initial version - dec 4: upload v2 with basic data quality fixes (i.e. the `is_stacked` column) - dec 5 0500: upload v3 which has pre-randomised order and duplicate rows for document+summary dropped ## stats ![stats](https://i.imgur.com/TyyDthT.png) ## dataset details see the repo `.log` file for more details. train input ```python [2022-12-05 01:05:17] INFO:root:INPUTS - basic stats - train [2022-12-05 01:05:17] INFO:root:{'num_columns': 5, 'num_rows': 204045, 'num_unique_target': 203107, 'num_unique_text': 203846, 'summary - average chars': 125.46, 'summary - average tokens': 30.383719277610332, 'text input - average chars': 2202.42, 'text input - average tokens': 523.9222230390355} ``` stacked train: ```python [2022-12-05 04:47:01] INFO:root:stacked 181719 rows, 22326 rows were ineligible [2022-12-05 04:47:02] INFO:root:dropped 64825 duplicate rows, 320939 rows remain [2022-12-05 04:47:02] INFO:root:shuffling output with seed 323 [2022-12-05 04:47:03] INFO:root:STACKED - basic stats - train [2022-12-05 04:47:04] INFO:root:{'num_columns': 6, 'num_rows': 320939, 'num_unique_chapters': 320840, 'num_unique_summaries': 320101, 'summary - average chars': 199.89, 'summary - average tokens': 46.29925001324239, 'text input - average chars': 2629.19, 'text input - average tokens': 621.541532814647} ``` ## Citation If you find this useful in your work, please consider citing us. ``` @misc {stacked_summaries_2023, author = { {Stacked Summaries: Karim Foda and Peter Szemraj} }, title = { stacked-xsum-1024 (Revision 2d47220) }, year = 2023, url = { https://huggingface.co/datasets/stacked-summaries/stacked-xsum-1024 }, doi = { 10.57967/hf/0390 }, publisher = { Hugging Face } } ```

提供机构：

stacked-summaries

原始信息汇总

数据集概述

基本信息

语言: 英语
许可证: Apache-2.0
大小分类: 100K<n<1M
源数据集: xsum
任务分类: 摘要生成
美观名称: Stacked XSUM: 1024 tokens max
标签: 堆叠摘要, xsum

配置

默认配置:
- 训练数据: data/train-*
- 验证数据: data/validation-*
- 测试数据: data/test-*

数据集信息

特征:
- document: 字符串
- summary: 字符串
- id: 整数64位
- chapter_length: 整数64位
- summary_length: 整数64位
- is_stacked: 布尔值
分割:
- 训练: 320939个例子, 918588672字节
- 验证: 17935个例子, 51154057字节
- 测试: 17830个例子, 51118088字节
下载大小: 653378162字节
数据集大小: 1020860817字节

数据集处理

原始数据集: 基础数据集的副本
堆叠行处理:
- 最大输入长度: 1024个令牌
- 最大输出长度: 1024个令牌
特殊令牌: 使用[NEXT_CONCEPT]标记来指示同一摘要内的新主题

数据集统计

训练输入:
- 平均摘要字符数: 125.46
- 平均摘要令牌数: 30.38
- 平均文本输入字符数: 2202.42
- 平均文本输入令牌数: 523.92
堆叠训练:
- 平均摘要字符数: 199.89
- 平均摘要令牌数: 46.30
- 平均文本输入字符数: 2629.19
- 平均文本输入令牌数: 621.54

更新记录

12月3日: 上传初始版本
12月4日: 上传v2版本，包含基本数据质量修复
12月5日: 上传v3版本，预随机化顺序并删除文档+摘要的重复行

引用信息

@misc {stacked_summaries_2023, author = { {Stacked Summaries: Karim Foda and Peter Szemraj} }, title = { stacked-xsum-1024 (Revision 2d47220) }, year = 2023, url = { https://huggingface.co/datasets/stacked-summaries/stacked-xsum-1024 }, doi = { 10.57967/hf/0390 }, publisher = { Hugging Face } }

5,000+

优质数据集

54 个

任务类型

进入经典数据集