five

stacked-summaries/stacked-xsum-1024

收藏
Hugging Face2023-10-08 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/stacked-summaries/stacked-xsum-1024
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: apache-2.0 size_categories: - 100K<n<1M source_datasets: - xsum task_categories: - summarization pretty_name: 'Stacked XSUM: 1024 tokens max' tags: - stacked summaries - xsum configs: - config_name: default data_files: - split: train path: data/train-* - split: validation path: data/validation-* - split: test path: data/test-* dataset_info: features: - name: document dtype: string - name: summary dtype: string - name: id dtype: int64 - name: chapter_length dtype: int64 - name: summary_length dtype: int64 - name: is_stacked dtype: bool splits: - name: train num_bytes: 918588672 num_examples: 320939 - name: validation num_bytes: 51154057 num_examples: 17935 - name: test num_bytes: 51118088 num_examples: 17830 download_size: 653378162 dataset_size: 1020860817 --- # stacked-xsum-1024 a "stacked" version of `xsum` 1. Original Dataset: copy of the base dataset 2. Stacked Rows: The original dataset is processed by stacking rows based on certain criteria: - Maximum Input Length: The maximum length for input sequences is 1024 tokens in the longt5 model tokenizer. - Maximum Output Length: The maximum length for output sequences is also 1024 tokens in the longt5 model tokenizer. 3. Special Token: The dataset utilizes the `[NEXT_CONCEPT]` token to indicate a new topic **within** the same summary. It is recommended to explicitly add this special token to your model's tokenizer before training, ensuring that it is recognized and processed correctly during downstream usage. 4. ## updates - dec 3: upload initial version - dec 4: upload v2 with basic data quality fixes (i.e. the `is_stacked` column) - dec 5 0500: upload v3 which has pre-randomised order and duplicate rows for document+summary dropped ## stats ![stats](https://i.imgur.com/TyyDthT.png) ## dataset details see the repo `.log` file for more details. train input ```python [2022-12-05 01:05:17] INFO:root:INPUTS - basic stats - train [2022-12-05 01:05:17] INFO:root:{'num_columns': 5, 'num_rows': 204045, 'num_unique_target': 203107, 'num_unique_text': 203846, 'summary - average chars': 125.46, 'summary - average tokens': 30.383719277610332, 'text input - average chars': 2202.42, 'text input - average tokens': 523.9222230390355} ``` stacked train: ```python [2022-12-05 04:47:01] INFO:root:stacked 181719 rows, 22326 rows were ineligible [2022-12-05 04:47:02] INFO:root:dropped 64825 duplicate rows, 320939 rows remain [2022-12-05 04:47:02] INFO:root:shuffling output with seed 323 [2022-12-05 04:47:03] INFO:root:STACKED - basic stats - train [2022-12-05 04:47:04] INFO:root:{'num_columns': 6, 'num_rows': 320939, 'num_unique_chapters': 320840, 'num_unique_summaries': 320101, 'summary - average chars': 199.89, 'summary - average tokens': 46.29925001324239, 'text input - average chars': 2629.19, 'text input - average tokens': 621.541532814647} ``` ## Citation If you find this useful in your work, please consider citing us. ``` @misc {stacked_summaries_2023, author = { {Stacked Summaries: Karim Foda and Peter Szemraj} }, title = { stacked-xsum-1024 (Revision 2d47220) }, year = 2023, url = { https://huggingface.co/datasets/stacked-summaries/stacked-xsum-1024 }, doi = { 10.57967/hf/0390 }, publisher = { Hugging Face } } ```
提供机构:
stacked-summaries
原始信息汇总

数据集概述

基本信息

  • 语言: 英语
  • 许可证: Apache-2.0
  • 大小分类: 100K<n<1M
  • 源数据集: xsum
  • 任务分类: 摘要生成
  • 美观名称: Stacked XSUM: 1024 tokens max
  • 标签: 堆叠摘要, xsum

配置

  • 默认配置:
    • 训练数据: data/train-*
    • 验证数据: data/validation-*
    • 测试数据: data/test-*

数据集信息

  • 特征:
    • document: 字符串
    • summary: 字符串
    • id: 整数64位
    • chapter_length: 整数64位
    • summary_length: 整数64位
    • is_stacked: 布尔值
  • 分割:
    • 训练: 320939个例子, 918588672字节
    • 验证: 17935个例子, 51154057字节
    • 测试: 17830个例子, 51118088字节
  • 下载大小: 653378162字节
  • 数据集大小: 1020860817字节

数据集处理

  • 原始数据集: 基础数据集的副本
  • 堆叠行处理:
    • 最大输入长度: 1024个令牌
    • 最大输出长度: 1024个令牌
  • 特殊令牌: 使用[NEXT_CONCEPT]标记来指示同一摘要内的新主题

数据集统计

  • 训练输入:
    • 平均摘要字符数: 125.46
    • 平均摘要令牌数: 30.38
    • 平均文本输入字符数: 2202.42
    • 平均文本输入令牌数: 523.92
  • 堆叠训练:
    • 平均摘要字符数: 199.89
    • 平均摘要令牌数: 46.30
    • 平均文本输入字符数: 2629.19
    • 平均文本输入令牌数: 621.54

更新记录

  • 12月3日: 上传初始版本
  • 12月4日: 上传v2版本,包含基本数据质量修复
  • 12月5日: 上传v3版本,预随机化顺序并删除文档+摘要的重复行

引用信息

@misc {stacked_summaries_2023, author = { {Stacked Summaries: Karim Foda and Peter Szemraj} }, title = { stacked-xsum-1024 (Revision 2d47220) }, year = 2023, url = { https://huggingface.co/datasets/stacked-summaries/stacked-xsum-1024 }, doi = { 10.57967/hf/0390 }, publisher = { Hugging Face } }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作