five

pszemraj/qmsum-cleaned

收藏
Hugging Face2024-02-18 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/pszemraj/qmsum-cleaned
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: apache-2.0 size_categories: - 1K<n<10K source_datasets: tau/scrolls task_categories: - text2text-generation - summarization tags: - scrolls - qmsum dataset_info: - config_name: default features: - name: id dtype: string - name: pid dtype: string - name: input dtype: string - name: output dtype: string - name: input_token_count dtype: int64 - name: output_token_count dtype: int64 splits: - name: train num_bytes: 68960760 num_examples: 1257 - name: validation num_bytes: 15700972 num_examples: 272 - name: test num_bytes: 16120860 num_examples: 281 download_size: 42316972 dataset_size: 100782592 - config_name: no-prefix features: - name: id dtype: string - name: pid dtype: string - name: input dtype: string - name: output dtype: string - name: prompt dtype: string splits: - name: train num_bytes: 68944419 num_examples: 1257 - name: validation num_bytes: 15697436 num_examples: 272 - name: test num_bytes: 16117207 num_examples: 281 download_size: 6180898 dataset_size: 100759062 configs: - config_name: default data_files: - split: train path: data/train-* - split: validation path: data/validation-* - split: test path: data/test-* - config_name: no-prefix data_files: - split: train path: no-prefix/train-* - split: validation path: no-prefix/validation-* - split: test path: no-prefix/test-* --- # qmsum-cleaned ## prefixes It's worth noting that each "document" in `input` is prefixed by a question/prompt on what the model is supposed to do. **You may want to explicitly handle this in some way, or prefix your models trained on this dataset.** Most frequent "prefixes" separated via [sentence-splitter](https://github.com/mediacloud/sentence-splitter) in the `train` split: | | Sentence | Count | |---:|:------------------------------------------------------------------------------|--------:| | 0 | Summarize the whole meeting. | 121 | | 1 | Summarize the meeting | 25 | | 2 | What did the team discuss about the product cost? | 4 | | 3 | How did Marketing design the product evaluation? | 4 | | 4 | Summarize the wrap up of the meeting. | 3 | | 5 | What did the group discuss about user requirements of the new remote control? | 3 | | 6 | What did the team discuss during the product evaluation? | 3 | | 7 | Summarize the meeting. | 2 | | 8 | Summarize what was said about digits form | 2 | | 9 | What was discussed in the meeting? | 2 | ### wordcloud Visualized as a wordcloud (`train` split): ![wc](prefix-train-wordcloud.png) ## token counts ![counts](https://i.imgur.com/rARAOvr.png)
提供机构:
pszemraj
原始信息汇总

数据集概述

基本信息

  • 语言: 英语
  • 许可证: Apache-2.0
  • 大小分类: 1K<n<10K
  • 来源数据集: tau/scrolls
  • 任务类别:
    • 文本到文本生成
    • 摘要生成
  • 标签:
    • scrolls
    • qmsum

数据集配置

  • 配置名称: default, no-prefix
  • 特征:
    • id: 字符串
    • pid: 字符串
    • input: 字符串
    • output: 字符串
    • input_token_count: 整数64位
    • output_token_count: 整数64位
    • prompt (仅no-prefix配置): 字符串

数据集拆分

  • 训练集:
    • default配置: 1257个样本,68960760字节
    • no-prefix配置: 1257个样本,68944419字节
  • 验证集:
    • default配置: 272个样本,15700972字节
    • no-prefix配置: 272个样本,15697436字节
  • 测试集:
    • default配置: 281个样本,16120860字节
    • no-prefix配置: 281个样本,16117207字节

下载与数据集大小

  • 下载大小:
    • default配置: 42316972字节
    • no-prefix配置: 6180898字节
  • 数据集大小:
    • default配置: 100782592字节
    • no-prefix配置: 100759062字节

数据文件配置

  • default配置:
    • 训练集: data/train-*
    • 验证集: data/validation-*
    • 测试集: data/test-*
  • no-prefix配置:
    • 训练集: no-prefix/train-*
    • 验证集: no-prefix/validation-*
    • 测试集: no-prefix/test-*

输入前缀

  • 每个输入文档前缀包含一个问题/提示,指示模型应执行的任务。常见前缀包括“Summarize the whole meeting.”等。
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作