pszemraj/qmsum-cleaned

Name: pszemraj/qmsum-cleaned
Creator: pszemraj
Published: 2024-02-18 08:51:21
License: 暂无描述

Hugging Face2024-02-18 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/pszemraj/qmsum-cleaned

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en license: apache-2.0 size_categories: - 1K<n<10K source_datasets: tau/scrolls task_categories: - text2text-generation - summarization tags: - scrolls - qmsum dataset_info: - config_name: default features: - name: id dtype: string - name: pid dtype: string - name: input dtype: string - name: output dtype: string - name: input_token_count dtype: int64 - name: output_token_count dtype: int64 splits: - name: train num_bytes: 68960760 num_examples: 1257 - name: validation num_bytes: 15700972 num_examples: 272 - name: test num_bytes: 16120860 num_examples: 281 download_size: 42316972 dataset_size: 100782592 - config_name: no-prefix features: - name: id dtype: string - name: pid dtype: string - name: input dtype: string - name: output dtype: string - name: prompt dtype: string splits: - name: train num_bytes: 68944419 num_examples: 1257 - name: validation num_bytes: 15697436 num_examples: 272 - name: test num_bytes: 16117207 num_examples: 281 download_size: 6180898 dataset_size: 100759062 configs: - config_name: default data_files: - split: train path: data/train-* - split: validation path: data/validation-* - split: test path: data/test-* - config_name: no-prefix data_files: - split: train path: no-prefix/train-* - split: validation path: no-prefix/validation-* - split: test path: no-prefix/test-* --- # qmsum-cleaned ## prefixes It's worth noting that each "document" in `input` is prefixed by a question/prompt on what the model is supposed to do. **You may want to explicitly handle this in some way, or prefix your models trained on this dataset.** Most frequent "prefixes" separated via [sentence-splitter](https://github.com/mediacloud/sentence-splitter) in the `train` split: | | Sentence | Count | |---:|:------------------------------------------------------------------------------|--------:| | 0 | Summarize the whole meeting. | 121 | | 1 | Summarize the meeting | 25 | | 2 | What did the team discuss about the product cost? | 4 | | 3 | How did Marketing design the product evaluation? | 4 | | 4 | Summarize the wrap up of the meeting. | 3 | | 5 | What did the group discuss about user requirements of the new remote control? | 3 | | 6 | What did the team discuss during the product evaluation? | 3 | | 7 | Summarize the meeting. | 2 | | 8 | Summarize what was said about digits form | 2 | | 9 | What was discussed in the meeting? | 2 | ### wordcloud Visualized as a wordcloud (`train` split): ![wc](prefix-train-wordcloud.png) ## token counts ![counts](https://i.imgur.com/rARAOvr.png)

提供机构：

pszemraj

原始信息汇总

数据集概述

基本信息

语言: 英语
许可证: Apache-2.0
大小分类: 1K<n<10K
来源数据集: tau/scrolls
任务类别:
- 文本到文本生成
- 摘要生成
标签:
- scrolls
- qmsum

数据集配置

配置名称: default, no-prefix
特征:
- id: 字符串
- pid: 字符串
- input: 字符串
- output: 字符串
- input_token_count: 整数64位
- output_token_count: 整数64位
- prompt (仅no-prefix配置): 字符串

数据集拆分

训练集:
- default配置: 1257个样本，68960760字节
- no-prefix配置: 1257个样本，68944419字节
验证集:
- default配置: 272个样本，15700972字节
- no-prefix配置: 272个样本，15697436字节
测试集:
- default配置: 281个样本，16120860字节
- no-prefix配置: 281个样本，16117207字节

下载与数据集大小

下载大小:
- default配置: 42316972字节
- no-prefix配置: 6180898字节
数据集大小:
- default配置: 100782592字节
- no-prefix配置: 100759062字节

数据文件配置

default配置:
- 训练集: data/train-*
- 验证集: data/validation-*
- 测试集: data/test-*
no-prefix配置:
- 训练集: no-prefix/train-*
- 验证集: no-prefix/validation-*
- 测试集: no-prefix/test-*

输入前缀

每个输入文档前缀包含一个问题/提示，指示模型应执行的任务。常见前缀包括“Summarize the whole meeting.”等。

5,000+

优质数据集

54 个

任务类型

进入经典数据集