pszemraj/qmsum-cleaned
收藏Hugging Face2024-02-18 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/pszemraj/qmsum-cleaned
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: apache-2.0
size_categories:
- 1K<n<10K
source_datasets: tau/scrolls
task_categories:
- text2text-generation
- summarization
tags:
- scrolls
- qmsum
dataset_info:
- config_name: default
features:
- name: id
dtype: string
- name: pid
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: input_token_count
dtype: int64
- name: output_token_count
dtype: int64
splits:
- name: train
num_bytes: 68960760
num_examples: 1257
- name: validation
num_bytes: 15700972
num_examples: 272
- name: test
num_bytes: 16120860
num_examples: 281
download_size: 42316972
dataset_size: 100782592
- config_name: no-prefix
features:
- name: id
dtype: string
- name: pid
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: prompt
dtype: string
splits:
- name: train
num_bytes: 68944419
num_examples: 1257
- name: validation
num_bytes: 15697436
num_examples: 272
- name: test
num_bytes: 16117207
num_examples: 281
download_size: 6180898
dataset_size: 100759062
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: validation
path: data/validation-*
- split: test
path: data/test-*
- config_name: no-prefix
data_files:
- split: train
path: no-prefix/train-*
- split: validation
path: no-prefix/validation-*
- split: test
path: no-prefix/test-*
---
# qmsum-cleaned
## prefixes
It's worth noting that each "document" in `input` is prefixed by a question/prompt on what the model is supposed to do. **You may want to explicitly handle this in some way, or prefix your models trained on this dataset.**
Most frequent "prefixes" separated via [sentence-splitter](https://github.com/mediacloud/sentence-splitter) in the `train` split:
| | Sentence | Count |
|---:|:------------------------------------------------------------------------------|--------:|
| 0 | Summarize the whole meeting. | 121 |
| 1 | Summarize the meeting | 25 |
| 2 | What did the team discuss about the product cost? | 4 |
| 3 | How did Marketing design the product evaluation? | 4 |
| 4 | Summarize the wrap up of the meeting. | 3 |
| 5 | What did the group discuss about user requirements of the new remote control? | 3 |
| 6 | What did the team discuss during the product evaluation? | 3 |
| 7 | Summarize the meeting. | 2 |
| 8 | Summarize what was said about digits form | 2 |
| 9 | What was discussed in the meeting? | 2 |
### wordcloud
Visualized as a wordcloud (`train` split):

## token counts

提供机构:
pszemraj
原始信息汇总
数据集概述
基本信息
- 语言: 英语
- 许可证: Apache-2.0
- 大小分类: 1K<n<10K
- 来源数据集: tau/scrolls
- 任务类别:
- 文本到文本生成
- 摘要生成
- 标签:
- scrolls
- qmsum
数据集配置
- 配置名称: default, no-prefix
- 特征:
- id: 字符串
- pid: 字符串
- input: 字符串
- output: 字符串
- input_token_count: 整数64位
- output_token_count: 整数64位
- prompt (仅no-prefix配置): 字符串
数据集拆分
- 训练集:
- default配置: 1257个样本,68960760字节
- no-prefix配置: 1257个样本,68944419字节
- 验证集:
- default配置: 272个样本,15700972字节
- no-prefix配置: 272个样本,15697436字节
- 测试集:
- default配置: 281个样本,16120860字节
- no-prefix配置: 281个样本,16117207字节
下载与数据集大小
- 下载大小:
- default配置: 42316972字节
- no-prefix配置: 6180898字节
- 数据集大小:
- default配置: 100782592字节
- no-prefix配置: 100759062字节
数据文件配置
- default配置:
- 训练集: data/train-*
- 验证集: data/validation-*
- 测试集: data/test-*
- no-prefix配置:
- 训练集: no-prefix/train-*
- 验证集: no-prefix/validation-*
- 测试集: no-prefix/test-*
输入前缀
- 每个输入文档前缀包含一个问题/提示,指示模型应执行的任务。常见前缀包括“Summarize the whole meeting.”等。



