taskmaster1
收藏魔搭社区2025-12-05 更新2025-07-12 收录
下载链接:
https://modelscope.cn/datasets/google-research-datasets/taskmaster1
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for Taskmaster-1
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** [Taskmaster-1](https://research.google/tools/datasets/taskmaster-1/)
- **Repository:** [GitHub](https://github.com/google-research-datasets/Taskmaster/tree/master/TM-1-2019)
- **Paper:** [Taskmaster-1: Toward a Realistic and Diverse Dialog Dataset](https://arxiv.org/abs/1909.05358)
- **Leaderboard:** N/A
- **Point of Contact:** [Taskmaster Googlegroup](taskmaster-datasets@googlegroups.com)
### Dataset Summary
Taskmaster-1 is a goal-oriented conversational dataset. It includes 13,215 task-based
dialogs comprising six domains. Two procedures were used to create this collection,
each with unique advantages. The first involves a two-person, spoken "Wizard of Oz" (WOz) approach
in which trained agents and crowdsourced workers interact to complete the task while the second is
"self-dialog" in which crowdsourced workers write the entire dialog themselves.
### Supported Tasks and Leaderboards
[More Information Needed]
### Languages
The dataset is in English language.
## Dataset Structure
### Data Instances
A typical example looks like this
```
{
"conversation_id":"dlg-336c8165-068e-4b4b-803d-18ef0676f668",
"instruction_id":"restaurant-table-2",
"utterances":[
{
"index":0,
"segments":[
],
"speaker":"USER",
"text":"Hi, I'm looking for a place that sells spicy wet hotdogs, can you think of any?"
},
{
"index":1,
"segments":[
{
"annotations":[
{
"name":"restaurant_reservation.name.restaurant.reject"
}
],
"end_index":37,
"start_index":16,
"text":"Spicy Wet Hotdogs LLC"
}
],
"speaker":"ASSISTANT",
"text":"You might enjoy Spicy Wet Hotdogs LLC."
},
{
"index":2,
"segments":[
],
"speaker":"USER",
"text":"That sounds really good, can you make me a reservation?"
},
{
"index":3,
"segments":[
],
"speaker":"ASSISTANT",
"text":"Certainly, when would you like a reservation?"
},
{
"index":4,
"segments":[
{
"annotations":[
{
"name":"restaurant_reservation.num.guests"
},
{
"name":"restaurant_reservation.num.guests"
}
],
"end_index":20,
"start_index":18,
"text":"50"
}
],
"speaker":"USER",
"text":"I have a party of 50 who want a really sloppy dog on Saturday at noon."
}
]
}
```
### Data Fields
Each conversation in the data file has the following structure:
- `conversation_id`: A universally unique identifier with the prefix 'dlg-'. The ID has no meaning.
- `utterances`: A list of utterances that make up the conversation.
- `instruction_id`: A reference to the file(s) containing the user (and, if applicable, agent) instructions for this conversation.
Each utterance has the following fields:
- `index`: A 0-based index indicating the order of the utterances in the conversation.
- `speaker`: Either USER or ASSISTANT, indicating which role generated this utterance.
- `text`: The raw text of the utterance. In case of self dialogs (one_person_dialogs), this is written by the crowdsourced worker. In case of the WOz dialogs, 'ASSISTANT' turns are written and 'USER' turns are transcribed from the spoken recordings of crowdsourced workers.
- `segments`: A list of various text spans with semantic annotations.
Each segment has the following fields:
- `start_index`: The position of the start of the annotation in the utterance text.
- `end_index`: The position of the end of the annotation in the utterance text.
- `text`: The raw text that has been annotated.
- `annotations`: A list of annotation details for this segment.
Each annotation has a single field:
- `name`: The annotation name.
### Data Splits
- one_person_dialogs
The data in `one_person_dialogs` config is split into `train`, `dev` and `test` splits.
| | train | validation | test |
|--------------|-------:|------------:|------:|
| N. Instances | 6168 | 770 | 770 |
- woz_dialogs
The data in `woz_dialogs` config has no default splits.
| | train |
|--------------|-------:|
| N. Instances | 5507 |
## Dataset Creation
### Curation Rationale
[More Information Needed]
### Source Data
[More Information Needed]
#### Initial Data Collection and Normalization
[More Information Needed]
#### Who are the source language producers?
[More Information Needed]
### Annotations
[More Information Needed]
#### Annotation process
[More Information Needed]
#### Who are the annotators?
[More Information Needed]
### Personal and Sensitive Information
[More Information Needed]
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed]
### Discussion of Biases
[More Information Needed]
### Other Known Limitations
[More Information Needed]
## Additional Information
### Dataset Curators
[More Information Needed]
### Licensing Information
The dataset is licensed under `Creative Commons Attribution 4.0 License`
### Citation Information
[More Information Needed]
```
@inproceedings{48484,
title = {Taskmaster-1: Toward a Realistic and Diverse Dialog Dataset},
author = {Bill Byrne and Karthik Krishnamoorthi and Chinnadhurai Sankar and Arvind Neelakantan and Daniel Duckworth and Semih Yavuz and Ben Goodrich and Amit Dubey and Kyu-Young Kim and Andy Cedilnik},
year = {2019}
}
```
### Contributions
Thanks to [@patil-suraj](https://github.com/patil-suraj) for adding this dataset.
# Taskmaster-1 数据集卡片
## 目录
- [数据集描述](#dataset-description)
- [数据集概要](#dataset-summary)
- [支持任务与排行榜](#supported-tasks-and-leaderboards)
- [语言](#languages)
- [数据集结构](#dataset-structure)
- [数据样例](#data-instances)
- [数据字段](#data-fields)
- [数据划分](#data-splits)
- [数据集构建](#dataset-creation)
- [构建初衷](#curation-rationale)
- [源数据](#source-data)
- [标注信息](#annotations)
- [个人与敏感信息](#personal-and-sensitive-information)
- [数据使用注意事项](#considerations-for-using-the-data)
- [数据集的社会影响](#social-impact-of-dataset)
- [偏差讨论](#discussion-of-biases)
- [其他已知局限性](#other-known-limitations)
- [附加信息](#additional-information)
- [数据集维护者](#dataset-curators)
- [授权信息](#licensing-information)
- [引用信息](#citation-information)
- [贡献致谢](#contributions)
## 数据集描述
- **"主页"**:[Taskmaster-1](https://research.google/tools/datasets/taskmaster-1/)
- **"代码仓库"**:[GitHub](https://github.com/google-research-datasets/Taskmaster/tree/master/TM-1-2019)
- **"相关论文"**:[Taskmaster-1: Toward a Realistic and Diverse Dialog Dataset](https://arxiv.org/abs/1909.05358)
- **"排行榜"**:无(N/A)
- **"联系方式"**:[Taskmaster 工作组邮箱](taskmaster-datasets@googlegroups.com)
### 数据集概要
Taskmaster-1 是一款面向任务型对话的数据集,共包含13215条基于任务的对话,涵盖6个应用领域。本次数据集的构建采用了两种各具优势的流程:其一为双人口语化**绿野仙踪(Wizard of Oz, WOz)**范式,即由经过训练的智能体与众包工人交互以完成指定任务;其二为**自对话(self-dialog)**模式,即由众包工人独立撰写完整的对话内容。
### 支持任务与排行榜
[需补充更多信息]
### 语言
本数据集采用英语编写。
## 数据集结构
### 数据样例
一条典型的对话样例如以下格式所示:
json
{
"conversation_id":"dlg-336c8165-068e-4b4b-803d-18ef0676f668",
"instruction_id":"restaurant-table-2",
"utterances":[
{
"index":0,
"segments":[
],
"speaker":"USER",
"text":"Hi, I'm looking for a place that sells spicy wet hotdogs, can you think of any?"
},
{
"index":1,
"segments":[
{
"annotations":[
{
"name":"restaurant_reservation.name.restaurant.reject"
}
],
"end_index":37,
"start_index":16,
"text":"Spicy Wet Hotdogs LLC"
}
],
"speaker":"ASSISTANT",
"text":"You might enjoy Spicy Wet Hotdogs LLC."
},
{
"index":2,
"segments":[
],
"speaker":"USER",
"text":"That sounds really good, can you make me a reservation?"
},
{
"index":3,
"segments":[
],
"speaker":"ASSISTANT",
"text":"Certainly, when would you like a reservation?"
},
{
"index":4,
"segments":[
{
"annotations":[
{
"name":"restaurant_reservation.num.guests"
},
{
"name":"restaurant_reservation.num.guests"
}
],
"end_index":20,
"start_index":18,
"text":"50"
}
],
"speaker":"USER",
"text":"I have a party of 50 who want a really sloppy dog on Saturday at noon."
}
]
}
### 数据字段
数据文件中的每条对话均遵循以下结构:
- `conversation_id`:前缀为`dlg-`的全局唯一标识符,该ID无实际语义。
- `utterances`:组成完整对话的话语轮次列表。
- `instruction_id`:指向包含本次对话用户(及可选的助手)指令的文件的引用标识。
每条话语轮次包含以下字段:
- `index`:从0开始的索引,用于标识该话语在对话中的先后顺序。
- `speaker`:取值为`USER`(用户)或`ASSISTANT`(助手),用于标识该轮话语的发言角色。
- `text`:话语的原始文本内容。若为自对话模式(单人对话),该文本由众包工人独立撰写;若为WOz对话模式,助手轮次的文本由人工撰写,用户轮次的文本则基于众包工人的口语录音转录而来。
- `segments`:带有语义标注的各类文本片段列表。
每个文本片段包含以下字段:
- `start_index`:标注起始位置在该话语文本中的字符索引。
- `end_index`:标注结束位置在该话语文本中的字符索引。
- `text`:被标注的原始文本片段内容。
- `annotations`:该文本片段的标注详情列表。
每个标注包含以下字段:
- `name`:标注的名称。
### 数据划分
- 单人对话(one_person_dialogs)配置:数据被划分为训练集(train)、验证集(dev)与测试集(test),样本统计如下:
| | 训练集 | 验证集 | 测试集 |
|--------------|-------:|------------:|------:|
| 样本数量 | 6168 | 770 | 770 |
- 绿野仙踪对话(woz_dialogs)配置:无默认数据划分,总样本数为5507。
## 数据集构建
### 构建初衷
[需补充更多信息]
### 源数据
[需补充更多信息]
#### 初始数据收集与标准化
[需补充更多信息]
#### 源语言生产者是谁?
[需补充更多信息]
### 标注信息
[需补充更多信息]
#### 标注流程
[需补充更多信息]
#### 标注人员是谁?
[需补充更多信息]
### 个人与敏感信息
[需补充更多信息]
## 数据使用注意事项
### 数据集的社会影响
[需补充更多信息]
### 偏差讨论
[需补充更多信息]
### 其他已知局限性
[需补充更多信息]
## 附加信息
### 数据集维护者
[需补充更多信息]
### 授权信息
本数据集采用`知识共享署名4.0许可(Creative Commons Attribution 4.0 License)`进行授权。
### 引用信息
[需补充更多信息]
bibtex
@inproceedings{48484,
title = {Taskmaster-1: Toward a Realistic and Diverse Dialog Dataset},
author = {Bill Byrne and Karthik Krishnamoorthi and Chinnadhurai Sankar and Arvind Neelakantan and Daniel Duckworth and Semih Yavuz and Ben Goodrich and Amit Dubey and Kyu-Young Kim and Andy Cedilnik},
year = {2019}
}
### 贡献致谢
感谢 [@patil-suraj](https://github.com/patil-suraj) 为本数据集添加支持。
提供机构:
maas
创建时间:
2025-07-07



