taskmaster2
收藏魔搭社区2025-12-05 更新2025-07-12 收录
下载链接:
https://modelscope.cn/datasets/google-research-datasets/taskmaster2
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for Taskmaster-2
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** [Taskmaster-1](https://research.google/tools/datasets/taskmaster-1/)
- **Repository:** [GitHub](https://github.com/google-research-datasets/Taskmaster/tree/master/TM-2-2020)
- **Paper:** [Taskmaster-1: Toward a Realistic and Diverse Dialog Dataset](https://arxiv.org/abs/1909.05358)
- **Leaderboard:** N/A
- **Point of Contact:** [Taskmaster Googlegroup](taskmaster-datasets@googlegroups.com)
### Dataset Summary
Taskmaster is dataset for goal oriented conversations. The Taskmaster-2 dataset consists of 17,289 dialogs
in the seven domains which include restaurants, food ordering, movies, hotels, flights, music and sports.
Unlike Taskmaster-1, which includes both written "self-dialogs" and spoken two-person dialogs,
Taskmaster-2 consists entirely of spoken two-person dialogs. In addition, while Taskmaster-1 is
almost exclusively task-based, Taskmaster-2 contains a good number of search- and recommendation-oriented dialogs.
All dialogs in this release were created using a Wizard of Oz (WOz) methodology in which crowdsourced
workers played the role of a 'user' and trained call center operators played the role of the 'assistant'.
In this way, users were led to believe they were interacting with an automated system that “spoke”
using text-to-speech (TTS) even though it was in fact a human behind the scenes.
As a result, users could express themselves however they chose in the context of an automated interface.
### Supported Tasks and Leaderboards
[More Information Needed]
### Languages
The dataset is in English language.
## Dataset Structure
### Data Instances
A typical example looks like this
```
{
"conversation_id": "dlg-0047a087-6a3c-4f27-b0e6-268f53a2e013",
"instruction_id": "flight-6",
"utterances": [
{
"index": 0,
"segments": [],
"speaker": "USER",
"text": "Hi, I'm looking for a flight. I need to visit a friend."
},
{
"index": 1,
"segments": [],
"speaker": "ASSISTANT",
"text": "Hello, how can I help you?"
},
{
"index": 2,
"segments": [],
"speaker": "ASSISTANT",
"text": "Sure, I can help you with that."
},
{
"index": 3,
"segments": [],
"speaker": "ASSISTANT",
"text": "On what dates?"
},
{
"index": 4,
"segments": [
{
"annotations": [
{
"name": "flight_search.date.depart_origin"
}
],
"end_index": 37,
"start_index": 27,
"text": "March 20th"
},
{
"annotations": [
{
"name": "flight_search.date.return"
}
],
"end_index": 45,
"start_index": 41,
"text": "22nd"
}
],
"speaker": "USER",
"text": "I'm looking to travel from March 20th to 22nd."
}
]
}
```
### Data Fields
Each conversation in the data file has the following structure:
- `conversation_id`: A universally unique identifier with the prefix 'dlg-'. The ID has no meaning.
- `utterances`: A list of utterances that make up the conversation.
- `instruction_id`: A reference to the file(s) containing the user (and, if applicable, agent) instructions for this conversation.
Each utterance has the following fields:
- `index`: A 0-based index indicating the order of the utterances in the conversation.
- `speaker`: Either USER or ASSISTANT, indicating which role generated this utterance.
- `text`: The raw text of the utterance. In case of self dialogs (one_person_dialogs), this is written by the crowdsourced worker. In case of the WOz dialogs, 'ASSISTANT' turns are written and 'USER' turns are transcribed from the spoken recordings of crowdsourced workers.
- `segments`: A list of various text spans with semantic annotations.
Each segment has the following fields:
- `start_index`: The position of the start of the annotation in the utterance text.
- `end_index`: The position of the end of the annotation in the utterance text.
- `text`: The raw text that has been annotated.
- `annotations`: A list of annotation details for this segment.
Each annotation has a single field:
- `name`: The annotation name.
### Data Splits
There are no deafults splits for all the config. The below table lists the number of examples in each config.
| Config | Train |
|-------------------|--------|
| flights | 2481 |
| food-orderings | 1050 |
| hotels | 2355 |
| movies | 3047 |
| music | 1602 |
| restaurant-search | 3276 |
| sports | 3478 |
## Dataset Creation
### Curation Rationale
[More Information Needed]
### Source Data
[More Information Needed]
#### Initial Data Collection and Normalization
[More Information Needed]
#### Who are the source language producers?
[More Information Needed]
### Annotations
[More Information Needed]
#### Annotation process
[More Information Needed]
#### Who are the annotators?
[More Information Needed]
### Personal and Sensitive Information
[More Information Needed]
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed]
### Discussion of Biases
[More Information Needed]
### Other Known Limitations
[More Information Needed]
## Additional Information
### Dataset Curators
[More Information Needed]
### Licensing Information
The dataset is licensed under `Creative Commons Attribution 4.0 License`
### Citation Information
[More Information Needed]
```
@inproceedings{48484,
title = {Taskmaster-1: Toward a Realistic and Diverse Dialog Dataset},
author = {Bill Byrne and Karthik Krishnamoorthi and Chinnadhurai Sankar and Arvind Neelakantan and Daniel Duckworth and Semih Yavuz and Ben Goodrich and Amit Dubey and Kyu-Young Kim and Andy Cedilnik},
year = {2019}
}
```
### Contributions
Thanks to [@patil-suraj](https://github.com/patil-suraj) for adding this dataset.
# 数据集卡片:Taskmaster-2
## 目录
- [数据集描述](#dataset-description)
- [数据集概述](#dataset-summary)
- [支持任务与基准测试榜](#supported-tasks-and-leaderboards)
- [语言](#languages)
- [数据集结构](#dataset-structure)
- [数据实例](#data-instances)
- [数据字段](#data-fields)
- [数据划分](#data-splits)
- [数据集构建](#dataset-creation)
- [构建初衷](#curation-rationale)
- [源数据](#source-data)
- [标注](#annotations)
- [个人与敏感信息](#personal-and-sensitive-information)
- [数据集使用注意事项](#considerations-for-using-the-data)
- [数据集的社会影响](#social-impact-of-dataset)
- [偏见分析](#discussion-of-biases)
- [其他已知局限](#other-known-limitations)
- [附加信息](#additional-information)
- [数据集维护者](#dataset-curators)
- [许可信息](#licensing-information)
- [引用信息](#citation-information)
- [贡献致谢](#contributions)
## 数据集描述
- **主页**:[Taskmaster-1](https://research.google/tools/datasets/taskmaster-1/)
- **代码仓库**:[GitHub](https://github.com/google-research-datasets/Taskmaster/tree/master/TM-2-2020)
- **相关论文**:[Taskmaster-1: Toward a Realistic and Diverse Dialog Dataset](https://arxiv.org/abs/1909.05358)
- **基准测试榜**:无
- **联系人**:[Taskmaster 工作组](taskmaster-datasets@googlegroups.com)
### 数据集概述
Taskmaster是面向目标的对话(goal-oriented conversations)数据集。Taskmaster-2数据集包含17289条对话,涵盖7个领域,分别为餐饮、食品订购、影视、酒店、航班、音乐与体育。与同时包含书面“自对话”与双人口语对话的Taskmaster-1不同,Taskmaster-2仅包含双人口语对话。此外,Taskmaster-1几乎完全基于任务型场景,而Taskmaster-2还包含大量搜索与推荐导向的对话。本版本发布的所有对话均采用绿野仙踪(Wizard of Oz, WOz)方法构建:众包工人扮演“用户”角色,经过培训的呼叫中心操作员扮演“助手”角色。在此设定下,用户会误以为自己在与采用文本转语音(text-to-speech, TTS)技术的自动化系统交互,而实际上后台由人类充当助手。因此,用户可在自动化交互界面中自由表达自身诉求。
### 支持任务与基准测试榜
[需补充更多信息]
### 语言
本数据集采用英语编写。
## 数据集结构
### 数据实例
典型的数据实例如下所示:
{
"conversation_id": "dlg-0047a087-6a3c-4f27-b0e6-268f53a2e013",
"instruction_id": "flight-6",
"utterances": [
{
"index": 0,
"segments": [],
"speaker": "USER",
"text": "Hi, I'm looking for a flight. I need to visit a friend."
},
{
"index": 1,
"segments": [],
"speaker": "ASSISTANT",
"text": "Hello, how can I help you?"
},
{
"index": 2,
"segments": [],
"speaker": "ASSISTANT",
"text": "Sure, I can help you with that."
},
{
"index": 3,
"segments": [],
"speaker": "ASSISTANT",
"text": "On what dates?"
},
{
"index": 4,
"segments": [
{
"annotations": [
{
"name": "flight_search.date.depart_origin"
}
],
"end_index": 37,
"start_index": 27,
"text": "March 20th"
},
{
"annotations": [
{
"name": "flight_search.date.return"
}
],
"end_index": 45,
"start_index": 41,
"text": "22nd"
}
],
"speaker": "USER",
"text": "I'm looking to travel from March 20th to 22nd."
}
]
}
### 数据字段
数据文件中的每条对话均遵循以下结构:
- `conversation_id`:以`dlg-`为前缀的全局唯一标识符,无实际语义。
- `utterances`:组成对话的话语列表。
- `instruction_id`:指向包含本次对话用户(及若适用的助手)指令的文件的引用。
每条话语包含以下字段:
- `index`:基于0的索引,用于标识话语在对话中的顺序。
- `speaker`:取值为`USER`或`ASSISTANT`,分别代表生成该话语的角色。
- `text`:话语的原始文本。若为自对话(one_person_dialogs),则由众包工人撰写;若为WOz对话,则助手(ASSISTANT)话语由人工撰写,用户(USER)话语由众包工人的口语录音转录而来。
- `segments`:带有语义标注的各类文本片段列表。
每个文本片段包含以下字段:
- `start_index`:标注在话语文本中的起始位置。
- `end_index`:标注在话语文本中的结束位置。
- `text`:被标注的原始文本。
- `annotations`:该片段的各类标注详情列表。
每个标注仅包含一个字段:
- `name`:标注名称。
### 数据划分
并非所有配置均存在默认划分。下表列出了各配置下的样本数量:
| 配置名称 | 训练集样本数 |
|---------------------|--------------|
| 航班查询 | 2481 |
| 食品订购 | 1050 |
| 酒店预订 | 2355 |
| 影视查询 | 3047 |
| 音乐查询 | 1602 |
| 餐饮搜索 | 3276 |
| 体育查询 | 3478 |
## 数据集构建
### 构建初衷
[需补充更多信息]
### 源数据
[需补充更多信息]
#### 初始数据收集与标准化
[需补充更多信息]
#### 源语言生成者是谁?
[需补充更多信息]
### 标注
[需补充更多信息]
#### 标注流程
[需补充更多信息]
#### 标注人员是谁?
[需补充更多信息]
### 个人与敏感信息
[需补充更多信息]
## 数据集使用注意事项
### 数据集的社会影响
[需补充更多信息]
### 偏见分析
[需补充更多信息]
### 其他已知局限
[需补充更多信息]
## 附加信息
### 数据集维护者
[需补充更多信息]
### 许可信息
本数据集采用`知识共享署名4.0许可(Creative Commons Attribution 4.0 License)`进行授权。
### 引用信息
[需补充更多信息]
@inproceedings{48484,
title = {Taskmaster-1: Toward a Realistic and Diverse Dialog Dataset},
author = {Bill Byrne and Karthik Krishnamoorthi and Chinnadhurai Sankar and Arvind Neelakantan and Daniel Duckworth and Semih Yavuz and Ben Goodrich and Amit Dubey and Kyu-Young Kim and Andy Cedilnik},
year = {2019}
}
### 贡献致谢
感谢 [@patil-suraj](https://github.com/patil-suraj) 为本数据集的收录提供支持。
提供机构:
maas
创建时间:
2025-07-07



