下载链接：

https://modelscope.cn/datasets/google-research-datasets/taskmaster2

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for Taskmaster-2 ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [Taskmaster-1](https://research.google/tools/datasets/taskmaster-1/) - **Repository:** [GitHub](https://github.com/google-research-datasets/Taskmaster/tree/master/TM-2-2020) - **Paper:** [Taskmaster-1: Toward a Realistic and Diverse Dialog Dataset](https://arxiv.org/abs/1909.05358) - **Leaderboard:** N/A - **Point of Contact:** [Taskmaster Googlegroup](taskmaster-datasets@googlegroups.com) ### Dataset Summary Taskmaster is dataset for goal oriented conversations. The Taskmaster-2 dataset consists of 17,289 dialogs in the seven domains which include restaurants, food ordering, movies, hotels, flights, music and sports. Unlike Taskmaster-1, which includes both written "self-dialogs" and spoken two-person dialogs, Taskmaster-2 consists entirely of spoken two-person dialogs. In addition, while Taskmaster-1 is almost exclusively task-based, Taskmaster-2 contains a good number of search- and recommendation-oriented dialogs. All dialogs in this release were created using a Wizard of Oz (WOz) methodology in which crowdsourced workers played the role of a 'user' and trained call center operators played the role of the 'assistant'. In this way, users were led to believe they were interacting with an automated system that “spoke” using text-to-speech (TTS) even though it was in fact a human behind the scenes. As a result, users could express themselves however they chose in the context of an automated interface. ### Supported Tasks and Leaderboards [More Information Needed] ### Languages The dataset is in English language. ## Dataset Structure ### Data Instances A typical example looks like this ``` { "conversation_id": "dlg-0047a087-6a3c-4f27-b0e6-268f53a2e013", "instruction_id": "flight-6", "utterances": [ { "index": 0, "segments": [], "speaker": "USER", "text": "Hi, I'm looking for a flight. I need to visit a friend." }, { "index": 1, "segments": [], "speaker": "ASSISTANT", "text": "Hello, how can I help you?" }, { "index": 2, "segments": [], "speaker": "ASSISTANT", "text": "Sure, I can help you with that." }, { "index": 3, "segments": [], "speaker": "ASSISTANT", "text": "On what dates?" }, { "index": 4, "segments": [ { "annotations": [ { "name": "flight_search.date.depart_origin" } ], "end_index": 37, "start_index": 27, "text": "March 20th" }, { "annotations": [ { "name": "flight_search.date.return" } ], "end_index": 45, "start_index": 41, "text": "22nd" } ], "speaker": "USER", "text": "I'm looking to travel from March 20th to 22nd." } ] } ``` ### Data Fields Each conversation in the data file has the following structure: - `conversation_id`: A universally unique identifier with the prefix 'dlg-'. The ID has no meaning. - `utterances`: A list of utterances that make up the conversation. - `instruction_id`: A reference to the file(s) containing the user (and, if applicable, agent) instructions for this conversation. Each utterance has the following fields: - `index`: A 0-based index indicating the order of the utterances in the conversation. - `speaker`: Either USER or ASSISTANT, indicating which role generated this utterance. - `text`: The raw text of the utterance. In case of self dialogs (one_person_dialogs), this is written by the crowdsourced worker. In case of the WOz dialogs, 'ASSISTANT' turns are written and 'USER' turns are transcribed from the spoken recordings of crowdsourced workers. - `segments`: A list of various text spans with semantic annotations. Each segment has the following fields: - `start_index`: The position of the start of the annotation in the utterance text. - `end_index`: The position of the end of the annotation in the utterance text. - `text`: The raw text that has been annotated. - `annotations`: A list of annotation details for this segment. Each annotation has a single field: - `name`: The annotation name. ### Data Splits There are no deafults splits for all the config. The below table lists the number of examples in each config. | Config | Train | |-------------------|--------| | flights | 2481 | | food-orderings | 1050 | | hotels | 2355 | | movies | 3047 | | music | 1602 | | restaurant-search | 3276 | | sports | 3478 | ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data [More Information Needed] #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations [More Information Needed] #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information The dataset is licensed under `Creative Commons Attribution 4.0 License` ### Citation Information [More Information Needed] ``` @inproceedings{48484, title = {Taskmaster-1: Toward a Realistic and Diverse Dialog Dataset}, author = {Bill Byrne and Karthik Krishnamoorthi and Chinnadhurai Sankar and Arvind Neelakantan and Daniel Duckworth and Semih Yavuz and Ben Goodrich and Amit Dubey and Kyu-Young Kim and Andy Cedilnik}, year = {2019} } ``` ### Contributions Thanks to [@patil-suraj](https://github.com/patil-suraj) for adding this dataset.

# 数据集卡片：Taskmaster-2 ## 目录 - [数据集描述](#dataset-description) - [数据集概述](#dataset-summary) - [支持任务与基准测试榜](#supported-tasks-and-leaderboards) - [语言](#languages) - [数据集结构](#dataset-structure) - [数据实例](#data-instances) - [数据字段](#data-fields) - [数据划分](#data-splits) - [数据集构建](#dataset-creation) - [构建初衷](#curation-rationale) - [源数据](#source-data) - [标注](#annotations) - [个人与敏感信息](#personal-and-sensitive-information) - [数据集使用注意事项](#considerations-for-using-the-data) - [数据集的社会影响](#social-impact-of-dataset) - [偏见分析](#discussion-of-biases) - [其他已知局限](#other-known-limitations) - [附加信息](#additional-information) - [数据集维护者](#dataset-curators) - [许可信息](#licensing-information) - [引用信息](#citation-information) - [贡献致谢](#contributions) ## 数据集描述 - **主页**：[Taskmaster-1](https://research.google/tools/datasets/taskmaster-1/) - **代码仓库**：[GitHub](https://github.com/google-research-datasets/Taskmaster/tree/master/TM-2-2020) - **相关论文**：[Taskmaster-1: Toward a Realistic and Diverse Dialog Dataset](https://arxiv.org/abs/1909.05358) - **基准测试榜**：无 - **联系人**：[Taskmaster 工作组](taskmaster-datasets@googlegroups.com) ### 数据集概述 Taskmaster是面向目标的对话（goal-oriented conversations）数据集。Taskmaster-2数据集包含17289条对话，涵盖7个领域，分别为餐饮、食品订购、影视、酒店、航班、音乐与体育。与同时包含书面“自对话”与双人口语对话的Taskmaster-1不同，Taskmaster-2仅包含双人口语对话。此外，Taskmaster-1几乎完全基于任务型场景，而Taskmaster-2还包含大量搜索与推荐导向的对话。本版本发布的所有对话均采用绿野仙踪（Wizard of Oz, WOz）方法构建：众包工人扮演“用户”角色，经过培训的呼叫中心操作员扮演“助手”角色。在此设定下，用户会误以为自己在与采用文本转语音（text-to-speech, TTS）技术的自动化系统交互，而实际上后台由人类充当助手。因此，用户可在自动化交互界面中自由表达自身诉求。 ### 支持任务与基准测试榜 [需补充更多信息] ### 语言本数据集采用英语编写。 ## 数据集结构 ### 数据实例典型的数据实例如下所示： { "conversation_id": "dlg-0047a087-6a3c-4f27-b0e6-268f53a2e013", "instruction_id": "flight-6", "utterances": [ { "index": 0, "segments": [], "speaker": "USER", "text": "Hi, I'm looking for a flight. I need to visit a friend." }, { "index": 1, "segments": [], "speaker": "ASSISTANT", "text": "Hello, how can I help you?" }, { "index": 2, "segments": [], "speaker": "ASSISTANT", "text": "Sure, I can help you with that." }, { "index": 3, "segments": [], "speaker": "ASSISTANT", "text": "On what dates?" }, { "index": 4, "segments": [ { "annotations": [ { "name": "flight_search.date.depart_origin" } ], "end_index": 37, "start_index": 27, "text": "March 20th" }, { "annotations": [ { "name": "flight_search.date.return" } ], "end_index": 45, "start_index": 41, "text": "22nd" } ], "speaker": "USER", "text": "I'm looking to travel from March 20th to 22nd." } ] } ### 数据字段数据文件中的每条对话均遵循以下结构： - `conversation_id`：以`dlg-`为前缀的全局唯一标识符，无实际语义。 - `utterances`：组成对话的话语列表。 - `instruction_id`：指向包含本次对话用户（及若适用的助手）指令的文件的引用。每条话语包含以下字段： - `index`：基于0的索引，用于标识话语在对话中的顺序。 - `speaker`：取值为`USER`或`ASSISTANT`，分别代表生成该话语的角色。 - `text`：话语的原始文本。若为自对话（one_person_dialogs），则由众包工人撰写；若为WOz对话，则助手（ASSISTANT）话语由人工撰写，用户（USER）话语由众包工人的口语录音转录而来。 - `segments`：带有语义标注的各类文本片段列表。每个文本片段包含以下字段： - `start_index`：标注在话语文本中的起始位置。 - `end_index`：标注在话语文本中的结束位置。 - `text`：被标注的原始文本。 - `annotations`：该片段的各类标注详情列表。每个标注仅包含一个字段： - `name`：标注名称。 ### 数据划分并非所有配置均存在默认划分。下表列出了各配置下的样本数量： | 配置名称 | 训练集样本数 | |---------------------|--------------| | 航班查询 | 2481 | | 食品订购 | 1050 | | 酒店预订 | 2355 | | 影视查询 | 3047 | | 音乐查询 | 1602 | | 餐饮搜索 | 3276 | | 体育查询 | 3478 | ## 数据集构建 ### 构建初衷 [需补充更多信息] ### 源数据 [需补充更多信息] #### 初始数据收集与标准化 [需补充更多信息] #### 源语言生成者是谁？ [需补充更多信息] ### 标注 [需补充更多信息] #### 标注流程 [需补充更多信息] #### 标注人员是谁？ [需补充更多信息] ### 个人与敏感信息 [需补充更多信息] ## 数据集使用注意事项 ### 数据集的社会影响 [需补充更多信息] ### 偏见分析 [需补充更多信息] ### 其他已知局限 [需补充更多信息] ## 附加信息 ### 数据集维护者 [需补充更多信息] ### 许可信息本数据集采用`知识共享署名4.0许可（Creative Commons Attribution 4.0 License）`进行授权。 ### 引用信息 [需补充更多信息] @inproceedings{48484, title = {Taskmaster-1: Toward a Realistic and Diverse Dialog Dataset}, author = {Bill Byrne and Karthik Krishnamoorthi and Chinnadhurai Sankar and Arvind Neelakantan and Daniel Duckworth and Semih Yavuz and Ben Goodrich and Amit Dubey and Kyu-Young Kim and Andy Cedilnik}, year = {2019} } ### 贡献致谢感谢 [@patil-suraj](https://github.com/patil-suraj) 为本数据集的收录提供支持。

应用场景：