下载链接：

https://modelscope.cn/datasets/google-research-datasets/taskmaster3

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for taskmaster3 ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [Taskmaster](https://research.google/tools/datasets/taskmaster-1/) - **Repository:** [GitHub](https://github.com/google-research-datasets/Taskmaster/tree/master/TM-3-2020) - **Paper:** [Taskmaster-1: Toward a Realistic and Diverse Dialog Dataset](https://arxiv.org/abs/1909.05358) - **Leaderboard:** N/A - **Point of Contact:** [Taskmaster Googlegroup](taskmaster-datasets@googlegroups.com) ### Dataset Summary Taskmaster is dataset for goal oriented conversations. The Taskmaster-3 dataset consists of 23,757 movie ticketing dialogs. By "movie ticketing" we mean conversations where the customer's goal is to purchase tickets after deciding on theater, time, movie name, number of tickets, and date, or opt out of the transaction. This collection was created using the "self-dialog" method. This means a single, crowd-sourced worker is paid to create a conversation writing turns for both speakers, i.e. the customer and the ticketing agent. ### Supported Tasks and Leaderboards [More Information Needed] ### Languages The dataset is in English language. ## Dataset Structure ### Data Instances A typical example looks like this ``` { "conversation_id": "dlg-ddee80da-9ffa-4773-9ce7-f73f727cb79c", "instructions": "SCENARIO: Pretend you’re *using a digital assistant to purchase tickets for a movie currently showing in theaters*. ...", "scenario": "4 exchanges with 1 error and predefined variables", "utterances": [ { "apis": [], "index": 0, "segments": [ { "annotations": [ { "name": "num.tickets" } ], "end_index": 21, "start_index": 20, "text": "2" }, { "annotations": [ { "name": "name.movie" } ], "end_index": 42, "start_index": 37, "text": "Mulan" } ], "speaker": "user", "text": "I would like to buy 2 tickets to see Mulan." }, { "index": 6, "segments": [], "speaker": "user", "text": "Yes.", "apis": [ { "args": [ { "arg_name": "name.movie", "arg_value": "Mulan" }, { "arg_name": "name.theater", "arg_value": "Mountain AMC 16" } ], "index": 6, "name": "book_tickets", "response": [ { "response_name": "status", "response_value": "success" } ] } ] } ], "vertical": "Movie Tickets" } ``` ### Data Fields Each conversation in the data file has the following structure: - `conversation_id`: A universally unique identifier with the prefix 'dlg-'. The ID has no meaning. - `utterances`: A list of utterances that make up the conversation. - `instructions`: Instructions for the crowdsourced worker used in creating the conversation. - `vertical`: In this dataset the vertical for all dialogs is "Movie Tickets". - `scenario`: This is the title of the instructions for each dialog. Each utterance has the following fields: - `index`: A 0-based index indicating the order of the utterances in the conversation. - `speaker`: Either USER or ASSISTANT, indicating which role generated this utterance. - `text`: The raw text of the utterance. In case of self dialogs (one_person_dialogs), this is written by the crowdsourced worker. In case of the WOz dialogs, 'ASSISTANT' turns are written and 'USER' turns are transcribed from the spoken recordings of crowdsourced workers. - `segments`: A list of various text spans with semantic annotations. - `apis`: An array of API invocations made during the utterance. Each API has the following structure: - `name`: The name of the API invoked (e.g. find_movies). - `index`: The index of the parent utterance. - `args`: A `list` of `dict` with keys `arg_name` and `arg_value` which represent the name of the argument and the value for the argument respectively. - `response`: A `list` of `dict`s with keys `response_name` and `response_value` which represent the name of the response and the value for the response respectively. Each segment has the following fields: - `start_index`: The position of the start of the annotation in the utterance text. - `end_index`: The position of the end of the annotation in the utterance text. - `text`: The raw text that has been annotated. - `annotations`: A list of annotation details for this segment. Each annotation has a single field: - `name`: The annotation name. ### Data Splits There are no deafults splits for all the config. The below table lists the number of examples in each config. | | Train | |-------------------|--------| | n_instances | 23757 | ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data [More Information Needed] #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations [More Information Needed] #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information The dataset is licensed under `Creative Commons Attribution 4.0 License` ### Citation Information [More Information Needed] ``` @inproceedings{48484, title = {Taskmaster-1: Toward a Realistic and Diverse Dialog Dataset}, author = {Bill Byrne and Karthik Krishnamoorthi and Chinnadhurai Sankar and Arvind Neelakantan and Daniel Duckworth and Semih Yavuz and Ben Goodrich and Amit Dubey and Kyu-Young Kim and Andy Cedilnik}, year = {2019} } ``` ### Contributions Thanks to [@patil-suraj](https://github.com/patil-suraj) for adding this dataset.

# Taskmaster3 数据集卡片 ## 目录 - [数据集描述](#dataset-description) - [数据集概述](#dataset-summary) - [支持任务与排行榜](#supported-tasks-and-leaderboards) - [语言情况](#languages) - [数据集结构](#dataset-structure) - [数据实例](#data-instances) - [数据字段](#data-fields) - [数据划分](#data-splits) - [数据集构建](#dataset-creation) - [构建初衷](#curation-rationale) - [源数据](#source-data) - [标注工作](#annotations) - [个人与敏感信息](#personal-and-sensitive-information) - [数据集使用注意事项](#considerations-for-using-the-data) - [数据集的社会影响](#social-impact-of-dataset) - [偏差讨论](#discussion-of-biases) - [其他已知局限性](#other-known-limitations) - [附加信息](#additional-information) - [数据集维护者](#dataset-curators) - [许可信息](#licensing-information) - [引用信息](#citation-information) - [贡献致谢](#contributions) ## 数据集描述 - **主页**：[Taskmaster](https://research.google/tools/datasets/taskmaster-1/) - **代码仓库**：[GitHub](https://github.com/google-research-datasets/Taskmaster/tree/master/TM-3-2020) - **相关论文**：[Taskmaster-1: Toward a Realistic and Diverse Dialog Dataset](https://arxiv.org/abs/1909.05358) - **排行榜**：无（N/A） - **联系方式**：[Taskmaster 工作组邮箱](taskmaster-datasets@googlegroups.com) ### 数据集概述 Taskmaster是面向目标导向对话（goal-oriented conversation）的数据集。Taskmaster-3数据集包含23757条电影购票对话。此处的"电影购票"指用户以完成购票为目标的对话：用户需先确定影院、场次时间、影片名称、购票数量及观影日期，最终完成购票或取消交易。该数据集采用**自对话（self-dialog）**方法构建：即雇佣一名众包工作者，同时模拟对话双方（顾客与购票客服）的发言来创作完整对话。 ### 支持任务与排行榜 [更多信息待补充] ### 语言情况本数据集语言为英语。 ## 数据集结构 ### 数据实例一个典型的对话样本格式如下： json { "conversation_id": "dlg-ddee80da-9ffa-4773-9ce7-f73f727cb79c", "instructions": "场景说明：请模拟你*使用数字助手购买正在热映的电影票*的场景。 ...", "scenario": "包含1处错误与预设变量的4轮对话", "utterances": [ { "apis": [], "index": 0, "segments": [ { "annotations": [ { "name": "num.tickets" } ], "end_index": 21, "start_index": 20, "text": "2" }, { "annotations": [ { "name": "name.movie" } ], "end_index": 42, "start_index": 37, "text": "Mulan" } ], "speaker": "user", "text": "I would like to buy 2 tickets to see Mulan." }, { "index": 6, "segments": [], "speaker": "user", "text": "Yes.", "apis": [ { "args": [ { "arg_name": "name.movie", "arg_value": "Mulan" }, { "arg_name": "name.theater", "arg_value": "Mountain AMC 16" } ], "index": 6, "name": "book_tickets", "response": [ { "response_name": "status", "response_value": "success" } ] } ] } ], "vertical": "Movie Tickets" } ### 数据字段每个对话文件中的对话均遵循以下结构： - `conversation_id`：以`dlg-`为前缀的全局唯一标识符，该标识符无实际语义。 - `utterances`：组成对话的轮次列表。 - `instructions`：用于指导众包工作者创作对话的说明文档。 - `vertical`：本数据集中所有对话的垂直业务领域均为"电影购票（Movie Tickets）"。 - `scenario`：各对话对应的创作说明标题。每个对话轮次包含以下字段： - `index`：基于0的索引值，用于标识该轮次在对话中的顺序。 - `speaker`：发言角色，可选值为`USER`（用户）或`ASSISTANT`（客服助手）。 - `text`：该轮次的原始文本。对于自对话（self-dialog）样本，文本由众包工作者撰写；对于**奥兹巫师范式（Wizard of Oz, WOz）**对话样本，客服轮次由众包工作者撰写，用户轮次则来自众包工作者的口语录音转录结果。 - `segments`：带有语义标注的文本片段列表。 - `apis`：该轮次中发起的API调用数组。每个API调用包含以下结构： - `name`：被调用的API名称（例如`find_movies`，即查询影片）。 - `index`：所属对话轮次的索引值。 - `args`：参数列表，为包含`arg_name`（参数名）与`arg_value`（参数值）的字典数组，分别代表参数的名称与取值。 - `response`：响应列表，为包含`response_name`（响应项名称）与`response_value`（响应项取值）的字典数组，分别代表响应项的名称与取值。每个文本片段包含以下字段： - `start_index`：该标注片段在轮次文本中的起始位置。 - `end_index`：该标注片段在轮次文本中的结束位置。 - `text`：被标注的原始文本片段。 - `annotations`：该片段的标注详情列表。每个标注仅包含一个字段： - `name`：标注的名称（例如`num.tickets`（购票数量）、`name.movie`（电影名称））。 ### 数据划分本数据集所有配置均无默认划分方式。下表列出了各配置下的样本数量： | 指标 | 训练集 | |---------------------|--------| | 样本总数(n_instances) | 23757 | ## 数据集构建 ### 构建初衷 [更多信息待补充] ### 源数据 [更多信息待补充] #### 初始数据收集与归一化 [更多信息待补充] #### 源语言生成者是谁？ [更多信息待补充] ### 标注工作 [更多信息待补充] #### 标注流程 [更多信息待补充] #### 标注人员是谁？ [更多信息待补充] ### 个人与敏感信息 [更多信息待补充] ## 数据集使用注意事项 ### 数据集的社会影响 [更多信息待补充] ### 偏差讨论 [更多信息待补充] ### 其他已知局限性 [更多信息待补充] ## 附加信息 ### 数据集维护者 [更多信息待补充] ### 许可信息本数据集采用**知识共享署名4.0许可协议（Creative Commons Attribution 4.0 License）**。 ### 引用信息 [更多信息待补充] bibtex @inproceedings{48484, title = {Taskmaster-1: Toward a Realistic and Diverse Dialog Dataset}, author = {Bill Byrne and Karthik Krishnamoorthi and Chinnadhurai Sankar and Arvind Neelakantan and Daniel Duckworth and Semih Yavuz and Ben Goodrich and Amit Dubey and Kyu-Young Kim and Andy Cedilnik}, year = {2019} } ### 贡献致谢感谢[@patil-suraj](https://github.com/patil-suraj) 添加本数据集。

应用场景：