five

taskmaster1

收藏
魔搭社区2025-12-05 更新2025-07-12 收录
下载链接:
https://modelscope.cn/datasets/google-research-datasets/taskmaster1
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for Taskmaster-1 ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [Taskmaster-1](https://research.google/tools/datasets/taskmaster-1/) - **Repository:** [GitHub](https://github.com/google-research-datasets/Taskmaster/tree/master/TM-1-2019) - **Paper:** [Taskmaster-1: Toward a Realistic and Diverse Dialog Dataset](https://arxiv.org/abs/1909.05358) - **Leaderboard:** N/A - **Point of Contact:** [Taskmaster Googlegroup](taskmaster-datasets@googlegroups.com) ### Dataset Summary Taskmaster-1 is a goal-oriented conversational dataset. It includes 13,215 task-based dialogs comprising six domains. Two procedures were used to create this collection, each with unique advantages. The first involves a two-person, spoken "Wizard of Oz" (WOz) approach in which trained agents and crowdsourced workers interact to complete the task while the second is "self-dialog" in which crowdsourced workers write the entire dialog themselves. ### Supported Tasks and Leaderboards [More Information Needed] ### Languages The dataset is in English language. ## Dataset Structure ### Data Instances A typical example looks like this ``` { "conversation_id":"dlg-336c8165-068e-4b4b-803d-18ef0676f668", "instruction_id":"restaurant-table-2", "utterances":[ { "index":0, "segments":[ ], "speaker":"USER", "text":"Hi, I'm looking for a place that sells spicy wet hotdogs, can you think of any?" }, { "index":1, "segments":[ { "annotations":[ { "name":"restaurant_reservation.name.restaurant.reject" } ], "end_index":37, "start_index":16, "text":"Spicy Wet Hotdogs LLC" } ], "speaker":"ASSISTANT", "text":"You might enjoy Spicy Wet Hotdogs LLC." }, { "index":2, "segments":[ ], "speaker":"USER", "text":"That sounds really good, can you make me a reservation?" }, { "index":3, "segments":[ ], "speaker":"ASSISTANT", "text":"Certainly, when would you like a reservation?" }, { "index":4, "segments":[ { "annotations":[ { "name":"restaurant_reservation.num.guests" }, { "name":"restaurant_reservation.num.guests" } ], "end_index":20, "start_index":18, "text":"50" } ], "speaker":"USER", "text":"I have a party of 50 who want a really sloppy dog on Saturday at noon." } ] } ``` ### Data Fields Each conversation in the data file has the following structure: - `conversation_id`: A universally unique identifier with the prefix 'dlg-'. The ID has no meaning. - `utterances`: A list of utterances that make up the conversation. - `instruction_id`: A reference to the file(s) containing the user (and, if applicable, agent) instructions for this conversation. Each utterance has the following fields: - `index`: A 0-based index indicating the order of the utterances in the conversation. - `speaker`: Either USER or ASSISTANT, indicating which role generated this utterance. - `text`: The raw text of the utterance. In case of self dialogs (one_person_dialogs), this is written by the crowdsourced worker. In case of the WOz dialogs, 'ASSISTANT' turns are written and 'USER' turns are transcribed from the spoken recordings of crowdsourced workers. - `segments`: A list of various text spans with semantic annotations. Each segment has the following fields: - `start_index`: The position of the start of the annotation in the utterance text. - `end_index`: The position of the end of the annotation in the utterance text. - `text`: The raw text that has been annotated. - `annotations`: A list of annotation details for this segment. Each annotation has a single field: - `name`: The annotation name. ### Data Splits - one_person_dialogs The data in `one_person_dialogs` config is split into `train`, `dev` and `test` splits. | | train | validation | test | |--------------|-------:|------------:|------:| | N. Instances | 6168 | 770 | 770 | - woz_dialogs The data in `woz_dialogs` config has no default splits. | | train | |--------------|-------:| | N. Instances | 5507 | ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data [More Information Needed] #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations [More Information Needed] #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information The dataset is licensed under `Creative Commons Attribution 4.0 License` ### Citation Information [More Information Needed] ``` @inproceedings{48484, title = {Taskmaster-1: Toward a Realistic and Diverse Dialog Dataset}, author = {Bill Byrne and Karthik Krishnamoorthi and Chinnadhurai Sankar and Arvind Neelakantan and Daniel Duckworth and Semih Yavuz and Ben Goodrich and Amit Dubey and Kyu-Young Kim and Andy Cedilnik}, year = {2019} } ``` ### Contributions Thanks to [@patil-suraj](https://github.com/patil-suraj) for adding this dataset.

# Taskmaster-1 数据集卡片 ## 目录 - [数据集描述](#dataset-description) - [数据集概要](#dataset-summary) - [支持任务与排行榜](#supported-tasks-and-leaderboards) - [语言](#languages) - [数据集结构](#dataset-structure) - [数据样例](#data-instances) - [数据字段](#data-fields) - [数据划分](#data-splits) - [数据集构建](#dataset-creation) - [构建初衷](#curation-rationale) - [源数据](#source-data) - [标注信息](#annotations) - [个人与敏感信息](#personal-and-sensitive-information) - [数据使用注意事项](#considerations-for-using-the-data) - [数据集的社会影响](#social-impact-of-dataset) - [偏差讨论](#discussion-of-biases) - [其他已知局限性](#other-known-limitations) - [附加信息](#additional-information) - [数据集维护者](#dataset-curators) - [授权信息](#licensing-information) - [引用信息](#citation-information) - [贡献致谢](#contributions) ## 数据集描述 - **"主页"**:[Taskmaster-1](https://research.google/tools/datasets/taskmaster-1/) - **"代码仓库"**:[GitHub](https://github.com/google-research-datasets/Taskmaster/tree/master/TM-1-2019) - **"相关论文"**:[Taskmaster-1: Toward a Realistic and Diverse Dialog Dataset](https://arxiv.org/abs/1909.05358) - **"排行榜"**:无(N/A) - **"联系方式"**:[Taskmaster 工作组邮箱](taskmaster-datasets@googlegroups.com) ### 数据集概要 Taskmaster-1 是一款面向任务型对话的数据集,共包含13215条基于任务的对话,涵盖6个应用领域。本次数据集的构建采用了两种各具优势的流程:其一为双人口语化**绿野仙踪(Wizard of Oz, WOz)**范式,即由经过训练的智能体与众包工人交互以完成指定任务;其二为**自对话(self-dialog)**模式,即由众包工人独立撰写完整的对话内容。 ### 支持任务与排行榜 [需补充更多信息] ### 语言 本数据集采用英语编写。 ## 数据集结构 ### 数据样例 一条典型的对话样例如以下格式所示: json { "conversation_id":"dlg-336c8165-068e-4b4b-803d-18ef0676f668", "instruction_id":"restaurant-table-2", "utterances":[ { "index":0, "segments":[ ], "speaker":"USER", "text":"Hi, I'm looking for a place that sells spicy wet hotdogs, can you think of any?" }, { "index":1, "segments":[ { "annotations":[ { "name":"restaurant_reservation.name.restaurant.reject" } ], "end_index":37, "start_index":16, "text":"Spicy Wet Hotdogs LLC" } ], "speaker":"ASSISTANT", "text":"You might enjoy Spicy Wet Hotdogs LLC." }, { "index":2, "segments":[ ], "speaker":"USER", "text":"That sounds really good, can you make me a reservation?" }, { "index":3, "segments":[ ], "speaker":"ASSISTANT", "text":"Certainly, when would you like a reservation?" }, { "index":4, "segments":[ { "annotations":[ { "name":"restaurant_reservation.num.guests" }, { "name":"restaurant_reservation.num.guests" } ], "end_index":20, "start_index":18, "text":"50" } ], "speaker":"USER", "text":"I have a party of 50 who want a really sloppy dog on Saturday at noon." } ] } ### 数据字段 数据文件中的每条对话均遵循以下结构: - `conversation_id`:前缀为`dlg-`的全局唯一标识符,该ID无实际语义。 - `utterances`:组成完整对话的话语轮次列表。 - `instruction_id`:指向包含本次对话用户(及可选的助手)指令的文件的引用标识。 每条话语轮次包含以下字段: - `index`:从0开始的索引,用于标识该话语在对话中的先后顺序。 - `speaker`:取值为`USER`(用户)或`ASSISTANT`(助手),用于标识该轮话语的发言角色。 - `text`:话语的原始文本内容。若为自对话模式(单人对话),该文本由众包工人独立撰写;若为WOz对话模式,助手轮次的文本由人工撰写,用户轮次的文本则基于众包工人的口语录音转录而来。 - `segments`:带有语义标注的各类文本片段列表。 每个文本片段包含以下字段: - `start_index`:标注起始位置在该话语文本中的字符索引。 - `end_index`:标注结束位置在该话语文本中的字符索引。 - `text`:被标注的原始文本片段内容。 - `annotations`:该文本片段的标注详情列表。 每个标注包含以下字段: - `name`:标注的名称。 ### 数据划分 - 单人对话(one_person_dialogs)配置:数据被划分为训练集(train)、验证集(dev)与测试集(test),样本统计如下: | | 训练集 | 验证集 | 测试集 | |--------------|-------:|------------:|------:| | 样本数量 | 6168 | 770 | 770 | - 绿野仙踪对话(woz_dialogs)配置:无默认数据划分,总样本数为5507。 ## 数据集构建 ### 构建初衷 [需补充更多信息] ### 源数据 [需补充更多信息] #### 初始数据收集与标准化 [需补充更多信息] #### 源语言生产者是谁? [需补充更多信息] ### 标注信息 [需补充更多信息] #### 标注流程 [需补充更多信息] #### 标注人员是谁? [需补充更多信息] ### 个人与敏感信息 [需补充更多信息] ## 数据使用注意事项 ### 数据集的社会影响 [需补充更多信息] ### 偏差讨论 [需补充更多信息] ### 其他已知局限性 [需补充更多信息] ## 附加信息 ### 数据集维护者 [需补充更多信息] ### 授权信息 本数据集采用`知识共享署名4.0许可(Creative Commons Attribution 4.0 License)`进行授权。 ### 引用信息 [需补充更多信息] bibtex @inproceedings{48484, title = {Taskmaster-1: Toward a Realistic and Diverse Dialog Dataset}, author = {Bill Byrne and Karthik Krishnamoorthi and Chinnadhurai Sankar and Arvind Neelakantan and Daniel Duckworth and Semih Yavuz and Ben Goodrich and Amit Dubey and Kyu-Young Kim and Andy Cedilnik}, year = {2019} } ### 贡献致谢 感谢 [@patil-suraj](https://github.com/patil-suraj) 为本数据集添加支持。
提供机构:
maas
创建时间:
2025-07-07
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作