five

schema_guided_dstc8

收藏
魔搭社区2025-12-05 更新2025-07-12 收录
下载链接:
https://modelscope.cn/datasets/google-research-datasets/schema_guided_dstc8
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for The Schema-Guided Dialogue Dataset ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Repository:** [Github repository for The Schema-Guided Dialogue Dataset](https://github.com/google-research-datasets/dstc8-schema-guided-dialogue) - **Paper:** [Towards Scalable Multi-Domain Conversational Agents: The Schema-Guided Dialogue Dataset](https://arxiv.org/abs/1909.05855) - **Point of Contact:** [abhirast@google.com](abhirast@google.com) ### Dataset Summary The Schema-Guided Dialogue dataset (SGD) was developed for the Dialogue State Tracking task of the Eights Dialogue Systems Technology Challenge (dstc8). The SGD dataset consists of over 18k annotated multi-domain, task-oriented conversations between a human and a virtual assistant. These conversations involve interactions with services and APIs spanning 17 domains, ranging from banks and events to media, calendar, travel, and weather. For most of these domains, the SGD dataset contains multiple different APIs, many of which have overlapping functionalities but different interfaces, which reflects common real-world scenarios. ### Supported Tasks and Leaderboards This dataset is designed to serve as an effective test-bed for intent prediction, slot filling, state tracking (i.e., estimating the user's goal) and language generation, among other tasks for large-scale virtual assistants: - **Generative dialogue modeling** or `dialogue-modeling`: the text of the dialogues can be used to train a sequence model on the utterances. Performance on this task is typically evaluated with delexicalized-[BLEU](https://huggingface.co/metrics/bleu), inform rate and request success. - **Intent state tracking**, a `multi-class-classification` task: predict the belief state of the user side of the conversation, performance is measured by [F1](https://huggingface.co/metrics/f1). - **Action prediction**, a `parsing` task: parse an utterance into the corresponding dialog acts for the system to use. [F1](https://huggingface.co/metrics/f1) is typically reported. ### Languages The text in the dataset is in English (`en`). ## Dataset Structure ### Data Instances - `dialogues` configuration (default): Each dialogue is represented as a sequence of turns, each containing a user or system utterance. The annotations for each turn are grouped into frames, where each frame corresponds to a single service. The annotations for user turns include the active intent, the dialogue state and slot spans for the different slots values mentioned in the turn. For system turns, we have the system actions representing the semantics of the system utterance. Each system action is represented using a dialogue act with optional parameters. - `schema` configuration: In addition to the dialogues, for each service used in the dataset, a normalized representation of the interface exposed is provided as the schema. The schema contains details like the name of the service, the list of tasks supported by the service (intents) and the attributes of the entities used by the service (slots). The schema also contains natural language descriptions of the service, intents and slots which can be used for developing models which can condition their predictions on the schema. ### Data Fields Each dialog instance has the following fields: - `dialogue_id`: A unique identifier for a dialogue. - `services`: A list of services present in the dialogue. - `turns`: A list of annotated system or user utterances. Each turn consists of the following fields: - `speaker`: The speaker for the turn. Either `USER` or `SYSTEM`. - `utterance`: A string containing the natural language utterance. - `frames`: A list of frames, each frame containing annotations for a single service and consists of the following fields: - `service`: The name of the service corresponding to the frame. The slots and intents used in the following fields are taken from the schema of this service. - `slots`: A list of slot spans in the utterance, only provided for non-categorical slots. Each slot span contains the following fields: - `slot`: The name of the slot. - `start`: The index of the starting character in the utterance corresponding to the slot value. - `exclusive_end`: The index of the character just after the last character corresponding to the slot value in the utterance. - `actions`: A list of actions corresponding to the system. Each action has the following fields: - `act`: The type of action. - `slot`: (optional) A slot argument for some of the actions. - `values`: (optional) A list of values assigned to the slot. If the values list is non-empty, then the slot must be present. - `canonical_values`: (optional) The values in their canonicalized form as used by the service. It is a list of strings of the same length as values. - `service_call`: (system turns only, optional) The request sent to the service. It consists of the following fields: - `method`: The name of the intent or function of the service or API being executed. - `parameters`: A pair of lists of the same lengths: `parameter_slot_name` contains slot names and `parameter_canonical_value` contains the corresponding values in their canonicalized form. - `service_results`: (system turns only, optional) A list of entities containing the results obtained from the service. It is only available for turns in which a service call is made. Each entity is represented as a pair of lists of the same length: `service_slot_name` contains slot names and `service_canonical_value` contains the corresponding canonical values. - `state`: (user turns only) The dialogue state corresponding to the service. It consists of the following fields: - `active_intent`: The intent corresponding to the service of the frame which is currently being fulfilled by the system. It takes the value "NONE" if none of the intents are active. - `requested_slots`: A list of slots requested by the user in the current turn. - `slot_values`: A pair of lists of the same lengths: `slot_name` contains slot names and `slot_value_list` contains the corresponding lists of strings. For categorical slots, this list contains a single value assigned to the slot. For non-categorical slots, all the values in this list are spoken variations of each other and are equivalent (e.g, "6 pm", "six in the evening", "evening at 6" etc.). The mapping from the action ID and the action name is the following: 0: AFFIRM 1: AFFIRM_INTENT 2: CONFIRM 3: GOODBYE 4: INFORM 5: INFORM_COUNT 6: INFORM_INTENT 7: NEGATE 8: NEGATE_INTENT 9: NOTIFY_FAILURE 10: NOTIFY_SUCCESS 11: OFFER 12: OFFER_INTENT 13: REQUEST 14: REQUEST_ALTS 15: REQ_MORE 16: SELECT 17: THANK_YOU ### Data Splits The dataset is split into a `train`, `validation`, and `test` split with the following sizes: | | train | validation | test | |---------------------|------:|-----------:|------:| | Number of dialogues | 16142 | 2482 | 4201 | | Number of turns | 48426 | 7446 | 12603 | ## Dataset Creation ### Curation Rationale The data was collected by first using a dialogue simulator to generate dialogue outlines first and then paraphrasing them to obtain natural utterances. Using a dialogue simulator ensures the coverage of a large variety of dialogue flows by filtering out similar flows in the simulation phase to create a diverse dataset, and dialogues can be generated with their annotation, as opposed to a Wizard-of-Oz setup which is prone to manual annotation errors. ### Source Data #### Initial Data Collection and Normalization The dialogue outlines are first generated by a simulator. The dialogue simulator interacts with the services to generate dialogue outlines. It consists of two agents playing the roles of the user and the system, interacting with each other using a finite set of actions specified through dialogue acts over a probabilistic automaton designed to capture varied dialogue trajectories. It is worth noting that the simulation automaton does not include any domain-specific constraints: all domain-specific constraints are encoded in the schema and scenario. The dialogue paraphrasing framework then converts the outlines generated by the simulator into a natural conversation. Users may refer to the slot values in the dialogue acts in various different ways during the conversation, e.g., “los angeles” may be referred to as “LA” or “LAX”. To introduce these natural variations in the slot values, different slot values are replaced with a randomly selected variation while being kept consistent across user turns in a dialogue. The actions are then converted to pseudo-natural language utterances using a set of manually defined action-to-text templates, and the resulting utterances for the different actions in a turn are concatenated together. Finally, the dialogue transformed by these steps is sent to the crowd workers to be reformulated into more natural language. One crowd worker is tasked with paraphrasing all utterances of a dialogue to ensure naturalness and coherence. The crowd workers are asked to exactly repeat the slot values in their paraphrases so that the span indices for the slots can be recovered via string matching. #### Who are the source language producers? The language structure is machine-generated, and the language realizations are produced by crowd workers. The dataset paper does not provide demographic information for the crowd workers. ### Annotations #### Annotation process The annotations are automatically obtained during the initial sampling process and by string matching after reformulation. #### Who are the annotators? [N/A] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators The dataset was created by a team of researchers working at Google Mountain View. ### Licensing Information The dataset is released under CC BY-SA 4.0 license. ### Citation Information For the DSCT8 task, please cite: ``` @article{corr/abs-2002-01359, author = {Abhinav Rastogi and Xiaoxue Zang and Srinivas Sunkara and Raghav Gupta and Pranav Khaitan}, title = {Schema-Guided Dialogue State Tracking Task at {DSTC8}}, journal = {CoRR}, volume = {abs/2002.01359}, year = {2020}, url = {https://arxiv.org/abs/2002.01359}, archivePrefix = {arXiv}, eprint = {2002.01359} } ``` For the initial release paper please cite: ``` @inproceedings{aaai/RastogiZSGK20, author = {Abhinav Rastogi and Xiaoxue Zang and Srinivas Sunkara and Raghav Gupta and Pranav Khaitan}, title = {Towards Scalable Multi-Domain Conversational Agents: The Schema-Guided Dialogue Dataset}, booktitle = {The Thirty-Fourth {AAAI} Conference on Artificial Intelligence, {AAAI} 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, {IAAI} 2020, The Tenth {AAAI} Symposium on Educational Advances in Artificial Intelligence, {EAAI} 2020, New York, NY, USA, February 7-12, 2020}, pages = {8689--8696}, publisher = {{AAAI} Press}, year = {2020}, url = {https://aaai.org/ojs/index.php/AAAI/article/view/6394} } ``` ### Contributions Thanks to [@yjernite](https://github.com/yjernite) for adding this dataset.

# 模式引导对话数据集(Schema-Guided Dialogue Dataset)数据集卡片 ## 目录 - [数据集描述](#dataset-description) - [数据集概述](#dataset-summary) - [支持任务与基准测试榜](#supported-tasks-and-leaderboards) - [语言](#languages) - [数据集结构](#dataset-structure) - [数据实例](#data-instances) - [数据字段](#data-fields) - [数据划分](#data-splits) - [数据集构建](#dataset-creation) - [构建初衷](#curation-rationale) - [源数据](#source-data) - [标注信息](#annotations) - [个人与敏感信息](#personal-and-sensitive-information) - [数据集使用注意事项](#considerations-for-using-the-data) - [数据集的社会影响](#social-impact-of-dataset) - [偏差讨论](#discussion-of-biases) - [其他已知局限性](#other-known-limitations) - [附加信息](#additional-information) - [数据集维护者](#dataset-curators) - [许可信息](#licensing-information) - [引用信息](#citation-information) - [贡献致谢](#contributions) ## 数据集描述 - **仓库**:[模式引导对话数据集GitHub仓库](https://github.com/google-research-datasets/dstc8-schema-guided-dialogue) - **论文**:[面向可扩展多领域对话智能体:模式引导对话数据集](https://arxiv.org/abs/1909.05855) - **联系方式**:[abhirast@google.com](abhirast@google.com) ### 数据集概述 模式引导对话数据集(Schema-Guided Dialogue Dataset,简称SGD)是为第八届对话系统技术挑战赛(Dialogue Systems Technology Challenge 8,DSTC8)的对话状态跟踪任务开发的。该数据集包含超过1.8万条经过标注的人类与虚拟助手之间的多领域任务型对话。这些对话涉及与覆盖17个领域的服务及应用程序编程接口(Application Programming Interface,API)的交互,领域涵盖银行、活动、媒体、日历、出行与天气等。在多数领域中,SGD数据集包含多种不同的API,其中许多功能重叠但接口各异,这一设计贴合真实世界的常见场景。 ### 支持任务与基准测试榜 本数据集旨在为大规模虚拟助手的意图预测、槽位填充、状态跟踪(即预估用户目标)以及语言生成等任务提供有效的测试基准: - **生成式对话建模**(`dialogue-modeling`):可利用对话文本针对对话语句训练序列模型。该任务的性能通常通过去词汇化BLEU(delexicalized-BLEU)、告知率(inform rate)与请求成功率进行评估。 - **意图状态跟踪**(`multi-class-classification`,多分类任务):预测对话中用户侧的置信状态,性能通过F1值进行衡量。 - **动作预测**(`parsing`,句法分析任务):将对话语句解析为系统可用的对应对话动作,通常以F1值作为性能指标。 ### 语言 数据集文本语言为英语(`en`)。 ## 数据集结构 ### 数据实例 - **对话(`dialogues`)配置(默认配置)**:每条对话以轮次序列的形式表示,每一轮包含用户或系统的语句。每一轮的标注被分组为帧(frame),每一帧对应单个服务。用户轮次的标注包含当前激活意图、对话状态以及当前轮次中提及的不同槽位值的槽位跨度。系统轮次则包含代表系统语句语义的系统动作,每个系统动作通过带可选参数的对话动作表示。 - **模式(`schema`)配置**:除对话数据外,数据集内使用的每个服务均提供其对外接口的标准化表示形式,即模式(schema)。模式包含服务名称、服务支持的任务列表(意图)以及服务使用的实体属性(槽位)等细节,同时还包含服务、意图与槽位的自然语言描述,可用于开发以模式为条件进行预测的模型。 ### 数据字段 每个对话实例包含以下字段: - `dialogue_id`:对话的唯一标识符。 - `services`:对话中涉及的服务列表。 - `turns`:带标注的系统或用户语句列表。每一轮包含以下字段: - `speaker`:当前轮次的说话者,可选值为`USER`(用户)或`SYSTEM`(系统)。 - `utterance`:包含自然语言语句的字符串。 - `frames`:帧(frame)列表,每个帧包含单个服务的标注信息,包含以下字段: - `service`:当前帧对应的服务名称,后续字段中使用的槽位与意图均取自该服务的模式定义。 - `slots`:语句中的槽位(slot)跨度列表,仅针对非分类槽位提供。每个槽位跨度包含以下字段: - `slot`:槽位名称。 - `start`:槽位值在语句中的起始字符索引。 - `exclusive_end`:槽位值在语句中的结束字符的下一位索引。 - `actions`:系统动作列表,每个动作包含以下字段: - `act`:动作类型。 - `slot`(可选):部分动作的槽位参数。 - `values`(可选):分配给槽位的值列表。若`values`列表非空,则必须存在对应的`slot`字段。 - `canonical_values`(可选):服务使用的标准化形式的值列表,其长度与`values`列表一致。 - `service_call`(仅系统轮次可用,可选):发送至服务的请求,包含以下字段: - `method`:待执行的服务或应用程序编程接口(API)的意图或函数名称。 - `parameters`:两个等长的列表组成的结构:`parameter_slot_name`为槽位名称列表,`parameter_canonical_value`为对应的标准化值列表。 - `service_results`(仅系统轮次可用,可选):从服务获取的结果实体列表,仅在发起服务调用的轮次中可用。每个实体由两个等长的列表组成:`service_slot_name`为槽位名称列表,`service_canonical_value`为对应的标准化值列表。 - `state`(仅用户轮次可用):对应服务的对话状态,包含以下字段: - `active_intent`:当前系统正在处理的帧对应服务的意图,若无激活意图则取值为`"NONE"`。 - `requested_slots`:当前轮次中用户请求的槽位列表。 - `slot_values`:两个等长的列表组成的结构:`slot_name`为槽位名称列表,`slot_value_list`为对应的字符串值列表。对于分类槽位,该列表仅包含分配给槽位的单个值;对于非分类槽位,列表中的所有值均为同一语义的不同口语表达形式(例如`"6 pm"`、`"six in the evening"`与`"evening at 6"`等)。 动作ID与动作名称的对应关系如下: 0: 确认(AFFIRM) 1: 确认意图(AFFIRM_INTENT) 2: 确认(CONFIRM) 3: 再见(GOODBYE) 4: 告知(INFORM) 5: 告知数量(INFORM_COUNT) 6: 告知意图(INFORM_INTENT) 7: 否定(NEGATE) 8: 否定意图(NEGATE_INTENT) 9: 通知失败(NOTIFY_FAILURE) 10: 通知成功(NOTIFY_SUCCESS) 11: 提供(OFFER) 12: 提供意图(OFFER_INTENT) 13: 请求(REQUEST) 14: 请求替代项(REQUEST_ALTS) 15: 请求更多(REQ_MORE) 16: 选择(SELECT) 17: 感谢(THANK_YOU) ### 数据划分 数据集划分为训练集(`train`)、验证集(`validation`)与测试集(`test`),各集合规模如下: | | 训练集 | 验证集 | 测试集 | |---------------------|------:|-----------:|------:| | 对话总数 | 16142 | 2482 | 4201 | | 轮次总数 | 48426 | 7446 | 12603 | ## 数据集构建 ### 构建初衷 本数据集首先通过对话模拟器生成对话大纲,再对大纲进行释义以获取自然语句。借助对话模拟器,可在模拟阶段过滤相似的对话流程,从而覆盖丰富多样的对话逻辑,构建多样化的数据集,同时可在生成对话的同时自动完成标注,相较于易出现人工标注误差的“奥兹巫师”(Wizard-of-Oz)实验范式,该方式更具优势。 ### 源数据 #### 初始数据收集与标准化 对话大纲首先由模拟器生成。对话模拟器通过与服务交互生成对话大纲,其包含两个分别扮演用户与系统的智能体,通过基于概率自动机的对话动作指定的有限动作集合进行交互,该自动机旨在捕捉多样化的对话轨迹。值得注意的是,该模拟自动机不包含任何领域特定约束,所有领域特定约束均编码于模式与场景之中。 随后,对话释义框架将模拟器生成的大纲转换为自然对话。用户在对话中可通过多种方式指代对话动作中的槽位值,例如将“los angeles”称为“LA”或“LAX”。为引入槽位值的自然变体,系统会将不同槽位值替换为随机选择的变体,并确保同一对话中用户轮次的槽位值保持一致。随后,通过一组人工定义的动作转文本模板,将动作转换为伪自然语言语句,并将同一轮次中不同动作对应的语句拼接起来。 最后,经过上述步骤转换的对话将被提交给众包工作者,以进一步重构为更自然的语言。每名众包工作者负责对单条对话的所有语句进行释义,以确保对话的自然性与连贯性。同时要求工作者在释义中严格保留槽位值,以便通过字符串匹配恢复槽位的跨度索引。 #### 源语言生产者是谁? 语言结构由机器生成,而语言表达由众包工作者完成。本数据集的相关论文未提供众包工作者的人口统计信息。 ### 标注信息 #### 标注流程 标注信息在初始采样阶段自动生成,并在语句重构后通过字符串匹配进行校准。 #### 标注者是谁? [无相关信息] ### 个人与敏感信息 [需补充更多信息] ## 数据集使用注意事项 ### 数据集的社会影响 [需补充更多信息] ### 偏差讨论 [需补充更多信息] ### 其他已知局限性 [需补充更多信息] ## 附加信息 ### 数据集维护者 本数据集由谷歌山景城(Google Mountain View)团队的研究人员构建。 ### 许可信息 本数据集采用CC BY-SA 4.0许可协议发布。 ### 引用信息 针对DSTC8任务,请引用以下文献: @article{corr/abs-2002-01359, author = {Abhinav Rastogi and Xiaoxue Zang and Srinivas Sunkara and Raghav Gupta and Pranav Khaitan}, title = {Schema-Guided Dialogue State Tracking Task at {DSTC8}}, journal = {CoRR}, volume = {abs/2002-01359}, year = {2020}, url = {https://arxiv.org/abs/2002-01359}, archivePrefix = {arXiv}, eprint = {2002-01359} } 针对初始发布论文,请引用以下文献: @inproceedings{aaai/RastogiZSGK20, author = {Abhinav Rastogi and Xiaoxue Zang and Srinivas Sunkara and Raghav Gupta and Pranav Khaitan}, title = {Towards Scalable Multi-Domain Conversational Agents: The Schema-Guided Dialogue Dataset}, booktitle = {The Thirty-Fourth {AAAI} Conference on Artificial Intelligence, {AAAI} 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, {IAAI} 2020, The Tenth {AAAI} Symposium on Educational Advances in Artificial Intelligence, {EAAI} 2020, New York, NY, USA, February 7-12, 2020}, pages = {8689--8696}, publisher = {{AAAI} Press}, year = {2020}, url = {https://aaai.org/ojs/index.php/AAAI/article/view/6394} } ### 贡献致谢 感谢[@yjernite](https://github.com/yjernite)为本数据集添加支持。
提供机构:
maas
创建时间:
2025-07-07
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作