five

ConvLab/sgd

收藏
Hugging Face2024-05-08 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/ConvLab/sgd
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: - cc-by-sa-4.0 multilinguality: - monolingual pretty_name: SGD size_categories: - 10K<n<100K task_categories: - conversational --- # Dataset Card for Schema-Guided Dialogue - **Repository:** https://github.com/google-research-datasets/dstc8-schema-guided-dialogue - **Paper:** https://arxiv.org/pdf/1909.05855.pdf - **Leaderboard:** None - **Who transforms the dataset:** Qi Zhu(zhuq96 at gmail dot com) To use this dataset, you need to install [ConvLab-3](https://github.com/ConvLab/ConvLab-3) platform first. Then you can load the dataset via: ``` from convlab.util import load_dataset, load_ontology, load_database dataset = load_dataset('sgd') ontology = load_ontology('sgd') database = load_database('sgd') ``` For more usage please refer to [here](https://github.com/ConvLab/ConvLab-3/tree/master/data/unified_datasets). ### Dataset Summary The **Schema-Guided Dialogue (SGD)** dataset consists of over 20k annotated multi-domain, task-oriented conversations between a human and a virtual assistant. These conversations involve interactions with services and APIs spanning 20 domains, such as banks, events, media, calendar, travel, and weather. For most of these domains, the dataset contains multiple different APIs, many of which have overlapping functionalities but different interfaces, which reflects common real-world scenarios. The wide range of available annotations can be used for intent prediction, slot filling, dialogue state tracking, policy imitation learning, language generation, and user simulation learning, among other tasks for developing large-scale virtual assistants. Additionally, the dataset contains unseen domains and services in the evaluation set to quantify the performance in zero-shot or few-shot settings. - **How to get the transformed data from original data:** - Download [dstc8-schema-guided-dialogue-master.zip](https://github.com/google-research-datasets/dstc8-schema-guided-dialogue/archive/refs/heads/master.zip). - Run `python preprocess.py` in the current directory. - **Main changes of the transformation:** - Lower case original `act` as `intent`. - Add `count` slot for each domain, non-categorical, find span by text matching. - Categorize `dialogue acts` according to the `intent`. - Concatenate multiple values using `|`. - Retain `active_intent`, `requested_slots`, `service_call`. - **Annotations:** - dialogue acts, state, db_results, service_call, active_intent, requested_slots. ### Supported Tasks and Leaderboards NLU, DST, Policy, NLG, E2E ### Languages English ### Data Splits | split | dialogues | utterances | avg_utt | avg_tokens | avg_domains | cat slot match(state) | cat slot match(goal) | cat slot match(dialogue act) | non-cat slot span(dialogue act) | | ---------- | --------- | ---------- | ------- | ---------- | ----------- | --------------------- | -------------------- | ---------------------------- | ------------------------------- | | train | 16142 | 329964 | 20.44 | 9.75 | 1.84 | 100 | - | 100 | 100 | | validation | 2482 | 48726 | 19.63 | 9.66 | 1.84 | 100 | - | 100 | 100 | | test | 4201 | 84594 | 20.14 | 10.4 | 2.02 | 100 | - | 100 | 100 | | all | 22825 | 463284 | 20.3 | 9.86 | 1.87 | 100 | - | 100 | 100 | 45 domains: ['Banks_1', 'Buses_1', 'Buses_2', 'Calendar_1', 'Events_1', 'Events_2', 'Flights_1', 'Flights_2', 'Homes_1', 'Hotels_1', 'Hotels_2', 'Hotels_3', 'Media_1', 'Movies_1', 'Music_1', 'Music_2', 'RentalCars_1', 'RentalCars_2', 'Restaurants_1', 'RideSharing_1', 'RideSharing_2', 'Services_1', 'Services_2', 'Services_3', 'Travel_1', 'Weather_1', 'Alarm_1', 'Banks_2', 'Flights_3', 'Hotels_4', 'Media_2', 'Movies_2', 'Restaurants_2', 'Services_4', 'Buses_3', 'Events_3', 'Flights_4', 'Homes_2', 'Media_3', 'Messaging_1', 'Movies_3', 'Music_3', 'Payment_1', 'RentalCars_3', 'Trains_1'] - **cat slot match**: how many values of categorical slots are in the possible values of ontology in percentage. - **non-cat slot span**: how many values of non-categorical slots have span annotation in percentage. ### Citation ``` @article{rastogi2019towards, title={Towards Scalable Multi-domain Conversational Agents: The Schema-Guided Dialogue Dataset}, author={Rastogi, Abhinav and Zang, Xiaoxue and Sunkara, Srinivas and Gupta, Raghav and Khaitan, Pranav}, journal={arXiv preprint arXiv:1909.05855}, year={2019} } ``` ### Licensing Information [**CC BY-SA 4.0**](https://creativecommons.org/licenses/by-sa/4.0/)
提供机构:
ConvLab
原始信息汇总

数据集概述

数据集名称: Schema-Guided Dialogue (SGD)

数据集大小: 包含超过20,000个多领域、任务导向的人机对话记录。

领域覆盖: 涉及20个领域,如银行、事件、媒体、日历、旅行和天气等。

数据集用途: 用于意图预测、槽填充、对话状态跟踪、策略模仿学习、语言生成和用户模拟学习等任务。

语言: 英语

数据集转换:

  • 原始数据转换为意图。
  • 添加非分类的count槽。
  • 对话动作根据意图分类。
  • 多个值通过|连接。
  • 保留active_intent, requested_slots, service_call

数据集分割:

  • 训练集: 16,142对话,329,964语句
  • 验证集: 2,482对话,48,726语句
  • 测试集: 4,201对话,84,594语句

支持的任务: NLU, DST, Policy, NLG, E2E

许可证: CC BY-SA 4.0

数据集详细信息

数据集转换步骤:

  1. 下载dstc8-schema-guided-dialogue-master.zip
  2. 运行python preprocess.py进行预处理。

数据集详细统计:

  • 平均对话长度: 训练集20.44,验证集19.63,测试集20.14
  • 平均语句长度: 训练集9.75,验证集9.66,测试集10.4
  • 平均涉及领域数: 训练集1.84,验证集1.84,测试集2.02

注释信息:

  • 对话动作、状态、db_results、service_call、active_intent、requested_slots。

数据集引用:

@article{rastogi2019towards, title={Towards Scalable Multi-domain Conversational Agents: The Schema-Guided Dialogue Dataset}, author={Rastogi, Abhinav and Zang, Xiaoxue and Sunkara, Srinivas and Gupta, Raghav and Khaitan, Pranav}, journal={arXiv preprint arXiv:1909.05855}, year={2019} }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作