ConvLab/sgd

Name: ConvLab/sgd
Creator: ConvLab
Published: 2024-05-08 13:01:30
License: 暂无描述

Hugging Face2024-05-08 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/ConvLab/sgd

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en license: - cc-by-sa-4.0 multilinguality: - monolingual pretty_name: SGD size_categories: - 10K<n<100K task_categories: - conversational --- # Dataset Card for Schema-Guided Dialogue - **Repository:** https://github.com/google-research-datasets/dstc8-schema-guided-dialogue - **Paper:** https://arxiv.org/pdf/1909.05855.pdf - **Leaderboard:** None - **Who transforms the dataset:** Qi Zhu(zhuq96 at gmail dot com) To use this dataset, you need to install [ConvLab-3](https://github.com/ConvLab/ConvLab-3) platform first. Then you can load the dataset via: ``` from convlab.util import load_dataset, load_ontology, load_database dataset = load_dataset('sgd') ontology = load_ontology('sgd') database = load_database('sgd') ``` For more usage please refer to [here](https://github.com/ConvLab/ConvLab-3/tree/master/data/unified_datasets). ### Dataset Summary The **Schema-Guided Dialogue (SGD)** dataset consists of over 20k annotated multi-domain, task-oriented conversations between a human and a virtual assistant. These conversations involve interactions with services and APIs spanning 20 domains, such as banks, events, media, calendar, travel, and weather. For most of these domains, the dataset contains multiple different APIs, many of which have overlapping functionalities but different interfaces, which reflects common real-world scenarios. The wide range of available annotations can be used for intent prediction, slot filling, dialogue state tracking, policy imitation learning, language generation, and user simulation learning, among other tasks for developing large-scale virtual assistants. Additionally, the dataset contains unseen domains and services in the evaluation set to quantify the performance in zero-shot or few-shot settings. - **How to get the transformed data from original data:** - Download [dstc8-schema-guided-dialogue-master.zip](https://github.com/google-research-datasets/dstc8-schema-guided-dialogue/archive/refs/heads/master.zip). - Run `python preprocess.py` in the current directory. - **Main changes of the transformation:** - Lower case original `act` as `intent`. - Add `count` slot for each domain, non-categorical, find span by text matching. - Categorize `dialogue acts` according to the `intent`. - Concatenate multiple values using `|`. - Retain `active_intent`, `requested_slots`, `service_call`. - **Annotations:** - dialogue acts, state, db_results, service_call, active_intent, requested_slots. ### Supported Tasks and Leaderboards NLU, DST, Policy, NLG, E2E ### Languages English ### Data Splits | split | dialogues | utterances | avg_utt | avg_tokens | avg_domains | cat slot match(state) | cat slot match(goal) | cat slot match(dialogue act) | non-cat slot span(dialogue act) | | ---------- | --------- | ---------- | ------- | ---------- | ----------- | --------------------- | -------------------- | ---------------------------- | ------------------------------- | | train | 16142 | 329964 | 20.44 | 9.75 | 1.84 | 100 | - | 100 | 100 | | validation | 2482 | 48726 | 19.63 | 9.66 | 1.84 | 100 | - | 100 | 100 | | test | 4201 | 84594 | 20.14 | 10.4 | 2.02 | 100 | - | 100 | 100 | | all | 22825 | 463284 | 20.3 | 9.86 | 1.87 | 100 | - | 100 | 100 | 45 domains: ['Banks_1', 'Buses_1', 'Buses_2', 'Calendar_1', 'Events_1', 'Events_2', 'Flights_1', 'Flights_2', 'Homes_1', 'Hotels_1', 'Hotels_2', 'Hotels_3', 'Media_1', 'Movies_1', 'Music_1', 'Music_2', 'RentalCars_1', 'RentalCars_2', 'Restaurants_1', 'RideSharing_1', 'RideSharing_2', 'Services_1', 'Services_2', 'Services_3', 'Travel_1', 'Weather_1', 'Alarm_1', 'Banks_2', 'Flights_3', 'Hotels_4', 'Media_2', 'Movies_2', 'Restaurants_2', 'Services_4', 'Buses_3', 'Events_3', 'Flights_4', 'Homes_2', 'Media_3', 'Messaging_1', 'Movies_3', 'Music_3', 'Payment_1', 'RentalCars_3', 'Trains_1'] - **cat slot match**: how many values of categorical slots are in the possible values of ontology in percentage. - **non-cat slot span**: how many values of non-categorical slots have span annotation in percentage. ### Citation ``` @article{rastogi2019towards, title={Towards Scalable Multi-domain Conversational Agents: The Schema-Guided Dialogue Dataset}, author={Rastogi, Abhinav and Zang, Xiaoxue and Sunkara, Srinivas and Gupta, Raghav and Khaitan, Pranav}, journal={arXiv preprint arXiv:1909.05855}, year={2019} } ``` ### Licensing Information [**CC BY-SA 4.0**](https://creativecommons.org/licenses/by-sa/4.0/)

提供机构：

ConvLab

原始信息汇总

数据集概述

数据集名称： Schema-Guided Dialogue (SGD)

数据集大小： 包含超过20,000个多领域、任务导向的人机对话记录。

领域覆盖： 涉及20个领域，如银行、事件、媒体、日历、旅行和天气等。

数据集用途： 用于意图预测、槽填充、对话状态跟踪、策略模仿学习、语言生成和用户模拟学习等任务。

语言： 英语

数据集转换：

原始数据转换为意图。
添加非分类的count槽。
对话动作根据意图分类。
多个值通过|连接。
保留active_intent, requested_slots, service_call。

数据集分割：

训练集： 16,142对话，329,964语句
验证集： 2,482对话，48,726语句
测试集： 4,201对话，84,594语句

支持的任务： NLU, DST, Policy, NLG, E2E

许可证： CC BY-SA 4.0

数据集详细信息

数据集转换步骤：

下载dstc8-schema-guided-dialogue-master.zip。
运行python preprocess.py进行预处理。

数据集详细统计：

平均对话长度： 训练集20.44，验证集19.63，测试集20.14
平均语句长度： 训练集9.75，验证集9.66，测试集10.4
平均涉及领域数： 训练集1.84，验证集1.84，测试集2.02

注释信息：

对话动作、状态、db_results、service_call、active_intent、requested_slots。

数据集引用：

@article{rastogi2019towards, title={Towards Scalable Multi-domain Conversational Agents: The Schema-Guided Dialogue Dataset}, author={Rastogi, Abhinav and Zang, Xiaoxue and Sunkara, Srinivas and Gupta, Raghav and Khaitan, Pranav}, journal={arXiv preprint arXiv:1909.05855}, year={2019} }

5,000+

优质数据集

54 个

任务类型

进入经典数据集