five

jpcorb20/multidogo

收藏
Hugging Face2022-10-20 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/jpcorb20/multidogo
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - crowdsourced language_creators: - crowdsourced language: - en license: - other multilinguality: - monolingual pretty_name: multidogo size_categories: - 10k<n<100k source_datasets: - original task_categories: - text-classification - sequence-modeling - structure-prediction - other task_ids: - intent-classification - dialogue-modeling - slot-filling - named-entity-recognition - other-other-my-task-description --- MultiDoGo dialog dataset: - paper: https://aclanthology.org/D19-1460/ - git repo: https://github.com/awslabs/multi-domain-goal-oriented-dialogues-dataset *Abstract* The need for high-quality, large-scale, goal-oriented dialogue datasets continues to grow as virtual assistants become increasingly wide-spread. However, publicly available datasets useful for this area are limited either in their size, linguistic diversity, domain coverage, or annotation granularity. In this paper, we present strategies toward curating and annotating large scale goal oriented dialogue data. We introduce the MultiDoGO dataset to overcome these limitations. With a total of over 81K dialogues harvested across six domains, MultiDoGO is over 8 times the size of MultiWOZ, the other largest comparable dialogue dataset currently available to the public. Over 54K of these harvested conversations are annotated for intent classes and slot labels. We adopt a Wizard-of-Oz approach wherein a crowd-sourced worker (the “customer”) is paired with a trained annotator (the “agent”). The data curation process was controlled via biases to ensure a diversity in dialogue flows following variable dialogue policies. We provide distinct class label tags for agents vs. customer utterances, along with applicable slot labels. We also compare and contrast our strategies on annotation granularity, i.e. turn vs. sentence level. Furthermore, we compare and contrast annotations curated by leveraging professional annotators vs the crowd. We believe our strategies for eliciting and annotating such a dialogue dataset scales across modalities and domains and potentially languages in the future. To demonstrate the efficacy of our devised strategies we establish neural baselines for classification on the agent and customer utterances as well as slot labeling for each domain. ## Licensing information Community Data License Agreement – Permissive, Version 1.0.
提供机构:
jpcorb20
原始信息汇总

数据集概述

基本信息

  • 名称: MultiDoGo
  • 语言: 英语 (en)
  • 许可证: 其他
  • 多语言性: 单语
  • 大小: 10k<n<100k
  • 数据来源: 原始数据

创建者

  • 标注创建者: 众包
  • 语言创建者: 众包

任务类别

  • 文本分类
  • 序列建模
  • 结构预测
  • 其他

任务ID

  • 意图分类
  • 对话建模
  • 槽填充
  • 命名实体识别
  • 其他-其他-我的任务描述
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作