Statcan Dialogue Dataset
收藏Mendeley Data2024-03-27 更新2024-06-27 收录
下载链接:
https://borealisdata.ca/citation?persistentId=doi:10.5683/SP3/NR0BMY
下载链接
链接失效反馈官方服务:
资源简介:
Welcome to the data repository for requesting access to the Statcan Dialogue Dataset! Before requesting access, you can visit our website or read our EACL 2023 paper Requesting Access In order to use our dataset, you must agree to the terms of use and restrictions before requesting access (see below). We will manually review each request and grant access or reach out to you for further information. To facilitate the process, make sure that: Your Dataverse account is linked to your professional/research website, which we may review to ensure the dataset will be used for the intended purpose Your request is made with an academic (e.g. .edu) or professional email (e.g. @servicenow.com). To do this, your have to set your primary email to your academic/professional email, or create a new Dataverse account. If your academic institution does not end with .edu, or you are part of a professional group that does not have an email address, please contact us (see email in paper). Abstract: We introduce the StatCan Dialogue Dataset consisting of 19,379 conversation turns between agents working at Statistics Canada and online users looking for published data tables. The conversations stem from genuine intents, are held in English or French, and lead to agents retrieving one of over 5000 complex data tables. Based on this dataset, we propose two tasks: (1) automatic retrieval of relevant tables based on a on-going conversation, and (2) automatic generation of appropriate agent responses at each turn. We investigate the difficulty of each task by establishing strong baselines. Our experiments on a temporal data split reveal that all models struggle to generalize to future conversations, as we observe a significant drop in performance across both tasks when we move from the validation to the test set. In addition, we find that response generation models struggle to decide when to return a table. Considering that the tasks pose significant challenges to existing models, we encourage the community to develop models for our task, which can be directly used to help knowledge workers find relevant tables for live chat users.
欢迎访问用于申请Statcan对话数据集(Statcan Dialogue Dataset)访问权限的数据仓库!在提交访问申请前,您可访问我们的官方网站,或阅读我们发表于EACL 2023的论文《Requesting Access》。若要使用本数据集,您需先同意相关使用条款与限制(详见下文)。
我们将对每一项申请进行人工审核,并授予访问权限,或联系您补充必要信息。为优化审核流程,请确保以下两点:
1. 您的Dataverse账户已关联您的学术或专业研究网站,我们可能会对该关联信息进行核查,以确认数据集将用于既定用途;
2. 您的申请需使用学术邮箱(如后缀为.edu的邮箱)或专业邮箱(如后缀为servicenow.com的邮箱)提交。若需满足上述要求,您可将账户主邮箱设置为学术/专业邮箱,或新建一个Dataverse账户。若您所在学术机构的邮箱不使用.edu后缀,或您所属的专业团体无专属邮箱,请联系我们(联系方式详见论文)。
### 数据集摘要
我们推出了Statcan对话数据集(Statcan Dialogue Dataset),该数据集包含加拿大统计局(Statistics Canada)工作人员与寻求公开数据表的在线用户之间共计19379条对话轮次。所有对话均源自真实用户意图,使用英语或法语开展,最终由工作人员从超过5000个复杂数据表中检索到对应结果。
基于该数据集,我们提出两项研究任务:(1)基于当前对话自动检索相关数据表;(2)在对话的每一轮次自动生成适配的工作人员回复。我们通过构建强基准模型对两项任务的难度进行了评估。
针对时序划分的数据集开展的实验显示,所有模型均难以泛化至未来对话场景:当从验证集切换至测试集时,两项任务的性能均出现显著下降。此外,我们还发现回复生成模型难以判断何时应返回检索到的数据表。
鉴于上述任务对现有模型构成了显著挑战,我们呼吁学界同仁针对本任务开发相关模型,该类模型可直接用于帮助知识工作者为实时聊天用户检索相关数据表。
创建时间:
2023-06-28



