CANARD Dataset

Name: CANARD Dataset
Creator: Papers with Code
License: 暂无描述

paperswithcode.com2025-03-21 收录

下载链接：

https://paperswithcode.com/dataset/canard

下载链接

链接失效反馈

官方服务：

资源简介：

CANARD is a dataset for question-in-context rewriting that consists of questions each given in a dialog context together with a context-independent rewriting of the question. The context of each question is the dialog utterences that precede the question. CANARD can be used to evaluate question rewriting models that handle important linguistic phenomena such as coreference and ellipsis resolution. CANARD is based on QuAC (Choi et al., 2018)---a conversational reading comprehension dataset in which answers are selected spans from a given section in a Wikipedia article. Some questions in QuAC are unanswerable with their given sections. We use the answer 'I don't know.' for such questions. CANARD is constructed by crowdsourcing question rewritings using Amazon Mechanical Turk. We apply several automatic and manual quality controls to ensure the quality of the data collection process. The dataset consists of 40,527 questions with different context lengths. More details are available in our EMNLP 2019 paper. An example is provided below. The dataset is distributed under the CC BY-SA 4.0 license.

CANARD是一项针对情境下问题重写的语料库，其中包含的每个问题均置于对话语境之中，并附带一个独立于情境的问题重写版本。每个问题的语境由前置于问题之处的对话语句构成。CANARD可用于评估处理诸如核心指代和省略语消解等重要语言现象的问题重写模型。CANARD基于QuAC（Choi等，2018年）——一个对话式阅读理解语料库，其中的答案为从给定维基百科文章的某一节选取的片段。QuAC中的一些问题在其给定章节中无法找到答案。对于此类问题，我们使用答案'I don't know.'。CANARD通过Amazon Mechanical Turk的众包方式构建问题重写。我们采用了多种自动和人工的质量控制手段，以确保数据收集过程的质量。该语料库包含40,527个不同语境长度的问题。更多详细信息可在我们的EMNLP 2019论文中找到。以下提供了一个示例。该数据集以CC BY-SA 4.0许可证进行分发。

提供机构：

Papers with Code

搜集汇总

数据集介绍

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集