CANARD (A Dataset for Question-in-Context Rewriting)

Name: CANARD (A Dataset for Question-in-Context Rewriting)
Creator: OpenDataLab
Published: 2026-05-24 11:30:30
License: 暂无描述

OpenDataLab2026-05-24 更新2024-05-09 收录

下载链接：

https://opendatalab.org.cn/OpenDataLab/CANARD

下载链接

链接失效反馈

官方服务：

资源简介：

CANARD 是一个用于在上下文中重写的数据集，它由每个在对话上下文中给出的问题以及与上下文无关的问题重写组成。每个问题的上下文是问题之前的对话话语。 CANARD 可用于评估处理重要语言现象（如共指和省略号解析）的问题重写模型。 CANARD 基于 QuAC（Choi 等人，2018 年）——一个会话阅读理解数据集，其中从维基百科文章的给定部分中选择答案。 QuAC 中的一些问题在其给定的部分中无法回答。我们使用“我不知道”的答案。对于这样的问题。 CANARD 是使用 Amazon Mechanical Turk 通过众包问题重写构建的。我们应用了多种自动和手动质量控制来确保数据收集过程的质量。该数据集由 40,527 个具有不同上下文长度的问题组成。我们的 EMNLP 2019 论文中提供了更多详细信息。下面提供了一个示例。该数据集在 CC BY-SA 4.0 许可下分发。

CANARD is a dataset for contextual question rewriting. Each instance in the dataset consists of a question presented within its conversational context and its corresponding decontextualized rewrite. The context for each question is defined as the preceding conversational utterances prior to the question. CANARD can be utilized to evaluate question rewriting models that handle critical linguistic phenomena including coreference resolution and ellipsis parsing. CANARD is built upon QuAC (Choi et al., 2018), a conversational reading comprehension dataset where answers are selected from a designated passage of a Wikipedia article. Some questions in QuAC are unanswerable based on their corresponding given passages, and we adopt the answer "I don't know" for such questions. CANARD is constructed through crowdsourced question rewriting tasks on Amazon Mechanical Turk. Multiple automatic and manual quality control procedures were implemented to guarantee the quality of the data collection workflow. The dataset contains 40,527 questions with varying context lengths. Additional details can be found in our EMNLP 2019 publication, with an example provided below. This dataset is distributed under the CC BY-SA 4.0 license.

提供机构：

OpenDataLab

创建时间：

2022-09-01

搜集汇总

数据集介绍