gaussalgo/Canard_Wiki-augmented

Name: gaussalgo/Canard_Wiki-augmented
Creator: gaussalgo
Published: 2023-04-12 13:35:37
License: 暂无描述

Hugging Face2023-04-12 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/gaussalgo/Canard_Wiki-augmented

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: History sequence: string - name: QuAC_dialog_id dtype: string - name: Question dtype: string - name: Question_no dtype: int64 - name: Rewrite dtype: string - name: true_page_title dtype: string - name: true_contexts dtype: string - name: answer dtype: string - name: true_contexts_wiki dtype: string - name: extractive dtype: bool - name: retrieved_contexts sequence: string splits: - name: train num_bytes: 1353765609 num_examples: 31526 - name: test num_bytes: 252071528 num_examples: 5571 download_size: 231554886 dataset_size: 1605837137 license: cc-by-sa-4.0 task_categories: - question-answering - conversational - text2text-generation language: - en pretty_name: Canard Wikipedia-augmented size_categories: - 10K<n<100K --- # Dataset Card for Canard_Wiki-augmented ### Summary This is a dataset of fact-retrieving conversations about Wikipedia articles, with all responses grounded in a specific segment of text in the referenced Wikipedia article. It is an extended version of [Canard](https://sites.google.com/view/qanta/projects/canard) and [QuAC](https://huggingface.co/datasets/quac) datasets, augmented with the contexts of [English Wikipedia](https://huggingface.co/datasets/wikipedia). ### Supported Tasks The dataset is intended to train a factually-consistent conversational model able to ground all its responses to the corresponding source(s). However, the data can also be used to evaluate the information retrieval (IR) system for given queries, for contextual disambiguation of the queries from a conversation, etc. ## Dataset Structure The dataset can be loaded by simply choosing a split (`train` or `test`) and calling: ```python import datasets canard_augm_test = datasets.load_dataset("gaussalgo/Canard_Wiki-augmented", split="test") print(canard_augm_test[0]) # print the first sample ``` ### Data Instances The samples of Canard_Wiki-augmented have this format: ```python {'History': ['Anna Politkovskaya', 'The murder remains unsolved, 2016'], 'QuAC_dialog_id': 'C_0aaa843df0bd467b96e5a496fc0b033d_1', 'Question': 'Did they have any clues?', 'Question_no': 1, 'answer': 'Her colleagues at Novaya gazeta protested that until the instigator or sponsor of the crime was identified, arrested and prosecuted the case was not closed.' 'Rewrite': 'Did investigators have any clues in the unresolved murder of Anna Politkovskaya?', 'true_page_title': 'Anna Politkovskaya', 'true_contexts': 'In September 2016 Vladimir Markin, official spokesman for (...)', 'true_contexts_wiki': 'Anna Stepanovna Politkovskaya was a US-born Russian journalist (...)', 'extractive': True 'retrieved_contexts': ['Clues was an indie rock band from Montreal, Canada formed by Alden Penner (...)', 'High Stakes is a British game show series hosted by Jeremy Kyle, in which (...)'] ``` ### Data Fields * **History**: History of the conversation from Canard. The first two entries of the conversation are always synthetic. * **QuAC_dialog_id**: Dialogue ID mapping the conversation to the original QuAC dataset (*dialogue_id* in QuAC). * **Question**: Current question of the user from Canard. * **Question_no**: Ordering of the user's question from the conversation, originally from Canard. * **answer**: Correctly extracted answer to a given question from a relevant Wikipedia article (*true_contexts*). Note that some of the questions are open, thus the listed answer is not the only correct possibility. * **Rewrite**: A rephrased version of *Question*, manually disambiguated from the context of *History* by the annotators of Canard. * **true_page_title**: Title of the Wikipedia article containing *answer*. *wikipedia_page_title* from QuAC. * **true_contexts**: An excerpt of the paragraph with an answer from the Wikipedia article titled *true_page_title*. * **true_contexts_wiki**: A full contents of Wikipedia article (*text* from Wikipedia dataset), where *true_page_title* matches Wikipedia *title*. Note that the Wikipedia dataset was retrieved on 2nd of April, 2023. * **extractive**: A flag whether the *answer* in this sample can be found as an exact-match in *true_contexts_wiki*. * **retrieved_contexts**: "Distractor" contexts retrieved from the full Wikipedia dataset using the okapi-BM25 IR system on a **Rewrite** question. ### Data Splits * **train** split is aligned with the training splits of Canard and QuAC. * **test** split matches the validation split of QuAC and the test split of Canard (where the conversation ids match). ## Licensing This dataset is composed of [QuAC](https://huggingface.co/datasets/quac) (MIT), [Canard](https://sites.google.com/view/qanta/projects/canard) (CC BY-SA 4.0) and [Wikipedia](https://huggingface.co/datasets/wikipedia) (CC BY SA 3.0). Canard_Wiki-augmented is therefore licensed under CC BY-SA 4.0 as well, allowing it to be also commercially used. ## Cite If you use this dataset in a research, do not forget to cite the authors of original datasets, that Canard_Wiki-augmented is derived from: [QuAC](https://huggingface.co/datasets/quac), [Canard](https://sites.google.com/view/qanta/projects/canard).

提供机构：

gaussalgo

原始信息汇总

数据集概述

数据集名称

Canard Wikipedia-augmented

数据集特征

History: 字符串序列，对话的历史记录。
QuAC_dialog_id: 字符串，对话ID，映射到原始QuAC数据集。
Question: 字符串，用户当前的问题。
Question_no: 整数，用户问题的顺序。
Rewrite: 字符串，问题的重新表述版本。
true_page_title: 字符串，包含答案的维基百科文章标题。
true_contexts: 字符串，维基百科文章中包含答案的段落摘录。
answer: 字符串，对问题的正确提取答案。
true_contexts_wiki: 字符串，维基百科文章的全文内容。
extractive: 布尔值，指示答案是否可以在true_contexts_wiki中找到精确匹配。
retrieved_contexts: 字符串序列，使用okapi-BM25 IR系统从完整的维基百科数据集中检索的“干扰”上下文。

数据集结构

训练集 (train): 31526个样本，大小为1353765609字节。
测试集 (test): 5571个样本，大小为252071528字节。

许可证

CC BY-SA 4.0

语言

英语 (en)

任务类别

问答
对话
文本到文本生成

数据集大小类别

10K<n<100K

5,000+

优质数据集

54 个

任务类型

进入经典数据集