gaussalgo/Canard_Wiki-augmented
收藏Hugging Face2023-04-12 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/gaussalgo/Canard_Wiki-augmented
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: History
sequence: string
- name: QuAC_dialog_id
dtype: string
- name: Question
dtype: string
- name: Question_no
dtype: int64
- name: Rewrite
dtype: string
- name: true_page_title
dtype: string
- name: true_contexts
dtype: string
- name: answer
dtype: string
- name: true_contexts_wiki
dtype: string
- name: extractive
dtype: bool
- name: retrieved_contexts
sequence: string
splits:
- name: train
num_bytes: 1353765609
num_examples: 31526
- name: test
num_bytes: 252071528
num_examples: 5571
download_size: 231554886
dataset_size: 1605837137
license: cc-by-sa-4.0
task_categories:
- question-answering
- conversational
- text2text-generation
language:
- en
pretty_name: Canard Wikipedia-augmented
size_categories:
- 10K<n<100K
---
# Dataset Card for Canard_Wiki-augmented
### Summary
This is a dataset of fact-retrieving conversations about Wikipedia articles, with all responses grounded in a specific segment of text in the referenced Wikipedia article.
It is an extended version of [Canard](https://sites.google.com/view/qanta/projects/canard)
and [QuAC](https://huggingface.co/datasets/quac) datasets,
augmented with the contexts of [English Wikipedia](https://huggingface.co/datasets/wikipedia).
### Supported Tasks
The dataset is intended to train a factually-consistent conversational model able to ground all its responses to the corresponding source(s).
However, the data can also be used to evaluate the information retrieval (IR) system for given queries, for contextual disambiguation of the queries from a conversation, etc.
## Dataset Structure
The dataset can be loaded by simply choosing a split (`train` or `test`) and calling:
```python
import datasets
canard_augm_test = datasets.load_dataset("gaussalgo/Canard_Wiki-augmented", split="test")
print(canard_augm_test[0]) # print the first sample
```
### Data Instances
The samples of Canard_Wiki-augmented have this format:
```python
{'History': ['Anna Politkovskaya', 'The murder remains unsolved, 2016'],
'QuAC_dialog_id': 'C_0aaa843df0bd467b96e5a496fc0b033d_1',
'Question': 'Did they have any clues?',
'Question_no': 1,
'answer': 'Her colleagues at Novaya gazeta protested that until the instigator or sponsor of the crime was identified, arrested and prosecuted the case was not closed.'
'Rewrite': 'Did investigators have any clues in the unresolved murder of Anna Politkovskaya?',
'true_page_title': 'Anna Politkovskaya',
'true_contexts': 'In September 2016 Vladimir Markin, official spokesman for (...)',
'true_contexts_wiki': 'Anna Stepanovna Politkovskaya was a US-born Russian journalist (...)',
'extractive': True
'retrieved_contexts': ['Clues was an indie rock band from Montreal, Canada formed by Alden Penner (...)',
'High Stakes is a British game show series hosted by Jeremy Kyle, in which (...)']
```
### Data Fields
* **History**: History of the conversation from Canard. The first two entries of the conversation are always synthetic.
* **QuAC_dialog_id**: Dialogue ID mapping the conversation to the original QuAC dataset (*dialogue_id* in QuAC).
* **Question**: Current question of the user from Canard.
* **Question_no**: Ordering of the user's question from the conversation, originally from Canard.
* **answer**: Correctly extracted answer to a given question from a relevant Wikipedia article (*true_contexts*). Note that some of the questions are open, thus the listed answer is not the only correct possibility.
* **Rewrite**: A rephrased version of *Question*, manually disambiguated from the context of *History* by the annotators of Canard.
* **true_page_title**: Title of the Wikipedia article containing *answer*. *wikipedia_page_title* from QuAC.
* **true_contexts**: An excerpt of the paragraph with an answer from the Wikipedia article titled *true_page_title*.
* **true_contexts_wiki**: A full contents of Wikipedia article (*text* from Wikipedia dataset), where *true_page_title* matches Wikipedia *title*. Note that the Wikipedia dataset was retrieved on 2nd of April, 2023.
* **extractive**: A flag whether the *answer* in this sample can be found as an exact-match in *true_contexts_wiki*.
* **retrieved_contexts**: "Distractor" contexts retrieved from the full Wikipedia dataset using the okapi-BM25 IR system on a **Rewrite** question.
### Data Splits
* **train** split is aligned with the training splits of Canard and QuAC.
* **test** split matches the validation split of QuAC and the test split of Canard (where the conversation ids match).
## Licensing
This dataset is composed of [QuAC](https://huggingface.co/datasets/quac) (MIT),
[Canard](https://sites.google.com/view/qanta/projects/canard) (CC BY-SA 4.0)
and [Wikipedia](https://huggingface.co/datasets/wikipedia) (CC BY SA 3.0).
Canard_Wiki-augmented is therefore licensed under CC BY-SA 4.0 as well, allowing it to be also commercially used.
## Cite
If you use this dataset in a research, do not forget to cite the authors of original datasets, that Canard_Wiki-augmented is derived from:
[QuAC](https://huggingface.co/datasets/quac), [Canard](https://sites.google.com/view/qanta/projects/canard).
提供机构:
gaussalgo
原始信息汇总
数据集概述
数据集名称
- Canard Wikipedia-augmented
数据集特征
- History: 字符串序列,对话的历史记录。
- QuAC_dialog_id: 字符串,对话ID,映射到原始QuAC数据集。
- Question: 字符串,用户当前的问题。
- Question_no: 整数,用户问题的顺序。
- Rewrite: 字符串,问题的重新表述版本。
- true_page_title: 字符串,包含答案的维基百科文章标题。
- true_contexts: 字符串,维基百科文章中包含答案的段落摘录。
- answer: 字符串,对问题的正确提取答案。
- true_contexts_wiki: 字符串,维基百科文章的全文内容。
- extractive: 布尔值,指示答案是否可以在
true_contexts_wiki中找到精确匹配。 - retrieved_contexts: 字符串序列,使用okapi-BM25 IR系统从完整的维基百科数据集中检索的“干扰”上下文。
数据集结构
- 训练集 (
train): 31526个样本,大小为1353765609字节。 - 测试集 (
test): 5571个样本,大小为252071528字节。
许可证
- CC BY-SA 4.0
语言
- 英语 (
en)
任务类别
- 问答
- 对话
- 文本到文本生成
数据集大小类别
- 10K<n<100K



