castorini/odqa-wiki-corpora

Name: castorini/odqa-wiki-corpora
Creator: castorini
Published: 2023-01-05 21:32:51
License: 暂无描述

Hugging Face2023-01-05 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/castorini/odqa-wiki-corpora

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - no-annotation language: - en language_creators: [] license: - cc-by-sa-3.0 multilinguality: - monolingual pretty_name: Open-Domain Question Answering Wikipedia Corpora size_categories: [] source_datasets: [] tags: [] task_categories: - question-answering - text-retrieval task_ids: - open-domain-qa --- # Dataset Card for Open-Domain Question Answering Wikipedia Corpora ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Dataset Structure](#dataset-structure) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Source Data](#source-data) ## Dataset Description ### Dataset Summary The Wikipedia corpus variants provided can serve as knowledge sources for question-answering systems based on a retriever–reader pipeline. These corpus variants and their corresponding experiments are described further in the paper entitled: > Pre-Processing Matters! Improved Wikipedia Corpora for Open-Domain Question Answering. ## Dataset Structure ### Data Fields The dataset consists of passages that have been segmented from Wikipedia articles. For each passage, the following fields are provided - ```docid```: The passage id in the format of (X#Y) where passages from the same article share the same X, but Y denotes the segment id within the article - ```title```: The title of the article from where the passage comes - ```text```: The text content of the passage ### Data Splits There are 6 corpus variants in total - ```wiki-text-100w-karpukhin```: The original DPR Wikipedia corpus with non-overlapping passages, each 100 words long, from Karpukhin et al., > Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih. [Dense Passage Retrieval for Open-Domain Question Answering](https://www.aclweb.org/anthology/2020.emnlp-main.550/). _Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)_, pages 6769-6781, 2020. - ```wiki-text-100w-tamber```: Our replication of the above corpus - ```wiki-text-6-3-tamber```: A corpus similar to above i.e. without tables, infoboxes, and lists. Segmentation is done differently, with a passage size of 6 sentences and a stride of 3 sentences. Note, this means that passages are overlapped. - ```wiki-text-8-4-tamber```: Like wiki-text-6-3, but with a passage size of 8 sentences and a stride of 4 sentences. - ```wiki-all-6-3-tamber```: A corpus with tables, infoboxes, and lists included with a passage size of 6 sentences and a stride of 3 sentences. - ```wiki-all-8-4-tamber```: Like wiki-all-6-3, but with a passage size of 8 sentences and a stride of 4 sentences. ## Dataset Creation ### Source Data #### Initial Data Collection and Normalization We start with downloading the full December 20, 2018 Wikipedia XML dump: ```enwiki-20181220-pages-articles.xml``` from the Internet Archive: https://archive.org/details/enwiki-20181220. This is then Pre-processed by WikiExtractor: https://github.com/attardi/wikiextractor (making sure to modify the code to include lists as desired and replacing any tables with the string "TABLETOREPLACE") and DrQA: https://github.com/facebookresearch/DrQA/tree/main/scripts/retriever (again making sure to modify the code to not remove lists as desired). We then apply the [pre-processing script]((https://github.com/castorini/pyserini/blob/master/docs/experiments-wiki-corpora.md)) we make available in [Pyserini](https://github.com/castorini/pyserini) to generate the different corpus variants.

annotations_creators: - 无标注 language: - 英语 language_creators: [] license: - cc-by-sa-3.0 multilinguality: - 单语言 pretty_name: 开放域问答（Open-Domain Question Answering）维基百科语料库 size_categories: [] source_datasets: [] tags: [] task_categories: - 问答 - 文本检索 task_ids: - 开放域问答（Open-Domain Question Answering） # 开放域问答维基百科语料库数据集卡片 ## 目录 - [目录](#目录) - [数据集描述](#数据集描述) - [数据集概述](#数据集概述) - [数据集结构](#数据集结构) - [数据字段](#数据字段) - [数据划分](#数据划分) - [数据集创建](#数据集创建) - [源数据](#源数据) ## 数据集描述 ### 数据集概述本数据集提供的维基百科语料库变体可作为基于检索器-阅读器流水线的问答系统的知识来源。这些语料库变体及其对应实验在以下论文中有详细阐述： > 《预处理至关重要！面向开放域问答的改进型维基百科语料库》 ## 数据集结构 ### 数据字段本数据集包含从维基百科文章中分割得到的段落。每个段落包含以下字段： - `docid`: 段落标识符，格式为(X#Y)，同一文章的段落共享相同的X值，Y代表该文章内的段落分段编号 - `title`: 该段落所属文章的标题 - `text`: 段落的文本内容 ### 数据划分本数据集共包含6种语料库变体： - `wiki-text-100w-karpukhin`: 源自Karpukhin等人研究的原始稠密段落检索（Dense Passage Retrieval, DPR）维基百科语料库，采用非重叠段落分割，每段长度为100词。 > Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih. [稠密段落检索用于开放域问答](https://www.aclweb.org/anthology/2020.emnlp-main.550/). _2020年自然语言处理经验方法会议（EMNLP 2020）论文集_，第6769-6781页，2020年。 - `wiki-text-100w-tamber`: 本团队对上述语料库的复现版本 - `wiki-text-6-3-tamber`: 与上述语料库类似的版本，不含表格、信息框与列表。采用新的分割方式，段落长度为6个句子，步长为3个句子。请注意，这意味着段落存在重叠。 - `wiki-text-8-4-tamber`: 与`wiki-text-6-3-tamber`类似，但段落长度为8个句子，步长为4个句子。 - `wiki-all-6-3-tamber`: 包含表格、信息框与列表的语料库，段落长度为6个句子，步长为3个句子。 - `wiki-all-8-4-tamber`: 与`wiki-all-6-3-tamber`类似，但段落长度为8个句子，步长为4个句子。 ## 数据集创建 ### 源数据 #### 原始数据收集与标准化我们首先从互联网档案馆（Internet Archive）下载2018年12月20日的完整维基百科XML转储文件：`enwiki-20181220-pages-articles.xml`，下载地址为https://archive.org/details/enwiki-20181220。随后使用WikiExtractor（https://github.com/attardi/wikiextractor）进行预处理（需修改代码以保留列表，并将所有表格替换为字符串"TABLETOREPLACE"），并使用DrQA（https://github.com/facebookresearch/DrQA/tree/main/scripts/retriever）进行后续处理（需修改代码以保留列表）。随后我们使用[Pyserini](https://github.com/castorini/pyserini)中提供的预处理脚本（https://github.com/castorini/pyserini/blob/master/docs/experiments-wiki-corpora.md）生成不同的语料库变体。

提供机构：

castorini

原始信息汇总

数据集概述

数据集名称

名称: Open-Domain Question Answering Wikipedia Corpora
别名: Open-Domain QA Wikipedia Corpora

语言与许可

语言: 英语 (en)
许可: CC-BY-SA-3.0

多语言性

类型: 单语种

任务类别

任务类别:
- 问答
- 文本检索
具体任务: 开放领域问答 (open-domain-qa)

数据集结构

数据字段:
- docid: 文章ID，格式为(X#Y)，其中X表示同一文章的ID，Y表示文章内的段落ID。
- title: 文章标题。
- text: 段落文本内容。
数据分割:
- wiki-text-100w-karpukhin: 原始DPR维基百科语料库，非重叠段落，每段100字。
- wiki-text-100w-tamber: 上述语料库的复制。
- wiki-text-6-3-tamber: 类似语料库，不包含表格、信息框和列表，段落大小为6句，步长为3句，段落重叠。
- wiki-text-8-4-tamber: 类似wiki-text-6-3-tamber，段落大小为8句，步长为4句。
- wiki-all-6-3-tamber: 包含表格、信息框和列表的语料库，段落大小为6句，步长为3句。
- wiki-all-8-4-tamber: 类似wiki-all-6-3-tamber，段落大小为8句，步长为4句。

数据集创建

源数据:
- 初始数据收集: 使用2018年12月20日的维基百科XML转储enwiki-20181220-pages-articles.xml。
- 预处理工具: WikiExtractor 和 DrQA。
- 预处理步骤: 使用Pyserini提供的预处理脚本生成不同的语料库变体。

搜集汇总

数据集介绍

构建方式

在开放域问答研究领域，构建高质量的知识源是提升系统性能的关键。本数据集基于2018年12月20日的英文维基百科XML全文转储，通过精心设计的预处理流程构建而成。首先利用WikiExtractor工具提取文本内容，并调整代码以保留列表结构，同时将表格替换为特定标记；随后借助DrQA脚本进行进一步清理，确保文本规范化。最终通过Pyserini工具包中的定制脚本，生成六种不同变体的语料库，涵盖固定词长与非重叠分段、重叠句子分段等多种分割策略，为检索-阅读器流水线提供了结构化的知识基础。

特点

该数据集在开放域问答任务中展现出鲜明的结构化特征。其核心在于提供了六种维基百科语料变体，包括基于固定100词长度的非重叠分段版本，以及以句子为单位、采用滑动窗口策略的重叠分段版本，后者允许更细粒度的上下文捕捉。部分变体特意滤除了表格、信息框和列表，以聚焦于纯文本内容；而另一些变体则保留了这些结构化元素，增强了信息的完整性。每个段落均标注了来源文章标题、唯一文档标识及文本内容，这种设计便于在密集检索任务中实现高效的知识定位与匹配。

使用方法

在自然语言处理实践中，该数据集主要服务于开放域问答系统的训练与评估。研究者可将其作为检索模块的知识库，通过嵌入模型将段落编码为向量，构建高效的索引结构，以支持针对用户问题的实时段落检索。检索到的相关段落可进一步输入到阅读理解模型中进行答案抽取。数据集的多种变体允许对比不同预处理策略（如是否包含表格、分段长度与重叠程度）对系统性能的影响。典型的使用流程包括加载指定语料变体、构建检索索引、集成至如DPR等检索-阅读器框架，从而推动问答系统在准确性与效率上的优化。

背景与挑战

背景概述

在开放域问答（ODQA）研究领域，构建高效的知识源是推动系统性能提升的核心环节。由Castorini团队于2020年左右发布的Open-Domain Question Answering Wikipedia Corpora数据集，旨在为基于检索-阅读器管道的问答系统提供经过优化的维基百科文本语料。该数据集源于对维基百科2018年12月20日全量XML转储的深度处理，通过多种分段策略（如固定词长或句子滑动窗口）生成多个变体，以支持密集段落检索等前沿方法。其设计呼应了Karpukhin等人提出的DPR框架，显著促进了ODQA模型在知识覆盖与答案精确性方面的进步，成为该领域重要的基准资源之一。

当前挑战

该数据集致力于应对开放域问答中知识检索的挑战，即如何从海量非结构化文本中快速、准确地定位与问题相关的信息片段。具体而言，构建过程面临多重困难：维基百科原始数据包含表格、信息框和列表等多种异构结构，需在保留语义完整性与剔除噪声之间取得平衡；不同分段策略（如重叠句子窗口与非重叠词块）直接影响检索效率与答案连贯性，需通过实验验证最优参数；此外，数据预处理的一致性保障亦具难度，需复现并改进既有流程以确保语料质量与可比性。

常用场景

经典使用场景

在开放域问答（ODQA）研究领域，castorini/odqa-wiki-corpora数据集常被用作知识源，支撑检索-阅读器（retriever-reader）架构的系统开发。该数据集通过提供多种维基百科文本变体，使研究者能够评估不同文本分割策略对信息检索效果的影响。例如，在训练密集段落检索（DPR）模型时，研究者可对比非重叠的100词段落与重叠的句子级段落，以优化检索精度与上下文连贯性之间的平衡。

衍生相关工作

围绕该数据集衍生的经典工作包括Karpukhin等人提出的密集段落检索（DPR）模型，该模型利用数据集的非重叠段落变体训练检索器，显著提升了开放域问答的检索效率。后续研究如Tamber等人的预处理改进工作，进一步探索了句子重叠与结构化信息保留的策略，催生了如Pyserini工具包的集成，推动了检索系统在学术与工业界的标准化应用。

数据集最近研究