facebook/kilt_wikipedia

Name: facebook/kilt_wikipedia
Creator: facebook
Published: 2024-01-18 11:07:33
License: 暂无描述

Hugging Face2024-01-18 更新2024-05-25 收录

下载链接：

https://hf-mirror.com/datasets/facebook/kilt_wikipedia

下载链接

链接失效反馈

官方服务：

资源简介：

--- paperswithcode_id: null pretty_name: KiltWikipedia dataset_info: features: - name: kilt_id dtype: string - name: wikipedia_id dtype: string - name: wikipedia_title dtype: string - name: text sequence: - name: paragraph dtype: string - name: anchors sequence: - name: paragraph_id dtype: int32 - name: start dtype: int32 - name: end dtype: int32 - name: text dtype: string - name: href dtype: string - name: wikipedia_title dtype: string - name: wikipedia_id dtype: string - name: categories dtype: string - name: wikidata_info struct: - name: description dtype: string - name: enwikiquote_title dtype: string - name: wikidata_id dtype: string - name: wikidata_label dtype: string - name: wikipedia_title dtype: string - name: aliases sequence: - name: alias dtype: string - name: history struct: - name: pageid dtype: int32 - name: parentid dtype: int32 - name: revid dtype: int32 - name: pre_dump dtype: bool - name: timestamp dtype: string - name: url dtype: string config_name: '2019-08-01' splits: - name: full num_bytes: 29372535718 num_examples: 5903530 download_size: 37318876722 dataset_size: 29372535718 --- # Dataset Card for "kilt_wikipedia" ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [https://github.com/facebookresearch/KILT](https://github.com/facebookresearch/KILT) - **Repository:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **Paper:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **Point of Contact:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **Size of downloaded dataset files:** 37.32 GB - **Size of the generated dataset:** 29.37 GB - **Total amount of disk used:** 66.69 GB ### Dataset Summary KILT-Wikipedia: Wikipedia pre-processed for KILT. ### Supported Tasks and Leaderboards [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Languages [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Dataset Structure ### Data Instances #### 2019-08-01 - **Size of downloaded dataset files:** 37.32 GB - **Size of the generated dataset:** 29.37 GB - **Total amount of disk used:** 66.69 GB An example of 'full' looks as follows. ``` { "anchors": { "end": [], "href": [], "paragraph_id": [], "start": [], "text": [], "wikipedia_id": [], "wikipedia_title": [] }, "categories": "", "history": { "pageid": 0, "parentid": 0, "pre_dump": true, "revid": 0, "timestamp": "", "url": "" }, "kilt_id": "", "text": { "paragraph": [] }, "wikidata_info": { "aliases": { "alias": [] }, "description": "", "enwikiquote_title": "", "wikidata_id": "", "wikidata_label": "", "wikipedia_title": "" }, "wikipedia_id": "", "wikipedia_title": "" } ``` ### Data Fields The data fields are the same among all splits. #### 2019-08-01 - `kilt_id`: a `string` feature. - `wikipedia_id`: a `string` feature. - `wikipedia_title`: a `string` feature. - `text`: a dictionary feature containing: - `paragraph`: a `string` feature. - `anchors`: a dictionary feature containing: - `paragraph_id`: a `int32` feature. - `start`: a `int32` feature. - `end`: a `int32` feature. - `text`: a `string` feature. - `href`: a `string` feature. - `wikipedia_title`: a `string` feature. - `wikipedia_id`: a `string` feature. - `categories`: a `string` feature. - `description`: a `string` feature. - `enwikiquote_title`: a `string` feature. - `wikidata_id`: a `string` feature. - `wikidata_label`: a `string` feature. - `wikipedia_title`: a `string` feature. - `aliases`: a dictionary feature containing: - `alias`: a `string` feature. - `pageid`: a `int32` feature. - `parentid`: a `int32` feature. - `revid`: a `int32` feature. - `pre_dump`: a `bool` feature. - `timestamp`: a `string` feature. - `url`: a `string` feature. ### Data Splits | name | full | |----------|------:| |2019-08-01|5903530| ## Dataset Creation ### Curation Rationale [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Source Data #### Initial Data Collection and Normalization [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### Who are the source language producers? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Annotations #### Annotation process [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### Who are the annotators? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Personal and Sensitive Information [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Discussion of Biases [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Other Known Limitations [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Additional Information ### Dataset Curators [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Licensing Information [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Citation Information ``` @inproceedings{fb_kilt, author = {Fabio Petroni and Aleksandra Piktus and Angela Fan and Patrick Lewis and Majid Yazdani and Nicola De Cao and James Thorne and Yacine Jernite and Vassilis Plachouras and Tim Rockt"aschel and Sebastian Riedel}, title = {{KILT:} a {B}enchmark for {K}nowledge {I}ntensive {L}anguage {T}asks}, journal = {CoRR}, archivePrefix = {arXiv}, year = {2020}, ``` ### Contributions Thanks to [@thomwolf](https://github.com/thomwolf), [@yjernite](https://github.com/yjernite) for adding this dataset.

paperswithcode_id: 无 pretty_name: KILT维基百科（KiltWikipedia） dataset_info: features: - name: kilt_id dtype: string - name: wikipedia_id dtype: string - name: wikipedia_title dtype: string - name: text sequence: - name: paragraph dtype: string - name: anchors sequence: - name: paragraph_id dtype: int32 - name: start dtype: int32 - name: end dtype: int32 - name: text dtype: string - name: href dtype: string - name: wikipedia_title dtype: string - name: wikipedia_id dtype: string - name: categories dtype: string - name: wikidata_info struct: - name: description dtype: string - name: enwikiquote_title dtype: string - name: wikidata_id dtype: string - name: wikidata_label dtype: string - name: wikipedia_title dtype: string - name: aliases sequence: - name: alias dtype: string - name: history struct: - name: pageid dtype: int32 - name: parentid dtype: int32 - name: revid dtype: int32 - name: pre_dump dtype: bool - name: timestamp dtype: string - name: url dtype: string config_name: '2019-08-01' splits: - name: full num_bytes: 29372535718 num_examples: 5903530 download_size: 37318876722 dataset_size: 29372535718 --- # 「kilt_wikipedia」数据集卡片 ## 目录 - [数据集描述](#dataset-description) - [数据集概述](#dataset-summary) - [支持任务与基准榜单](#supported-tasks-and-leaderboards) - [语言](#languages) - [数据集结构](#dataset-structure) - [数据实例](#data-instances) - [数据字段](#data-fields) - [数据拆分](#data-splits) - [数据集构建](#dataset-creation) - [构建初衷](#curation-rationale) - [源数据](#source-data) - [标注信息](#annotations) - [个人与敏感信息](#personal-and-sensitive-information) - [数据集使用注意事项](#considerations-for-using-the-data) - [数据集的社会影响](#social-impact-of-dataset) - [偏差讨论](#discussion-of-biases) - [其他已知局限性](#other-known-limitations) - [附加信息](#additional-information) - [数据集整理者](#dataset-curators) - [授权信息](#licensing-information) - [引用信息](#citation-information) - [贡献者](#contributions) ## 数据集描述 - **主页:** [https://github.com/facebookresearch/KILT](https://github.com/facebookresearch/KILT) - **代码仓库:** [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **相关论文:** [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **联系方式:** [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **下载数据集总大小:** 37.32 GB - **生成数据集总大小:** 29.37 GB - **占用磁盘总空间:** 66.69 GB ### 数据集概述 KILT维基百科：为KILT（知识密集型语言任务基准，Knowledge Intensive Language Tasks Benchmark）预处理的维基百科语料库。 ### 支持任务与基准榜单 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 语言 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## 数据集结构 ### 数据实例 #### 2019-08-01 - **下载数据集总大小:** 37.32 GB - **生成数据集总大小:** 29.37 GB - **占用磁盘总空间:** 66.69 GB 「full」拆分的示例格式如下： { "anchors": { "end": [], "href": [], "paragraph_id": [], "start": [], "text": [], "wikipedia_id": [], "wikipedia_title": [] }, "categories": "", "history": { "pageid": 0, "parentid": 0, "pre_dump": true, "revid": 0, "timestamp": "", "url": "" }, "kilt_id": "", "text": { "paragraph": [] }, "wikidata_info": { "aliases": { "alias": [] }, "description": "", "enwikiquote_title": "", "wikidata_id": "", "wikidata_label": "", "wikipedia_title": "" }, "wikipedia_id": "", "wikipedia_title": "" } ### 数据字段所有拆分的字段结构一致。 #### 2019-08-01 - `kilt_id`: 字符串类型特征。 - `wikipedia_id`: 字符串类型特征。 - `wikipedia_title`: 字符串类型特征。 - `text`: 包含以下子字段的字典类型特征： - `paragraph`: 字符串类型特征，代表段落文本。 - `anchors`: 包含以下子字段的序列类型特征： - `paragraph_id`: int32类型特征，代表锚点所在段落的ID。 - `start`: int32类型特征，代表锚点文本在段落中的起始位置。 - `end`: int32类型特征，代表锚点文本在段落中的结束位置。 - `text`: 字符串类型特征，代表锚点的显示文本。 - `href`: 字符串类型特征，代表锚点的链接地址。 - `wikipedia_title`: 字符串类型特征，代表锚点指向的维基百科条目标题。 - `wikipedia_id`: 字符串类型特征，代表锚点指向的维基百科条目ID。 - `categories`: 字符串类型特征，代表条目分类信息。 - `wikidata_info`: 包含以下子字段的结构体类型特征： - `description`: 字符串类型特征，代表维基数据描述信息。 - `enwikiquote_title`: 字符串类型特征，代表英文维基语录条目标题。 - `wikidata_id`: 字符串类型特征，代表维基数据条目ID。 - `wikidata_label`: 字符串类型特征，代表维基数据条目标签。 - `wikipedia_title`: 字符串类型特征，关联的维基百科条目标题。 - `aliases`: 包含`alias`子字段的序列类型特征，`alias`为字符串类型特征，代表条目别名。 - `history`: 包含以下子字段的结构体类型特征： - `pageid`: int32类型特征，代表维基百科页面ID。 - `parentid`: int32类型特征，代表当前修订版本的父版本ID。 - `revid`: int32类型特征，代表当前修订版本ID。 - `pre_dump`: 布尔类型特征，代表是否为预转储标记。 - `timestamp`: 字符串类型特征，代表修订时间戳。 - `url`: 字符串类型特征，代表页面URL。 ### 数据拆分 | 配置版本 | 样本数量 | |----------|---------| |2019-08-01|5903530| ## 数据集构建 ### 构建初衷 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 源数据 #### 初始数据收集与标准化 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### 源语言生产者是谁？ [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 标注信息 #### 标注流程 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### 标注者是谁？ [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 个人与敏感信息 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## 数据集使用注意事项 ### 数据集的社会影响 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 偏差讨论 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 其他已知局限性 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## 附加信息 ### 数据集整理者 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 授权信息 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 引用信息 @inproceedings{fb_kilt, author = {Fabio Petroni and Aleksandra Piktus and Angela Fan and Patrick Lewis and Majid Yazdani and Nicola De Cao and James Thorne and Yacine Jernite and Vassilis Plachouras and Tim Rockt"aschel and Sebastian Riedel}, title = {{KILT:} a {B}enchmark for {K}nowledge {I}ntensive {L}anguage {T}asks}, journal = {CoRR}, archivePrefix = {arXiv}, year = {2020}, ### 贡献者感谢[@thomwolf](https://github.com/thomwolf)、[@yjernite](https://github.com/yjernite)为本数据集提交贡献。

提供机构：

facebook

原始信息汇总

数据集概述

数据集名称: KiltWikipedia

数据集配置名称: 2019-08-01

数据集大小:

下载大小: 37.32 GB
生成数据集大小: 29.37 GB

数据集样本数量: 5903530

数据集结构

数据字段

kilt_id: 字符串类型
wikipedia_id: 字符串类型
wikipedia_title: 字符串类型
text: 字典类型，包含 paragraph: 字符串类型
anchors: 字典类型，包含
- paragraph_id: 整数类型
- start: 整数类型
- end: 整数类型
- text: 字符串类型
- href: 字符串类型
- wikipedia_title: 字符串类型
- wikipedia_id: 字符串类型
categories: 字符串类型
wikidata_info: 结构体类型，包含
- description: 字符串类型
- enwikiquote_title: 字符串类型
- wikidata_id: 字符串类型
- wikidata_label: 字符串类型
- wikipedia_title: 字符串类型
- aliases: 序列类型，包含 alias: 字符串类型
history: 结构体类型，包含
- pageid: 整数类型
- parentid: 整数类型
- revid: 整数类型
- pre_dump: 布尔类型
- timestamp: 字符串类型
- url: 字符串类型

数据分割

名称	样本数量
full	5903530

搜集汇总

数据集介绍

构建方式

在知识密集型语言任务的研究领域，KILT-Wikipedia数据集以其精心的构建方式脱颖而出。该数据集基于2019年8月1日的英文维基百科快照，通过系统化的预处理流程，将原始维基百科页面转化为结构化的知识单元。构建过程不仅提取了条目的文本内容，还保留了丰富的元数据，包括页面间的超链接锚点、分类信息以及关联的维基数据实体。这种构建方式确保了数据在知识表示上的连贯性与完整性，为后续的知识检索与推理任务奠定了坚实基础。

使用方法

在具体应用中，研究人员可通过Hugging Face的`datasets`库便捷加载此数据集。其标准化的字段设计允许用户直接访问文本、锚点、分类及实体信息，便于构建知识驱动的下游任务，如开放域问答、实体链接或事实核查。数据集的大规模与高质量特性，使其适用于训练或评估需要深厚知识背景的预训练语言模型。使用时应遵循其引用的学术规范，并注意数据快照的时间局限性，以确保研究结论的时效性与准确性。

背景与挑战

背景概述

在知识密集型语言任务研究领域，高质量基准数据集的构建对于推动模型理解与生成能力至关重要。KILT-Wikipedia数据集由Facebook研究院于2020年主导创建，其核心研究目标在于为知识密集型语言任务提供一个统一且结构化的评估基准。该数据集基于2019年8月的维基百科快照，通过精细的预处理流程，将海量非结构化文本转化为富含实体链接、分类信息及历史元数据的结构化知识库。这一举措显著促进了诸如开放域问答、实体链接及事实验证等任务的发展，为后续研究提供了可靠的知识溯源基础。

当前挑战

KILT-Wikipedia数据集致力于解决知识密集型语言任务中的核心挑战，即如何让模型精准检索并利用外部知识进行推理与生成。其构建过程面临多重困难：维基百科作为动态更新的开放平台，数据规模庞大且结构异构，需设计复杂管道来提取文本、锚点链接、分类体系及维基数据实体，并保持其间的语义一致性。同时，确保知识的时间一致性亦为关键，数据集需明确标注快照时间点，以规避时序混淆。此外，处理多语言别名、消歧实体以及维护知识单元的完整性，均对数据清洗与对齐流程提出了严峻考验。

常用场景

经典使用场景

在知识密集型语言任务的研究领域，KILT-Wikipedia数据集常被用作基准知识库，支撑开放域问答、事实验证和实体链接等任务。其结构化文本与丰富的元数据，为模型提供了精准的知识检索与推理基础，使得研究者能够评估模型在真实世界知识理解与运用上的效能。

解决学术问题

该数据集有效应对了自然语言处理中知识融合的挑战，通过整合维基百科的篇章内容、锚点链接及维基数据实体信息，为模型训练与评估提供了统一的知识框架。它促进了知识增强型语言模型的发展，解决了传统模型在事实一致性、可解释性以及跨任务泛化能力方面的局限。

实际应用

在实际应用中，KILT-Wikipedia可作为智能助手、搜索引擎和学术研究工具的知识后端，提升信息检索的准确性与深度。其结构化知识支持自动化文档摘要、知识图谱构建以及教育领域的智能辅导系统，为依赖权威知识的行业应用提供了可靠的数据支撑。

数据集最近研究