florin-hf/kilt_corpus_wiki_dump2019
收藏Hugging Face2026-02-17 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/florin-hf/kilt_corpus_wiki_dump2019
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: chunk_id
dtype: string
- name: doc_id
dtype: string
- name: chunk_order
dtype: int64
- name: is_first
dtype: bool
- name: is_last
dtype: bool
- name: text
dtype: string
- name: title
dtype: string
- name: wikipedia_id
dtype: string
- name: wikipedia_title
dtype: string
- name: categories
dtype: string
- name: section
sequence: string
- name: section_order
dtype: int64
- name: num_sections
dtype: int64
- name: pageid
dtype: string
- name: parentid
dtype: string
- name: pre_dump
dtype: bool
- name: revid
dtype: string
- name: timestamp
dtype: string
- name: url
dtype: string
- name: wikidata_id
dtype: string
- name: content_format
dtype: string
- name: chunk_type
dtype: string
splits:
- name: train
num_bytes: 29669574567
num_examples: 27493072
download_size: 11435522832
dataset_size: 29669574567
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
task_categories:
- question-answering
language:
- en
size_categories:
- 10M<n<100M
tags:
- benchmark
- wikipedia
- retrieval
- RAG
---
# KILT Corpus in Chunks: Wikipedia Corpus Chunked for RAG
A preprocessed version of the [KILT Wikipedia corpus](https://huggingface.co/datasets/facebook/kilt_wikipedia) where each article has been split into coherent, section-aware text chunks suitable for Retrieval Augmented Generation (RAG) pipelines.
## Dataset Summary
This dataset provides a chunked view of the KILT Wikipedia snapshot (August 2019). Starting from full Wikipedia articles, each article is first segmented by its section structure, then each section is independently chunked using a sentence-aware splitter. This design ensures that chunks never cross section boundaries and that each chunk can be precisely located within its source article.
## Processing Pipeline
1. **Source**: [facebook/kilt_wikipedia](https://huggingface.co/datasets/facebook/kilt_wikipedia) — a Wikipedia snapshot from 01 August 2019 used as the knowledge base for the KILT benchmark.
2. **Section splitting**: Each article is divided into its constituent sections/subsections before chunking. No chunk ever spans two different sections.
3. **Sentence-aware chunking**: Each section is chunked using [`SentenceSplitter`](https://docs.llamaindex.ai/en/stable/api_reference/node_parsers/sentence_splitter/) from [LlamaIndex](https://www.llamaindex.ai/), producing non-overlapping chunks of up to **256 tokens**. Sentence boundaries are respected, so chunks are always semantically coherent.
4. **Chunk and position metadata** are added to each record (see schema below).
---
## Schema
Each record corresponds to one chunk and contains the following fields:
| Field | Type | Description |
|---|---|---|
| `chunk_id` | `string` | Unique chunk identifier in the format `doc-{doc_id}::s-{section_order}::chunk-{chunk_order}` |
| `doc_id` | `string` | Document identifier combining Wikipedia ID and revision info |
| `chunk_order` | `int` | Position of the chunk within its section (0-indexed) |
| `is_first` | `bool` | Whether this is the first chunk in the section |
| `is_last` | `bool` | Whether this is the last chunk in the section |
| `text` | `string` | The chunk text content |
| `title` | `string` | Hierarchical section path (e.g. `# Article\n## Section\n### Subsection`) |
| `section` | `list[str]` | List of section/subsection names for the chunk's location |
| `section_order` | `int` | Index of the section within the article (0-indexed) |
| `num_sections` | `int` | Total number of sections in the article |
| `wikipedia_id` | `string` | Original Wikipedia page ID |
| `wikipedia_title` | `string` | Wikipedia article title |
| `categories` | `string` | Comma-separated Wikipedia categories |
| `pageid` | `string` | Wikipedia page ID |
| `parentid` | `string` | Wikipedia parent revision ID |
| `revid` | `string` | Wikipedia revision ID |
| `timestamp` | `string` | Revision timestamp (ISO 8601) |
| `url` | `string` | Direct URL to the Wikipedia revision |
| `wikidata_id` | `string` | Wikidata entity ID |
## Example Record
```python
{
'chunk_id': 'doc-39_4::s-1::chunk-0',
'doc_id': '39_4',
'chunk_order': 0,
'is_first': True,
'is_last': True,
'text': "Albedo is not directly dependent on illumination because changing the amount "
"of incoming light proportionally changes the amount of reflected light, except "
"in circumstances where a change in illumination induces a change in the Earth's "
"surface at that location (e.g. through albedo-temperature feedback). That said, "
"albedo and illumination both vary by latitude. Albedo is highest near the poles "
"and lowest in the subtropics, with a local maximum in the tropics.",
'title': '# Albedo\n## Examples of terrestrial albedo effects\n### Illumination',
'wikipedia_id': '39',
'wikipedia_title': 'Albedo',
'section': ['Examples of terrestrial albedo effects', 'Illumination'],
'section_order': 4,
'num_sections': 19,
'url': 'https://en.wikipedia.org/w/index.php?title=Albedo&oldid=906500850',
...
}
```
---
## Source & License
Built upon [facebook/kilt_wikipedia](https://huggingface.co/datasets/facebook/kilt_wikipedia). Please refer to the original dataset for licensing terms. The additional processing and metadata fields are released under the same terms.
This processed corpus was used in the papers [Do RAG Systems Really Suffer From Positional Bias?](https://aclanthology.org/2025.emnlp-main.1422) and [Redefining Retrieval Evaluation in the Era of LLMs](https://arxiv.org/abs/2510.21440).
```
@inproceedings{petroni-etal-2021-kilt,
title = "{KILT}: a Benchmark for Knowledge Intensive Language Tasks",
author = {Petroni, Fabio and
Piktus, Aleksandra and
Fan, Angela and
Lewis, Patrick and
Yazdani, Majid and
De Cao, Nicola and
Thorne, James and
Jernite, Yacine and
Karpukhin, Vladimir and
Maillard, Jean and
Plachouras, Vassilis and
Rockt{\"a}schel, Tim and
Riedel, Sebastian},
editor = "Toutanova, Kristina and
Rumshisky, Anna and
Zettlemoyer, Luke and
Hakkani-Tur, Dilek and
Beltagy, Iz and
Bethard, Steven and
Cotterell, Ryan and
Chakraborty, Tanmoy and
Zhou, Yichao",
booktitle = "Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
month = jun,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.naacl-main.200/",
doi = "10.18653/v1/2021.naacl-main.200",
pages = "2523--2544",
abstract = "Challenging problems such as open-domain question answering, fact checking, slot filling and entity linking require access to large, external knowledge sources. While some models do well on individual tasks, developing general models is difficult as each task might require computationally expensive indexing of custom knowledge sources, in addition to dedicated infrastructure. To catalyze research on models that condition on specific information in large textual resources, we present a benchmark for knowledge-intensive language tasks (KILT). All tasks in KILT are grounded in the same snapshot of Wikipedia, reducing engineering turnaround through the re-use of components, as well as accelerating research into task-agnostic memory architectures. We test both task-specific and general baselines, evaluating downstream performance in addition to the ability of the models to provide provenance. We find that a shared dense vector index coupled with a seq2seq model is a strong baseline, outperforming more tailor-made approaches for fact checking, open-domain question answering and dialogue, and yielding competitive results on entity linking and slot filling, by generating disambiguated text. KILT data and code are available at \url{https://github.com/facebookresearch/KILT}."
}
@misc{cuconasu2025ragsystemsreallysuffer,
title={Do RAG Systems Really Suffer From Positional Bias?},
author={Florin Cuconasu and Simone Filice and Guy Horowitz and Yoelle Maarek and Fabrizio Silvestri},
year={2025},
eprint={2505.15561},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.15561},
}
@misc{trappolini2025redefiningretrievalevaluationera,
title={Redefining Retrieval Evaluation in the Era of LLMs},
author={Giovanni Trappolini and Florin Cuconasu and Simone Filice and Yoelle Maarek and Fabrizio Silvestri},
year={2025},
eprint={2510.21440},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2510.21440},
}
```
提供机构:
florin-hf



