florin-hf/kilt_corpus_wiki_dump2019

Name: florin-hf/kilt_corpus_wiki_dump2019
Creator: florin-hf
Published: 2026-02-17 16:55:45
License: 暂无描述

Hugging Face2026-02-17 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/florin-hf/kilt_corpus_wiki_dump2019

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: chunk_id dtype: string - name: doc_id dtype: string - name: chunk_order dtype: int64 - name: is_first dtype: bool - name: is_last dtype: bool - name: text dtype: string - name: title dtype: string - name: wikipedia_id dtype: string - name: wikipedia_title dtype: string - name: categories dtype: string - name: section sequence: string - name: section_order dtype: int64 - name: num_sections dtype: int64 - name: pageid dtype: string - name: parentid dtype: string - name: pre_dump dtype: bool - name: revid dtype: string - name: timestamp dtype: string - name: url dtype: string - name: wikidata_id dtype: string - name: content_format dtype: string - name: chunk_type dtype: string splits: - name: train num_bytes: 29669574567 num_examples: 27493072 download_size: 11435522832 dataset_size: 29669574567 configs: - config_name: default data_files: - split: train path: data/train-* task_categories: - question-answering language: - en size_categories: - 10M<n<100M tags: - benchmark - wikipedia - retrieval - RAG --- # KILT Corpus in Chunks: Wikipedia Corpus Chunked for RAG A preprocessed version of the [KILT Wikipedia corpus](https://huggingface.co/datasets/facebook/kilt_wikipedia) where each article has been split into coherent, section-aware text chunks suitable for Retrieval Augmented Generation (RAG) pipelines. ## Dataset Summary This dataset provides a chunked view of the KILT Wikipedia snapshot (August 2019). Starting from full Wikipedia articles, each article is first segmented by its section structure, then each section is independently chunked using a sentence-aware splitter. This design ensures that chunks never cross section boundaries and that each chunk can be precisely located within its source article. ## Processing Pipeline 1. **Source**: [facebook/kilt_wikipedia](https://huggingface.co/datasets/facebook/kilt_wikipedia) — a Wikipedia snapshot from 01 August 2019 used as the knowledge base for the KILT benchmark. 2. **Section splitting**: Each article is divided into its constituent sections/subsections before chunking. No chunk ever spans two different sections. 3. **Sentence-aware chunking**: Each section is chunked using [`SentenceSplitter`](https://docs.llamaindex.ai/en/stable/api_reference/node_parsers/sentence_splitter/) from [LlamaIndex](https://www.llamaindex.ai/), producing non-overlapping chunks of up to **256 tokens**. Sentence boundaries are respected, so chunks are always semantically coherent. 4. **Chunk and position metadata** are added to each record (see schema below). --- ## Schema Each record corresponds to one chunk and contains the following fields: | Field | Type | Description | |---|---|---| | `chunk_id` | `string` | Unique chunk identifier in the format `doc-{doc_id}::s-{section_order}::chunk-{chunk_order}` | | `doc_id` | `string` | Document identifier combining Wikipedia ID and revision info | | `chunk_order` | `int` | Position of the chunk within its section (0-indexed) | | `is_first` | `bool` | Whether this is the first chunk in the section | | `is_last` | `bool` | Whether this is the last chunk in the section | | `text` | `string` | The chunk text content | | `title` | `string` | Hierarchical section path (e.g. `# Article\n## Section\n### Subsection`) | | `section` | `list[str]` | List of section/subsection names for the chunk's location | | `section_order` | `int` | Index of the section within the article (0-indexed) | | `num_sections` | `int` | Total number of sections in the article | | `wikipedia_id` | `string` | Original Wikipedia page ID | | `wikipedia_title` | `string` | Wikipedia article title | | `categories` | `string` | Comma-separated Wikipedia categories | | `pageid` | `string` | Wikipedia page ID | | `parentid` | `string` | Wikipedia parent revision ID | | `revid` | `string` | Wikipedia revision ID | | `timestamp` | `string` | Revision timestamp (ISO 8601) | | `url` | `string` | Direct URL to the Wikipedia revision | | `wikidata_id` | `string` | Wikidata entity ID | ## Example Record ```python { 'chunk_id': 'doc-39_4::s-1::chunk-0', 'doc_id': '39_4', 'chunk_order': 0, 'is_first': True, 'is_last': True, 'text': "Albedo is not directly dependent on illumination because changing the amount " "of incoming light proportionally changes the amount of reflected light, except " "in circumstances where a change in illumination induces a change in the Earth's " "surface at that location (e.g. through albedo-temperature feedback). That said, " "albedo and illumination both vary by latitude. Albedo is highest near the poles " "and lowest in the subtropics, with a local maximum in the tropics.", 'title': '# Albedo\n## Examples of terrestrial albedo effects\n### Illumination', 'wikipedia_id': '39', 'wikipedia_title': 'Albedo', 'section': ['Examples of terrestrial albedo effects', 'Illumination'], 'section_order': 4, 'num_sections': 19, 'url': 'https://en.wikipedia.org/w/index.php?title=Albedo&oldid=906500850', ... } ``` --- ## Source & License Built upon [facebook/kilt_wikipedia](https://huggingface.co/datasets/facebook/kilt_wikipedia). Please refer to the original dataset for licensing terms. The additional processing and metadata fields are released under the same terms. This processed corpus was used in the papers [Do RAG Systems Really Suffer From Positional Bias?](https://aclanthology.org/2025.emnlp-main.1422) and [Redefining Retrieval Evaluation in the Era of LLMs](https://arxiv.org/abs/2510.21440). ``` @inproceedings{petroni-etal-2021-kilt, title = "{KILT}: a Benchmark for Knowledge Intensive Language Tasks", author = {Petroni, Fabio and Piktus, Aleksandra and Fan, Angela and Lewis, Patrick and Yazdani, Majid and De Cao, Nicola and Thorne, James and Jernite, Yacine and Karpukhin, Vladimir and Maillard, Jean and Plachouras, Vassilis and Rockt{\"a}schel, Tim and Riedel, Sebastian}, editor = "Toutanova, Kristina and Rumshisky, Anna and Zettlemoyer, Luke and Hakkani-Tur, Dilek and Beltagy, Iz and Bethard, Steven and Cotterell, Ryan and Chakraborty, Tanmoy and Zhou, Yichao", booktitle = "Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies", month = jun, year = "2021", address = "Online", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.naacl-main.200/", doi = "10.18653/v1/2021.naacl-main.200", pages = "2523--2544", abstract = "Challenging problems such as open-domain question answering, fact checking, slot filling and entity linking require access to large, external knowledge sources. While some models do well on individual tasks, developing general models is difficult as each task might require computationally expensive indexing of custom knowledge sources, in addition to dedicated infrastructure. To catalyze research on models that condition on specific information in large textual resources, we present a benchmark for knowledge-intensive language tasks (KILT). All tasks in KILT are grounded in the same snapshot of Wikipedia, reducing engineering turnaround through the re-use of components, as well as accelerating research into task-agnostic memory architectures. We test both task-specific and general baselines, evaluating downstream performance in addition to the ability of the models to provide provenance. We find that a shared dense vector index coupled with a seq2seq model is a strong baseline, outperforming more tailor-made approaches for fact checking, open-domain question answering and dialogue, and yielding competitive results on entity linking and slot filling, by generating disambiguated text. KILT data and code are available at \url{https://github.com/facebookresearch/KILT}." } @misc{cuconasu2025ragsystemsreallysuffer, title={Do RAG Systems Really Suffer From Positional Bias?}, author={Florin Cuconasu and Simone Filice and Guy Horowitz and Yoelle Maarek and Fabrizio Silvestri}, year={2025}, eprint={2505.15561}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2505.15561}, } @misc{trappolini2025redefiningretrievalevaluationera, title={Redefining Retrieval Evaluation in the Era of LLMs}, author={Giovanni Trappolini and Florin Cuconasu and Simone Filice and Yoelle Maarek and Fabrizio Silvestri}, year={2025}, eprint={2510.21440}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2510.21440}, } ```

提供机构：

florin-hf

5,000+

优质数据集

54 个

任务类型

进入经典数据集