s-nlp/kilt
收藏Hugging Face2026-03-21 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/s-nlp/kilt
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: cc-by-sa-4.0
pretty_name: KILT Wikipedia knowledge source (paragraph-level)
tags:
- wikipedia
- kilt
- retrieval
- en
- paragraph
task_categories:
- text-retrieval
---
# KILT Wikipedia — paragraph-level (flattened)
## Dataset summary
This dataset is a **flattened** view of the [KILT knowledge source](https://github.com/facebookresearch/KILT) (`kilt_knowledgesource.json`): each row is **one Wikipedia paragraph** (one string in the original per-page `text` list), not one page per row.
- **Source corpus:** KILT Wikipedia knowledge source (2019/08/01 Wikipedia dump, per KILT README).
- **Split:** `train` only (full paragraph stream).
- **Rows (train):** `111,789,997`
- **Input shards used:** `112` JSONL file(s) (`part-*.jsonl`).
- **Parquet chunks before Hub push:** `2236` (batch_rows=50000, Hub max_shard_size=500MB).
## Data fields
| Column | Type | Description |
|--------|------|-------------|
| `wikipedia_id` | string | KILT Wikipedia page id |
| `wikipedia_title` | string | Page title |
| `text` | string | Single paragraph body |
| `_id` | string | Stable id: `{<page _id>}::p{<paragraph_index>}` |
## How it was built
1. Convert `kilt_knowledgesource.json` (JSONL, one JSON object per line) with `OSCAR_like_experiments/scripts/convert_kilt_knowledge_source_to_paragraph_jsonl.py`.
2. Upload with `OSCAR_like_experiments/scripts/push_kilt_paragraph_jsonl_to_hub.py` from directory: `/home/jovyan/rpt/OSCAR_like_experiments/scripts/kilt_knowledgesource`.
## Intended use
Sparse / dense **retrieval indexing** (e.g. BM25, SPLADE) where each document unit is a paragraph, matching the KILT-style chunking used in RAG pipelines.
## Limitations
- Text is **English Wikipedia** as packaged in KILT; formatting/markup follows KILT preprocessing.
- Not an official Meta/Facebook dataset release; this is a **derived** redistribution — comply with Wikipedia and KILT terms.
## Citation
Please cite **KILT** (and Wikipedia as appropriate):
```bibtex
@inproceedings{petroni-etal-2021-kilt,
title = {KILT}: a Benchmark for Knowledge Intensive Language Tasks,
author = {Petroni, Fabio and Piktus, Aleksandra and Fan, Angela and others},
booktitle = {NAACL-HLT},
year = {2021},
}
```
Repository: [facebookresearch/KILT](https://github.com/facebookresearch/KILT)
## Dataset card contact
Dataset repo: `s-nlp/kilt`
提供机构:
s-nlp



