five

s-nlp/kilt

收藏
Hugging Face2026-03-21 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/s-nlp/kilt
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: cc-by-sa-4.0 pretty_name: KILT Wikipedia knowledge source (paragraph-level) tags: - wikipedia - kilt - retrieval - en - paragraph task_categories: - text-retrieval --- # KILT Wikipedia — paragraph-level (flattened) ## Dataset summary This dataset is a **flattened** view of the [KILT knowledge source](https://github.com/facebookresearch/KILT) (`kilt_knowledgesource.json`): each row is **one Wikipedia paragraph** (one string in the original per-page `text` list), not one page per row. - **Source corpus:** KILT Wikipedia knowledge source (2019/08/01 Wikipedia dump, per KILT README). - **Split:** `train` only (full paragraph stream). - **Rows (train):** `111,789,997` - **Input shards used:** `112` JSONL file(s) (`part-*.jsonl`). - **Parquet chunks before Hub push:** `2236` (batch_rows=50000, Hub max_shard_size=500MB). ## Data fields | Column | Type | Description | |--------|------|-------------| | `wikipedia_id` | string | KILT Wikipedia page id | | `wikipedia_title` | string | Page title | | `text` | string | Single paragraph body | | `_id` | string | Stable id: `{<page _id>}::p{<paragraph_index>}` | ## How it was built 1. Convert `kilt_knowledgesource.json` (JSONL, one JSON object per line) with `OSCAR_like_experiments/scripts/convert_kilt_knowledge_source_to_paragraph_jsonl.py`. 2. Upload with `OSCAR_like_experiments/scripts/push_kilt_paragraph_jsonl_to_hub.py` from directory: `/home/jovyan/rpt/OSCAR_like_experiments/scripts/kilt_knowledgesource`. ## Intended use Sparse / dense **retrieval indexing** (e.g. BM25, SPLADE) where each document unit is a paragraph, matching the KILT-style chunking used in RAG pipelines. ## Limitations - Text is **English Wikipedia** as packaged in KILT; formatting/markup follows KILT preprocessing. - Not an official Meta/Facebook dataset release; this is a **derived** redistribution — comply with Wikipedia and KILT terms. ## Citation Please cite **KILT** (and Wikipedia as appropriate): ```bibtex @inproceedings{petroni-etal-2021-kilt, title = {KILT}: a Benchmark for Knowledge Intensive Language Tasks, author = {Petroni, Fabio and Piktus, Aleksandra and Fan, Angela and others}, booktitle = {NAACL-HLT}, year = {2021}, } ``` Repository: [facebookresearch/KILT](https://github.com/facebookresearch/KILT) ## Dataset card contact Dataset repo: `s-nlp/kilt`
提供机构:
s-nlp
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作