kilt

Name: kilt
Creator: s-nlp
Published: 2026-03-21 16:57:38
License: 暂无描述

Hugging Face2026-03-21 更新2026-03-23 收录

下载链接：

https://huggingface.co/datasets/s-nlp/kilt

下载链接

链接失效反馈

官方服务：

资源简介：

KILT Wikipedia 段落级数据集是对KILT知识源（`kilt_knowledgesource.json`）的扁平化视图，其中每一行代表一个维基百科段落（原始每页`text`列表中的一个字符串），而非每行一页。数据来源于2019年8月1日的维基百科转储，仅包含训练集，共有111,789,997行段落数据，原始输入为112个JSONL文件。数据集包含四个字段：`wikipedia_id`（KILT维基百科页面ID）、`wikipedia_title`（页面标题）、`text`（单个段落正文）和`_id`（稳定ID，格式为`{<page _id>}::p{<paragraph_index>}`）。该数据集适用于稀疏/密集检索索引任务（如BM25、SPLADE），其中每个文档单元为一个段落，匹配RAG管道中使用的KILT风格分块。需要注意的是，文本为KILT打包的英文维基百科内容，遵循KILT预处理格式，且非Meta/Facebook官方发布的数据集，而是衍生再分发版本，使用时需遵守维基百科和KILT的条款。

The KILT Wikipedia passage-level dataset is a flattened view of the KILT knowledge source (`kilt_knowledgesource.json`), where each row represents a single Wikipedia passage (one string from the original per-page `text` list), rather than an entire Wikipedia page per row. The dataset is derived from the August 1, 2019 Wikipedia dump, only contains the training split, has a total of 111,789,997 passage rows, and the original input consists of 112 JSONL files. The dataset includes four fields: `wikipedia_id` (KILT Wikipedia page ID), `wikipedia_title` (page title), `text` (body of a single passage), and `_id` (a stable ID with the format `{<page _id>}::p{<paragraph_index>}`). This dataset is suitable for sparse/dense retrieval indexing tasks (e.g., BM25, SPLADE), where each document unit is a passage, matching the KILT-style chunking used in RAG pipelines. It should be noted that the text is KILT-packaged English Wikipedia content following the KILT preprocessing format. This is not an officially released dataset by Meta/Facebook, but a derived and redistributed version. Users must comply with the terms of Wikipedia and KILT when using this dataset.

提供机构：

s-nlp

创建时间：

2026-03-21