singletongue/wikipedia-paragraphs

Name: singletongue/wikipedia-paragraphs
Creator: singletongue
Published: 2026-03-13 13:10:44
License: 暂无描述

Hugging Face2026-03-13 更新2025-10-25 收录

下载链接：

https://hf-mirror.com/datasets/singletongue/wikipedia-paragraphs

下载链接

链接失效反馈

官方服务：

资源简介：

这是一个从维基百科生成的数据集，旨在为自然语言处理研究提供数据。每个条目包含从维基百科页面提取的经过清理的段落文本和维基链接信息，以及有用的元数据，如类别、模板和相关维基数据QID。数据集被组织成多个配置，每个配置对应于特定的维基百科数据转储。每个配置包含训练分割，其中包含字节数和示例数。README还提供了一个数据集使用的示例。

The wikipedia-paragraphs dataset is a collection of paragraph text and Wikilink information extracted from Wikipedia pages, along with metadata like categories, templates, and associated Wikidata QID. It is designed for natural language processing (NLP) research and includes multiple language configurations. Each configuration corresponds to a specific Wikipedia dump and contains train splits with details on the number of bytes and examples. The dataset is structured to represent individual Wikipedia pages, which can be articles, categories, or templates.

提供机构：

singletongue

5,000+

优质数据集

54 个

任务类型

进入经典数据集