mamei16/wikipedia_paragraphs

Name: mamei16/wikipedia_paragraphs
Creator: mamei16
Published: 2025-10-16 12:26:01
License: 暂无描述

Hugging Face2025-10-16 更新2025-10-25 收录

下载链接：

https://hf-mirror.com/datasets/mamei16/wikipedia_paragraphs

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集由英文维基百科文章构成，文章被段落和空格分隔成tokens，并为每个token分配一个二进制的ner_tag，表示该token是否后接段落分隔。特殊情况下，如果段落分隔前有冒号或段落过短，将不会进行分隔。该数据集用于训练在RAG应用中的chunking模型。

This dataset is composed of English Wikipedia articles, split into tokens by paragraph breaks and spaces. Each token is assigned a binary ner_tag indicating whether it is followed by a paragraph break in the original text. There are special cases where the text is not split if a paragraph break is preceded by a colon or if the paragraph is too short. The dataset is intended for training chunking models used in RAG applications.

提供机构：

mamei16

5,000+

优质数据集

54 个

任务类型

进入经典数据集