chenzhe0000/wikitext_cleaned

Name: chenzhe0000/wikitext_cleaned
Creator: chenzhe0000
Published: 2026-04-09 07:26:19
License: 暂无描述

Hugging Face2026-04-09 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/chenzhe0000/wikitext_cleaned

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit --- # Dataset Card for Cleaned & Chunked WikiText Corpus ## 📌 Dataset Description This dataset is a cleaned and chunked version of WikiText-style corpus, designed for language model pretraining and evaluation. The preprocessing pipeline follows a lightweight data curation paradigm: > **cleaning → deduplication → normalization → token-aware chunking** Specifically, the dataset includes: - Removal of special and noisy characters - Cleaning of formatting artifacts (e.g., HTML tags, irregular symbols) - Deduplication to reduce redundant or highly similar text samples - Text normalization (whitespace, encoding, etc.) - Chunking of long documents into smaller segments suitable for model training Each sample corresponds to a **cleaned text chunk**, rather than a full original document. --- ## 📊 Data Structure Each sample in the dataset follows this structure: ```json { "uid": "string", "content": "string", "meta_data": { "index": "int", "total": "int", "length": "int" } }

提供机构：

chenzhe0000

5,000+

优质数据集

54 个

任务类型

进入经典数据集