five

chenzhe0000/wikitext_cleaned

收藏
Hugging Face2026-04-09 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/chenzhe0000/wikitext_cleaned
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit --- # Dataset Card for Cleaned & Chunked WikiText Corpus ## 📌 Dataset Description This dataset is a cleaned and chunked version of WikiText-style corpus, designed for language model pretraining and evaluation. The preprocessing pipeline follows a lightweight data curation paradigm: > **cleaning → deduplication → normalization → token-aware chunking** Specifically, the dataset includes: - Removal of special and noisy characters - Cleaning of formatting artifacts (e.g., HTML tags, irregular symbols) - Deduplication to reduce redundant or highly similar text samples - Text normalization (whitespace, encoding, etc.) - Chunking of long documents into smaller segments suitable for model training Each sample corresponds to a **cleaned text chunk**, rather than a full original document. --- ## 📊 Data Structure Each sample in the dataset follows this structure: ```json { "uid": "string", "content": "string", "meta_data": { "index": "int", "total": "int", "length": "int" } }
提供机构:
chenzhe0000
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作