chenzhe0000/wikitext_cleaned
收藏Hugging Face2026-04-09 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/chenzhe0000/wikitext_cleaned
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
---
# Dataset Card for Cleaned & Chunked WikiText Corpus
## 📌 Dataset Description
This dataset is a cleaned and chunked version of WikiText-style corpus, designed for language model pretraining and evaluation.
The preprocessing pipeline follows a lightweight data curation paradigm:
> **cleaning → deduplication → normalization → token-aware chunking**
Specifically, the dataset includes:
- Removal of special and noisy characters
- Cleaning of formatting artifacts (e.g., HTML tags, irregular symbols)
- Deduplication to reduce redundant or highly similar text samples
- Text normalization (whitespace, encoding, etc.)
- Chunking of long documents into smaller segments suitable for model training
Each sample corresponds to a **cleaned text chunk**, rather than a full original document.
---
## 📊 Data Structure
Each sample in the dataset follows this structure:
```json
{
"uid": "string",
"content": "string",
"meta_data": {
"index": "int",
"total": "int",
"length": "int"
}
}
提供机构:
chenzhe0000



