Zhantas/Cleaned-Uzbek_Wikipedia
收藏Hugging Face2026-04-08 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Zhantas/Cleaned-Uzbek_Wikipedia
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- uz
license: cc-by-sa-4.0
task: text-generation
tags:
- wikipedia
- uzbek
- latin
- cleaned
- corpus
- multilingual
size_categories:
- 100K<n<1M
---
# Cleaned-Uzbek_Wikipedia
A massive, high-quality, and denoised corpus derived from the Uzbek Wikipedia. This dataset is specifically curated for the pre-training and fine-tuning of Large Language Models (LLMs) focusing on the modern Uzbek language.
## 🌟 Key Feature: Latin-Centric Corpus
Unlike many existing Uzbek datasets, this corpus is heavily optimized for the **Latin script (89.42%)**, reflecting the modern linguistic standard and official writing system of Uzbekistan. It is an ideal foundation for building state-of-the-art (SOTA) Latin-Uzbek language models.
## 📊 Dataset Benchmark
### General Statistics
| Metric | Value |
| :--- | :--- |
| **Total Articles** | 329,766 |
| **Total Characters** | 447,507,793 |
| **Total Words** | 53,382,386 |
| **Avg. Words per Article** | 161.88 |
### Script Distribution
| Script | Percentage |
| :--- | :--- |
| **Latin (Латиница)** | **89.42%** |
| **Digits (Цифры)** | 4.34% |
| **Other/Special** | 5.73% |
| **Cyrillic (Кириллица)** | 0.48% |
| **Arabic (Арабский)** | 0.03% |
### Length Distribution
| Metric | Value |
| :--- | :--- |
| **Median Length** | 396 chars |
| **Mean Length** | 1,357.05 chars |
## 🛠 Cleaning Pipeline
To ensure maximum data purity and minimize "noise" during training, the following advanced pipeline was implemented:
1. **Wikicode Stripping:** Used `mwparserfromhell` to remove all MediaWiki markup, including infoboxes, templates, and internal wiki-links.
2. **Namespace Filtering:** Only articles from the main namespace (`ns=0`) were retained. All redirects, talk pages, and technical/administrative pages were discarded.
3. **Header & Noise Removal:**
* Automated removal of Wikipedia service headers (e.g., "Kategoriyalar", "Havolalar", "Manbalar", "Demografiyasi") in both Latin and Cyrillic.
* Stripping of HTML tags and leftover structural artifacts.
4. **Syntax Cleanup:** Removed residual brackets `[[ ]]` and cleaned up broken punctuation resulting from template removal.
5. **Text Normalization:** Standardized whitespace, removed redundant line breaks, and ensured clean paragraph separation.
## 🚀 Usage
from datasets import load_dataset
# Load the dataset
dataset = load_dataset("Zhantas/Cleaned-Uzbek_Wikipedia")
# Access an article
print(dataset['train'][0]['title'])
print(dataset['train'][0]['text'])
📜 License
This dataset is released under the CC-BY-SA-4.0 license.
提供机构:
Zhantas



