five

Zhantas/Cleaned-Uzbek_Wikipedia

收藏
Hugging Face2026-04-08 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Zhantas/Cleaned-Uzbek_Wikipedia
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - uz license: cc-by-sa-4.0 task: text-generation tags: - wikipedia - uzbek - latin - cleaned - corpus - multilingual size_categories: - 100K<n<1M --- # Cleaned-Uzbek_Wikipedia A massive, high-quality, and denoised corpus derived from the Uzbek Wikipedia. This dataset is specifically curated for the pre-training and fine-tuning of Large Language Models (LLMs) focusing on the modern Uzbek language. ## 🌟 Key Feature: Latin-Centric Corpus Unlike many existing Uzbek datasets, this corpus is heavily optimized for the **Latin script (89.42%)**, reflecting the modern linguistic standard and official writing system of Uzbekistan. It is an ideal foundation for building state-of-the-art (SOTA) Latin-Uzbek language models. ## 📊 Dataset Benchmark ### General Statistics | Metric | Value | | :--- | :--- | | **Total Articles** | 329,766 | | **Total Characters** | 447,507,793 | | **Total Words** | 53,382,386 | | **Avg. Words per Article** | 161.88 | ### Script Distribution | Script | Percentage | | :--- | :--- | | **Latin (Латиница)** | **89.42%** | | **Digits (Цифры)** | 4.34% | | **Other/Special** | 5.73% | | **Cyrillic (Кириллица)** | 0.48% | | **Arabic (Арабский)** | 0.03% | ### Length Distribution | Metric | Value | | :--- | :--- | | **Median Length** | 396 chars | | **Mean Length** | 1,357.05 chars | ## 🛠 Cleaning Pipeline To ensure maximum data purity and minimize "noise" during training, the following advanced pipeline was implemented: 1. **Wikicode Stripping:** Used `mwparserfromhell` to remove all MediaWiki markup, including infoboxes, templates, and internal wiki-links. 2. **Namespace Filtering:** Only articles from the main namespace (`ns=0`) were retained. All redirects, talk pages, and technical/administrative pages were discarded. 3. **Header & Noise Removal:** * Automated removal of Wikipedia service headers (e.g., "Kategoriyalar", "Havolalar", "Manbalar", "Demografiyasi") in both Latin and Cyrillic. * Stripping of HTML tags and leftover structural artifacts. 4. **Syntax Cleanup:** Removed residual brackets `[[ ]]` and cleaned up broken punctuation resulting from template removal. 5. **Text Normalization:** Standardized whitespace, removed redundant line breaks, and ensured clean paragraph separation. ## 🚀 Usage from datasets import load_dataset # Load the dataset dataset = load_dataset("Zhantas/Cleaned-Uzbek_Wikipedia") # Access an article print(dataset['train'][0]['title']) print(dataset['train'][0]['text']) 📜 License This dataset is released under the CC-BY-SA-4.0 license.
提供机构:
Zhantas
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作