Zhantas/Cleaned-Uzbek_Wikipedia

Name: Zhantas/Cleaned-Uzbek_Wikipedia
Creator: Zhantas
Published: 2026-04-08 01:24:00
License: 暂无描述

Hugging Face2026-04-08 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/Zhantas/Cleaned-Uzbek_Wikipedia

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - uz license: cc-by-sa-4.0 task: text-generation tags: - wikipedia - uzbek - latin - cleaned - corpus - multilingual size_categories: - 100K<n<1M --- # Cleaned-Uzbek_Wikipedia A massive, high-quality, and denoised corpus derived from the Uzbek Wikipedia. This dataset is specifically curated for the pre-training and fine-tuning of Large Language Models (LLMs) focusing on the modern Uzbek language. ## 🌟 Key Feature: Latin-Centric Corpus Unlike many existing Uzbek datasets, this corpus is heavily optimized for the **Latin script (89.42%)**, reflecting the modern linguistic standard and official writing system of Uzbekistan. It is an ideal foundation for building state-of-the-art (SOTA) Latin-Uzbek language models. ## 📊 Dataset Benchmark ### General Statistics | Metric | Value | | :--- | :--- | | **Total Articles** | 329,766 | | **Total Characters** | 447,507,793 | | **Total Words** | 53,382,386 | | **Avg. Words per Article** | 161.88 | ### Script Distribution | Script | Percentage | | :--- | :--- | | **Latin (Латиница)** | **89.42%** | | **Digits (Цифры)** | 4.34% | | **Other/Special** | 5.73% | | **Cyrillic (Кириллица)** | 0.48% | | **Arabic (Арабский)** | 0.03% | ### Length Distribution | Metric | Value | | :--- | :--- | | **Median Length** | 396 chars | | **Mean Length** | 1,357.05 chars | ## 🛠 Cleaning Pipeline To ensure maximum data purity and minimize "noise" during training, the following advanced pipeline was implemented: 1. **Wikicode Stripping:** Used `mwparserfromhell` to remove all MediaWiki markup, including infoboxes, templates, and internal wiki-links. 2. **Namespace Filtering:** Only articles from the main namespace (`ns=0`) were retained. All redirects, talk pages, and technical/administrative pages were discarded. 3. **Header & Noise Removal:** * Automated removal of Wikipedia service headers (e.g., "Kategoriyalar", "Havolalar", "Manbalar", "Demografiyasi") in both Latin and Cyrillic. * Stripping of HTML tags and leftover structural artifacts. 4. **Syntax Cleanup:** Removed residual brackets `[[ ]]` and cleaned up broken punctuation resulting from template removal. 5. **Text Normalization:** Standardized whitespace, removed redundant line breaks, and ensured clean paragraph separation. ## 🚀 Usage from datasets import load_dataset # Load the dataset dataset = load_dataset("Zhantas/Cleaned-Uzbek_Wikipedia") # Access an article print(dataset['train'][0]['title']) print(dataset['train'][0]['text']) 📜 License This dataset is released under the CC-BY-SA-4.0 license.

提供机构：

Zhantas

5,000+

优质数据集

54 个

任务类型

进入经典数据集