Zhantas/Cleaned-Kyrgyz_Wikipedia
收藏Hugging Face2026-04-08 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Zhantas/Cleaned-Kyrgyz_Wikipedia
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- ky
license: cc-by-sa-4.0
task: text-generation
tags:
- wikipedia
- kyrgyz
- cleaned
- corpus
- multilingual
size_categories:
- 10K<n<100K
---
# Cleaned-Kyrgyz_Wikipedia
A high-quality, denoised corpus derived from the Kyrgyz Wikipedia. This dataset is specifically prepared for pre-training and fine-tuning Large Language Models (LLMs) in the Kyrgyz language.
## 📊 Dataset Benchmark
### General Statistics
| Metric | Value |
| :--- | :--- |
| **Total Articles** | 76,519 |
| **Total Characters** | 92,945,532 |
| **Total Words** | 11,340,634 |
| **Avg. Words per Article** | 148.21 |
| **Language Purity Index** | **91.71% (Cyrillic)** |
### Character Length Distribution
| Metric | Value |
| :--- | :--- |
| **Min Length** | 101 chars |
| **Max Length** | 210,307 chars |
| **Median Length** | 712 chars |
| **Mean Length** | 1,214.67 chars |
### Script Distribution
| Script | Percentage |
| :--- | :--- |
| **Cyrillic (Кириллица)** | 76.04% |
| **Whitespace** | 13.01% |
| **Punctuation** | 4.08% |
| **Digits** | 3.37% |
| **Latin (Латиница)** | 2.73% |
| **Arabic (Арабский)** | 0.01% |
| **Other/Special** | 0.76% |
## 🛠 Cleaning Pipeline
To ensure maximum data quality for LLM training, the following pipeline was applied:
1. **Wikicode Removal:** Used `mwparserfromhell` to strip all MediaWiki syntax (templates, infoboxes, and internal links).
2. **Namespace Filtering:** Only articles from the main namespace (`ns=0`) were kept. All redirects, talk pages, and technical pages were discarded.
3. **Noise Stripping:**
* Removal of all HTML tags.
* Removal of Wikipedia service headers (e.g., "Category:", "References", "External links").
* Cleanup of leftover brackets `[[ ]]` and structural artifacts.
4. **Text Normalization:** Standardized whitespace, removed redundant line breaks, and cleaned up trailing punctuation.
## 🚀 Usage
from datasets import load_dataset
dataset = load_dataset("Zhantas/Cleaned-Kyrgyz_Wikipedia")
# Access an article
print(dataset['train'][0]['title'])
print(dataset['train'][0]['text'])
📜 License
This dataset is released under the CC-BY-SA-4.0 license.
提供机构:
Zhantas



