five

Zhantas/Cleaned-Kyrgyz_Wikipedia

收藏
Hugging Face2026-04-08 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Zhantas/Cleaned-Kyrgyz_Wikipedia
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - ky license: cc-by-sa-4.0 task: text-generation tags: - wikipedia - kyrgyz - cleaned - corpus - multilingual size_categories: - 10K<n<100K --- # Cleaned-Kyrgyz_Wikipedia A high-quality, denoised corpus derived from the Kyrgyz Wikipedia. This dataset is specifically prepared for pre-training and fine-tuning Large Language Models (LLMs) in the Kyrgyz language. ## 📊 Dataset Benchmark ### General Statistics | Metric | Value | | :--- | :--- | | **Total Articles** | 76,519 | | **Total Characters** | 92,945,532 | | **Total Words** | 11,340,634 | | **Avg. Words per Article** | 148.21 | | **Language Purity Index** | **91.71% (Cyrillic)** | ### Character Length Distribution | Metric | Value | | :--- | :--- | | **Min Length** | 101 chars | | **Max Length** | 210,307 chars | | **Median Length** | 712 chars | | **Mean Length** | 1,214.67 chars | ### Script Distribution | Script | Percentage | | :--- | :--- | | **Cyrillic (Кириллица)** | 76.04% | | **Whitespace** | 13.01% | | **Punctuation** | 4.08% | | **Digits** | 3.37% | | **Latin (Латиница)** | 2.73% | | **Arabic (Арабский)** | 0.01% | | **Other/Special** | 0.76% | ## 🛠 Cleaning Pipeline To ensure maximum data quality for LLM training, the following pipeline was applied: 1. **Wikicode Removal:** Used `mwparserfromhell` to strip all MediaWiki syntax (templates, infoboxes, and internal links). 2. **Namespace Filtering:** Only articles from the main namespace (`ns=0`) were kept. All redirects, talk pages, and technical pages were discarded. 3. **Noise Stripping:** * Removal of all HTML tags. * Removal of Wikipedia service headers (e.g., "Category:", "References", "External links"). * Cleanup of leftover brackets `[[ ]]` and structural artifacts. 4. **Text Normalization:** Standardized whitespace, removed redundant line breaks, and cleaned up trailing punctuation. ## 🚀 Usage from datasets import load_dataset dataset = load_dataset("Zhantas/Cleaned-Kyrgyz_Wikipedia") # Access an article print(dataset['train'][0]['title']) print(dataset['train'][0]['text']) 📜 License This dataset is released under the CC-BY-SA-4.0 license.
提供机构:
Zhantas
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作