kurumikz/Cleaned-Kazakh-Wikipedia
收藏Hugging Face2026-03-03 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/kurumikz/Cleaned-Kazakh-Wikipedia
下载链接
链接失效反馈官方服务:
资源简介:
---
license: odc-by
language:
- kk
tags:
- nlp
- kazakh
- wikipedia
- dataset
- llm-training
pretty_name: Kazakh-Wiki-Clean-228K
size_categories:
- 100K<n<1M
---
# 📚 Kazakh-Wiki-Clean-228K
A cleaned and structured collection of **228,810 articles** from the Kazakh Wikipedia, curated for Large Language Model (LLM) pre-training, fine-tuning, and Natural Language Processing (NLP) tasks.
---
## 📊 Dataset Summary
| Property | Value |
|---|---|
| **Total Articles** | 228,810 |
| **Source** | Wikimedia Foundation (kkwiki) |
| **Language** | Kazakh (Cyrillic script) |
| **Format** | JSON Lines (`.jsonl`) |
| **Article Scope** | Namespace 0 (Main Articles only) |
---
## 🔧 Preprocessing Methodology
The dataset was processed using a custom stream-based extraction pipeline designed for data integrity.
### 1. Filtering
- Retained primary namespace articles (NS 0) only
- Excluded redirects, talk pages, and administrative categories
- Discarded articles with a final length below **200 characters** to ensure content quality
### 2. Markup Removal
- Recursive stripping of nested MediaWiki templates, infoboxes, and tables
- Removal of HTML tags and technical metadata
- Conversion of internal wiki-links to plain text
### 3. Normalization
- Removal of technical commands (e.g., `__NOTOC__`, `__NOEDITSECTION__`)
- Cleanup of structural artifacts such as empty parentheses and residual brackets
- Normalization of whitespace and punctuation
---
## 🗂️ Data Structure
The dataset is stored in JSONL format. Each line represents a single article:
```json
{
"title": "Article Title",
"text": "Extracted and cleaned text content..."
}
```
---
## 🚀 Intended Applications
- **Language Modeling** — Pre-training and fine-tuning of transformer-based models
- **Embeddings** — Development of Kazakh-language semantic representations
- **RAG Systems** — Knowledge base for Retrieval-Augmented Generation pipelines
- **Linguistic Analysis** — Statistical and structural research on the Kazakh language
---
## ⚖️ License and Attribution
This dataset is distributed under the **[Odc-by License](https://opensource.org/licenses/Odc-by)**.
**Attribution:**
Users must credit the original author when using or redistributing this dataset.
> Created by **kurumikz**
>
提供机构:
kurumikz



