Abzalbek89/corpus_clean
收藏Hugging Face2026-04-15 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/Abzalbek89/corpus_clean
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- kk
license: apache-2.0
task_categories:
- text-generation
- fill-mask
size_categories:
- 1M<n<10M
tags:
- kazakh
- corpus
- nlp
- language-modeling
- cleaned
dataset_info:
features:
- name: text
dtype: string
- name: source
dtype: string
splits:
- name: train
num_examples: 1502583
- name: validation
num_examples: 15177
---
# Kazakh Cleaned Corpus
Cleaned and deduplicated Kazakh-language text corpus built from multiple open sources. Designed for pretraining and fine-tuning Kazakh language models.
## Dataset Summary
| | Count |
|---|---|
| **Train** | 1,502,583 |
| **Validation** | 15,177 |
| **Total** | 1,517,760 |
The dataset was assembled from the [kz-transformers/multidomain-kazakh-dataset](https://huggingface.co/datasets/kz-transformers/multidomain-kazakh-dataset) collection and processed through a multi-stage cleaning pipeline.
## Sources
| Source | Raw texts | Clean texts | Description |
|---|---|---|---|
| `md_oscar` | 269,047 | 239,807 | OSCAR web crawl |
| `md_kazakhNews` | 2,107,986 | 129,349 | Kazakh news articles |
| `md_kazakhBooks` | 8,423 | 20,482* | Kazakh books |
| `md_leipzig` | 1,706,485 | 1,128,122 | Leipzig corpora collection |
| **Total** | **4,091,941** | **1,517,760** | |
\* Books are split into chunks of ≤50K characters, so the clean count exceeds raw count.
## Cleaning Pipeline
Each text passes through 9 sequential filters (ordered fast → slow):
1. **OSCAR dict fix** — unwraps texts stored as `{'text': '...'}` Python dict literals
2. **NFC normalization** — Unicode NFC, control character removal, whitespace normalization
3. **Minimum length** — ≥50 characters, ≥10 words
4. **Kazakh character check** — must contain at least one Kazakh-specific character (Ә, Ғ, Қ, Ң, Ө, Ұ, Ү, Һ, І, etc.)
5. **Script profile** — Cyrillic ≥60%, Latin ≤25%
6. **Junk filter** — URL density ≤5/1K chars, ≤5 HTML tags, special char ratio ≤40%, no boilerplate patterns
7. **Gzip repetition** — compression ratio ≥0.20 (filters repetitive/degenerate text)
8. **FastText LID** — language identification: `kk` confidence ≥0.50, gap to nearest rival ≥0.10
9. **Exact dedup** — MD5-based deduplication across all sources (1,033 duplicates removed)
### Rejection Statistics
| Reason | Count |
|---|---|
| `no_kaz_chars` | 1,888,456 |
| `too_few_words` | 323,914 |
| `too_short` | 255,993 |
| `lid_rejected` | 74,315 |
| `junk` | 45,297 |
| `script_profile` | 19,871 |
| `gzip_repetition` | 10,706 |
| `dedup` | 1,033 |
## Fields
- **`text`** (`string`) — cleaned text content
- **`source`** (`string`) — origin dataset identifier (e.g., `md_oscar`, `md_leipzig`)
## Usage
```python
from datasets import load_dataset
ds = load_dataset("Abzalbek89/corpus_clean")
# Access splits
train = ds["train"]
val = ds["validation"]
print(train[0]["text"][:200])
print(train[0]["source"])
```
## License
Apache 2.0
提供机构:
Abzalbek89



