five

Abzalbek89/corpus_clean

收藏
Hugging Face2026-04-15 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/Abzalbek89/corpus_clean
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - kk license: apache-2.0 task_categories: - text-generation - fill-mask size_categories: - 1M<n<10M tags: - kazakh - corpus - nlp - language-modeling - cleaned dataset_info: features: - name: text dtype: string - name: source dtype: string splits: - name: train num_examples: 1502583 - name: validation num_examples: 15177 --- # Kazakh Cleaned Corpus Cleaned and deduplicated Kazakh-language text corpus built from multiple open sources. Designed for pretraining and fine-tuning Kazakh language models. ## Dataset Summary | | Count | |---|---| | **Train** | 1,502,583 | | **Validation** | 15,177 | | **Total** | 1,517,760 | The dataset was assembled from the [kz-transformers/multidomain-kazakh-dataset](https://huggingface.co/datasets/kz-transformers/multidomain-kazakh-dataset) collection and processed through a multi-stage cleaning pipeline. ## Sources | Source | Raw texts | Clean texts | Description | |---|---|---|---| | `md_oscar` | 269,047 | 239,807 | OSCAR web crawl | | `md_kazakhNews` | 2,107,986 | 129,349 | Kazakh news articles | | `md_kazakhBooks` | 8,423 | 20,482* | Kazakh books | | `md_leipzig` | 1,706,485 | 1,128,122 | Leipzig corpora collection | | **Total** | **4,091,941** | **1,517,760** | | \* Books are split into chunks of ≤50K characters, so the clean count exceeds raw count. ## Cleaning Pipeline Each text passes through 9 sequential filters (ordered fast → slow): 1. **OSCAR dict fix** — unwraps texts stored as `{'text': '...'}` Python dict literals 2. **NFC normalization** — Unicode NFC, control character removal, whitespace normalization 3. **Minimum length** — ≥50 characters, ≥10 words 4. **Kazakh character check** — must contain at least one Kazakh-specific character (Ә, Ғ, Қ, Ң, Ө, Ұ, Ү, Һ, І, etc.) 5. **Script profile** — Cyrillic ≥60%, Latin ≤25% 6. **Junk filter** — URL density ≤5/1K chars, ≤5 HTML tags, special char ratio ≤40%, no boilerplate patterns 7. **Gzip repetition** — compression ratio ≥0.20 (filters repetitive/degenerate text) 8. **FastText LID** — language identification: `kk` confidence ≥0.50, gap to nearest rival ≥0.10 9. **Exact dedup** — MD5-based deduplication across all sources (1,033 duplicates removed) ### Rejection Statistics | Reason | Count | |---|---| | `no_kaz_chars` | 1,888,456 | | `too_few_words` | 323,914 | | `too_short` | 255,993 | | `lid_rejected` | 74,315 | | `junk` | 45,297 | | `script_profile` | 19,871 | | `gzip_repetition` | 10,706 | | `dedup` | 1,033 | ## Fields - **`text`** (`string`) — cleaned text content - **`source`** (`string`) — origin dataset identifier (e.g., `md_oscar`, `md_leipzig`) ## Usage ```python from datasets import load_dataset ds = load_dataset("Abzalbek89/corpus_clean") # Access splits train = ds["train"] val = ds["validation"] print(train[0]["text"][:200]) print(train[0]["source"]) ``` ## License Apache 2.0
提供机构:
Abzalbek89
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作