afkfatih/turkish-cpt-dataset

Name: afkfatih/turkish-cpt-dataset
Creator: afkfatih
Published: 2026-03-02 00:05:19
License: 暂无描述

Hugging Face2026-03-02 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/afkfatih/turkish-cpt-dataset

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - tr - en license: cc-by-4.0 task_categories: - text-generation tags: - turkish - continual-pretraining - CPT - wikipedia - fineweb - c4 size_categories: - 1B<n<10B --- # Turkish CPT Dataset A high-quality Turkish + English dataset for Continued Pre-Training (CPT) of language models. ## Dataset Summary | Property | Value | |---|---| | Total examples | 1,908,378 | | Total tokens | ~2.19B | | Turkish ratio | ~80% | | English ratio | ~20% | | Languages | Turkish, English | ## Sources | Source | Language | Examples | Description | |---|---|---|---| | [wikimedia/wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia) (tr) | TR | ~534K | Turkish Wikipedia | | [wikimedia/wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia) (en) | EN | ~134K | English Wikipedia (20% replay) | | [habanoz/c4_tr_fineweb_plus](https://huggingface.co/datasets/habanoz/c4_tr_fineweb_plus) | TR | ~500K | Filtered Turkish web text | | [HuggingFaceFW/fineweb-2](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2) (tur_Latn) | TR | ~500K | High-quality Turkish web data | | [HuggingFaceFW/fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) | EN | ~300K | High-quality English web data | ## Cleaning Pipeline Applied industry-standard cleaning steps: - **UTF-8 NFC normalization** — unicode noise removal - **Whitespace normalization** — excess newlines, tabs, spaces - **URL removal** — web boilerplate - **Alphanumeric ratio filter** — spam/symbol detection (min 50%) - **Repetitive line filter** — boilerplate detection (min 30% unique lines) - **Minimum 50 tokens** — very short text removal - **Maximum 100K tokens** — abnormally long document removal 60,357 examples removed (3.1%), 1.3% token loss. ## English Replay 20% English data is mixed in following best practices from continual pretraining research to prevent **catastrophic forgetting** of the base model's reasoning capabilities. ## Usage ```python from datasets import load_dataset dataset = load_dataset("afkfatih/turkish-cpt-dataset", split="train") ``` ## Intended Use This dataset is intended for CPT of instruction-tuned or base language models to improve Turkish language understanding and generation while preserving English capabilities. ## License Dataset components inherit their original licenses. Compiled dataset released under CC-BY-4.0.

提供机构：

afkfatih

5,000+

优质数据集

54 个

任务类型

进入经典数据集