five

afkfatih/turkish-cpt-dataset

收藏
Hugging Face2026-03-02 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/afkfatih/turkish-cpt-dataset
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - tr - en license: cc-by-4.0 task_categories: - text-generation tags: - turkish - continual-pretraining - CPT - wikipedia - fineweb - c4 size_categories: - 1B<n<10B --- # Turkish CPT Dataset A high-quality Turkish + English dataset for Continued Pre-Training (CPT) of language models. ## Dataset Summary | Property | Value | |---|---| | Total examples | 1,908,378 | | Total tokens | ~2.19B | | Turkish ratio | ~80% | | English ratio | ~20% | | Languages | Turkish, English | ## Sources | Source | Language | Examples | Description | |---|---|---|---| | [wikimedia/wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia) (tr) | TR | ~534K | Turkish Wikipedia | | [wikimedia/wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia) (en) | EN | ~134K | English Wikipedia (20% replay) | | [habanoz/c4_tr_fineweb_plus](https://huggingface.co/datasets/habanoz/c4_tr_fineweb_plus) | TR | ~500K | Filtered Turkish web text | | [HuggingFaceFW/fineweb-2](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2) (tur_Latn) | TR | ~500K | High-quality Turkish web data | | [HuggingFaceFW/fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) | EN | ~300K | High-quality English web data | ## Cleaning Pipeline Applied industry-standard cleaning steps: - **UTF-8 NFC normalization** — unicode noise removal - **Whitespace normalization** — excess newlines, tabs, spaces - **URL removal** — web boilerplate - **Alphanumeric ratio filter** — spam/symbol detection (min 50%) - **Repetitive line filter** — boilerplate detection (min 30% unique lines) - **Minimum 50 tokens** — very short text removal - **Maximum 100K tokens** — abnormally long document removal 60,357 examples removed (3.1%), 1.3% token loss. ## English Replay 20% English data is mixed in following best practices from continual pretraining research to prevent **catastrophic forgetting** of the base model's reasoning capabilities. ## Usage ```python from datasets import load_dataset dataset = load_dataset("afkfatih/turkish-cpt-dataset", split="train") ``` ## Intended Use This dataset is intended for CPT of instruction-tuned or base language models to improve Turkish language understanding and generation while preserving English capabilities. ## License Dataset components inherit their original licenses. Compiled dataset released under CC-BY-4.0.
提供机构:
afkfatih
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作