afkfatih/turkish-cpt-dataset
收藏Hugging Face2026-03-02 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/afkfatih/turkish-cpt-dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- tr
- en
license: cc-by-4.0
task_categories:
- text-generation
tags:
- turkish
- continual-pretraining
- CPT
- wikipedia
- fineweb
- c4
size_categories:
- 1B<n<10B
---
# Turkish CPT Dataset
A high-quality Turkish + English dataset for Continued Pre-Training (CPT) of language models.
## Dataset Summary
| Property | Value |
|---|---|
| Total examples | 1,908,378 |
| Total tokens | ~2.19B |
| Turkish ratio | ~80% |
| English ratio | ~20% |
| Languages | Turkish, English |
## Sources
| Source | Language | Examples | Description |
|---|---|---|---|
| [wikimedia/wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia) (tr) | TR | ~534K | Turkish Wikipedia |
| [wikimedia/wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia) (en) | EN | ~134K | English Wikipedia (20% replay) |
| [habanoz/c4_tr_fineweb_plus](https://huggingface.co/datasets/habanoz/c4_tr_fineweb_plus) | TR | ~500K | Filtered Turkish web text |
| [HuggingFaceFW/fineweb-2](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2) (tur_Latn) | TR | ~500K | High-quality Turkish web data |
| [HuggingFaceFW/fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) | EN | ~300K | High-quality English web data |
## Cleaning Pipeline
Applied industry-standard cleaning steps:
- **UTF-8 NFC normalization** — unicode noise removal
- **Whitespace normalization** — excess newlines, tabs, spaces
- **URL removal** — web boilerplate
- **Alphanumeric ratio filter** — spam/symbol detection (min 50%)
- **Repetitive line filter** — boilerplate detection (min 30% unique lines)
- **Minimum 50 tokens** — very short text removal
- **Maximum 100K tokens** — abnormally long document removal
60,357 examples removed (3.1%), 1.3% token loss.
## English Replay
20% English data is mixed in following best practices from continual pretraining research to prevent **catastrophic forgetting** of the base model's reasoning capabilities.
## Usage
```python
from datasets import load_dataset
dataset = load_dataset("afkfatih/turkish-cpt-dataset", split="train")
```
## Intended Use
This dataset is intended for CPT of instruction-tuned or base language models to improve Turkish language understanding and generation while preserving English capabilities.
## License
Dataset components inherit their original licenses. Compiled dataset released under CC-BY-4.0.
提供机构:
afkfatih



