cturan/turkish-synthetic-corpus
收藏Hugging Face2026-04-07 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/cturan/turkish-synthetic-corpus
下载链接
链接失效反馈官方服务:
资源简介:
---
license: odc-by
task_categories:
- text-generation
language:
- tr
size_categories:
- 1M<n<10M
pretty_name: Turkish Synthetic Corpus
---
# Turkish Synthetic Corpus
A synthetic Turkish text corpus with **1,871,131** documents, designed for Turkish language model training.
## About
Inspired by [HuggingFaceTB/smollm-corpus](https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus). Questions and prompts were sourced from the SmolLM Corpus pipeline; a language model then generated localized Turkish responses and documents around them. All credit for the original corpus design and methodology goes to the HuggingFace SmolLM team.
The resulting dataset covers a wide range of topics — science, history, culture, economics, coding, fiction — written in natural Turkish prose at varying register levels.
## Data Fields
| Field | Type | Description |
|---|---|---|
| `id` | `string` | Unique document ID |
| `text` | `string` | Turkish document text |
## Intended Use
- Turkish LLM pretraining and continued pretraining
- Tokenizer training
- Language modeling benchmarks
> Content is synthetic. Factual accuracy is not guaranteed.
## Citation
If you use this dataset, please also credit the original SmolLM Corpus:
```bibtex
@software{benallal2024smollmcorpus,
author = {Ben Allal, Loubna and Lozhkov, Anton and Penedo, Guilherme and Wolf, Thomas and von Werra, Leandro},
title = {SmolLM-Corpus},
month = July,
year = 2024,
url = {https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus}
}
```
提供机构:
cturan



