five

ogulcanaydogan/Turkish-LLM-v10-Training

收藏
Hugging Face2026-03-12 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/ogulcanaydogan/Turkish-LLM-v10-Training
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - tr license: apache-2.0 size_categories: - 100K<n<1M task_categories: - text-generation tags: - turkish - sft - instruction-tuning - low-resource - nlp pretty_name: Turkish LLM Training Dataset v10 --- # Turkish LLM Training Dataset v10 A curated corpus of **144,022 Turkish instruction-completion pairs** used to train the [Turkish LLM Family](https://huggingface.co/collections/ogulcanaydogan/turkish-llm-family-69b303b4ef1c36caffca4e94). ## Dataset Description This dataset was created to address the scarcity of high-quality Turkish instruction-following data for language model fine-tuning. It covers a broad range of topics including: - **Science & Technology** (physics, chemistry, biology, computer science) - **History & Geography** (Turkish and world history, geography) - **General Knowledge** (culture, society, daily life) - **Mathematics** (arithmetic, algebra, problem-solving) - **Language & Literature** (Turkish grammar, literature analysis) ## Dataset Structure | Field | Type | Description | |-------|------|-------------| | `prompt` | string | The instruction or question in Turkish | | `completion` | string | The expected response in Turkish | ### Statistics | Metric | Value | |--------|-------| | Total examples | 144,022 | | Format | JSONL | | Language | Turkish (tr) | | Average prompt length | ~50 tokens | | Average completion length | ~150 tokens | ## Usage ```python from datasets import load_dataset dataset = load_dataset("ogulcanaydogan/Turkish-LLM-v10-Training") print(dataset["train"][0]) # {'prompt': 'Çocuklar kaç süt dişi kaybeder?', 'completion': 'Çocuklar büyüdükçe 20 süt dişini kaybederler...'} ``` ## Models Trained on This Data - [Turkish-LLM-14B-Instruct](https://huggingface.co/ogulcanaydogan/Turkish-LLM-14B-Instruct) — 14.7B parameters - [Turkish-LLM-7B-Instruct](https://huggingface.co/ogulcanaydogan/Turkish-LLM-7B-Instruct) — 7B parameters ## License Apache 2.0 ## Citation ```bibtex @misc{aydogan2026turkishllm, title={Turkish LLM Family: Open-Source Turkish Language Models}, author={Aydogan, Ogulcan}, year={2026}, url={https://huggingface.co/ogulcanaydogan/Turkish-LLM-14B-Instruct} } ``` --- # Türkçe ## Türkçe LLM Eğitim Verisi v10 [Turkish LLM Ailesi](https://huggingface.co/collections/ogulcanaydogan/turkish-llm-family-69b303b4ef1c36caffca4e94) modellerinin eğitiminde kullanılan **144.022 Türkçe talimat-cevap çifti** içeren küratörlü bir veri seti. ### Kapsam - Fen ve teknoloji, tarih ve coğrafya, genel kültür, matematik, dil ve edebiyat konularını kapsar. ### Kullanım ```python from datasets import load_dataset dataset = load_dataset("ogulcanaydogan/Turkish-LLM-v10-Training") ```
提供机构:
ogulcanaydogan
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作