Sungur-Dataset
收藏魔搭社区2025-11-27 更新2025-11-03 收录
下载链接:
https://modelscope.cn/datasets/suayptalha/Sungur-Dataset
下载链接
链接失效反馈官方服务:
资源简介:
<img src="./Sungur.png"/>
# Sungur-Dataset
## 📖 Overview
**Sungur-Dataset** is a large-scale, instruction–response style dataset designed to improve the **reasoning capabilities of Turkish language models**.
The dataset was created by merging **four publicly available reasoning datasets** into a unified format, resulting in **41,1k samples** covering multiple domains such as **mathematics, medicine, and general reasoning**.
This dataset is ideal for **Supervised Fine-Tuning (SFT)** in Turkish.
---
## 📊 Dataset Composition
Sungur-Dataset integrates the following sources:
* **[ituperceptron/turkish_medical_reasoning]**
* **[ituperceptron/turkish-general-reasoning-28k]**
* **[duxx/reasoning_dataset_turkish]**
* **[SoAp9035/r1-reasoning-tr]**
All datasets were reformatted into a **chat-style structure**:
```json
[
{"role": "user", "content": "Question/Prompt"},
{"role": "assistant", "content": "Answer (with reasoning if available)"}
]
```
---
## 🔍 Key Features
* **Size:** 41.1K reasoning samples
* **Languages:** Turkish (native + translated prompts)
* **Domains:** Math, Medical, General reasoning, and more
* **Structure:** Instruction–response pairs with optional `<think>...</think>` reasoning traces
* **Use Cases:**
* Instruction fine-tuning of LLMs
* Enhancing reasoning ability in Turkish models
---
## 📦 Example
```json
{
"messages": [
{"role": "user", "content": "Bir hasta göğüs ağrısıyla acile başvuruyor. İlk yapılacak tetkik nedir?"},
{"role": "assistant", "content": "<think>\nÖncelikle kardiyak nedenler ekarte edilmelidir. Bu yüzden en acil test EKG'dir.\n</think>\n\nİlk yapılacak tetkik: EKG."}
],
"source": "ituperceptron/turkish_medical_reasoning"
}
```
---
## 🚀 Usage
```python
from datasets import load_dataset
ds = load_dataset("suayptalha/Sungur-Dataset", split="train")
print(ds[0])
```
---
## 🙏 Acknowledgements
This dataset was made possible by integrating and reformatting several open-source datasets.
Special thanks to the following contributors and projects:
* **[ituperceptron](https://huggingface.co/ituperceptron)** for releasing *Turkish Medical Reasoning* and *Turkish General Reasoning* datasets.
* **[duxx](https://huggingface.co/duxx)** for creating the *Turkish Reasoning Dataset*.
* **[SoAp9035](https://huggingface.co/SoAp9035)** for publishing *R1-Reasoning-TR*.
## 📌 Citation
If you use **Sungur-Dataset**, please cite it as:
```
@misc{sungur_collection_2025,
title = {Sungur (Hugging Face Collection)},
author = {Şuayp Talha Kocabay},
year = {2025},
howpublished = {\url{https://huggingface.co/collections/suayptalha/sungur-68dcd094da7f8976cdc5898e}},
note = {Turkish LLM family and dataset collection}
}
```
---
license: apache-2.0
---
<img src="./Sungur.png"/>
# Sungur-Dataset
## 📖 概述
**Sungur-Dataset** 是一款大规模指令-回复风格数据集,旨在提升土耳其语大语言模型(Large Language Model, LLM)的推理能力。本数据集通过整合**4个公开可用的推理数据集**并统一格式构建完成,总计包含**41.1K条样本**,覆盖数学、医学与通用推理等多个领域。该数据集非常适合用于土耳其语模型的**监督微调(Supervised Fine-Tuning, SFT)**。
---
## 📊 数据集构成
Sungur-Dataset整合了以下数据源:
* **[ituperceptron/turkish_medical_reasoning]**
* **[ituperceptron/turkish-general-reasoning-28k]**
* **[duxx/reasoning_dataset_turkish]**
* **[SoAp9035/r1-reasoning-tr]**
所有原始数据集均被重构为**对话式结构**:
json
[
{"role": "user", "content": "Question/Prompt"},
{"role": "assistant", "content": "Answer (with reasoning if available)"}
]
---
## 🔍 核心特性
* **规模**:41.1K条推理样本
* **语言**:土耳其语(原生语料+翻译后的提示词)
* **覆盖领域**:数学、医学、通用推理等多领域
* **数据结构**:带可选`<think>...</think>`推理痕迹的指令-回复对
* **适用场景**:
* 大语言模型的指令微调
* 提升土耳其语模型的推理能力
---
## 📦 示例
json
{
"messages": [
{"role": "user", "content": "一名患者因胸痛前往急诊就诊,首先应进行的检查是什么?"},
{"role": "assistant", "content": "<think>
首先应排除心脏相关病因。因此最紧急的检查为心电图(EKG)。
</think>
首先应进行的检查:心电图(EKG)。"}
],
"source": "ituperceptron/turkish_medical_reasoning"
}
---
## 🚀 使用方法
python
from datasets import load_dataset
ds = load_dataset("suayptalha/Sungur-Dataset", split="train")
print(ds[0])
---
## 🙏 致谢
本数据集的构建得益于多个开源数据集的整合与格式重构。特别感谢以下贡献者与项目:
* **[ituperceptron](https://huggingface.co/ituperceptron)** 发布了*土耳其医学推理数据集*与*土耳其通用推理数据集*。
* **[duxx](https://huggingface.co/duxx)** 创作了*土耳其语推理数据集*。
* **[SoAp9035](https://huggingface.co/SoAp9035)** 发布了*R1-Reasoning-TR*数据集。
## 📌 引用
若您使用**Sungur-Dataset**,请按如下格式引用:
@misc{sungur_collection_2025,
title = {Sungur (Hugging Face Collection)},
author = {Şuayp Talha Kocabay},
year = {2025},
howpublished = {url{https://huggingface.co/collections/suayptalha/sungur-68dcd094da7f8976cdc5898e}},
note = {土耳其语大语言模型系列与数据集合集}
}
---
license: apache-2.0
---
提供机构:
maas
创建时间:
2025-10-02



