five

Sungur-Dataset

收藏
魔搭社区2025-11-27 更新2025-11-03 收录
下载链接:
https://modelscope.cn/datasets/suayptalha/Sungur-Dataset
下载链接
链接失效反馈
官方服务:
资源简介:
<img src="./Sungur.png"/> # Sungur-Dataset ## 📖 Overview **Sungur-Dataset** is a large-scale, instruction–response style dataset designed to improve the **reasoning capabilities of Turkish language models**. The dataset was created by merging **four publicly available reasoning datasets** into a unified format, resulting in **41,1k samples** covering multiple domains such as **mathematics, medicine, and general reasoning**. This dataset is ideal for **Supervised Fine-Tuning (SFT)** in Turkish. --- ## 📊 Dataset Composition Sungur-Dataset integrates the following sources: * **[ituperceptron/turkish_medical_reasoning]** * **[ituperceptron/turkish-general-reasoning-28k]** * **[duxx/reasoning_dataset_turkish]** * **[SoAp9035/r1-reasoning-tr]** All datasets were reformatted into a **chat-style structure**: ```json [ {"role": "user", "content": "Question/Prompt"}, {"role": "assistant", "content": "Answer (with reasoning if available)"} ] ``` --- ## 🔍 Key Features * **Size:** 41.1K reasoning samples * **Languages:** Turkish (native + translated prompts) * **Domains:** Math, Medical, General reasoning, and more * **Structure:** Instruction–response pairs with optional `<think>...</think>` reasoning traces * **Use Cases:** * Instruction fine-tuning of LLMs * Enhancing reasoning ability in Turkish models --- ## 📦 Example ```json { "messages": [ {"role": "user", "content": "Bir hasta göğüs ağrısıyla acile başvuruyor. İlk yapılacak tetkik nedir?"}, {"role": "assistant", "content": "<think>\nÖncelikle kardiyak nedenler ekarte edilmelidir. Bu yüzden en acil test EKG'dir.\n</think>\n\nİlk yapılacak tetkik: EKG."} ], "source": "ituperceptron/turkish_medical_reasoning" } ``` --- ## 🚀 Usage ```python from datasets import load_dataset ds = load_dataset("suayptalha/Sungur-Dataset", split="train") print(ds[0]) ``` --- ## 🙏 Acknowledgements This dataset was made possible by integrating and reformatting several open-source datasets. Special thanks to the following contributors and projects: * **[ituperceptron](https://huggingface.co/ituperceptron)** for releasing *Turkish Medical Reasoning* and *Turkish General Reasoning* datasets. * **[duxx](https://huggingface.co/duxx)** for creating the *Turkish Reasoning Dataset*. * **[SoAp9035](https://huggingface.co/SoAp9035)** for publishing *R1-Reasoning-TR*. ## 📌 Citation If you use **Sungur-Dataset**, please cite it as: ``` @misc{sungur_collection_2025, title = {Sungur (Hugging Face Collection)}, author = {Şuayp Talha Kocabay}, year = {2025}, howpublished = {\url{https://huggingface.co/collections/suayptalha/sungur-68dcd094da7f8976cdc5898e}}, note = {Turkish LLM family and dataset collection} } ``` --- license: apache-2.0 ---

<img src="./Sungur.png"/> # Sungur-Dataset ## 📖 概述 **Sungur-Dataset** 是一款大规模指令-回复风格数据集,旨在提升土耳其语大语言模型(Large Language Model, LLM)的推理能力。本数据集通过整合**4个公开可用的推理数据集**并统一格式构建完成,总计包含**41.1K条样本**,覆盖数学、医学与通用推理等多个领域。该数据集非常适合用于土耳其语模型的**监督微调(Supervised Fine-Tuning, SFT)**。 --- ## 📊 数据集构成 Sungur-Dataset整合了以下数据源: * **[ituperceptron/turkish_medical_reasoning]** * **[ituperceptron/turkish-general-reasoning-28k]** * **[duxx/reasoning_dataset_turkish]** * **[SoAp9035/r1-reasoning-tr]** 所有原始数据集均被重构为**对话式结构**: json [ {"role": "user", "content": "Question/Prompt"}, {"role": "assistant", "content": "Answer (with reasoning if available)"} ] --- ## 🔍 核心特性 * **规模**:41.1K条推理样本 * **语言**:土耳其语(原生语料+翻译后的提示词) * **覆盖领域**:数学、医学、通用推理等多领域 * **数据结构**:带可选`<think>...</think>`推理痕迹的指令-回复对 * **适用场景**: * 大语言模型的指令微调 * 提升土耳其语模型的推理能力 --- ## 📦 示例 json { "messages": [ {"role": "user", "content": "一名患者因胸痛前往急诊就诊,首先应进行的检查是什么?"}, {"role": "assistant", "content": "<think> 首先应排除心脏相关病因。因此最紧急的检查为心电图(EKG)。 </think> 首先应进行的检查:心电图(EKG)。"} ], "source": "ituperceptron/turkish_medical_reasoning" } --- ## 🚀 使用方法 python from datasets import load_dataset ds = load_dataset("suayptalha/Sungur-Dataset", split="train") print(ds[0]) --- ## 🙏 致谢 本数据集的构建得益于多个开源数据集的整合与格式重构。特别感谢以下贡献者与项目: * **[ituperceptron](https://huggingface.co/ituperceptron)** 发布了*土耳其医学推理数据集*与*土耳其通用推理数据集*。 * **[duxx](https://huggingface.co/duxx)** 创作了*土耳其语推理数据集*。 * **[SoAp9035](https://huggingface.co/SoAp9035)** 发布了*R1-Reasoning-TR*数据集。 ## 📌 引用 若您使用**Sungur-Dataset**,请按如下格式引用: @misc{sungur_collection_2025, title = {Sungur (Hugging Face Collection)}, author = {Şuayp Talha Kocabay}, year = {2025}, howpublished = {url{https://huggingface.co/collections/suayptalha/sungur-68dcd094da7f8976cdc5898e}}, note = {土耳其语大语言模型系列与数据集合集} } --- license: apache-2.0 ---
提供机构:
maas
创建时间:
2025-10-02
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作