Turkish-DPO-Pairs

Name: Turkish-DPO-Pairs
Creator: maas
Published: 2025-12-05 16:34:39
License: 暂无描述

魔搭社区2025-12-05 更新2025-05-17 收录

下载链接：

https://modelscope.cn/datasets/suayptalha/Turkish-DPO-Pairs

下载链接

链接失效反馈

官方服务：

资源简介：

Original dataset: [malhajar/orca_dpo_pairs-tr](https://huggingface.co/datasets/malhajar/orca_dpo_pairs-tr) # Dataset Card for "suayptalha/Turkish-DPO-Pairs" This Dataset is part of a series of datasets aimed at advancing Turkish LLM Developments by establishing rigid Turkish dataset collection to enhance the performance of LLM's Produced in the Turkish Language. suayptalha/Turkish-DPO-Pairs is the processed version where only the content of messages with the role as "assistant" is extracted from both the chosen and rejected fields of malhajar/orca_dpo_pairs-tr which is a translated version of [`HuggingFaceH4/orca_dpo_pairs`]( https://huggingface.co/datasets/HuggingFaceH4/orca_dpo_pairs) ### Dataset Summary This is a pre-processed version of the [OpenOrca dataset](https://huggingface.co/datasets/Open-Orca/OpenOrca) translated to Turkish. The original OpenOrca dataset is a collection of augmented FLAN data that aligns, as best as possible, with the distributions outlined in the [Orca paper](https://arxiv.org/abs/2306.02707). It has been instrumental in generating high-performing preference-tuned model checkpoints and serves as a valuable resource for all NLP researchers and developers! # Dataset Summary The OrcaDPO Pair dataset is a subset of the OpenOrca dataset suitable for DPO preference tuning. ### Usage To load the dataset, run: ```python from datasets import load_dataset ds = load_dataset("suayptalha/Turkish-DPO-Pairs") ``` <a name="languages"></a> # Languages The language of the data is primarily Turkish. <a name="dataset-structure"></a> # Citation ```bibtex @misc{OpenOrca, title = {OpenOrca: An Open Dataset of GPT Augmented FLAN Reasoning Traces}, author = {Wing Lian and Bleys Goodson and Eugene Pentland and Austin Cook and Chanvichet Vong and "Teknium"}, year = {2023}, publisher = {HuggingFace}, journal = {HuggingFace repository}, howpublished = {\url{https://https://huggingface.co/Open-Orca/OpenOrca}}, } ```

原始数据集：[malhajar/orca_dpo_pairs-tr](https://huggingface.co/datasets/malhajar/orca_dpo_pairs-tr) # 「suayptalha/Turkish-DPO-Pairs」数据集卡片本数据集属于一系列旨在推动土耳其语大语言模型（LLM）发展的数据集之一，通过构建严谨的土耳其语数据集集合，以提升土耳其语产出的大语言模型的性能。 suayptalha/Turkish-DPO-Pairs 是经过预处理的版本，仅从 malhajar/orca_dpo_pairs-tr 的「选定（chosen）」与「拒绝（rejected）」字段中提取角色为「助手（assistant）」的消息内容。而 malhajar/orca_dpo_pairs-tr 是 [`HuggingFaceH4/orca_dpo_pairs`](https://huggingface.co/datasets/HuggingFaceH4/orca_dpo_pairs) 的土耳其语翻译版本。 ## 数据集概述本数据集是经土耳其语翻译后的 [OpenOrca 数据集](https://huggingface.co/datasets/Open-Orca/OpenOrca) 的预处理版本。原始的 OpenOrca 数据集是经过增强的 FLAN 数据集合，尽可能贴合 [Orca 论文](https://arxiv.org/abs/2306.02707) 中概述的数据分布。该数据集对生成高性能的偏好微调模型检查点起到了关键作用，同时也是所有自然语言处理（NLP）研究者与开发者的宝贵资源！ ## 数据集概述 OrcaDPO 配对数据集是 OpenOrca 数据集的子集，专门用于 DPO 偏好微调任务。 ### 使用方法若要加载该数据集，请运行以下代码： python from datasets import load_dataset ds = load_dataset("suayptalha/Turkish-DPO-Pairs") <a name="languages"></a> # 语言本数据集的主要语言为土耳其语。 <a name="dataset-structure"></a> # 引用 bibtex @misc{OpenOrca, title = "OpenOrca: An Open Dataset of GPT Augmented FLAN Reasoning Traces", author = {Wing Lian and Bleys Goodson and Eugene Pentland and Austin Cook and Chanvichet Vong and "Teknium"}, year = {2023}, publisher = {HuggingFace}, journal = {HuggingFace repository}, howpublished = {url{https://https://huggingface.co/Open-Orca/OpenOrca}}, }

提供机构：

maas

创建时间：

2025-05-16

5,000+

优质数据集

54 个

任务类型

进入经典数据集