Turkish-DPO-Pairs
收藏魔搭社区2025-12-05 更新2025-05-17 收录
下载链接:
https://modelscope.cn/datasets/suayptalha/Turkish-DPO-Pairs
下载链接
链接失效反馈官方服务:
资源简介:
Original dataset: [malhajar/orca_dpo_pairs-tr](https://huggingface.co/datasets/malhajar/orca_dpo_pairs-tr)
# Dataset Card for "suayptalha/Turkish-DPO-Pairs"
This Dataset is part of a series of datasets aimed at advancing Turkish LLM Developments by establishing rigid Turkish dataset collection to enhance the performance of LLM's Produced in the Turkish Language.
suayptalha/Turkish-DPO-Pairs is the processed version where only the content of messages with the role as "assistant" is extracted from both the chosen and rejected fields of malhajar/orca_dpo_pairs-tr which is a translated version of [`HuggingFaceH4/orca_dpo_pairs`]( https://huggingface.co/datasets/HuggingFaceH4/orca_dpo_pairs)
### Dataset Summary
This is a pre-processed version of the [OpenOrca dataset](https://huggingface.co/datasets/Open-Orca/OpenOrca) translated to Turkish.
The original OpenOrca dataset is a collection of augmented FLAN data that aligns, as best as possible, with the distributions outlined in the [Orca paper](https://arxiv.org/abs/2306.02707).
It has been instrumental in generating high-performing preference-tuned model checkpoints and serves as a valuable resource for all NLP researchers and developers!
# Dataset Summary
The OrcaDPO Pair dataset is a subset of the OpenOrca dataset suitable for DPO preference tuning.
### Usage
To load the dataset, run:
```python
from datasets import load_dataset
ds = load_dataset("suayptalha/Turkish-DPO-Pairs")
```
<a name="languages"></a>
# Languages
The language of the data is primarily Turkish.
<a name="dataset-structure"></a>
# Citation
```bibtex
@misc{OpenOrca,
title = {OpenOrca: An Open Dataset of GPT Augmented FLAN Reasoning Traces},
author = {Wing Lian and Bleys Goodson and Eugene Pentland and Austin Cook and Chanvichet Vong and "Teknium"},
year = {2023},
publisher = {HuggingFace},
journal = {HuggingFace repository},
howpublished = {\url{https://https://huggingface.co/Open-Orca/OpenOrca}},
}
```
原始数据集:[malhajar/orca_dpo_pairs-tr](https://huggingface.co/datasets/malhajar/orca_dpo_pairs-tr)
# 「suayptalha/Turkish-DPO-Pairs」数据集卡片
本数据集属于一系列旨在推动土耳其语大语言模型(LLM)发展的数据集之一,通过构建严谨的土耳其语数据集集合,以提升土耳其语产出的大语言模型的性能。
suayptalha/Turkish-DPO-Pairs 是经过预处理的版本,仅从 malhajar/orca_dpo_pairs-tr 的「选定(chosen)」与「拒绝(rejected)」字段中提取角色为「助手(assistant)」的消息内容。而 malhajar/orca_dpo_pairs-tr 是 [`HuggingFaceH4/orca_dpo_pairs`](https://huggingface.co/datasets/HuggingFaceH4/orca_dpo_pairs) 的土耳其语翻译版本。
## 数据集概述
本数据集是经土耳其语翻译后的 [OpenOrca 数据集](https://huggingface.co/datasets/Open-Orca/OpenOrca) 的预处理版本。
原始的 OpenOrca 数据集是经过增强的 FLAN 数据集合,尽可能贴合 [Orca 论文](https://arxiv.org/abs/2306.02707) 中概述的数据分布。该数据集对生成高性能的偏好微调模型检查点起到了关键作用,同时也是所有自然语言处理(NLP)研究者与开发者的宝贵资源!
## 数据集概述
OrcaDPO 配对数据集是 OpenOrca 数据集的子集,专门用于 DPO 偏好微调任务。
### 使用方法
若要加载该数据集,请运行以下代码:
python
from datasets import load_dataset
ds = load_dataset("suayptalha/Turkish-DPO-Pairs")
<a name="languages"></a>
# 语言
本数据集的主要语言为土耳其语。
<a name="dataset-structure"></a>
# 引用
bibtex
@misc{OpenOrca,
title = "OpenOrca: An Open Dataset of GPT Augmented FLAN Reasoning Traces",
author = {Wing Lian and Bleys Goodson and Eugene Pentland and Austin Cook and Chanvichet Vong and "Teknium"},
year = {2023},
publisher = {HuggingFace},
journal = {HuggingFace repository},
howpublished = {url{https://https://huggingface.co/Open-Orca/OpenOrca}},
}
提供机构:
maas
创建时间:
2025-05-16



