five

Turkish-DPO-Pairs

收藏
魔搭社区2025-12-05 更新2025-05-17 收录
下载链接:
https://modelscope.cn/datasets/suayptalha/Turkish-DPO-Pairs
下载链接
链接失效反馈
官方服务:
资源简介:
Original dataset: [malhajar/orca_dpo_pairs-tr](https://huggingface.co/datasets/malhajar/orca_dpo_pairs-tr) # Dataset Card for "suayptalha/Turkish-DPO-Pairs" This Dataset is part of a series of datasets aimed at advancing Turkish LLM Developments by establishing rigid Turkish dataset collection to enhance the performance of LLM's Produced in the Turkish Language. suayptalha/Turkish-DPO-Pairs is the processed version where only the content of messages with the role as "assistant" is extracted from both the chosen and rejected fields of malhajar/orca_dpo_pairs-tr which is a translated version of [`HuggingFaceH4/orca_dpo_pairs`]( https://huggingface.co/datasets/HuggingFaceH4/orca_dpo_pairs) ### Dataset Summary This is a pre-processed version of the [OpenOrca dataset](https://huggingface.co/datasets/Open-Orca/OpenOrca) translated to Turkish. The original OpenOrca dataset is a collection of augmented FLAN data that aligns, as best as possible, with the distributions outlined in the [Orca paper](https://arxiv.org/abs/2306.02707). It has been instrumental in generating high-performing preference-tuned model checkpoints and serves as a valuable resource for all NLP researchers and developers! # Dataset Summary The OrcaDPO Pair dataset is a subset of the OpenOrca dataset suitable for DPO preference tuning. ### Usage To load the dataset, run: ```python from datasets import load_dataset ds = load_dataset("suayptalha/Turkish-DPO-Pairs") ``` <a name="languages"></a> # Languages The language of the data is primarily Turkish. <a name="dataset-structure"></a> # Citation ```bibtex @misc{OpenOrca, title = {OpenOrca: An Open Dataset of GPT Augmented FLAN Reasoning Traces}, author = {Wing Lian and Bleys Goodson and Eugene Pentland and Austin Cook and Chanvichet Vong and "Teknium"}, year = {2023}, publisher = {HuggingFace}, journal = {HuggingFace repository}, howpublished = {\url{https://https://huggingface.co/Open-Orca/OpenOrca}}, } ```

原始数据集:[malhajar/orca_dpo_pairs-tr](https://huggingface.co/datasets/malhajar/orca_dpo_pairs-tr) # 「suayptalha/Turkish-DPO-Pairs」数据集卡片 本数据集属于一系列旨在推动土耳其语大语言模型(LLM)发展的数据集之一,通过构建严谨的土耳其语数据集集合,以提升土耳其语产出的大语言模型的性能。 suayptalha/Turkish-DPO-Pairs 是经过预处理的版本,仅从 malhajar/orca_dpo_pairs-tr 的「选定(chosen)」与「拒绝(rejected)」字段中提取角色为「助手(assistant)」的消息内容。而 malhajar/orca_dpo_pairs-tr 是 [`HuggingFaceH4/orca_dpo_pairs`](https://huggingface.co/datasets/HuggingFaceH4/orca_dpo_pairs) 的土耳其语翻译版本。 ## 数据集概述 本数据集是经土耳其语翻译后的 [OpenOrca 数据集](https://huggingface.co/datasets/Open-Orca/OpenOrca) 的预处理版本。 原始的 OpenOrca 数据集是经过增强的 FLAN 数据集合,尽可能贴合 [Orca 论文](https://arxiv.org/abs/2306.02707) 中概述的数据分布。该数据集对生成高性能的偏好微调模型检查点起到了关键作用,同时也是所有自然语言处理(NLP)研究者与开发者的宝贵资源! ## 数据集概述 OrcaDPO 配对数据集是 OpenOrca 数据集的子集,专门用于 DPO 偏好微调任务。 ### 使用方法 若要加载该数据集,请运行以下代码: python from datasets import load_dataset ds = load_dataset("suayptalha/Turkish-DPO-Pairs") <a name="languages"></a> # 语言 本数据集的主要语言为土耳其语。 <a name="dataset-structure"></a> # 引用 bibtex @misc{OpenOrca, title = "OpenOrca: An Open Dataset of GPT Augmented FLAN Reasoning Traces", author = {Wing Lian and Bleys Goodson and Eugene Pentland and Austin Cook and Chanvichet Vong and "Teknium"}, year = {2023}, publisher = {HuggingFace}, journal = {HuggingFace repository}, howpublished = {url{https://https://huggingface.co/Open-Orca/OpenOrca}}, }
提供机构:
maas
创建时间:
2025-05-16
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作