MLDataScientist/DPO-uz-9k

Name: MLDataScientist/DPO-uz-9k
Creator: MLDataScientist
Published: 2024-05-25 23:30:08
License: 暂无描述

Hugging Face2024-05-25 更新2024-06-12 收录

下载链接：

https://hf-mirror.com/datasets/MLDataScientist/DPO-uz-9k

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 task_categories: - text-generation language: - uz tags: - dpo - rlhf pretty_name: DPO Uzbek 9k size_categories: - 1K<n<10K --- This is DPO Uzbek translated dataset with 9k chat pairs. Original English dataset comes from [DPO-En-Zh-20k](https://huggingface.co/datasets/hiyouga/DPO-En-Zh-20k/tree/9ad5f7428419d3cf78493cf3f4be832cf5346ba8) (commit 9ad5f7428419d3cf78493cf3f4be832cf5346ba8. File: dpo_en.json). I translated 10k pairs of chat examples into Uzbek using NLLB 3.3B model. After translation was completed, I used local [lilac](https://lilacai-lilac.hf.space/) instance to remove records with coding examples since NLLB is not good at translating text with coding examples. Note that each prompt has two answers. The first answer should be the 'selected' response and the second answer should be the 'rejected' response in DPO. Below is the translate function I used with NLLB in Python along with other data pipeline functions: ``` import torch from transformers import AutoModelForSeq2SeqLM, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained(".") model = AutoModelForSeq2SeqLM.from_pretrained(".", torch_dtype=torch.float16, attn_implementation="flash_attention_2").to("cuda").eval() def translate(article): inputs = tokenizer(article, return_tensors="pt", padding=True ).to("cuda") translated_tokens = model.generate( **inputs, forced_bos_token_id=tokenizer.lang_code_to_id["uzn_Latn"], max_new_tokens=512, temperature = 0 ) return tokenizer.batch_decode(translated_tokens, skip_special_tokens=True) ``` Translation process took ~20h on my local PC with one RTX 3090. Translation quality is comparable to Google Translate but it is not comparable to human translation quality. We still lack human chat examples in Uzbek. For this reason, I am translating some chat datasets into Uzbek with NLLB 3.3B. --- This is what the original English dataset contains: - 4,000 examples of [argilla/distilabel-capybara-dpo-7k-binarized](https://huggingface.co/datasets/argilla/distilabel-capybara-dpo-7k-binarized) with chosen score>=4. - 3,000 examples of [argilla/distilabel-intel-orca-dpo-pairs](https://huggingface.co/datasets/argilla/distilabel-intel-orca-dpo-pairs) with chosen score>=8. - 3,000 examples of [argilla/ultrafeedback-binarized-preferences-cleaned](https://huggingface.co/datasets/argilla/ultrafeedback-binarized-preferences-cleaned) with chosen score>=4.

提供机构：

MLDataScientist

原始信息汇总

数据集概述

基本信息

许可证: Apache-2.0
任务类别: 文本生成
语言: 乌兹别克语
标签: DPO, RLHF
美观名称: DPO Uzbek 9k
大小类别: 1K<n<10K

数据集描述

内容: 包含9,000对乌兹别克语聊天数据，由原始英语数据集DPO-En-Zh-20k翻译而来。
翻译过程: 使用NLLB 3.3B模型翻译10,000对聊天示例，后通过lilac实例移除包含编码示例的记录。
响应结构: 每个提示包含两个答案，第一个为selected响应，第二个为rejected响应。
翻译质量: 与Google Translate相当，但不及人工翻译质量。

原始英语数据集内容

4,000个来自argilla/distilabel-capybara-dpo-7k-binarized的示例，选定分数>=4。
3,000个来自argilla/distilabel-intel-orca-dpo-pairs的示例，选定分数>=8。
3,000个来自argilla/ultrafeedback-binarized-preferences-cleaned的示例，选定分数>=4。

5,000+

优质数据集

54 个

任务类型

进入经典数据集