MLDataScientist/DPO-uz-9k
收藏Hugging Face2024-05-25 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/MLDataScientist/DPO-uz-9k
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- text-generation
language:
- uz
tags:
- dpo
- rlhf
pretty_name: DPO Uzbek 9k
size_categories:
- 1K<n<10K
---
This is DPO Uzbek translated dataset with 9k chat pairs.
Original English dataset comes from [DPO-En-Zh-20k](https://huggingface.co/datasets/hiyouga/DPO-En-Zh-20k/tree/9ad5f7428419d3cf78493cf3f4be832cf5346ba8) (commit 9ad5f7428419d3cf78493cf3f4be832cf5346ba8. File: dpo_en.json).
I translated 10k pairs of chat examples into Uzbek using NLLB 3.3B model.
After translation was completed, I used local [lilac](https://lilacai-lilac.hf.space/) instance to remove records with coding examples since NLLB is not good at translating text with coding examples.
Note that each prompt has two answers. The first answer should be the 'selected' response and the second answer should be the 'rejected' response in DPO.
Below is the translate function I used with NLLB in Python along with other data pipeline functions:
```
import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(".")
model = AutoModelForSeq2SeqLM.from_pretrained(".", torch_dtype=torch.float16, attn_implementation="flash_attention_2").to("cuda").eval()
def translate(article):
inputs = tokenizer(article, return_tensors="pt", padding=True
).to("cuda")
translated_tokens = model.generate(
**inputs, forced_bos_token_id=tokenizer.lang_code_to_id["uzn_Latn"], max_new_tokens=512,
temperature = 0
)
return tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)
```
Translation process took ~20h on my local PC with one RTX 3090.
Translation quality is comparable to Google Translate but it is not comparable to human translation quality. We still lack human chat examples in Uzbek. For this reason, I am translating some chat datasets into Uzbek with NLLB 3.3B.
---
This is what the original English dataset contains:
- 4,000 examples of [argilla/distilabel-capybara-dpo-7k-binarized](https://huggingface.co/datasets/argilla/distilabel-capybara-dpo-7k-binarized) with chosen score>=4.
- 3,000 examples of [argilla/distilabel-intel-orca-dpo-pairs](https://huggingface.co/datasets/argilla/distilabel-intel-orca-dpo-pairs) with chosen score>=8.
- 3,000 examples of [argilla/ultrafeedback-binarized-preferences-cleaned](https://huggingface.co/datasets/argilla/ultrafeedback-binarized-preferences-cleaned) with chosen score>=4.
提供机构:
MLDataScientist
原始信息汇总
数据集概述
基本信息
- 许可证: Apache-2.0
- 任务类别: 文本生成
- 语言: 乌兹别克语
- 标签: DPO, RLHF
- 美观名称: DPO Uzbek 9k
- 大小类别: 1K<n<10K
数据集描述
- 内容: 包含9,000对乌兹别克语聊天数据,由原始英语数据集DPO-En-Zh-20k翻译而来。
- 翻译过程: 使用NLLB 3.3B模型翻译10,000对聊天示例,后通过lilac实例移除包含编码示例的记录。
- 响应结构: 每个提示包含两个答案,第一个为selected响应,第二个为rejected响应。
- 翻译质量: 与Google Translate相当,但不及人工翻译质量。
原始英语数据集内容
- 4,000个来自argilla/distilabel-capybara-dpo-7k-binarized的示例,选定分数>=4。
- 3,000个来自argilla/distilabel-intel-orca-dpo-pairs的示例,选定分数>=8。
- 3,000个来自argilla/ultrafeedback-binarized-preferences-cleaned的示例,选定分数>=4。



