five

MLDataScientist/DPO-uz-9k

收藏
Hugging Face2024-05-25 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/MLDataScientist/DPO-uz-9k
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - text-generation language: - uz tags: - dpo - rlhf pretty_name: DPO Uzbek 9k size_categories: - 1K<n<10K --- This is DPO Uzbek translated dataset with 9k chat pairs. Original English dataset comes from [DPO-En-Zh-20k](https://huggingface.co/datasets/hiyouga/DPO-En-Zh-20k/tree/9ad5f7428419d3cf78493cf3f4be832cf5346ba8) (commit 9ad5f7428419d3cf78493cf3f4be832cf5346ba8. File: dpo_en.json). I translated 10k pairs of chat examples into Uzbek using NLLB 3.3B model. After translation was completed, I used local [lilac](https://lilacai-lilac.hf.space/) instance to remove records with coding examples since NLLB is not good at translating text with coding examples. Note that each prompt has two answers. The first answer should be the 'selected' response and the second answer should be the 'rejected' response in DPO. Below is the translate function I used with NLLB in Python along with other data pipeline functions: ``` import torch from transformers import AutoModelForSeq2SeqLM, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained(".") model = AutoModelForSeq2SeqLM.from_pretrained(".", torch_dtype=torch.float16, attn_implementation="flash_attention_2").to("cuda").eval() def translate(article): inputs = tokenizer(article, return_tensors="pt", padding=True ).to("cuda") translated_tokens = model.generate( **inputs, forced_bos_token_id=tokenizer.lang_code_to_id["uzn_Latn"], max_new_tokens=512, temperature = 0 ) return tokenizer.batch_decode(translated_tokens, skip_special_tokens=True) ``` Translation process took ~20h on my local PC with one RTX 3090. Translation quality is comparable to Google Translate but it is not comparable to human translation quality. We still lack human chat examples in Uzbek. For this reason, I am translating some chat datasets into Uzbek with NLLB 3.3B. --- This is what the original English dataset contains: - 4,000 examples of [argilla/distilabel-capybara-dpo-7k-binarized](https://huggingface.co/datasets/argilla/distilabel-capybara-dpo-7k-binarized) with chosen score>=4. - 3,000 examples of [argilla/distilabel-intel-orca-dpo-pairs](https://huggingface.co/datasets/argilla/distilabel-intel-orca-dpo-pairs) with chosen score>=8. - 3,000 examples of [argilla/ultrafeedback-binarized-preferences-cleaned](https://huggingface.co/datasets/argilla/ultrafeedback-binarized-preferences-cleaned) with chosen score>=4.
提供机构:
MLDataScientist
原始信息汇总

数据集概述

基本信息

  • 许可证: Apache-2.0
  • 任务类别: 文本生成
  • 语言: 乌兹别克语
  • 标签: DPO, RLHF
  • 美观名称: DPO Uzbek 9k
  • 大小类别: 1K<n<10K

数据集描述

  • 内容: 包含9,000对乌兹别克语聊天数据,由原始英语数据集DPO-En-Zh-20k翻译而来。
  • 翻译过程: 使用NLLB 3.3B模型翻译10,000对聊天示例,后通过lilac实例移除包含编码示例的记录。
  • 响应结构: 每个提示包含两个答案,第一个为selected响应,第二个为rejected响应。
  • 翻译质量: 与Google Translate相当,但不及人工翻译质量。

原始英语数据集内容

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作