lightblue/mitsu_tophalf_borda

Name: lightblue/mitsu_tophalf_borda
Creator: lightblue
Published: 2024-05-30 06:45:46
License: 暂无描述

Hugging Face2024-05-30 更新2024-06-12 收录

下载链接：

https://hf-mirror.com/datasets/lightblue/mitsu_tophalf_borda

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: prompt dtype: string - name: chosen list: - name: content dtype: string - name: role dtype: string - name: rejected list: - name: content dtype: string - name: role dtype: string splits: - name: test num_bytes: 216048.79407407407 num_examples: 68 - name: train num_bytes: 4289220 num_examples: 1350 download_size: 4766811 dataset_size: 4505268.794074073 configs: - config_name: default data_files: - split: train path: data/train-* - split: test path: data/test-* license: cc-by-nc-4.0 --- # Mitsu <img width=400 src="https://cdn-uploads.huggingface.co/production/uploads/64b63f8ad57e02621dc93c8b/Ypd0x0ZyVCJs7rkd5xA_O.png" alt="Mitsu - a honey bee in its comb"/> [[Paper]](https://arxiv.org/abs/2405.18952) [[Model]](https://huggingface.co/lightblue/suzume-llama-3-8B-multilingual-orpo-borda-half) This is a multilingual preference dataset generated using human written prompts and responses from 7 LLMs. We evaluate each set of responses 5 times using GPT4. Note that this model has a non-commerical license as we used the Command R and Command R+ models to create this data. We are currently working on a developing a commerically usable model, so stay tuned for that! # Dataset details This is the ORPO training dataset derived from the [lightblue/mitsu](https://huggingface.co/datasets/lightblue/mitsu). This dataset contains the prompts corresponding to the 50\% most consistely ranked responses by GPT, with the highest/lowest ranked responses used as the positive and negative responses for each prompt. # How we made this: We made this dataset using our Repeated Ranking method, which entails the following steps: 1. Sample responses from [lightblue/tagengo-gpt4](https://huggingface.co/datasets/lightblue/tagengo-gpt4), stratifying by language by sampling 100 per language 2. Generate responses for each prompt using each of the following models: * gpt-35-turbo-instruct (0914) * gpt-4 (0125-Preview) * Nexusflow/Starling-LM-7B-beta * Qwen/Qwen1.5-32B-Chat * Qwen/Qwen1.5-72B-Chat * CohereForAI/c4ai-command-r-v01 * CohereForAI/c4ai-command-r-plus 3. Evaluate the responses using gpt-4 (0125-Preview) 5 times, randomly shuffling the order that the responses are given in each time 4. Calculate the agreement between the rankings using Kendall's W The full code for creating this dataset can be [found on our repo](https://github.com/lightblue-tech/suzume/tree/main/mitsu/data_creation). # How to use it: We process this dataset into datasets usable for DPO/PPO/ORPO training using the [code available on our repo](https://github.com/lightblue-tech/suzume/blob/main/mitsu/data_creation/response_rank_process.ipynb). Processed versions of this dataset can be found at: * [All prompt dataset](https://huggingface.co/datasets/lightblue/mitsu_full_borda) * [Prompts with top 75% most repeated consistent evaluations](https://huggingface.co/datasets/lightblue/mitsu_top75_borda) * [Prompts with top 50% most repeated consistent evaluations (recommended for training)](https://huggingface.co/datasets/lightblue/mitsu_tophalf_borda) * [Prompts with top 25% most repeated consistent evaluations](https://huggingface.co/datasets/lightblue/mitsu_top25_borda) # Dataset results We conducted experiments by training our [lightblue/suzume-llama-3-8B-multilingual](https://huggingface.co/lightblue/suzume-llama-3-8B-multilingual) using this dataset with ORPO training. We also conduct experiments where we sample varying fractions of the dataset, ordered by the consistency of the 5 rankings that the evaluator model gave (as described in the diagram below). <img width=800 src="https://cdn-uploads.huggingface.co/production/uploads/64b63f8ad57e02621dc93c8b/Ccz6V2G7zCmfZWXuHK0x3.png" alt="Diagram describing our repeated ranking methodology"/> We train using the top 75%, 50%, and 25% most consistently ranked responses, and compare that to training on all responses. We find that training on less data can actually result in greater down stream accuracy for down-stream tasks, such as the MT-Bench scores in 6 languages that we test on: <img width=700 src="https://cdn-uploads.huggingface.co/production/uploads/64b63f8ad57e02621dc93c8b/JahHDC6xcgbz3Ej2ZrWjQ.png" alt="MT-Bench results for our ORPO experiments"/> # How to cite ```tex @article{devine2024sure, title={Are You Sure? Rank Them Again: Repeated Ranking For Better Preference Datasets}, author={Devine, Peter}, journal={arXiv preprint arXiv:2405.18952}, year={2024} } ``` # Developer Peter Devine - ([ptrdvn](https://huggingface.co/ptrdvn))

提供机构：

lightblue

原始信息汇总

数据集概述

数据集信息

特征（Features）：
- prompt：字符串类型
- chosen：列表类型，包含
  - content：字符串类型
  - role：字符串类型
- rejected：列表类型，包含
  - content：字符串类型
  - role：字符串类型
分割（Splits）：
- test：包含68个示例，总字节数为216048.79407407407
- train：包含1350个示例，总字节数为4289220
下载大小（Download Size）：4766811字节
数据集大小（Dataset Size）：4505268.794074073字节

配置（Configs）

默认配置（config_name: default）：
- 训练数据文件（split: train）：路径为data/train-*
- 测试数据文件（split: test）：路径为data/test-*

许可证（License）

cc-by-nc-4.0

数据集创建方法

使用Repeated Ranking方法：
1. 从lightblue/tagengo-gpt4数据集中按语言抽样100个响应
2. 使用多种模型生成响应
3. 使用gpt-4 (0125-Preview)评估响应，每次随机排序
4. 使用Kendalls W计算排名一致性

数据集使用

用于DPO/PPO/ORPO训练：
- 处理代码可在GitHub仓库找到
- 处理后的数据集版本：

实验结果

使用lightblue/suzume-llama-3-8B-multilingual进行ORPO训练：
- 比较了不同比例的响应训练效果
- 发现使用较少但更一致的数据可以提高下游任务的准确性

引用信息

tex @article{devine2024sure, title={Are You Sure? Rank Them Again: Repeated Ranking For Better Preference Datasets}, author={Devine, Peter}, journal={arXiv preprint arXiv:2405.18952}, year={2024} }

5,000+

优质数据集

54 个

任务类型

进入经典数据集