distilabel-intel-orca-dpo-pairs

Name: distilabel-intel-orca-dpo-pairs
Creator: maas
Published: 2026-04-29 00:10:11
License: 暂无描述

魔搭社区2026-04-29 更新2024-06-08 收录

下载链接：

https://modelscope.cn/datasets/AI-ModelScope/distilabel-intel-orca-dpo-pairs

下载链接

链接失效反馈

官方服务：

资源简介：

<p align="right"> <a href="https://github.com/argilla-io/distilabel"> <img src="https://raw.githubusercontent.com/argilla-io/distilabel/main/docs/assets/distilabel-badge-light.png" alt="Built with Distilabel" width="200" height="32"/> </a> </p> # distilabel Orca Pairs for DPO The dataset is a "distilabeled" version of the widely used dataset: [Intel/orca_dpo_pairs](https://huggingface.co/datasets/Intel/orca_dpo_pairs). The original dataset has been used by 100s of open-source practitioners and models. We knew from fixing UltraFeedback (and before that, Alpacas and Dollys) that this dataset could be highly improved. Continuing with our mission to build the best alignment datasets for open-source LLMs and the community, we spent a few hours improving it with [distilabel](https://github.com/argilla-io/distilabel). This was our main intuition: the original dataset just assumes gpt4/3.5-turbo are always the best response. We know from UltraFeedback that's not always the case. Moreover, DPO fine-tuning benefits from the diversity of preference pairs. Additionally, we have added a new column indicating whether the question in the dataset is part of the train set of gsm8k (there were no examples from the test set). See the reproduction section for more details. ## Using this dataset This dataset is useful for preference tuning and we recommend using it instead of the original. It's already prepared in the "standard" chosen, rejected format with additional information for further filtering and experimentation. The main changes are: 1. ~2K pairs have been swapped: rejected become the chosen response. We have kept the original chosen and rejected on two new columns `original_*` for reproducibility purposes. 2. 4K pairs have been identified as `tie`: equally bad or good. 3. Chosen scores have been added: you can now filter out based on a threshold (see our distilabeled Hermes 2.5 model for an example) 4. We have kept the ratings and rationales generated with gpt-4-turbo and distilabel so you can prepare the data differently if you want. 5. We have added a column to indicate if the input is part of gsm8k train set. In our experiments, we have got very good results by reducing the size of the dataset by more than 50%. Here's an example of how to achieve that: ```python from datasets import load_dataset # Instead of this: # dataset = load_dataset("Intel/orca_dpo_pairs", split="train") # use this: dataset = load_dataset("argilla/distilabel-intel-orca-dpo-pairs", split="train") dataset = dataset.filter( lambda r: r["status"] != "tie" and r["chosen_score"] >= 8 and not r["in_gsm8k_train"] ) ``` This results in `5,922` instead of `12,859` samples (54% reduction) and leads to better performance than the same model tuned with 100% of the samples in the original dataset. > We'd love to hear about your experiments! If you want to try this out, consider joining our [Slack community](https://join.slack.com/t/rubrixworkspace/shared_invite/zt-whigkyjn-a3IUJLD7gDbTZ0rKlvcJ5g) and let's build some open datasets and models together. ## Reproducing the dataset In this section, we outline the steps to reproduce this dataset. ### Rate original dataset pairs Build a preference dataset with distilabel using the original dataset: ```python from distilabel.llm import OpenAILLM from distilabel.tasks import JudgeLMTask from distilabel.pipeline import Pipeline from datasets import load_dataset # Shuffle 'chosen' and 'rejected' to avoid positional bias and keep track of the order def shuffle_and_track(chosen, rejected): pair = [chosen, rejected] random.shuffle(pair) order = ["chosen" if x == chosen else "rejected" for x in pair] return {"generations": pair, "order": order} dataset = load_dataset("Intel/orca_dpo_pairs", split="train") # This shuffles the pairs to mitigate positional bias dataset = dataset.map(lambda x: shuffle_and_track(x["chosen"], x["rejected"])) # We use our JudgeLM implementation to rate the original pairs labeler = OpenAILLM( task=JudgeLMTask(), model="gpt-4-1106-preview", num_threads=16, max_new_tokens=512, ) dataset = dataset.rename_columns({"question": "input"}) distipipe = Pipeline( labeller=labeler ) # This computes ratings and natural language critiques for each pair ds = distipipe.generate(dataset=dataset, num_generations=2) ``` If you want to further filter and curate the dataset, you can push the dataset to [Argilla](https://github.com/argilla-io/argilla) as follows: ```python rg_dataset = ds.to_argilla() rg_dataset.push_to_argilla(name="your_dataset_name", workspace="your_workspace_name") ``` You get a nice UI with a lot of pre-computed metadata to explore and curate the dataset: ![image/png](https://cdn-uploads.huggingface.co/production/uploads/60420dccc15e823a685f2b03/IoK4nFObadhJpkVmWALZP.png) The resulting dataset is now much more useful: we know which response is preferred (by gpt-4-turbo), which ones have low scores, and we even have natural language explanations. But what did we find? Was our intuition confirmed? ![image/png](https://cdn-uploads.huggingface.co/production/uploads/60420dccc15e823a685f2b03/-V8wY1DYzrtwM9LbGrBXq.png) The above chart shows the following: * ~4,000 pairs were given the same rating (a tie). * ~7,000 pairs were correct according to our AI judge (`unchanged`). * and ~2,000 times the rejected response was preferred (`swapped`). Now the next question is: can we build better models with this new knowledge? The answer is the "distilabeled Hermes" model, check it out! ### Post-processing to add useful information Swap rejected and chosen, and add chosen scores and status: ```python def add_status(r): status = "unchanged" highest_rated_idx = np.argmax(r['rating']) # Compare to the index of the chosen response if r['rating']== None or r['rating'][0] == r['rating'][1]: status = "tie" elif r['order'][highest_rated_idx] != 'chosen': status = "swapped" return {"status": status} def swap(r): chosen = r["chosen"] rejected = r["rejected"] if r['rating'] is not None: chosen_score = r['rating'][np.argmax(r['rating'])] else: chosen_score = None if r['status'] == "swapped": chosen = r["rejected"] rejected = r["chosen"] return { "chosen": chosen, "rejected": rejected, "original_chosen": r["chosen"], "original_rejected": r["rejected"], "chosen_score": chosen_score } updated = ds.map(add_status).map(swap) ``` ### gsm8k "decontamination" The basic approach for finding duplicated examples. We didn't find any from the test sets. We experimented with lower thresholds but below 0.8 they introduced false positives: ```python import pandas as pd import nltk from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics.pairwise import cosine_similarity from datasets import load_dataset nltk.download('punkt') # Load the datasets source_dataset = load_dataset("gsm8k", "main", split="train") source_dataset_socratic = load_dataset("gsm8k", "socratic", split="train") #target_dataset = load_dataset("Intel/orca_dpo_pairs", split="train") target_dataset = load_dataset("argilla/distilabel-intel-orca-dpo-pairs", split="train") # Extract the 'question' column from each dataset source_questions = source_dataset['question'] source_questions_socratic = source_dataset_socratic['question'] target_questions = target_dataset['input'] # Function to preprocess the text def preprocess(text): return nltk.word_tokenize(text.lower()) # Preprocess the questions source_questions_processed = [preprocess(q) for q in source_questions] source_questions.extend([preprocess(q) for q in source_questions_socratic]) target_questions_processed = [preprocess(q) for q in target_questions] # Vectorize the questions vectorizer = TfidfVectorizer() source_vec = vectorizer.fit_transform([' '.join(q) for q in source_questions_processed]) target_vec = vectorizer.transform([' '.join(q) for q in target_questions_processed]) # Calculate cosine similarity similarity_matrix = cosine_similarity(source_vec, target_vec) # Determine matches based on a threshold: # checked manually and below 0.8 there are only false positives threshold = 0.8 matching_pairs = [] for i, row in enumerate(similarity_matrix): for j, similarity in enumerate(row): if similarity >= threshold: matching_pairs.append((source_questions[i], target_questions[j], similarity)) # Create a DataFrame from the matching pairs df = pd.DataFrame(matching_pairs, columns=['Source Question', 'Target Question', 'Similarity Score']) # Create a set of matching target questions matching_target_questions = list(df['Target Question']) # Add a column to the target dataset indicating whether each question is matched target_dataset = target_dataset.map(lambda example: {"in_gsm8k_train": example['input'] in matching_target_questions}) ``` Result: ``` False 12780 True 79 Name: in_gsm8k_train ```

<p align="right"> <a href="https://github.com/argilla-io/distilabel"> <img src="https://raw.githubusercontent.com/argilla-io/distilabel/main/docs/assets/distilabel-badge-light.png" alt="基于distilabel构建" width="200" height="32"/> </a> </p> # 面向直接偏好优化（DPO，Direct Preference Optimization）的distilabel标注Orca样本对数据集本数据集是热门数据集[Intel/orca_dpo_pairs](https://huggingface.co/datasets/Intel/orca_dpo_pairs)的distilabel标注版本。该原始数据集已被数百名开源从业者与各类模型所使用。我们在优化UltraFeedback（此前还包括Alpacas与Dollys）的过程中意识到，该数据集具备大幅优化的空间。秉承为开源大语言模型（LLM）与社区打造最优对齐数据集的使命，我们耗时数小时，通过[distilabel](https://github.com/argilla-io/distilabel)对其进行了优化。我们的核心改进思路如下：原始数据集默认GPT-4/3.5-turbo生成的回复始终为最优解，但我们在UltraFeedback的实践中发现，事实并非总是如此。此外，DPO微调能够从偏好样本对的多样性中获益。此外，我们新增了一列，用于标记数据集中的问题是否属于GSM8K的训练集（未包含测试集样本）。更多细节请参见数据集复现章节。 ## 数据集使用说明本数据集适用于偏好微调，我们推荐使用本数据集替代原始版本。其已按照标准的「选中回复（chosen）、未选中回复（rejected）」格式进行预处理，并附带了额外信息以支持进一步的筛选与实验。主要改进如下： 1. 约2000个样本对被交换：将原本的未选中回复设为选中回复。为保证可复现性，我们将原始的选中与未选中回复分别保存至`original_chosen`与`original_rejected`两列。 2. 约4000个样本对被标记为`tie`（即两个回复质量相当，无明显优劣）。 3. 新增了选中回复的评分字段：如今你可以基于评分阈值进行数据筛选（可参考我们的distilabel标注Hermes 2.5模型作为示例）。 4. 我们保留了由GPT-4-turbo与distilabel生成的评分与自然语言解释，方便你根据需求自定义数据处理流程。 5. 新增一列，用于标记输入问题是否属于GSM8K训练集。在我们的实验中，将数据集规模缩减50%以上后，仍取得了优异的效果。以下为具体实现示例： python from datasets import load_dataset # 原始加载方式： # dataset = load_dataset("Intel/orca_dpo_pairs", split="train") # 推荐使用本数据集的加载方式： dataset = load_dataset("argilla/distilabel-intel-orca-dpo-pairs", split="train") dataset = dataset.filter( lambda r: r["status"] != "tie" and r["chosen_score"] >= 8 and not r["in_gsm8k_train"] ) 该操作将样本量从12859条缩减至5922条（缩减幅度达54%），且使用该精简后数据集微调的模型，性能优于使用原始数据集全量样本微调的同模型。 > 我们期待了解你的实验成果！如果你想尝试使用本数据集，欢迎加入我们的[Slack社区](https://join.slack.com/t/rubrixworkspace/shared_invite/zt-whigkyjn-a3IUJLD7gDbTZ0rKlvcJ5g)，一起共建开源数据集与模型。 ## 数据集复现 ### 对原始数据集样本对进行评分 python from distilabel.llm import OpenAILLM from distilabel.tasks import JudgeLMTask from distilabel.pipeline import Pipeline from datasets import load_dataset # 打乱「选中回复（chosen）」与「未选中回复（rejected）」的顺序以避免位置偏差，并记录打乱后的顺序 def shuffle_and_track(chosen, rejected): pair = [chosen, rejected] random.shuffle(pair) order = ["chosen" if x == chosen else "rejected" for x in pair] return {"generations": pair, "order": order} dataset = load_dataset("Intel/orca_dpo_pairs", split="train") # 该步骤用于打乱样本对顺序以缓解位置偏差 dataset = dataset.map(lambda x: shuffle_and_track(x["chosen"], x["rejected"])) # 我们使用自研的JudgeLM实现对原始样本对进行评分 labeler = OpenAILLM( task=JudgeLMTask(), model="gpt-4-1106-preview", num_threads=16, max_new_tokens=512, ) dataset = dataset.rename_columns({"question": "input"}) distipipe = Pipeline( labeller=labeler ) # 该步骤将为每个样本对生成评分与自然语言评价 ds = distipipe.generate(dataset=dataset, num_generations=2) 如果你需要进一步筛选与整理数据集，可以按照以下步骤将数据集推送至[Argilla](https://github.com/argilla-io/argilla)： python rg_dataset = ds.to_argilla() rg_dataset.push_to_argilla(name="your_dataset_name", workspace="your_workspace_name") 你将获得一个功能完善的可视化界面，附带大量预计算的元数据，方便你探索与整理数据集： ![image/png](https://cdn-uploads.huggingface.co/production/uploads/60420dccc15e823a685f2b03/IoK4nFObadhJpkVmWALZP.png) 处理后的数据集实用性大幅提升：我们可以知晓GPT-4-turbo偏好的回复、评分较低的回复，甚至还能获取对应的自然语言解释。那么我们的改进思路是否得到了验证？ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/60420dccc15e823a685f2b03/-V8wY1DYzrtwM9LbGrBXq.png) 如上图表显示： * 约4000个样本对的评分相同（标记为`tie`）。 * 约7000个样本对的评分未发生变化（标记为`unchanged`），即AI评审认为原始的选中与未选中回复排序合理。 * 约2000个样本对的原始未选中回复被AI评审判定为更优（标记为`swapped`）。接下来的问题是：我们能否基于这些新发现训练出更优秀的模型？答案就是「distilabel标注Hermes」模型，欢迎体验！ ### 后处理以新增实用字段 python def add_status(r): status = "unchanged" highest_rated_idx = np.argmax(r['rating']) # 对比选中回复的索引 if r['rating']== None or r['rating'][0] == r['rating'][1]: status = "tie" elif r['order'][highest_rated_idx] != 'chosen': status = "swapped" return {"status": status} def swap(r): chosen = r["chosen"] rejected = r["rejected"] if r['rating'] is not None: chosen_score = r['rating'][np.argmax(r['rating'])] else: chosen_score = None if r['status'] == "swapped": chosen = r["rejected"] rejected = r["chosen"] return { "chosen": chosen, "rejected": rejected, "original_chosen": r["chosen"], "original_rejected": r["rejected"], "chosen_score": chosen_score } updated = ds.map(add_status).map(swap) ### GSM8K去重处理本步骤用于检测重复样本。我们未在测试集中发现重复样本，但在尝试更低的阈值时发现，阈值低于0.8会引入大量假阳性匹配： python import pandas as pd import nltk from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics.pairwise import cosine_similarity from datasets import load_dataset nltk.download('punkt') # 加载数据集 source_dataset = load_dataset("gsm8k", "main", split="train") source_dataset_socratic = load_dataset("gsm8k", "socratic", split="train") #target_dataset = load_dataset("Intel/orca_dpo_pairs", split="train") target_dataset = load_dataset("argilla/distilabel-intel-orca-dpo-pairs", split="train") # 提取各数据集的「问题」列 source_questions = source_dataset['question'] source_questions_socratic = source_dataset_socratic['question'] target_questions = target_dataset['input'] # 文本预处理函数 def preprocess(text): return nltk.word_tokenize(text.lower()) # 预处理所有问题 source_questions_processed = [preprocess(q) for q in source_questions] source_questions.extend([preprocess(q) for q in source_questions_socratic]) target_questions_processed = [preprocess(q) for q in target_questions] # 向量化问题文本 vectorizer = TfidfVectorizer() source_vec = vectorizer.fit_transform([' '.join(q) for q in source_questions_processed]) target_vec = vectorizer.transform([' '.join(q) for q in target_questions_processed]) # 计算余弦相似度 similarity_matrix = cosine_similarity(source_vec, target_vec) # 根据阈值确定匹配对： # 经人工校验，阈值低于0.8时会引入大量假阳性匹配 threshold = 0.8 matching_pairs = [] for i, row in enumerate(similarity_matrix): for j, similarity in enumerate(row): if similarity >= threshold: matching_pairs.append((source_questions[i], target_questions[j], similarity)) # 从匹配对创建DataFrame df = pd.DataFrame(matching_pairs, columns=['Source Question', 'Target Question', 'Similarity Score']) # 获取匹配的目标问题集合 matching_target_questions = list(df['Target Question']) # 为目标数据集新增一列，标记每个问题是否匹配GSM8K训练集问题 target_dataset = target_dataset.map(lambda example: {"in_gsm8k_train": example['input'] in matching_target_questions}) 处理结果： False 12780 True 79 Name: in_gsm8k_train

提供机构：

maas

创建时间：

2024-05-09

搜集汇总

数据集介绍