openeurollm/open-perfectblend-decontaminated

Name: openeurollm/open-perfectblend-decontaminated
Creator: openeurollm
Published: 2026-03-29 19:45:51
License: 暂无描述

Hugging Face2026-03-29 更新2026-04-05 收录

下载链接：

https://hf-mirror.com/datasets/openeurollm/open-perfectblend-decontaminated

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: conversations list: - name: from dtype: string - name: value dtype: string - name: source dtype: string splits: - name: train num_bytes: 2947802727 num_examples: 1418925 download_size: 2947802727 dataset_size: 2947802727 configs: - config_name: default data_files: - split: train path: data/train-* license: apache-2.0 decontamination: source_dataset: mlabonne/open-perfectblend benchmarks: - path: HuggingFaceH4/MATH-500 subset: default split: test - path: HuggingFaceH4/aime_2024 subset: default split: train - path: math-ai/aime25 subset: default split: test - path: math-ai/amc23 subset: default split: test - path: daman1209arora/jeebench subset: default split: test - path: Idavidrein/gpqa subset: gpqa_diamond split: train - path: ali-elganzory/livecodebench-code_generation_lite subset: release_v6 split: test - path: openai/openai_humaneval subset: openai_humaneval split: test - path: google-research-datasets/mbpp subset: full split: train+test+validation+prompt - path: google/IFEval subset: default split: train - path: tatsu-lab/alpaca_eval subset: alpaca_eval split: eval - path: lmarena-ai/arena-hard-auto subset: default split: train contamination_stats: - subset: default split: train total: 1884616 removed: 1984 --- ## Decontamination This dataset is a decontaminated version of [mlabonne/open-perfectblend](https://huggingface.co/datasets/mlabonne/open-perfectblend). ### Benchmarks used - **MATH500**: `HuggingFaceH4/MATH-500` (subset=default, split=test) - **AIME24**: `HuggingFaceH4/aime_2024` (subset=default, split=train) - **AIME25**: `math-ai/aime25` (subset=default, split=test) - **AMC23**: `math-ai/amc23` (subset=default, split=test) - **JEEBench**: `daman1209arora/jeebench` (subset=default, split=test) - **GPQADiamond**: `Idavidrein/gpqa` (subset=gpqa_diamond, split=train) - **LiveCodeBench**: `ali-elganzory/livecodebench-code_generation_lite` (subset=release_v6, split=test) - **HumanEval**: `openai/openai_humaneval` (subset=openai_humaneval, split=test) - **MBPP**: `google-research-datasets/mbpp` (subset=full, split=train+test+validation+prompt) - **IFEval**: `google/IFEval` (subset=default, split=train) - **AlpacaEval**: `tatsu-lab/alpaca_eval` (subset=alpaca_eval, split=eval) - **Arena-Hard-v2.0**: `lmarena-ai/arena-hard-auto` (subset=default, split=train) (data_files=['data/arena-hard-v2.0/question.jsonl']) ### Decontamination settings <table> <thead> <tr><th>Parameter</th><th>Value</th></tr> </thead> <tbody> <tr><td>N-gram size</td><td>8</td></tr> <tr><td>Match threshold</td><td>0.5</td></tr> </tbody> </table> ### Split and benchmark details <table> <thead> <tr> <th>Subset</th> <th>Split</th> <th>Docs in split (dataset)</th> <th>Benchmark</th> <th>Contaminated (dataset)</th> <th>Contamination rate (dataset)</th> <th>Docs (benchmark)</th> <th>Contaminated (benchmark)</th> <th>Contamination rate (benchmark)</th> </tr> </thead> <tbody> <tr> <td rowspan="12">default</td> <td rowspan="12">train</td> <td rowspan="12">1,884,616</td> <td>MATH500</td> <td>803</td> <td>0.0426%</td> <td>500</td> <td>91</td> <td>18.20%</td> </tr> <tr> <td>AIME24</td> <td>0</td> <td>0.0000%</td> <td>30</td> <td>0</td> <td>0.0000%</td> </tr> <tr> <td>AIME25</td> <td>2</td> <td>0.0001%</td> <td>30</td> <td>1</td> <td>3.33%</td> </tr> <tr> <td>AMC23</td> <td>3</td> <td>0.0002%</td> <td>40</td> <td>2</td> <td>5.00%</td> </tr> <tr> <td>JEEBench</td> <td>0</td> <td>0.0000%</td> <td>515</td> <td>0</td> <td>0.0000%</td> </tr> <tr> <td>GPQADiamond</td> <td>8</td> <td>0.0004%</td> <td>198</td> <td>6</td> <td>3.03%</td> </tr> <tr> <td>LiveCodeBench</td> <td>2</td> <td>0.0001%</td> <td>1055</td> <td>2</td> <td>0.1896%</td> </tr> <tr> <td>HumanEval</td> <td>374</td> <td>0.0198%</td> <td>164</td> <td>117</td> <td>71.34%</td> </tr> <tr> <td>MBPP</td> <td>468</td> <td>0.0248%</td> <td>974</td> <td>227</td> <td>23.31%</td> </tr> <tr> <td>IFEval</td> <td>10</td> <td>0.0005%</td> <td>541</td> <td>5</td> <td>0.9242%</td> </tr> <tr> <td>AlpacaEval</td> <td>309</td> <td>0.0164%</td> <td>805</td> <td>85</td> <td>10.56%</td> </tr> <tr> <td>Arena-Hard-v2.0</td> <td>10</td> <td>0.0005%</td> <td>750</td> <td>6</td> <td>0.8000%</td> </tr> </tbody> </table> ### Dataset summary <table> <thead> <tr><th>Metric</th><th>Value</th></tr> </thead> <tbody> <tr><td>Total documents in dataset</td><td>1,884,616</td></tr> <tr><td>Contaminated documents (removed)</td><td>1,984</td></tr> <tr><td>Documents after decontamination</td><td>1,882,632</td></tr> <tr><td>Contamination rate (dataset)</td><td>0.1053%</td></tr> </tbody> </table> --- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/61b8e2ba285851687028d395/O0draAVUeUZI9qRMglywA.png) # 🎨 Open-PerfectBlend Open-PerfectBlend is an open-source reproduction of the instruction dataset introduced in the paper ["The Perfect Blend: Redefining RLHF with Mixture of Judges"](https://arxiv.org/abs/2409.20370). It's a solid general-purpose instruction dataset with chat, math, code, and instruction-following data. ## Data source ![image/png](https://cdn-uploads.huggingface.co/production/uploads/61b8e2ba285851687028d395/rQ7db032OcjTZ2i2cpvY7.png) Here is the list of the datasets used in this mix: | Dataset | # Samples | |------|------| | [meta-math/MetaMathQA](https://huggingface.co/datasets/meta-math/MetaMathQA) | 395,000 | | [openbmb/UltraInteract_sft](https://huggingface.co/datasets/openbmb/UltraInteract_sft) | 288,579 | | [HuggingFaceH4/ultrachat_200k](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) | 207,865 | | [microsoft/orca-math-word-problems-200k](https://huggingface.co/datasets/microsoft/orca-math-word-problems-200k) | 200,035 | | [HuggingFaceH4/ultrafeedback_binarized](https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized) | 187,405 | | [theblackcat102/evol-codealpaca-v1](https://huggingface.co/datasets/theblackcat102/evol-codealpaca-v1) | 111,272 | | [Post-training-Data-Flywheel/AutoIF-instruct-61k](https://huggingface.co/datasets/Post-training-Data-Flywheel/AutoIF-instruct-61k) | 61,492 | | [mlabonne/lmsys-arena-human-preference-55k-sharegpt](https://huggingface.co/datasets/mlabonne/lmsys-arena-human-preference-55k-sharegpt) | 57,362 | The deduplication process removed 88.1k samples across all datasets. All of these datasets use either an Apache 2.0 or MIT license. Thanks to OpenBMB, MetaMath, Hugging Face, Microsoft, theblackcat102, Post-training-Data-Flywheel, and LMSYS for the data! ## Comparison Here is the extract from the paper with the dataset mixture: ![image/png](https://cdn-uploads.huggingface.co/production/uploads/61b8e2ba285851687028d395/QObNW7erIVb_WM8C3hxDo.png) There are two main differences with the dataset described in the paper: * Instruction-following data comes from another source because Meta didn't release their dataset. * The harmful intent hasn't been released either, so I didn't add any data in this category.

提供机构：

openeurollm

5,000+

优质数据集

54 个

任务类型

进入经典数据集