openeurollm/open-perfectblend-decontaminated
收藏Hugging Face2026-03-29 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/openeurollm/open-perfectblend-decontaminated
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: conversations
list:
- name: from
dtype: string
- name: value
dtype: string
- name: source
dtype: string
splits:
- name: train
num_bytes: 2947802727
num_examples: 1418925
download_size: 2947802727
dataset_size: 2947802727
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
license: apache-2.0
decontamination:
source_dataset: mlabonne/open-perfectblend
benchmarks:
- path: HuggingFaceH4/MATH-500
subset: default
split: test
- path: HuggingFaceH4/aime_2024
subset: default
split: train
- path: math-ai/aime25
subset: default
split: test
- path: math-ai/amc23
subset: default
split: test
- path: daman1209arora/jeebench
subset: default
split: test
- path: Idavidrein/gpqa
subset: gpqa_diamond
split: train
- path: ali-elganzory/livecodebench-code_generation_lite
subset: release_v6
split: test
- path: openai/openai_humaneval
subset: openai_humaneval
split: test
- path: google-research-datasets/mbpp
subset: full
split: train+test+validation+prompt
- path: google/IFEval
subset: default
split: train
- path: tatsu-lab/alpaca_eval
subset: alpaca_eval
split: eval
- path: lmarena-ai/arena-hard-auto
subset: default
split: train
contamination_stats:
- subset: default
split: train
total: 1884616
removed: 1984
---
## Decontamination
This dataset is a decontaminated version of [mlabonne/open-perfectblend](https://huggingface.co/datasets/mlabonne/open-perfectblend).
### Benchmarks used
- **MATH500**: `HuggingFaceH4/MATH-500` (subset=default, split=test)
- **AIME24**: `HuggingFaceH4/aime_2024` (subset=default, split=train)
- **AIME25**: `math-ai/aime25` (subset=default, split=test)
- **AMC23**: `math-ai/amc23` (subset=default, split=test)
- **JEEBench**: `daman1209arora/jeebench` (subset=default, split=test)
- **GPQADiamond**: `Idavidrein/gpqa` (subset=gpqa_diamond, split=train)
- **LiveCodeBench**: `ali-elganzory/livecodebench-code_generation_lite` (subset=release_v6, split=test)
- **HumanEval**: `openai/openai_humaneval` (subset=openai_humaneval, split=test)
- **MBPP**: `google-research-datasets/mbpp` (subset=full, split=train+test+validation+prompt)
- **IFEval**: `google/IFEval` (subset=default, split=train)
- **AlpacaEval**: `tatsu-lab/alpaca_eval` (subset=alpaca_eval, split=eval)
- **Arena-Hard-v2.0**: `lmarena-ai/arena-hard-auto` (subset=default, split=train) (data_files=['data/arena-hard-v2.0/question.jsonl'])
### Decontamination settings
<table>
<thead>
<tr><th>Parameter</th><th>Value</th></tr>
</thead>
<tbody>
<tr><td>N-gram size</td><td>8</td></tr>
<tr><td>Match threshold</td><td>0.5</td></tr>
</tbody>
</table>
### Split and benchmark details
<table>
<thead>
<tr>
<th>Subset</th>
<th>Split</th>
<th>Docs in split (dataset)</th>
<th>Benchmark</th>
<th>Contaminated (dataset)</th>
<th>Contamination rate (dataset)</th>
<th>Docs (benchmark)</th>
<th>Contaminated (benchmark)</th>
<th>Contamination rate (benchmark)</th>
</tr>
</thead>
<tbody>
<tr>
<td rowspan="12">default</td>
<td rowspan="12">train</td>
<td rowspan="12">1,884,616</td>
<td>MATH500</td>
<td>803</td>
<td>0.0426%</td>
<td>500</td>
<td>91</td>
<td>18.20%</td>
</tr>
<tr>
<td>AIME24</td>
<td>0</td>
<td>0.0000%</td>
<td>30</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>AIME25</td>
<td>2</td>
<td>0.0001%</td>
<td>30</td>
<td>1</td>
<td>3.33%</td>
</tr>
<tr>
<td>AMC23</td>
<td>3</td>
<td>0.0002%</td>
<td>40</td>
<td>2</td>
<td>5.00%</td>
</tr>
<tr>
<td>JEEBench</td>
<td>0</td>
<td>0.0000%</td>
<td>515</td>
<td>0</td>
<td>0.0000%</td>
</tr>
<tr>
<td>GPQADiamond</td>
<td>8</td>
<td>0.0004%</td>
<td>198</td>
<td>6</td>
<td>3.03%</td>
</tr>
<tr>
<td>LiveCodeBench</td>
<td>2</td>
<td>0.0001%</td>
<td>1055</td>
<td>2</td>
<td>0.1896%</td>
</tr>
<tr>
<td>HumanEval</td>
<td>374</td>
<td>0.0198%</td>
<td>164</td>
<td>117</td>
<td>71.34%</td>
</tr>
<tr>
<td>MBPP</td>
<td>468</td>
<td>0.0248%</td>
<td>974</td>
<td>227</td>
<td>23.31%</td>
</tr>
<tr>
<td>IFEval</td>
<td>10</td>
<td>0.0005%</td>
<td>541</td>
<td>5</td>
<td>0.9242%</td>
</tr>
<tr>
<td>AlpacaEval</td>
<td>309</td>
<td>0.0164%</td>
<td>805</td>
<td>85</td>
<td>10.56%</td>
</tr>
<tr>
<td>Arena-Hard-v2.0</td>
<td>10</td>
<td>0.0005%</td>
<td>750</td>
<td>6</td>
<td>0.8000%</td>
</tr>
</tbody>
</table>
### Dataset summary
<table>
<thead>
<tr><th>Metric</th><th>Value</th></tr>
</thead>
<tbody>
<tr><td>Total documents in dataset</td><td>1,884,616</td></tr>
<tr><td>Contaminated documents (removed)</td><td>1,984</td></tr>
<tr><td>Documents after decontamination</td><td>1,882,632</td></tr>
<tr><td>Contamination rate (dataset)</td><td>0.1053%</td></tr>
</tbody>
</table>
---

# 🎨 Open-PerfectBlend
Open-PerfectBlend is an open-source reproduction of the instruction dataset introduced in the paper ["The Perfect Blend: Redefining RLHF with Mixture of Judges"](https://arxiv.org/abs/2409.20370).
It's a solid general-purpose instruction dataset with chat, math, code, and instruction-following data.
## Data source

Here is the list of the datasets used in this mix:
| Dataset | # Samples |
|------|------|
| [meta-math/MetaMathQA](https://huggingface.co/datasets/meta-math/MetaMathQA) | 395,000 |
| [openbmb/UltraInteract_sft](https://huggingface.co/datasets/openbmb/UltraInteract_sft) | 288,579 |
| [HuggingFaceH4/ultrachat_200k](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) | 207,865 |
| [microsoft/orca-math-word-problems-200k](https://huggingface.co/datasets/microsoft/orca-math-word-problems-200k) | 200,035 |
| [HuggingFaceH4/ultrafeedback_binarized](https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized) | 187,405 |
| [theblackcat102/evol-codealpaca-v1](https://huggingface.co/datasets/theblackcat102/evol-codealpaca-v1) | 111,272 |
| [Post-training-Data-Flywheel/AutoIF-instruct-61k](https://huggingface.co/datasets/Post-training-Data-Flywheel/AutoIF-instruct-61k) | 61,492 |
| [mlabonne/lmsys-arena-human-preference-55k-sharegpt](https://huggingface.co/datasets/mlabonne/lmsys-arena-human-preference-55k-sharegpt) | 57,362 |
The deduplication process removed 88.1k samples across all datasets. All of these datasets use either an Apache 2.0 or MIT license.
Thanks to OpenBMB, MetaMath, Hugging Face, Microsoft, theblackcat102, Post-training-Data-Flywheel, and LMSYS for the data!
## Comparison
Here is the extract from the paper with the dataset mixture:

There are two main differences with the dataset described in the paper:
* Instruction-following data comes from another source because Meta didn't release their dataset.
* The harmful intent hasn't been released either, so I didn't add any data in this category.
提供机构:
openeurollm



