iknow-lab/JudgeBias-DPO-RefFree-testset
收藏Hugging Face2026-03-12 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/iknow-lab/JudgeBias-DPO-RefFree-testset
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
size_categories:
- 1K<n<10K
task_categories:
- text-generation
tags:
- dpo
- preference
- llm-as-a-judge
- debiasing
- materials-science
- evaluation
- benchmark
dataset_info:
features:
- name: prompt
dtype: string
- name: chosen
dtype: string
- name: rejected
dtype: string
- name: score_chosen
dtype: float64
- name: score_rejected
dtype: float64
- name: score_delta
dtype: float64
- name: anchor_score
dtype: float64
- name: sample_id
dtype: int64
- name: Material_Name
dtype: string
- name: domain
dtype: string
- name: process
dtype: string
- name: perturbation_type
dtype: string
- name: perturbation_category
dtype: string
- name: perturbation_rate
dtype: float64
- name: chosen_model
dtype: string
- name: rejected_model
dtype: string
- name: dataset_name
dtype: string
splits:
- name: test
num_examples: 1000
config_name: default
configs:
- config_name: default
data_files:
- split: test
path: test.parquet
---
# JudgeBias-DPO-RefFree-testset
A fixed **evaluation benchmark** (1,000 samples) for assessing DPO-trained LLM judges on materials science synthesis recipe evaluation in a **reference-free** setting.
## Purpose
This test set supports two evaluation methods:
### 1. Reward Accuracy (Log-Probability)
Use `prompt`, `chosen`, and `rejected` to compute implicit reward accuracy without generation:
```python
reward_chosen = log P(chosen | prompt)
reward_rejected = log P(rejected | prompt)
accuracy = mean(reward_chosen > reward_rejected)
```
### 2. Generation-Based Evaluation (Anchor-Referenced)
Generate judge responses from `prompt`, parse `overall_score`, and compare against `anchor_score`:
- **Representational samples** (`perturbation_category == "represent"`):
- `repr_bias = mean(|generated_score - anchor_score|)` — lower is better
- **Error samples** (`perturbation_category == "error"`):
- `error_sensitivity = mean(anchor_score - generated_score)` — higher is better
> `score_chosen` should NOT be used as ground truth — it is the relatively better score within a pair, not an absolute reference. Use `anchor_score` (repr consensus median) instead.
## Composition
Stratified sample from [JudgeBias-DPO-RefFree-subset-10k](https://huggingface.co/datasets/iknow-lab/JudgeBias-DPO-RefFree-subset-10k). 9 datasets balanced at ~110 samples each.
| Dataset | Category | Samples |
|---|---|---|
| `action_antonym_100pct` | error | 106 |
| `all_error_perturbation_15pct` | error | 98 |
| `element_substitution_100pct` | error | 131 |
| `equipment_substitution_100pct` | error | 102 |
| `numerical_perturbation_100pct` | error | 119 |
| `llm_representational_perturbation_15pct` | represent | 114 |
| `llm_to_formula_100pct` | represent | 117 |
| `llm_to_iupac_100pct` | represent | 98 |
| `llm_to_name_100pct` | represent | 115 |
| Metric | Value |
|---|---|
| Total | 1,000 |
| Error / Representational | 556 (56%) / 444 (44%) |
| Unique samples | 933 |
## Usage
```python
from datasets import load_dataset
testset = load_dataset("iknow-lab/JudgeBias-DPO-RefFree-testset", split="test")
```
提供机构:
iknow-lab



