iknow-lab/JudgeBias-DPO-RefFree-testset

Name: iknow-lab/JudgeBias-DPO-RefFree-testset
Creator: iknow-lab
Published: 2026-03-12 05:35:40
License: 暂无描述

Hugging Face2026-03-12 更新2026-04-05 收录

下载链接：

https://hf-mirror.com/datasets/iknow-lab/JudgeBias-DPO-RefFree-testset

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en size_categories: - 1K<n<10K task_categories: - text-generation tags: - dpo - preference - llm-as-a-judge - debiasing - materials-science - evaluation - benchmark dataset_info: features: - name: prompt dtype: string - name: chosen dtype: string - name: rejected dtype: string - name: score_chosen dtype: float64 - name: score_rejected dtype: float64 - name: score_delta dtype: float64 - name: anchor_score dtype: float64 - name: sample_id dtype: int64 - name: Material_Name dtype: string - name: domain dtype: string - name: process dtype: string - name: perturbation_type dtype: string - name: perturbation_category dtype: string - name: perturbation_rate dtype: float64 - name: chosen_model dtype: string - name: rejected_model dtype: string - name: dataset_name dtype: string splits: - name: test num_examples: 1000 config_name: default configs: - config_name: default data_files: - split: test path: test.parquet --- # JudgeBias-DPO-RefFree-testset A fixed **evaluation benchmark** (1,000 samples) for assessing DPO-trained LLM judges on materials science synthesis recipe evaluation in a **reference-free** setting. ## Purpose This test set supports two evaluation methods: ### 1. Reward Accuracy (Log-Probability) Use `prompt`, `chosen`, and `rejected` to compute implicit reward accuracy without generation: ```python reward_chosen = log P(chosen | prompt) reward_rejected = log P(rejected | prompt) accuracy = mean(reward_chosen > reward_rejected) ``` ### 2. Generation-Based Evaluation (Anchor-Referenced) Generate judge responses from `prompt`, parse `overall_score`, and compare against `anchor_score`: - **Representational samples** (`perturbation_category == "represent"`): - `repr_bias = mean(|generated_score - anchor_score|)` — lower is better - **Error samples** (`perturbation_category == "error"`): - `error_sensitivity = mean(anchor_score - generated_score)` — higher is better > `score_chosen` should NOT be used as ground truth — it is the relatively better score within a pair, not an absolute reference. Use `anchor_score` (repr consensus median) instead. ## Composition Stratified sample from [JudgeBias-DPO-RefFree-subset-10k](https://huggingface.co/datasets/iknow-lab/JudgeBias-DPO-RefFree-subset-10k). 9 datasets balanced at ~110 samples each. | Dataset | Category | Samples | |---|---|---| | `action_antonym_100pct` | error | 106 | | `all_error_perturbation_15pct` | error | 98 | | `element_substitution_100pct` | error | 131 | | `equipment_substitution_100pct` | error | 102 | | `numerical_perturbation_100pct` | error | 119 | | `llm_representational_perturbation_15pct` | represent | 114 | | `llm_to_formula_100pct` | represent | 117 | | `llm_to_iupac_100pct` | represent | 98 | | `llm_to_name_100pct` | represent | 115 | | Metric | Value | |---|---| | Total | 1,000 | | Error / Representational | 556 (56%) / 444 (44%) | | Unique samples | 933 | ## Usage ```python from datasets import load_dataset testset = load_dataset("iknow-lab/JudgeBias-DPO-RefFree-testset", split="test") ```

提供机构：

iknow-lab

5,000+

优质数据集

54 个

任务类型

进入经典数据集