ibm-research/900K-Judgements
收藏Hugging Face2026-03-18 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/ibm-research/900K-Judgements
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cdla-permissive-2.0
task_categories:
- text-generation
- text-classification
language:
- en
tags:
- llm-evaluation
- llm-as-a-judge
- pairwise-comparison
- model-evaluation
- benchmark
size_categories:
- 100K<n<1M
---
# 900K Judgements: A Large-Scale LLM-as-a-Judge Evaluation Dataset
## Dataset Description
This dataset contains approximately 900,000 pairwise comparison judgements from multiple LLM judges evaluating model responses. The data was collected as part of the paper [`Mediocrity is the key for LLM as a Judge Anchor Selection'](https://arxiv.org/abs/2603.16848), investigating the impact of anchor selection in LLM-as-a-judge pairwise evaluation.
### Dataset Summary
- **Total Evaluations:** ~900K pairwise judgements
- **Judge Models:** 5 different LLM judges
- **Evaluation Format:** Pairwise comparisons with confidence levels
- **Domain:** Open-ended text generation evaluation
- **Base Datasets:** Arena-Hard-v2.0 and AlpacaEval
- **Evaluated Models:** 22 models on Arena-Hard-v2.0, 11 models on AlpacaEval
### Paper Abstract
> The "LLM-as-a-judge" paradigm has become a standard method for evaluating open-ended generation. To address the quadratic scalability costs of pairwise comparisons, popular benchmarks like Arena-Hard and AlpacaEval compare all models against a single anchor. However, despite its widespread use, the impact of anchor selection on the reliability of the results remains largely unexplored. In this work, we systematically investigate the effect of anchor selection by evaluating 22 different anchors on the Arena-Hard-v2.0 dataset. We find that the choice of anchor is critical: a poor anchor can dramatically reduce correlation with human rankings. We identify that common anchor choices (best-performing and worst-performing models) make poor anchors. Because these extreme anchors are consistently better or worse than all other models, they are seldom indicative of the relative ranking of the models. We further quantify the effect size of anchor selection, showing it is comparable to the selection of a judge model. We conclude with actionable recommendations. First, we conduct a power analysis, and compute sufficient benchmark sizes for anchor-based evaluation, finding that standard benchmark sizes are insufficient for pairwise evaluation and fail to distinguish between competitive models reliably. Second, we provide guidelines for selecting informative anchors to ensure reliable and efficient evaluation practices.
## Dataset Structure
### Data Fields
Each row in the dataset represents a single pairwise comparison judgement with the following fields:
- **`uid`** (string): Unique identifier for the comparison instance (format: `{instance_id}_{model_a}_{model_b}`). Uids that start with 'eval_' are from AlpacaEval.
- **`order_randomized`** (string): Whether model order was randomized in the prompt (`"original"` or `"swapped"`)
- **`model_a_in_prompt`** (string): Name of the first model presented to the judge
- **`model_b_in_prompt`** (string): Name of the second model presented to the judge
- **`raw_judgment`** (string): Complete raw text output from the judge model
- **`extracted_verdict`** (string): Parsed verdict from the judgement (e.g., `"A>>B"`, `"A>B"`, `"A=B"`, `"B>A"`, `"B>>B"`)
- **`confidence`** (string): Confidence level of the judgement (`"significantly"`, `"slightly"`, or `"tie"`)
- **`final_verdict`** (string): Verdict mapped back to original model names (e.g., `"model1>>model2"`)
- **`timestamp`** (string): ISO 8601 timestamp of when the judgement was made
- **`judge_model`** (string): Identifier of the LLM judge that made the evaluation
### Verdict Format
Verdicts follow a standardized format:
- `A>>B` or `B>>A`: Model is significantly better
- `A>B` or `B>A`: Model is slightly better
- `A=B`: Models are tied/equivalent
## Dataset Creation
### Source Data
The evaluations are based on two widely-used benchmarks:
- **Arena-Hard-v2.0:** Contains challenging user queries designed to test advanced model capabilities
- **AlpacaEval:** A comprehensive evaluation suite for instruction-following models
These datasets provide diverse and representative test cases for evaluating open-ended text generation.
#### Evaluated Models
**Arena-Hard-v2.0 (22 models):**
- Gemma 3 27B Instruct
- Qwen3 30B A3B
- o1
- o3 Mini
- Claude 3.7 Sonnet thinking 16k
- Athene V2 Chat
- Claude 3.5 Sonnet
- o3 Mini High
- GPT-4.5 (Preview)
- QwQ 32B
- GPT-4.1
- GPT-4.1 Mini
- GPT-4.1 Nano
- Qwen3 32B
- o4 Mini
- DeepSeek-R1
- Llama 3.1 Nemotron 70B Instruct
- Qwen2.5 72B Instruct
- Gemini 2.5 Flash
- Qwen3 235B A22B
- Llama 4 Maverick Instruct
- o3
**AlpacaEval (11 models):**
- Mixtral 8x22B Instruct
- Qwen2 72B Instruct
- GPT-3.5 Turbo
- Claude 3.5 Sonnet
- Yi 34B Chat
- GPT-4 Turbo
- Llama 3.1 405B Instruct
- Guanaco 65B
- GPT-4o
- GPT-4 Turbo (Preview)
- Falcon 40B Instruct
### Judge Models
The dataset includes evaluations from 5 different LLM judge models, providing diverse perspectives on model performance. Each judge evaluated the same model pairs, allowing for inter-judge agreement analysis.
The five judge models are:
1. **DeepSeek-V3** - Advanced reasoning model from DeepSeek
2. **GPT-OSS 120B** - Large-scale open-source GPT model (120B parameters)
3. **GPT-OSS 20B** - Smaller open-source GPT model (20B parameters)
4. **Qwen3 235B-A22B Instruct** - Instruction-tuned Qwen3 model (235B parameters)
5. **Qwen3 8B** - Compact Qwen3 model (8B parameters)
### Annotation Process
1. **Model Response Collection:** Responses were collected from 22 models for Arena-Hard-v2.0 and 11 models for AlpacaEval
2. **Pairwise Comparison:** Each model pair was evaluated by multiple judge models
3. **Order Randomization:** Model order was randomized to control for position bias
4. **Structured Output:** Judges provided verdicts with confidence levels
5. **Quality Control:** Duplicate evaluations were removed based on (UID, judge_model) pairs
## Usage
### Loading the Dataset
```python
from datasets import load_dataset
# Load the full dataset
dataset = load_dataset("ibm-research/900K-Judgements")
# Access the data
df = dataset['train'].to_pandas()
print(f"Total evaluations: {len(df)}")
print(f"Judge models: {df['judge_model'].unique()}")
```
### Example: Analyzing Judge Agreement
```python
import pandas as pd
from datasets import load_dataset
# Load dataset
dataset = load_dataset("ibm-research/900K-Judgements")
df = dataset['train'].to_pandas()
# Calculate agreement between judges for the same comparison
agreement_df = df.groupby('uid').agg({
'extracted_verdict': lambda x: x.mode()[0] if len(x.mode()) > 0 else None,
'judge_model': 'count'
}).rename(columns={'judge_model': 'num_judges'})
print(f"Comparisons evaluated by multiple judges: {(agreement_df['num_judges'] > 1).sum()}")
```
### Example: Filtering by Judge Model
```python
# Get evaluations from a specific judge
judge_name = "deepseek-ai/DeepSeek-V3"
judge_evals = df[df['judge_model'] == judge_name]
print(f"Evaluations by {judge_name}: {len(judge_evals)}")
```
### Example: Analyzing Confidence Levels
```python
# Distribution of confidence levels
confidence_dist = df['confidence'].value_counts()
print("Confidence distribution:")
print(confidence_dist)
# Significant verdicts only
significant = df[df['confidence'] == 'significantly']
print(f"\nSignificant verdicts: {len(significant)} ({len(significant)/len(df)*100:.1f}%)")
```
## Considerations for Using the Data
### Biases and Limitations
1. **Judge Model Biases:** Different judge models may have inherent biases toward certain response styles or models
2. **Position Bias:** Despite randomization, some position bias may remain
3. **Prompt Sensitivity:** Judge verdicts can be sensitive to prompt formatting
4. **Domain Coverage:** Evaluations are based on Arena-Hard-v2.0 and AlpacaEval, which may not cover all use cases
5. **Temporal Effects:** Model capabilities and judge behavior may change over time
### Ethical Considerations
- **Model Evaluation Fairness:** Results should not be used as the sole metric for model quality
- **Judge Reliability:** Multiple judges should be consulted for critical decisions
- **Transparency:** The limitations of LLM-as-a-judge evaluation should be clearly communicated
## Citation
If you use this dataset in your research, please cite:
```
@misc{donyehiya2026mediocritykeyllmjudge,
title={Mediocrity is the key for LLM as a Judge Anchor Selection},
author={Shachar Don-Yehiya and Asaf Yehudai and Leshem Choshen and Omri Abend},
year={2026},
eprint={2603.16848},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2603.16848},
}
```
## License
This dataset is released under the Community Data License Agreement – Permissive, Version 2.0 (CDLA-Permissive-2.0).
## Contact
For questions or issues regarding this dataset, please open an issue on the dataset repository or contact the authors.
## Acknowledgments
This research was conducted at IBM Research. We thank the creators of Arena-Hard-v2.0 and AlpacaEval for providing the base evaluation frameworks.
提供机构:
ibm-research



