JiaqiXue/R2-Bench
收藏Hugging Face2026-04-06 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/JiaqiXue/R2-Bench
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- text-generation
tags:
- llm-routing
- benchmark
- quality-prediction
- token-budget
pretty_name: R2-Bench
size_categories:
- 10K<n<100K
---
# R2-Bench
R2-Bench is a benchmark dataset for evaluating LLM routing with joint model and token budget optimization. It contains 30,968 queries evaluated across 10 LLMs at 16 token budget levels, with LLM-judge quality scores.
> Associated with **R2-Router** ([code](https://github.com/jqxue1999/router/tree/release-routerarena-public)), under review at ICML 2026.
## Dataset Structure
```
data/
├── meta-llama/
│ ├── Llama-3.1-70B-Instruct/
│ │ ├── 10_judge.csv
│ │ ├── 20_judge.csv
│ │ ├── ...
│ │ └── 8000_judge.csv
│ └── Llama-3.2-3B-Instruct/
├── Qwen/
│ ├── Qwen3-235B-A22B-Instruct-2507/
│ ├── Qwen3-Next-80B-A3B-Instruct/
│ ├── Qwen3-30B-A3B-Instruct-2507/
│ ├── Qwen2.5-Math-7B-Instruct/
│ ├── Qwen2.5-Math-1.5B-Instruct/
│ └── Qwen3-0.6B/
└── zai-org/
├── GLM-4.5-Air/
└── GLM-4.6/
```
## Models (10)
| Model | Provider |
|-------|----------|
| Qwen3-235B-A22B-Instruct-2507 | Qwen |
| Qwen3-Next-80B-A3B-Instruct | Qwen |
| Qwen3-30B-A3B-Instruct-2507 | Qwen |
| Qwen2.5-Math-7B-Instruct | Qwen |
| Qwen2.5-Math-1.5B-Instruct | Qwen |
| Qwen3-0.6B | Qwen |
| Llama-3.1-70B-Instruct | Meta |
| Llama-3.2-3B-Instruct | Meta |
| GLM-4.5-Air | Z-AI |
| GLM-4.6 | Z-AI |
## Token Budgets (16)
10, 20, 30, 40, 50, 80, 100, 150, 200, 300, 500, 800, 1200, 2000, 4000, 8000
Each model is evaluated at every budget level. The LLM is instructed to respond within the given token budget via a system prompt.
## File Format
Each `{budget}_judge.csv` file contains:
| Column | Description |
|--------|-------------|
| `prompts_id` | Unique query ID |
| `key` | Content hash of the query |
| `original_prompt` | The original query text |
| `templated_prompt` | The prompt as sent to the LLM (with budget instruction) |
| `golden_answer` | Reference answer for judging |
| `response` | LLM's generated response |
| `actual_token_count` | Actual number of tokens in the response |
| `judge_raw` | Raw judge output (JSON with score and justification) |
| `correctness_score` | Judge's correctness score (0.0 to 1.0) |
## Statistics
- **Queries**: 30,968
- **Models**: 10
- **Budgets**: 16
- **Total evaluations**: ~4.95M (30,968 × 10 × 16)
- **Dataset size**: ~25 GB
## Usage
```python
import pandas as pd
# Load a specific model + budget
df = pd.read_csv("data/Qwen/Qwen3-235B-A22B-Instruct-2507/100_judge.csv")
print(f"Queries: {len(df)}")
print(f"Mean score: {df['correctness_score'].mean():.3f}")
print(f"Mean tokens: {df['actual_token_count'].mean():.0f}")
```
## Usage with R2-Router
These are the training labels for R2-Router's Ridge regression predictors:
```python
from r2_router import R2Router
# The checkpoints in the r2-router repo were trained on this data
router = R2Router.from_pretrained("./r2_router")
```
## Citation
```bibtex
@inproceedings{r2router2026,
title={R2-Router: A New Paradigm for LLM Routing with Reasoning},
author={Anonymous},
booktitle={International Conference on Machine Learning (ICML)},
year={2026}
}
```
## License
MIT License
提供机构:
JiaqiXue



