JiaqiXue/R2-Bench

Name: JiaqiXue/R2-Bench
Creator: JiaqiXue
Published: 2026-04-06 03:33:53
License: 暂无描述

Hugging Face2026-04-06 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/JiaqiXue/R2-Bench

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit task_categories: - text-generation tags: - llm-routing - benchmark - quality-prediction - token-budget pretty_name: R2-Bench size_categories: - 10K<n<100K --- # R2-Bench R2-Bench is a benchmark dataset for evaluating LLM routing with joint model and token budget optimization. It contains 30,968 queries evaluated across 10 LLMs at 16 token budget levels, with LLM-judge quality scores. > Associated with **R2-Router** ([code](https://github.com/jqxue1999/router/tree/release-routerarena-public)), under review at ICML 2026. ## Dataset Structure ``` data/ ├── meta-llama/ │ ├── Llama-3.1-70B-Instruct/ │ │ ├── 10_judge.csv │ │ ├── 20_judge.csv │ │ ├── ... │ │ └── 8000_judge.csv │ └── Llama-3.2-3B-Instruct/ ├── Qwen/ │ ├── Qwen3-235B-A22B-Instruct-2507/ │ ├── Qwen3-Next-80B-A3B-Instruct/ │ ├── Qwen3-30B-A3B-Instruct-2507/ │ ├── Qwen2.5-Math-7B-Instruct/ │ ├── Qwen2.5-Math-1.5B-Instruct/ │ └── Qwen3-0.6B/ └── zai-org/ ├── GLM-4.5-Air/ └── GLM-4.6/ ``` ## Models (10) | Model | Provider | |-------|----------| | Qwen3-235B-A22B-Instruct-2507 | Qwen | | Qwen3-Next-80B-A3B-Instruct | Qwen | | Qwen3-30B-A3B-Instruct-2507 | Qwen | | Qwen2.5-Math-7B-Instruct | Qwen | | Qwen2.5-Math-1.5B-Instruct | Qwen | | Qwen3-0.6B | Qwen | | Llama-3.1-70B-Instruct | Meta | | Llama-3.2-3B-Instruct | Meta | | GLM-4.5-Air | Z-AI | | GLM-4.6 | Z-AI | ## Token Budgets (16) 10, 20, 30, 40, 50, 80, 100, 150, 200, 300, 500, 800, 1200, 2000, 4000, 8000 Each model is evaluated at every budget level. The LLM is instructed to respond within the given token budget via a system prompt. ## File Format Each `{budget}_judge.csv` file contains: | Column | Description | |--------|-------------| | `prompts_id` | Unique query ID | | `key` | Content hash of the query | | `original_prompt` | The original query text | | `templated_prompt` | The prompt as sent to the LLM (with budget instruction) | | `golden_answer` | Reference answer for judging | | `response` | LLM's generated response | | `actual_token_count` | Actual number of tokens in the response | | `judge_raw` | Raw judge output (JSON with score and justification) | | `correctness_score` | Judge's correctness score (0.0 to 1.0) | ## Statistics - **Queries**: 30,968 - **Models**: 10 - **Budgets**: 16 - **Total evaluations**: ~4.95M (30,968 × 10 × 16) - **Dataset size**: ~25 GB ## Usage ```python import pandas as pd # Load a specific model + budget df = pd.read_csv("data/Qwen/Qwen3-235B-A22B-Instruct-2507/100_judge.csv") print(f"Queries: {len(df)}") print(f"Mean score: {df['correctness_score'].mean():.3f}") print(f"Mean tokens: {df['actual_token_count'].mean():.0f}") ``` ## Usage with R2-Router These are the training labels for R2-Router's Ridge regression predictors: ```python from r2_router import R2Router # The checkpoints in the r2-router repo were trained on this data router = R2Router.from_pretrained("./r2_router") ``` ## Citation ```bibtex @inproceedings{r2router2026, title={R2-Router: A New Paradigm for LLM Routing with Reasoning}, author={Anonymous}, booktitle={International Conference on Machine Learning (ICML)}, year={2026} } ``` ## License MIT License

提供机构：

JiaqiXue

5,000+

优质数据集

54 个

任务类型

进入经典数据集