five

JiaqiXue/R2-Bench

收藏
Hugging Face2026-04-06 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/JiaqiXue/R2-Bench
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit task_categories: - text-generation tags: - llm-routing - benchmark - quality-prediction - token-budget pretty_name: R2-Bench size_categories: - 10K<n<100K --- # R2-Bench R2-Bench is a benchmark dataset for evaluating LLM routing with joint model and token budget optimization. It contains 30,968 queries evaluated across 10 LLMs at 16 token budget levels, with LLM-judge quality scores. > Associated with **R2-Router** ([code](https://github.com/jqxue1999/router/tree/release-routerarena-public)), under review at ICML 2026. ## Dataset Structure ``` data/ ├── meta-llama/ │ ├── Llama-3.1-70B-Instruct/ │ │ ├── 10_judge.csv │ │ ├── 20_judge.csv │ │ ├── ... │ │ └── 8000_judge.csv │ └── Llama-3.2-3B-Instruct/ ├── Qwen/ │ ├── Qwen3-235B-A22B-Instruct-2507/ │ ├── Qwen3-Next-80B-A3B-Instruct/ │ ├── Qwen3-30B-A3B-Instruct-2507/ │ ├── Qwen2.5-Math-7B-Instruct/ │ ├── Qwen2.5-Math-1.5B-Instruct/ │ └── Qwen3-0.6B/ └── zai-org/ ├── GLM-4.5-Air/ └── GLM-4.6/ ``` ## Models (10) | Model | Provider | |-------|----------| | Qwen3-235B-A22B-Instruct-2507 | Qwen | | Qwen3-Next-80B-A3B-Instruct | Qwen | | Qwen3-30B-A3B-Instruct-2507 | Qwen | | Qwen2.5-Math-7B-Instruct | Qwen | | Qwen2.5-Math-1.5B-Instruct | Qwen | | Qwen3-0.6B | Qwen | | Llama-3.1-70B-Instruct | Meta | | Llama-3.2-3B-Instruct | Meta | | GLM-4.5-Air | Z-AI | | GLM-4.6 | Z-AI | ## Token Budgets (16) 10, 20, 30, 40, 50, 80, 100, 150, 200, 300, 500, 800, 1200, 2000, 4000, 8000 Each model is evaluated at every budget level. The LLM is instructed to respond within the given token budget via a system prompt. ## File Format Each `{budget}_judge.csv` file contains: | Column | Description | |--------|-------------| | `prompts_id` | Unique query ID | | `key` | Content hash of the query | | `original_prompt` | The original query text | | `templated_prompt` | The prompt as sent to the LLM (with budget instruction) | | `golden_answer` | Reference answer for judging | | `response` | LLM's generated response | | `actual_token_count` | Actual number of tokens in the response | | `judge_raw` | Raw judge output (JSON with score and justification) | | `correctness_score` | Judge's correctness score (0.0 to 1.0) | ## Statistics - **Queries**: 30,968 - **Models**: 10 - **Budgets**: 16 - **Total evaluations**: ~4.95M (30,968 × 10 × 16) - **Dataset size**: ~25 GB ## Usage ```python import pandas as pd # Load a specific model + budget df = pd.read_csv("data/Qwen/Qwen3-235B-A22B-Instruct-2507/100_judge.csv") print(f"Queries: {len(df)}") print(f"Mean score: {df['correctness_score'].mean():.3f}") print(f"Mean tokens: {df['actual_token_count'].mean():.0f}") ``` ## Usage with R2-Router These are the training labels for R2-Router's Ridge regression predictors: ```python from r2_router import R2Router # The checkpoints in the r2-router repo were trained on this data router = R2Router.from_pretrained("./r2_router") ``` ## Citation ```bibtex @inproceedings{r2router2026, title={R2-Router: A New Paradigm for LLM Routing with Reasoning}, author={Anonymous}, booktitle={International Conference on Machine Learning (ICML)}, year={2026} } ``` ## License MIT License
提供机构:
JiaqiXue
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作