R-HORIZON-Math500
收藏魔搭社区2026-01-08 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/meituan-longcat/R-HORIZON-Math500
下载链接
链接失效反馈官方服务:
资源简介:
<div align="center">
<h1>
<img src="https://raw.githubusercontent.com/meituan-longcat/R-HORIZON/main/assets/problem-solving.png" alt="logo" width="60" style="vertical-align:middle; margin-right:10px;">
R-HORIZON
</h1>
<div>
How Far Can Your Large Reasoning Model Really Go in Breadth and Depth?
</div>
</div>
<br>
<p align="center">
📃 <a href="https://arxiv.org/abs/2510.08189" target="_blank">Paper</a > • 🌐 <a href="https://reasoning-horizon.github.io/" target="_blank">Project Page</a > • 🤗 <a href="https://huggingface.co/datasets/meituan-longcat/R-HORIZON-training-data" target="_blank">Dataset</a >
</p >
R-HORIZON is a novel method designed to stimulate long-horizon reasoning behaviors in Large Reasoning Models (LRMs) through query composition. We transform isolated problems into complex multi-step reasoning scenarios, revealing that even the most advanced LRMs suffer significant performance degradation when facing interdependent problems that span long reasoning horizons.

## 🔥 Releases
**[2025-10-09]**
- 🎉 **R-HORIZON Benchmark** is now available! Test your LRMs on complex multi-horizon reasoning tasks.
- 🤗 **Training and evaluation datasets** are available on Hugging Face: [R-HORIZON Dataset](https://huggingface.co/datasets/meituan-longcat/R-HORIZON-training-data)
- 📄 **Paper released** on arXiv: [R-HORIZON: How Far Can Your Large Reasoning Model Really Go in Breadth and Depth?](https://arxiv.org/abs/2510.08189)
## 🌟 Overview
Recent advances in reasoning-focused language models (e.g., OpenAI o1, DeepSeek-R1) have demonstrated remarkable improvements through test-time scaling and long Chain-of-Thought (CoT). However, existing benchmarks primarily focus on immediate, single-horizon tasks, failing to adequately evaluate models' ability to handle complex, long-horizon scenarios.
**Key challenges in current paradigms:**
- **Limited evaluation scope**: Existing benchmarks confine themselves to isolated problems, missing the complexity of real-world multi-step reasoning
- **Limited effective reasoning length**: Models struggle to maintain performance as reasoning chains grow longer
- **Poor thinking budget allocation**: LRMs fail to appropriately distribute thinking resources across multiple interdependent problems
To address these limitations, we introduce **R-HORIZON**, which:
- Transforms isolated problems into **complex multi-step reasoning scenarios** through query composition
- Establishes the **R-HORIZON Benchmark** comprising 6 representative datasets from mathematics, code generation, and agent applications
- Enables **reinforcement learning with verified rewards (RLVR)** using long-horizon reasoning data

## 📖 Table of Contents
- [🔥 Releases](#-releases)
- [🌟 Overview](#-overview)
- [📊 R-HORIZON Benchmark](#-r-horizon-benchmark)
- [🚀 Training with R-HORIZON](#-training-with-r-horizon)
- [Quick Start](#quick-start)
- [Installation](#installation)
- [Benchmark Evaluation](#benchmark-evaluation)
- [Training with R-HORIZON datasets](#training-with-r-horizon-datasets)
- [Dataset](#dataset)
- [Dataset Construction](#dataset-construction)
- [Dataset on Hugging Face Hub](#dataset-on-hugging-face-hub)
- [Dataset Structure](#dataset-structure)
- [Citation](#citation)
## 📊 R-HORIZON Benchmark
We evaluate 20+ state-of-the-art LRMs on the R-HORIZON Benchmark, revealing significant performance degradation as reasoning horizons increase:

**Key findings from our benchmark evaluation:**
- **Universal performance degradation**: Even the most powerful models suffer severe drops as problem count increases. For instance, DeepSeek-R1 drops from 87.3% (single problem) to 24.6% (5 problems) on AIME25.
- **Model size matters**: Larger models exhibit more resilience to multi-horizon challenges. R1-Qwen-7B drops from 93.6% to 0% when solving 16 problems, showing 34.1% more degradation than the 32B models.
- **Task-dependent degradation**: Code generation tasks show steeper performance declines compared to mathematics. Many reasoning models lose their tool-calling abilities in web search scenarios, resulting in poor multi-step performance.
## 🚀 Training with R-HORIZON
Training with R-HORIZON composed data yields substantial improvements on both single and multi-horizon reasoning tasks:

**Training results highlights:**
- **Dual Performance Gains**: Training with 2-composed problems significantly improves both multi-horizon reasoning (+17.4 points on AIME24 n=2) and single-problem performance (+7.5 points on AIME24 original).
- **Scalable Complexity**: Increasing composition complexity (n=4) enhances the model's ability to handle problems requiring more reasoning steps, achieving 50.6% on Math500 (n=8).
| Models | MATH500 (Origin) | MATH500 (n=8) | AIME24 (Origin) | AIME24 (n=2) | AIME25 (Origin) | AIME25 (n=2) | AMC23 (Origin) | AMC23 (n=2) |
|-----------------|------------------|---------------|-----------------|--------------|-----------------|--------------|----------------|-------------|
| R1-Qwen-7B | 93.6 | 11.8 | 48.3 | 16.4 | 33.3 | 3.5 | 90.2 | 48.8 |
| Baseline (n=1) | **95.6** | 8.4 | 57.9 | 16.7 | 47.9 | 5.1 | **95.9** | 55.0 |
| R-HORIZON (n=2) | 95.4 | 21.4 | **65.4** | 34.1 | **49.6** | **10.0** | 94.1 | **80.6** |
| R-HORIZON (n=4) | 94.6 | **50.6** | 62.9 | **34.8** | 45.4 | 8.1 | 91.9 | 79.1 |
## Quick Start
### Installation
```bash
# Clone the repository
git clone https://github.com/meituan-longcat/R-HORIZON.git
cd R-HORIZON
# Create conda environment
conda create -n r-horizon python=3.10 -y
conda activate r-horizon
# Install PyTorch
pip3 install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu124
pip3 install flash-attn --no-build-isolation
# Install additional dependencies
pip install -r requirements.txt
```
### Benchmark Evaluation
1. Download the R-HORIZON Benchmark
```bash
# Download benchmark datasets
python ./evaluation/data/download.py
```
2. Modify config.json under evaluation directory
```json
{
"inference": {
// model_key (e.g. r1-distill-qwen7b) is for run.sh
"r1-distill-qwen7b": {
// the ip and port used in vllm server
"base_url": "http://{Your IP and Port}/v1/completions",
"api_key": "EMPTY",
// model_name is corresponding to the modelname in vllm server
"model_name": "{vllm's modelname}",
"params": {
"temperature": 1.0,
"top_p": 0.95,
"top_k": 10,
"max_tokens": 65536
},
"prompt_prefix": "<|im_start|>user:\n",
"prompt_suffix": "\n<|im_end|>\n<|im_start|>assistant:\n"
}
},
"extract": {
"gpt-4.1": {
"model_name": "gpt-4.1",
"base_url": "{OpenAI's baseurl}",
"api_key": "{Your API key}",
"params": {
"temperature": 0.0,
"max_tokens": 16000
}
}
}
}
```
3. Run a vllm server
```bash
vllm serve {modelname}\
--host {ip}\
--port {port}\
--served-model-name {modelname}\
--dtype auto --pipeline-parallel-size 1 --tensor-parallel-size 1 --trust-remote-code\
--enable-chunked-prefill --max-model-len 131072 --max-num-batched-tokens 10240\
--max-num-seqs 256 --gpu-memory-utilization 0.85 --disable-custom-all-reduce\
--enable-reasoning --reasoning-parser deepseek_r1 --enable-chunked-prefill
```
4. Evaluate your model
Here is a bash example, and model_key is defined in config.json
```bash
sh evaluation/run.sh {input_file} {output_dir} {model_key}
# example
sh evaluation/run.sh evaluation/data/R-HORIZON-Math500/Math500-combined-n2.jsonl evaluation/result r1-distill-qwen7b
```
### Training with R-HORIZON datasets
1. Download composed training data
```python
from huggingface_hub import snapshot_download
snapshot_download(
repo_id="meituan-longcat/R-HORIZON-training-data",
repo_type="dataset",
local_dir="./training/data",
)
```
2. Launch training
```bash
# Train with R-HORIZON using GRPO algorithm
bash ./training/scripts/train/skywork-or1-rlvr-math-training-7b-40k.sh
```
## Dataset
### Dataset Construction
Step 1: Filter Samples with Valid Integers
```bash
# Purpose: Retain samples containing valid integers in input text and pure integer targets, excluding ambiguous numeric expressions (e.g., floats, fractions, LaTeX commands).
python step1_filt_integer_samples.py
```
Step 2: Identify Key Variables
```bash
# Purpose: select "key variables" (critical integers that significantly affect problem outcomes)
# configure API credentials in the script (replace YOUR_API_KEY)
python step2_select_key_variable.py
```
Step 3: Combine into Chained Reasoning Problems
```bash
# Purpose: Generate multi-horizon chained problems where each step's key variable depends on the previous step's answer.
python step3_combine_problems.py
```
### Dataset on Hugging Face Hub
The R-HORIZON training datasets and evaluation benchmark are available on Hugging Face Hub:
| Dataset Type | Dataset Name | Hugging Face Link |
|--------------|-------------------------------|-----------------------------------------------------------------------------------|
| Evaluation | R-HORIZON-Math500 | [link](https://huggingface.co/datasets/meituan-longcat/R-HORIZON-Math500) |
| Evaluation | R-HORIZON-AIME24 | [link](https://huggingface.co/datasets/meituan-longcat/R-HORIZON-AIME24) |
| Evaluation | R-HORIZON-AIME25 | [link](https://huggingface.co/datasets/meituan-longcat/R-HORIZON-AIME25) |
| Evaluation | R-HORIZON-AMC23 | [link](https://huggingface.co/datasets/meituan-longcat/R-HORIZON-AMC23) |
| Evaluation | R-HORIZON-Websearch | [link](https://huggingface.co/datasets/meituan-longcat/R-HORIZON-Websearch) |
| Training | R-HORIZON-training-data | [link](https://huggingface.co/datasets/meituan-longcat/R-HORIZON-training-data) |
### Dataset Structure
```json
{
"input": "[1-N linked problems + solving instructions (with [variablek]/[answerk] placeholders)]",
"instanceId": "[Unique ID for this instance]",
"origin_instanceIds": "[List of original problem IDs]",
"target": "[List of final answers, e.g., [answer1, answer2]]",
"num_problems": "[Total problems, e.g., 2]",
"selected_variables": [
{
"number": "[Key variable from problem]",
"context": "[Context of the number]",
"text": "[Text of the number]",
"is_independent": "[true/false]",
"is_in_math_env": "[true/false]"
}
]
}
```
## Citation
If you find R-HORIZON helpful for your research, please cite our paper:
```bibtex
@misc{lu2025rhorizonfarlargereasoning,
title={R-Horizon: How Far Can Your Large Reasoning Model Really Go in Breadth and Depth?},
author={Yi Lu and Jianing Wang and Linsen Guo and Wei He and Hongyin Tang and Tao Gui and Xuanjing Huang and Xuezhi Cao and Wei Wang and Xunliang Cai},
year={2025},
eprint={2510.08189},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2510.08189},
}
```
<div align="center">
<h1>
<img src="https://raw.githubusercontent.com/meituan-longcat/R-HORIZON/main/assets/problem-solving.png" alt="logo" width="60" style="vertical-align:middle; margin-right:10px;">
R-HORIZON
</h1>
<div>
你的大推理模型(Large Reasoning Models, LRMs)在广度与深度上究竟能走多远?
</div>
</div>
<br>
<p align="center">
📃 <a href="https://arxiv.org/abs/2510.08189" target="_blank">论文</a> • 🌐 <a href="https://reasoning-horizon.github.io/" target="_blank">项目主页</a> • 🤗 <a href="https://huggingface.co/datasets/meituan-longcat/R-HORIZON-training-data" target="_blank">数据集</a>
</p >
R-HORIZON 是一种全新的方法,旨在通过查询组合(query composition)激发大推理模型(Large Reasoning Models, LRMs)的长跨度推理行为。我们将孤立的单个问题转化为复杂的多步推理场景,研究发现,即便最先进的大推理模型,在面对涉及长推理跨度的相互依赖型问题时,性能也会出现显著下滑。

## 🔥 版本更新
**[2025-10-09]**
- 🎉 **R-HORIZON 基准测试集** 现已上线!快来在复杂多跨度推理任务上测试你的大推理模型吧。
- 🤗 **训练与评估数据集** 已上架 Hugging Face:[R-HORIZON 数据集](https://huggingface.co/datasets/meituan-longcat/R-HORIZON-training-data)
- 📄 **论文已在 arXiv 发布**:[R-HORIZON: 你的大推理模型在广度与深度上究竟能走多远?](https://arxiv.org/abs/2510.08189)
## 🌟 概述
近年来,聚焦推理的语言模型(如 OpenAI o1、DeepSeek-R1)通过测试时缩放(test-time scaling)与长思维链(Chain-of-Thought, CoT)取得了显著进展。然而,现有基准测试主要聚焦于即时单跨度任务,无法充分评估模型处理复杂长跨度场景的能力。
**当前范式下的核心挑战:**
- **评估范围受限**:现有基准测试仅局限于孤立问题,未能覆盖真实世界多步推理的复杂性
- **有效推理长度受限**:随着推理链变长,模型难以维持稳定性能
- **思维资源分配不合理**:大推理模型无法在多个相互依赖的问题间合理分配思考资源
为解决上述局限,我们提出了 **R-HORIZON**,其具备以下特性:
- 通过查询组合将孤立问题转化为**复杂多步推理场景**
- 构建包含数学、代码生成、智能体应用六大类代表性数据集的**R-HORIZON 基准测试集**
- 利用长跨度推理数据实现**带验证奖励的强化学习(reinforcement learning with verified rewards, RLVR)**

## 📖 目录
- [🔥 版本更新](#-releases)
- [🌟 概述](#-overview)
- [📊 R-HORIZON 基准测试集](#-r-horizon-benchmark)
- [🚀 基于 R-HORIZON 的训练](#-training-with-r-horizon)
- [快速上手](#quick-start)
- [环境安装](#installation)
- [基准测试评估](#benchmark-evaluation)
- [基于 R-HORIZON 数据集的训练](#training-with-r-horizon-datasets)
- [数据集](#dataset)
- [数据集构建](#dataset-construction)
- [Hugging Face Hub 数据集](#dataset-on-hugging-face-hub)
- [数据集结构](#dataset-structure)
- [引用](#citation)
## 📊 R-HORIZON 基准测试集
我们在 R-HORIZON 基准测试集上评估了20余款最先进的大推理模型,结果显示随着推理跨度增加,模型性能出现显著下滑:

**基准测试的核心发现:**
- **性能普适性下滑**:即便最强大的模型,随着问题数量增加也会出现严重性能衰减。例如,DeepSeek-R1 在 AIME25 数据集上的准确率从单问题场景的87.3%降至5问题场景的24.6%。
- **模型规模影响显著**:更大规模的模型对多跨度挑战的抗衰减能力更强。R1-Qwen-7B 在求解16个问题时准确率从93.6%降至0%,相比32B模型的性能衰减幅度高出34.1%。
- **任务相关的性能衰减**:相比数学任务,代码生成任务的性能下滑更为剧烈。许多推理模型在网页搜索场景中会丧失工具调用能力,导致多步任务表现不佳。
## 🚀 基于 R-HORIZON 的训练
使用 R-HORIZON 组合数据进行训练,可在单跨度与多跨度推理任务上均取得显著性能提升:

**训练结果亮点:**
- **双重性能增益**:使用2个组合问题进行训练,可同时提升多跨度推理性能(在AIME24的n=2场景下提升17.4个百分点)与单问题推理性能(在AIME24原始场景下提升7.5个百分点)。
- **复杂度可扩展性**:提升组合复杂度(n=4)可增强模型处理高推理步数问题的能力,在Math500的n=8场景下准确率达到50.6%。
| 模型 | MATH500(原始) | MATH500(n=8) | AIME24(原始) | AIME24(n=2) | AIME25(原始) | AIME25(n=2) | AMC23(原始) | AMC23(n=2) |
|-----------------|------------------|---------------|-----------------|--------------|-----------------|--------------|----------------|-------------|
| R1-Qwen-7B | 93.6 | 11.8 | 48.3 | 16.4 | 33.3 | 3.5 | 90.2 | 48.8 |
| Baseline(n=1) | **95.6** | 8.4 | 57.9 | 16.7 | 47.9 | 5.1 | **95.9** | 55.0 |
| R-HORIZON(n=2) | 95.4 | 21.4 | **65.4** | 34.1 | **49.6** | **10.0** | 94.1 | **80.6** |
| R-HORIZON(n=4) | 94.6 | **50.6** | 62.9 | **34.8** | 45.4 | 8.1 | 91.9 | 79.1 |
## 快速上手
### 环境安装
bash
# 克隆仓库
git clone https://github.com/meituan-longcat/R-HORIZON.git
cd R-HORIZON
# 创建conda环境
conda create -n r-horizon python=3.10 -y
conda activate r-horizon
# 安装 PyTorch
pip3 install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu124
pip3 install flash-attn --no-build-isolation
# 安装额外依赖
pip install -r requirements.txt
### 基准测试评估
1. 下载 R-HORIZON 基准测试集
bash
# 下载基准测试数据集
python ./evaluation/data/download.py
2. 修改 evaluation 目录下的 config.json
json
{
"inference": {
// model_key(如 r1-distill-qwen7b)用于 run.sh 脚本
"r1-distill-qwen7b": {
// vllm 服务使用的IP与端口
"base_url": "http://{Your IP and Port}/v1/completions",
"api_key": "EMPTY",
// model_name 需与 vllm 服务中的模型名称一致
"model_name": "{vllm's modelname}",
"params": {
"temperature": 1.0,
"top_p": 0.95,
"top_k": 10,
"max_tokens": 65536
},
"prompt_prefix": "<|im_start|>user:
",
"prompt_suffix": "
<|im_end|>
<|im_start|>assistant:
"
}
},
"extract": {
"gpt-4.1": {
"model_name": "gpt-4.1",
"base_url": "{OpenAI's baseurl}",
"api_key": "{Your API key}",
"params": {
"temperature": 0.0,
"max_tokens": 16000
}
}
}
}
3. 启动 vllm 服务
bash
vllm serve {modelname}
--host {ip}
--port {port}
--served-model-name {modelname}
--dtype auto --pipeline-parallel-size 1 --tensor-parallel-size 1 --trust-remote-code
--enable-chunked-prefill --max-model-len 131072 --max-num-batched-tokens 10240
--max-num-seqs 256 --gpu-memory-utilization 0.85 --disable-custom-all-reduce
--enable-reasoning --reasoning-parser deepseek_r1 --enable-chunked-prefill
4. 评估你的模型
以下为示例 bash 脚本,model_key 需与 config.json 中定义的一致
bash
sh evaluation/run.sh {input_file} {output_dir} {model_key}
# 示例
sh evaluation/run.sh evaluation/data/R-HORIZON-Math500/Math500-combined-n2.jsonl evaluation/result r1-distill-qwen7b
### 基于 R-HORIZON 数据集的训练
1. 下载组合训练数据
python
from huggingface_hub import snapshot_download
snapshot_download(
repo_id="meituan-longcat/R-HORIZON-training-data",
repo_type="dataset",
local_dir="./training/data",
)
2. 启动训练
bash
# 使用 GRPO 算法基于 R-HORIZON 进行训练
bash ./training/scripts/train/skywork-or1-rlvr-math-training-7b-40k.sh
## 数据集
### 数据集构建
步骤1:过滤有效整数样本
bash
# 功能说明:保留输入文本中包含有效整数且目标为纯整数的样本,排除模糊数值表达式(如浮点数、分数、LaTeX 命令)。
python step1_filt_integer_samples.py
步骤2:识别关键变量
bash
# 功能说明:选取"关键变量"(对问题结果有显著影响的核心整数)
# 需在脚本中配置API凭证(替换 YOUR_API_KEY)
python step2_select_key_variable.py
步骤3:组合为链式推理问题
bash
# 功能说明:生成多跨度链式问题,每一步的关键变量依赖于上一步的答案。
python step3_combine_problems.py
### Hugging Face Hub 数据集
R-HORIZON 训练数据集与评估基准测试集已上架 Hugging Face Hub:
| 数据集类型 | 数据集名称 | Hugging Face 链接 |
|--------------|-------------------------------|-----------------------------------------------------------------------------------|
| 评估集 | R-HORIZON-Math500 | [链接](https://huggingface.co/datasets/meituan-longcat/R-HORIZON-Math500) |
| 评估集 | R-HORIZON-AIME24 | [链接](https://huggingface.co/datasets/meituan-longcat/R-HORIZON-AIME24) |
| 评估集 | R-HORIZON-AIME25 | [链接](https://huggingface.co/datasets/meituan-longcat/R-HORIZON-AIME25) |
| 评估集 | R-HORIZON-AMC23 | [链接](https://huggingface.co/datasets/meituan-longcat/R-HORIZON-AMC23) |
| 评估集 | R-HORIZON-Websearch | [链接](https://huggingface.co/datasets/meituan-longcat/R-HORIZON-Websearch) |
| 训练集 | R-HORIZON-training-data | [链接](https://huggingface.co/datasets/meituan-longcat/R-HORIZON-training-data) |
### 数据集结构
json
{
"input": "[1-N个关联问题 + 解题说明(包含[variablek]/[answerk]占位符)]",
"instanceId": "[该实例的唯一ID]",
"origin_instanceIds": "[原始问题ID列表]",
"target": "[最终答案列表,例如 [answer1, answer2]]",
"num_problems": "[问题总数,例如 2]",
"selected_variables": [
{
"number": "[问题中的关键变量]",
"context": "该变量的上下文信息",
"text": "该变量的文本描述",
"is_independent": "[true/false]",
"is_in_math_env": "[true/false]"
}
]
}
## 引用
如果您的研究中用到了 R-HORIZON,请引用我们的论文:
bibtex
@misc{lu2025rhorizonfarlargereasoning,
title={R-Horizon: How Far Can Your Large Reasoning Model Really Go in Breadth and Depth?},
author={Yi Lu and Jianing Wang and Linsen Guo and Wei He and Hongyin Tang and Tao Gui and Xuanjing Huang and Xuezhi Cao and Wei Wang and Xunliang Cai},
year={2025},
eprint={2510.08189},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2510.08189},
}
提供机构:
maas
创建时间:
2025-11-03



