R-HORIZON-Websearch
收藏魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/meituan-longcat/R-HORIZON-Websearch
下载链接
链接失效反馈官方服务:
资源简介:
<div align="center">
<h1>
<img src="https://raw.githubusercontent.com/meituan-longcat/R-HORIZON/main/assets/problem-solving.png" alt="logo" width="60" style="vertical-align:middle; margin-right:10px;">
R-HORIZON
</h1>
<div>
How Far Can Your Large Reasoning Model Really Go in Breadth and Depth?
</div>
</div>
<br>
<p align="center">
📃 <a href="https://arxiv.org/abs/2510.08189" target="_blank">Paper</a > • 🌐 <a href="https://reasoning-horizon.github.io/" target="_blank">Project Page</a > • 🤗 <a href="https://huggingface.co/datasets/meituan-longcat/R-HORIZON-training-data" target="_blank">Dataset</a >
</p >
R-HORIZON is a novel method designed to stimulate long-horizon reasoning behaviors in Large Reasoning Models (LRMs) through query composition. We transform isolated problems into complex multi-step reasoning scenarios, revealing that even the most advanced LRMs suffer significant performance degradation when facing interdependent problems that span long reasoning horizons.

## 🔥 Releases
**[2025-10-09]**
- 🎉 **R-HORIZON Benchmark** is now available! Test your LRMs on complex multi-horizon reasoning tasks.
- 🤗 **Training and evaluation datasets** are available on Hugging Face: [R-HORIZON Dataset](https://huggingface.co/datasets/meituan-longcat/R-HORIZON-training-data)
- 📄 **Paper released** on arXiv: [R-HORIZON: How Far Can Your Large Reasoning Model Really Go in Breadth and Depth?](https://arxiv.org/abs/2510.08189)
## 🌟 Overview
Recent advances in reasoning-focused language models (e.g., OpenAI o1, DeepSeek-R1) have demonstrated remarkable improvements through test-time scaling and long Chain-of-Thought (CoT). However, existing benchmarks primarily focus on immediate, single-horizon tasks, failing to adequately evaluate models' ability to handle complex, long-horizon scenarios.
**Key challenges in current paradigms:**
- **Limited evaluation scope**: Existing benchmarks confine themselves to isolated problems, missing the complexity of real-world multi-step reasoning
- **Limited effective reasoning length**: Models struggle to maintain performance as reasoning chains grow longer
- **Poor thinking budget allocation**: LRMs fail to appropriately distribute thinking resources across multiple interdependent problems
To address these limitations, we introduce **R-HORIZON**, which:
- Transforms isolated problems into **complex multi-step reasoning scenarios** through query composition
- Establishes the **R-HORIZON Benchmark** comprising 6 representative datasets from mathematics, code generation, and agent applications
- Enables **reinforcement learning with verified rewards (RLVR)** using long-horizon reasoning data

## 📖 Table of Contents
- [🔥 Releases](#-releases)
- [🌟 Overview](#-overview)
- [📊 R-HORIZON Benchmark](#-r-horizon-benchmark)
- [🚀 Training with R-HORIZON](#-training-with-r-horizon)
- [Quick Start](#quick-start)
- [Installation](#installation)
- [Benchmark Evaluation](#benchmark-evaluation)
- [Training with R-HORIZON datasets](#training-with-r-horizon-datasets)
- [Dataset](#dataset)
- [Dataset Construction](#dataset-construction)
- [Dataset on Hugging Face Hub](#dataset-on-hugging-face-hub)
- [Dataset Structure](#dataset-structure)
- [Citation](#citation)
## 📊 R-HORIZON Benchmark
We evaluate 20+ state-of-the-art LRMs on the R-HORIZON Benchmark, revealing significant performance degradation as reasoning horizons increase:

**Key findings from our benchmark evaluation:**
- **Universal performance degradation**: Even the most powerful models suffer severe drops as problem count increases. For instance, DeepSeek-R1 drops from 87.3% (single problem) to 24.6% (5 problems) on AIME25.
- **Model size matters**: Larger models exhibit more resilience to multi-horizon challenges. R1-Qwen-7B drops from 93.6% to 0% when solving 16 problems, showing 34.1% more degradation than the 32B models.
- **Task-dependent degradation**: Code generation tasks show steeper performance declines compared to mathematics. Many reasoning models lose their tool-calling abilities in web search scenarios, resulting in poor multi-step performance.
## 🚀 Training with R-HORIZON
Training with R-HORIZON composed data yields substantial improvements on both single and multi-horizon reasoning tasks:

**Training results highlights:**
- **Dual Performance Gains**: Training with 2-composed problems significantly improves both multi-horizon reasoning (+17.4 points on AIME24 n=2) and single-problem performance (+7.5 points on AIME24 original).
- **Scalable Complexity**: Increasing composition complexity (n=4) enhances the model's ability to handle problems requiring more reasoning steps, achieving 50.6% on Math500 (n=8).
| Models | MATH500 (Origin) | MATH500 (n=8) | AIME24 (Origin) | AIME24 (n=2) | AIME25 (Origin) | AIME25 (n=2) | AMC23 (Origin) | AMC23 (n=2) |
|-----------------|------------------|---------------|-----------------|--------------|-----------------|--------------|----------------|-------------|
| R1-Qwen-7B | 93.6 | 11.8 | 48.3 | 16.4 | 33.3 | 3.5 | 90.2 | 48.8 |
| Baseline (n=1) | **95.6** | 8.4 | 57.9 | 16.7 | 47.9 | 5.1 | **95.9** | 55.0 |
| R-HORIZON (n=2) | 95.4 | 21.4 | **65.4** | 34.1 | **49.6** | **10.0** | 94.1 | **80.6** |
| R-HORIZON (n=4) | 94.6 | **50.6** | 62.9 | **34.8** | 45.4 | 8.1 | 91.9 | 79.1 |
## Quick Start
### Installation
```bash
# Clone the repository
git clone https://github.com/meituan-longcat/R-HORIZON.git
cd R-HORIZON
# Create conda environment
conda create -n r-horizon python=3.10 -y
conda activate r-horizon
# Install PyTorch
pip3 install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu124
pip3 install flash-attn --no-build-isolation
# Install additional dependencies
pip install -r requirements.txt
```
### Benchmark Evaluation
1. Download the R-HORIZON Benchmark
```bash
# Download benchmark datasets
python ./evaluation/data/download.py
```
2. Modify config.json under evaluation directory
```json
{
"inference": {
// model_key (e.g. r1-distill-qwen7b) is for run.sh
"r1-distill-qwen7b": {
// the ip and port used in vllm server
"base_url": "http://{Your IP and Port}/v1/completions",
"api_key": "EMPTY",
// model_name is corresponding to the modelname in vllm server
"model_name": "{vllm's modelname}",
"params": {
"temperature": 1.0,
"top_p": 0.95,
"top_k": 10,
"max_tokens": 65536
},
"prompt_prefix": "<|im_start|>user:\n",
"prompt_suffix": "\n<|im_end|>\n<|im_start|>assistant:\n"
}
},
"extract": {
"gpt-4.1": {
"model_name": "gpt-4.1",
"base_url": "{OpenAI's baseurl}",
"api_key": "{Your API key}",
"params": {
"temperature": 0.0,
"max_tokens": 16000
}
}
}
}
```
3. Run a vllm server
```bash
vllm serve {modelname}\
--host {ip}\
--port {port}\
--served-model-name {modelname}\
--dtype auto --pipeline-parallel-size 1 --tensor-parallel-size 1 --trust-remote-code\
--enable-chunked-prefill --max-model-len 131072 --max-num-batched-tokens 10240\
--max-num-seqs 256 --gpu-memory-utilization 0.85 --disable-custom-all-reduce\
--enable-reasoning --reasoning-parser deepseek_r1 --enable-chunked-prefill
```
4. Evaluate your model
Here is a bash example, and model_key is defined in config.json
```bash
sh evaluation/run.sh {input_file} {output_dir} {model_key}
# example
sh evaluation/run.sh evaluation/data/R-HORIZON-Math500/Math500-combined-n2.jsonl evaluation/result r1-distill-qwen7b
```
### Training with R-HORIZON datasets
1. Download composed training data
```python
from huggingface_hub import snapshot_download
snapshot_download(
repo_id="meituan-longcat/R-HORIZON-training-data",
repo_type="dataset",
local_dir="./training/data",
)
```
2. Launch training
```bash
# Train with R-HORIZON using GRPO algorithm
bash ./training/scripts/train/skywork-or1-rlvr-math-training-7b-40k.sh
```
## Dataset
### Dataset Construction
Step 1: Filter Samples with Valid Integers
```bash
# Purpose: Retain samples containing valid integers in input text and pure integer targets, excluding ambiguous numeric expressions (e.g., floats, fractions, LaTeX commands).
python step1_filt_integer_samples.py
```
Step 2: Identify Key Variables
```bash
# Purpose: select "key variables" (critical integers that significantly affect problem outcomes)
# configure API credentials in the script (replace YOUR_API_KEY)
python step2_select_key_variable.py
```
Step 3: Combine into Chained Reasoning Problems
```bash
# Purpose: Generate multi-horizon chained problems where each step's key variable depends on the previous step's answer.
python step3_combine_problems.py
```
### Dataset on Hugging Face Hub
The R-HORIZON training datasets and evaluation benchmark are available on Hugging Face Hub:
| Dataset Type | Dataset Name | Hugging Face Link |
|--------------|-------------------------------|-----------------------------------------------------------------------------------|
| Evaluation | R-HORIZON-Math500 | [link](https://huggingface.co/datasets/meituan-longcat/R-HORIZON-Math500) |
| Evaluation | R-HORIZON-AIME24 | [link](https://huggingface.co/datasets/meituan-longcat/R-HORIZON-AIME24) |
| Evaluation | R-HORIZON-AIME25 | [link](https://huggingface.co/datasets/meituan-longcat/R-HORIZON-AIME25) |
| Evaluation | R-HORIZON-AMC23 | [link](https://huggingface.co/datasets/meituan-longcat/R-HORIZON-AMC23) |
| Evaluation | R-HORIZON-Websearch | [link](https://huggingface.co/datasets/meituan-longcat/R-HORIZON-Websearch) |
| Training | R-HORIZON-training-data | [link](https://huggingface.co/datasets/meituan-longcat/R-HORIZON-training-data) |
### Dataset Structure
```json
{
"input": "[1-N linked problems + solving instructions (with [variablek]/[answerk] placeholders)]",
"instanceId": "[Unique ID for this instance]",
"origin_instanceIds": "[List of original problem IDs]",
"target": "[List of final answers, e.g., [answer1, answer2]]",
"num_problems": "[Total problems, e.g., 2]",
"selected_variables": [
{
"number": "[Key variable from problem]",
"context": "[Context of the number]",
"text": "[Text of the number]",
"is_independent": "[true/false]",
"is_in_math_env": "[true/false]"
}
]
}
```
## Citation
If you find R-HORIZON helpful for your research, please cite our paper:
```bibtex
@misc{lu2025rhorizonfarlargereasoning,
title={R-Horizon: How Far Can Your Large Reasoning Model Really Go in Breadth and Depth?},
author={Yi Lu and Jianing Wang and Linsen Guo and Wei He and Hongyin Tang and Tao Gui and Xuanjing Huang and Xuezhi Cao and Wei Wang and Xunliang Cai},
year={2025},
eprint={2510.08189},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2510.08189},
}
```
<div align="center">
<h1>
<img src="https://raw.githubusercontent.com/meituan-longcat/R-HORIZON/main/assets/problem-solving.png" alt="logo" width="60" style="vertical-align:middle; margin-right:10px;">
R-HORIZON
</h1>
<div>
你的大推理模型(Large Reasoning Model, LRM)在广度与深度上究竟能走多远?
</div>
</div>
<br>
<p align="center">
📃 <a href="https://arxiv.org/abs/2510.08189" target="_blank">论文</a> • 🌐 <a href="https://reasoning-horizon.github.io/" target="_blank">项目主页</a> • 🤗 <a href="https://huggingface.co/datasets/meituan-longcat/R-HORIZON-training-data" target="_blank">数据集</a>
</p>
R-HORIZON是一种新颖的方法,旨在通过查询组合(query composition)激发大推理模型(Large Reasoning Model, LRM)的长视野推理行为。我们将孤立的问题转化为复杂的多步推理场景,研究发现即便最先进的大推理模型,在面对跨越长推理视野的相互依赖型问题时,性能也会出现显著下滑。

## 🔥 最新动态
**[2025-10-09]**
- 🎉 **R-HORIZON基准测试集** 现已上线!快来在复杂多视野推理任务中测试你的大推理模型性能。
- 🤗 **训练与评估数据集** 已在Hugging Face平台发布:[R-HORIZON 数据集](https://huggingface.co/datasets/meituan-longcat/R-HORIZON-training-data)
- 📄 **研究论文** 已在arXiv平台上线:[R-HORIZON:你的大推理模型在广度与深度上究竟能走多远?](https://arxiv.org/abs/2510.08189)
## 🌟 研究概述
近年来,聚焦推理能力的大语言模型(如OpenAI o1、DeepSeek-R1)通过测试时缩放与长思维链(Chain-of-Thought, CoT)技术取得了显著进展。但现有基准测试主要关注即时单视野任务,无法充分评估模型处理复杂长视野场景的能力。
**当前范式下的核心挑战:**
- **评估范围受限**:现有基准测试仅局限于孤立问题,无法还原现实世界中多步推理的复杂性
- **有效推理长度不足**:随着推理链长度增加,模型难以维持稳定的性能表现
- **思考资源分配不合理**:大推理模型无法在多个相互依赖的问题间合理分配思考资源
为解决上述局限,我们提出了**R-HORIZON**框架,其核心能力包括:
- 通过查询组合将孤立问题转化为**复杂多步推理场景**
- 构建**R-HORIZON基准测试集**,涵盖数学、代码生成与智能体应用三大领域共6个代表性数据集
- 利用长视野推理数据实现**带验证奖励的强化学习(Reinforcement Learning with Verified Rewards, RLVR)**

## 📖 目录
- [🔥 最新动态](#-releases)
- [🌟 研究概述](#-overview)
- [📊 R-HORIZON基准测试集](#-r-horizon-benchmark)
- [🚀 基于R-HORIZON的模型训练](#-training-with-r-horizon)
- [快速上手](#quick-start)
- [环境配置](#installation)
- [基准测试评估](#benchmark-evaluation)
- [使用R-HORIZON数据集进行训练](#training-with-r-horizon-datasets)
- [数据集说明](#dataset)
- [数据集构建流程](#dataset-construction)
- [Hugging Face Hub数据集](#dataset-on-hugging-face-hub)
- [数据集结构](#dataset-structure)
- [引用方式](#citation)
## 📊 R-HORIZON基准测试集
我们在R-HORIZON基准测试集上对20余款当前最先进的大推理模型进行了评估,结果显示随着推理视野的扩展,模型性能出现显著下滑:

### 基准测试核心发现
- **普适性性能下滑**:即便最顶尖的模型,随着问题数量增加,性能也会出现大幅下降。例如DeepSeek-R1在AIME25数据集上的准确率从单问题场景的87.3%降至5问题场景的24.6%。
- **模型规模影响显著**:更大规模的模型对多视野挑战的鲁棒性更强。R1-Qwen-7B在解决16个问题时准确率从93.6%降至0%,性能下滑幅度比32B模型高出34.1%。
- **任务依赖型性能衰减**:与数学任务相比,代码生成任务的性能下滑更为剧烈。许多推理模型在网页搜索场景中会丧失工具调用能力,导致多步推理表现不佳。
## 🚀 基于R-HORIZON的模型训练
使用R-HORIZON组合生成的数据集进行训练,可在单视野与多视野推理任务上均取得显著性能提升:

### 训练结果亮点
- **双向性能增益**:使用2个组合问题进行训练,可同时提升多视野推理能力(在AIME24的n=2场景下提升17.4个百分点)与单问题推理性能(在原始AIME24数据集上提升7.5个百分点)。
- **复杂度可扩展性**:提升组合复杂度(n=4)可增强模型处理高推理步数任务的能力,在Math500的n=8场景下准确率达到50.6%。
| 模型 | MATH500(原始) | MATH500(n=8) | AIME24(原始) | AIME24(n=2) | AIME25(原始) | AIME25(n=2) | AMC23(原始) | AMC23(n=2) |
|-----------------|------------------|---------------|-----------------|--------------|-----------------|--------------|----------------|-------------|
| R1-Qwen-7B | 93.6 | 11.8 | 48.3 | 16.4 | 33.3 | 3.5 | 90.2 | 48.8 |
| Baseline (n=1) | **95.6** | 8.4 | 57.9 | 16.7 | 47.9 | 5.1 | **95.9** | 55.0 |
| R-HORIZON (n=2) | 95.4 | 21.4 | **65.4** | 34.1 | **49.6** | **10.0** | 94.1 | **80.6** |
| R-HORIZON (n=4) | 94.6 | **50.6** | 62.9 | **34.8** | 45.4 | 8.1 | 91.9 | 79.1 |
## 快速上手
### 环境配置
bash
# 克隆代码仓库
git clone https://github.com/meituan-longcat/R-HORIZON.git
cd R-HORIZON
# 创建Conda环境
conda create -n r-horizon python=3.10 -y
conda activate r-horizon
# 安装PyTorch
pip3 install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu124
pip3 install flash-attn --no-build-isolation
# 安装其他依赖项
pip install -r requirements.txt
### 基准测试评估
1. 下载R-HORIZON基准测试集
bash
# 下载基准测试数据集
python ./evaluation/data/download.py
2. 修改evaluation目录下的config.json文件
json
{
"inference": {
// model_key (例如r1-distill-qwen7b) 对应run.sh中的配置
"r1-distill-qwen7b": {
// vllm服务器使用的IP与端口
"base_url": "http://{Your IP and Port}/v1/completions",
"api_key": "EMPTY",
// model_name对应vllm服务器中的模型名称
"model_name": "{vllm's modelname}",
"params": {
"temperature": 1.0,
"top_p": 0.95,
"top_k": 10,
"max_tokens": 65536
},
"prompt_prefix": "<|im_start|>user:
",
"prompt_suffix": "
<|im_end|>
<|im_start|>assistant:
"
}
},
"extract": {
"gpt-4.1": {
"model_name": "gpt-4.1",
"base_url": "{OpenAI's baseurl}",
"api_key": "{Your API key}",
"params": {
"temperature": 0.0,
"max_tokens": 16000
}
}
}
}
3. 启动vllm服务器
bash
vllm serve {modelname}
--host {ip}
--port {port}
--served-model-name {modelname}
--dtype auto --pipeline-parallel-size 1 --tensor-parallel-size 1 --trust-remote-code
--enable-chunked-prefill --max-model-len 131072 --max-num-batched-tokens 10240
--max-num-seqs 256 --gpu-memory-utilization 0.85 --disable-custom-all-reduce
--enable-reasoning --reasoning-parser deepseek_r1 --enable-chunked-prefill
4. 评估你的模型
以下为bash示例,model_key为config.json中定义的键名
bash
sh evaluation/run.sh {input_file} {output_dir} {model_key}
# 示例
sh evaluation/run.sh evaluation/data/R-HORIZON-Math500/Math500-combined-n2.jsonl evaluation/result r1-distill-qwen7b
### 使用R-HORIZON数据集训练
1. 下载组合后的训练数据
python
from huggingface_hub import snapshot_download
snapshot_download(
repo_id="meituan-longcat/R-HORIZON-training-data",
repo_type="dataset",
local_dir="./training/data",
)
2. 启动训练
bash
# 使用GRPO算法基于R-HORIZON进行训练
bash ./training/scripts/train/skywork-or1-rlvr-math-training-7b-40k.sh
## 数据集说明
### 数据集构建流程
步骤1:过滤有效整数样本
bash
# 用途:保留输入文本中包含有效整数且目标为纯整数的样本,排除模糊的数值表达式(如浮点数、分数、LaTeX命令)。
python step1_filt_integer_samples.py
步骤2:识别关键变量
bash
# 用途:选取“关键变量”(对问题结果具有显著影响的关键整数)
# 在脚本中配置API密钥(替换YOUR_API_KEY)
python step2_select_key_variable.py
步骤3:组合为链式推理问题
bash
# 用途:生成多视野链式问题,其中每一步的关键变量依赖于前一步的答案。
python step3_combine_problems.py
### Hugging Face Hub数据集
R-HORIZON训练数据集与评估基准测试集已在Hugging Face Hub平台发布:
| 数据集类型 | 数据集名称 | Hugging Face 链接 |
|--------------|-------------------------------|-----------------------------------------------------------------------------------|
| 评估集 | R-HORIZON-Math500 | [链接](https://huggingface.co/datasets/meituan-longcat/R-HORIZON-Math500) |
| 评估集 | R-HORIZON-AIME24 | [链接](https://huggingface.co/datasets/meituan-longcat/R-HORIZON-AIME24) |
| 评估集 | R-HORIZON-AIME25 | [链接](https://huggingface.co/datasets/meituan-longcat/R-HORIZON-AIME25) |
| 评估集 | R-HORIZON-AMC23 | [链接](https://huggingface.co/datasets/meituan-longcat/R-HORIZON-AMC23) |
| 评估集 | R-HORIZON-Websearch | [链接](https://huggingface.co/datasets/meituan-longcat/R-HORIZON-Websearch) |
| 训练集 | R-HORIZON-training-data | [链接](https://huggingface.co/datasets/meituan-longcat/R-HORIZON-training-data) |
### 数据集结构
json
{
"input": "[1-N个关联问题+求解说明(包含[variablek]/[answerk]占位符)]",
"instanceId": "[该实例的唯一ID]",
"origin_instanceIds": "[原始问题ID列表]",
"target": "[最终答案列表,例如[answer1, answer2]]",
"num_problems": "[总问题数,例如2]",
"selected_variables": [
{
"number": "[问题中的关键变量]",
"context": "[该变量的上下文信息]",
"text": "[该变量的文本描述]",
"is_independent": "[true/false,是否独立]",
"is_in_math_env": "[true/false,是否处于数学环境中]"
}
]
}
## 引用方式
如果您的研究用到了R-HORIZON,请引用我们的论文:
bibtex
@misc{lu2025rhorizonfarlargereasoning,
title={R-Horizon: How Far Can Your Large Reasoning Model Really Go in Breadth and Depth?},
author={Yi Lu and Jianing Wang and Linsen Guo and Wei He and Hongyin Tang and Tao Gui and Xuanjing Huang and Xuezhi Cao and Wei Wang and Xunliang Cai},
year={2025},
eprint={2510.08189},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2510.08189},
}
提供机构:
maas
创建时间:
2025-11-03



