R-HORIZON-Websearch

Name: R-HORIZON-Websearch
Creator: maas
Published: 2025-12-05 16:55:53
License: 暂无描述

魔搭社区2025-12-05 更新2025-12-06 收录

下载链接：

https://modelscope.cn/datasets/meituan-longcat/R-HORIZON-Websearch

下载链接

链接失效反馈

官方服务：

资源简介：

<div align="center"> <h1> <img src="https://raw.githubusercontent.com/meituan-longcat/R-HORIZON/main/assets/problem-solving.png" alt="logo" width="60" style="vertical-align:middle; margin-right:10px;"> R-HORIZON </h1> <div> How Far Can Your Large Reasoning Model Really Go in Breadth and Depth? </div> </div> <br> <p align="center"> 📃 <a href="https://arxiv.org/abs/2510.08189" target="_blank">Paper</a > • 🌐 <a href="https://reasoning-horizon.github.io/" target="_blank">Project Page</a > • 🤗 <a href="https://huggingface.co/datasets/meituan-longcat/R-HORIZON-training-data" target="_blank">Dataset</a > </p > R-HORIZON is a novel method designed to stimulate long-horizon reasoning behaviors in Large Reasoning Models (LRMs) through query composition. We transform isolated problems into complex multi-step reasoning scenarios, revealing that even the most advanced LRMs suffer significant performance degradation when facing interdependent problems that span long reasoning horizons. ![](https://raw.githubusercontent.com/meituan-longcat/R-HORIZON/main/assets/mainfig.png) ## 🔥 Releases **[2025-10-09]** - 🎉 **R-HORIZON Benchmark** is now available! Test your LRMs on complex multi-horizon reasoning tasks. - 🤗 **Training and evaluation datasets** are available on Hugging Face: [R-HORIZON Dataset](https://huggingface.co/datasets/meituan-longcat/R-HORIZON-training-data) - 📄 **Paper released** on arXiv: [R-HORIZON: How Far Can Your Large Reasoning Model Really Go in Breadth and Depth?](https://arxiv.org/abs/2510.08189) ## 🌟 Overview Recent advances in reasoning-focused language models (e.g., OpenAI o1, DeepSeek-R1) have demonstrated remarkable improvements through test-time scaling and long Chain-of-Thought (CoT). However, existing benchmarks primarily focus on immediate, single-horizon tasks, failing to adequately evaluate models' ability to handle complex, long-horizon scenarios. **Key challenges in current paradigms:** - **Limited evaluation scope**: Existing benchmarks confine themselves to isolated problems, missing the complexity of real-world multi-step reasoning - **Limited effective reasoning length**: Models struggle to maintain performance as reasoning chains grow longer - **Poor thinking budget allocation**: LRMs fail to appropriately distribute thinking resources across multiple interdependent problems To address these limitations, we introduce **R-HORIZON**, which: - Transforms isolated problems into **complex multi-step reasoning scenarios** through query composition - Establishes the **R-HORIZON Benchmark** comprising 6 representative datasets from mathematics, code generation, and agent applications - Enables **reinforcement learning with verified rewards (RLVR)** using long-horizon reasoning data ![](https://raw.githubusercontent.com/meituan-longcat/R-HORIZON/main/assets/method_fig.png) ## 📖 Table of Contents - [🔥 Releases](#-releases) - [🌟 Overview](#-overview) - [📊 R-HORIZON Benchmark](#-r-horizon-benchmark) - [🚀 Training with R-HORIZON](#-training-with-r-horizon) - [Quick Start](#quick-start) - [Installation](#installation) - [Benchmark Evaluation](#benchmark-evaluation) - [Training with R-HORIZON datasets](#training-with-r-horizon-datasets) - [Dataset](#dataset) - [Dataset Construction](#dataset-construction) - [Dataset on Hugging Face Hub](#dataset-on-hugging-face-hub) - [Dataset Structure](#dataset-structure) - [Citation](#citation) ## 📊 R-HORIZON Benchmark We evaluate 20+ state-of-the-art LRMs on the R-HORIZON Benchmark, revealing significant performance degradation as reasoning horizons increase: ![](https://raw.githubusercontent.com/meituan-longcat/R-HORIZON/main/assets/result_fig.png) **Key findings from our benchmark evaluation:** - **Universal performance degradation**: Even the most powerful models suffer severe drops as problem count increases. For instance, DeepSeek-R1 drops from 87.3% (single problem) to 24.6% (5 problems) on AIME25. - **Model size matters**: Larger models exhibit more resilience to multi-horizon challenges. R1-Qwen-7B drops from 93.6% to 0% when solving 16 problems, showing 34.1% more degradation than the 32B models. - **Task-dependent degradation**: Code generation tasks show steeper performance declines compared to mathematics. Many reasoning models lose their tool-calling abilities in web search scenarios, resulting in poor multi-step performance. ## 🚀 Training with R-HORIZON Training with R-HORIZON composed data yields substantial improvements on both single and multi-horizon reasoning tasks: ![](https://raw.githubusercontent.com/meituan-longcat/R-HORIZON/main/assets/skywork_n1_n2_comparison.png) **Training results highlights:** - **Dual Performance Gains**: Training with 2-composed problems significantly improves both multi-horizon reasoning (+17.4 points on AIME24 n=2) and single-problem performance (+7.5 points on AIME24 original). - **Scalable Complexity**: Increasing composition complexity (n=4) enhances the model's ability to handle problems requiring more reasoning steps, achieving 50.6% on Math500 (n=8). | Models | MATH500 (Origin) | MATH500 (n=8) | AIME24 (Origin) | AIME24 (n=2) | AIME25 (Origin) | AIME25 (n=2) | AMC23 (Origin) | AMC23 (n=2) | |-----------------|------------------|---------------|-----------------|--------------|-----------------|--------------|----------------|-------------| | R1-Qwen-7B | 93.6 | 11.8 | 48.3 | 16.4 | 33.3 | 3.5 | 90.2 | 48.8 | | Baseline (n=1) | **95.6** | 8.4 | 57.9 | 16.7 | 47.9 | 5.1 | **95.9** | 55.0 | | R-HORIZON (n=2) | 95.4 | 21.4 | **65.4** | 34.1 | **49.6** | **10.0** | 94.1 | **80.6** | | R-HORIZON (n=4) | 94.6 | **50.6** | 62.9 | **34.8** | 45.4 | 8.1 | 91.9 | 79.1 | ## Quick Start ### Installation ```bash # Clone the repository git clone https://github.com/meituan-longcat/R-HORIZON.git cd R-HORIZON # Create conda environment conda create -n r-horizon python=3.10 -y conda activate r-horizon # Install PyTorch pip3 install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu124 pip3 install flash-attn --no-build-isolation # Install additional dependencies pip install -r requirements.txt ``` ### Benchmark Evaluation 1. Download the R-HORIZON Benchmark ```bash # Download benchmark datasets python ./evaluation/data/download.py ``` 2. Modify config.json under evaluation directory ```json { "inference": { // model_key (e.g. r1-distill-qwen7b) is for run.sh "r1-distill-qwen7b": { // the ip and port used in vllm server "base_url": "http://{Your IP and Port}/v1/completions", "api_key": "EMPTY", // model_name is corresponding to the modelname in vllm server "model_name": "{vllm's modelname}", "params": { "temperature": 1.0, "top_p": 0.95, "top_k": 10, "max_tokens": 65536 }, "prompt_prefix": "<|im_start|>user:\n", "prompt_suffix": "\n<|im_end|>\n<|im_start|>assistant:\n" } }, "extract": { "gpt-4.1": { "model_name": "gpt-4.1", "base_url": "{OpenAI's baseurl}", "api_key": "{Your API key}", "params": { "temperature": 0.0, "max_tokens": 16000 } } } } ``` 3. Run a vllm server ```bash vllm serve {modelname}\ --host {ip}\ --port {port}\ --served-model-name {modelname}\ --dtype auto --pipeline-parallel-size 1 --tensor-parallel-size 1 --trust-remote-code\ --enable-chunked-prefill --max-model-len 131072 --max-num-batched-tokens 10240\ --max-num-seqs 256 --gpu-memory-utilization 0.85 --disable-custom-all-reduce\ --enable-reasoning --reasoning-parser deepseek_r1 --enable-chunked-prefill ``` 4. Evaluate your model Here is a bash example, and model_key is defined in config.json ```bash sh evaluation/run.sh {input_file} {output_dir} {model_key} # example sh evaluation/run.sh evaluation/data/R-HORIZON-Math500/Math500-combined-n2.jsonl evaluation/result r1-distill-qwen7b ``` ### Training with R-HORIZON datasets 1. Download composed training data ```python from huggingface_hub import snapshot_download snapshot_download( repo_id="meituan-longcat/R-HORIZON-training-data", repo_type="dataset", local_dir="./training/data", ) ``` 2. Launch training ```bash # Train with R-HORIZON using GRPO algorithm bash ./training/scripts/train/skywork-or1-rlvr-math-training-7b-40k.sh ``` ## Dataset ### Dataset Construction Step 1: Filter Samples with Valid Integers ```bash # Purpose: Retain samples containing valid integers in input text and pure integer targets, excluding ambiguous numeric expressions (e.g., floats, fractions, LaTeX commands). python step1_filt_integer_samples.py ``` Step 2: Identify Key Variables ```bash # Purpose: select "key variables" (critical integers that significantly affect problem outcomes) # configure API credentials in the script (replace YOUR_API_KEY) python step2_select_key_variable.py ``` Step 3: Combine into Chained Reasoning Problems ```bash # Purpose: Generate multi-horizon chained problems where each step's key variable depends on the previous step's answer. python step3_combine_problems.py ``` ### Dataset on Hugging Face Hub The R-HORIZON training datasets and evaluation benchmark are available on Hugging Face Hub: | Dataset Type | Dataset Name | Hugging Face Link | |--------------|-------------------------------|-----------------------------------------------------------------------------------| | Evaluation | R-HORIZON-Math500 | [link](https://huggingface.co/datasets/meituan-longcat/R-HORIZON-Math500) | | Evaluation | R-HORIZON-AIME24 | [link](https://huggingface.co/datasets/meituan-longcat/R-HORIZON-AIME24) | | Evaluation | R-HORIZON-AIME25 | [link](https://huggingface.co/datasets/meituan-longcat/R-HORIZON-AIME25) | | Evaluation | R-HORIZON-AMC23 | [link](https://huggingface.co/datasets/meituan-longcat/R-HORIZON-AMC23) | | Evaluation | R-HORIZON-Websearch | [link](https://huggingface.co/datasets/meituan-longcat/R-HORIZON-Websearch) | | Training | R-HORIZON-training-data | [link](https://huggingface.co/datasets/meituan-longcat/R-HORIZON-training-data) | ### Dataset Structure ```json { "input": "[1-N linked problems + solving instructions (with [variablek]/[answerk] placeholders)]", "instanceId": "[Unique ID for this instance]", "origin_instanceIds": "[List of original problem IDs]", "target": "[List of final answers, e.g., [answer1, answer2]]", "num_problems": "[Total problems, e.g., 2]", "selected_variables": [ { "number": "[Key variable from problem]", "context": "[Context of the number]", "text": "[Text of the number]", "is_independent": "[true/false]", "is_in_math_env": "[true/false]" } ] } ``` ## Citation If you find R-HORIZON helpful for your research, please cite our paper: ```bibtex @misc{lu2025rhorizonfarlargereasoning, title={R-Horizon: How Far Can Your Large Reasoning Model Really Go in Breadth and Depth?}, author={Yi Lu and Jianing Wang and Linsen Guo and Wei He and Hongyin Tang and Tao Gui and Xuanjing Huang and Xuezhi Cao and Wei Wang and Xunliang Cai}, year={2025}, eprint={2510.08189}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2510.08189}, } ```

<div align="center"> <h1> <img src="https://raw.githubusercontent.com/meituan-longcat/R-HORIZON/main/assets/problem-solving.png" alt="logo" width="60" style="vertical-align:middle; margin-right:10px;"> R-HORIZON </h1> <div> 你的大推理模型（Large Reasoning Model, LRM）在广度与深度上究竟能走多远？ </div> </div> <br> <p align="center"> 📃 <a href="https://arxiv.org/abs/2510.08189" target="_blank">论文</a> • 🌐 <a href="https://reasoning-horizon.github.io/" target="_blank">项目主页</a> • 🤗 <a href="https://huggingface.co/datasets/meituan-longcat/R-HORIZON-training-data" target="_blank">数据集</a> </p> R-HORIZON是一种新颖的方法，旨在通过查询组合（query composition）激发大推理模型（Large Reasoning Model, LRM）的长视野推理行为。我们将孤立的问题转化为复杂的多步推理场景，研究发现即便最先进的大推理模型，在面对跨越长推理视野的相互依赖型问题时，性能也会出现显著下滑。 ![](https://raw.githubusercontent.com/meituan-longcat/R-HORIZON/main/assets/mainfig.png) ## 🔥 最新动态 **[2025-10-09]** - 🎉 **R-HORIZON基准测试集** 现已上线！快来在复杂多视野推理任务中测试你的大推理模型性能。 - 🤗 **训练与评估数据集** 已在Hugging Face平台发布：[R-HORIZON 数据集](https://huggingface.co/datasets/meituan-longcat/R-HORIZON-training-data) - 📄 **研究论文** 已在arXiv平台上线：[R-HORIZON：你的大推理模型在广度与深度上究竟能走多远？](https://arxiv.org/abs/2510.08189) ## 🌟 研究概述近年来，聚焦推理能力的大语言模型（如OpenAI o1、DeepSeek-R1）通过测试时缩放与长思维链（Chain-of-Thought, CoT）技术取得了显著进展。但现有基准测试主要关注即时单视野任务，无法充分评估模型处理复杂长视野场景的能力。 **当前范式下的核心挑战：** - **评估范围受限**：现有基准测试仅局限于孤立问题，无法还原现实世界中多步推理的复杂性 - **有效推理长度不足**：随着推理链长度增加，模型难以维持稳定的性能表现 - **思考资源分配不合理**：大推理模型无法在多个相互依赖的问题间合理分配思考资源为解决上述局限，我们提出了**R-HORIZON**框架，其核心能力包括： - 通过查询组合将孤立问题转化为**复杂多步推理场景** - 构建**R-HORIZON基准测试集**，涵盖数学、代码生成与智能体应用三大领域共6个代表性数据集 - 利用长视野推理数据实现**带验证奖励的强化学习（Reinforcement Learning with Verified Rewards, RLVR）** ![](https://raw.githubusercontent.com/meituan-longcat/R-HORIZON/main/assets/method_fig.png) ## 📖 目录 - [🔥 最新动态](#-releases) - [🌟 研究概述](#-overview) - [📊 R-HORIZON基准测试集](#-r-horizon-benchmark) - [🚀 基于R-HORIZON的模型训练](#-training-with-r-horizon) - [快速上手](#quick-start) - [环境配置](#installation) - [基准测试评估](#benchmark-evaluation) - [使用R-HORIZON数据集进行训练](#training-with-r-horizon-datasets) - [数据集说明](#dataset) - [数据集构建流程](#dataset-construction) - [Hugging Face Hub数据集](#dataset-on-hugging-face-hub) - [数据集结构](#dataset-structure) - [引用方式](#citation) ## 📊 R-HORIZON基准测试集我们在R-HORIZON基准测试集上对20余款当前最先进的大推理模型进行了评估，结果显示随着推理视野的扩展，模型性能出现显著下滑： ![](https://raw.githubusercontent.com/meituan-longcat/R-HORIZON/main/assets/result_fig.png) ### 基准测试核心发现 - **普适性性能下滑**：即便最顶尖的模型，随着问题数量增加，性能也会出现大幅下降。例如DeepSeek-R1在AIME25数据集上的准确率从单问题场景的87.3%降至5问题场景的24.6%。 - **模型规模影响显著**：更大规模的模型对多视野挑战的鲁棒性更强。R1-Qwen-7B在解决16个问题时准确率从93.6%降至0%，性能下滑幅度比32B模型高出34.1%。 - **任务依赖型性能衰减**：与数学任务相比，代码生成任务的性能下滑更为剧烈。许多推理模型在网页搜索场景中会丧失工具调用能力，导致多步推理表现不佳。 ## 🚀 基于R-HORIZON的模型训练使用R-HORIZON组合生成的数据集进行训练，可在单视野与多视野推理任务上均取得显著性能提升： ![](https://raw.githubusercontent.com/meituan-longcat/R-HORIZON/main/assets/skywork_n1_n2_comparison.png) ### 训练结果亮点 - **双向性能增益**：使用2个组合问题进行训练，可同时提升多视野推理能力（在AIME24的n=2场景下提升17.4个百分点）与单问题推理性能（在原始AIME24数据集上提升7.5个百分点）。 - **复杂度可扩展性**：提升组合复杂度（n=4）可增强模型处理高推理步数任务的能力，在Math500的n=8场景下准确率达到50.6%。 | 模型 | MATH500（原始） | MATH500（n=8） | AIME24（原始） | AIME24（n=2） | AIME25（原始） | AIME25（n=2） | AMC23（原始） | AMC23（n=2） | |-----------------|------------------|---------------|-----------------|--------------|-----------------|--------------|----------------|-------------| | R1-Qwen-7B | 93.6 | 11.8 | 48.3 | 16.4 | 33.3 | 3.5 | 90.2 | 48.8 | | Baseline (n=1) | **95.6** | 8.4 | 57.9 | 16.7 | 47.9 | 5.1 | **95.9** | 55.0 | | R-HORIZON (n=2) | 95.4 | 21.4 | **65.4** | 34.1 | **49.6** | **10.0** | 94.1 | **80.6** | | R-HORIZON (n=4) | 94.6 | **50.6** | 62.9 | **34.8** | 45.4 | 8.1 | 91.9 | 79.1 | ## 快速上手 ### 环境配置 bash # 克隆代码仓库 git clone https://github.com/meituan-longcat/R-HORIZON.git cd R-HORIZON # 创建Conda环境 conda create -n r-horizon python=3.10 -y conda activate r-horizon # 安装PyTorch pip3 install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu124 pip3 install flash-attn --no-build-isolation # 安装其他依赖项 pip install -r requirements.txt ### 基准测试评估 1. 下载R-HORIZON基准测试集 bash # 下载基准测试数据集 python ./evaluation/data/download.py 2. 修改evaluation目录下的config.json文件 json { "inference": { // model_key (例如r1-distill-qwen7b) 对应run.sh中的配置 "r1-distill-qwen7b": { // vllm服务器使用的IP与端口 "base_url": "http://{Your IP and Port}/v1/completions", "api_key": "EMPTY", // model_name对应vllm服务器中的模型名称 "model_name": "{vllm's modelname}", "params": { "temperature": 1.0, "top_p": 0.95, "top_k": 10, "max_tokens": 65536 }, "prompt_prefix": "<|im_start|>user: ", "prompt_suffix": " <|im_end|> <|im_start|>assistant: " } }, "extract": { "gpt-4.1": { "model_name": "gpt-4.1", "base_url": "{OpenAI's baseurl}", "api_key": "{Your API key}", "params": { "temperature": 0.0, "max_tokens": 16000 } } } } 3. 启动vllm服务器 bash vllm serve {modelname} --host {ip} --port {port} --served-model-name {modelname} --dtype auto --pipeline-parallel-size 1 --tensor-parallel-size 1 --trust-remote-code --enable-chunked-prefill --max-model-len 131072 --max-num-batched-tokens 10240 --max-num-seqs 256 --gpu-memory-utilization 0.85 --disable-custom-all-reduce --enable-reasoning --reasoning-parser deepseek_r1 --enable-chunked-prefill 4. 评估你的模型以下为bash示例，model_key为config.json中定义的键名 bash sh evaluation/run.sh {input_file} {output_dir} {model_key} # 示例 sh evaluation/run.sh evaluation/data/R-HORIZON-Math500/Math500-combined-n2.jsonl evaluation/result r1-distill-qwen7b ### 使用R-HORIZON数据集训练 1. 下载组合后的训练数据 python from huggingface_hub import snapshot_download snapshot_download( repo_id="meituan-longcat/R-HORIZON-training-data", repo_type="dataset", local_dir="./training/data", ) 2. 启动训练 bash # 使用GRPO算法基于R-HORIZON进行训练 bash ./training/scripts/train/skywork-or1-rlvr-math-training-7b-40k.sh ## 数据集说明 ### 数据集构建流程步骤1：过滤有效整数样本 bash # 用途：保留输入文本中包含有效整数且目标为纯整数的样本，排除模糊的数值表达式（如浮点数、分数、LaTeX命令）。 python step1_filt_integer_samples.py 步骤2：识别关键变量 bash # 用途：选取“关键变量”（对问题结果具有显著影响的关键整数） # 在脚本中配置API密钥（替换YOUR_API_KEY） python step2_select_key_variable.py 步骤3：组合为链式推理问题 bash # 用途：生成多视野链式问题，其中每一步的关键变量依赖于前一步的答案。 python step3_combine_problems.py ### Hugging Face Hub数据集 R-HORIZON训练数据集与评估基准测试集已在Hugging Face Hub平台发布： | 数据集类型 | 数据集名称 | Hugging Face 链接 | |--------------|-------------------------------|-----------------------------------------------------------------------------------| | 评估集 | R-HORIZON-Math500 | [链接](https://huggingface.co/datasets/meituan-longcat/R-HORIZON-Math500) | | 评估集 | R-HORIZON-AIME24 | [链接](https://huggingface.co/datasets/meituan-longcat/R-HORIZON-AIME24) | | 评估集 | R-HORIZON-AIME25 | [链接](https://huggingface.co/datasets/meituan-longcat/R-HORIZON-AIME25) | | 评估集 | R-HORIZON-AMC23 | [链接](https://huggingface.co/datasets/meituan-longcat/R-HORIZON-AMC23) | | 评估集 | R-HORIZON-Websearch | [链接](https://huggingface.co/datasets/meituan-longcat/R-HORIZON-Websearch) | | 训练集 | R-HORIZON-training-data | [链接](https://huggingface.co/datasets/meituan-longcat/R-HORIZON-training-data) | ### 数据集结构 json { "input": "[1-N个关联问题+求解说明（包含[variablek]/[answerk]占位符）]", "instanceId": "[该实例的唯一ID]", "origin_instanceIds": "[原始问题ID列表]", "target": "[最终答案列表，例如[answer1, answer2]]", "num_problems": "[总问题数，例如2]", "selected_variables": [ { "number": "[问题中的关键变量]", "context": "[该变量的上下文信息]", "text": "[该变量的文本描述]", "is_independent": "[true/false，是否独立]", "is_in_math_env": "[true/false，是否处于数学环境中]" } ] } ## 引用方式如果您的研究用到了R-HORIZON，请引用我们的论文： bibtex @misc{lu2025rhorizonfarlargereasoning, title={R-Horizon: How Far Can Your Large Reasoning Model Really Go in Breadth and Depth?}, author={Yi Lu and Jianing Wang and Linsen Guo and Wei He and Hongyin Tang and Tao Gui and Xuanjing Huang and Xuezhi Cao and Wei Wang and Xunliang Cai}, year={2025}, eprint={2510.08189}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2510.08189}, }

提供机构：

maas

创建时间：

2025-11-03

5,000+

优质数据集

54 个

任务类型

进入经典数据集