R-HORIZON-Math500

Name: R-HORIZON-Math500
Creator: maas
Published: 2026-01-08 10:46:34
License: 暂无描述

魔搭社区2026-01-08 更新2025-12-06 收录

下载链接：

https://modelscope.cn/datasets/meituan-longcat/R-HORIZON-Math500

下载链接

链接失效反馈

官方服务：

资源简介：

<div align="center"> <h1> <img src="https://raw.githubusercontent.com/meituan-longcat/R-HORIZON/main/assets/problem-solving.png" alt="logo" width="60" style="vertical-align:middle; margin-right:10px;"> R-HORIZON </h1> <div> How Far Can Your Large Reasoning Model Really Go in Breadth and Depth? </div> </div> <br> <p align="center"> 📃 <a href="https://arxiv.org/abs/2510.08189" target="_blank">Paper</a > • 🌐 <a href="https://reasoning-horizon.github.io/" target="_blank">Project Page</a > • 🤗 <a href="https://huggingface.co/datasets/meituan-longcat/R-HORIZON-training-data" target="_blank">Dataset</a > </p > R-HORIZON is a novel method designed to stimulate long-horizon reasoning behaviors in Large Reasoning Models (LRMs) through query composition. We transform isolated problems into complex multi-step reasoning scenarios, revealing that even the most advanced LRMs suffer significant performance degradation when facing interdependent problems that span long reasoning horizons. ![](https://raw.githubusercontent.com/meituan-longcat/R-HORIZON/main/assets/mainfig.png) ## 🔥 Releases **[2025-10-09]** - 🎉 **R-HORIZON Benchmark** is now available! Test your LRMs on complex multi-horizon reasoning tasks. - 🤗 **Training and evaluation datasets** are available on Hugging Face: [R-HORIZON Dataset](https://huggingface.co/datasets/meituan-longcat/R-HORIZON-training-data) - 📄 **Paper released** on arXiv: [R-HORIZON: How Far Can Your Large Reasoning Model Really Go in Breadth and Depth?](https://arxiv.org/abs/2510.08189) ## 🌟 Overview Recent advances in reasoning-focused language models (e.g., OpenAI o1, DeepSeek-R1) have demonstrated remarkable improvements through test-time scaling and long Chain-of-Thought (CoT). However, existing benchmarks primarily focus on immediate, single-horizon tasks, failing to adequately evaluate models' ability to handle complex, long-horizon scenarios. **Key challenges in current paradigms:** - **Limited evaluation scope**: Existing benchmarks confine themselves to isolated problems, missing the complexity of real-world multi-step reasoning - **Limited effective reasoning length**: Models struggle to maintain performance as reasoning chains grow longer - **Poor thinking budget allocation**: LRMs fail to appropriately distribute thinking resources across multiple interdependent problems To address these limitations, we introduce **R-HORIZON**, which: - Transforms isolated problems into **complex multi-step reasoning scenarios** through query composition - Establishes the **R-HORIZON Benchmark** comprising 6 representative datasets from mathematics, code generation, and agent applications - Enables **reinforcement learning with verified rewards (RLVR)** using long-horizon reasoning data ![](https://raw.githubusercontent.com/meituan-longcat/R-HORIZON/main/assets/method_fig.png) ## 📖 Table of Contents - [🔥 Releases](#-releases) - [🌟 Overview](#-overview) - [📊 R-HORIZON Benchmark](#-r-horizon-benchmark) - [🚀 Training with R-HORIZON](#-training-with-r-horizon) - [Quick Start](#quick-start) - [Installation](#installation) - [Benchmark Evaluation](#benchmark-evaluation) - [Training with R-HORIZON datasets](#training-with-r-horizon-datasets) - [Dataset](#dataset) - [Dataset Construction](#dataset-construction) - [Dataset on Hugging Face Hub](#dataset-on-hugging-face-hub) - [Dataset Structure](#dataset-structure) - [Citation](#citation) ## 📊 R-HORIZON Benchmark We evaluate 20+ state-of-the-art LRMs on the R-HORIZON Benchmark, revealing significant performance degradation as reasoning horizons increase: ![](https://raw.githubusercontent.com/meituan-longcat/R-HORIZON/main/assets/result_fig.png) **Key findings from our benchmark evaluation:** - **Universal performance degradation**: Even the most powerful models suffer severe drops as problem count increases. For instance, DeepSeek-R1 drops from 87.3% (single problem) to 24.6% (5 problems) on AIME25. - **Model size matters**: Larger models exhibit more resilience to multi-horizon challenges. R1-Qwen-7B drops from 93.6% to 0% when solving 16 problems, showing 34.1% more degradation than the 32B models. - **Task-dependent degradation**: Code generation tasks show steeper performance declines compared to mathematics. Many reasoning models lose their tool-calling abilities in web search scenarios, resulting in poor multi-step performance. ## 🚀 Training with R-HORIZON Training with R-HORIZON composed data yields substantial improvements on both single and multi-horizon reasoning tasks: ![](https://raw.githubusercontent.com/meituan-longcat/R-HORIZON/main/assets/skywork_n1_n2_comparison.png) **Training results highlights:** - **Dual Performance Gains**: Training with 2-composed problems significantly improves both multi-horizon reasoning (+17.4 points on AIME24 n=2) and single-problem performance (+7.5 points on AIME24 original). - **Scalable Complexity**: Increasing composition complexity (n=4) enhances the model's ability to handle problems requiring more reasoning steps, achieving 50.6% on Math500 (n=8). | Models | MATH500 (Origin) | MATH500 (n=8) | AIME24 (Origin) | AIME24 (n=2) | AIME25 (Origin) | AIME25 (n=2) | AMC23 (Origin) | AMC23 (n=2) | |-----------------|------------------|---------------|-----------------|--------------|-----------------|--------------|----------------|-------------| | R1-Qwen-7B | 93.6 | 11.8 | 48.3 | 16.4 | 33.3 | 3.5 | 90.2 | 48.8 | | Baseline (n=1) | **95.6** | 8.4 | 57.9 | 16.7 | 47.9 | 5.1 | **95.9** | 55.0 | | R-HORIZON (n=2) | 95.4 | 21.4 | **65.4** | 34.1 | **49.6** | **10.0** | 94.1 | **80.6** | | R-HORIZON (n=4) | 94.6 | **50.6** | 62.9 | **34.8** | 45.4 | 8.1 | 91.9 | 79.1 | ## Quick Start ### Installation ```bash # Clone the repository git clone https://github.com/meituan-longcat/R-HORIZON.git cd R-HORIZON # Create conda environment conda create -n r-horizon python=3.10 -y conda activate r-horizon # Install PyTorch pip3 install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu124 pip3 install flash-attn --no-build-isolation # Install additional dependencies pip install -r requirements.txt ``` ### Benchmark Evaluation 1. Download the R-HORIZON Benchmark ```bash # Download benchmark datasets python ./evaluation/data/download.py ``` 2. Modify config.json under evaluation directory ```json { "inference": { // model_key (e.g. r1-distill-qwen7b) is for run.sh "r1-distill-qwen7b": { // the ip and port used in vllm server "base_url": "http://{Your IP and Port}/v1/completions", "api_key": "EMPTY", // model_name is corresponding to the modelname in vllm server "model_name": "{vllm's modelname}", "params": { "temperature": 1.0, "top_p": 0.95, "top_k": 10, "max_tokens": 65536 }, "prompt_prefix": "<|im_start|>user:\n", "prompt_suffix": "\n<|im_end|>\n<|im_start|>assistant:\n" } }, "extract": { "gpt-4.1": { "model_name": "gpt-4.1", "base_url": "{OpenAI's baseurl}", "api_key": "{Your API key}", "params": { "temperature": 0.0, "max_tokens": 16000 } } } } ``` 3. Run a vllm server ```bash vllm serve {modelname}\ --host {ip}\ --port {port}\ --served-model-name {modelname}\ --dtype auto --pipeline-parallel-size 1 --tensor-parallel-size 1 --trust-remote-code\ --enable-chunked-prefill --max-model-len 131072 --max-num-batched-tokens 10240\ --max-num-seqs 256 --gpu-memory-utilization 0.85 --disable-custom-all-reduce\ --enable-reasoning --reasoning-parser deepseek_r1 --enable-chunked-prefill ``` 4. Evaluate your model Here is a bash example, and model_key is defined in config.json ```bash sh evaluation/run.sh {input_file} {output_dir} {model_key} # example sh evaluation/run.sh evaluation/data/R-HORIZON-Math500/Math500-combined-n2.jsonl evaluation/result r1-distill-qwen7b ``` ### Training with R-HORIZON datasets 1. Download composed training data ```python from huggingface_hub import snapshot_download snapshot_download( repo_id="meituan-longcat/R-HORIZON-training-data", repo_type="dataset", local_dir="./training/data", ) ``` 2. Launch training ```bash # Train with R-HORIZON using GRPO algorithm bash ./training/scripts/train/skywork-or1-rlvr-math-training-7b-40k.sh ``` ## Dataset ### Dataset Construction Step 1: Filter Samples with Valid Integers ```bash # Purpose: Retain samples containing valid integers in input text and pure integer targets, excluding ambiguous numeric expressions (e.g., floats, fractions, LaTeX commands). python step1_filt_integer_samples.py ``` Step 2: Identify Key Variables ```bash # Purpose: select "key variables" (critical integers that significantly affect problem outcomes) # configure API credentials in the script (replace YOUR_API_KEY) python step2_select_key_variable.py ``` Step 3: Combine into Chained Reasoning Problems ```bash # Purpose: Generate multi-horizon chained problems where each step's key variable depends on the previous step's answer. python step3_combine_problems.py ``` ### Dataset on Hugging Face Hub The R-HORIZON training datasets and evaluation benchmark are available on Hugging Face Hub: | Dataset Type | Dataset Name | Hugging Face Link | |--------------|-------------------------------|-----------------------------------------------------------------------------------| | Evaluation | R-HORIZON-Math500 | [link](https://huggingface.co/datasets/meituan-longcat/R-HORIZON-Math500) | | Evaluation | R-HORIZON-AIME24 | [link](https://huggingface.co/datasets/meituan-longcat/R-HORIZON-AIME24) | | Evaluation | R-HORIZON-AIME25 | [link](https://huggingface.co/datasets/meituan-longcat/R-HORIZON-AIME25) | | Evaluation | R-HORIZON-AMC23 | [link](https://huggingface.co/datasets/meituan-longcat/R-HORIZON-AMC23) | | Evaluation | R-HORIZON-Websearch | [link](https://huggingface.co/datasets/meituan-longcat/R-HORIZON-Websearch) | | Training | R-HORIZON-training-data | [link](https://huggingface.co/datasets/meituan-longcat/R-HORIZON-training-data) | ### Dataset Structure ```json { "input": "[1-N linked problems + solving instructions (with [variablek]/[answerk] placeholders)]", "instanceId": "[Unique ID for this instance]", "origin_instanceIds": "[List of original problem IDs]", "target": "[List of final answers, e.g., [answer1, answer2]]", "num_problems": "[Total problems, e.g., 2]", "selected_variables": [ { "number": "[Key variable from problem]", "context": "[Context of the number]", "text": "[Text of the number]", "is_independent": "[true/false]", "is_in_math_env": "[true/false]" } ] } ``` ## Citation If you find R-HORIZON helpful for your research, please cite our paper: ```bibtex @misc{lu2025rhorizonfarlargereasoning, title={R-Horizon: How Far Can Your Large Reasoning Model Really Go in Breadth and Depth?}, author={Yi Lu and Jianing Wang and Linsen Guo and Wei He and Hongyin Tang and Tao Gui and Xuanjing Huang and Xuezhi Cao and Wei Wang and Xunliang Cai}, year={2025}, eprint={2510.08189}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2510.08189}, } ```

<div align="center"> <h1> <img src="https://raw.githubusercontent.com/meituan-longcat/R-HORIZON/main/assets/problem-solving.png" alt="logo" width="60" style="vertical-align:middle; margin-right:10px;"> R-HORIZON </h1> <div> 你的大推理模型（Large Reasoning Models, LRMs）在广度与深度上究竟能走多远？ </div> </div> <br> <p align="center"> 📃 <a href="https://arxiv.org/abs/2510.08189" target="_blank">论文</a> • 🌐 <a href="https://reasoning-horizon.github.io/" target="_blank">项目主页</a> • 🤗 <a href="https://huggingface.co/datasets/meituan-longcat/R-HORIZON-training-data" target="_blank">数据集</a> </p > R-HORIZON 是一种全新的方法，旨在通过查询组合（query composition）激发大推理模型（Large Reasoning Models, LRMs）的长跨度推理行为。我们将孤立的单个问题转化为复杂的多步推理场景，研究发现，即便最先进的大推理模型，在面对涉及长推理跨度的相互依赖型问题时，性能也会出现显著下滑。 ![](https://raw.githubusercontent.com/meituan-longcat/R-HORIZON/main/assets/mainfig.png) ## 🔥 版本更新 **[2025-10-09]** - 🎉 **R-HORIZON 基准测试集** 现已上线！快来在复杂多跨度推理任务上测试你的大推理模型吧。 - 🤗 **训练与评估数据集** 已上架 Hugging Face：[R-HORIZON 数据集](https://huggingface.co/datasets/meituan-longcat/R-HORIZON-training-data) - 📄 **论文已在 arXiv 发布**：[R-HORIZON: 你的大推理模型在广度与深度上究竟能走多远？](https://arxiv.org/abs/2510.08189) ## 🌟 概述近年来，聚焦推理的语言模型（如 OpenAI o1、DeepSeek-R1）通过测试时缩放（test-time scaling）与长思维链（Chain-of-Thought, CoT）取得了显著进展。然而，现有基准测试主要聚焦于即时单跨度任务，无法充分评估模型处理复杂长跨度场景的能力。 **当前范式下的核心挑战：** - **评估范围受限**：现有基准测试仅局限于孤立问题，未能覆盖真实世界多步推理的复杂性 - **有效推理长度受限**：随着推理链变长，模型难以维持稳定性能 - **思维资源分配不合理**：大推理模型无法在多个相互依赖的问题间合理分配思考资源为解决上述局限，我们提出了 **R-HORIZON**，其具备以下特性： - 通过查询组合将孤立问题转化为**复杂多步推理场景** - 构建包含数学、代码生成、智能体应用六大类代表性数据集的**R-HORIZON 基准测试集** - 利用长跨度推理数据实现**带验证奖励的强化学习（reinforcement learning with verified rewards, RLVR）** ![](https://raw.githubusercontent.com/meituan-longcat/R-HORIZON/main/assets/method_fig.png) ## 📖 目录 - [🔥 版本更新](#-releases) - [🌟 概述](#-overview) - [📊 R-HORIZON 基准测试集](#-r-horizon-benchmark) - [🚀 基于 R-HORIZON 的训练](#-training-with-r-horizon) - [快速上手](#quick-start) - [环境安装](#installation) - [基准测试评估](#benchmark-evaluation) - [基于 R-HORIZON 数据集的训练](#training-with-r-horizon-datasets) - [数据集](#dataset) - [数据集构建](#dataset-construction) - [Hugging Face Hub 数据集](#dataset-on-hugging-face-hub) - [数据集结构](#dataset-structure) - [引用](#citation) ## 📊 R-HORIZON 基准测试集我们在 R-HORIZON 基准测试集上评估了20余款最先进的大推理模型，结果显示随着推理跨度增加，模型性能出现显著下滑： ![](https://raw.githubusercontent.com/meituan-longcat/R-HORIZON/main/assets/result_fig.png) **基准测试的核心发现：** - **性能普适性下滑**：即便最强大的模型，随着问题数量增加也会出现严重性能衰减。例如，DeepSeek-R1 在 AIME25 数据集上的准确率从单问题场景的87.3%降至5问题场景的24.6%。 - **模型规模影响显著**：更大规模的模型对多跨度挑战的抗衰减能力更强。R1-Qwen-7B 在求解16个问题时准确率从93.6%降至0%，相比32B模型的性能衰减幅度高出34.1%。 - **任务相关的性能衰减**：相比数学任务，代码生成任务的性能下滑更为剧烈。许多推理模型在网页搜索场景中会丧失工具调用能力，导致多步任务表现不佳。 ## 🚀 基于 R-HORIZON 的训练使用 R-HORIZON 组合数据进行训练，可在单跨度与多跨度推理任务上均取得显著性能提升： ![](https://raw.githubusercontent.com/meituan-longcat/R-HORIZON/main/assets/skywork_n1_n2_comparison.png) **训练结果亮点：** - **双重性能增益**：使用2个组合问题进行训练，可同时提升多跨度推理性能（在AIME24的n=2场景下提升17.4个百分点）与单问题推理性能（在AIME24原始场景下提升7.5个百分点）。 - **复杂度可扩展性**：提升组合复杂度（n=4）可增强模型处理高推理步数问题的能力，在Math500的n=8场景下准确率达到50.6%。 | 模型 | MATH500（原始） | MATH500（n=8） | AIME24（原始） | AIME24（n=2） | AIME25（原始） | AIME25（n=2） | AMC23（原始） | AMC23（n=2） | |-----------------|------------------|---------------|-----------------|--------------|-----------------|--------------|----------------|-------------| | R1-Qwen-7B | 93.6 | 11.8 | 48.3 | 16.4 | 33.3 | 3.5 | 90.2 | 48.8 | | Baseline（n=1） | **95.6** | 8.4 | 57.9 | 16.7 | 47.9 | 5.1 | **95.9** | 55.0 | | R-HORIZON（n=2） | 95.4 | 21.4 | **65.4** | 34.1 | **49.6** | **10.0** | 94.1 | **80.6** | | R-HORIZON（n=4） | 94.6 | **50.6** | 62.9 | **34.8** | 45.4 | 8.1 | 91.9 | 79.1 | ## 快速上手 ### 环境安装 bash # 克隆仓库 git clone https://github.com/meituan-longcat/R-HORIZON.git cd R-HORIZON # 创建conda环境 conda create -n r-horizon python=3.10 -y conda activate r-horizon # 安装 PyTorch pip3 install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu124 pip3 install flash-attn --no-build-isolation # 安装额外依赖 pip install -r requirements.txt ### 基准测试评估 1. 下载 R-HORIZON 基准测试集 bash # 下载基准测试数据集 python ./evaluation/data/download.py 2. 修改 evaluation 目录下的 config.json json { "inference": { // model_key（如 r1-distill-qwen7b）用于 run.sh 脚本 "r1-distill-qwen7b": { // vllm 服务使用的IP与端口 "base_url": "http://{Your IP and Port}/v1/completions", "api_key": "EMPTY", // model_name 需与 vllm 服务中的模型名称一致 "model_name": "{vllm's modelname}", "params": { "temperature": 1.0, "top_p": 0.95, "top_k": 10, "max_tokens": 65536 }, "prompt_prefix": "<|im_start|>user: ", "prompt_suffix": " <|im_end|> <|im_start|>assistant: " } }, "extract": { "gpt-4.1": { "model_name": "gpt-4.1", "base_url": "{OpenAI's baseurl}", "api_key": "{Your API key}", "params": { "temperature": 0.0, "max_tokens": 16000 } } } } 3. 启动 vllm 服务 bash vllm serve {modelname} --host {ip} --port {port} --served-model-name {modelname} --dtype auto --pipeline-parallel-size 1 --tensor-parallel-size 1 --trust-remote-code --enable-chunked-prefill --max-model-len 131072 --max-num-batched-tokens 10240 --max-num-seqs 256 --gpu-memory-utilization 0.85 --disable-custom-all-reduce --enable-reasoning --reasoning-parser deepseek_r1 --enable-chunked-prefill 4. 评估你的模型以下为示例 bash 脚本，model_key 需与 config.json 中定义的一致 bash sh evaluation/run.sh {input_file} {output_dir} {model_key} # 示例 sh evaluation/run.sh evaluation/data/R-HORIZON-Math500/Math500-combined-n2.jsonl evaluation/result r1-distill-qwen7b ### 基于 R-HORIZON 数据集的训练 1. 下载组合训练数据 python from huggingface_hub import snapshot_download snapshot_download( repo_id="meituan-longcat/R-HORIZON-training-data", repo_type="dataset", local_dir="./training/data", ) 2. 启动训练 bash # 使用 GRPO 算法基于 R-HORIZON 进行训练 bash ./training/scripts/train/skywork-or1-rlvr-math-training-7b-40k.sh ## 数据集 ### 数据集构建步骤1：过滤有效整数样本 bash # 功能说明：保留输入文本中包含有效整数且目标为纯整数的样本，排除模糊数值表达式（如浮点数、分数、LaTeX 命令）。 python step1_filt_integer_samples.py 步骤2：识别关键变量 bash # 功能说明：选取"关键变量"（对问题结果有显著影响的核心整数） # 需在脚本中配置API凭证（替换 YOUR_API_KEY） python step2_select_key_variable.py 步骤3：组合为链式推理问题 bash # 功能说明：生成多跨度链式问题，每一步的关键变量依赖于上一步的答案。 python step3_combine_problems.py ### Hugging Face Hub 数据集 R-HORIZON 训练数据集与评估基准测试集已上架 Hugging Face Hub： | 数据集类型 | 数据集名称 | Hugging Face 链接 | |--------------|-------------------------------|-----------------------------------------------------------------------------------| | 评估集 | R-HORIZON-Math500 | [链接](https://huggingface.co/datasets/meituan-longcat/R-HORIZON-Math500) | | 评估集 | R-HORIZON-AIME24 | [链接](https://huggingface.co/datasets/meituan-longcat/R-HORIZON-AIME24) | | 评估集 | R-HORIZON-AIME25 | [链接](https://huggingface.co/datasets/meituan-longcat/R-HORIZON-AIME25) | | 评估集 | R-HORIZON-AMC23 | [链接](https://huggingface.co/datasets/meituan-longcat/R-HORIZON-AMC23) | | 评估集 | R-HORIZON-Websearch | [链接](https://huggingface.co/datasets/meituan-longcat/R-HORIZON-Websearch) | | 训练集 | R-HORIZON-training-data | [链接](https://huggingface.co/datasets/meituan-longcat/R-HORIZON-training-data) | ### 数据集结构 json { "input": "[1-N个关联问题 + 解题说明（包含[variablek]/[answerk]占位符）]", "instanceId": "[该实例的唯一ID]", "origin_instanceIds": "[原始问题ID列表]", "target": "[最终答案列表，例如 [answer1, answer2]]", "num_problems": "[问题总数，例如 2]", "selected_variables": [ { "number": "[问题中的关键变量]", "context": "该变量的上下文信息", "text": "该变量的文本描述", "is_independent": "[true/false]", "is_in_math_env": "[true/false]" } ] } ## 引用如果您的研究中用到了 R-HORIZON，请引用我们的论文： bibtex @misc{lu2025rhorizonfarlargereasoning, title={R-Horizon: How Far Can Your Large Reasoning Model Really Go in Breadth and Depth?}, author={Yi Lu and Jianing Wang and Linsen Guo and Wei He and Hongyin Tang and Tao Gui and Xuanjing Huang and Xuezhi Cao and Wei Wang and Xunliang Cai}, year={2025}, eprint={2510.08189}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2510.08189}, }

提供机构：

maas

创建时间：

2025-11-03

5,000+

优质数据集

54 个

任务类型

进入经典数据集