five

R-HORIZON-Math500

收藏
魔搭社区2026-01-08 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/meituan-longcat/R-HORIZON-Math500
下载链接
链接失效反馈
官方服务:
资源简介:
<div align="center"> <h1> <img src="https://raw.githubusercontent.com/meituan-longcat/R-HORIZON/main/assets/problem-solving.png" alt="logo" width="60" style="vertical-align:middle; margin-right:10px;"> R-HORIZON </h1> <div> How Far Can Your Large Reasoning Model Really Go in Breadth and Depth? </div> </div> <br> <p align="center"> 📃 <a href="https://arxiv.org/abs/2510.08189" target="_blank">Paper</a > • 🌐 <a href="https://reasoning-horizon.github.io/" target="_blank">Project Page</a > • 🤗 <a href="https://huggingface.co/datasets/meituan-longcat/R-HORIZON-training-data" target="_blank">Dataset</a > </p > R-HORIZON is a novel method designed to stimulate long-horizon reasoning behaviors in Large Reasoning Models (LRMs) through query composition. We transform isolated problems into complex multi-step reasoning scenarios, revealing that even the most advanced LRMs suffer significant performance degradation when facing interdependent problems that span long reasoning horizons. ![](https://raw.githubusercontent.com/meituan-longcat/R-HORIZON/main/assets/mainfig.png) ## 🔥 Releases **[2025-10-09]** - 🎉 **R-HORIZON Benchmark** is now available! Test your LRMs on complex multi-horizon reasoning tasks. - 🤗 **Training and evaluation datasets** are available on Hugging Face: [R-HORIZON Dataset](https://huggingface.co/datasets/meituan-longcat/R-HORIZON-training-data) - 📄 **Paper released** on arXiv: [R-HORIZON: How Far Can Your Large Reasoning Model Really Go in Breadth and Depth?](https://arxiv.org/abs/2510.08189) ## 🌟 Overview Recent advances in reasoning-focused language models (e.g., OpenAI o1, DeepSeek-R1) have demonstrated remarkable improvements through test-time scaling and long Chain-of-Thought (CoT). However, existing benchmarks primarily focus on immediate, single-horizon tasks, failing to adequately evaluate models' ability to handle complex, long-horizon scenarios. **Key challenges in current paradigms:** - **Limited evaluation scope**: Existing benchmarks confine themselves to isolated problems, missing the complexity of real-world multi-step reasoning - **Limited effective reasoning length**: Models struggle to maintain performance as reasoning chains grow longer - **Poor thinking budget allocation**: LRMs fail to appropriately distribute thinking resources across multiple interdependent problems To address these limitations, we introduce **R-HORIZON**, which: - Transforms isolated problems into **complex multi-step reasoning scenarios** through query composition - Establishes the **R-HORIZON Benchmark** comprising 6 representative datasets from mathematics, code generation, and agent applications - Enables **reinforcement learning with verified rewards (RLVR)** using long-horizon reasoning data ![](https://raw.githubusercontent.com/meituan-longcat/R-HORIZON/main/assets/method_fig.png) ## 📖 Table of Contents - [🔥 Releases](#-releases) - [🌟 Overview](#-overview) - [📊 R-HORIZON Benchmark](#-r-horizon-benchmark) - [🚀 Training with R-HORIZON](#-training-with-r-horizon) - [Quick Start](#quick-start) - [Installation](#installation) - [Benchmark Evaluation](#benchmark-evaluation) - [Training with R-HORIZON datasets](#training-with-r-horizon-datasets) - [Dataset](#dataset) - [Dataset Construction](#dataset-construction) - [Dataset on Hugging Face Hub](#dataset-on-hugging-face-hub) - [Dataset Structure](#dataset-structure) - [Citation](#citation) ## 📊 R-HORIZON Benchmark We evaluate 20+ state-of-the-art LRMs on the R-HORIZON Benchmark, revealing significant performance degradation as reasoning horizons increase: ![](https://raw.githubusercontent.com/meituan-longcat/R-HORIZON/main/assets/result_fig.png) **Key findings from our benchmark evaluation:** - **Universal performance degradation**: Even the most powerful models suffer severe drops as problem count increases. For instance, DeepSeek-R1 drops from 87.3% (single problem) to 24.6% (5 problems) on AIME25. - **Model size matters**: Larger models exhibit more resilience to multi-horizon challenges. R1-Qwen-7B drops from 93.6% to 0% when solving 16 problems, showing 34.1% more degradation than the 32B models. - **Task-dependent degradation**: Code generation tasks show steeper performance declines compared to mathematics. Many reasoning models lose their tool-calling abilities in web search scenarios, resulting in poor multi-step performance. ## 🚀 Training with R-HORIZON Training with R-HORIZON composed data yields substantial improvements on both single and multi-horizon reasoning tasks: ![](https://raw.githubusercontent.com/meituan-longcat/R-HORIZON/main/assets/skywork_n1_n2_comparison.png) **Training results highlights:** - **Dual Performance Gains**: Training with 2-composed problems significantly improves both multi-horizon reasoning (+17.4 points on AIME24 n=2) and single-problem performance (+7.5 points on AIME24 original). - **Scalable Complexity**: Increasing composition complexity (n=4) enhances the model's ability to handle problems requiring more reasoning steps, achieving 50.6% on Math500 (n=8). | Models | MATH500 (Origin) | MATH500 (n=8) | AIME24 (Origin) | AIME24 (n=2) | AIME25 (Origin) | AIME25 (n=2) | AMC23 (Origin) | AMC23 (n=2) | |-----------------|------------------|---------------|-----------------|--------------|-----------------|--------------|----------------|-------------| | R1-Qwen-7B | 93.6 | 11.8 | 48.3 | 16.4 | 33.3 | 3.5 | 90.2 | 48.8 | | Baseline (n=1) | **95.6** | 8.4 | 57.9 | 16.7 | 47.9 | 5.1 | **95.9** | 55.0 | | R-HORIZON (n=2) | 95.4 | 21.4 | **65.4** | 34.1 | **49.6** | **10.0** | 94.1 | **80.6** | | R-HORIZON (n=4) | 94.6 | **50.6** | 62.9 | **34.8** | 45.4 | 8.1 | 91.9 | 79.1 | ## Quick Start ### Installation ```bash # Clone the repository git clone https://github.com/meituan-longcat/R-HORIZON.git cd R-HORIZON # Create conda environment conda create -n r-horizon python=3.10 -y conda activate r-horizon # Install PyTorch pip3 install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu124 pip3 install flash-attn --no-build-isolation # Install additional dependencies pip install -r requirements.txt ``` ### Benchmark Evaluation 1. Download the R-HORIZON Benchmark ```bash # Download benchmark datasets python ./evaluation/data/download.py ``` 2. Modify config.json under evaluation directory ```json { "inference": { // model_key (e.g. r1-distill-qwen7b) is for run.sh "r1-distill-qwen7b": { // the ip and port used in vllm server "base_url": "http://{Your IP and Port}/v1/completions", "api_key": "EMPTY", // model_name is corresponding to the modelname in vllm server "model_name": "{vllm's modelname}", "params": { "temperature": 1.0, "top_p": 0.95, "top_k": 10, "max_tokens": 65536 }, "prompt_prefix": "<|im_start|>user:\n", "prompt_suffix": "\n<|im_end|>\n<|im_start|>assistant:\n" } }, "extract": { "gpt-4.1": { "model_name": "gpt-4.1", "base_url": "{OpenAI's baseurl}", "api_key": "{Your API key}", "params": { "temperature": 0.0, "max_tokens": 16000 } } } } ``` 3. Run a vllm server ```bash vllm serve {modelname}\ --host {ip}\ --port {port}\ --served-model-name {modelname}\ --dtype auto --pipeline-parallel-size 1 --tensor-parallel-size 1 --trust-remote-code\ --enable-chunked-prefill --max-model-len 131072 --max-num-batched-tokens 10240\ --max-num-seqs 256 --gpu-memory-utilization 0.85 --disable-custom-all-reduce\ --enable-reasoning --reasoning-parser deepseek_r1 --enable-chunked-prefill ``` 4. Evaluate your model Here is a bash example, and model_key is defined in config.json ```bash sh evaluation/run.sh {input_file} {output_dir} {model_key} # example sh evaluation/run.sh evaluation/data/R-HORIZON-Math500/Math500-combined-n2.jsonl evaluation/result r1-distill-qwen7b ``` ### Training with R-HORIZON datasets 1. Download composed training data ```python from huggingface_hub import snapshot_download snapshot_download( repo_id="meituan-longcat/R-HORIZON-training-data", repo_type="dataset", local_dir="./training/data", ) ``` 2. Launch training ```bash # Train with R-HORIZON using GRPO algorithm bash ./training/scripts/train/skywork-or1-rlvr-math-training-7b-40k.sh ``` ## Dataset ### Dataset Construction Step 1: Filter Samples with Valid Integers ```bash # Purpose: Retain samples containing valid integers in input text and pure integer targets, excluding ambiguous numeric expressions (e.g., floats, fractions, LaTeX commands). python step1_filt_integer_samples.py ``` Step 2: Identify Key Variables ```bash # Purpose: select "key variables" (critical integers that significantly affect problem outcomes) # configure API credentials in the script (replace YOUR_API_KEY) python step2_select_key_variable.py ``` Step 3: Combine into Chained Reasoning Problems ```bash # Purpose: Generate multi-horizon chained problems where each step's key variable depends on the previous step's answer. python step3_combine_problems.py ``` ### Dataset on Hugging Face Hub The R-HORIZON training datasets and evaluation benchmark are available on Hugging Face Hub: | Dataset Type | Dataset Name | Hugging Face Link | |--------------|-------------------------------|-----------------------------------------------------------------------------------| | Evaluation | R-HORIZON-Math500 | [link](https://huggingface.co/datasets/meituan-longcat/R-HORIZON-Math500) | | Evaluation | R-HORIZON-AIME24 | [link](https://huggingface.co/datasets/meituan-longcat/R-HORIZON-AIME24) | | Evaluation | R-HORIZON-AIME25 | [link](https://huggingface.co/datasets/meituan-longcat/R-HORIZON-AIME25) | | Evaluation | R-HORIZON-AMC23 | [link](https://huggingface.co/datasets/meituan-longcat/R-HORIZON-AMC23) | | Evaluation | R-HORIZON-Websearch | [link](https://huggingface.co/datasets/meituan-longcat/R-HORIZON-Websearch) | | Training | R-HORIZON-training-data | [link](https://huggingface.co/datasets/meituan-longcat/R-HORIZON-training-data) | ### Dataset Structure ```json { "input": "[1-N linked problems + solving instructions (with [variablek]/[answerk] placeholders)]", "instanceId": "[Unique ID for this instance]", "origin_instanceIds": "[List of original problem IDs]", "target": "[List of final answers, e.g., [answer1, answer2]]", "num_problems": "[Total problems, e.g., 2]", "selected_variables": [ { "number": "[Key variable from problem]", "context": "[Context of the number]", "text": "[Text of the number]", "is_independent": "[true/false]", "is_in_math_env": "[true/false]" } ] } ``` ## Citation If you find R-HORIZON helpful for your research, please cite our paper: ```bibtex @misc{lu2025rhorizonfarlargereasoning, title={R-Horizon: How Far Can Your Large Reasoning Model Really Go in Breadth and Depth?}, author={Yi Lu and Jianing Wang and Linsen Guo and Wei He and Hongyin Tang and Tao Gui and Xuanjing Huang and Xuezhi Cao and Wei Wang and Xunliang Cai}, year={2025}, eprint={2510.08189}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2510.08189}, } ```

<div align="center"> <h1> <img src="https://raw.githubusercontent.com/meituan-longcat/R-HORIZON/main/assets/problem-solving.png" alt="logo" width="60" style="vertical-align:middle; margin-right:10px;"> R-HORIZON </h1> <div> 你的大推理模型(Large Reasoning Models, LRMs)在广度与深度上究竟能走多远? </div> </div> <br> <p align="center"> 📃 <a href="https://arxiv.org/abs/2510.08189" target="_blank">论文</a> • 🌐 <a href="https://reasoning-horizon.github.io/" target="_blank">项目主页</a> • 🤗 <a href="https://huggingface.co/datasets/meituan-longcat/R-HORIZON-training-data" target="_blank">数据集</a> </p > R-HORIZON 是一种全新的方法,旨在通过查询组合(query composition)激发大推理模型(Large Reasoning Models, LRMs)的长跨度推理行为。我们将孤立的单个问题转化为复杂的多步推理场景,研究发现,即便最先进的大推理模型,在面对涉及长推理跨度的相互依赖型问题时,性能也会出现显著下滑。 ![](https://raw.githubusercontent.com/meituan-longcat/R-HORIZON/main/assets/mainfig.png) ## 🔥 版本更新 **[2025-10-09]** - 🎉 **R-HORIZON 基准测试集** 现已上线!快来在复杂多跨度推理任务上测试你的大推理模型吧。 - 🤗 **训练与评估数据集** 已上架 Hugging Face:[R-HORIZON 数据集](https://huggingface.co/datasets/meituan-longcat/R-HORIZON-training-data) - 📄 **论文已在 arXiv 发布**:[R-HORIZON: 你的大推理模型在广度与深度上究竟能走多远?](https://arxiv.org/abs/2510.08189) ## 🌟 概述 近年来,聚焦推理的语言模型(如 OpenAI o1、DeepSeek-R1)通过测试时缩放(test-time scaling)与长思维链(Chain-of-Thought, CoT)取得了显著进展。然而,现有基准测试主要聚焦于即时单跨度任务,无法充分评估模型处理复杂长跨度场景的能力。 **当前范式下的核心挑战:** - **评估范围受限**:现有基准测试仅局限于孤立问题,未能覆盖真实世界多步推理的复杂性 - **有效推理长度受限**:随着推理链变长,模型难以维持稳定性能 - **思维资源分配不合理**:大推理模型无法在多个相互依赖的问题间合理分配思考资源 为解决上述局限,我们提出了 **R-HORIZON**,其具备以下特性: - 通过查询组合将孤立问题转化为**复杂多步推理场景** - 构建包含数学、代码生成、智能体应用六大类代表性数据集的**R-HORIZON 基准测试集** - 利用长跨度推理数据实现**带验证奖励的强化学习(reinforcement learning with verified rewards, RLVR)** ![](https://raw.githubusercontent.com/meituan-longcat/R-HORIZON/main/assets/method_fig.png) ## 📖 目录 - [🔥 版本更新](#-releases) - [🌟 概述](#-overview) - [📊 R-HORIZON 基准测试集](#-r-horizon-benchmark) - [🚀 基于 R-HORIZON 的训练](#-training-with-r-horizon) - [快速上手](#quick-start) - [环境安装](#installation) - [基准测试评估](#benchmark-evaluation) - [基于 R-HORIZON 数据集的训练](#training-with-r-horizon-datasets) - [数据集](#dataset) - [数据集构建](#dataset-construction) - [Hugging Face Hub 数据集](#dataset-on-hugging-face-hub) - [数据集结构](#dataset-structure) - [引用](#citation) ## 📊 R-HORIZON 基准测试集 我们在 R-HORIZON 基准测试集上评估了20余款最先进的大推理模型,结果显示随着推理跨度增加,模型性能出现显著下滑: ![](https://raw.githubusercontent.com/meituan-longcat/R-HORIZON/main/assets/result_fig.png) **基准测试的核心发现:** - **性能普适性下滑**:即便最强大的模型,随着问题数量增加也会出现严重性能衰减。例如,DeepSeek-R1 在 AIME25 数据集上的准确率从单问题场景的87.3%降至5问题场景的24.6%。 - **模型规模影响显著**:更大规模的模型对多跨度挑战的抗衰减能力更强。R1-Qwen-7B 在求解16个问题时准确率从93.6%降至0%,相比32B模型的性能衰减幅度高出34.1%。 - **任务相关的性能衰减**:相比数学任务,代码生成任务的性能下滑更为剧烈。许多推理模型在网页搜索场景中会丧失工具调用能力,导致多步任务表现不佳。 ## 🚀 基于 R-HORIZON 的训练 使用 R-HORIZON 组合数据进行训练,可在单跨度与多跨度推理任务上均取得显著性能提升: ![](https://raw.githubusercontent.com/meituan-longcat/R-HORIZON/main/assets/skywork_n1_n2_comparison.png) **训练结果亮点:** - **双重性能增益**:使用2个组合问题进行训练,可同时提升多跨度推理性能(在AIME24的n=2场景下提升17.4个百分点)与单问题推理性能(在AIME24原始场景下提升7.5个百分点)。 - **复杂度可扩展性**:提升组合复杂度(n=4)可增强模型处理高推理步数问题的能力,在Math500的n=8场景下准确率达到50.6%。 | 模型 | MATH500(原始) | MATH500(n=8) | AIME24(原始) | AIME24(n=2) | AIME25(原始) | AIME25(n=2) | AMC23(原始) | AMC23(n=2) | |-----------------|------------------|---------------|-----------------|--------------|-----------------|--------------|----------------|-------------| | R1-Qwen-7B | 93.6 | 11.8 | 48.3 | 16.4 | 33.3 | 3.5 | 90.2 | 48.8 | | Baseline(n=1) | **95.6** | 8.4 | 57.9 | 16.7 | 47.9 | 5.1 | **95.9** | 55.0 | | R-HORIZON(n=2) | 95.4 | 21.4 | **65.4** | 34.1 | **49.6** | **10.0** | 94.1 | **80.6** | | R-HORIZON(n=4) | 94.6 | **50.6** | 62.9 | **34.8** | 45.4 | 8.1 | 91.9 | 79.1 | ## 快速上手 ### 环境安装 bash # 克隆仓库 git clone https://github.com/meituan-longcat/R-HORIZON.git cd R-HORIZON # 创建conda环境 conda create -n r-horizon python=3.10 -y conda activate r-horizon # 安装 PyTorch pip3 install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu124 pip3 install flash-attn --no-build-isolation # 安装额外依赖 pip install -r requirements.txt ### 基准测试评估 1. 下载 R-HORIZON 基准测试集 bash # 下载基准测试数据集 python ./evaluation/data/download.py 2. 修改 evaluation 目录下的 config.json json { "inference": { // model_key(如 r1-distill-qwen7b)用于 run.sh 脚本 "r1-distill-qwen7b": { // vllm 服务使用的IP与端口 "base_url": "http://{Your IP and Port}/v1/completions", "api_key": "EMPTY", // model_name 需与 vllm 服务中的模型名称一致 "model_name": "{vllm's modelname}", "params": { "temperature": 1.0, "top_p": 0.95, "top_k": 10, "max_tokens": 65536 }, "prompt_prefix": "<|im_start|>user: ", "prompt_suffix": " <|im_end|> <|im_start|>assistant: " } }, "extract": { "gpt-4.1": { "model_name": "gpt-4.1", "base_url": "{OpenAI's baseurl}", "api_key": "{Your API key}", "params": { "temperature": 0.0, "max_tokens": 16000 } } } } 3. 启动 vllm 服务 bash vllm serve {modelname} --host {ip} --port {port} --served-model-name {modelname} --dtype auto --pipeline-parallel-size 1 --tensor-parallel-size 1 --trust-remote-code --enable-chunked-prefill --max-model-len 131072 --max-num-batched-tokens 10240 --max-num-seqs 256 --gpu-memory-utilization 0.85 --disable-custom-all-reduce --enable-reasoning --reasoning-parser deepseek_r1 --enable-chunked-prefill 4. 评估你的模型 以下为示例 bash 脚本,model_key 需与 config.json 中定义的一致 bash sh evaluation/run.sh {input_file} {output_dir} {model_key} # 示例 sh evaluation/run.sh evaluation/data/R-HORIZON-Math500/Math500-combined-n2.jsonl evaluation/result r1-distill-qwen7b ### 基于 R-HORIZON 数据集的训练 1. 下载组合训练数据 python from huggingface_hub import snapshot_download snapshot_download( repo_id="meituan-longcat/R-HORIZON-training-data", repo_type="dataset", local_dir="./training/data", ) 2. 启动训练 bash # 使用 GRPO 算法基于 R-HORIZON 进行训练 bash ./training/scripts/train/skywork-or1-rlvr-math-training-7b-40k.sh ## 数据集 ### 数据集构建 步骤1:过滤有效整数样本 bash # 功能说明:保留输入文本中包含有效整数且目标为纯整数的样本,排除模糊数值表达式(如浮点数、分数、LaTeX 命令)。 python step1_filt_integer_samples.py 步骤2:识别关键变量 bash # 功能说明:选取"关键变量"(对问题结果有显著影响的核心整数) # 需在脚本中配置API凭证(替换 YOUR_API_KEY) python step2_select_key_variable.py 步骤3:组合为链式推理问题 bash # 功能说明:生成多跨度链式问题,每一步的关键变量依赖于上一步的答案。 python step3_combine_problems.py ### Hugging Face Hub 数据集 R-HORIZON 训练数据集与评估基准测试集已上架 Hugging Face Hub: | 数据集类型 | 数据集名称 | Hugging Face 链接 | |--------------|-------------------------------|-----------------------------------------------------------------------------------| | 评估集 | R-HORIZON-Math500 | [链接](https://huggingface.co/datasets/meituan-longcat/R-HORIZON-Math500) | | 评估集 | R-HORIZON-AIME24 | [链接](https://huggingface.co/datasets/meituan-longcat/R-HORIZON-AIME24) | | 评估集 | R-HORIZON-AIME25 | [链接](https://huggingface.co/datasets/meituan-longcat/R-HORIZON-AIME25) | | 评估集 | R-HORIZON-AMC23 | [链接](https://huggingface.co/datasets/meituan-longcat/R-HORIZON-AMC23) | | 评估集 | R-HORIZON-Websearch | [链接](https://huggingface.co/datasets/meituan-longcat/R-HORIZON-Websearch) | | 训练集 | R-HORIZON-training-data | [链接](https://huggingface.co/datasets/meituan-longcat/R-HORIZON-training-data) | ### 数据集结构 json { "input": "[1-N个关联问题 + 解题说明(包含[variablek]/[answerk]占位符)]", "instanceId": "[该实例的唯一ID]", "origin_instanceIds": "[原始问题ID列表]", "target": "[最终答案列表,例如 [answer1, answer2]]", "num_problems": "[问题总数,例如 2]", "selected_variables": [ { "number": "[问题中的关键变量]", "context": "该变量的上下文信息", "text": "该变量的文本描述", "is_independent": "[true/false]", "is_in_math_env": "[true/false]" } ] } ## 引用 如果您的研究中用到了 R-HORIZON,请引用我们的论文: bibtex @misc{lu2025rhorizonfarlargereasoning, title={R-Horizon: How Far Can Your Large Reasoning Model Really Go in Breadth and Depth?}, author={Yi Lu and Jianing Wang and Linsen Guo and Wei He and Hongyin Tang and Tao Gui and Xuanjing Huang and Xuezhi Cao and Wei Wang and Xunliang Cai}, year={2025}, eprint={2510.08189}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2510.08189}, }
提供机构:
maas
创建时间:
2025-11-03
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作