five

Long-Horizon-Execution

收藏
魔搭社区2026-01-02 更新2025-09-20 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/Long-Horizon-Execution
下载链接
链接失效反馈
官方服务:
资源简介:
# Long Horizon Execution This project contains the dataset accompanying the paper "[The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs](https://arxiv.org/abs/2509.09677)" ## Abstract Does continued scaling of large language models (LLMs) yield diminishing returns? Real-world value often stems from the length of task an agent can complete. We start this work by observing the simple but counterintuitive fact that marginal gains in single-step accuracy can compound into exponential improvements in the length of a task a model can successfully complete. Then, we argue that failures of LLMs when simple tasks are made longer arise from mistakes in execution, rather than an inability to reason. We propose isolating execution capability, by explicitly providing the knowledge and plan needed to solve a long-horizon task. We find that larger models can correctly execute significantly more turns even when small models have 100% single-turn accuracy. We observe that the per-step accuracy of models degrades as the number of steps increases. This is not just due to long-context limitations -- curiously, we observe a self-conditioning effect -- models become more likely to make mistakes when the context contains their errors from prior turns. Self-conditioning does not reduce by just scaling the model size. In contrast, recent thinking models do not self-condition, and can also execute much longer tasks in a single turn. We conclude by benchmarking frontier thinking models on the length of task they can execute in a single turn. Overall, by focusing on the ability to execute, we hope to reconcile debates on how LLMs can solve complex reasoning problems yet fail at simple tasks when made longer, and highlight the massive benefits of scaling model size and sequential test-time compute for long-horizon tasks. **GitHub:** [https://github.com/long-horizon-execution/measuring-execution/](https://github.com/long-horizon-execution/measuring-execution/) ## Description ![Our task.](task_overview.png) This dataset is a synthetic benchmark designed to measure the pure execution capability of LLMs over long horizons. The core task is **key-value dictionary addition**. A fixed, in-context dictionary mapping five-letter English words (keys) to integer values is provided in `dictionary.json`. The model's goal is to maintain a running sum. In each turn, it receives one or more keys (defined by the turn complexity, `K`), retrieves their corresponding values from the dictionary, adds them to the running sum, and outputs the new sum. The primary metric for evaluation is the **task length**: the number of steps a model can execute before its accuracy drops below a certain threshold. The dataset is designed to be programmatically generated and thus contamination-free. We only provide 100 samples here for ease of access, but more can be generated using the script [here](https://github.com/long-horizon-execution/measuring-execution/blob/main/generate_dataset_json.py). ## Using the dataset `test.jsonl` contains the individual samples that can be used to prompt the LLM. - _"input"_: contains the keys to be processed. - _"values"_: contains the values mapped to the corresponding keys as described in `dictionary.json`. - _"output"_: contains the expected running sum answers. The provided dataset is configured with a turn complexity of `K=1` (one key per turn). To evaluate models on a higher turn complexity, such as `K=N`, you can post-process the data by grouping every `N` consecutive turns: - _"input"_: Concatenate every `N` items into a single comma-separated string. - _"output"_: The new running sum for the grouped turn is simply the **last** running sum from the original group of `N` turns. ## Sample Usage To load and inspect the `test.jsonl` dataset: ```python import json file_path = "test.jsonl" samples = [] with open(file_path, 'r', encoding='utf-8') as f: for line in f: samples.append(json.loads(line)) print(f"Loaded {len(samples)} samples.") print("First sample:") print(samples[0]) ``` ## Benchmark ![Benchmark of Frontier models.](benchmark.png) ## Citation Find out more about our work at https://arxiv.org/abs/2509.09677. Consider citing us if you use our dataset. ``` @misc{ sinha2025illusiondiminishingreturnsmeasuring, title={The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs}, author={Akshit Sinha and Arvindh Arun and Shashwat Goel and Steffen Staab and Jonas Geiping}, year={2025}, eprint={2509.09677}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2509.09677}, } ```

# 长视野执行 本项目包含与论文《边际收益递减的假象:评估大语言模型(Large Language Model,LLM)的长视野执行能力》(https://arxiv.org/abs/2509.09677)配套的数据集。 ## 摘要 大语言模型的持续扩容是否会带来边际收益递减?现实世界中的应用价值往往取决于智能体能够完成的任务长度。本研究首先观察到一个简洁却反直觉的现象:单步准确率的边际提升,能够复合为模型可成功完成的任务长度的指数级增长。随后我们提出,当简单任务被扩展为长时序任务时,大语言模型出现失败的根源在于执行过程中的失误,而非推理能力的缺失。我们提出通过显式提供解决长时序任务所需的知识与规划方案,来隔离模型的执行能力。我们发现,即便小模型的单步准确率达到100%,更大参数量的模型仍能正确执行显著更多的任务轮次。我们观察到,模型的单步准确率会随着任务总轮次的增加而下降。这一现象并非仅由长上下文限制导致——有趣的是,我们发现了一种**自我制约效应(self-conditioning effect)**:当上下文包含模型先前轮次的错误时,模型更容易出现失误。仅通过扩容模型参数量,无法缓解这一自我制约效应。与之形成对比的是,近期的**思维模型(thinking models)**不会出现自我制约现象,且能够在单轮内执行更长的任务。最后,我们对前沿思维模型的单轮任务执行长度进行了基准测试。总体而言,通过聚焦执行能力,我们希望能够调和当前的学术争议:即大语言模型为何能够解决复杂推理问题,却在简单任务被扩展为长时序任务时出现失误;同时我们也希望凸显出,扩容模型参数量以及在测试阶段采用串行计算,对于长时序任务所带来的巨大增益。 **代码仓库:** [https://github.com/long-horizon-execution/measuring-execution/](https://github.com/long-horizon-execution/measuring-execution/) ## 数据集说明 ![任务概览](task_overview.png) 本数据集是一个合成基准测试集,用于评估大语言模型在长时序场景下的纯执行能力。核心任务为**键值对字典求和**。`dictionary.json` 文件中提供了一个固定的上下文字典,将五个字母的英文单词(作为键)映射到整数值。模型的目标是维护一个动态累加和。在每一轮任务中,模型会收到一个或多个键(轮次复杂度由`K`定义),从字典中检索这些键对应的整数值,将其加到动态累加和中,并输出新的累加和。评估的核心指标为**任务长度**:即模型的准确率降至某一阈值前,能够成功执行的任务总轮次。 本数据集支持通过脚本程序化生成,因此不存在训练数据污染问题。为便于使用,本次仅提供100个样本,更多样本可通过[此处](https://github.com/long-horizon-execution/measuring-execution/blob/main/generate_dataset_json.py)的脚本生成。 ## 数据集使用说明 `test.jsonl` 文件包含了可用于提示大语言模型的独立样本: - `input`:包含待处理的键。 - `values`:包含`dictionary.json`中定义的、与对应键映射的整数值。 - `output`:包含预期的动态累加和结果。 本次提供的数据集设置轮次复杂度为`K=1`(即每轮处理一个键)。若需在更高轮次复杂度(如`K=N`)下评估模型,可通过以下方式对数据进行后处理:将每`N`个连续轮次的样本进行分组: - `input`:将每`N`个输入项拼接为一个以逗号分隔的字符串。 - `output`:分组轮次对应的新累加和,即为原`N`轮样本中的最后一个动态累加和。 ## 示例用法 以下代码可用于加载并查看 `test.jsonl` 数据集: python import json file_path = "test.jsonl" samples = [] with open(file_path, 'r', encoding='utf-8') as f: for line in f: samples.append(json.loads(line)) print(f"Loaded {len(samples)} samples.") print("First sample:") print(samples[0]) ## 基准测试 ![前沿模型基准测试结果](benchmark.png) ## 引用信息 如需了解更多研究细节,请访问 https://arxiv.org/abs/2509.09677。若您使用本数据集,请引用我们的论文。 @misc{ sinha2025illusiondiminishingreturnsmeasuring, title={The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs}, author={Akshit Sinha and Arvindh Arun and Shashwat Goel and Steffen Staab and Jonas Geiping}, year={2025}, eprint={2509.09677}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2509.09677}, }
提供机构:
maas
创建时间:
2025-09-15
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作