AlienKevin/SWE-ZERO-1k-trajectories

Name: AlienKevin/SWE-ZERO-1k-trajectories
Creator: AlienKevin
Published: 2026-04-10 22:19:34
License: 暂无描述

Hugging Face2026-04-10 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/AlienKevin/SWE-ZERO-1k-trajectories

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en - code license: apache-2.0 size_categories: - 1K<n<10K task_categories: - text-generation tags: - agentic - code - swe-bench - multi-turn - chat - mini-swe-agent dataset_info: features: - name: instance_id dtype: string - name: messages list: - name: role dtype: string - name: content dtype: string - name: repo dtype: string - name: exit_status dtype: string - name: exit_status_detail dtype: string - name: n_turns dtype: int64 - name: n_messages dtype: int64 - name: prompt_tokens dtype: int64 - name: completion_tokens dtype: int64 - name: duration_sec dtype: float64 configs: - config_name: default data_files: - split: train path: data/train-* --- # SWE-ZERO-1k-trajectories 1,000 execution-free agentic rollouts from [`ricdomolm/mini-coder-1.7b`](https://huggingface.co/ricdomolm/mini-coder-1.7b) on 100 SWE-rebench V2 PRs (10 repos × 10 PRs × 10 rollouts each), generated as part of the SWE-ZERO MVP for Marin ([marin-community/marin#4561](https://github.com/marin-community/marin/issues/4561)). Each row is one rollout. The `messages` column is in the standard chat format (list of `{role, content}`), so the HuggingFace dataset viewer renders each trajectory as a multi-turn conversation between the agent and the sandboxed bash environment. ## Schema | field | type | description | |---|---|---| | `instance_id` | string | SWE-rebench V2 instance ID | | `messages` | list[{role, content}] | the full chat trajectory (rendered by the HF viewer) | | `repo` | string | `org/name` of the GitHub repo | | `exit_status` | string | `Submitted` / `incomplete` / `errored` / `other` | | `exit_status_detail` | string | full exit reason (truncated to 200 chars) | | `n_turns` | int | assistant message count | | `n_messages` | int | total message count (system + user + assistant) | | `prompt_tokens` | int | total prompt tokens consumed across all turns | | `completion_tokens` | int | total completion tokens generated | | `duration_sec` | float | wall time of the rollout in seconds | The `messages` column follows the mini-swe-agent v1 format: each assistant turn is `THOUGHT: ...` followed by exactly one `\`\`\`bash` block; each subsequent user turn is the bash subprocess output prefixed with `Observation:`. The agent runs in a PATH-whitelisted sandbox (no language interpreters, no network) against a real cloned checkout of the repo at the PR's `base_commit`. ## Generation details - **Model**: `ricdomolm/mini-coder-1.7b` (Qwen3-1.7B fine-tuned on mini-swe-agent trajectories) - **Inference**: vLLM-tpu on a single v6e-1 worker via the Marin Iris cluster - **Sampling**: `temperature=1.0`, `max_total_tokens=8192`, `MAX_TURNS=30` - **Sandbox**: per-rollout shallow git checkout + PATH whitelist (cat, grep, sed, find, awk, ls, git, ...) with bash returning `command not found` for anything else - **Wall time**: 62 minutes for all 1,000 rollouts on v6e-1 with `concurrency=16`, `max_num_seqs=32`, `--enforce-eager` ## Statistics - 154/1000 (15.4%) cleanly submitted via `echo COMPLETE_TASK_AND_SUBMIT_FINAL_OUTPUT` - 832/1000 (83.2%) ran out of token / turn budget mid-conversation - 14/1000 (1.4%) hit a transient HTTP 400 from the per-turn token budget estimator (since fixed) - Mean turns / rollout: 13.3 - Mean wall time / rollout: 26 s - 87.4% unique trajectories at MinHash Jaccard < 0.5 (Code World Model near-deduplication metric) ## Source Generated by `experiments/swe_zero/run_swe_zero_mvp.py` on the `kevin/swe-zero-mvp` branch of `marin-community/marin`. Raw input file: `gs://marin-us-central2/experiments/swe_zero_mvp/step6/rollouts.json`.

提供机构：

AlienKevin

搜集汇总

数据集介绍

构建方式

在软件工程智能体研究领域，SWE-ZERO-1k-trajectories数据集通过系统化的仿真流程构建而成。该数据集基于SWE-bench V2中的100个拉取请求，每个请求对应10个独立的智能体执行轨迹，总计生成1000条无执行依赖的交互记录。构建过程采用了经过微调的`ricdomolm/mini-coder-1.7b`模型，在严格隔离的沙盒环境中运行，该环境仅允许有限的命令行工具访问，并基于代码仓库的特定提交版本进行克隆。数据生成依托高性能计算集群，采用特定的采样参数与并发策略，确保了轨迹的多样性与真实性。

特点

本数据集的核心特征体现在其高度结构化的多轮对话轨迹与丰富的元数据标注。每条轨迹均以标准聊天格式封装，完整记录了智能体与沙盒环境之间的完整交互序列，包括思考过程、命令执行及环境反馈。数据集提供了详尽的执行状态统计，如提交成功率、令牌消耗与耗时分析，并辅以代码世界模型的去重度量，揭示了轨迹的多样性。这些特征使其成为研究智能体决策逻辑、评估模型在代码任务上的泛化能力及分析多步推理过程的宝贵资源。

使用方法

该数据集适用于训练或评估面向软件工程任务的对话型智能体。研究者可直接加载数据集，利用`messages`字段中的多轮对话序列，模拟智能体在受限环境下的逐步问题解决过程。元数据字段如`exit_status`与`n_turns`可用于分析智能体行为模式与性能瓶颈。此外，数据集支持基于执行轨迹的对比实验，例如研究不同采样策略对任务完成率的影响，或作为基准测试集验证新模型在代码修改与调试任务上的有效性。

背景与挑战

背景概述

在软件工程领域，自动化代码修复与任务执行已成为人工智能研究的前沿方向。SWE-ZERO-1k-trajectories数据集于2024年由Marin社区的研究团队创建，作为SWE-ZERO最小可行产品的一部分，旨在探索基于大型语言模型的智能代理在真实代码库环境中的多轮交互轨迹。该数据集的核心研究问题聚焦于评估代理模型在受限沙盒环境中解决实际GitHub拉取请求问题的能力，通过记录模型与bash环境的完整对话历史，为代码生成与代理行为分析提供了宝贵的实证数据，推动了自动化软件工程工具的发展。

当前挑战

该数据集所解决的领域问题涉及代码生成与任务执行的智能化挑战，具体包括代理在复杂代码上下文中进行多步推理的困难，以及模型在无语言解释器和网络访问的严格沙盒限制下完成实际软件修复任务的适应性。构建过程中的挑战主要体现在生成高质量轨迹的技术复杂性上，例如需要平衡采样温度与令牌预算以维持对话连贯性，处理模型因令牌或轮次限制而中断的频繁情况，以及确保轨迹多样性以避免重复模式，这些因素共同构成了数据集构建与后续分析的关键难点。

常用场景

经典使用场景

在软件工程智能体研究领域，SWE-ZERO-1k-trajectories数据集为评估代码生成与修复模型的交互能力提供了关键基准。该数据集通过模拟真实GitHub仓库的拉取请求场景，记录了智能体在受限bash沙箱环境中的多轮对话轨迹，典型应用于训练和验证自主代码代理在无执行约束下的推理与决策过程。研究者可借助这些轨迹分析智能体如何逐步理解问题、规划操作并生成最终解决方案，从而推动代码任务自动化的发展。

解决学术问题

该数据集有效应对了代码智能体研究中缺乏大规模、结构化交互轨迹的挑战，为探索模型在复杂软件维护任务中的泛化与鲁棒性提供了实证基础。通过提供详细的退出状态、令牌消耗与时间度量，它支持对智能体效率、错误模式及资源利用的量化分析，进而促进算法优化与评估标准的建立，对自动化软件工程和智能体行为理解具有重要理论意义。

衍生相关工作

围绕该数据集，学术界已衍生出多项经典研究，例如基于mini-swe-agent框架的代码世界模型构建与近重复轨迹去重技术。这些工作进一步探索了智能体在受限环境中的策略学习、多任务泛化以及轨迹效率优化，为后续如SWE-bench等基准的扩展与智能体架构创新提供了重要参考，持续推动着自动化软件工程领域的进展。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集