caiovicentino1/processflow

Name: caiovicentino1/processflow
Creator: caiovicentino1
Published: 2026-04-11 00:45:04
License: 暂无描述

Hugging Face2026-04-11 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/caiovicentino1/processflow

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 task_categories: - text-generation - question-answering language: - en - code tags: - code - agent - tool-use - debugging - refactoring - postmortem - process-supervision - chain-of-thought - multi-verifier - long-horizon pretty_name: ProcessFlow size_categories: - 100K<n<1M configs: - config_name: default data_files: - split: train path: train.jsonl - split: validation path: val.jsonl - split: test path: test.jsonl - config_name: messages data_files: - split: train path: training_format/messages/train.jsonl - split: validation path: training_format/messages/val.jsonl - split: test path: training_format/messages/test.jsonl - config_name: alpaca data_files: - split: train path: training_format/alpaca/train.jsonl - split: validation path: training_format/alpaca/val.jsonl - split: test path: training_format/alpaca/test.jsonl - config_name: sharegpt data_files: - split: train path: training_format/sharegpt/train.jsonl - split: validation path: training_format/sharegpt/val.jsonl - split: test path: training_format/sharegpt/test.jsonl --- # ProcessFlow **A multi-format, process-centric code dataset for training LLM agents.** > ✅ **EMPIRICALLY VALIDATED (2026-04-10).** Fine-tuning Qwen2.5-1.5B base on > v1.7 (108K training samples, 3 epochs, LoRA r=32) produced a > **+0.681 ProcessFlow-Eval delta** (0.217 → 0.899) with **no HumanEval > regression** and **PPL improvement of -4.62 nats** on held-out test data. > All 3 validation gates passed decisively. See [Empirical validation](#-empirical-validation-qwen25-15b) > section below. Trained adapter: [caiovicentino1/Qwen2.5-1.5B-ProcessFlow-v1.7-LoRA](https://huggingface.co/caiovicentino1/Qwen2.5-1.5B-ProcessFlow-v1.7-LoRA) ProcessFlow treats code generation as a **process**, not just an outcome. Instead of a single `(problem, solution)` shape, it captures 16 distinct engineering activities combined into a unified 128K-sample training corpus: debugging sessions, incident postmortems, refactoring journeys, multi-turn agent tool traces (up to ~1000 tool calls per sample), architecture decisions, PR reviews, performance investigations, stack-trace debugging, security audits, migration guides, and more. The dataset is designed for **training code agents that sustain coherent work over long horizons** — the capability gap between single-turn code generation and real production engineering. ## 📊 Quick stats | | Count | |---|---:| | **Total samples** | **127,897** | | Training split | 108,537 (85.0%) | | Validation split | 9,605 (7.5%) | | Test split | 9,655 (7.5%) | | LongAgent-Bench holdout | 100 | | **Formats** | **16** | | **Languages** | **43** | | **Avg sample size** | ~4.9 KB | | **Total size** | 1.3 GB | | **Contamination with HumanEval/MBPP/SWE-bench Lite** | **0.00%** (verified) | ## 📐 Empirical validation (Qwen2.5-1.5B) **Setup** | Item | Value | |---|---| | Base model | `Qwen/Qwen2.5-1.5B` (base, not -Instruct) | | Training data | `training_format/messages/train.jsonl` — full 108,537 samples | | Method | Unsloth LoRA, r=32, alpha=64, all linear targets | | Epochs | 3 (10,164 steps total) | | Hardware | Colab RTX PRO 6000 Blackwell (96 GB) | | Wall clock | 7h 27m | | Loss curve | 1.39 → 0.138 (-90.1%), healthy plateau from step ~5000 | | Trained adapter | [caiovicentino1/Qwen2.5-1.5B-ProcessFlow-v1.7-LoRA](https://huggingface.co/caiovicentino1/Qwen2.5-1.5B-ProcessFlow-v1.7-LoRA) | **ProcessFlow-Eval (n=180, exact_match + tool_trace_match)** | Format | Baseline | Trained | Δ | |---|---:|---:|---:| | `multi_step_evolution` | 0.045 | 0.872 | **+0.827** (19×) | | `legacy_single_step_fix` | 0.259 | 0.985 | +0.726 | | `stack_trace_debug` | 0.169 | 0.860 | +0.691 | | `agent_tool_trace` | 0.200 | 0.889 | +0.689 | | `security_chain` | 0.183 | 0.789 | +0.606 | | `executable_test_case` | 0.447 | 0.998 | +0.550 | | **OVERALL** | **0.217** | **0.899** | **+0.681** | **Three validation gates (3/3 passed):** | Gate | Threshold | Result | Status | |---|---|---|---| | ProcessFlow-Eval delta | ≥ +0.05 | **+0.681** (13.6× threshold) | ✅ | | HumanEval no regression | Δ > -0.02 | **+0.073** (12/164 vs 0/164)¹ | ✅ | | Test PPL improvement | trained < baseline | **2.638 < 7.258** (Δ -4.62) | ✅ | ¹ HumanEval baseline of 0.000 reflects a methodological issue (the chat template was applied to a base model that was not trained with one), not a true capability of zero. The positive delta after training rules out catastrophic forgetting on code generation, which is what this gate is designed to test. **Interpretation** Every format that was measurable improved substantially. The 19× jump on `multi_step_evolution` and the near-perfect scores on `executable_test_case` (0.998) and `legacy_single_step_fix` (0.985) indicate that the dataset teaches its target capabilities effectively at the 1.5B scale, with no sign of regression on independent code generation. Loss plateau from step ~5000 onward suggests the model is approaching capacity for what v1.7 contains, motivating the v1.8 expansion (capabilities the current dataset under-represents: long-context engineering, principal-level systems work, distributed systems debugging). **Reproducibility** The training notebook used for this validation is available at `processflow/FINETUNE_UNSLOTH_PROCESSFLOW.ipynb` in the source repository. The PFE-Eval scoring uses the public `processflow_eval/eval.jsonl` split released here, with `exact_match` and `tool_trace_match` scoring categories (LLM judge skipped for objectivity). ## ⭐ What this dataset contributes | Dimension | ProcessFlow | OpenCodeInstruct | SWE-bench | Magicoder | CodeUltraFeedback | |---|:-:|:-:|:-:|:-:|:-:| | **Sample count** | 128K | 5M | 2.3K | 75K | 10K | | **Formats** | **16** | 1 | 1 | 1 | 1 | | **Multi-verifier disagreement** | **✅ 28K samples** | ❌ | ❌ | ❌ | pairwise only | | **Agent tool traces** | **✅ 28K, up to ~1000 calls** | ❌ | partial | ❌ | ❌ | | **Process reasoning traces** | **✅ 5K with `<think>`** | ❌ | ❌ | ❌ | ❌ | | **Incident postmortems** | **✅ 4.5K** | ❌ | ❌ | ❌ | ❌ | | **Architecture decisions** | **✅ 5.5K** | ❌ | ❌ | ❌ | ❌ | Individual properties above — multi-turn tool use, preference pairs, code instructions, reasoning traces — exist in other public datasets. To our knowledge, ProcessFlow is the first to combine all of them in a unified training corpus at 100K+ sample scale. We will happily update this statement if pointed at a prior dataset with the same combination. ## 🧰 The 16 formats | # | Format | Count | What it teaches | |---|---|---:|---| | 1 | `single_step_fix` | 43,713 | Bug identification + diagnosis + fix pattern (single-turn) | | 2 | `agent_tool_trace` | 28,199 | Multi-turn tool-use across 8 canonical tools (read, write, edit, grep, glob, run, test, list) | | 3 | `multi_step_evolution` | 10,500 | Iterative problem solving — attempts, failures, lessons | | 4 | `debugging_session` | 8,400 | Investigation steps, hypotheses, red herrings, root cause | | 5 | `architecture_decision` | 5,500 | Options considered, trade-offs, chosen path, rationale | | 6 | `deep_reasoning_trace` | 5,145 | `<think>` blocks with wrong attempts + correct solution + meta lesson | | 7 | `pr_review_chain` | 4,879 | Multi-round code review with evolving diff | | 8 | `incident_postmortem` | 4,500 | Timeline, root cause, action items, lessons learned | | 9 | `refactoring_journey` | 3,339 | Code smells → step-by-step refactoring → metrics improvement | | 10 | `verifier_disagreement` | 2,907 | 3 independent verifiers reviewing the same solution, with explicit disagreement resolution | | 11 | `stack_trace_debug` | 2,500 | Raw error → reasoning chain → root cause → regression test | | 12 | `executable_test_case` | 2,480 | Buggy code → fixed code → test suite with execution results | | 13 | `deep_perf_investigation` | 2,435 | Baseline vs current metrics → tool outputs → root cause → final metrics | | 14 | `security_chain` | 1,500 | CWE → vulnerable code → exploit PoC → fix → defense in depth | | 15 | `performance_investigation` | 1,000 | Lightweight variant of deep_perf_investigation | | 16 | `migration_guide` | 900 | Multi-phase migration with before/after, breaking changes, rollback plan | ## 🤖 Agent tool trace pyramid The `agent_tool_trace` format contains a **full length pyramid from 3 to ~1000 tool calls per sample** — matching the distribution of real engineering work: | Tier | Tool calls | Samples | Typical use case | |---|---|---:|---| | T1 Short | 3-25 | 14,046 | Simple bug fix, small test, quick grep | | T2 Medium | 25-80 | 8,700 | Feature addition, PR review, security audit | | T3 Long | 80-200 | 4,000 | Monorepo API migration, TypeScript strict rollout | | T4 Marathon | 200-500 | 1,000 | Cross-service refactor, full test suite recovery | | T5 Frontier | 500-1000 | 200 | Multi-day incident investigation, codebase-wide type migration | The 500-1000 tool-call range (T4/T5) is rarely present in public code *training* datasets. Long-horizon agent *evaluations* at this scale have been reported in the literature — e.g., NL2Repo-Bench observed sessions with hundreds of interaction turns, and [Sinha et al. (2509.09677)](https://arxiv.org/abs/2509.09677) tested execution past 2,000 steps — but publicly downloadable training data stratified across T1-T5 tool-call counts is, as far as we can tell, not available elsewhere. This is the data used to train agents that sustain coherent work across long sessions. ## 🌍 Languages (top 15) | Language | Count | % | |---|---:|---:| | Python | 34,611 | 27.1% | | TypeScript | 15,373 | 12.0% | | Java | 15,176 | 11.9% | | Go | 12,704 | 9.9% | | Rust | 7,859 | 6.1% | | C/C++ | 4,665 | 3.6% | | Bash | 4,291 | 3.4% | | Ruby | 2,738 | 2.1% | | SQL | 2,606 | 2.0% | | YAML | 2,258 | 1.8% | | Kotlin | 2,057 | 1.6% | | Swift | 1,928 | 1.5% | | PHP | 1,549 | 1.2% | | Scala | 1,307 | 1.0% | | (28 more, incl. HCL, Haskell, Zig, Dart, Elixir...) | ~15K | ~12% | **43 languages total.** ## 🎚️ Difficulty distribution | Tier | Count | % | |---|---:|---:| | senior | 24,644 | 19.3% | | staff | 21,606 | 16.9% | | mid | 16,471 | 12.9% | | principal | 13,008 | 10.2% | | junior | 4,447 | 3.5% | | easy | 1,547 | 1.2% | | *(legacy samples, no explicit difficulty)* | 46,174 | 36.1% | Samples from batches 4–9 (the structured portion) have explicit difficulty. Legacy batches (cursor series, early batches) use the implicit difficulty of the code example itself. ## 📦 Data structure ### Files in this release ``` v1.7_release/ ├── train.jsonl # 108,537 samples (raw native schema) ├── val.jsonl # 9,605 samples ├── test.jsonl # 9,655 samples ├── bench_holdout.jsonl # 100 samples reserved for LongAgent-Bench ├── split_stats.json # distribution per format × language × difficulty ├── split_manifest.json # provenance: per-source-file split counts ├── contamination_report.json # 0.00% true contamination, method + details ├── README.md # this file └── training_format/ # 3 training-ready shapes ├── messages/ # standard chat-messages format — trl, axolotl, torchtune │ ├── train.jsonl │ ├── val.jsonl │ └── test.jsonl ├── alpaca/ # instruction/input/output legacy shape │ ├── train.jsonl │ ├── val.jsonl │ └── test.jsonl └── sharegpt/ # conversations format — compatible with common SFT frameworks ├── train.jsonl ├── val.jsonl └── test.jsonl ``` ### Raw schema (native) Each sample in the raw jsonl files has format-specific fields. Every sample includes these common fields: | Field | Type | Description | |---|---|---| | `id` | string | UUID (for newer batches) or synthesized id for legacy | | `format` | string | One of the 16 format names (null for legacy) | | `language` | string | Primary programming language | | `difficulty` | string | junior / mid / senior / staff / principal | | `bug_type` | string | Fine-grained bug classification (336 unique labels) | | `domain` | string | Application domain (payments, auth, etc.) | Format-specific fields are documented inline in each sample. See `split_stats.json` for complete field inventories. ### Training format schemas **`messages/` (standard chat-messages format):** ```json { "messages": [ {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."} ], "metadata": { "source_format": "single_step_fix", "language": "Python", "difficulty": "senior", "bug_type": "null-dereference", "domain": "payments", "source_id": "..." } } ``` For `agent_tool_trace` samples, `messages` is multi-turn with `role: tool` messages and `tool_calls` on assistant turns, following the standard function-calling / tool-use specification used by most chat-completion APIs. **`alpaca/` (legacy instruction/input/output):** ```json { "instruction": "...", "input": "", "output": "...", "metadata": {...} } ``` Multi-turn agent tool traces are flattened into the `output` field for compatibility. **`sharegpt/` (conversations format):** ```json { "conversations": [ {"from": "human", "value": "..."}, {"from": "assistant", "value": "..."} ], "metadata": {...} } ``` ## 🔍 Contamination check **Result: 0.00% true contamination** across HumanEval, MBPP, and SWE-bench Lite. | Benchmark | Problems | Raw matches (sim ≥ 0.85) | True positives | |---|---:|---:|---:| | HumanEval | 164 | 0 | 0 | | MBPP | 974 | 1 | 0 | | SWE-bench Lite | 300 | 0 | 0 | | **Total** | **1,438** | **1** | **0** | **Method**: Cosine similarity between 1024-dimensional code-text embeddings of ProcessFlow samples and benchmark problems. Threshold 0.85. Executed on an RTX PRO 6000 Blackwell. See `contamination_report.json` for the exact embedding model configuration used for reproducibility. **The single raw match** (`synthetic-030` vs `MBPP/332`, similarity 0.851) was manually reviewed and confirmed as a **false positive**: both samples implement character-frequency counting — a universal beginner Python exercise — but with different function names, different algorithms (`.get(key, 0)` pattern vs `if key in dict.keys()`), and different pedagogical purposes (bug-fix demonstration vs from-scratch function writing). Full details in `contamination_report.json`. **Training on this dataset and evaluating on HumanEval, MBPP, or SWE-bench Lite is methodologically sound.** ## 🧪 Data collection methodology ProcessFlow was generated through **LLM-assisted synthetic data generation** across 9 production batches, each with a detailed per-format prompt specification and post-hoc quality validation. ### Generation pipeline 1. **Prompt-driven synthesis**: Each format has a dedicated detailed prompt spec (`BATCH4_PROMPT.md` through `BATCH9_PROMPT.md`) describing field requirements, difficulty distribution, anti-template rules, and validation checklists. 2. **Structured requirements per batch tier**: Long-horizon tool trace batches (6-9) added progressively stricter requirements: - Batch 6 (30-80 calls): phase structure, checkpoints, state references, hypothesis tracking - Batch 7 (80-200 calls): + meta reflection, phase dependency graph, checkpoint deltas - Batch 8 (200-500 calls): + context compaction events, progress log, honest dead-end recovery, simulated timeline - Batch 9 (500-1000 calls): + context management strategy, milestones, machine-verifiable outcomes 3. **5 quality gates** (`validate_batch.py`) applied automatically to every sample: - **Gate A**: Language purity (naming conventions match declared language, no cross-language leaks) - **Gate B**: Cross-field consistency (bug type keywords present in code/error/context) - **Gate C**: Verifier template/alignment (anti-template phrase blocklist, verdict↔score consistency) - **Gate D**: Metrics realism (SEV/revenue/SLA consistency, plausible improvement ratios) - **Gate E**: Template artifacts (no unfilled placeholders, no f-string leaks, no cross-language test output) 4. **Repair pipeline** (`repair_rejected.py`) for rejected samples, combining deterministic auto-fixes (language relabel, f-string escape stripping, SLA flag correction) with targeted LLM regeneration of broken fields. 5. **Merge pipeline** (`merge_clean.py`) that re-validates every sample against the gates before inclusion in the release. ### Final yield Starting from ~150K raw samples across 9 batches, the pipeline produced **127,897 validated samples** after: - Gate rejection: ~11K initial failures - Repair recovery: ~3.9K rescued via LLM regeneration - Final merge: 127,897 unique samples passing all 5 gates ### Stratified deterministic splits Train/val/test allocation uses a **deterministic hash-based split** (`sha256(sample_id) mod 1000`) stratified by (format × language × difficulty). Same sample always lands in the same split across regenerations. 100 samples are reserved as `bench_holdout` for [LongAgent-Bench](https://github.com/caiovicentino/longagent-bench) evaluation. ## 🎯 Intended use **Primary use case**: Supervised fine-tuning (SFT) of code-focused LLMs and agents, particularly: 1. **Tool-using code agents** that sustain multi-turn investigations (batch 6-9 long-horizon data) 2. **Process-aware reasoning** — models trained on `<think>` blocks, wrong attempts, red herrings 3. **Verifier-aware training** — using the 2,907 `verifier_disagreement` samples as preference/reward signal 4. **Long-horizon continuation** — models learning to manage context across hundreds of tool calls 5. **Multi-task engineering skill** — postmortem writing, refactoring journeys, architecture decisions as additional capabilities beyond single-turn code generation ### Recommended training mixes - **General code agent**: Use all 3 `messages/` splits directly. - **Long-horizon specialist**: Filter `agent_tool_trace` samples at T3-T5 tiers (~5,200 samples with 80+ tool calls). Combine with a base code generation mix. - **Reasoning specialist**: Filter `deep_reasoning_trace` + `debugging_session` + `verifier_disagreement` (~16K samples). - **Process writer specialist**: `incident_postmortem` + `refactoring_journey` + `architecture_decision` + `migration_guide` (~14.2K samples). ### Evaluation recommendations Evaluate on any of: - **HumanEval / MBPP / SWE-bench Lite** — methodologically sound (0.00% contamination verified) - **LongAgent-Bench** — included in `bench_holdout.jsonl` (100 tasks stratified across 5 tiers) - **HumanEval+, LiveCodeBench, Code-Contests** — not verified for contamination (recommend an independent scan before publishing results) ## ⚠️ Limitations and known biases 1. **Synthetic origin**: All samples are generated via LLM-assisted tooling. Despite extensive quality gating, synthetic data has known distributional quirks compared to real engineering work. 2. **Language skew**: Python dominates (~27%). If you need high coverage of rare languages (Lisp, Erlang, F#, etc.) this dataset has partial coverage only. 3. **Difficulty skew in legacy batches**: The ~46K "legacy" samples (primarily `single_step_fix` from early cursor batches) do not have explicit difficulty labels and tend toward mid/senior level. Filter by the structured batches (4-9) if you need precise difficulty control. 4. **English-only task descriptions**: All task statements and explanations are in English. Code is language-native but documentation is English. 5. **No real-execution verification**: Samples contain `test_suite` fields with test code and claimed execution outputs, but tests were not actually executed end-to-end in a real runtime for most samples. For the `executable_test_case` format (~2.5K samples), execution outputs are simulated based on structural correctness — treat them as pedagogical examples rather than verified pass/fail signals. 6. **Simulated tool outputs**: Agent tool traces contain realistic but synthetic tool outputs. No samples were recorded from a real sandboxed agent session. 7. **Dead end quality varies**: In marathon traces (batch 8-9), "dead end" investigations are intended to be intellectually honest. Spot-checking confirms most are, but some may feel constructed to a careful reader. 8. **Known quality gates passed, but not hand-audited at scale**: Every sample passed 5 automated quality gates, and ~1,000 were spot-checked by hand. The remaining 126,897 are gated but not hand-verified. Use the `metadata.source_format` to filter out formats you don't trust for your use case. 9. **Some `single_step_fix` samples mirror common beginner exercises**: As shown in the contamination check (false positive on MBPP/332 char frequency), some legacy samples target universal beginner tasks. These are legitimate pedagogical content but won't teach advanced skills. ## 📜 License Apache 2.0 ## 🔗 Related work - **LongAgent-Bench** — companion evaluation suite with 100 tasks across 5 tool-call length tiers (T1: 5-25 calls → T5: 500-1000 calls). The 100 bench tasks are held out from this dataset (see `bench_holdout.jsonl`). - **validate_batch.py** — the 5 quality gates used to validate every sample. - **repair_rejected.py** — repair pipeline for gate-rejected samples. - **merge_clean.py** — final merge + re-validation before release. ## 📖 Citation If you use ProcessFlow in research, please cite: ```bibtex @dataset{processflow_v1_7_2026, title = {ProcessFlow : A Multi-Format Process-Centric Code Dataset for Training LLM Agents}, author = {Vicentino, Caio}, year = {2026}, note = {128K samples across 16 formats with agent tool traces from 3 to 1000 tool calls per sample}, url = {https://huggingface.co/datasets/caiovicentino1/processflow} } ``` ## 🙏 Acknowledgements Generation tooling, validation, and repair pipelines developed iteratively over April 2026. Grateful to the open-source community behind `huggingface_hub`, `sentence-transformers`, `datasets`, and the authors of HumanEval, MBPP, and SWE-bench Lite for providing the evaluation baselines used in the contamination check.

提供机构：

caiovicentino1

搜集汇总

数据集介绍

构建方式

在代码生成与智能体研究领域，ProcessFlow数据集通过精心设计的LLM辅助合成流程构建而成。该流程涵盖九个生产批次，每个批次均配有详细的格式规范提示，并实施了严格的质量验证机制。数据生成后，系统自动执行五道质量关卡，涵盖语言纯度、跨字段一致性及模板伪影检测等方面，对未通过样本启动包含确定性修复与针对性LLM再生成的修复管道。最终，基于样本ID的确定性哈希分层策略，将经过验证的样本划分为训练、验证与测试集，确保了数据划分的稳定性与可复现性。

特点

ProcessFlow数据集的核心特征在于其过程中心的多元化设计，突破了传统代码数据集中单一问题-解决方案的范式。该数据集囊括了十六种独特的工程活动格式，如多轮工具调用轨迹、深度推理追踪、事件事后分析等，旨在培养智能体在长周期任务中维持连贯工作的能力。其显著特色包括覆盖四十三种编程语言的广泛语言支持、从初级到首席工程师的明确难度分层，以及包含多达近千次工具调用的超长序列样本，为训练具备复杂工程推理与持续交互能力的代码智能体提供了丰富且结构化的资源。

使用方法

该数据集主要适用于代码导向的大型语言模型与智能体的监督式微调。研究者可直接使用其提供的三种训练就绪格式——标准聊天消息格式、Alpaca指令格式及ShareGPT对话格式——进行模型训练。针对不同的研究重点，可对数据进行筛选混合，例如，专注于长周期任务的智能体可筛选工具调用次数超过八十次的样本，而侧重于推理能力的模型则可组合深度推理追踪与调试会话等格式。评估方面，由于已严格验证其与HumanEval等基准测试零污染，可安全用于相关代码生成能力的评测，同时数据集内预留的基准测试保留集也为长周期智能体评估提供了专用任务。

背景与挑战

背景概述

在人工智能与软件工程交叉领域，代码生成任务正从单一结果输出向复杂过程理解演进。ProcessFlow数据集于2026年由研究团队构建，旨在突破传统代码数据集的局限，将编程视为包含调试、架构决策、事后分析等16种工程活动的完整流程。该数据集通过12.8万条多格式样本，覆盖43种编程语言，首次在十万级规模上整合了多轮工具调用、过程推理轨迹与验证器分歧等维度，为训练能够维持长周期连贯工作的智能体奠定了数据基础，推动了代码生成模型向真实生产环境工程能力的跨越。

当前挑战

该数据集致力于解决长周期代码生成智能体训练的核心挑战，即如何让模型在数百步工具调用中保持上下文连贯性，并掌握调试、重构等非线性工程推理能力。构建过程中的挑战主要体现在多维度数据合成与质量控制：需在合成生成中避免模板化表达，确保16种活动格式的语义真实性；同时通过五层质量门控系统验证语言纯度、字段一致性等指标，并对长达千步的工具轨迹进行分层采样，以匹配真实工程任务的时长分布，最终在合成数据与工程实践间取得平衡。

常用场景

经典使用场景

在代码生成与智能体研究领域，ProcessFlow数据集被广泛应用于训练能够执行长时程、多步骤工程任务的代码智能体。其核心价值在于将代码生成视为一个动态过程，而非单一输出，通过涵盖调试会话、架构决策、工具调用轨迹等16种工程活动，为模型提供了从问题识别到解决方案演进的完整上下文。这一设计使得模型能够学习在复杂、迭代的软件工程场景中维持连贯的推理与行动，例如在长达数百步的工具调用序列中保持目标一致性，从而弥合了单轮代码生成与实际生产工程之间的能力鸿沟。

衍生相关工作

该数据集已衍生出多项聚焦长时程代码智能体的前沿研究。例如，基于其工具调用金字塔结构，研究者开发了LongAgent-Bench评估框架，专门测试模型在超长交互序列中的稳定性与目标保持能力。同时，其多验证器分歧数据被用于训练偏好对齐模型，以提升代码生成的可信度与鲁棒性。此外，结合‘深度推理轨迹’格式的工作探索了显式思维链标注如何增强模型对复杂逻辑错误的诊断能力，推动了过程监督在代码领域的算法创新。

数据集最近研究