KRLabsOrg/tool-output-extraction-swebench

Name: KRLabsOrg/tool-output-extraction-swebench
Creator: KRLabsOrg
Published: 2026-04-12 14:30:02
License: 暂无描述

Hugging Face2026-04-12 更新2026-04-05 收录

下载链接：

https://hf-mirror.com/datasets/KRLabsOrg/tool-output-extraction-swebench

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en license: apache-2.0 size_categories: - 10K<n<100K task_categories: - text-generation - token-classification tags: - tool-output - code - swe-bench - distillation - agent - context-compression - context-pruning --- # Tool Output Extraction Dataset [**Paper**](https://huggingface.co/papers/2604.04979) | [**Code**](https://github.com/KRLabsOrg/squeez) Training data for [**squeez**](https://github.com/KRLabsOrg/squeez) — a small model that prunes verbose coding agent tool output to only the evidence the agent needs next. ## Task **Task-conditioned context pruning of a single tool observation for coding agents.** Given a focused extraction query and one verbose tool output, return the smallest verbatim evidence block(s) the agent should read next. The model copies lines from the tool output — it never rewrites, summarizes, or invents content. Every line in the target exists verbatim in the source. ## Dataset Summary | | Train | Dev | Test | Total | |---|---:|---:|---:|---:| | **Samples** | 10,508 | 240 | 618 | 11,366 | **Data sources:** | Source | Samples | Description | |--------|--------:|-------------| | SWE-bench real data | 9,205 | Real tool output executed on cloned Python repos | | Synthetic multi-ecosystem | 1,697 | LLM-generated tool output for JS, Rust, Go, Docker, etc. | | Synthetic negatives | 575 | Mismatched task+output pairs where nothing is relevant | ## Why This Dataset Exists LLM coding agents waste 80–95% of context tokens on irrelevant tool output. When an agent reads a 500-line file to find one function, or runs `pytest` to find a failing test, most of the output is noise. Squeez trains small models to compress this output before it enters the agent's context window. ### Why v3? The previous dataset (v2) had three problems: 1. **Line-number leakage.** The synthetic data generation pipeline showed numbered output to the teacher LLM, and the teacher leaked those numbers into the target annotations. 87% of synthetic response lines contained line number prefixes (`2: npm WARN...`) that did not exist in the raw tool output. This trained the model to hallucinate formatting. 2. **No canonical truth format.** The dataset stored XML-wrapped ChatML prompts as the ground truth. This tangled the benchmark representation with one specific model's training format, making cross-model evaluation fragile. 3. **Task drift.** The labeling task drifted from "context pruning for an agent" toward "recover one arbitrary teacher-selected subset." Queries were either too vague (full issue descriptions) or too literal (grep-able string searches), neither matching the actual product use case. v3 fixes all three by introducing a canonical span-based representation, regenerating all labels with a focus on verbatim extraction, and manually curating the test set. ## How The Data Was Created ### Step 1: Generate Tool Calls (SWE-bench) Starting from 2,294 instances in the [SWE-bench](https://github.com/princeton-nlp/SWE-bench) test split, we simulated what a coding agent would do for each issue: read source files, grep for symbols, run tests, check git history, install packages. We generated 3–7 tool calls per instance, weighted toward common agent actions (read_file 28%, grep 18%, test_output 8%, etc.). ### Step 2: Execute on Real Repos Every tool call was executed against the actual repository checked out at the base commit (before the fix). `git grep` ran real searches, `pytest` ran real tests, `pip install -e .` did real installs. This produces authentic tool output with real file paths, error messages, formatting, and noise — things an LLM cannot reliably generate from scratch. ### Step 3: Generate Focused Queries For each (issue, tool_output) pair, a teacher LLM generated a focused extraction query — a short, concrete request for evidence rather than a full issue description or a literal grep pattern. Good queries: *"Find the traceback block that explains the import failure"*, *"Find the diff hunk that changes CSV parsing"* Bad queries: *"Fix the bug"*, *"Find all lines containing 'raise AttributeError'"* ### Step 4: Label Gold Spans A teacher LLM selected gold spans — contiguous blocks of lines in the raw tool output that answer the extraction query. The teacher saw numbered output as a reference interface, but the canonical target maps line numbers back to the raw unnumbered text. Every target line must exist verbatim in the source. The canonical representation is a list of `{start_line, end_line}` spans over the raw tool output. XML wrappers, ChatML formatting, and line number prefixes only appear in derived training files, never in the ground truth. ### Step 5: Synthetic Multi-Ecosystem Data SWE-bench only covers Python repositories, but real coding agents work across all ecosystems. We generated synthetic data for 15 tool types that SWE-bench cannot provide: `npm_install`, `npm_build`, `tsc`, `eslint`, `cargo_build`, `go_build`, `mvn_gradle`, `make_cmake`, `docker_build`, `docker_logs`, `terraform`, `kubectl`, `pip_install`, `mypy_pyright`, and `curl`. **Two-pass generation:** 1. An LLM generates a realistic task description and tool output in XML markers. Each tool type has a config with scenarios (e.g., "peer dependency conflict", "missing native module") and seed examples guiding realistic formatting. 2. Given the numbered output, a second LLM call selects relevant line numbers as JSON. These are mapped back to raw unnumbered source lines, validated (must exist verbatim, must be non-empty, reasonable compression ratio), and stored as canonical spans. ### Step 6: Hard Negatives To teach the model when to output nothing, 575 samples use intentionally mismatched task+output pairs — e.g., a React authentication task paired with a Rust borrow-checker error. The correct answer is an empty extraction. ### Step 7: Quality Filtering and Assembly Samples with empty spans (where the teacher found nothing relevant) were capped at 10% per tool type. Overly broad annotations (>60% of lines selected) were reviewed. Train/dev/test splits were assembled: - **SWE-bench**: split by repository (test: xarray, flask; dev: requests; train: all others) - **Synthetic**: split per tool type (10% test, 5% dev, 85% train) - **Negatives**: capped at ~10% of positives per tool type in test ### Step 8: Test Set Curation The held-out test set was manually reviewed and 111 samples were excluded: - **Near-duplicate np.unicode_ errors (63)**: The xarray repo on NumPy 2.0 produces the same `AttributeError: np.unicode_ was removed` on every `import xarray`. These identical errors across 65 different xarray instances were deduplicated to 2 representative samples. - **Trivial tiny outputs (39)**: Samples with 1–2 line output (e.g., lint "All checks passed!", "Python 3.12.9", single-line curl errors). Nothing to filter — not a meaningful benchmark. - **Overly broad spans (5)**: Samples selecting >50% of a large output, or spanning the entire top half of a file. - **Wrong annotations (4)**: Mislabeled tool types, spans pointing to wrong content, or vague queries without task context. The exclusion list is tracked in `test_exclusions.json` with per-sample reasons. ## Formats The dataset ships in three parallel formats, all derived from the same canonical spans: ### Canonical (`canonical_train/dev/test.jsonl`) The source of truth. Model-agnostic, no XML, no formatting artifacts. ```json { "instance_id": "django__django-11270", "source": "swe", "tool_type": "read_file", "query": "Find the code block that validates the referer in CsrfViewMiddleware", "background_task": "Fix CSRF validation bug when referer URL contains port number...", "tool_output": "raw output exactly as shown to the agent", "gold_spans": [ {"start_line": 41, "end_line": 52} ], "is_irrelevant": false, "command": "django/middleware/csrf.py" } ``` - `gold_spans` reference 1-indexed line numbers in `tool_output` - `is_irrelevant: true` means no lines are relevant (hard negative) - `query` is the focused extraction request; `background_task` is the full issue for provenance ### Generative / Qwen (`train/dev/test.jsonl`) ChatML-formatted for SFT training with Qwen or similar models. ```json { "prompt": "<|im_start|>system You prune verbose tool output...<|im_end|> <|im_start|>user <query> Find the code block... </query> <tool_output> 1: class CsrfViewMiddleware: 2: def _check_referer(self, request): ... </tool_output><|im_end|> <|im_start|>assistant ", "response": "<relevant_lines> 41: referer = request.META.get('HTTP_REFERER') 42: if referer is None: ... </relevant_lines>", "metadata": { "instance_id": "django__django-11270", "tool_type": "read_file", "source": "swe", "num_total_lines": 84, "num_relevant_lines": 12, "compression_ratio": 0.857 } } ``` ### Encoder (`encoder_train/dev/test.jsonl`) For token/line classification models (mmBERT, etc.). ## Tool Types 27 tool types across multiple ecosystems: | Ecosystem | Tool types | Source | |-----------|-----------|-------| | **Python** | read_file, grep, python, test_output, type_check, coverage, lint_output, build_output | SWE-bench | | **Python** | pip_install, curl | SWE-bench + synthetic | | **Git** | git_log, git_diff, git_blame, ls | SWE-bench | | **JavaScript/TypeScript** | npm_install, npm_build, tsc, eslint | Synthetic | | **Rust** | cargo_build | Synthetic | | **Go** | go_build | Synthetic | | **Java** | mvn_gradle | Synthetic | | **C/C++** | make_cmake | Synthetic | | **Infrastructure** | docker_build, docker_logs, terraform, kubectl | Synthetic | | **Python (type checking)** | mypy_pyright | Synthetic | ## Splits **SWE-bench data** is split by repository (zero instance overlap): - **Test**: `pydata/xarray`, `pallets/flask` - **Dev**: `psf/requests` - **Train**: all others (django, sympy, scikit-learn, sphinx, matplotlib, pytest, astropy, pylint, seaborn) **Synthetic data** is split per tool type: 10% test, 5% dev, 85% train. Hard negatives are capped at ~10% per tool type in test. ## Usage ```python from datasets import load_dataset ds = load_dataset("KRLabsOrg/tool-output-extraction-swebench") # Generative training splits print(ds) # DatasetDict({ # train: Dataset({features: ['prompt', 'response', 'metadata'], num_rows: 10508}) # dev: Dataset({features: ['prompt', 'response', 'metadata'], num_rows: 240}) # test: Dataset({features: ['prompt', 'response', 'metadata'], num_rows: 618}) # }) ``` ## Citation ```bibtex @misc{kovács2026squeeztaskconditionedtooloutputpruning, title={Squeez: Task-Conditioned Tool-Output Pruning for Coding Agents}, author={Ádám Kovács}, year={2026}, eprint={2604.04979}, archivePrefix={arXiv}, primaryClass={cs.SE}, url={https://arxiv.org/abs/2604.04979}, } ```

提供机构：

KRLabsOrg

5,000+

优质数据集

54 个

任务类型

进入经典数据集