KRLabsOrg/tool-output-extraction-swebench-gliner
收藏Hugging Face2026-04-17 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/KRLabsOrg/tool-output-extraction-swebench-gliner
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- token-classification
language:
- en
tags:
- information-extraction
- ner
- gliner
- extractive-qa
- coding-agents
- tool-output
- context-pruning
size_categories:
- 10K<n<100K
---
# Tool Output Extraction (extractive / GLiNER2 format)
Extractive variant of [KRLabsOrg/tool-output-extraction-swebench](https://huggingface.co/datasets/KRLabsOrg/tool-output-extraction-swebench), formatted for fine-tuning span-extraction models ([GLiNER2](https://github.com/fastino-ai/GLiNER2), BERT-for-QA, etc.).
Each tool observation from the parent dataset is chunked into ~400-token windows (preserving line boundaries) so it fits into encoder-style models with a 512-token context. The query is concatenated in front of each chunk, extractive-QA style, and gold evidence is mapped to verbatim spans within the tool-output portion. Chunks with no evidence become natural negative examples.
## Format
Each row is a GLiNER2 training record with the query concatenated into the input:
```json
{
"input": "Query: Find the code block...\n\nTool output:\n193: ...\n194: ...\n...\n233: columns = []\n...",
"output": {
"entities": {"RELEVANT": ["233: columns = []\n234: for col in data.columns:\n..."]}
},
"meta": {
"instance_id": "astropy__astropy-12544",
"source": "swe",
"tool_type": "read_file",
"query": "Find the code block in read_table_fits...",
"chunk_index": 7,
"total_chunks": 15,
"has_evidence": true,
"chunk_start_line": 193,
"chunk_end_line": 233
}
}
```
Design choices:
- **Query concatenation.** Query is prepended as `Query: ...\n\nTool output:\n<chunk>` so the model conditions on it directly, like an extractive-QA model. This avoids relying on GLiNER2's per-type `entity_descriptions` for per-example queries.
- **Single entity type `RELEVANT`.** All examples share the same type; the task-specific signal comes from the query in the input.
- **Verbatim spans.** Every entity mention is a verbatim substring of `input`, validated with GLiNER2's `InputExample.validate()`.
## Splits
| Split | Chunks | Positive | Negative | Source examples |
|-------|-------:|---------:|---------:|----------------:|
| train | 51,917 | 17,450 | 34,467 | 10,508 |
| dev | 2,579 | 422 | 2,157 | 240 |
| test | 9,595 | 1,090 | 8,505 | 618 |
Negatives in the train split are subsampled (30% kept) to limit class imbalance. Dev and test preserve the natural distribution.
## Usage with GLiNER2
```python
from gliner2.training.data import InputExample, TrainingDataset
import json
def load_split(path):
examples = []
with open(path) as f:
for line in f:
d = json.loads(line)
examples.append(InputExample(
text=d["input"],
entities=d["output"]["entities"],
))
return TrainingDataset(examples=examples)
train_ds = load_split("gliner_train.jsonl")
dev_ds = load_split("gliner_dev.jsonl")
```
At inference, format new inputs the same way:
```python
query = "Find the failing test block"
chunk = open("pytest_output.txt").read()
text = f"Query: {query}\n\nTool output:\n{chunk}"
# model.extract_entities(text, entity_types=["RELEVANT"]) -> list of verbatim spans
```
## Source
Generated from [KRLabsOrg/tool-output-extraction-swebench](https://huggingface.co/datasets/KRLabsOrg/tool-output-extraction-swebench) (11,477 examples, 27 tool types, derived from SWE-bench repositories and synthetic multi-ecosystem observations). See the [paper](https://arxiv.org/abs/2604.04979) for construction details.
## Citation
```bibtex
@misc{kovács2026squeeztaskconditionedtooloutputpruning,
title={Squeez: Task-Conditioned Tool-Output Pruning for Coding Agents},
author={Ádám Kovács},
year={2026},
eprint={2604.04979},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2604.04979},
}
```
## License
Apache 2.0
提供机构:
KRLabsOrg



