GeniusHTX/SWE-Skills-Bench
收藏Hugging Face2026-03-22 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/GeniusHTX/SWE-Skills-Bench
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: skill_id
dtype: string
- name: name
dtype: string
- name: description
dtype: string
- name: type
dtype: string
- name: task_prompt
dtype: string
- name: skill_document
dtype: string
- name: test_code
dtype: string
- name: repo_url
dtype: string
- name: repo_commit
dtype: string
- name: docker_image
dtype: string
splits:
- name: train
num_examples: 49
configs:
- config_name: default
data_files:
- split: train
path: swe_skills_bench.jsonl
language:
- en
license: mit
task_categories:
- text-generation
tags:
- code
- software-engineering
- benchmark
- agents
- skill-injection
pretty_name: SWE-Skills-Bench
size_categories:
- n<1K
---
**Dataset Summary**
SWE-Skills-Bench is a benchmark dataset for evaluating whether injected skill documents — structured packages of procedural knowledge — measurably improve LLM agent performance on real-world software engineering tasks.
The dataset contains 49 skills spanning 565 task instances across six software engineering domains (Deployment & DevOps, Analytics & Monitoring, API Development, Data Science & ML, Security & Testing, and Developer Tools). Each skill is grounded in an authentic GitHub repository at a fixed commit, paired with a curated skill document and a deterministic pytest test suite that encodes the task's acceptance criteria.
The dataset is designed to answer: *Does giving an agent a skill document actually help?* The primary evaluation metric is pytest pass rate, measured under two conditions — with and without skill injection — to compute a pass-rate delta (ΔP) per skill.
The dataset was released as part of [SWE-Skills-Bench: Do Agent Skills Actually Help in Real-World Software Engineering?](https://arxiv.org/abs/2603.15401)
**Dataset Structure**
An example of a SWE-Skills-Bench datum is as follows:
```
skill_id: (str) - Unique skill identifier, e.g. "fix", "tdd-workflow".
name: (str) - Human-readable task name.
description: (str) - One-line description of the task.
type: (str) - Task category, e.g. "repair", "feature", "fix".
task_prompt: (str) - Full task prompt passed to the agent (Markdown).
skill_document: (str) - Curated skill document injected as agent context (Markdown).
test_code: (str) - Pytest test suite used to evaluate the agent's output.
repo_url: (str) - Target GitHub repository URL.
repo_commit: (str) - Fixed commit hash for reproducibility.
docker_image: (str) - Pre-configured Docker image for the evaluation environment.
```
**Supported Tasks**
SWE-Skills-Bench proposes a paired evaluation task: given a task prompt (with or without an injected skill document), an agent must complete a software engineering task on a real codebase. Correctness is verified by running the associated pytest test suite inside a Docker container.
提供机构:
GeniusHTX



