GeniusHTX/SWE-Skills-Bench

Name: GeniusHTX/SWE-Skills-Bench
Creator: GeniusHTX
Published: 2026-03-22 16:13:17
License: 暂无描述

Hugging Face2026-03-22 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/GeniusHTX/SWE-Skills-Bench

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: skill_id dtype: string - name: name dtype: string - name: description dtype: string - name: type dtype: string - name: task_prompt dtype: string - name: skill_document dtype: string - name: test_code dtype: string - name: repo_url dtype: string - name: repo_commit dtype: string - name: docker_image dtype: string splits: - name: train num_examples: 49 configs: - config_name: default data_files: - split: train path: swe_skills_bench.jsonl language: - en license: mit task_categories: - text-generation tags: - code - software-engineering - benchmark - agents - skill-injection pretty_name: SWE-Skills-Bench size_categories: - n<1K --- **Dataset Summary** SWE-Skills-Bench is a benchmark dataset for evaluating whether injected skill documents — structured packages of procedural knowledge — measurably improve LLM agent performance on real-world software engineering tasks. The dataset contains 49 skills spanning 565 task instances across six software engineering domains (Deployment & DevOps, Analytics & Monitoring, API Development, Data Science & ML, Security & Testing, and Developer Tools). Each skill is grounded in an authentic GitHub repository at a fixed commit, paired with a curated skill document and a deterministic pytest test suite that encodes the task's acceptance criteria. The dataset is designed to answer: *Does giving an agent a skill document actually help?* The primary evaluation metric is pytest pass rate, measured under two conditions — with and without skill injection — to compute a pass-rate delta (ΔP) per skill. The dataset was released as part of [SWE-Skills-Bench: Do Agent Skills Actually Help in Real-World Software Engineering?](https://arxiv.org/abs/2603.15401) **Dataset Structure** An example of a SWE-Skills-Bench datum is as follows: ``` skill_id: (str) - Unique skill identifier, e.g. "fix", "tdd-workflow". name: (str) - Human-readable task name. description: (str) - One-line description of the task. type: (str) - Task category, e.g. "repair", "feature", "fix". task_prompt: (str) - Full task prompt passed to the agent (Markdown). skill_document: (str) - Curated skill document injected as agent context (Markdown). test_code: (str) - Pytest test suite used to evaluate the agent's output. repo_url: (str) - Target GitHub repository URL. repo_commit: (str) - Fixed commit hash for reproducibility. docker_image: (str) - Pre-configured Docker image for the evaluation environment. ``` **Supported Tasks** SWE-Skills-Bench proposes a paired evaluation task: given a task prompt (with or without an injected skill document), an agent must complete a software engineering task on a real codebase. Correctness is verified by running the associated pytest test suite inside a Docker container.

提供机构：

GeniusHTX

5,000+

优质数据集

54 个

任务类型

进入经典数据集