FEA-Bench
收藏魔搭社区2025-10-09 更新2025-04-26 收录
下载链接:
https://modelscope.cn/datasets/AI-Bench/FEA-Bench
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for FEA-Bench
A Benchmark for Evaluating Repository-Level Code Generation for Feature Implementation.
## Dataset Details
### Dataset Description
The FEA-Bench is a benchmark with a test set that contains 1,401 task instances from 83 Github repositories. This benchmark aims to evaluate the capabilities of repository-level incremental code development. The task instances are collected from Github pull requests, which have the purpose of new feature implementation. Each task instance includes the repo and the base commit sha256, and the PR number and the status of unit test.
- **Curated by:** the authors of the FEA-Bench paper: Wei Li, Xin Zhang, Zhongxin Guo, Shaoguang Mao and their collaborators.
- **Language(s) (NLP):** English
- **License:** Others; We list all licenses of involved github repositories in the last part.
## Uses
This dataset is designed to evaluate performances of LLMs on repository-level code development, which is a complicated software engineering task.
- Repository-level incremental code development: The FEA-Bench can be used to evaluate a model for the the capabilities of repository-level incremental code development. Success on this task is typically measured by achieving a high/low resolved ratio. The leaderboard will soon be published as a website.
### Direct Use
Use scripts from FEA-Bench repo to get info for task instances and organize them into prompt, which can be used to LLMs' inference. Also, you can get info or use agents to directly solve the PRs with code changes.
### Out-of-Scope Use
This dataset is not aimed at training for LLMs. You should not take the FEA-Bench as the training dataset to avoid contamination.
## Dataset Structure
An example:
```
{
"instance_id": "huggingface__accelerate-270",
"pull_number": 270,
"repo": "huggingface/accelerate",
"version": null,
"base_commit": "515fcca9ed2b36c274c595dbdff75f1c2da635de",
"environment_setup_commit": "08101b9dde2b1a9658c2e363e3e9f5663ba06073",
"FAIL_TO_PASS": [
"tests/test_state_checkpointing.py::CheckpointTest::test_can_resume_training",
"tests/test_state_checkpointing.py::CheckpointTest::test_invalid_registration",
"tests/test_state_checkpointing.py::CheckpointTest::test_with_scheduler"
],
"PASS_TO_PASS": []
}
```
## Dataset Creation
### Curation Rationale
Implementing new features in repository-level codebases is a crucial application of code generation models. However, current benchmarks lack a dedicated evaluation framework for this capability. To fill this gap, we introduce FEA-Bench, a benchmark designed to assess the ability of large language models (LLMs) to perform incremental development within code repositories.
### Source Data
#### Data Collection and Processing
We collect pull requests from 83 GitHub repositories and use rule-based and intent-based filtering to construct task instances focused on new feature development. Each task instance containing code changes is paired with relevant unit test files to ensure that the solution can be verified.
#### Who are the source data producers?
Authors of 83 Github repositories list in the last part.
#### Personal and Sensitive Information
The dataset does not include any personal or sensitive information.
## Bias, Risks, and Limitations
- The quantity of high-quality data suitable for repository-level incremental development is limited. High-quality and usable pull requests for new feature development are relatively scarce. Many repository-level code developments for implementing new functionalities were committed during the early stages of repositories, without going through the rigorous code review process typical of the open-source community, resulting in lower data quality that cannot be utilized.
- Furthermore, the software's early-stage developments might not even have been conducted using the GitHub platform, posing a challenge for data collection and utilization.
- The repository-level incremental code development may not just include new feature implementation tasks.
- Only Python repositories are involved in FEA-Bench.
- The inference results of the task instances from the benchmark may contain code that is harmful to computer systems.
### Recommendations
Evaluation by docker is recommended, just like SWE-bench. We will also publish a patch for SWE-bench to make it compatible for our tasks' evaluation.
**BibTeX:**
To be appeared after publishing the ArXiv paper.
**APA:**
To be appeared after publishing the ArXiv paper.
## Dataset Card Contact
For further information or questions, please contact Xin Zhang (xinzhang3@microsoft.com).
## All involved Github repositories in the FEA-Bench
| Repo Name | License | Topic |
|-----------------------------------|-------------------|--------------------------------------------|
| astropy/astropy | BSD-3-Clause | Scientific/Engineering::Astronomy |
| django/django | BSD-3-Clause | Internet::WWW/HTTP |
| matplotlib/matplotlib | Other | Scientific/Engineering::Visualization |
| mwaskom/seaborn | BSD-3-Clause | Scientific/Engineering::Visualization |
| pallets/flask | BSD-3-Clause | Internet::WWW/HTTP |
| pvlib/pvlib-python | BSD-3-Clause | Scientific/Engineering::Physics |
| pydata/xarray | Apache-2.0 | Scientific/Engineering::Information Analysis |
| pydicom/pydicom | Others | Scientific/Engineering::Medical Science Apps. |
| pylint-dev/astroid | LGPL-2.1 | Software Development::Libraries |
| pylint-dev/pylint | GPL-2.0 | Software Development::Quality Assurance |
| pyvista/pyvista | MIT | Scientific/Engineering::Information Analysis |
| scikit-learn/scikit-learn | BSD-3-Clause | Scientific/Engineering::Artificial Intelligence |
| sphinx-doc/sphinx | BSD-2-Clause | Text Processing::Markup |
| sqlfluff/sqlfluff | MIT | Software Development::Quality Assurance |
| sympy/sympy | Others | Scientific/Engineering::Mathematics |
| Aider-AI/aider | Apache-2.0 | Software Development::Code Generators |
| Cog-Creators/Red-DiscordBot | GPL-3.0 | Communications::Chat |
| DLR-RM/stable-baselines3 | MIT | Scientific/Engineering::Artificial Intelligence |
| EleutherAI/lm-evaluation-harness | MIT | Scientific/Engineering::Artificial Intelligence |
| Project-MONAI/MONAI | Apache-2.0 | Scientific/Engineering::Medical Science Apps. |
| PyThaiNLP/pythainlp | Apache-2.0 | Text Processing::Linguistic |
| RDFLib/rdflib | BSD-3-Clause | Software Development::Libraries |
| Textualize/rich | MIT | Software Development::Libraries |
| Textualize/textual | MIT | Software Development::User Interfaces |
| TileDB-Inc/TileDB-Py | MIT | Software Development::Libraries |
| astronomer/astronomer-cosmos | Apache-2.0 | Software Development::Build Tools |
| atlassian-api/atlassian-python-api| Apache-2.0 | Internet::WWW/HTTP |
| aws-cloudformation/cfn-lint | MIT-0 | Software Development::Quality Assurance |
| aws-powertools/powertools-lambda-python | MIT-0 | Software Development::Libraries |
| aws/sagemaker-python-sdk | Apache-2.0 | Scientific/Engineering::Artificial Intelligence |
| biopragmatics/bioregistry | MIT | Scientific/Engineering::Bio-Informatics |
| boto/boto3 | Apache-2.0 | Software Development::Libraries |
| boto/botocore | Apache-2.0 | Software Development::Libraries |
| cocotb/cocotb | BSD-3-Clause | Scientific/Engineering::Electronic Design Automation (EDA) |
| conan-io/conan | MIT | Software Development::Build Tools |
| deepset-ai/haystack | Apache-2.0 | Scientific/Engineering::Artificial Intelligence |
| docker/docker-py | Apache-2.0 | Software Development::Libraries |
| dpkp/kafka-python | Apache-2.0 | Software Development::Libraries |
| embeddings-benchmark/mteb | Apache-2.0 | Scientific/Engineering::Artificial Intelligence |
| facebookresearch/hydra | MIT | Software Development::Libraries |
| fairlearn/fairlearn | MIT | Scientific/Engineering::Artificial Intelligence |
| falconry/falcon | Apache-2.0 | Internet::WWW/HTTP |
| google-deepmind/optax | Apache-2.0 | Scientific/Engineering::Artificial Intelligence |
| googleapis/python-aiplatform | Apache-2.0 | Scientific/Engineering::Artificial Intelligence |
| googleapis/python-bigquery | Apache-2.0 | Internet::WWW/HTTP |
| gradio-app/gradio | Apache-2.0 | Scientific/Engineering::Human Machine Interfaces |
| graphql-python/graphene | MIT | Software Development::Libraries |
| huggingface/accelerate | Apache-2.0 | Scientific/Engineering::Artificial Intelligence |
| huggingface/datasets | Apache-2.0 | Scientific/Engineering::Artificial Intelligence |
| huggingface/huggingface_hub | Apache-2.0 | Scientific/Engineering::Artificial Intelligence |
| huggingface/pytorch-image-models | Apache-2.0 | Software Development::Libraries |
| huggingface/trl | Apache-2.0 | Scientific/Engineering::Artificial Intelligence |
| joblib/joblib | BSD-3-Clause | Software Development::Libraries |
| joke2k/faker | MIT | Software Development::Testing |
| lark-parser/lark | MIT | Text Processing::Linguistic |
| minio/minio-py | Apache-2.0 | Software Development::Libraries |
| open-mmlab/mmengine | Apache-2.0 | Utilities |
| openvinotoolkit/datumaro | MIT | Scientific/Engineering::Image Processing |
| pgmpy/pgmpy | MIT | Scientific/Engineering::Artificial Intelligence |
| pre-commit/pre-commit | MIT | Software Development::Quality Assurance |
| prometheus/client_python | Apache-2.0 | System::Monitoring |
| prompt-toolkit/python-prompt-toolkit | BSD-3-Clause | Software Development::User Interfaces |
| pygments/pygments | BSD-2-Clause | Software Development::Documentation |
| pyocd/pyOCD | Apache-2.0 | Software Development::Debuggers |
| pypa/hatch | MIT | Software Development::Build Tools |
| pyro-ppl/pyro | Apache-2.0 | Scientific/Engineering::Artificial Intelligence |
| python-hyper/h2 | MIT | Internet::WWW/HTTP |
| roboflow/supervision | MIT | Scientific/Engineering::Image Processing |
| rytilahti/python-miio | GPL-3.0 | Home Automation |
| saleweaver/python-amazon-sp-api | MIT | Internet::WWW/HTTP |
| scrapy/scrapy | BSD-3-Clause | Software Development::Libraries |
| scverse/scanpy | BSD-3-Clause | Scientific/Engineering::Bio-Informatics |
| slackapi/bolt-python | MIT | Communications::Chat |
| slackapi/python-slack-sdk | MIT | Communications::Chat |
| snowflakedb/snowflake-connector-python | Apache-2.0 | Software Development::Libraries |
| softlayer/softlayer-python | MIT | Software Development::Libraries |
| spec-first/connexion | Apache-2.0 | Internet::WWW/HTTP |
| statsmodels/statsmodels | BSD-3-Clause | Scientific/Engineering::Information Analysis |
| tfranzel/drf-spectacular | BSD-3-Clause | Software Development::Documentation |
| tobymao/sqlglot | MIT | Database::Database Engines/Servers |
| tornadoweb/tornado | Apache-2.0 | Internet::WWW/HTTP |
| tortoise/tortoise-orm | Apache-2.0 | Database::Front-Ends |
| wagtail/wagtail | BSD-3-Clause | Internet::WWW/HTTP |
# FEA-Bench 数据集卡片
## 用于特征实现的仓库级代码生成评估基准测试集
### 数据集详情
#### 数据集描述
FEA-Bench是一款基准测试集,其测试集包含来自83个GitHub仓库的1401个任务实例。本基准旨在评估仓库级增量代码开发能力。任务实例均采集自旨在实现新功能的GitHub拉取请求(Pull Request,PR)。每个任务实例包含仓库信息、基础提交SHA256值、PR编号以及单元测试状态。
- **整理方:** FEA-Bench论文的作者:李伟、张鑫、郭忠信、毛少光及其合作者。
- **自然语言语种:** 英语
- **许可证:** 其他许可;我们将在文末列出所有涉及的GitHub仓库的许可证。
## 用途
本数据集旨在评估大语言模型(Large Language Model,LLM)在仓库级代码开发这一复杂软件工程任务中的性能。
- 仓库级增量代码开发:FEA-Bench可用于评估模型的仓库级增量代码开发能力。该任务的成功通常以较高/较低的解决率来衡量。排行榜将很快以网站形式发布。
### 直接使用
使用FEA-Bench仓库中的脚本获取任务实例信息并组织为提示词,可用于大语言模型的推理。此外,您也可以获取相关信息或使用AI智能体(AI Agent)直接解决包含代码变更的PR。
### 非适用场景
本数据集并非为大语言模型训练而设计。请勿将FEA-Bench用作训练数据集,以免出现数据污染问题。
## 数据集结构
示例如下:
{
"instance_id": "huggingface__accelerate-270",
"pull_number": 270,
"repo": "huggingface/accelerate",
"version": null,
"base_commit": "515fcca9ed2b36c274c595dbdff75f1c2da635de",
"environment_setup_commit": "08101b9dde2b1a9658c2e363e3e9f5663ba06073",
"FAIL_TO_PASS": [
"tests/test_state_checkpointing.py::CheckpointTest::test_can_resume_training",
"tests/test_state_checkpointing.py::CheckpointTest::test_invalid_registration",
"tests/test_state_checkpointing.py::CheckpointTest::test_with_scheduler"
],
"PASS_TO_PASS": []
}
## 数据集构建
### 构建初衷
在仓库级代码库中实现新功能是代码生成模型的重要应用场景。然而,当前基准测试集缺乏针对该能力的专用评估框架。为填补这一空白,我们推出FEA-Bench,一款旨在评估大语言模型(LLM)在代码仓库内进行增量开发能力的基准测试集。
### 源数据
#### 数据采集与处理
我们从83个GitHub仓库中采集拉取请求,并通过基于规则和基于意图的过滤构建专注于新功能开发的任务实例。每个包含代码变更的任务实例均搭配了相关的单元测试文件,以确保解决方案可被验证。
#### 源数据生产者是谁?
83个GitHub仓库的作者,名单详见文末。
#### 个人与敏感信息
本数据集不包含任何个人或敏感信息。
## 偏差、风险与局限性
- 适用于仓库级增量开发的高质量数据数量有限。用于新功能开发的高质量可用PR相对稀缺。许多用于实现新功能的仓库级代码开发提交于仓库早期阶段,未经过开源社区典型的严格代码审查流程,导致数据质量较低,无法被利用。
- 此外,软件早期开发可能并未使用GitHub平台,这为数据采集与利用带来了挑战。
- 仓库级增量代码开发并非仅包含新特征实现任务。
- FEA-Bench仅涉及Python仓库。
- 本基准测试集的任务实例的推理结果可能包含对计算机系统有害的代码。
### 建议
建议采用与SWE-bench相同的方式通过Docker容器进行评估。我们还将发布针对SWE-bench的补丁,使其能够兼容本任务的评估。
**BibTeX:**
将在ArXiv论文发表后公布。
**APA:**
将在ArXiv论文发表后公布。
## 数据集卡片联系方式
如需进一步信息或疑问,请联系张鑫(xinzhang3@microsoft.com)。
## FEA-Bench涉及的所有GitHub仓库
| 仓库名称 | 许可证 | 主题 |
|-----------------------------------|-------------------|--------------------------------------------|
| astropy/astropy | BSD-3-Clause | 科学/工程::天文学 |
| django/django | BSD-3-Clause | 互联网::万维网/超文本传输协议 |
| matplotlib/matplotlib | 其他 | 科学/工程::可视化 |
| mwaskom/seaborn | BSD-3-Clause | 科学/工程::可视化 |
| pallets/flask | BSD-3-Clause | 互联网::万维网/超文本传输协议 |
| pvlib/pvlib-python | BSD-3-Clause | 科学/工程::物理学 |
| pydata/xarray | Apache-2.0 | 科学/工程::信息分析 |
| pydicom/pydicom | 其他 | 科学/工程::医学应用 |
| pylint-dev/astroid | LGPL-2.1 | 软件开发::库 |
| pylint-dev/pylint | GPL-2.0 | 软件开发::质量保证 |
| pyvista/pyvista | MIT | 科学/工程::信息分析 |
| scikit-learn/scikit-learn | BSD-3-Clause | 科学/工程::人工智能 |
| sphinx-doc/sphinx | BSD-2-Clause | 文本处理::标记语言 |
| sqlfluff/sqlfluff | MIT | 软件开发::质量保证 |
| sympy/sympy | 其他 | 科学/工程::数学 |
| Aider-AI/aider | Apache-2.0 | 软件开发::代码生成工具 |
| Cog-Creators/Red-DiscordBot | GPL-3.0 | 通信::聊天 |
| DLR-RM/stable-baselines3 | MIT | 科学/工程::人工智能 |
| EleutherAI/lm-evaluation-harness | MIT | 科学/工程::人工智能 |
| Project-MONAI/MONAI | Apache-2.0 | 科学/工程::医学应用 |
| PyThaiNLP/pythainlp | Apache-2.0 | 文本处理::语言学 |
| RDFLib/rdflib | BSD-3-Clause | 软件开发::库 |
| Textualize/rich | MIT | 软件开发::库 |
| Textualize/textual | MIT | 软件开发::用户界面 |
| TileDB-Inc/TileDB-Py | MIT | 软件开发::库 |
| astronomer/astronomer-cosmos | Apache-2.0 | 软件开发::构建工具 |
| atlassian-api/atlassian-python-api| Apache-2.0 | 互联网::万维网/超文本传输协议 |
| aws-cloudformation/cfn-lint | MIT-0 | 软件开发::质量保证 |
| aws-powertools/powertools-lambda-python | MIT-0 | 软件开发::库 |
| aws/sagemaker-python-sdk | Apache-2.0 | 科学/工程::人工智能 |
| biopragmatics/bioregistry | MIT | 科学/工程::生物信息学 |
| boto/boto3 | Apache-2.0 | 软件开发::库 |
| boto/botocore | Apache-2.0 | 软件开发::库 |
| cocotb/cocotb | BSD-3-Clause | 科学/工程::电子设计自动化(EDA) |
| conan-io/conan | MIT | 软件开发::构建工具 |
| deepset-ai/haystack | Apache-2.0 | 科学/工程::人工智能 |
| docker/docker-py | Apache-2.0 | 软件开发::库 |
| dpkp/kafka-python | Apache-2.0 | 软件开发::库 |
| embeddings-benchmark/mteb | Apache-2.0 | 科学/工程::人工智能 |
| facebookresearch/hydra | MIT | 软件开发::库 |
| fairlearn/fairlearn | MIT | 科学/工程::人工智能 |
| falconry/falcon | Apache-2.0 | 互联网::万维网/超文本传输协议 |
| google-deepmind/optax | Apache-2.0 | 科学/工程::人工智能 |
| googleapis/python-aiplatform | Apache-2.0 | 科学/工程::人工智能 |
| googleapis/python-bigquery | Apache-2.0 | 互联网::万维网/超文本传输协议 |
| gradio-app/gradio | Apache-2.0 | 科学/工程::人机交互 |
| graphql-python/graphene | MIT | 软件开发::库 |
| huggingface/accelerate | Apache-2.0 | 科学/工程::人工智能 |
| huggingface/datasets | Apache-2.0 | 科学/工程::人工智能 |
| huggingface/huggingface_hub | Apache-2.0 | 科学/工程::人工智能 |
| huggingface/pytorch-image-models | Apache-2.0 | 软件开发::库 |
| huggingface/trl | Apache-2.0 | 科学/工程::人工智能 |
| joblib/joblib | BSD-3-Clause | 软件开发::库 |
| joke2k/faker | MIT | 软件开发::测试 |
| lark-parser/lark | MIT | 文本处理::语言学 |
| minio/minio-py | Apache-2.0 | 软件开发::库 |
| open-mmlab/mmengine | Apache-2.0 | 工具类 |
| openvinotoolkit/datumaro | MIT | 科学/工程::图像处理 |
| pgmpy/pgmpy | MIT | 科学/工程::人工智能 |
| pre-commit/pre-commit | MIT | 软件开发::质量保证 |
| prometheus/client_python | Apache-2.0 | 系统::监控 |
| prompt-toolkit/python-prompt-toolkit | BSD-3-Clause | 软件开发::用户界面 |
| pygments/pygments | BSD-2-Clause | 软件开发::文档工具 |
| pyocd/pyOCD | Apache-2.0 | 软件开发::调试器 |
| pypa/hatch | MIT | 软件开发::构建工具 |
| pyro-ppl/pyro | Apache-2.0 | 科学/工程::人工智能 |
| python-hyper/h2 | MIT | 互联网::万维网/超文本传输协议 |
| roboflow/supervision | MIT | 科学/工程::图像处理 |
| rytilahti/python-miio | GPL-3.0 | 家庭自动化 |
| saleweaver/python-amazon-sp-api | MIT | 互联网::万维网/超文本传输协议 |
| scrapy/scrapy | BSD-3-Clause | 软件开发::库 |
| scverse/scanpy | BSD-3-Clause | 科学/工程::生物信息学 |
| slackapi/bolt-python | MIT | 通信::聊天 |
| slackapi/python-slack-sdk | MIT | 通信::聊天 |
| snowflakedb/snowflake-connector-python | Apache-2.0 | 软件开发::库 |
| softlayer/softlayer-python | MIT | 软件开发::库 |
| spec-first/connexion | Apache-2.0 | 互联网::万维网/超文本传输协议 |
| statsmodels/statsmodels | BSD-3-Clause | 科学/工程::信息分析 |
| tfranzel/drf-spectacular | BSD-3-Clause | 软件开发::文档工具 |
| tobymao/sqlglot | MIT | 数据库::数据库引擎/服务器 |
| tornadoweb/tornado | Apache-2.0 | 互联网::万维网/超文本传输协议 |
| tortoise/tortoise-orm | Apache-2.0 | 数据库::前端 |
| wagtail/wagtail | BSD-3-Clause | 互联网::万维网/超文本传输协议 |
提供机构:
maas
创建时间:
2025-03-19



