RealBench: A Repo-Level Code Generation Benchmark Aligned with Real-World Software Development Practices

Name: RealBench: A Repo-Level Code Generation Benchmark Aligned with Real-World Software Development Practices
Creator: figshare
Published: 2025-12-02 09:02:32
License: 暂无描述

DataCite Commons2025-12-02 更新2026-04-25 收录

下载链接：

https://figshare.com/articles/dataset/RealBench_A_Repo-Level_Code_Generation_Benchmark_Aligned_with_Real-World_Software_Development_Practices/28596638/3

下载链接

链接失效反馈

官方服务：

资源简介：

<pre>We greatly appreciate your thoughtful feedback on our work! The supplemented materials for Review #2570A is in the ''Response to Review #2570A.md''. The supplemented materials for Review #2570B is in the ''Response to Review #2570B.md''. The supplemented materials for Review #2570C is in the ''Response to Review #2570C.md''. </pre>This repository contains the implementation of **RealBench**, a comprehensive benchmark and evaluation framework for repository-level code aligned with real-world software development practices. ## 📝 Abstract > Repository-level code generation seeks to automate software development by generating complete codebases from specifications. While existing benchmarks like HumanEval evaluate LLMs on generating code from natural language requirements alone, they fail to reflect real-world industry practices where developers primarily work from structured system designs like UML diagrams. This gap means current benchmark scores may not accurately measure LLMs' practical utility. To address this, we propose RealBench, a novel repository-level benchmark aligned with real-world software development. RealBench provides both natural language requirements and UML diagrams (package and class diagrams) for 61 diverse, real-world Python repositories, along with comprehensive test suites. We evaluate six advanced LLMs (including GPT-4o, Claude, Gemini, DeepSeek, and Qwen) using three generation strategies: holistic, incremental, and retrieval-augmented generation (RAG). Our systematic evaluation, using five metrics across repository-level and class-level granularities, reveals key insights: LLMs perform poorly on this task (best Pass@1 is only 19.39%), struggle significantly with larger repositories, and while good at identifying modules from diagrams, often generate code with grammar and logic errors. The results show that holistic generation works best for small repositories, while incremental generation is superior for complex ones, and that detailed UML designs are crucial for performance. RealBench provides a standardized platform for evaluating and advancing LLMs for practical, design-driven code generation. ## 🔧 Features - 🎯 **Comprehensive Benchmark**: Curated dataset of Python projects with UML diagrams, descriptions, and structure trees. - 🤖 **Multi-Model Evaluation**: Support for GPT-4o, Claude, DeepSeek, Gemini, Qwen, and other state-of-the-art LLMs. - 📊 **Multiple Approaches**: Evaluation of holistic, incremental, and RAG-based code generation methods. - 🔍 **UML Analysis**: Advanced UML diagram parsing and analysis using Understand API. - 📐 **Structure Preservation**: Maintains project structure and file organization during code generation. - ⚙️ **Automated Evaluation**: Comprehensive evaluation pipeline with automated testing and metrics computation. --- ## 📄 Algorithm Supplement (Pseudocode) ![algos](./supplementary/fig_real_example.pdf) ## 📁 Project Structure ```text . ├── datasets/ # Contains the benchmark datasets used for evaluation (Download from https://drive.google.com/drive/folders/16ZT8y-K-I6HN95wtv76xy5HGHNsaznQE?usp=sharing). │ ├── all_proj/ # Complete project dataset with all files and dependencies. │ └── projs_filtered_uml/ # Filtered dataset containing projects with UML diagrams. │ ├── level1/ # Level 1 projects (basic complexity). │ ├── level2/ # Level 2 projects (intermediate complexity). │ ├── level3/ # Level 3 projects (advanced complexity). │ └── level4/ # Level 4 projects (expert complexity). ├── src/ # Main source code for the RealBench framework. │ ├── crawler/ # GitHub repository crawling and filtering. │ │ ├── download.py # Downloads repositories from GitHub. │ │ ├── filter.py # Filters repositories based on criteria. │ │ └── github.py # GitHub API interaction utilities. │ ├── uml_gen/ # UML diagram generation and analysis. │ │ ├── file_uml.py # UML diagram generation using Understand API. │ │ ├── tree.py # Project structure tree generation. │ │ └── uml_dag.py # UML dependency graph analysis. │ ├── model_gen/ # Code generation using various LLMs. │ │ ├── generate/ # LLM inference implementations. │ │ │ ├── gpt_inference.py # GPT model inference. │ │ │ ├── claude_inference.py # Claude model inference. │ │ │ ├── deepseek_inference.py # DeepSeek model inference. │ │ │ ├── gemini_inference.py # Gemini model inference. │ │ │ └── qwen_inference.py # Qwen model inference. │ │ ├── make_prompts_UML/ # UML-based prompt generation. │ │ │ ├── instruction_v3.txt # Instruction template for UML generation. │ │ │ └── sys_gen_prom_con.py # System prompt construction. │ │ ├── make_prompts_package/ # Package-level prompt generation. │ │ ├── make_prompts_rag/ # RAG-based prompt generation. │ │ └── results/ # Generated code results and evaluations (Download from https://drive.google.com/drive/folders/16ZT8y-K-I6HN95wtv76xy5HGHNsaznQE?usp=sharing). │ │ ├── uml/ # UML-based generation results. │ │ │ ├── holistic/ # Holistic generation results. │ │ │ ├── incremental/ # Incremental generation results. │ │ └── rag/ # RAG-based generation results. │ └── cope_scripts/ # Data collection and processing scripts. │ ├── collect_des.py # Collects project descriptions. │ ├── parser_project.py # Parses project structure and dependencies. │ └── generator_des.py # Generates project descriptions. ├── environment.yml # Conda environment configuration. └── README.md # This file. ``` --- ## ⚙️ Environment * 🐧 Windows/Linux system (Windows 10/11 or Ubuntu 20.04+ recommended) * 🐍 Python 3.9 * 📦 Conda environment management * 🔧 Understand API (for UML analysis) ---

提供机构：

figshare

创建时间：

2025-11-28

5,000+

优质数据集

54 个

任务类型

进入经典数据集