LAURA: Enhancing Code Review Generation with Context-Enriched Retrieval-Augmented LLM

Figshare2025-10-03 更新2026-04-28 收录

下载链接：

https://figshare.com/articles/dataset/LAURA_Enhancing_Code_Review_Generation_with_Context-Enriched_Retrieval-Augmented_LLM/27367194

下载链接

链接失效反馈

官方服务：

资源简介：

LAURA: Enhancing Code Review Generation with Context-Enriched Retrieval-Augmented LLMIntroductionLAURA is an LLM-based retrieval-augmented, context-aware framework for code review generation, which integrates context augmentation, review exemplar retrieval, and prompt tuning to enhance the performance of LLMs (in our study, ChatGPT-4o and DeepSeek v3) in generating code review comments.The experiments show that LAURA outperforms the direct application of ChatGPT-4o and DeepSeek v3 for code review generation and significantly surpasses the performance of the pre-trained model CodeReviewer.Since our experiments are based on ChatGPT-4o and DeepSeek v3, we have released the data processing code and dataset used in our research. The code section includes the Python scripts we used for data collection, cleaning, merging, and retrieval. The dataset section contains 301k entries from 1,807 high-quality projects sourced from GitHub, covering four programming languages: C, C++, Java, and Python. We also provide the time-split dataset used as the retrieval database (which is also used for fine-tuning CodeReviewer) and the human-annotated evaluation dataset.File Structurecodes: Data collection, filtering and post-processing codes used in our studydata_collection_and_filtering.py: Code for collecting data via the GitHub GraphQL API and filtering with rule-based and LLM-based methodsdata_embedding.py: Code for data embeddingdata_merging.py: Code for data merging, used to merge the review comments with the same target diffdata_retrieval.py: Code for data retrievaldiff_extension.py: Code for extending the code diffs by integrating the full code contexts into the diffsdatasets: Datasets built and used in our studydatabase_for_retrieve.csv: The dataset we built for retrieval-augmented generation, containing 298,494 entries prior to December 26, 2024evaluation_data.csv: The evaluation dataset we manually annotated, containing 384 entries later than December 26, 2024full_dataset.csv: The full dataset we collected, containing 301,256 entriesprompts: The prompts used in data filtering, generation and evaluationdirect_generation.txt: The prompt we used for direct generation as baselinesLAURA_generation.txt: The prompt we used for LAURA generationLLM_evaluation.txt: The prompt we used for LLM evaluationLLM_filtering.txt: The prompt we used for LLM filtering in data filtering processREADME.md: Description of our submission

创建时间：

2025-10-03

5,000+

优质数据集

54 个

任务类型

进入经典数据集