CodeCriticBench

Name: CodeCriticBench
Creator: maas
Published: 2025-12-04 16:48:04
License: 暂无描述

魔搭社区2025-12-04 更新2025-11-03 收录

下载链接：

https://modelscope.cn/datasets/m-a-p/CodeCriticBench

下载链接

链接失效反馈

官方服务：

资源简介：

# CodeCriticBench: A Holistic Benchmark for Code Critique in LLMs ## 💥 Introduction **CodeCriticBench** is a comprehensive benchmark designed to systematically evaluate the critique capabilities of large language models (LLMs) in both code generation and code-question answering tasks. Beyond focusing on code generation, this benchmark extends to code-related questions, offering multidimensional and fine-grained evaluation criteria to rigorously assess LLMs' reasoning and code comprehension abilities. ## ✨ Key Features - **Multitask Coverage** - **Code Generation**: Includes algorithmic problems from common platforms (e.g., CodeForces, MBPP, LiveCodeBench), alongside a specialized Debug subset to evaluate the model's ability to detect specific programming errors. - **Code Question Answering (Code QA)**: Based on real-world programming scenarios, combining StackOverflow responses and diverse question generation from Qwen2.5-72B to assess performance in realistic situations. - **Fine-grained Evaluation Mechanism** Each sample is accompanied by a series of meticulously designed evaluation checklists covering 10 distinct criteria. In addition to basic evaluations, advanced assessment protocols ensure a multi-angle, layered assessment of the model's output quality. - **Difficulty Stratification** Using 12 state-of-the-art LLMs, each sample is categorized by difficulty into three levels: Easy (1,517 samples), Medium (1,084 samples), and Hard (1,699 samples). This ensures a balanced distribution across difficulty levels. - **Automated and Manual Labeling** - **Automated Evaluation**: Code generation tasks are paired with test cases to automatically validate code correctness within a sandbox environment. - **Manual Evaluation**: Code QA tasks involve 20 volunteers with programming experience who independently assess answers, with final labels determined via majority voting. <div align="center"> <img src="https://raw.githubusercontent.com/multimodal-art-projection/CodeCriticBench/main/source/image1.png" width="900" height="270"> <img src="https://raw.githubusercontent.com/multimodal-art-projection/CodeCriticBench/main/source/image2.png" width="550" height="450"> </div> ## 🌸 Framework Overview <div align="center"> <img src="https://raw.githubusercontent.com/multimodal-art-projection/CodeCriticBench/main/source/image3.png" > </div> ## 🌸 Usage To get started with **CodeCriticBench**, clone the repository and follow these steps: ```bash git clone https://github.com/multimodal-art-projection/CodeCriticBench.git cd CodeCriticBench ``` ## 💻 Run Evaluation Script Use the provided evaluation scripts for automated and manual assessment of model outputs. For example: - **Model Inference**: Run inference on your model: ```bash python src/infer_qwen.py --model_name 'Qwen2.5-Coder-32B-Instruct' --model_path='./Qwen2.5-Coder-32B-Instruct' --input_data_path='./data/CodeCriticBench.jsonl' --output_data_path='./data/output/' ``` - **Score Evaluation**: Score the model outputs: ```bash python src/evaluate.py ``` ## 📰 Evaluation Results Evaluation results will be displayed as follows: <div align="center"> <img src="https://raw.githubusercontent.com/multimodal-art-projection/CodeCriticBench/main/source/image4.png" > <img src="https://raw.githubusercontent.com/multimodal-art-projection/CodeCriticBench/main/source/image5.png" > </div> ## 🔥 Contributing We welcome contributions to CodeCriticBench! Whether it's expanding the dataset, improving evaluation metrics, or optimizing code, your input is highly valued. ## 📜 Citation If you use CodeCriticBench in your research, please cite the following: ```bibtex @article{zhang2025codecriticbench, title={Codecriticbench: A holistic code critique benchmark for large language models}, author={Zhang, Alexander and Dong, Marcus and Liu, Jiaheng and Zhang, Wei and Wang, Yejie and Yang, Jian and Zhang, Ge and Liu, Tianyu and Peng, Zhongyuan and Tan, Yingshui and others}, journal={arXiv preprint arXiv:2502.16614}, year={2025} } ``` ## Contact If you have any questions or suggestions, feel free to reach out via the issues page. If you have any questions or suggestions, feel free to reach out via the [issues page](https://github.com/multimodal-art-projection/CodeCriticBench/issues). --- CodeCriticBench is dedicated to advancing the field of code understanding and critique within LLMs. We look forward to your usage and feedback!

# CodeCriticBench：面向大语言模型代码评审的全景基准测试集 ## 💥 简介 **CodeCriticBench** 是一款全景式基准测试集，旨在系统性评估大语言模型（Large Language Model，LLM）在代码生成与代码问答两类任务中的代码评审能力。本基准不仅聚焦代码生成任务，还拓展至代码相关问答场景，提供多维度、细粒度的评估准则，以严谨地测评LLM的推理能力与代码理解水平。 ## ✨ 核心特性 - **多任务覆盖范围** - **代码生成任务**：涵盖主流编程平台（如CodeForces、MBPP、LiveCodeBench）的算法题，同时包含专门的调试子集，用于测评模型检测特定编程错误的能力。 - **代码问答（Code QA）**：基于真实编程场景，整合StackOverflow问答数据与Qwen2.5-72B生成的多样化问题，以测评模型在真实场景下的表现。 - **细粒度评估机制** 每个测试样本均配套一系列精心设计的评估清单，涵盖10项独立评估准则。除基础评估外，进阶评估协议可从多维度、分层级对模型输出质量进行测评。 - **难度分层设计** 基于12款前沿大语言模型，所有测试样本按难度划分为三个等级：简单级（1517条样本）、中级（1084条样本）与困难级（1699条样本），确保各难度层级的样本分布均衡。 - **自动与人工标注机制** - **自动评估**：代码生成任务配套测试用例，可在沙箱环境中自动验证代码的正确性。 - **人工评估**：代码问答任务由20名具备编程经验的志愿者独立作答评估，最终标签通过多数投票规则确定。 <div align="center"> <img src="https://raw.githubusercontent.com/multimodal-art-projection/CodeCriticBench/main/source/image1.png" width="900" height="270"> <img src="https://raw.githubusercontent.com/multimodal-art-projection/CodeCriticBench/main/source/image2.png" width="550" height="450"> </div> ## 🌸 框架概览 <div align="center"> <img src="https://raw.githubusercontent.com/multimodal-art-projection/CodeCriticBench/main/source/image3.png" > </div> ## 🌸 使用方法若需使用**CodeCriticBench**，请克隆本仓库并按照以下步骤操作： bash git clone https://github.com/multimodal-art-projection/CodeCriticBench.git cd CodeCriticBench ## 💻 运行评估脚本请使用提供的评估脚本完成模型输出的自动与人工评估，示例如下： - **模型推理**：对您的模型执行推理： bash python src/infer_qwen.py --model_name 'Qwen2.5-Coder-32B-Instruct' --model_path='./Qwen2.5-Coder-32B-Instruct' --input_data_path='./data/CodeCriticBench.jsonl' --output_data_path='./data/output/' - **结果评分**：对模型输出进行评分： bash python src/evaluate.py ## 📰 评估结果评估结果将以如下形式展示： <div align="center"> <img src="https://raw.githubusercontent.com/multimodal-art-projection/CodeCriticBench/main/source/image4.png" > <img src="https://raw.githubusercontent.com/multimodal-art-projection/CodeCriticBench/main/source/image5.png" > </div> ## 🔥 贡献指南我们欢迎所有针对CodeCriticBench的贡献！无论是拓展数据集、优化评估指标，还是改进代码实现，我们都将珍视您的每一份投入。 ## 📜 引用格式若您在研究中使用CodeCriticBench，请按照以下格式引用： bibtex @article{zhang2025codecriticbench, title={Codecriticbench: A holistic code critique benchmark for large language models}, author={Zhang, Alexander and Dong, Marcus and Liu, Jiaheng and Zhang, Wei and Wang, Yejie and Yang, Jian and Zhang, Ge and Liu, Tianyu and Peng, Zhongyuan and Tan, Yingshui and others}, journal={arXiv preprint arXiv:2502.16614}, year={2025} } ## 📞 联系方式若您有任何疑问或建议，欢迎通过[issues页面](https://github.com/multimodal-art-projection/CodeCriticBench/issues)联系我们。 --- CodeCriticBench 致力于推动大语言模型领域内代码理解与评审方向的发展，期待您的使用与反馈！

提供机构：

maas

创建时间：

2025-08-28

5,000+

优质数据集

54 个

任务类型

进入经典数据集