AutoCodeBenchmark

Name: AutoCodeBenchmark
Creator: maas
Published: 2026-05-02 21:44:13
License: 暂无描述

魔搭社区2026-05-02 更新2025-10-11 收录

下载链接：

https://modelscope.cn/datasets/tencent-community/AutoCodeBenchmark

下载链接

链接失效反馈

官方服务：

资源简介：

<div align="center"> **AutoCodeBench: Large Language Models are Automatic Code Benchmark Generators** **Hunyuan Team, Tencent** </div> <p align="center"> <a href="https://arxiv.org/abs/2508.09101">📖 Paper</a> • <a href="https://autocodebench.github.io/">🏠 Home Page</a> • <a href="https://github.com/Tencent-Hunyuan/AutoCodeBenchmark">💻 Code </a> • <a href="https://autocodebench.github.io/leaderboard.html">🏆 Leaderboard</a> • <a href="#citation">📜 Citation</a> </p> ## Introduction Existing code generation benchmarks typically rely on manual annotations, which are not only time-consuming but also challenging to scale across diverse programming languages and varying problem complexities. Furthermore, most existing benchmarks predominantly focus on Python, while the limited number of multilingual benchmarks suffer from insufficient difficulty levels and imbalanced language distribution. To address these limitations, we propose the following comprehensive solution: **AutoCodeGen**: An innovative automated workflow leveraging LLM-Sandbox Interaction, where *LLMs dynamically generate test inputs and obtain corresponding test outputs through the sandbox environment*, enabling the creation of high-quality, scalable code generation datasets. **AutoCodeBench**: A comprehensive, large-scale code generation benchmark comprising 3,920 carefully curated problems, featuring balanced distribution across 20 programming languages. This benchmark is characterized by its high difficulty levels, practical relevance, and linguistic diversity. **AutoCodeBench-Lite**: Derived from extensive evaluation of over 30 open-source and closed-source models on AutoCodeBench, this refined subset contains 1,586 problems that demonstrate consistent solvability, having been successfully addressed by at least two different models. **AutoCodeBench-Complete**: Constructed from 1,000 selected problems from AutoCodeBench-Lite, this benchmark employs 3-shot prompting to create a completion-style code generation assessment framework specifically designed to evaluate the performance capabilities of base models. **MultiLanguageSandbox**: A robust, secure, and high-performance multi-language code execution sandbox service that provides comprehensive support for compilation and execution across more than 30 programming languages. ## AutoCodeGen <div align="center"> <img src="figures/autocodegen.png" width="85%"> </div> ## AutoCodeBench <div align="center"> <img src="figures/bench_comp.png" width="85%"> </div> Previous benchmarks mainly focused on Python, with multilingual benchmarks like Fullstackbench and McEval suffering from imbalanced language and category distributions, and overly simple difficulty. In contrast, AutoCodeBench is a high-difficulty multilingual benchmark with balanced language and category distributions to better assess models' multilingual capabilities. <div align="center"> <img src="figures/distribution_comp.png" width="85%"> </div> ## Experimental Results <div align="center"> <img src="figures/exp_acb.png" width="85%"> </div> <div align="center"> <img src="figures/exp_acb-lite.png" width="85%"> </div> <div align="center"> <img src="figures/exp_acb-comp.png" width="85%"> </div> ## Data Description <div align="center"> <img src="figures/acb.png" width="85%"> </div> Field Descriptions: - question: The programming problem. - canonical_solution: The code solution. - demo_test_func: Public test function containing a few basic test cases. - full_test_func: Private test function containing a large number of comprehensive test cases. - language: The programming language used. - difficulty: easy/medium/hard **System Prompt**: `You are an expert programmer. Your task is to provide a code solution within a single Markdown code block for the given programming problem. Do not include any direct execution commands, test cases, or usage examples within the code block.` ## Evaluation ### 1. Prepare a file `model_output.jsonl` You can use your model to perform inference based on the "question" field in the `autocodebench.jsonl` file and the system prompt, and save the model's output in the "output" field. An example of using VLLM for infernece can be found in the file `run_vllm.sh`. ### 2. Pull the sandbox image ```bash docker pull hunyuansandbox/multi-language-sandbox:v1 ``` ### 3. Start the sandbox service ```bash cd MultiLanguageSandbox ``` ```bash docker run -d \ --name sandbox-service \ -p 8080:8080 \ --cap-add=NET_ADMIN \ hunyuansandbox/multi-language-sandbox:v1 ``` ### 4. Verify the service ```bash # Check container status docker ps | grep sandbox ``` ```bash # Test service health status. If the response contains `"exec_outcome": "PASSED"` in the JSON, it indicates the service is running normally. curl -X POST http://localhost:8080/submit \ -H "Content-Type: application/json" \ -d '{"src_uid": "test-001", "lang": "python", "source_code": "print(\"Hello World\")"}' ``` ```bash # Verify canonical_solution, expected result pass@1=100% python3 call_sandbox.py \ --input_file AutoCodeBench/autocodebench.jsonl \ --output autocodebench.exec.jsonl \ --server_ip localhost \ --server_port 8080 \ --concurrency 32 \ --solution_key canonical_solution ``` ### 5. Calculate pass@1 ```python python3 call_sandbox.py \ --input_file model_output.jsonl \ --output model_output.exec.jsonl \ --server_ip localhost \ --server_port 8080 \ --concurrency 32 \ --solution_key output ``` ## Citation If you find our project helpful, please cite: ```bibtex @misc{chou2025autocodebenchlargelanguagemodels, title={AutoCodeBench: Large Language Models are Automatic Code Benchmark Generators}, author={Jason Chou and Ao Liu and Yuchi Deng and Zhiying Zeng and Tao Zhang and Haotian Zhu and Jianwei Cai and Yue Mao and Chenchen Zhang and Lingyun Tan and Ziyan Xu and Bohui Zhai and Hengyi Liu and Speed Zhu and Wiggin Zhou and Fengzong Lian}, year={2025}, eprint={2508.09101}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2508.09101}, } ``` ## License This repository is licensed under the terms of the [LICENSE](LICENSE) file.

<div align="center"> **AutoCodeBench：大语言模型是自动化代码基准测试生成器** **腾讯混元团队** </div> <p align="center"> <a href="https://arxiv.org/abs/2508.09101">📖 论文</a> • <a href="https://autocodebench.github.io/">🏠 项目主页</a> • <a href="https://github.com/Tencent-Hunyuan/AutoCodeBenchmark">💻 代码仓库</a> • <a href="https://autocodebench.github.io/leaderboard.html">🏆 排行榜</a> • <a href="#citation">📜 引用信息</a> </p> ## 简介现有代码生成基准测试通常依赖人工标注，不仅耗时耗力，且难以在多样化编程语言与不同问题复杂度下进行规模化扩展。此外，绝大多数现有基准测试主要聚焦于Python语言，而少数多语言基准测试则存在难度层级不足、语言分布不均衡的问题。为解决上述局限，我们提出了一套完整的解决方案： **AutoCodeGen**：一种创新性的自动化工作流，依托大语言模型（Large Language Model，LLM）沙箱交互机制，即*大语言模型可通过沙箱环境动态生成测试输入并获取对应的测试输出*，从而能够构建高质量、可规模化的代码生成数据集。 **AutoCodeBench**：一个全面的大规模代码生成基准测试集，包含3920个精心筛选的问题，在20种编程语言中实现了均衡分布。该基准测试集具备高难度、高实用性与语言多样性的特点。 **AutoCodeBench-Lite**：基于在AutoCodeBench上对30余种开源与闭源模型进行的大规模评估所得出的精炼子集，包含1586个具备一致可求解性的问题——这些问题已被至少两种不同的模型成功解决。 **AutoCodeBench-Complete**：从AutoCodeBench-Lite中选取1000个问题构建而成，该基准测试集采用少样本（Few-shot）提示方式，构建了补全式代码生成评估框架，专门用于评估基础大语言模型的性能能力。 **MultiLanguageSandbox**：一款高性能、安全且鲁棒的多语言代码执行沙箱服务，可为超过30种编程语言提供完整的编译与执行支持。 ## AutoCodeGen <div align="center"> <img src="figures/autocodegen.png" width="85%"> </div> ## AutoCodeBench <div align="center"> <img src="figures/bench_comp.png" width="85%"> </div> 此前的基准测试大多仅聚焦于Python语言，诸如Fullstackbench与McEval等多语言基准测试则存在语言与类别分布不均衡、难度过低的问题。与之相对，AutoCodeBench是一款高难度多语言基准测试集，实现了语言与类别分布的均衡，能够更精准地评估模型的多语言编程能力。 <div align="center"> <img src="figures/distribution_comp.png" width="85%"> </div> ## 实验结果 <div align="center"> <img src="figures/exp_acb.png" width="85%"> </div> <div align="center"> <img src="figures/exp_acb-lite.png" width="85%"> </div> <div align="center"> <img src="figures/exp_acb-comp.png" width="85%"> </div> ## 数据集说明 <div align="center"> <img src="figures/acb.png" width="85%"> </div> 字段说明： - question：编程问题描述 - canonical_solution：标准代码解决方案 - demo_test_func：公开测试函数，包含若干基础测试用例 - full_test_func：私有测试函数，包含大量全面的测试用例 - language：所用编程语言 - difficulty：难度等级（easy/medium/hard，即简单/中等/困难） **系统提示词**：`"You are an expert programmer. Your task is to provide a code solution within a single Markdown code block for the given programming problem. Do not include any direct execution commands, test cases, or usage examples within the code block."` ## 评估流程 ### 1. 准备`model_output.jsonl`文件你可以基于`autocodebench.jsonl`文件中的`question`字段与系统提示词，使用你的模型进行推理，并将模型输出保存至`output`字段中。使用VLLM进行推理的示例可参考`run_vllm.sh`文件。 ### 2. 拉取沙箱镜像 bash docker pull hunyuansandbox/multi-language-sandbox:v1 ### 3. 启动沙箱服务 bash cd MultiLanguageSandbox bash docker run -d --name sandbox-service -p 8080:8080 --cap-add=NET_ADMIN hunyuansandbox/multi-language-sandbox:v1 ### 4. 验证服务状态 bash # 检查容器运行状态 docker ps | grep sandbox bash # 测试服务健康状态。若返回的JSON中包含`"exec_outcome": "PASSED"`，则表示服务运行正常。 curl -X POST http://localhost:8080/submit -H "Content-Type: application/json" -d '{"src_uid": "test-001", "lang": "python", "source_code": "print("Hello World")"}' bash # 验证标准解决方案，预期结果为pass@1=100% python3 call_sandbox.py --input_file AutoCodeBench/autocodebench.jsonl --output autocodebench.exec.jsonl --server_ip localhost --server_port 8080 --concurrency 32 --solution_key canonical_solution ### 5. 计算pass@1指标 python python3 call_sandbox.py --input_file model_output.jsonl --output model_output.exec.jsonl --server_ip localhost --server_port 8080 --concurrency 32 --solution_key output ## 引用信息若您认为本项目对您的工作有所帮助，请引用以下文献： bibtex @misc{chou2025autocodebenchlargelanguagemodels, title={AutoCodeBench: Large Language Models are Automatic Code Benchmark Generators}, author={Jason Chou and Ao Liu and Yuchi Deng and Zhiying Zeng and Tao Zhang and Haotian Zhu and Jianwei Cai and Yue Mao and Chenchen Zhang and Lingyun Tan and Ziyan Xu and Bohui Zhai and Hengyi Liu and Speed Zhu and Wiggin Zhou and Fengzong Lian}, year={2025}, eprint={2508.09101}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2508.09101}, } ## 开源许可本代码仓库遵循[LICENSE](LICENSE)文件中所规定的许可条款。

提供机构：

maas

创建时间：

2025-09-26

5,000+

优质数据集

54 个

任务类型

进入经典数据集