FullStackBench
收藏魔搭社区2026-01-09 更新2025-01-18 收录
下载链接:
https://modelscope.cn/datasets/ByteDance/FullStackBench
下载链接
链接失效反馈官方服务:
资源简介:
<h1 style="text-align: center;">FullStack Bench: Evaluating LLMs as Full Stack Coders </h1>
<div align="center" style="margin: 2px;">
<a href="https://www.python.org/">
<img alt="Build" src="https://img.shields.io/badge/Python-3.8+-1f425f.svg?color=purple"style="display: inline-block; vertical-align: middle;"/>
</a>
<a href="" style="margin: 2px;">
<img alt="Code License" src="https://img.shields.io/badge/Code_License-Apache 2.0 license-f5de53%3F?color=green" style="display: inline-block; vertical-align: middle;"/>
</a>
<a href="" style="margin: 2px;">
<img alt="Data License" src="https://img.shields.io/badge/Data_License-CC--BY--SA--4.0-f5de53%3F?color=blue" style="display: inline-block; vertical-align: middle;"/>
</a>
</div>
<div style="text-align: center;">
Official repository for our paper "FullStack Bench: Evaluating LLMs as Full Stack Coders"
</div>
<p align="center">
<a href="https://github.com/bytedance/FullStackBench">🏠 FullStack Bench Code </a> •
<a href="https://huggingface.co/datasets/ByteDance/FullStackBench">📊 Benchmark Data </a> •
<a href="https://github.com/bytedance/SandboxFusion">📚 SandboxFusion </a>
</p>
## Table of contents
- [FullStack Bench: Evaluating LLMs as Full Stack Coders](#Introduction)
- [📌 Introduction](#introduction)
- [📚 SandboxFusion](#leaderboard)
- [📊 Data](#data)
- [💻 Usage](#usage)
- [📖 Citation](#citation)
## 📌Introduction
**FullStack Bench** is a multilingual benchmark for full-stack programming, covering a wide range of application domains and **16** programming languages with **3K** test samples, which substantially pushes the limits of code LLMs in code-related abilities of the real-world code development scenarios.
<p align="center">
<img src="https://github.com/bytedance/FullStackBench/blob/main/assets/intro.png?raw=true" width="80%" alt="FullStack Bench" />
</p>
### Task Examples
**FullStack Bench** covers more mainstream application domains when compared to existing code
evaluation benchmarks. Here is a visualization example from FullStack Bench, where the model is tasked with solving problems in the domain of desktop and web development using HTML.
<p align="center">
<img src="https://github.com/bytedance/FullStackBench/blob/main/assets/bench_cases.jpg?raw=true" width="80%" alt="FullStack Bench" />
</p>
Refer to our paper or dataset for more details.
### Results
<p align="center">
<img src="https://github.com/bytedance/FullStackBench/blob/main/assets/result.png?raw=true" width="100%" alt="results" />
</p>
Refer to our paper for more results.
## 📚SandboxFusion
**SandboxFusion** is an an effective code sandbox execution tool to evaluate different programming tasks from different languages. It incorporates over 10 coding-related evaluation datasets, featuring a standardized data format and accessible via a uniform HTTP API.
<p align="center">
<img src="https://github.com/bytedance/FullStackBench/blob/main/assets/sandbox.png?raw=true" width="80%" alt="FullStack Bench" />
</p>
Refer to our paper and <a href="https://bytedance.github.io/SandboxFusion/">📚 Tutorial </a> for more Details.
## 📊Data
<div align="center">
| **Dataset** | **Download** |
| :------------: | :------------: |
| FullStack Bench Dataset | [🤗 HuggingFace](https://github.com/bytedance/FullStackBench) |
</div>
## 💻Usage
Start the [sandbox server](https://bytedance.github.io/SandboxFusion/):
```bash
docker run -d --rm -p 8080:8080 volcengine/sandbox-fusion:server-20241204
```
For users in mainland China, the following mirror is provided:
```bash
docker run -d --rm -p 8080:8080 vemlp-cn-beijing.cr.volces.com/preset-images/code-sandbox:server-20241204
```
Then, run the benchmark:
```bash
git clone https://github.com/bytedance/FullStackBench.git
cd FullStackBench
pip install -r requirements.txt
# modify the model configs in src/main.py
python src/main.py
```
## 📖Citation
If you find our work helpful, please use the following citations.
```
@misc{liu2024fullstackbenchevaluatingllms,
title={FullStack Bench: Evaluating LLMs as Full Stack Coders},
author={Siyao Liu and He Zhu and Jerry Liu and Shulin Xin and Aoyan Li and Rui Long and Li Chen and Jack Yang and Jinxiang Xia and Z. Y. Peng and Shukai Liu and Zhaoxiang Zhang and Ge Zhang and Wenhao Huang and Kai Shen and Liang Xiang},
year={2024},
eprint={2412.00535},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2412.00535},
}
```
# FullStack Bench:评估大语言模型(LLM)作为全栈开发者的能力
<div align="center" style="margin: 2px;">
<a href="https://www.python.org/">
<img alt="构建状态" src="https://img.shields.io/badge/Python-3.8+-1f425f.svg?color=purple"style="display: inline-block; vertical-align: middle;"/>
</a>
<a href="" style="margin: 2px;">
<img alt="代码许可证" src="https://img.shields.io/badge/Code_License-Apache 2.0 license-f5de53%3F?color=green" style="display: inline-block; vertical-align: middle;"/>
</a>
<a href="" style="margin: 2px;">
<img alt="数据许可证" src="https://img.shields.io/badge/Data_License-CC--BY--SA--4.0-f5de53%3F?color=blue" style="display: inline-block; vertical-align: middle;"/>
</a>
</div>
<div style="text-align: center;">
本仓库为论文《FullStack Bench:评估大语言模型作为全栈开发者的能力》的官方代码仓库
</div>
<p align="center">
<a href="https://github.com/bytedance/FullStackBench">🏠 FullStack Bench 代码仓库 </a> •
<a href="https://huggingface.co/datasets/ByteDance/FullStackBench">📊 基准数据集 </a> •
<a href="https://github.com/bytedance/SandboxFusion">📚 SandboxFusion </a>
</p>
## 目录
- [FullStack Bench:评估大语言模型作为全栈开发者的能力](#概述)
- [📌 概述](#概述)
- [📚 SandboxFusion](#sandboxfusion)
- [📊 数据集](#数据集)
- [💻 使用方法](#使用方法)
- [📖 引用格式](#引用格式)
## 📌概述
**FullStack Bench**是一款面向全栈编程的多语言基准测试集,覆盖16种编程语言与3000个测试样本,涉及广泛的应用领域,可充分检验大语言模型在真实代码开发场景下的各类代码相关能力。
<p align="center">
<img src="https://github.com/bytedance/FullStackBench/blob/main/assets/intro.png?raw=true" width="80%" alt="FullStack Bench" />
</p>
### 任务示例
相较于现有代码评估基准,FullStack Bench覆盖了更多主流应用领域。以下为FullStack Bench中的一个可视化示例,模型需完成基于HTML的桌面与Web开发领域的编程任务。
<p align="center">
<img src="https://github.com/bytedance/FullStackBench/blob/main/assets/bench_cases.jpg?raw=true" width="80%" alt="FullStack Bench示例" />
</p>
更多详情可参阅我们的论文或数据集文档。
### 测试结果
<p align="center">
<img src="https://github.com/bytedance/FullStackBench/blob/main/assets/result.png?raw=true" width="100%" alt="测试结果" />
</p>
更多测试结果详见我们的论文。
## 📚SandboxFusion
**SandboxFusion**是一款高效的代码沙箱执行工具,用于评估不同编程语言下的各类编程任务。该工具整合了超过10个与代码相关的公开评估数据集,采用标准化数据格式,并可通过统一的HTTP API访问。
<p align="center">
<img src="https://github.com/bytedance/FullStackBench/blob/main/assets/sandbox.png?raw=true" width="80%" alt="SandboxFusion" />
</p>
更多详情可参阅我们的论文与<a href="https://bytedance.github.io/SandboxFusion/">📚 教程文档 </a>。
## 📊数据集
<div align="center">
| **数据集** | **下载链接** |
| :------------: | :------------: |
| FullStack Bench 基准数据集 | [🤗 Hugging Face](https://huggingface.co/datasets/ByteDance/FullStackBench) |
</div>
## 💻使用方法
启动沙箱服务器:
bash
docker run -d --rm -p 8080:8080 volcengine/sandbox-fusion:server-20241204
中国大陆地区用户可使用以下镜像:
bash
docker run -d --rm -p 8080:8080 vemlp-cn-beijing.cr.volces.com/preset-images/code-sandbox:server-20241204
然后运行基准测试:
bash
git clone https://github.com/bytedance/FullStackBench.git
cd FullStackBench
pip install -r requirements.txt
# 修改 src/main.py 中的模型配置
python src/main.py
## 📖引用格式
如果您认为我们的工作对您有所帮助,请采用以下引用格式:
bibtex
@misc{liu2024fullstackbenchevaluatingllms,
title={FullStack Bench: Evaluating LLMs as Full Stack Coders},
author={Siyao Liu and He Zhu and Jerry Liu and Shulin Xin and Aoyan Li and Rui Long and Li Chen and Jack Yang and Jinxiang Xia and Z. Y. Peng and Shukai Liu and Zhaoxiang Zhang and Ge Zhang and Wenhao Huang and Kai Shen and Liang Xiang},
year={2024},
eprint={2412.00535},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2412.00535},
}
提供机构:
maas
创建时间:
2025-01-13



