下载链接：

https://modelscope.cn/datasets/Alibaba-DT/SKYLENAGE-GameCodeGym

下载链接

链接失效反馈

官方服务：

资源简介：

[![Platform](https://img.shields.io/badge/Platform-SKYLENAGE-blue.svg)](https://skylenage.alibaba-inc.com/sla/home) [![Home](https://img.shields.io/badge/Homepage-🏠-blue.svg)](https://v-gamegym.github.io/index.html) [![Leaderboard](https://img.shields.io/badge/Leaderboard-🏆-brightgreen.svg)](https://v-gamegym.github.io/leaderboard.html) [![Paper](https://img.shields.io/badge/Paper-📄-b31b1b.svg)](https://arxiv.org/abs/2509.20136) [![Code](https://img.shields.io/badge/Code-💻-black.svg)](https://github.com/alibaba/SKYLENAGE-GameCodeGym/) # I. Benchmark Introduction **SKYLENAGE-GameCodeGym (V-GameGym)** is a comprehensive benchmark for code LLMs, addressing the lack of evaluation in visual game development. It includes **2,219 samples** across **100 clusters**, curated with a clustering-based method to ensure diversity and completeness. # II. Benchmark Features 1. **Game-specific metrics**: Playability, aesthetics, and user engagement. 2. **Multimodal evaluation**: LLM-driven visual code synthesis in a UI sandbox. 3. **Validated effectiveness**: Narrows the gap between code accuracy and real workflows. 4. **Quantifiable results**: Provides measurable indicators for visual programming. # III. LeaderBoard | Rank | Model Name | Company | Total | Code | Screenshot | Video | Release Date | |------|-----------------------------------|-------------|-------|-------|------------|-------|--------------| | 🥇 1 | GPT-5-20250807 | OpenAI | 45.0 | 96.6 | 17.6 | 20.7 | 2025-08-07 | | 🥈 2 | GPT-o3 | OpenAI | 44.8 | 92.3 | 20.2 | 21.9 | 2025-04-16 | | 🥉 3 | Gemini-2.5-pro | Google | 43.5 | 89.1 | 19.1 | 22.2 | 2025-06-17 | | 4 | GPT-5-mini | OpenAI | 43.5 | 96.7 | 15.7 | 18.0 | 2025-08-07 | | 5 | GPT-oss-120b | OpenAI | 43.4 | 90.1 | 19.7 | 20.3 | 2025-08-21 | | 6 | GPT-04-mini (high) | OpenAI | 43.0 | 87.8 | 19.8 | 21.4 | 2025-04-16 | | 7 | Qwen3-235B-A22B-2507 (Thinking) | Alibaba | 42.3 | 84.5 | 20.0 | 22.4 | 2025-07-25 | | 8 | Grok-4-0709 | xAI | 42.0 | 83.9 | 19.8 | 22.4 | 2025-07-09 | | 9 | Gemini-2.5-flash | Google | 42.0 | 92.8 | 16.5 | 16.7 | 2025-06-17 | | 10 | Qwen3-Coder-480B-A35B-Instruct | Alibaba | 41.4 | 85.3 | 18.3 | 20.5 | 2025-07-23 | | 11 | DeepSeek-V3-0324 | DeepSeek | 41.2 | 83.7 | 19.3 | 20.5 | 2025-03-24 | | 12 | Qwen3-235B-A22B-Instruct-2507 | Alibaba | 41.1 | 85.3 | 18.2 | 19.7 | 2025-07-21 | | 13 | DeepSeek-V3.1 | DeepSeek | 40.9 | 83.1 | 19.3 | 20.2 | 2025-08-21 | | 14 | Claude-Sonnet-4-20250514-Thinking | Anthropic | 40.5 | 90.3 | 14.4 | 16.9 | 2025-05-14 | | 15 | Seed-OSS-36B-Instruct | ByteDance | 40.3 | 88.3 | 16.4 | 16.2 | 2025-08-21 | | 16 | GLM-4.5 | Zhipu AI | 40.0 | 84.7 | 17.0 | 18.3 | 2025-07-28 | --- # IV. Contact Us For more details, please visit the **SKYLENAGE Platform**: https://skylenage.alibaba-inc.com/sla/home Contact us: **skylenage@service.alibaba.com**

[![平台](https://img.shields.io/badge/Platform-SKYLENAGE-blue.svg)](https://skylenage.alibaba-inc.com/sla/home) [![主页](https://img.shields.io/badge/Homepage-🏠-blue.svg)](https://v-gamegym.github.io/index.html) [![排行榜](https://img.shields.io/badge/Leaderboard-🏆-brightgreen.svg)](https://v-gamegym.github.io/leaderboard.html) [![论文](https://img.shields.io/badge/Paper-📄-b31b1b.svg)](https://arxiv.org/abs/2509.20136) [![代码](https://img.shields.io/badge/Code-💻-black.svg)](https://github.com/alibaba/SKYLENAGE-GameCodeGym/) # I. 基准测试简介 **SKYLENAGE-GameCodeGym（V-GameGym）**是一款面向代码大语言模型（Large Language Model, LLM）的综合性基准测试集，旨在弥补当前视觉游戏开发领域中代码LLM评估的空白。该数据集包含**2219个样本**，覆盖**100个聚类集群**，采用基于聚类的筛选方法以确保样本的多样性与完备性。 # II. 基准测试特性 1. **游戏专属评测指标**：涵盖可玩性、美观性与用户参与度三个维度。 2. **多模态评测能力**：支持在UI沙箱环境中由大语言模型驱动的视觉代码合成任务。 3. **有效性经过验证**：有效缩小了代码生成精度与实际开发工作流之间的性能差距。 4. **可量化评估结果**：为视觉编程任务提供可量化的评测指标。 # III. 排行榜 | 排名 | 模型名称 | 所属公司 | 总分 | 代码得分 | 截图得分 | 视频得分 | 发布日期 | |------|---------------------------------|--------------|-------|----------|----------|----------|--------------| | 🥇 1 | GPT-5-20250807 | OpenAI | 45.0 | 96.6 | 17.6 | 20.7 | 2025-08-07 | | 🥈 2 | GPT-o3 | OpenAI | 44.8 | 92.3 | 20.2 | 21.9 | 2025-04-16 | | 🥉 3 | Gemini-2.5-pro | Google | 43.5 | 89.1 | 19.1 | 22.2 | 2025-06-17 | | 4 | GPT-5-mini | OpenAI | 43.5 | 96.7 | 15.7 | 18.0 | 2025-08-07 | | 5 | GPT-oss-120b | OpenAI | 43.4 | 90.1 | 19.7 | 20.3 | 2025-08-21 | | 6 | GPT-04-mini (high) | OpenAI | 43.0 | 87.8 | 19.8 | 21.4 | 2025-04-16 | | 7 | Qwen3-235B-A22B-2507 (Thinking) | 阿里巴巴 | 42.3 | 84.5 | 20.0 | 22.4 | 2025-07-25 | | 8 | Grok-4-0709 | xAI | 42.0 | 83.9 | 19.8 | 22.4 | 2025-07-09 | | 9 | Gemini-2.5-flash | Google | 42.0 | 92.8 | 16.5 | 16.7 | 2025-06-17 | | 10 | Qwen3-Coder-480B-A35B-Instruct | 阿里巴巴 | 41.4 | 85.3 | 18.3 | 20.5 | 2025-07-23 | | 11 | DeepSeek-V3-0324 | DeepSeek | 41.2 | 83.7 | 19.3 | 20.5 | 2025-03-24 | | 12 | Qwen3-235B-A22B-Instruct-2507 | 阿里巴巴 | 41.1 | 85.3 | 18.2 | 19.7 | 2025-07-21 | | 13 | DeepSeek-V3.1 | DeepSeek | 40.9 | 83.1 | 19.3 | 20.2 | 2025-08-21 | | 14 | Claude-Sonnet-4-20250514-Thinking | Anthropic | 40.5 | 90.3 | 14.4 | 16.9 | 2025-05-14 | | 15 | Seed-OSS-36B-Instruct | 字节跳动 | 40.3 | 88.3 | 16.4 | 16.2 | 2025-08-21 | | 16 | GLM-4.5 | 智谱AI | 40.0 | 84.7 | 17.0 | 18.3 | 2025-07-28 | # IV. 联系我们如需获取更多详情，请访问**SKYLENAGE平台**： https://skylenage.alibaba-inc.com/sla/home 联系邮箱：**skylenage@service.alibaba.com**

应用场景：