coding-eval

Name: coding-eval
Creator: maas
Published: 2026-01-09 02:29:54
License: 暂无描述

魔搭社区2026-01-09 更新2025-11-22 收录

下载链接：

https://modelscope.cn/datasets/nex-agi/coding-eval

下载链接

链接失效反馈

官方服务：

资源简介：

# Model Evaluation Repository ## Overview To evaluate our model's performance, we have constructed a comprehensive evaluation dataset covering diverse practical scenarios. This dataset combines: 1. **Tasks from [CC-Bench](https://huggingface.co/datasets/zai-org/CC-Bench-trajectories)**: We selected tasks where we could match input source files to the projects and where the queries were clear and well-defined. 2. **Internal Testing Data**: We supplemented the dataset with additional tasks to increase data diversity, primarily including complex project generation based on requirement documents, mini-app development, code language exchange, and more. ## Dataset Composition Our evaluation dataset spans **13 distinct categories**. The distribution across categories is as follows: | Category | Description | Test Cases | |----------|-------------|------------| | **frontend** | Frontend development tasks including React, Vue, and UI components | 8 | | **data_analysis** | Data analysis and visualization tasks with various datasets | 5 | | **exchange** | Code migration and framework conversion tasks | 4 | | **fullstack** | Full-stack application development scenarios | 4| | **html** | HTML/CSS static page development | 4 | | **ma** | Mini-app development | 4 | | **svg** | SVG graphics and visualization generation | 3 | | **test** | Test case generation and testing framework tasks | 3 | | **crawler** | Web scraping and data collection tasks | 2 | | **prd** | Product requirements document processing and analysis | 2 | | **machinelearning** | Machine learning model training and inference | 1 | | **backend** | Backend service development and API creation | 1 | | **game** | Game development and interactive application tasks | 1 | ## Model Performance Our model (Nex-N1) demonstrates competitive performance across all evaluation scenarios, showing particularly strong results in practical coding tasks: <div align="center"> <img src="evaluation_result.png" width="70%"> </div> ## Repository Structure ### Data Files - **`evaluation_traces.jsonl`**: Complete inference traces for all evaluated models - **`query_file_map.json`**: Index mapping task IDs to required input files - **`vibecoding-test-files`**: Processed trace data for various evaluation scenarios ### Evaluation Workflow Each evaluation task is identified by a unique ID in the format `{category}-{number}` (e.g., `frontend-001`, `data_analysis-003`). The evaluation process follows these steps: 1. **Task Identification**: Read task details from `traces_all_vb_eval.jsonl` using the task ID 2. **Input File Resolution**: Use `query_file_map.json` to identify required input files for the task (if required) 3. **Workspace Setup**: Copy the corresponding input files into the evaluation workspace 4. **Model Execution**: Run the model with the task query and input files 5. **Result Evaluation**: Compare model output against expected behavior and success criteria #### Example Index Structure (`query_file_map.json`): ```json { "exchange-001": "Homepage-main", "data_analysis-001": "titanic.csv", "frontend-001": "react-redux-realworld-example-app", "fullstack-001": "vueBlog", "test-001": "react-redux-realworld-example-app", ... } ```

# 模型评估仓库 ## 概览为评估我们的模型性能，我们构建了覆盖多样实际场景的综合评估数据集。该数据集整合了以下两部分内容： 1. **来自[CC-Bench](https://huggingface.co/datasets/zai-org/CC-Bench-trajectories)的任务**：我们筛选出可将输入源文件匹配至对应项目、且查询语句清晰明确的任务。 2. **内部测试数据**：我们补充了额外任务以提升数据多样性，主要涵盖基于需求文档的复杂项目生成、微型应用开发、代码语言互转等场景。 ## 数据集构成我们的评估数据集涵盖13个独立类别，各分类的分布情况如下： | 类别 | 描述 | 测试用例数 | |----------|-------------|------------| | **前端（frontend）** | 包含React、Vue及UI组件开发的前端开发任务 | 8 | | **数据分析（data_analysis）** | 搭载各类数据集的数据分析与可视化任务 | 5 | | **代码互转（exchange）** | 代码迁移与框架转换任务 | 4 | | **全栈开发（fullstack）** | 全栈应用开发场景 | 4| | **静态网页（html）** | HTML/CSS静态页面开发任务 | 4 | | **微型应用（ma）** | 微型应用开发 | 4 | | **SVG图形（svg）** | SVG图形与可视化生成任务 | 3 | | **测试相关（test）** | 测试用例生成与测试框架任务 | 3 | | **网络爬虫（crawler）** | 网页抓取与数据采集任务 | 2 | | **产品需求文档（prd）** | 产品需求文档处理与分析任务 | 2 | | **机器学习（machinelearning）** | 机器学习模型训练与推理任务 | 1 | | **后端开发（backend）** | 后端服务开发与API构建任务 | 1 | | **游戏开发（game）** | 游戏开发与交互式应用任务 | 1 | ## 模型性能我们的模型（Nex-N1）在所有评估场景中均展现出具备竞争力的性能，在实际编码任务中表现尤为突出： <div align="center"> <img src="evaluation_result.png" width="70%"> </div> ## 仓库结构 ### 数据文件 - **`evaluation_traces.jsonl`**：所有待评估模型的完整推理轨迹 - **`query_file_map.json`**：任务ID与所需输入文件的索引映射表 - **`vibecoding-test-files`**：各类评估场景的已处理轨迹数据 ### 评估流程每个评估任务均采用格式为`{类别}-{编号}`的唯一ID标识（例如`frontend-001`、`data_analysis-003`）。评估流程遵循以下步骤： 1. **任务识别**：通过任务ID从`traces_all_vb_eval.jsonl`中读取任务详情 2. **输入文件匹配**：借助`query_file_map.json`确定任务所需的输入文件（若有） 3. **工作区搭建**：将对应输入文件复制至评估工作区 4. **模型执行**：基于任务查询与输入文件运行模型 5. **结果评估**：将模型输出与预期行为及成功判定标准进行对比 #### 示例索引结构（`query_file_map.json`）： json { "exchange-001": "Homepage-main", "data_analysis-001": "titanic.csv", "frontend-001": "react-redux-realworld-example-app", "fullstack-001": "vueBlog", "test-001": "react-redux-realworld-example-app", ... }

提供机构：

maas

创建时间：

2025-11-19

搜集汇总

数据集介绍