coding-eval
收藏魔搭社区2026-01-09 更新2025-11-22 收录
下载链接:
https://modelscope.cn/datasets/nex-agi/coding-eval
下载链接
链接失效反馈官方服务:
资源简介:
# Model Evaluation Repository
## Overview
To evaluate our model's performance, we have constructed a comprehensive evaluation dataset covering diverse practical scenarios. This dataset combines:
1. **Tasks from [CC-Bench](https://huggingface.co/datasets/zai-org/CC-Bench-trajectories)**: We selected tasks where we could match input source files to the projects and where the queries were clear and well-defined.
2. **Internal Testing Data**: We supplemented the dataset with additional tasks to increase data diversity, primarily including complex project generation based on requirement documents, mini-app development, code language exchange, and more.
## Dataset Composition
Our evaluation dataset spans **13 distinct categories**. The distribution across categories is as follows:
| Category | Description | Test Cases |
|----------|-------------|------------|
| **frontend** | Frontend development tasks including React, Vue, and UI components | 8 |
| **data_analysis** | Data analysis and visualization tasks with various datasets | 5 |
| **exchange** | Code migration and framework conversion tasks | 4 |
| **fullstack** | Full-stack application development scenarios | 4|
| **html** | HTML/CSS static page development | 4 |
| **ma** | Mini-app development | 4 |
| **svg** | SVG graphics and visualization generation | 3 |
| **test** | Test case generation and testing framework tasks | 3 |
| **crawler** | Web scraping and data collection tasks | 2 |
| **prd** | Product requirements document processing and analysis | 2 |
| **machinelearning** | Machine learning model training and inference | 1 |
| **backend** | Backend service development and API creation | 1 |
| **game** | Game development and interactive application tasks | 1 |
## Model Performance
Our model (Nex-N1) demonstrates competitive performance across all evaluation scenarios, showing particularly strong results in practical coding tasks:
<div align="center">
<img src="evaluation_result.png" width="70%">
</div>
## Repository Structure
### Data Files
- **`evaluation_traces.jsonl`**: Complete inference traces for all evaluated models
- **`query_file_map.json`**: Index mapping task IDs to required input files
- **`vibecoding-test-files`**: Processed trace data for various evaluation scenarios
### Evaluation Workflow
Each evaluation task is identified by a unique ID in the format `{category}-{number}` (e.g., `frontend-001`, `data_analysis-003`).
The evaluation process follows these steps:
1. **Task Identification**: Read task details from `traces_all_vb_eval.jsonl` using the task ID
2. **Input File Resolution**: Use `query_file_map.json` to identify required input files for the task (if required)
3. **Workspace Setup**: Copy the corresponding input files into the evaluation workspace
4. **Model Execution**: Run the model with the task query and input files
5. **Result Evaluation**: Compare model output against expected behavior and success criteria
#### Example Index Structure (`query_file_map.json`):
```json
{
"exchange-001": "Homepage-main",
"data_analysis-001": "titanic.csv",
"frontend-001": "react-redux-realworld-example-app",
"fullstack-001": "vueBlog",
"test-001": "react-redux-realworld-example-app",
...
}
```
# 模型评估仓库
## 概览
为评估我们的模型性能,我们构建了覆盖多样实际场景的综合评估数据集。该数据集整合了以下两部分内容:
1. **来自[CC-Bench](https://huggingface.co/datasets/zai-org/CC-Bench-trajectories)的任务**:我们筛选出可将输入源文件匹配至对应项目、且查询语句清晰明确的任务。
2. **内部测试数据**:我们补充了额外任务以提升数据多样性,主要涵盖基于需求文档的复杂项目生成、微型应用开发、代码语言互转等场景。
## 数据集构成
我们的评估数据集涵盖13个独立类别,各分类的分布情况如下:
| 类别 | 描述 | 测试用例数 |
|----------|-------------|------------|
| **前端(frontend)** | 包含React、Vue及UI组件开发的前端开发任务 | 8 |
| **数据分析(data_analysis)** | 搭载各类数据集的数据分析与可视化任务 | 5 |
| **代码互转(exchange)** | 代码迁移与框架转换任务 | 4 |
| **全栈开发(fullstack)** | 全栈应用开发场景 | 4|
| **静态网页(html)** | HTML/CSS静态页面开发任务 | 4 |
| **微型应用(ma)** | 微型应用开发 | 4 |
| **SVG图形(svg)** | SVG图形与可视化生成任务 | 3 |
| **测试相关(test)** | 测试用例生成与测试框架任务 | 3 |
| **网络爬虫(crawler)** | 网页抓取与数据采集任务 | 2 |
| **产品需求文档(prd)** | 产品需求文档处理与分析任务 | 2 |
| **机器学习(machinelearning)** | 机器学习模型训练与推理任务 | 1 |
| **后端开发(backend)** | 后端服务开发与API构建任务 | 1 |
| **游戏开发(game)** | 游戏开发与交互式应用任务 | 1 |
## 模型性能
我们的模型(Nex-N1)在所有评估场景中均展现出具备竞争力的性能,在实际编码任务中表现尤为突出:
<div align="center">
<img src="evaluation_result.png" width="70%">
</div>
## 仓库结构
### 数据文件
- **`evaluation_traces.jsonl`**:所有待评估模型的完整推理轨迹
- **`query_file_map.json`**:任务ID与所需输入文件的索引映射表
- **`vibecoding-test-files`**:各类评估场景的已处理轨迹数据
### 评估流程
每个评估任务均采用格式为`{类别}-{编号}`的唯一ID标识(例如`frontend-001`、`data_analysis-003`)。
评估流程遵循以下步骤:
1. **任务识别**:通过任务ID从`traces_all_vb_eval.jsonl`中读取任务详情
2. **输入文件匹配**:借助`query_file_map.json`确定任务所需的输入文件(若有)
3. **工作区搭建**:将对应输入文件复制至评估工作区
4. **模型执行**:基于任务查询与输入文件运行模型
5. **结果评估**:将模型输出与预期行为及成功判定标准进行对比
#### 示例索引结构(`query_file_map.json`):
json
{
"exchange-001": "Homepage-main",
"data_analysis-001": "titanic.csv",
"frontend-001": "react-redux-realworld-example-app",
"fullstack-001": "vueBlog",
"test-001": "react-redux-realworld-example-app",
...
}
提供机构:
maas
创建时间:
2025-11-19



