five

coding-eval

收藏
魔搭社区2026-01-09 更新2025-11-22 收录
下载链接:
https://modelscope.cn/datasets/nex-agi/coding-eval
下载链接
链接失效反馈
官方服务:
资源简介:
# Model Evaluation Repository ## Overview To evaluate our model's performance, we have constructed a comprehensive evaluation dataset covering diverse practical scenarios. This dataset combines: 1. **Tasks from [CC-Bench](https://huggingface.co/datasets/zai-org/CC-Bench-trajectories)**: We selected tasks where we could match input source files to the projects and where the queries were clear and well-defined. 2. **Internal Testing Data**: We supplemented the dataset with additional tasks to increase data diversity, primarily including complex project generation based on requirement documents, mini-app development, code language exchange, and more. ## Dataset Composition Our evaluation dataset spans **13 distinct categories**. The distribution across categories is as follows: | Category | Description | Test Cases | |----------|-------------|------------| | **frontend** | Frontend development tasks including React, Vue, and UI components | 8 | | **data_analysis** | Data analysis and visualization tasks with various datasets | 5 | | **exchange** | Code migration and framework conversion tasks | 4 | | **fullstack** | Full-stack application development scenarios | 4| | **html** | HTML/CSS static page development | 4 | | **ma** | Mini-app development | 4 | | **svg** | SVG graphics and visualization generation | 3 | | **test** | Test case generation and testing framework tasks | 3 | | **crawler** | Web scraping and data collection tasks | 2 | | **prd** | Product requirements document processing and analysis | 2 | | **machinelearning** | Machine learning model training and inference | 1 | | **backend** | Backend service development and API creation | 1 | | **game** | Game development and interactive application tasks | 1 | ## Model Performance Our model (Nex-N1) demonstrates competitive performance across all evaluation scenarios, showing particularly strong results in practical coding tasks: <div align="center"> <img src="evaluation_result.png" width="70%"> </div> ## Repository Structure ### Data Files - **`evaluation_traces.jsonl`**: Complete inference traces for all evaluated models - **`query_file_map.json`**: Index mapping task IDs to required input files - **`vibecoding-test-files`**: Processed trace data for various evaluation scenarios ### Evaluation Workflow Each evaluation task is identified by a unique ID in the format `{category}-{number}` (e.g., `frontend-001`, `data_analysis-003`). The evaluation process follows these steps: 1. **Task Identification**: Read task details from `traces_all_vb_eval.jsonl` using the task ID 2. **Input File Resolution**: Use `query_file_map.json` to identify required input files for the task (if required) 3. **Workspace Setup**: Copy the corresponding input files into the evaluation workspace 4. **Model Execution**: Run the model with the task query and input files 5. **Result Evaluation**: Compare model output against expected behavior and success criteria #### Example Index Structure (`query_file_map.json`): ```json { "exchange-001": "Homepage-main", "data_analysis-001": "titanic.csv", "frontend-001": "react-redux-realworld-example-app", "fullstack-001": "vueBlog", "test-001": "react-redux-realworld-example-app", ... } ```

# 模型评估仓库 ## 概览 为评估我们的模型性能,我们构建了覆盖多样实际场景的综合评估数据集。该数据集整合了以下两部分内容: 1. **来自[CC-Bench](https://huggingface.co/datasets/zai-org/CC-Bench-trajectories)的任务**:我们筛选出可将输入源文件匹配至对应项目、且查询语句清晰明确的任务。 2. **内部测试数据**:我们补充了额外任务以提升数据多样性,主要涵盖基于需求文档的复杂项目生成、微型应用开发、代码语言互转等场景。 ## 数据集构成 我们的评估数据集涵盖13个独立类别,各分类的分布情况如下: | 类别 | 描述 | 测试用例数 | |----------|-------------|------------| | **前端(frontend)** | 包含React、Vue及UI组件开发的前端开发任务 | 8 | | **数据分析(data_analysis)** | 搭载各类数据集的数据分析与可视化任务 | 5 | | **代码互转(exchange)** | 代码迁移与框架转换任务 | 4 | | **全栈开发(fullstack)** | 全栈应用开发场景 | 4| | **静态网页(html)** | HTML/CSS静态页面开发任务 | 4 | | **微型应用(ma)** | 微型应用开发 | 4 | | **SVG图形(svg)** | SVG图形与可视化生成任务 | 3 | | **测试相关(test)** | 测试用例生成与测试框架任务 | 3 | | **网络爬虫(crawler)** | 网页抓取与数据采集任务 | 2 | | **产品需求文档(prd)** | 产品需求文档处理与分析任务 | 2 | | **机器学习(machinelearning)** | 机器学习模型训练与推理任务 | 1 | | **后端开发(backend)** | 后端服务开发与API构建任务 | 1 | | **游戏开发(game)** | 游戏开发与交互式应用任务 | 1 | ## 模型性能 我们的模型(Nex-N1)在所有评估场景中均展现出具备竞争力的性能,在实际编码任务中表现尤为突出: <div align="center"> <img src="evaluation_result.png" width="70%"> </div> ## 仓库结构 ### 数据文件 - **`evaluation_traces.jsonl`**:所有待评估模型的完整推理轨迹 - **`query_file_map.json`**:任务ID与所需输入文件的索引映射表 - **`vibecoding-test-files`**:各类评估场景的已处理轨迹数据 ### 评估流程 每个评估任务均采用格式为`{类别}-{编号}`的唯一ID标识(例如`frontend-001`、`data_analysis-003`)。 评估流程遵循以下步骤: 1. **任务识别**:通过任务ID从`traces_all_vb_eval.jsonl`中读取任务详情 2. **输入文件匹配**:借助`query_file_map.json`确定任务所需的输入文件(若有) 3. **工作区搭建**:将对应输入文件复制至评估工作区 4. **模型执行**:基于任务查询与输入文件运行模型 5. **结果评估**:将模型输出与预期行为及成功判定标准进行对比 #### 示例索引结构(`query_file_map.json`): json { "exchange-001": "Homepage-main", "data_analysis-001": "titanic.csv", "frontend-001": "react-redux-realworld-example-app", "fullstack-001": "vueBlog", "test-001": "react-redux-realworld-example-app", ... }
提供机构:
maas
创建时间:
2025-11-19
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作