nex-agi/coding-eval

Name: nex-agi/coding-eval
Creator: nex-agi
Published: 2025-11-19 03:49:23
License: 暂无描述

Hugging Face2025-11-19 更新2026-01-03 收录

下载链接：

https://hf-mirror.com/datasets/nex-agi/coding-eval

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 --- # Model Evaluation Repository ## Overview To evaluate our model's performance, we have constructed a comprehensive evaluation dataset covering diverse practical scenarios. This dataset combines: 1. **Tasks from [CC-Bench](https://huggingface.co/datasets/zai-org/CC-Bench-trajectories)**: We selected tasks where we could match input source files to the projects and where the queries were clear and well-defined. 2. **Internal Testing Data**: We supplemented the dataset with additional tasks to increase data diversity, primarily including complex project generation based on requirement documents, mini-app development, code language exchange, and more. ## Dataset Composition Our evaluation dataset spans **13 distinct categories**. The distribution across categories is as follows: | Category | Description | Test Cases | |----------|-------------|------------| | **frontend** | Frontend development tasks including React, Vue, and UI components | 8 | | **data_analysis** | Data analysis and visualization tasks with various datasets | 5 | | **exchange** | Code migration and framework conversion tasks | 4 | | **fullstack** | Full-stack application development scenarios | 4| | **html** | HTML/CSS static page development | 4 | | **ma** | Mini-app development | 4 | | **svg** | SVG graphics and visualization generation | 3 | | **test** | Test case generation and testing framework tasks | 3 | | **crawler** | Web scraping and data collection tasks | 2 | | **prd** | Product requirements document processing and analysis | 2 | | **machinelearning** | Machine learning model training and inference | 1 | | **backend** | Backend service development and API creation | 1 | | **game** | Game development and interactive application tasks | 1 | ## Model Performance Our model (Nex-N1) demonstrates competitive performance across all evaluation scenarios, showing particularly strong results in practical coding tasks: <div align="center"> <img src="./assets/evaluation_result.png" width="70%"> </div> ## Repository Structure ### Data Files - **`vibecoding_evaluation/evaluation_traces.jsonl`**: Complete inference traces for all evaluated models - **`vibecoding_evaluation/query_file_map.json`**: Index mapping task IDs to required input files - **`vibecoding_evaluation/vibecoding-test-files`**: Processed trace data for various evaluation scenarios ### Evaluation Workflow Each evaluation task is identified by a unique ID in the format `{category}-{number}` (e.g., `frontend-001`, `data_analysis-003`). The evaluation process follows these steps: 1. **Task Identification**: Read task details from `traces_all_vb_eval.jsonl` using the task ID 2. **Input File Resolution**: Use `query_file_map.json` to identify required input files for the task (if required) 3. **Workspace Setup**: Copy the corresponding input files into the evaluation workspace 4. **Model Execution**: Run the model with the task query and input files 5. **Result Evaluation**: Compare model output against expected behavior and success criteria #### Example Index Structure (`query_file_map.json`): ```json { "exchange-001": "Homepage-main", "data_analysis-001": "titanic.csv", "frontend-001": "react-redux-realworld-example-app", "fullstack-001": "vueBlog", "test-001": "react-redux-realworld-example-app", ... } ```

提供机构：

nex-agi

5,000+

优质数据集

54 个

任务类型

进入经典数据集