five

nex-agi/coding-eval

收藏
Hugging Face2025-11-19 更新2026-01-03 收录
下载链接:
https://hf-mirror.com/datasets/nex-agi/coding-eval
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 --- # Model Evaluation Repository ## Overview To evaluate our model's performance, we have constructed a comprehensive evaluation dataset covering diverse practical scenarios. This dataset combines: 1. **Tasks from [CC-Bench](https://huggingface.co/datasets/zai-org/CC-Bench-trajectories)**: We selected tasks where we could match input source files to the projects and where the queries were clear and well-defined. 2. **Internal Testing Data**: We supplemented the dataset with additional tasks to increase data diversity, primarily including complex project generation based on requirement documents, mini-app development, code language exchange, and more. ## Dataset Composition Our evaluation dataset spans **13 distinct categories**. The distribution across categories is as follows: | Category | Description | Test Cases | |----------|-------------|------------| | **frontend** | Frontend development tasks including React, Vue, and UI components | 8 | | **data_analysis** | Data analysis and visualization tasks with various datasets | 5 | | **exchange** | Code migration and framework conversion tasks | 4 | | **fullstack** | Full-stack application development scenarios | 4| | **html** | HTML/CSS static page development | 4 | | **ma** | Mini-app development | 4 | | **svg** | SVG graphics and visualization generation | 3 | | **test** | Test case generation and testing framework tasks | 3 | | **crawler** | Web scraping and data collection tasks | 2 | | **prd** | Product requirements document processing and analysis | 2 | | **machinelearning** | Machine learning model training and inference | 1 | | **backend** | Backend service development and API creation | 1 | | **game** | Game development and interactive application tasks | 1 | ## Model Performance Our model (Nex-N1) demonstrates competitive performance across all evaluation scenarios, showing particularly strong results in practical coding tasks: <div align="center"> <img src="./assets/evaluation_result.png" width="70%"> </div> ## Repository Structure ### Data Files - **`vibecoding_evaluation/evaluation_traces.jsonl`**: Complete inference traces for all evaluated models - **`vibecoding_evaluation/query_file_map.json`**: Index mapping task IDs to required input files - **`vibecoding_evaluation/vibecoding-test-files`**: Processed trace data for various evaluation scenarios ### Evaluation Workflow Each evaluation task is identified by a unique ID in the format `{category}-{number}` (e.g., `frontend-001`, `data_analysis-003`). The evaluation process follows these steps: 1. **Task Identification**: Read task details from `traces_all_vb_eval.jsonl` using the task ID 2. **Input File Resolution**: Use `query_file_map.json` to identify required input files for the task (if required) 3. **Workspace Setup**: Copy the corresponding input files into the evaluation workspace 4. **Model Execution**: Run the model with the task query and input files 5. **Result Evaluation**: Compare model output against expected behavior and success criteria #### Example Index Structure (`query_file_map.json`): ```json { "exchange-001": "Homepage-main", "data_analysis-001": "titanic.csv", "frontend-001": "react-redux-realworld-example-app", "fullstack-001": "vueBlog", "test-001": "react-redux-realworld-example-app", ... } ```
提供机构:
nex-agi
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作