five

Qwen2.5-1.5B-Instruct_eval_5554

收藏
魔搭社区2025-11-24 更新2025-11-03 收录
下载链接:
https://modelscope.cn/datasets/mlfoundations-dev/Qwen2.5-1.5B-Instruct_eval_5554
下载链接
链接失效反馈
官方服务:
资源简介:
# mlfoundations-dev/Qwen2.5-1.5B-Instruct_eval_5554 Precomputed model outputs for evaluation. ## Evaluation Results ### Summary | Metric | AIME24 | AMC23 | MATH500 | MMLUPro | JEEBench | GPQADiamond | LiveCodeBench | CodeElo | CodeForces | HLE | HMMT | AIME25 | LiveCodeBenchv5 | |--------|------|-----|-------|-------|--------|-----------|-------------|-------|----------|---|----|------|---------------| | Accuracy | 3.0 | 30.8 | 50.2 | 32.5 | 16.4 | 24.7 | 5.5 | 0.8 | 2.2 | 15.3 | 0.0 | 0.7 | 5.1 | ### AIME24 - **Average Accuracy**: 3.00% ± 0.88% - **Number of Runs**: 10 | Run | Accuracy | Questions Solved | Total Questions | |-----|----------|-----------------|----------------| | 1 | 3.33% | 1 | 30 | | 2 | 0.00% | 0 | 30 | | 3 | 0.00% | 0 | 30 | | 4 | 6.67% | 2 | 30 | | 5 | 6.67% | 2 | 30 | | 6 | 3.33% | 1 | 30 | | 7 | 0.00% | 0 | 30 | | 8 | 3.33% | 1 | 30 | | 9 | 0.00% | 0 | 30 | | 10 | 6.67% | 2 | 30 | ### AMC23 - **Average Accuracy**: 30.75% ± 1.54% - **Number of Runs**: 10 | Run | Accuracy | Questions Solved | Total Questions | |-----|----------|-----------------|----------------| | 1 | 32.50% | 13 | 40 | | 2 | 27.50% | 11 | 40 | | 3 | 27.50% | 11 | 40 | | 4 | 30.00% | 12 | 40 | | 5 | 32.50% | 13 | 40 | | 6 | 37.50% | 15 | 40 | | 7 | 22.50% | 9 | 40 | | 8 | 30.00% | 12 | 40 | | 9 | 27.50% | 11 | 40 | | 10 | 40.00% | 16 | 40 | ### MATH500 - **Accuracy**: 50.20% | Accuracy | Questions Solved | Total Questions | |----------|-----------------|----------------| | 50.20% | 251 | 500 | ### MMLUPro - **Accuracy**: 32.50% | Accuracy | Questions Solved | Total Questions | |----------|-----------------|----------------| | 32.50% | N/A | N/A | ### JEEBench - **Average Accuracy**: 16.36% ± 0.97% - **Number of Runs**: 3 | Run | Accuracy | Questions Solved | Total Questions | |-----|----------|-----------------|----------------| | 1 | 18.59% | 95.75 | 515 | | 2 | 15.97% | 82.25 | 515 | | 3 | 14.51% | 74.75 | 515 | ### GPQADiamond - **Average Accuracy**: 24.75% ± 4.21% - **Number of Runs**: 3 | Run | Accuracy | Questions Solved | Total Questions | |-----|----------|-----------------|----------------| | 1 | 15.15% | 30 | 198 | | 2 | 26.26% | 52 | 198 | | 3 | 32.83% | 65 | 198 | ### LiveCodeBench - **Average Accuracy**: 5.54% ± 2.48% - **Number of Runs**: 6 | Run | Accuracy | Questions Solved | Total Questions | |-----|----------|-----------------|----------------| | 1 | 11.35% | 58 | 511 | | 2 | 10.57% | 54 | 511 | | 3 | 11.35% | 58 | 511 | | 4 | 0.00% | 0 | 511 | | 5 | 0.00% | 0 | 511 | | 6 | 0.00% | 0 | 511 | ### CodeElo - **Average Accuracy**: 0.77% ± 0.15% - **Number of Runs**: 3 | Run | Accuracy | Questions Solved | Total Questions | |-----|----------|-----------------|----------------| | 1 | 1.02% | 4 | 391 | | 2 | 0.51% | 2 | 391 | | 3 | 0.77% | 3 | 391 | ### CodeForces - **Average Accuracy**: 2.21% ± 0.22% - **Number of Runs**: 3 | Run | Accuracy | Questions Solved | Total Questions | |-----|----------|-----------------|----------------| | 1 | 1.77% | 8 | 453 | | 2 | 2.43% | 11 | 453 | | 3 | 2.43% | 11 | 453 | ### HLE - **Average Accuracy**: 15.33% ± 0.87% - **Number of Runs**: 3 | Run | Accuracy | Questions Solved | Total Questions | |-----|----------|-----------------|----------------| | 1 | 17.15% | 88 | 513 | | 2 | 13.45% | 69 | 513 | | 3 | 15.40% | 79 | 513 | ### HMMT - **Average Accuracy**: 0.00% ± 0.00% - **Number of Runs**: 10 | Run | Accuracy | Questions Solved | Total Questions | |-----|----------|-----------------|----------------| | 1 | 0.00% | 0 | 30 | | 2 | 0.00% | 0 | 30 | | 3 | 0.00% | 0 | 30 | | 4 | 0.00% | 0 | 30 | | 5 | 0.00% | 0 | 30 | | 6 | 0.00% | 0 | 30 | | 7 | 0.00% | 0 | 30 | | 8 | 0.00% | 0 | 30 | | 9 | 0.00% | 0 | 30 | | 10 | 0.00% | 0 | 30 | ### AIME25 - **Average Accuracy**: 0.67% ± 0.63% - **Number of Runs**: 10 | Run | Accuracy | Questions Solved | Total Questions | |-----|----------|-----------------|----------------| | 1 | 0.00% | 0 | 30 | | 2 | 0.00% | 0 | 30 | | 3 | 0.00% | 0 | 30 | | 4 | 0.00% | 0 | 30 | | 5 | 0.00% | 0 | 30 | | 6 | 0.00% | 0 | 30 | | 7 | 0.00% | 0 | 30 | | 8 | 0.00% | 0 | 30 | | 9 | 0.00% | 0 | 30 | | 10 | 6.67% | 2 | 30 | ### LiveCodeBenchv5 - **Average Accuracy**: 5.06% ± 1.01% - **Number of Runs**: 3 | Run | Accuracy | Questions Solved | Total Questions | |-----|----------|-----------------|----------------| | 1 | 4.34% | 16 | 369 | | 2 | 3.79% | 14 | 369 | | 3 | 7.05% | 26 | 369 |

# mlfoundations-dev/Qwen2.5-1.5B-Instruct_eval_5554 用于模型评估的预计算模型输出。 ## 评估结果 ### 总结 | 评测指标 | AIME24 | AMC23 | MATH500 | MMLUPro | JEEBench | GPQADiamond | LiveCodeBench | CodeElo | CodeForces | HLE | HMMT | AIME25 | LiveCodeBenchv5 | |--------|------|-----|-------|-------|--------|-----------|-------------|-------|----------|---|----|------|---------------| | 准确率 | 3.0 | 30.8 | 50.2 | 32.5 | 16.4 | 24.7 | 5.5 | 0.8 | 2.2 | 15.3 | 0.0 | 0.7 | 5.1 | ### AIME24 - **平均准确率**:3.00% ± 0.88% - **运行次数**:10 | 运行批次 | 准确率 | 解出题目数 | 总题目数 | |-----|----------|-----------------|----------------| | 1 | 3.33% | 1 | 30 | | 2 | 0.00% | 0 | 30 | | 3 | 0.00% | 0 | 30 | | 4 | 6.67% | 2 | 30 | | 5 | 6.67% | 2 | 30 | | 6 | 3.33% | 1 | 30 | | 7 | 0.00% | 0 | 30 | | 8 | 3.33% | 1 | 30 | | 9 | 0.00% | 0 | 30 | | 10 | 6.67% | 2 | 30 | ### AMC23 - **平均准确率**:30.75% ± 1.54% - **运行次数**:10 | 运行批次 | 准确率 | 解出题目数 | 总题目数 | |-----|----------|-----------------|----------------| | 1 | 32.50% | 13 | 40 | | 2 | 27.50% | 11 | 40 | | 3 | 27.50% | 11 | 40 | | 4 | 30.00% | 12 | 40 | | 5 | 32.50% | 13 | 40 | | 6 | 37.50% | 15 | 40 | | 7 | 22.50% | 9 | 40 | | 8 | 30.00% | 12 | 40 | | 9 | 27.50% | 11 | 40 | | 10 | 40.00% | 16 | 40 | ### MATH500 - **准确率**:50.20% | 准确率 | 解出题目数 | 总题目数 | |----------|-----------------|----------------| | 50.20% | 251 | 500 | ### MMLUPro - **准确率**:32.50% | 准确率 | 解出题目数 | 总题目数 | |----------|-----------------|----------------| | 32.50% | N/A | N/A | ### JEEBench - **平均准确率**:16.36% ± 0.97% - **运行次数**:3 | 运行批次 | 准确率 | 解出题目数 | 总题目数 | |-----|----------|-----------------|----------------| | 1 | 18.59% | 95.75 | 515 | | 2 | 15.97% | 82.25 | 515 | | 3 | 14.51% | 74.75 | 515 | ### GPQADiamond - **平均准确率**:24.75% ± 4.21% - **运行次数**:3 | 运行批次 | 准确率 | 解出题目数 | 总题目数 | |-----|----------|-----------------|----------------| | 1 | 15.15% | 30 | 198 | | 2 | 26.26% | 52 | 198 | | 3 | 32.83% | 65 | 198 | ### LiveCodeBench - **平均准确率**:5.54% ± 2.48% - **运行次数**:6 | 运行批次 | 准确率 | 解出题目数 | 总题目数 | |-----|----------|-----------------|----------------| | 1 | 11.35% | 58 | 511 | | 2 | 10.57% | 54 | 511 | | 3 | 11.35% | 58 | 511 | | 4 | 0.00% | 0 | 511 | | 5 | 0.00% | 0 | 511 | | 6 | 0.00% | 0 | 511 | ### CodeElo - **平均准确率**:0.77% ± 0.15% - **运行次数**:3 | 运行批次 | 准确率 | 解出题目数 | 总题目数 | |-----|----------|-----------------|----------------| | 1 | 1.02% | 4 | 391 | | 2 | 0.51% | 2 | 391 | | 3 | 0.77% | 3 | 391 | ### CodeForces - **平均准确率**:2.21% ± 0.22% - **运行次数**:3 | 运行批次 | 准确率 | 解出题目数 | 总题目数 | |-----|----------|-----------------|----------------| | 1 | 1.77% | 8 | 453 | | 2 | 2.43% | 11 | 453 | | 3 | 2.43% | 11 | 453 | ### HLE - **平均准确率**:15.33% ± 0.87% - **运行次数**:3 | 运行批次 | 准确率 | 解出题目数 | 总题目数 | |-----|----------|-----------------|----------------| | 1 | 17.15% | 88 | 513 | | 2 | 13.45% | 69 | 513 | | 3 | 15.40% | 79 | 513 | ### HMMT - **平均准确率**:0.00% ± 0.00% - **运行次数**:10 | 运行批次 | 准确率 | 解出题目数 | 总题目数 | |-----|----------|-----------------|----------------| | 1 | 0.00% | 0 | 30 | | 2 | 0.00% | 0 | 30 | | 3 | 0.00% | 0 | 30 | | 4 | 0.00% | 0 | 30 | | 5 | 0.00% | 0 | 30 | | 6 | 0.00% | 0 | 30 | | 7 | 0.00% | 0 | 30 | | 8 | 0.00% | 0 | 30 | | 9 | 0.00% | 0 | 30 | | 10 | 0.00% | 0 | 30 | ### AIME25 - **平均准确率**:0.67% ± 0.63% - **运行次数**:10 | 运行批次 | 准确率 | 解出题目数 | 总题目数 | |-----|----------|-----------------|----------------| | 1 | 0.00% | 0 | 30 | | 2 | 0.00% | 0 | 30 | | 3 | 0.00% | 0 | 30 | | 4 | 0.00% | 0 | 30 | | 5 | 0.00% | 0 | 30 | | 6 | 0.00% | 0 | 30 | | 7 | 0.00% | 0 | 30 | | 8 | 0.00% | 0 | 30 | | 9 | 0.00% | 0 | 30 | | 10 | 6.67% | 2 | 30 | ### LiveCodeBenchv5 - **平均准确率**:5.06% ± 1.01% - **运行次数**:3 | 运行批次 | 准确率 | 解出题目数 | 总题目数 | |-----|----------|-----------------|----------------| | 1 | 4.34% | 16 | 369 | | 2 | 3.79% | 14 | 369 | | 3 | 7.05% | 26 | 369 |
提供机构:
maas
创建时间:
2025-10-03
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作