Qwen2.5-1.5B-Instruct_eval_5554
收藏魔搭社区2025-11-24 更新2025-11-03 收录
下载链接:
https://modelscope.cn/datasets/mlfoundations-dev/Qwen2.5-1.5B-Instruct_eval_5554
下载链接
链接失效反馈官方服务:
资源简介:
# mlfoundations-dev/Qwen2.5-1.5B-Instruct_eval_5554
Precomputed model outputs for evaluation.
## Evaluation Results
### Summary
| Metric | AIME24 | AMC23 | MATH500 | MMLUPro | JEEBench | GPQADiamond | LiveCodeBench | CodeElo | CodeForces | HLE | HMMT | AIME25 | LiveCodeBenchv5 |
|--------|------|-----|-------|-------|--------|-----------|-------------|-------|----------|---|----|------|---------------|
| Accuracy | 3.0 | 30.8 | 50.2 | 32.5 | 16.4 | 24.7 | 5.5 | 0.8 | 2.2 | 15.3 | 0.0 | 0.7 | 5.1 |
### AIME24
- **Average Accuracy**: 3.00% ± 0.88%
- **Number of Runs**: 10
| Run | Accuracy | Questions Solved | Total Questions |
|-----|----------|-----------------|----------------|
| 1 | 3.33% | 1 | 30 |
| 2 | 0.00% | 0 | 30 |
| 3 | 0.00% | 0 | 30 |
| 4 | 6.67% | 2 | 30 |
| 5 | 6.67% | 2 | 30 |
| 6 | 3.33% | 1 | 30 |
| 7 | 0.00% | 0 | 30 |
| 8 | 3.33% | 1 | 30 |
| 9 | 0.00% | 0 | 30 |
| 10 | 6.67% | 2 | 30 |
### AMC23
- **Average Accuracy**: 30.75% ± 1.54%
- **Number of Runs**: 10
| Run | Accuracy | Questions Solved | Total Questions |
|-----|----------|-----------------|----------------|
| 1 | 32.50% | 13 | 40 |
| 2 | 27.50% | 11 | 40 |
| 3 | 27.50% | 11 | 40 |
| 4 | 30.00% | 12 | 40 |
| 5 | 32.50% | 13 | 40 |
| 6 | 37.50% | 15 | 40 |
| 7 | 22.50% | 9 | 40 |
| 8 | 30.00% | 12 | 40 |
| 9 | 27.50% | 11 | 40 |
| 10 | 40.00% | 16 | 40 |
### MATH500
- **Accuracy**: 50.20%
| Accuracy | Questions Solved | Total Questions |
|----------|-----------------|----------------|
| 50.20% | 251 | 500 |
### MMLUPro
- **Accuracy**: 32.50%
| Accuracy | Questions Solved | Total Questions |
|----------|-----------------|----------------|
| 32.50% | N/A | N/A |
### JEEBench
- **Average Accuracy**: 16.36% ± 0.97%
- **Number of Runs**: 3
| Run | Accuracy | Questions Solved | Total Questions |
|-----|----------|-----------------|----------------|
| 1 | 18.59% | 95.75 | 515 |
| 2 | 15.97% | 82.25 | 515 |
| 3 | 14.51% | 74.75 | 515 |
### GPQADiamond
- **Average Accuracy**: 24.75% ± 4.21%
- **Number of Runs**: 3
| Run | Accuracy | Questions Solved | Total Questions |
|-----|----------|-----------------|----------------|
| 1 | 15.15% | 30 | 198 |
| 2 | 26.26% | 52 | 198 |
| 3 | 32.83% | 65 | 198 |
### LiveCodeBench
- **Average Accuracy**: 5.54% ± 2.48%
- **Number of Runs**: 6
| Run | Accuracy | Questions Solved | Total Questions |
|-----|----------|-----------------|----------------|
| 1 | 11.35% | 58 | 511 |
| 2 | 10.57% | 54 | 511 |
| 3 | 11.35% | 58 | 511 |
| 4 | 0.00% | 0 | 511 |
| 5 | 0.00% | 0 | 511 |
| 6 | 0.00% | 0 | 511 |
### CodeElo
- **Average Accuracy**: 0.77% ± 0.15%
- **Number of Runs**: 3
| Run | Accuracy | Questions Solved | Total Questions |
|-----|----------|-----------------|----------------|
| 1 | 1.02% | 4 | 391 |
| 2 | 0.51% | 2 | 391 |
| 3 | 0.77% | 3 | 391 |
### CodeForces
- **Average Accuracy**: 2.21% ± 0.22%
- **Number of Runs**: 3
| Run | Accuracy | Questions Solved | Total Questions |
|-----|----------|-----------------|----------------|
| 1 | 1.77% | 8 | 453 |
| 2 | 2.43% | 11 | 453 |
| 3 | 2.43% | 11 | 453 |
### HLE
- **Average Accuracy**: 15.33% ± 0.87%
- **Number of Runs**: 3
| Run | Accuracy | Questions Solved | Total Questions |
|-----|----------|-----------------|----------------|
| 1 | 17.15% | 88 | 513 |
| 2 | 13.45% | 69 | 513 |
| 3 | 15.40% | 79 | 513 |
### HMMT
- **Average Accuracy**: 0.00% ± 0.00%
- **Number of Runs**: 10
| Run | Accuracy | Questions Solved | Total Questions |
|-----|----------|-----------------|----------------|
| 1 | 0.00% | 0 | 30 |
| 2 | 0.00% | 0 | 30 |
| 3 | 0.00% | 0 | 30 |
| 4 | 0.00% | 0 | 30 |
| 5 | 0.00% | 0 | 30 |
| 6 | 0.00% | 0 | 30 |
| 7 | 0.00% | 0 | 30 |
| 8 | 0.00% | 0 | 30 |
| 9 | 0.00% | 0 | 30 |
| 10 | 0.00% | 0 | 30 |
### AIME25
- **Average Accuracy**: 0.67% ± 0.63%
- **Number of Runs**: 10
| Run | Accuracy | Questions Solved | Total Questions |
|-----|----------|-----------------|----------------|
| 1 | 0.00% | 0 | 30 |
| 2 | 0.00% | 0 | 30 |
| 3 | 0.00% | 0 | 30 |
| 4 | 0.00% | 0 | 30 |
| 5 | 0.00% | 0 | 30 |
| 6 | 0.00% | 0 | 30 |
| 7 | 0.00% | 0 | 30 |
| 8 | 0.00% | 0 | 30 |
| 9 | 0.00% | 0 | 30 |
| 10 | 6.67% | 2 | 30 |
### LiveCodeBenchv5
- **Average Accuracy**: 5.06% ± 1.01%
- **Number of Runs**: 3
| Run | Accuracy | Questions Solved | Total Questions |
|-----|----------|-----------------|----------------|
| 1 | 4.34% | 16 | 369 |
| 2 | 3.79% | 14 | 369 |
| 3 | 7.05% | 26 | 369 |
# mlfoundations-dev/Qwen2.5-1.5B-Instruct_eval_5554
用于模型评估的预计算模型输出。
## 评估结果
### 总结
| 评测指标 | AIME24 | AMC23 | MATH500 | MMLUPro | JEEBench | GPQADiamond | LiveCodeBench | CodeElo | CodeForces | HLE | HMMT | AIME25 | LiveCodeBenchv5 |
|--------|------|-----|-------|-------|--------|-----------|-------------|-------|----------|---|----|------|---------------|
| 准确率 | 3.0 | 30.8 | 50.2 | 32.5 | 16.4 | 24.7 | 5.5 | 0.8 | 2.2 | 15.3 | 0.0 | 0.7 | 5.1 |
### AIME24
- **平均准确率**:3.00% ± 0.88%
- **运行次数**:10
| 运行批次 | 准确率 | 解出题目数 | 总题目数 |
|-----|----------|-----------------|----------------|
| 1 | 3.33% | 1 | 30 |
| 2 | 0.00% | 0 | 30 |
| 3 | 0.00% | 0 | 30 |
| 4 | 6.67% | 2 | 30 |
| 5 | 6.67% | 2 | 30 |
| 6 | 3.33% | 1 | 30 |
| 7 | 0.00% | 0 | 30 |
| 8 | 3.33% | 1 | 30 |
| 9 | 0.00% | 0 | 30 |
| 10 | 6.67% | 2 | 30 |
### AMC23
- **平均准确率**:30.75% ± 1.54%
- **运行次数**:10
| 运行批次 | 准确率 | 解出题目数 | 总题目数 |
|-----|----------|-----------------|----------------|
| 1 | 32.50% | 13 | 40 |
| 2 | 27.50% | 11 | 40 |
| 3 | 27.50% | 11 | 40 |
| 4 | 30.00% | 12 | 40 |
| 5 | 32.50% | 13 | 40 |
| 6 | 37.50% | 15 | 40 |
| 7 | 22.50% | 9 | 40 |
| 8 | 30.00% | 12 | 40 |
| 9 | 27.50% | 11 | 40 |
| 10 | 40.00% | 16 | 40 |
### MATH500
- **准确率**:50.20%
| 准确率 | 解出题目数 | 总题目数 |
|----------|-----------------|----------------|
| 50.20% | 251 | 500 |
### MMLUPro
- **准确率**:32.50%
| 准确率 | 解出题目数 | 总题目数 |
|----------|-----------------|----------------|
| 32.50% | N/A | N/A |
### JEEBench
- **平均准确率**:16.36% ± 0.97%
- **运行次数**:3
| 运行批次 | 准确率 | 解出题目数 | 总题目数 |
|-----|----------|-----------------|----------------|
| 1 | 18.59% | 95.75 | 515 |
| 2 | 15.97% | 82.25 | 515 |
| 3 | 14.51% | 74.75 | 515 |
### GPQADiamond
- **平均准确率**:24.75% ± 4.21%
- **运行次数**:3
| 运行批次 | 准确率 | 解出题目数 | 总题目数 |
|-----|----------|-----------------|----------------|
| 1 | 15.15% | 30 | 198 |
| 2 | 26.26% | 52 | 198 |
| 3 | 32.83% | 65 | 198 |
### LiveCodeBench
- **平均准确率**:5.54% ± 2.48%
- **运行次数**:6
| 运行批次 | 准确率 | 解出题目数 | 总题目数 |
|-----|----------|-----------------|----------------|
| 1 | 11.35% | 58 | 511 |
| 2 | 10.57% | 54 | 511 |
| 3 | 11.35% | 58 | 511 |
| 4 | 0.00% | 0 | 511 |
| 5 | 0.00% | 0 | 511 |
| 6 | 0.00% | 0 | 511 |
### CodeElo
- **平均准确率**:0.77% ± 0.15%
- **运行次数**:3
| 运行批次 | 准确率 | 解出题目数 | 总题目数 |
|-----|----------|-----------------|----------------|
| 1 | 1.02% | 4 | 391 |
| 2 | 0.51% | 2 | 391 |
| 3 | 0.77% | 3 | 391 |
### CodeForces
- **平均准确率**:2.21% ± 0.22%
- **运行次数**:3
| 运行批次 | 准确率 | 解出题目数 | 总题目数 |
|-----|----------|-----------------|----------------|
| 1 | 1.77% | 8 | 453 |
| 2 | 2.43% | 11 | 453 |
| 3 | 2.43% | 11 | 453 |
### HLE
- **平均准确率**:15.33% ± 0.87%
- **运行次数**:3
| 运行批次 | 准确率 | 解出题目数 | 总题目数 |
|-----|----------|-----------------|----------------|
| 1 | 17.15% | 88 | 513 |
| 2 | 13.45% | 69 | 513 |
| 3 | 15.40% | 79 | 513 |
### HMMT
- **平均准确率**:0.00% ± 0.00%
- **运行次数**:10
| 运行批次 | 准确率 | 解出题目数 | 总题目数 |
|-----|----------|-----------------|----------------|
| 1 | 0.00% | 0 | 30 |
| 2 | 0.00% | 0 | 30 |
| 3 | 0.00% | 0 | 30 |
| 4 | 0.00% | 0 | 30 |
| 5 | 0.00% | 0 | 30 |
| 6 | 0.00% | 0 | 30 |
| 7 | 0.00% | 0 | 30 |
| 8 | 0.00% | 0 | 30 |
| 9 | 0.00% | 0 | 30 |
| 10 | 0.00% | 0 | 30 |
### AIME25
- **平均准确率**:0.67% ± 0.63%
- **运行次数**:10
| 运行批次 | 准确率 | 解出题目数 | 总题目数 |
|-----|----------|-----------------|----------------|
| 1 | 0.00% | 0 | 30 |
| 2 | 0.00% | 0 | 30 |
| 3 | 0.00% | 0 | 30 |
| 4 | 0.00% | 0 | 30 |
| 5 | 0.00% | 0 | 30 |
| 6 | 0.00% | 0 | 30 |
| 7 | 0.00% | 0 | 30 |
| 8 | 0.00% | 0 | 30 |
| 9 | 0.00% | 0 | 30 |
| 10 | 6.67% | 2 | 30 |
### LiveCodeBenchv5
- **平均准确率**:5.06% ± 1.01%
- **运行次数**:3
| 运行批次 | 准确率 | 解出题目数 | 总题目数 |
|-----|----------|-----------------|----------------|
| 1 | 4.34% | 16 | 369 |
| 2 | 3.79% | 14 | 369 |
| 3 | 7.05% | 26 | 369 |
提供机构:
maas
创建时间:
2025-10-03



