AReaL-boba-RL-7B_eval_5554
收藏魔搭社区2025-11-19 更新2025-11-22 收录
下载链接:
https://modelscope.cn/datasets/mlfoundations-dev/AReaL-boba-RL-7B_eval_5554
下载链接
链接失效反馈官方服务:
资源简介:
# mlfoundations-dev/AReaL-boba-RL-7B_eval_5554
Precomputed model outputs for evaluation.
## Evaluation Results
### Summary
| Metric | AIME24 | AMC23 | MATH500 | MMLUPro | JEEBench | GPQADiamond | LiveCodeBench | CodeElo | CodeForces | HLE | HMMT | AIME25 | LiveCodeBenchv5 |
|--------|------|-----|-------|-------|--------|-----------|-------------|-------|----------|---|----|------|---------------|
| Accuracy | 61.0 | 92.8 | 89.2 | 41.6 | 53.3 | 52.7 | 46.1 | 21.5 | 20.7 | 12.1 | 30.0 | 44.3 | 30.4 |
### AIME24
- **Average Accuracy**: 61.00% ± 2.36%
- **Number of Runs**: 10
| Run | Accuracy | Questions Solved | Total Questions |
|-----|----------|-----------------|----------------|
| 1 | 60.00% | 18 | 30 |
| 2 | 63.33% | 19 | 30 |
| 3 | 56.67% | 17 | 30 |
| 4 | 56.67% | 17 | 30 |
| 5 | 50.00% | 15 | 30 |
| 6 | 60.00% | 18 | 30 |
| 7 | 53.33% | 16 | 30 |
| 8 | 63.33% | 19 | 30 |
| 9 | 76.67% | 23 | 30 |
| 10 | 70.00% | 21 | 30 |
### AMC23
- **Average Accuracy**: 92.75% ± 0.66%
- **Number of Runs**: 10
| Run | Accuracy | Questions Solved | Total Questions |
|-----|----------|-----------------|----------------|
| 1 | 87.50% | 35 | 40 |
| 2 | 95.00% | 38 | 40 |
| 3 | 92.50% | 37 | 40 |
| 4 | 95.00% | 38 | 40 |
| 5 | 95.00% | 38 | 40 |
| 6 | 92.50% | 37 | 40 |
| 7 | 92.50% | 37 | 40 |
| 8 | 92.50% | 37 | 40 |
| 9 | 92.50% | 37 | 40 |
| 10 | 92.50% | 37 | 40 |
### MATH500
- **Accuracy**: 89.20%
| Accuracy | Questions Solved | Total Questions |
|----------|-----------------|----------------|
| 89.20% | 446 | 500 |
### MMLUPro
- **Accuracy**: 41.61%
| Accuracy | Questions Solved | Total Questions |
|----------|-----------------|----------------|
| 41.61% | N/A | N/A |
### JEEBench
- **Average Accuracy**: 53.27% ± 0.28%
- **Number of Runs**: 3
| Run | Accuracy | Questions Solved | Total Questions |
|-----|----------|-----------------|----------------|
| 1 | 53.59% | 276.0 | 515 |
| 2 | 53.64% | 276.25 | 515 |
| 3 | 52.57% | 270.75 | 515 |
### GPQADiamond
- **Average Accuracy**: 52.69% ± 0.27%
- **Number of Runs**: 3
| Run | Accuracy | Questions Solved | Total Questions |
|-----|----------|-----------------|----------------|
| 1 | 52.02% | 103 | 198 |
| 2 | 53.03% | 105 | 198 |
| 3 | 53.03% | 105 | 198 |
### LiveCodeBench
- **Average Accuracy**: 46.12% ± 0.33%
- **Number of Runs**: 3
| Run | Accuracy | Questions Solved | Total Questions |
|-----|----------|-----------------|----------------|
| 1 | 45.79% | 234 | 511 |
| 2 | 45.79% | 234 | 511 |
| 3 | 46.77% | 239 | 511 |
### CodeElo
- **Average Accuracy**: 21.48% ± 0.15%
- **Number of Runs**: 3
| Run | Accuracy | Questions Solved | Total Questions |
|-----|----------|-----------------|----------------|
| 1 | 21.74% | 85 | 391 |
| 2 | 21.48% | 84 | 391 |
| 3 | 21.23% | 83 | 391 |
### CodeForces
- **Average Accuracy**: 20.68% ± 0.27%
- **Number of Runs**: 3
| Run | Accuracy | Questions Solved | Total Questions |
|-----|----------|-----------------|----------------|
| 1 | 20.31% | 92 | 453 |
| 2 | 20.53% | 93 | 453 |
| 3 | 21.19% | 96 | 453 |
### HLE
- **Average Accuracy**: 12.09% ± 0.88%
- **Number of Runs**: 3
| Run | Accuracy | Questions Solved | Total Questions |
|-----|----------|-----------------|----------------|
| 1 | 11.89% | 61 | 513 |
| 2 | 10.33% | 53 | 513 |
| 3 | 14.04% | 72 | 513 |
### HMMT
- **Average Accuracy**: 30.00% ± 1.33%
- **Number of Runs**: 10
| Run | Accuracy | Questions Solved | Total Questions |
|-----|----------|-----------------|----------------|
| 1 | 36.67% | 11 | 30 |
| 2 | 30.00% | 9 | 30 |
| 3 | 30.00% | 9 | 30 |
| 4 | 30.00% | 9 | 30 |
| 5 | 33.33% | 10 | 30 |
| 6 | 33.33% | 10 | 30 |
| 7 | 23.33% | 7 | 30 |
| 8 | 23.33% | 7 | 30 |
| 9 | 26.67% | 8 | 30 |
| 10 | 33.33% | 10 | 30 |
### AIME25
- **Average Accuracy**: 44.33% ± 1.57%
- **Number of Runs**: 10
| Run | Accuracy | Questions Solved | Total Questions |
|-----|----------|-----------------|----------------|
| 1 | 40.00% | 12 | 30 |
| 2 | 43.33% | 13 | 30 |
| 3 | 36.67% | 11 | 30 |
| 4 | 43.33% | 13 | 30 |
| 5 | 50.00% | 15 | 30 |
| 6 | 40.00% | 12 | 30 |
| 7 | 50.00% | 15 | 30 |
| 8 | 50.00% | 15 | 30 |
| 9 | 40.00% | 12 | 30 |
| 10 | 50.00% | 15 | 30 |
### LiveCodeBenchv5
- **Average Accuracy**: 30.44% ± 1.04%
- **Number of Runs**: 3
| Run | Accuracy | Questions Solved | Total Questions |
|-----|----------|-----------------|----------------|
| 1 | 29.27% | 108 | 369 |
| 2 | 32.52% | 120 | 369 |
| 3 | 29.54% | 109 | 369 |
# mlfoundations-dev/AReaL-boba-RL-7B_eval_5554
用于评估的预计算模型输出。
## 评估结果
### 汇总
| 指标(Metric) | AIME24 | AMC23 | MATH500 | MMLUPro | JEEBench | GPQADiamond | LiveCodeBench | CodeElo | CodeForces | HLE | HMMT | AIME25 | LiveCodeBenchv5 |
|--------|------|-----|-------|-------|--------|-----------|-------------|-------|----------|---|----|------|---------------|
| 准确率(Accuracy) | 61.0 | 92.8 | 89.2 | 41.6 | 53.3 | 52.7 | 46.1 | 21.5 | 20.7 | 12.1 | 30.0 | 44.3 | 30.4 |
### AIME24
- **平均准确率**: 61.00% ± 2.36%
- **运行次数**: 10
| 运行编号(Run) | 准确率(Accuracy) | 答对题目数(Questions Solved) | 总题目数(Total Questions) |
|-----|----------|-----------------|----------------|
| 1 | 60.00% | 18 | 30 |
| 2 | 63.33% | 19 | 30 |
| 3 | 56.67% | 17 | 30 |
| 4 | 56.67% | 17 | 30 |
| 5 | 50.00% | 15 | 30 |
| 6 | 60.00% | 18 | 30 |
| 7 | 53.33% | 16 | 30 |
| 8 | 63.33% | 19 | 30 |
| 9 | 76.67% | 23 | 30 |
| 10 | 70.00% | 21 | 30 |
### AMC23
- **平均准确率**: 92.75% ± 0.66%
- **运行次数**: 10
| 运行编号(Run) | 准确率(Accuracy) | 答对题目数(Questions Solved) | 总题目数(Total Questions) |
|-----|----------|-----------------|----------------|
| 1 | 87.50% | 35 | 40 |
| 2 | 95.00% | 38 | 40 |
| 3 | 92.50% | 37 | 40 |
| 4 | 95.00% | 38 | 40 |
| 5 | 95.00% | 38 | 40 |
| 6 | 92.50% | 37 | 40 |
| 7 | 92.50% | 37 | 40 |
| 8 | 92.50% | 37 | 40 |
| 9 | 92.50% | 37 | 40 |
| 10 | 92.50% | 37 | 40 |
### MATH500
- **准确率**: 89.20%
| 准确率(Accuracy) | 答对题目数(Questions Solved) | 总题目数(Total Questions) |
|----------|-----------------|----------------|
| 89.20% | 446 | 500 |
### MMLUPro
- **准确率**: 41.61%
| 准确率(Accuracy) | 答对题目数(Questions Solved) | 总题目数(Total Questions) |
|----------|-----------------|----------------|
| 41.61% | 不适用 | 不适用 |
### JEEBench
- **平均准确率**: 53.27% ± 0.28%
- **运行次数**: 3
| 运行编号(Run) | 准确率(Accuracy) | 答对题目数(Questions Solved) | 总题目数(Total Questions) |
|-----|----------|-----------------|----------------|
| 1 | 53.59% | 276.0 | 515 |
| 2 | 53.64% | 276.25 | 515 |
| 3 | 52.57% | 270.75 | 515 |
### GPQADiamond
- **平均准确率**: 52.69% ± 0.27%
- **运行次数**: 3
| 运行编号(Run) | 准确率(Accuracy) | 答对题目数(Questions Solved) | 总题目数(Total Questions) |
|-----|----------|-----------------|----------------|
| 1 | 52.02% | 103 | 198 |
| 2 | 53.03% | 105 | 198 |
| 3 | 53.03% | 105 | 198 |
### LiveCodeBench
- **平均准确率**: 46.12% ± 0.33%
- **运行次数**: 3
| 运行编号(Run) | 准确率(Accuracy) | 答对题目数(Questions Solved) | 总题目数(Total Questions) |
|-----|----------|-----------------|----------------|
| 1 | 45.79% | 234 | 511 |
| 2 | 45.79% | 234 | 511 |
| 3 | 46.77% | 239 | 511 |
### CodeElo
- **平均准确率**: 21.48% ± 0.15%
- **运行次数**: 3
| 运行编号(Run) | 准确率(Accuracy) | 答对题目数(Questions Solved) | 总题目数(Total Questions) |
|-----|----------|-----------------|----------------|
| 1 | 21.74% | 85 | 391 |
| 2 | 21.48% | 84 | 391 |
| 3 | 21.23% | 83 | 391 |
### CodeForces
- **平均准确率**: 20.68% ± 0.27%
- **运行次数**: 3
| 运行编号(Run) | 准确率(Accuracy) | 答对题目数(Questions Solved) | 总题目数(Total Questions) |
|-----|----------|-----------------|----------------|
| 1 | 20.31% | 92 | 453 |
| 2 | 20.53% | 93 | 453 |
| 3 | 21.19% | 96 | 453 |
### HLE
- **平均准确率**: 12.09% ± 0.88%
- **运行次数**: 3
| 运行编号(Run) | 准确率(Accuracy) | 答对题目数(Questions Solved) | 总题目数(Total Questions) |
|-----|----------|-----------------|----------------|
| 1 | 11.89% | 61 | 513 |
| 2 | 10.33% | 53 | 513 |
| 3 | 14.04% | 72 | 513 |
### HMMT
- **平均准确率**: 30.00% ± 1.33%
- **运行次数**: 10
| 运行编号(Run) | 准确率(Accuracy) | 答对题目数(Questions Solved) | 总题目数(Total Questions) |
|-----|----------|-----------------|----------------|
| 1 | 36.67% | 11 | 30 |
| 2 | 30.00% | 9 | 30 |
| 3 | 30.00% | 9 | 30 |
| 4 | 30.00% | 9 | 30 |
| 5 | 33.33% | 10 | 30 |
| 6 | 33.33% | 10 | 30 |
| 7 | 23.33% | 7 | 30 |
| 8 | 23.33% | 7 | 30 |
| 9 | 26.67% | 8 | 30 |
| 10 | 33.33% | 10 | 30 |
### AIME25
- **平均准确率**: 44.33% ± 1.57%
- **运行次数**: 10
| 运行编号(Run) | 准确率(Accuracy) | 答对题目数(Questions Solved) | 总题目数(Total Questions) |
|-----|----------|-----------------|----------------|
| 1 | 40.00% | 12 | 30 |
| 2 | 43.33% | 13 | 30 |
| 3 | 36.67% | 11 | 30 |
| 4 | 43.33% | 13 | 30 |
| 5 | 50.00% | 15 | 30 |
| 6 | 40.00% | 12 | 30 |
| 7 | 50.00% | 15 | 30 |
| 8 | 50.00% | 15 | 30 |
| 9 | 40.00% | 12 | 30 |
| 10 | 50.00% | 15 | 30 |
### LiveCodeBenchv5
- **平均准确率**: 30.44% ± 1.04%
- **运行次数**: 3
| 运行编号(Run) | 准确率(Accuracy) | 答对题目数(Questions Solved) | 总题目数(Total Questions) |
|-----|----------|-----------------|----------------|
| 1 | 29.27% | 108 | 369 |
| 2 | 32.52% | 120 | 369 |
| 3 | 29.54% | 109 | 369 |
提供机构:
maas
创建时间:
2025-10-04



