nandansarkar/base_model_on_log_odds_ranked_samples_eval_c693
收藏Hugging Face2025-12-13 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/nandansarkar/base_model_on_log_odds_ranked_samples_eval_c693
下载链接
链接失效反馈官方服务:
资源简介:
# nandansarkar/base_model_on_log_odds_ranked_samples_eval_c693
Precomputed model outputs for evaluation.
## Evaluation Results
### Summary
| Metric | AIME24 | AIME25 | GPQADiamond | JEEBench |
|--------|------|------|-----------|--------|
| Accuracy | 9.8 | 5.0 | 34.2 | 31.0 |
### AIME24
- **Average Accuracy**: 9.83% ± 1.01%
- **Number of Runs**: 20
| Run | Accuracy | Questions Solved | Total Questions |
|-----|----------|-----------------|----------------|
| 1 | 6.67% | 2 | 30 |
| 2 | 3.33% | 1 | 30 |
| 3 | 6.67% | 2 | 30 |
| 4 | 13.33% | 4 | 30 |
| 5 | 6.67% | 2 | 30 |
| 6 | 16.67% | 5 | 30 |
| 7 | 10.00% | 3 | 30 |
| 8 | 6.67% | 2 | 30 |
| 9 | 10.00% | 3 | 30 |
| 10 | 13.33% | 4 | 30 |
| 11 | 10.00% | 3 | 30 |
| 12 | 6.67% | 2 | 30 |
| 13 | 16.67% | 5 | 30 |
| 14 | 20.00% | 6 | 30 |
| 15 | 13.33% | 4 | 30 |
| 16 | 6.67% | 2 | 30 |
| 17 | 6.67% | 2 | 30 |
| 18 | 3.33% | 1 | 30 |
| 19 | 13.33% | 4 | 30 |
| 20 | 6.67% | 2 | 30 |
### AIME25
- **Average Accuracy**: 5.00% ± 0.65%
- **Number of Runs**: 20
| Run | Accuracy | Questions Solved | Total Questions |
|-----|----------|-----------------|----------------|
| 1 | 6.67% | 2 | 30 |
| 2 | 0.00% | 0 | 30 |
| 3 | 6.67% | 2 | 30 |
| 4 | 6.67% | 2 | 30 |
| 5 | 3.33% | 1 | 30 |
| 6 | 3.33% | 1 | 30 |
| 7 | 6.67% | 2 | 30 |
| 8 | 0.00% | 0 | 30 |
| 9 | 6.67% | 2 | 30 |
| 10 | 6.67% | 2 | 30 |
| 11 | 3.33% | 1 | 30 |
| 12 | 10.00% | 3 | 30 |
| 13 | 3.33% | 1 | 30 |
| 14 | 0.00% | 0 | 30 |
| 15 | 6.67% | 2 | 30 |
| 16 | 3.33% | 1 | 30 |
| 17 | 6.67% | 2 | 30 |
| 18 | 10.00% | 3 | 30 |
| 19 | 3.33% | 1 | 30 |
| 20 | 6.67% | 2 | 30 |
### GPQADiamond
- **Average Accuracy**: 34.19% ± 0.67%
- **Number of Runs**: 10
| Run | Accuracy | Questions Solved | Total Questions |
|-----|----------|-----------------|----------------|
| 1 | 35.86% | 71 | 198 |
| 2 | 34.34% | 68 | 198 |
| 3 | 34.34% | 68 | 198 |
| 4 | 33.33% | 66 | 198 |
| 5 | 33.33% | 66 | 198 |
| 6 | 31.31% | 62 | 198 |
| 7 | 30.81% | 61 | 198 |
| 8 | 33.84% | 67 | 198 |
| 9 | 36.87% | 73 | 198 |
| 10 | 37.88% | 75 | 198 |
### JEEBench
- **Average Accuracy**: 31.01% ± 0.32%
- **Number of Runs**: 10
| Run | Accuracy | Questions Solved | Total Questions |
|-----|----------|-----------------|----------------|
| 1 | 31.55% | 162.5 | 515 |
| 2 | 31.41% | 161.75 | 515 |
| 3 | 31.84% | 164.0 | 515 |
| 4 | 28.69% | 147.75 | 515 |
| 5 | 30.58% | 157.5 | 515 |
| 6 | 30.44% | 156.75 | 515 |
| 7 | 30.68% | 158.0 | 515 |
| 8 | 30.68% | 158.0 | 515 |
| 9 | 32.52% | 167.5 | 515 |
| 10 | 31.75% | 163.5 | 515 |
提供机构:
nandansarkar



