five

nandansarkar/base_model_on_log_odds_ranked_samples_eval_c693

收藏
Hugging Face2025-12-13 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/nandansarkar/base_model_on_log_odds_ranked_samples_eval_c693
下载链接
链接失效反馈
官方服务:
资源简介:
# nandansarkar/base_model_on_log_odds_ranked_samples_eval_c693 Precomputed model outputs for evaluation. ## Evaluation Results ### Summary | Metric | AIME24 | AIME25 | GPQADiamond | JEEBench | |--------|------|------|-----------|--------| | Accuracy | 9.8 | 5.0 | 34.2 | 31.0 | ### AIME24 - **Average Accuracy**: 9.83% ± 1.01% - **Number of Runs**: 20 | Run | Accuracy | Questions Solved | Total Questions | |-----|----------|-----------------|----------------| | 1 | 6.67% | 2 | 30 | | 2 | 3.33% | 1 | 30 | | 3 | 6.67% | 2 | 30 | | 4 | 13.33% | 4 | 30 | | 5 | 6.67% | 2 | 30 | | 6 | 16.67% | 5 | 30 | | 7 | 10.00% | 3 | 30 | | 8 | 6.67% | 2 | 30 | | 9 | 10.00% | 3 | 30 | | 10 | 13.33% | 4 | 30 | | 11 | 10.00% | 3 | 30 | | 12 | 6.67% | 2 | 30 | | 13 | 16.67% | 5 | 30 | | 14 | 20.00% | 6 | 30 | | 15 | 13.33% | 4 | 30 | | 16 | 6.67% | 2 | 30 | | 17 | 6.67% | 2 | 30 | | 18 | 3.33% | 1 | 30 | | 19 | 13.33% | 4 | 30 | | 20 | 6.67% | 2 | 30 | ### AIME25 - **Average Accuracy**: 5.00% ± 0.65% - **Number of Runs**: 20 | Run | Accuracy | Questions Solved | Total Questions | |-----|----------|-----------------|----------------| | 1 | 6.67% | 2 | 30 | | 2 | 0.00% | 0 | 30 | | 3 | 6.67% | 2 | 30 | | 4 | 6.67% | 2 | 30 | | 5 | 3.33% | 1 | 30 | | 6 | 3.33% | 1 | 30 | | 7 | 6.67% | 2 | 30 | | 8 | 0.00% | 0 | 30 | | 9 | 6.67% | 2 | 30 | | 10 | 6.67% | 2 | 30 | | 11 | 3.33% | 1 | 30 | | 12 | 10.00% | 3 | 30 | | 13 | 3.33% | 1 | 30 | | 14 | 0.00% | 0 | 30 | | 15 | 6.67% | 2 | 30 | | 16 | 3.33% | 1 | 30 | | 17 | 6.67% | 2 | 30 | | 18 | 10.00% | 3 | 30 | | 19 | 3.33% | 1 | 30 | | 20 | 6.67% | 2 | 30 | ### GPQADiamond - **Average Accuracy**: 34.19% ± 0.67% - **Number of Runs**: 10 | Run | Accuracy | Questions Solved | Total Questions | |-----|----------|-----------------|----------------| | 1 | 35.86% | 71 | 198 | | 2 | 34.34% | 68 | 198 | | 3 | 34.34% | 68 | 198 | | 4 | 33.33% | 66 | 198 | | 5 | 33.33% | 66 | 198 | | 6 | 31.31% | 62 | 198 | | 7 | 30.81% | 61 | 198 | | 8 | 33.84% | 67 | 198 | | 9 | 36.87% | 73 | 198 | | 10 | 37.88% | 75 | 198 | ### JEEBench - **Average Accuracy**: 31.01% ± 0.32% - **Number of Runs**: 10 | Run | Accuracy | Questions Solved | Total Questions | |-----|----------|-----------------|----------------| | 1 | 31.55% | 162.5 | 515 | | 2 | 31.41% | 161.75 | 515 | | 3 | 31.84% | 164.0 | 515 | | 4 | 28.69% | 147.75 | 515 | | 5 | 30.58% | 157.5 | 515 | | 6 | 30.44% | 156.75 | 515 | | 7 | 30.68% | 158.0 | 515 | | 8 | 30.68% | 158.0 | 515 | | 9 | 32.52% | 167.5 | 515 | | 10 | 31.75% | 163.5 | 515 |
提供机构:
nandansarkar
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作