five

AReaL-boba-RL-7B_eval_5554

收藏
魔搭社区2025-11-19 更新2025-11-22 收录
下载链接:
https://modelscope.cn/datasets/mlfoundations-dev/AReaL-boba-RL-7B_eval_5554
下载链接
链接失效反馈
官方服务:
资源简介:
# mlfoundations-dev/AReaL-boba-RL-7B_eval_5554 Precomputed model outputs for evaluation. ## Evaluation Results ### Summary | Metric | AIME24 | AMC23 | MATH500 | MMLUPro | JEEBench | GPQADiamond | LiveCodeBench | CodeElo | CodeForces | HLE | HMMT | AIME25 | LiveCodeBenchv5 | |--------|------|-----|-------|-------|--------|-----------|-------------|-------|----------|---|----|------|---------------| | Accuracy | 61.0 | 92.8 | 89.2 | 41.6 | 53.3 | 52.7 | 46.1 | 21.5 | 20.7 | 12.1 | 30.0 | 44.3 | 30.4 | ### AIME24 - **Average Accuracy**: 61.00% ± 2.36% - **Number of Runs**: 10 | Run | Accuracy | Questions Solved | Total Questions | |-----|----------|-----------------|----------------| | 1 | 60.00% | 18 | 30 | | 2 | 63.33% | 19 | 30 | | 3 | 56.67% | 17 | 30 | | 4 | 56.67% | 17 | 30 | | 5 | 50.00% | 15 | 30 | | 6 | 60.00% | 18 | 30 | | 7 | 53.33% | 16 | 30 | | 8 | 63.33% | 19 | 30 | | 9 | 76.67% | 23 | 30 | | 10 | 70.00% | 21 | 30 | ### AMC23 - **Average Accuracy**: 92.75% ± 0.66% - **Number of Runs**: 10 | Run | Accuracy | Questions Solved | Total Questions | |-----|----------|-----------------|----------------| | 1 | 87.50% | 35 | 40 | | 2 | 95.00% | 38 | 40 | | 3 | 92.50% | 37 | 40 | | 4 | 95.00% | 38 | 40 | | 5 | 95.00% | 38 | 40 | | 6 | 92.50% | 37 | 40 | | 7 | 92.50% | 37 | 40 | | 8 | 92.50% | 37 | 40 | | 9 | 92.50% | 37 | 40 | | 10 | 92.50% | 37 | 40 | ### MATH500 - **Accuracy**: 89.20% | Accuracy | Questions Solved | Total Questions | |----------|-----------------|----------------| | 89.20% | 446 | 500 | ### MMLUPro - **Accuracy**: 41.61% | Accuracy | Questions Solved | Total Questions | |----------|-----------------|----------------| | 41.61% | N/A | N/A | ### JEEBench - **Average Accuracy**: 53.27% ± 0.28% - **Number of Runs**: 3 | Run | Accuracy | Questions Solved | Total Questions | |-----|----------|-----------------|----------------| | 1 | 53.59% | 276.0 | 515 | | 2 | 53.64% | 276.25 | 515 | | 3 | 52.57% | 270.75 | 515 | ### GPQADiamond - **Average Accuracy**: 52.69% ± 0.27% - **Number of Runs**: 3 | Run | Accuracy | Questions Solved | Total Questions | |-----|----------|-----------------|----------------| | 1 | 52.02% | 103 | 198 | | 2 | 53.03% | 105 | 198 | | 3 | 53.03% | 105 | 198 | ### LiveCodeBench - **Average Accuracy**: 46.12% ± 0.33% - **Number of Runs**: 3 | Run | Accuracy | Questions Solved | Total Questions | |-----|----------|-----------------|----------------| | 1 | 45.79% | 234 | 511 | | 2 | 45.79% | 234 | 511 | | 3 | 46.77% | 239 | 511 | ### CodeElo - **Average Accuracy**: 21.48% ± 0.15% - **Number of Runs**: 3 | Run | Accuracy | Questions Solved | Total Questions | |-----|----------|-----------------|----------------| | 1 | 21.74% | 85 | 391 | | 2 | 21.48% | 84 | 391 | | 3 | 21.23% | 83 | 391 | ### CodeForces - **Average Accuracy**: 20.68% ± 0.27% - **Number of Runs**: 3 | Run | Accuracy | Questions Solved | Total Questions | |-----|----------|-----------------|----------------| | 1 | 20.31% | 92 | 453 | | 2 | 20.53% | 93 | 453 | | 3 | 21.19% | 96 | 453 | ### HLE - **Average Accuracy**: 12.09% ± 0.88% - **Number of Runs**: 3 | Run | Accuracy | Questions Solved | Total Questions | |-----|----------|-----------------|----------------| | 1 | 11.89% | 61 | 513 | | 2 | 10.33% | 53 | 513 | | 3 | 14.04% | 72 | 513 | ### HMMT - **Average Accuracy**: 30.00% ± 1.33% - **Number of Runs**: 10 | Run | Accuracy | Questions Solved | Total Questions | |-----|----------|-----------------|----------------| | 1 | 36.67% | 11 | 30 | | 2 | 30.00% | 9 | 30 | | 3 | 30.00% | 9 | 30 | | 4 | 30.00% | 9 | 30 | | 5 | 33.33% | 10 | 30 | | 6 | 33.33% | 10 | 30 | | 7 | 23.33% | 7 | 30 | | 8 | 23.33% | 7 | 30 | | 9 | 26.67% | 8 | 30 | | 10 | 33.33% | 10 | 30 | ### AIME25 - **Average Accuracy**: 44.33% ± 1.57% - **Number of Runs**: 10 | Run | Accuracy | Questions Solved | Total Questions | |-----|----------|-----------------|----------------| | 1 | 40.00% | 12 | 30 | | 2 | 43.33% | 13 | 30 | | 3 | 36.67% | 11 | 30 | | 4 | 43.33% | 13 | 30 | | 5 | 50.00% | 15 | 30 | | 6 | 40.00% | 12 | 30 | | 7 | 50.00% | 15 | 30 | | 8 | 50.00% | 15 | 30 | | 9 | 40.00% | 12 | 30 | | 10 | 50.00% | 15 | 30 | ### LiveCodeBenchv5 - **Average Accuracy**: 30.44% ± 1.04% - **Number of Runs**: 3 | Run | Accuracy | Questions Solved | Total Questions | |-----|----------|-----------------|----------------| | 1 | 29.27% | 108 | 369 | | 2 | 32.52% | 120 | 369 | | 3 | 29.54% | 109 | 369 |

# mlfoundations-dev/AReaL-boba-RL-7B_eval_5554 用于评估的预计算模型输出。 ## 评估结果 ### 汇总 | 指标(Metric) | AIME24 | AMC23 | MATH500 | MMLUPro | JEEBench | GPQADiamond | LiveCodeBench | CodeElo | CodeForces | HLE | HMMT | AIME25 | LiveCodeBenchv5 | |--------|------|-----|-------|-------|--------|-----------|-------------|-------|----------|---|----|------|---------------| | 准确率(Accuracy) | 61.0 | 92.8 | 89.2 | 41.6 | 53.3 | 52.7 | 46.1 | 21.5 | 20.7 | 12.1 | 30.0 | 44.3 | 30.4 | ### AIME24 - **平均准确率**: 61.00% ± 2.36% - **运行次数**: 10 | 运行编号(Run) | 准确率(Accuracy) | 答对题目数(Questions Solved) | 总题目数(Total Questions) | |-----|----------|-----------------|----------------| | 1 | 60.00% | 18 | 30 | | 2 | 63.33% | 19 | 30 | | 3 | 56.67% | 17 | 30 | | 4 | 56.67% | 17 | 30 | | 5 | 50.00% | 15 | 30 | | 6 | 60.00% | 18 | 30 | | 7 | 53.33% | 16 | 30 | | 8 | 63.33% | 19 | 30 | | 9 | 76.67% | 23 | 30 | | 10 | 70.00% | 21 | 30 | ### AMC23 - **平均准确率**: 92.75% ± 0.66% - **运行次数**: 10 | 运行编号(Run) | 准确率(Accuracy) | 答对题目数(Questions Solved) | 总题目数(Total Questions) | |-----|----------|-----------------|----------------| | 1 | 87.50% | 35 | 40 | | 2 | 95.00% | 38 | 40 | | 3 | 92.50% | 37 | 40 | | 4 | 95.00% | 38 | 40 | | 5 | 95.00% | 38 | 40 | | 6 | 92.50% | 37 | 40 | | 7 | 92.50% | 37 | 40 | | 8 | 92.50% | 37 | 40 | | 9 | 92.50% | 37 | 40 | | 10 | 92.50% | 37 | 40 | ### MATH500 - **准确率**: 89.20% | 准确率(Accuracy) | 答对题目数(Questions Solved) | 总题目数(Total Questions) | |----------|-----------------|----------------| | 89.20% | 446 | 500 | ### MMLUPro - **准确率**: 41.61% | 准确率(Accuracy) | 答对题目数(Questions Solved) | 总题目数(Total Questions) | |----------|-----------------|----------------| | 41.61% | 不适用 | 不适用 | ### JEEBench - **平均准确率**: 53.27% ± 0.28% - **运行次数**: 3 | 运行编号(Run) | 准确率(Accuracy) | 答对题目数(Questions Solved) | 总题目数(Total Questions) | |-----|----------|-----------------|----------------| | 1 | 53.59% | 276.0 | 515 | | 2 | 53.64% | 276.25 | 515 | | 3 | 52.57% | 270.75 | 515 | ### GPQADiamond - **平均准确率**: 52.69% ± 0.27% - **运行次数**: 3 | 运行编号(Run) | 准确率(Accuracy) | 答对题目数(Questions Solved) | 总题目数(Total Questions) | |-----|----------|-----------------|----------------| | 1 | 52.02% | 103 | 198 | | 2 | 53.03% | 105 | 198 | | 3 | 53.03% | 105 | 198 | ### LiveCodeBench - **平均准确率**: 46.12% ± 0.33% - **运行次数**: 3 | 运行编号(Run) | 准确率(Accuracy) | 答对题目数(Questions Solved) | 总题目数(Total Questions) | |-----|----------|-----------------|----------------| | 1 | 45.79% | 234 | 511 | | 2 | 45.79% | 234 | 511 | | 3 | 46.77% | 239 | 511 | ### CodeElo - **平均准确率**: 21.48% ± 0.15% - **运行次数**: 3 | 运行编号(Run) | 准确率(Accuracy) | 答对题目数(Questions Solved) | 总题目数(Total Questions) | |-----|----------|-----------------|----------------| | 1 | 21.74% | 85 | 391 | | 2 | 21.48% | 84 | 391 | | 3 | 21.23% | 83 | 391 | ### CodeForces - **平均准确率**: 20.68% ± 0.27% - **运行次数**: 3 | 运行编号(Run) | 准确率(Accuracy) | 答对题目数(Questions Solved) | 总题目数(Total Questions) | |-----|----------|-----------------|----------------| | 1 | 20.31% | 92 | 453 | | 2 | 20.53% | 93 | 453 | | 3 | 21.19% | 96 | 453 | ### HLE - **平均准确率**: 12.09% ± 0.88% - **运行次数**: 3 | 运行编号(Run) | 准确率(Accuracy) | 答对题目数(Questions Solved) | 总题目数(Total Questions) | |-----|----------|-----------------|----------------| | 1 | 11.89% | 61 | 513 | | 2 | 10.33% | 53 | 513 | | 3 | 14.04% | 72 | 513 | ### HMMT - **平均准确率**: 30.00% ± 1.33% - **运行次数**: 10 | 运行编号(Run) | 准确率(Accuracy) | 答对题目数(Questions Solved) | 总题目数(Total Questions) | |-----|----------|-----------------|----------------| | 1 | 36.67% | 11 | 30 | | 2 | 30.00% | 9 | 30 | | 3 | 30.00% | 9 | 30 | | 4 | 30.00% | 9 | 30 | | 5 | 33.33% | 10 | 30 | | 6 | 33.33% | 10 | 30 | | 7 | 23.33% | 7 | 30 | | 8 | 23.33% | 7 | 30 | | 9 | 26.67% | 8 | 30 | | 10 | 33.33% | 10 | 30 | ### AIME25 - **平均准确率**: 44.33% ± 1.57% - **运行次数**: 10 | 运行编号(Run) | 准确率(Accuracy) | 答对题目数(Questions Solved) | 总题目数(Total Questions) | |-----|----------|-----------------|----------------| | 1 | 40.00% | 12 | 30 | | 2 | 43.33% | 13 | 30 | | 3 | 36.67% | 11 | 30 | | 4 | 43.33% | 13 | 30 | | 5 | 50.00% | 15 | 30 | | 6 | 40.00% | 12 | 30 | | 7 | 50.00% | 15 | 30 | | 8 | 50.00% | 15 | 30 | | 9 | 40.00% | 12 | 30 | | 10 | 50.00% | 15 | 30 | ### LiveCodeBenchv5 - **平均准确率**: 30.44% ± 1.04% - **运行次数**: 3 | 运行编号(Run) | 准确率(Accuracy) | 答对题目数(Questions Solved) | 总题目数(Total Questions) | |-----|----------|-----------------|----------------| | 1 | 29.27% | 108 | 369 | | 2 | 32.52% | 120 | 369 | | 3 | 29.54% | 109 | 369 |
提供机构:
maas
创建时间:
2025-10-04
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作