five

Qwen2.5-7B-Instruct_openthoughts3_300k_annotated_Qwen3-32B_eval_8179

收藏
魔搭社区2025-11-09 更新2025-11-03 收录
下载链接:
https://modelscope.cn/datasets/mlfoundations-dev/Qwen2.5-7B-Instruct_openthoughts3_300k_annotated_Qwen3-32B_eval_8179
下载链接
链接失效反馈
官方服务:
资源简介:
# mlfoundations-dev/Qwen2.5-7B-Instruct_openthoughts3_300k_annotated_Qwen3-32B_eval_8179 Precomputed model outputs for evaluation. ## Evaluation Results ### Summary | Metric | AIME24 | AMC23 | MATH500 | JEEBench | GPQADiamond | LiveCodeBench | CodeElo | CodeForces | AIME25 | HLE | LiveCodeBenchv5 | HMMT | |--------|------|-----|-------|--------|-----------|-------------|-------|----------|------|---|---------------|----| | Accuracy | 62.3 | 90.5 | 88.2 | 57.9 | 53.2 | 51.6 | 25.9 | 26.0 | 46.7 | 12.1 | 39.3 | 33.3 | ### AIME24 - **Average Accuracy**: 62.33% ± 1.06% - **Number of Runs**: 10 | Run | Accuracy | Questions Solved | Total Questions | |-----|----------|-----------------|----------------| | 1 | 56.67% | 17 | 30 | | 2 | 66.67% | 20 | 30 | | 3 | 66.67% | 20 | 30 | | 4 | 63.33% | 19 | 30 | | 5 | 56.67% | 17 | 30 | | 6 | 63.33% | 19 | 30 | | 7 | 63.33% | 19 | 30 | | 8 | 60.00% | 18 | 30 | | 9 | 63.33% | 19 | 30 | | 10 | 63.33% | 19 | 30 | ### AMC23 - **Average Accuracy**: 90.50% ± 0.77% - **Number of Runs**: 10 | Run | Accuracy | Questions Solved | Total Questions | |-----|----------|-----------------|----------------| | 1 | 90.00% | 36 | 40 | | 2 | 92.50% | 37 | 40 | | 3 | 95.00% | 38 | 40 | | 4 | 87.50% | 35 | 40 | | 5 | 87.50% | 35 | 40 | | 6 | 92.50% | 37 | 40 | | 7 | 90.00% | 36 | 40 | | 8 | 92.50% | 37 | 40 | | 9 | 87.50% | 35 | 40 | | 10 | 90.00% | 36 | 40 | ### MATH500 - **Accuracy**: 88.20% | Accuracy | Questions Solved | Total Questions | |----------|-----------------|----------------| | 88.20% | 441 | 500 | ### JEEBench - **Average Accuracy**: 57.91% ± 0.88% - **Number of Runs**: 3 | Run | Accuracy | Questions Solved | Total Questions | |-----|----------|-----------------|----------------| | 1 | 56.07% | 288.75 | 515 | | 2 | 57.86% | 298.0 | 515 | | 3 | 59.81% | 308.0 | 515 | ### GPQADiamond - **Average Accuracy**: 53.20% ± 0.84% - **Number of Runs**: 3 | Run | Accuracy | Questions Solved | Total Questions | |-----|----------|-----------------|----------------| | 1 | 51.52% | 102 | 198 | | 2 | 55.05% | 109 | 198 | | 3 | 53.03% | 105 | 198 | ### LiveCodeBench - **Average Accuracy**: 51.60% ± 0.83% - **Number of Runs**: 6 | Run | Accuracy | Questions Solved | Total Questions | |-----|----------|-----------------|----------------| | 1 | 52.64% | 269 | 511 | | 2 | 53.03% | 271 | 511 | | 3 | 50.10% | 256 | 511 | | 4 | 48.14% | 246 | 511 | | 5 | 52.84% | 270 | 511 | | 6 | 52.84% | 270 | 511 | ### CodeElo - **Average Accuracy**: 25.92% ± 0.23% - **Number of Runs**: 3 | Run | Accuracy | Questions Solved | Total Questions | |-----|----------|-----------------|----------------| | 1 | 26.34% | 103 | 391 | | 2 | 25.58% | 100 | 391 | | 3 | 25.83% | 101 | 391 | ### CodeForces - **Average Accuracy**: 26.05% ± 0.96% - **Number of Runs**: 3 | Run | Accuracy | Questions Solved | Total Questions | |-----|----------|-----------------|----------------| | 1 | 25.83% | 117 | 453 | | 2 | 27.81% | 126 | 453 | | 3 | 24.50% | 111 | 453 | ### AIME25 - **Average Accuracy**: 46.67% ± 1.49% - **Number of Runs**: 10 | Run | Accuracy | Questions Solved | Total Questions | |-----|----------|-----------------|----------------| | 1 | 50.00% | 15 | 30 | | 2 | 40.00% | 12 | 30 | | 3 | 53.33% | 16 | 30 | | 4 | 46.67% | 14 | 30 | | 5 | 40.00% | 12 | 30 | | 6 | 50.00% | 15 | 30 | | 7 | 43.33% | 13 | 30 | | 8 | 53.33% | 16 | 30 | | 9 | 43.33% | 13 | 30 | | 10 | 46.67% | 14 | 30 | ### HLE - **Average Accuracy**: 12.09% ± 0.24% - **Number of Runs**: 3 | Run | Accuracy | Questions Solved | Total Questions | |-----|----------|-----------------|----------------| | 1 | 12.67% | 65 | 513 | | 2 | 11.70% | 60 | 513 | | 3 | 11.89% | 61 | 513 | ### LiveCodeBenchv5 - **Average Accuracy**: 39.30% ± 0.87% - **Number of Runs**: 3 | Run | Accuracy | Questions Solved | Total Questions | |-----|----------|-----------------|----------------| | 1 | 40.65% | 150 | 369 | | 2 | 39.57% | 146 | 369 | | 3 | 37.67% | 139 | 369 | ### HMMT - **Average Accuracy**: 33.33% ± 1.56% - **Number of Runs**: 10 | Run | Accuracy | Questions Solved | Total Questions | |-----|----------|-----------------|----------------| | 1 | 33.33% | 10 | 30 | | 2 | 30.00% | 9 | 30 | | 3 | 36.67% | 11 | 30 | | 4 | 40.00% | 12 | 30 | | 5 | 30.00% | 9 | 30 | | 6 | 33.33% | 10 | 30 | | 7 | 43.33% | 13 | 30 | | 8 | 26.67% | 8 | 30 | | 9 | 30.00% | 9 | 30 | | 10 | 30.00% | 9 | 30 |

# mlfoundations-dev/Qwen2.5-7B-Instruct_openthoughts3_300k_annotated_Qwen3-32B_eval_8179 用于模型评估的预计算输出结果。 ## 评估结果 ### 结果汇总 | 指标 | AIME24 | AMC23 | MATH500 | JEEBench | GPQADiamond | LiveCodeBench | CodeElo | CodeForces | AIME25 | HLE | LiveCodeBenchv5 | HMMT | |--------|------|-----|-------|--------|-----------|-------------|-------|----------|------|---|---------------|----| | 准确率 | 62.3 | 90.5 | 88.2 | 57.9 | 53.2 | 51.6 | 25.9 | 26.0 | 46.7 | 12.1 | 39.3 | 33.3 | ### AIME24 - **平均准确率**: 62.33% ± 1.06% - **测试轮次数**: 10 | 轮次 | 准确率 | 已解决题目数 | 总题目数 | |-----|----------|-----------------|----------------| | 1 | 56.67% | 17 | 30 | | 2 | 66.67% | 20 | 30 | | 3 | 66.67% | 20 | 30 | | 4 | 63.33% | 19 | 30 | | 5 | 56.67% | 17 | 30 | | 6 | 63.33% | 19 | 30 | | 7 | 63.33% | 19 | 30 | | 8 | 60.00% | 18 | 30 | | 9 | 63.33% | 19 | 30 | | 10 | 63.33% | 19 | 30 | ### AMC23 - **平均准确率**: 90.50% ± 0.77% - **测试轮次数**: 10 | 轮次 | 准确率 | 已解决题目数 | 总题目数 | |-----|----------|-----------------|----------------| | 1 | 90.00% | 36 | 40 | | 2 | 92.50% | 37 | 40 | | 3 | 95.00% | 38 | 40 | | 4 | 87.50% | 35 | 40 | | 5 | 87.50% | 35 | 40 | | 6 | 92.50% | 37 | 40 | | 7 | 90.00% | 36 | 40 | | 8 | 92.50% | 37 | 40 | | 9 | 87.50% | 35 | 40 | | 10 | 90.00% | 36 | 40 | ### MATH500 - **准确率**: 88.20% | 准确率 | 已解决题目数 | 总题目数 | |----------|-----------------|----------------| | 88.20% | 441 | 500 | ### JEEBench - **平均准确率**: 57.91% ± 0.88% - **测试轮次数**: 3 | 轮次 | 准确率 | 已解决题目数 | 总题目数 | |-----|----------|-----------------|----------------| | 1 | 56.07% | 288.75 | 515 | | 2 | 57.86% | 298.0 | 515 | | 3 | 59.81% | 308.0 | 515 | ### GPQADiamond - **平均准确率**: 53.20% ± 0.84% - **测试轮次数**: 3 | 轮次 | 准确率 | 已解决题目数 | 总题目数 | |-----|----------|-----------------|----------------| | 1 | 51.52% | 102 | 198 | | 2 | 55.05% | 109 | 198 | | 3 | 53.03% | 105 | 198 | ### LiveCodeBench - **平均准确率**: 51.60% ± 0.83% - **测试轮次数**: 6 | 轮次 | 准确率 | 已解决题目数 | 总题目数 | |-----|----------|-----------------|----------------| | 1 | 52.64% | 269 | 511 | | 2 | 53.03% | 271 | 511 | | 3 | 50.10% | 256 | 511 | | 4 | 48.14% | 246 | 511 | | 5 | 52.84% | 270 | 511 | | 6 | 52.84% | 270 | 511 | ### CodeElo - **平均准确率**: 25.92% ± 0.23% - **测试轮次数**: 3 | 轮次 | 准确率 | 已解决题目数 | 总题目数 | |-----|----------|-----------------|----------------| | 1 | 26.34% | 103 | 391 | | 2 | 25.58% | 100 | 391 | | 3 | 25.83% | 101 | 391 | ### CodeForces - **平均准确率**: 26.05% ± 0.96% - **测试轮次数**: 3 | 轮次 | 准确率 | 已解决题目数 | 总题目数 | |-----|----------|-----------------|----------------| | 1 | 25.83% | 117 | 453 | | 2 | 27.81% | 126 | 453 | | 3 | 24.50% | 111 | 453 | ### AIME25 - **平均准确率**: 46.67% ± 1.49% - **测试轮次数**: 10 | 轮次 | 准确率 | 已解决题目数 | 总题目数 | |-----|----------|-----------------|----------------| | 1 | 50.00% | 15 | 30 | | 2 | 40.00% | 12 | 30 | | 3 | 53.33% | 16 | 30 | | 4 | 46.67% | 14 | 30 | | 5 | 40.00% | 12 | 30 | | 6 | 50.00% | 15 | 30 | | 7 | 43.33% | 13 | 30 | | 8 | 53.33% | 16 | 30 | | 9 | 43.33% | 13 | 30 | | 10 | 46.67% | 14 | 30 | ### HLE - **平均准确率**: 12.09% ± 0.24% - **测试轮次数**: 3 | 轮次 | 准确率 | 已解决题目数 | 总题目数 | |-----|----------|-----------------|----------------| | 1 | 12.67% | 65 | 513 | | 2 | 11.70% | 60 | 513 | | 3 | 11.89% | 61 | 513 | ### LiveCodeBenchv5 - **平均准确率**: 39.30% ± 0.87% - **测试轮次数**: 3 | 轮次 | 准确率 | 已解决题目数 | 总题目数 | |-----|----------|-----------------|----------------| | 1 | 40.65% | 150 | 369 | | 2 | 39.57% | 146 | 369 | | 3 | 37.67% | 139 | 369 | ### HMMT - **平均准确率**: 33.33% ± 1.56% - **测试轮次数**: 10 | 轮次 | 准确率 | 已解决题目数 | 总题目数 | |-----|----------|-----------------|----------------| | 1 | 33.33% | 10 | 30 | | 2 | 30.00% | 9 | 30 | | 3 | 36.67% | 11 | 30 | | 4 | 40.00% | 12 | 30 | | 5 | 30.00% | 9 | 30 | | 6 | 33.33% | 10 | 30 | | 7 | 43.33% | 13 | 30 | | 8 | 26.67% | 8 | 30 | | 9 | 30.00% | 9 | 30 | | 10 | 30.00% | 9 | 30 |
提供机构:
maas
创建时间:
2025-10-03
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
该数据集包含Qwen2.5-7B-Instruct模型在openthoughts3 300k标注数据上的预计算评估输出,由Qwen3-32B模型评估,涵盖8179个条目。它在多个基准测试(如AIME24、AMC23、MATH500等)上提供了详细的准确率结果,遵循Apache License 2.0许可证并由mlfoundations-dev发布。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作