five

cogito-v1-preview-llama-3B_eval_2e29

收藏
魔搭社区2025-10-14 更新2025-10-11 收录
下载链接:
https://modelscope.cn/datasets/mlfoundations-dev/cogito-v1-preview-llama-3B_eval_2e29
下载链接
链接失效反馈
官方服务:
资源简介:
# mlfoundations-dev/cogito-v1-preview-llama-3B_eval_2e29 Precomputed model outputs for evaluation. ## Evaluation Results ### Summary | Metric | AIME24 | AMC23 | MATH500 | MMLUPro | JEEBench | GPQADiamond | LiveCodeBench | CodeElo | CodeForces | AIME25 | HLE | LiveCodeBenchv5 | |--------|------|-----|-------|-------|--------|-----------|-------------|-------|----------|------|---|---------------| | Accuracy | 0.3 | 17.8 | 35.8 | 16.8 | 14.5 | 30.3 | 8.3 | 1.6 | 4.3 | 0.3 | 11.5 | 5.6 | ### AIME24 - **Average Accuracy**: 0.33% ± 0.32% - **Number of Runs**: 10 | Run | Accuracy | Questions Solved | Total Questions | |-----|----------|-----------------|----------------| | 1 | 0.00% | 0 | 30 | | 2 | 3.33% | 1 | 30 | | 3 | 0.00% | 0 | 30 | | 4 | 0.00% | 0 | 30 | | 5 | 0.00% | 0 | 30 | | 6 | 0.00% | 0 | 30 | | 7 | 0.00% | 0 | 30 | | 8 | 0.00% | 0 | 30 | | 9 | 0.00% | 0 | 30 | | 10 | 0.00% | 0 | 30 | ### AMC23 - **Average Accuracy**: 17.75% ± 1.03% - **Number of Runs**: 10 | Run | Accuracy | Questions Solved | Total Questions | |-----|----------|-----------------|----------------| | 1 | 12.50% | 5 | 40 | | 2 | 20.00% | 8 | 40 | | 3 | 20.00% | 8 | 40 | | 4 | 17.50% | 7 | 40 | | 5 | 20.00% | 8 | 40 | | 6 | 20.00% | 8 | 40 | | 7 | 12.50% | 5 | 40 | | 8 | 15.00% | 6 | 40 | | 9 | 17.50% | 7 | 40 | | 10 | 22.50% | 9 | 40 | ### MATH500 - **Accuracy**: 35.80% | Accuracy | Questions Solved | Total Questions | |----------|-----------------|----------------| | 35.80% | 179 | 500 | ### MMLUPro - **Average Accuracy**: 16.80% ± 0.00% - **Number of Runs**: 1 | Run | Accuracy | Questions Solved | Total Questions | |-----|----------|-----------------|----------------| | 1 | 16.80% | 84 | 500 | ### JEEBench - **Average Accuracy**: 14.47% ± 1.05% - **Number of Runs**: 3 | Run | Accuracy | Questions Solved | Total Questions | |-----|----------|-----------------|----------------| | 1 | 14.56% | 75.0 | 515 | | 2 | 16.65% | 85.75 | 515 | | 3 | 12.18% | 62.75 | 515 | ### GPQADiamond - **Average Accuracy**: 30.30% ± 0.41% - **Number of Runs**: 3 | Run | Accuracy | Questions Solved | Total Questions | |-----|----------|-----------------|----------------| | 1 | 30.81% | 61 | 198 | | 2 | 30.81% | 61 | 198 | | 3 | 29.29% | 58 | 198 | ### LiveCodeBench - **Average Accuracy**: 8.28% ± 1.05% - **Number of Runs**: 3 | Run | Accuracy | Questions Solved | Total Questions | |-----|----------|-----------------|----------------| | 1 | 8.81% | 45 | 511 | | 2 | 6.26% | 32 | 511 | | 3 | 9.78% | 50 | 511 | ### CodeElo - **Average Accuracy**: 1.62% ± 0.23% - **Number of Runs**: 3 | Run | Accuracy | Questions Solved | Total Questions | |-----|----------|-----------------|----------------| | 1 | 1.28% | 5 | 391 | | 2 | 1.53% | 6 | 391 | | 3 | 2.05% | 8 | 391 | ### CodeForces - **Average Accuracy**: 4.34% ± 0.39% - **Number of Runs**: 3 | Run | Accuracy | Questions Solved | Total Questions | |-----|----------|-----------------|----------------| | 1 | 3.75% | 17 | 453 | | 2 | 5.08% | 23 | 453 | | 3 | 4.19% | 19 | 453 | ### AIME25 - **Average Accuracy**: 0.33% ± 0.32% - **Number of Runs**: 10 | Run | Accuracy | Questions Solved | Total Questions | |-----|----------|-----------------|----------------| | 1 | 0.00% | 0 | 30 | | 2 | 3.33% | 1 | 30 | | 3 | 0.00% | 0 | 30 | | 4 | 0.00% | 0 | 30 | | 5 | 0.00% | 0 | 30 | | 6 | 0.00% | 0 | 30 | | 7 | 0.00% | 0 | 30 | | 8 | 0.00% | 0 | 30 | | 9 | 0.00% | 0 | 30 | | 10 | 0.00% | 0 | 30 | ### HLE - **Average Accuracy**: 11.46% ± 0.84% - **Number of Runs**: 3 | Run | Accuracy | Questions Solved | Total Questions | |-----|----------|-----------------|----------------| | 1 | 13.48% | 69 | 512 | | 2 | 10.74% | 55 | 512 | | 3 | 10.16% | 52 | 512 | ### LiveCodeBenchv5 - **Average Accuracy**: 5.60% ± 0.33% - **Number of Runs**: 3 | Run | Accuracy | Questions Solved | Total Questions | |-----|----------|-----------------|----------------| | 1 | 6.23% | 23 | 369 | | 2 | 5.15% | 19 | 369 | | 3 | 5.42% | 20 | 369 |

# mlfoundations-dev/cogito-v1-preview-llama-3B_eval_2e29 用于评估的预计算模型输出结果。 ## 评估结果 ### 汇总概览 | 指标 | AIME24 | AMC23 | MATH500 | MMLUPro | JEEBench | GPQADiamond | LiveCodeBench | CodeElo | CodeForces | AIME25 | HLE | LiveCodeBenchv5 | |--------|------|-----|-------|-------|--------|-----------|-------------|-------|----------|------|---|---------------| | 准确率 | 0.3% | 17.8% | 35.8% | 16.8% | 14.5% | 30.3% | 8.3% | 1.6% | 4.3% | 0.3% | 11.5% | 5.6% | ### AIME24 评测集 - **平均准确率**: 0.33% ± 0.32% - **评测轮次**: 10 | 轮次 | 准确率 | 解决题目数 | 总题目数 | |-----|----------|-----------------|----------------| | 1 | 0.00% | 0 | 30 | | 2 | 3.33% | 1 | 30 | | 3 | 0.00% | 0 | 30 | | 4 | 0.00% | 0 | 30 | | 5 | 0.00% | 0 | 30 | | 6 | 0.00% | 0 | 30 | | 7 | 0.00% | 0 | 30 | | 8 | 0.00% | 0 | 30 | | 9 | 0.00% | 0 | 30 | | 10 | 0.00% | 0 | 30 | ### AMC23 评测集 - **平均准确率**: 17.75% ± 1.03% - **评测轮次**: 10 | 轮次 | 准确率 | 解决题目数 | 总题目数 | |-----|----------|-----------------|----------------| | 1 | 12.50% | 5 | 40 | | 2 | 20.00% | 8 | 40 | | 3 | 20.00% | 8 | 40 | | 4 | 17.50% | 7 | 40 | | 5 | 20.00% | 8 | 40 | | 6 | 20.00% | 8 | 40 | | 7 | 12.50% | 5 | 40 | | 8 | 15.00% | 6 | 40 | | 9 | 17.50% | 7 | 40 | | 10 | 22.50% | 9 | 40 | ### MATH500 评测集 - **准确率**: 35.80% | 准确率 | 解决题目数 | 总题目数 | |----------|-----------------|----------------| | 35.80% | 179 | 500 | ### MMLUPro 评测集 - **平均准确率**: 16.80% ± 0.00% - **评测轮次**: 1 | 轮次 | 准确率 | 解决题目数 | 总题目数 | |-----|----------|-----------------|----------------| | 1 | 16.80% | 84 | 500 | ### JEEBench 评测集 - **平均准确率**: 14.47% ± 1.05% - **评测轮次**: 3 | 轮次 | 准确率 | 解决题目数 | 总题目数 | |-----|----------|-----------------|----------------| | 1 | 14.56% | 75.0 | 515 | | 2 | 16.65% | 85.75 | 515 | | 3 | 12.18% | 62.75 | 515 | ### GPQADiamond 评测集 - **平均准确率**: 30.30% ± 0.41% - **评测轮次**: 3 | 轮次 | 准确率 | 解决题目数 | 总题目数 | |-----|----------|-----------------|----------------| | 1 | 30.81% | 61 | 198 | | 2 | 30.81% | 61 | 198 | | 3 | 29.29% | 58 | 198 | ### LiveCodeBench 评测集 - **平均准确率**: 8.28% ± 1.05% - **评测轮次**: 3 | 轮次 | 准确率 | 解决题目数 | 总题目数 | |-----|----------|-----------------|----------------| | 1 | 8.81% | 45 | 511 | | 2 | 6.26% | 32 | 511 | | 3 | 9.78% | 50 | 511 | ### CodeElo 评测集 - **平均准确率**: 1.62% ± 0.23% - **评测轮次**: 3 | 轮次 | 准确率 | 解决题目数 | 总题目数 | |-----|----------|-----------------|----------------| | 1 | 1.28% | 5 | 391 | | 2 | 1.53% | 6 | 391 | | 3 | 2.05% | 8 | 391 | ### CodeForces 评测集 - **平均准确率**: 4.34% ± 0.39% - **评测轮次**: 3 | 轮次 | 准确率 | 解决题目数 | 总题目数 | |-----|----------|-----------------|----------------| | 1 | 3.75% | 17 | 453 | | 2 | 5.08% | 23 | 453 | | 3 | 4.19% | 19 | 453 | ### AIME25 评测集 - **平均准确率**: 0.33% ± 0.32% - **评测轮次**: 10 | 轮次 | 准确率 | 解决题目数 | 总题目数 | |-----|----------|-----------------|----------------| | 1 | 0.00% | 0 | 30 | | 2 | 3.33% | 1 | 30 | | 3 | 0.00% | 0 | 30 | | 4 | 0.00% | 0 | 30 | | 5 | 0.00% | 0 | 30 | | 6 | 0.00% | 0 | 30 | | 7 | 0.00% | 0 | 30 | | 8 | 0.00% | 0 | 30 | | 9 | 0.00% | 0 | 30 | | 10 | 0.00% | 0 | 30 | ### HLE 评测集 - **平均准确率**: 11.46% ± 0.84% - **评测轮次**: 3 | 轮次 | 准确率 | 解决题目数 | 总题目数 | |-----|----------|-----------------|----------------| | 1 | 13.48% | 69 | 512 | | 2 | 10.74% | 55 | 512 | | 3 | 10.16% | 52 | 512 | ### LiveCodeBenchv5 评测集 - **平均准确率**: 5.60% ± 0.33% - **评测轮次**: 3 | 轮次 | 准确率 | 解决题目数 | 总题目数 | |-----|----------|-----------------|----------------| | 1 | 6.23% | 23 | 369 | | 2 | 5.15% | 19 | 369 | | 3 | 5.42% | 20 | 369 |
提供机构:
maas
创建时间:
2025-10-04
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作