cogito-v1-preview-llama-3B_eval_2e29
收藏魔搭社区2025-10-14 更新2025-10-11 收录
下载链接:
https://modelscope.cn/datasets/mlfoundations-dev/cogito-v1-preview-llama-3B_eval_2e29
下载链接
链接失效反馈官方服务:
资源简介:
# mlfoundations-dev/cogito-v1-preview-llama-3B_eval_2e29
Precomputed model outputs for evaluation.
## Evaluation Results
### Summary
| Metric | AIME24 | AMC23 | MATH500 | MMLUPro | JEEBench | GPQADiamond | LiveCodeBench | CodeElo | CodeForces | AIME25 | HLE | LiveCodeBenchv5 |
|--------|------|-----|-------|-------|--------|-----------|-------------|-------|----------|------|---|---------------|
| Accuracy | 0.3 | 17.8 | 35.8 | 16.8 | 14.5 | 30.3 | 8.3 | 1.6 | 4.3 | 0.3 | 11.5 | 5.6 |
### AIME24
- **Average Accuracy**: 0.33% ± 0.32%
- **Number of Runs**: 10
| Run | Accuracy | Questions Solved | Total Questions |
|-----|----------|-----------------|----------------|
| 1 | 0.00% | 0 | 30 |
| 2 | 3.33% | 1 | 30 |
| 3 | 0.00% | 0 | 30 |
| 4 | 0.00% | 0 | 30 |
| 5 | 0.00% | 0 | 30 |
| 6 | 0.00% | 0 | 30 |
| 7 | 0.00% | 0 | 30 |
| 8 | 0.00% | 0 | 30 |
| 9 | 0.00% | 0 | 30 |
| 10 | 0.00% | 0 | 30 |
### AMC23
- **Average Accuracy**: 17.75% ± 1.03%
- **Number of Runs**: 10
| Run | Accuracy | Questions Solved | Total Questions |
|-----|----------|-----------------|----------------|
| 1 | 12.50% | 5 | 40 |
| 2 | 20.00% | 8 | 40 |
| 3 | 20.00% | 8 | 40 |
| 4 | 17.50% | 7 | 40 |
| 5 | 20.00% | 8 | 40 |
| 6 | 20.00% | 8 | 40 |
| 7 | 12.50% | 5 | 40 |
| 8 | 15.00% | 6 | 40 |
| 9 | 17.50% | 7 | 40 |
| 10 | 22.50% | 9 | 40 |
### MATH500
- **Accuracy**: 35.80%
| Accuracy | Questions Solved | Total Questions |
|----------|-----------------|----------------|
| 35.80% | 179 | 500 |
### MMLUPro
- **Average Accuracy**: 16.80% ± 0.00%
- **Number of Runs**: 1
| Run | Accuracy | Questions Solved | Total Questions |
|-----|----------|-----------------|----------------|
| 1 | 16.80% | 84 | 500 |
### JEEBench
- **Average Accuracy**: 14.47% ± 1.05%
- **Number of Runs**: 3
| Run | Accuracy | Questions Solved | Total Questions |
|-----|----------|-----------------|----------------|
| 1 | 14.56% | 75.0 | 515 |
| 2 | 16.65% | 85.75 | 515 |
| 3 | 12.18% | 62.75 | 515 |
### GPQADiamond
- **Average Accuracy**: 30.30% ± 0.41%
- **Number of Runs**: 3
| Run | Accuracy | Questions Solved | Total Questions |
|-----|----------|-----------------|----------------|
| 1 | 30.81% | 61 | 198 |
| 2 | 30.81% | 61 | 198 |
| 3 | 29.29% | 58 | 198 |
### LiveCodeBench
- **Average Accuracy**: 8.28% ± 1.05%
- **Number of Runs**: 3
| Run | Accuracy | Questions Solved | Total Questions |
|-----|----------|-----------------|----------------|
| 1 | 8.81% | 45 | 511 |
| 2 | 6.26% | 32 | 511 |
| 3 | 9.78% | 50 | 511 |
### CodeElo
- **Average Accuracy**: 1.62% ± 0.23%
- **Number of Runs**: 3
| Run | Accuracy | Questions Solved | Total Questions |
|-----|----------|-----------------|----------------|
| 1 | 1.28% | 5 | 391 |
| 2 | 1.53% | 6 | 391 |
| 3 | 2.05% | 8 | 391 |
### CodeForces
- **Average Accuracy**: 4.34% ± 0.39%
- **Number of Runs**: 3
| Run | Accuracy | Questions Solved | Total Questions |
|-----|----------|-----------------|----------------|
| 1 | 3.75% | 17 | 453 |
| 2 | 5.08% | 23 | 453 |
| 3 | 4.19% | 19 | 453 |
### AIME25
- **Average Accuracy**: 0.33% ± 0.32%
- **Number of Runs**: 10
| Run | Accuracy | Questions Solved | Total Questions |
|-----|----------|-----------------|----------------|
| 1 | 0.00% | 0 | 30 |
| 2 | 3.33% | 1 | 30 |
| 3 | 0.00% | 0 | 30 |
| 4 | 0.00% | 0 | 30 |
| 5 | 0.00% | 0 | 30 |
| 6 | 0.00% | 0 | 30 |
| 7 | 0.00% | 0 | 30 |
| 8 | 0.00% | 0 | 30 |
| 9 | 0.00% | 0 | 30 |
| 10 | 0.00% | 0 | 30 |
### HLE
- **Average Accuracy**: 11.46% ± 0.84%
- **Number of Runs**: 3
| Run | Accuracy | Questions Solved | Total Questions |
|-----|----------|-----------------|----------------|
| 1 | 13.48% | 69 | 512 |
| 2 | 10.74% | 55 | 512 |
| 3 | 10.16% | 52 | 512 |
### LiveCodeBenchv5
- **Average Accuracy**: 5.60% ± 0.33%
- **Number of Runs**: 3
| Run | Accuracy | Questions Solved | Total Questions |
|-----|----------|-----------------|----------------|
| 1 | 6.23% | 23 | 369 |
| 2 | 5.15% | 19 | 369 |
| 3 | 5.42% | 20 | 369 |
# mlfoundations-dev/cogito-v1-preview-llama-3B_eval_2e29
用于评估的预计算模型输出结果。
## 评估结果
### 汇总概览
| 指标 | AIME24 | AMC23 | MATH500 | MMLUPro | JEEBench | GPQADiamond | LiveCodeBench | CodeElo | CodeForces | AIME25 | HLE | LiveCodeBenchv5 |
|--------|------|-----|-------|-------|--------|-----------|-------------|-------|----------|------|---|---------------|
| 准确率 | 0.3% | 17.8% | 35.8% | 16.8% | 14.5% | 30.3% | 8.3% | 1.6% | 4.3% | 0.3% | 11.5% | 5.6% |
### AIME24 评测集
- **平均准确率**: 0.33% ± 0.32%
- **评测轮次**: 10
| 轮次 | 准确率 | 解决题目数 | 总题目数 |
|-----|----------|-----------------|----------------|
| 1 | 0.00% | 0 | 30 |
| 2 | 3.33% | 1 | 30 |
| 3 | 0.00% | 0 | 30 |
| 4 | 0.00% | 0 | 30 |
| 5 | 0.00% | 0 | 30 |
| 6 | 0.00% | 0 | 30 |
| 7 | 0.00% | 0 | 30 |
| 8 | 0.00% | 0 | 30 |
| 9 | 0.00% | 0 | 30 |
| 10 | 0.00% | 0 | 30 |
### AMC23 评测集
- **平均准确率**: 17.75% ± 1.03%
- **评测轮次**: 10
| 轮次 | 准确率 | 解决题目数 | 总题目数 |
|-----|----------|-----------------|----------------|
| 1 | 12.50% | 5 | 40 |
| 2 | 20.00% | 8 | 40 |
| 3 | 20.00% | 8 | 40 |
| 4 | 17.50% | 7 | 40 |
| 5 | 20.00% | 8 | 40 |
| 6 | 20.00% | 8 | 40 |
| 7 | 12.50% | 5 | 40 |
| 8 | 15.00% | 6 | 40 |
| 9 | 17.50% | 7 | 40 |
| 10 | 22.50% | 9 | 40 |
### MATH500 评测集
- **准确率**: 35.80%
| 准确率 | 解决题目数 | 总题目数 |
|----------|-----------------|----------------|
| 35.80% | 179 | 500 |
### MMLUPro 评测集
- **平均准确率**: 16.80% ± 0.00%
- **评测轮次**: 1
| 轮次 | 准确率 | 解决题目数 | 总题目数 |
|-----|----------|-----------------|----------------|
| 1 | 16.80% | 84 | 500 |
### JEEBench 评测集
- **平均准确率**: 14.47% ± 1.05%
- **评测轮次**: 3
| 轮次 | 准确率 | 解决题目数 | 总题目数 |
|-----|----------|-----------------|----------------|
| 1 | 14.56% | 75.0 | 515 |
| 2 | 16.65% | 85.75 | 515 |
| 3 | 12.18% | 62.75 | 515 |
### GPQADiamond 评测集
- **平均准确率**: 30.30% ± 0.41%
- **评测轮次**: 3
| 轮次 | 准确率 | 解决题目数 | 总题目数 |
|-----|----------|-----------------|----------------|
| 1 | 30.81% | 61 | 198 |
| 2 | 30.81% | 61 | 198 |
| 3 | 29.29% | 58 | 198 |
### LiveCodeBench 评测集
- **平均准确率**: 8.28% ± 1.05%
- **评测轮次**: 3
| 轮次 | 准确率 | 解决题目数 | 总题目数 |
|-----|----------|-----------------|----------------|
| 1 | 8.81% | 45 | 511 |
| 2 | 6.26% | 32 | 511 |
| 3 | 9.78% | 50 | 511 |
### CodeElo 评测集
- **平均准确率**: 1.62% ± 0.23%
- **评测轮次**: 3
| 轮次 | 准确率 | 解决题目数 | 总题目数 |
|-----|----------|-----------------|----------------|
| 1 | 1.28% | 5 | 391 |
| 2 | 1.53% | 6 | 391 |
| 3 | 2.05% | 8 | 391 |
### CodeForces 评测集
- **平均准确率**: 4.34% ± 0.39%
- **评测轮次**: 3
| 轮次 | 准确率 | 解决题目数 | 总题目数 |
|-----|----------|-----------------|----------------|
| 1 | 3.75% | 17 | 453 |
| 2 | 5.08% | 23 | 453 |
| 3 | 4.19% | 19 | 453 |
### AIME25 评测集
- **平均准确率**: 0.33% ± 0.32%
- **评测轮次**: 10
| 轮次 | 准确率 | 解决题目数 | 总题目数 |
|-----|----------|-----------------|----------------|
| 1 | 0.00% | 0 | 30 |
| 2 | 3.33% | 1 | 30 |
| 3 | 0.00% | 0 | 30 |
| 4 | 0.00% | 0 | 30 |
| 5 | 0.00% | 0 | 30 |
| 6 | 0.00% | 0 | 30 |
| 7 | 0.00% | 0 | 30 |
| 8 | 0.00% | 0 | 30 |
| 9 | 0.00% | 0 | 30 |
| 10 | 0.00% | 0 | 30 |
### HLE 评测集
- **平均准确率**: 11.46% ± 0.84%
- **评测轮次**: 3
| 轮次 | 准确率 | 解决题目数 | 总题目数 |
|-----|----------|-----------------|----------------|
| 1 | 13.48% | 69 | 512 |
| 2 | 10.74% | 55 | 512 |
| 3 | 10.16% | 52 | 512 |
### LiveCodeBenchv5 评测集
- **平均准确率**: 5.60% ± 0.33%
- **评测轮次**: 3
| 轮次 | 准确率 | 解决题目数 | 总题目数 |
|-----|----------|-----------------|----------------|
| 1 | 6.23% | 23 | 369 |
| 2 | 5.15% | 19 | 369 |
| 3 | 5.42% | 20 | 369 |
提供机构:
maas
创建时间:
2025-10-04



