Qwen2.5-7B-Instruct_openthoughts3_300k_annotated_Qwen3-32B_eval_8179
收藏魔搭社区2025-11-09 更新2025-11-03 收录
下载链接:
https://modelscope.cn/datasets/mlfoundations-dev/Qwen2.5-7B-Instruct_openthoughts3_300k_annotated_Qwen3-32B_eval_8179
下载链接
链接失效反馈官方服务:
资源简介:
# mlfoundations-dev/Qwen2.5-7B-Instruct_openthoughts3_300k_annotated_Qwen3-32B_eval_8179
Precomputed model outputs for evaluation.
## Evaluation Results
### Summary
| Metric | AIME24 | AMC23 | MATH500 | JEEBench | GPQADiamond | LiveCodeBench | CodeElo | CodeForces | AIME25 | HLE | LiveCodeBenchv5 | HMMT |
|--------|------|-----|-------|--------|-----------|-------------|-------|----------|------|---|---------------|----|
| Accuracy | 62.3 | 90.5 | 88.2 | 57.9 | 53.2 | 51.6 | 25.9 | 26.0 | 46.7 | 12.1 | 39.3 | 33.3 |
### AIME24
- **Average Accuracy**: 62.33% ± 1.06%
- **Number of Runs**: 10
| Run | Accuracy | Questions Solved | Total Questions |
|-----|----------|-----------------|----------------|
| 1 | 56.67% | 17 | 30 |
| 2 | 66.67% | 20 | 30 |
| 3 | 66.67% | 20 | 30 |
| 4 | 63.33% | 19 | 30 |
| 5 | 56.67% | 17 | 30 |
| 6 | 63.33% | 19 | 30 |
| 7 | 63.33% | 19 | 30 |
| 8 | 60.00% | 18 | 30 |
| 9 | 63.33% | 19 | 30 |
| 10 | 63.33% | 19 | 30 |
### AMC23
- **Average Accuracy**: 90.50% ± 0.77%
- **Number of Runs**: 10
| Run | Accuracy | Questions Solved | Total Questions |
|-----|----------|-----------------|----------------|
| 1 | 90.00% | 36 | 40 |
| 2 | 92.50% | 37 | 40 |
| 3 | 95.00% | 38 | 40 |
| 4 | 87.50% | 35 | 40 |
| 5 | 87.50% | 35 | 40 |
| 6 | 92.50% | 37 | 40 |
| 7 | 90.00% | 36 | 40 |
| 8 | 92.50% | 37 | 40 |
| 9 | 87.50% | 35 | 40 |
| 10 | 90.00% | 36 | 40 |
### MATH500
- **Accuracy**: 88.20%
| Accuracy | Questions Solved | Total Questions |
|----------|-----------------|----------------|
| 88.20% | 441 | 500 |
### JEEBench
- **Average Accuracy**: 57.91% ± 0.88%
- **Number of Runs**: 3
| Run | Accuracy | Questions Solved | Total Questions |
|-----|----------|-----------------|----------------|
| 1 | 56.07% | 288.75 | 515 |
| 2 | 57.86% | 298.0 | 515 |
| 3 | 59.81% | 308.0 | 515 |
### GPQADiamond
- **Average Accuracy**: 53.20% ± 0.84%
- **Number of Runs**: 3
| Run | Accuracy | Questions Solved | Total Questions |
|-----|----------|-----------------|----------------|
| 1 | 51.52% | 102 | 198 |
| 2 | 55.05% | 109 | 198 |
| 3 | 53.03% | 105 | 198 |
### LiveCodeBench
- **Average Accuracy**: 51.60% ± 0.83%
- **Number of Runs**: 6
| Run | Accuracy | Questions Solved | Total Questions |
|-----|----------|-----------------|----------------|
| 1 | 52.64% | 269 | 511 |
| 2 | 53.03% | 271 | 511 |
| 3 | 50.10% | 256 | 511 |
| 4 | 48.14% | 246 | 511 |
| 5 | 52.84% | 270 | 511 |
| 6 | 52.84% | 270 | 511 |
### CodeElo
- **Average Accuracy**: 25.92% ± 0.23%
- **Number of Runs**: 3
| Run | Accuracy | Questions Solved | Total Questions |
|-----|----------|-----------------|----------------|
| 1 | 26.34% | 103 | 391 |
| 2 | 25.58% | 100 | 391 |
| 3 | 25.83% | 101 | 391 |
### CodeForces
- **Average Accuracy**: 26.05% ± 0.96%
- **Number of Runs**: 3
| Run | Accuracy | Questions Solved | Total Questions |
|-----|----------|-----------------|----------------|
| 1 | 25.83% | 117 | 453 |
| 2 | 27.81% | 126 | 453 |
| 3 | 24.50% | 111 | 453 |
### AIME25
- **Average Accuracy**: 46.67% ± 1.49%
- **Number of Runs**: 10
| Run | Accuracy | Questions Solved | Total Questions |
|-----|----------|-----------------|----------------|
| 1 | 50.00% | 15 | 30 |
| 2 | 40.00% | 12 | 30 |
| 3 | 53.33% | 16 | 30 |
| 4 | 46.67% | 14 | 30 |
| 5 | 40.00% | 12 | 30 |
| 6 | 50.00% | 15 | 30 |
| 7 | 43.33% | 13 | 30 |
| 8 | 53.33% | 16 | 30 |
| 9 | 43.33% | 13 | 30 |
| 10 | 46.67% | 14 | 30 |
### HLE
- **Average Accuracy**: 12.09% ± 0.24%
- **Number of Runs**: 3
| Run | Accuracy | Questions Solved | Total Questions |
|-----|----------|-----------------|----------------|
| 1 | 12.67% | 65 | 513 |
| 2 | 11.70% | 60 | 513 |
| 3 | 11.89% | 61 | 513 |
### LiveCodeBenchv5
- **Average Accuracy**: 39.30% ± 0.87%
- **Number of Runs**: 3
| Run | Accuracy | Questions Solved | Total Questions |
|-----|----------|-----------------|----------------|
| 1 | 40.65% | 150 | 369 |
| 2 | 39.57% | 146 | 369 |
| 3 | 37.67% | 139 | 369 |
### HMMT
- **Average Accuracy**: 33.33% ± 1.56%
- **Number of Runs**: 10
| Run | Accuracy | Questions Solved | Total Questions |
|-----|----------|-----------------|----------------|
| 1 | 33.33% | 10 | 30 |
| 2 | 30.00% | 9 | 30 |
| 3 | 36.67% | 11 | 30 |
| 4 | 40.00% | 12 | 30 |
| 5 | 30.00% | 9 | 30 |
| 6 | 33.33% | 10 | 30 |
| 7 | 43.33% | 13 | 30 |
| 8 | 26.67% | 8 | 30 |
| 9 | 30.00% | 9 | 30 |
| 10 | 30.00% | 9 | 30 |
# mlfoundations-dev/Qwen2.5-7B-Instruct_openthoughts3_300k_annotated_Qwen3-32B_eval_8179
用于模型评估的预计算输出结果。
## 评估结果
### 结果汇总
| 指标 | AIME24 | AMC23 | MATH500 | JEEBench | GPQADiamond | LiveCodeBench | CodeElo | CodeForces | AIME25 | HLE | LiveCodeBenchv5 | HMMT |
|--------|------|-----|-------|--------|-----------|-------------|-------|----------|------|---|---------------|----|
| 准确率 | 62.3 | 90.5 | 88.2 | 57.9 | 53.2 | 51.6 | 25.9 | 26.0 | 46.7 | 12.1 | 39.3 | 33.3 |
### AIME24
- **平均准确率**: 62.33% ± 1.06%
- **测试轮次数**: 10
| 轮次 | 准确率 | 已解决题目数 | 总题目数 |
|-----|----------|-----------------|----------------|
| 1 | 56.67% | 17 | 30 |
| 2 | 66.67% | 20 | 30 |
| 3 | 66.67% | 20 | 30 |
| 4 | 63.33% | 19 | 30 |
| 5 | 56.67% | 17 | 30 |
| 6 | 63.33% | 19 | 30 |
| 7 | 63.33% | 19 | 30 |
| 8 | 60.00% | 18 | 30 |
| 9 | 63.33% | 19 | 30 |
| 10 | 63.33% | 19 | 30 |
### AMC23
- **平均准确率**: 90.50% ± 0.77%
- **测试轮次数**: 10
| 轮次 | 准确率 | 已解决题目数 | 总题目数 |
|-----|----------|-----------------|----------------|
| 1 | 90.00% | 36 | 40 |
| 2 | 92.50% | 37 | 40 |
| 3 | 95.00% | 38 | 40 |
| 4 | 87.50% | 35 | 40 |
| 5 | 87.50% | 35 | 40 |
| 6 | 92.50% | 37 | 40 |
| 7 | 90.00% | 36 | 40 |
| 8 | 92.50% | 37 | 40 |
| 9 | 87.50% | 35 | 40 |
| 10 | 90.00% | 36 | 40 |
### MATH500
- **准确率**: 88.20%
| 准确率 | 已解决题目数 | 总题目数 |
|----------|-----------------|----------------|
| 88.20% | 441 | 500 |
### JEEBench
- **平均准确率**: 57.91% ± 0.88%
- **测试轮次数**: 3
| 轮次 | 准确率 | 已解决题目数 | 总题目数 |
|-----|----------|-----------------|----------------|
| 1 | 56.07% | 288.75 | 515 |
| 2 | 57.86% | 298.0 | 515 |
| 3 | 59.81% | 308.0 | 515 |
### GPQADiamond
- **平均准确率**: 53.20% ± 0.84%
- **测试轮次数**: 3
| 轮次 | 准确率 | 已解决题目数 | 总题目数 |
|-----|----------|-----------------|----------------|
| 1 | 51.52% | 102 | 198 |
| 2 | 55.05% | 109 | 198 |
| 3 | 53.03% | 105 | 198 |
### LiveCodeBench
- **平均准确率**: 51.60% ± 0.83%
- **测试轮次数**: 6
| 轮次 | 准确率 | 已解决题目数 | 总题目数 |
|-----|----------|-----------------|----------------|
| 1 | 52.64% | 269 | 511 |
| 2 | 53.03% | 271 | 511 |
| 3 | 50.10% | 256 | 511 |
| 4 | 48.14% | 246 | 511 |
| 5 | 52.84% | 270 | 511 |
| 6 | 52.84% | 270 | 511 |
### CodeElo
- **平均准确率**: 25.92% ± 0.23%
- **测试轮次数**: 3
| 轮次 | 准确率 | 已解决题目数 | 总题目数 |
|-----|----------|-----------------|----------------|
| 1 | 26.34% | 103 | 391 |
| 2 | 25.58% | 100 | 391 |
| 3 | 25.83% | 101 | 391 |
### CodeForces
- **平均准确率**: 26.05% ± 0.96%
- **测试轮次数**: 3
| 轮次 | 准确率 | 已解决题目数 | 总题目数 |
|-----|----------|-----------------|----------------|
| 1 | 25.83% | 117 | 453 |
| 2 | 27.81% | 126 | 453 |
| 3 | 24.50% | 111 | 453 |
### AIME25
- **平均准确率**: 46.67% ± 1.49%
- **测试轮次数**: 10
| 轮次 | 准确率 | 已解决题目数 | 总题目数 |
|-----|----------|-----------------|----------------|
| 1 | 50.00% | 15 | 30 |
| 2 | 40.00% | 12 | 30 |
| 3 | 53.33% | 16 | 30 |
| 4 | 46.67% | 14 | 30 |
| 5 | 40.00% | 12 | 30 |
| 6 | 50.00% | 15 | 30 |
| 7 | 43.33% | 13 | 30 |
| 8 | 53.33% | 16 | 30 |
| 9 | 43.33% | 13 | 30 |
| 10 | 46.67% | 14 | 30 |
### HLE
- **平均准确率**: 12.09% ± 0.24%
- **测试轮次数**: 3
| 轮次 | 准确率 | 已解决题目数 | 总题目数 |
|-----|----------|-----------------|----------------|
| 1 | 12.67% | 65 | 513 |
| 2 | 11.70% | 60 | 513 |
| 3 | 11.89% | 61 | 513 |
### LiveCodeBenchv5
- **平均准确率**: 39.30% ± 0.87%
- **测试轮次数**: 3
| 轮次 | 准确率 | 已解决题目数 | 总题目数 |
|-----|----------|-----------------|----------------|
| 1 | 40.65% | 150 | 369 |
| 2 | 39.57% | 146 | 369 |
| 3 | 37.67% | 139 | 369 |
### HMMT
- **平均准确率**: 33.33% ± 1.56%
- **测试轮次数**: 10
| 轮次 | 准确率 | 已解决题目数 | 总题目数 |
|-----|----------|-----------------|----------------|
| 1 | 33.33% | 10 | 30 |
| 2 | 30.00% | 9 | 30 |
| 3 | 36.67% | 11 | 30 |
| 4 | 40.00% | 12 | 30 |
| 5 | 30.00% | 9 | 30 |
| 6 | 33.33% | 10 | 30 |
| 7 | 43.33% | 13 | 30 |
| 8 | 26.67% | 8 | 30 |
| 9 | 30.00% | 9 | 30 |
| 10 | 30.00% | 9 | 30 |
提供机构:
maas
创建时间:
2025-10-03
搜集汇总
数据集介绍

背景与挑战
背景概述
该数据集包含Qwen2.5-7B-Instruct模型在openthoughts3 300k标注数据上的预计算评估输出,由Qwen3-32B模型评估,涵盖8179个条目。它在多个基准测试(如AIME24、AMC23、MATH500等)上提供了详细的准确率结果,遵循Apache License 2.0许可证并由mlfoundations-dev发布。
以上内容由遇见数据集搜集并总结生成



