Qwen3-235B-A22B-reasoning
收藏魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/NousResearch/Qwen3-235B-A22B-reasoning
下载链接
链接失效反馈官方服务:
资源简介:
# qwen-235b-a22-thinking Evaluation Results
## Summary
| Benchmark | Score | Metric | Samples | Overlong rate |
|-----------|-------|--------|---------|---------------|
| aime24 | 0.782 | math_pass@1:64_samples | 64 | 0.5% |
| aime25 | 0.718 | math_pass@1:64_samples | 64 | 0.1% |
| arenahard | 0.939 | eval/overall_winrate | 500 | 0.0% |
| bbh_generative | 0.884 | extractive_match | 1 | 0.0% |
| creative-writing-v3 | 0.775 | creative_writing_score | 96 | 0.0% |
| drop_generative_nous | 0.903 | drop_acc | 1 | 0.0% |
| eqbench3 | 0.800 | eqbench_score | 135 | 0.0% |
| gpqa_diamond | 0.697 | gpqa_pass@1:8_samples | 8 | 0.1% |
| ifeval | 0.914 | inst_level_loose_acc | 1 | 0.0% |
| lcb-v6-aug2024+ | 0.651 | eval/pass_1 | 1 | 0.2% |
| math_500 | 0.975 | math_pass@1:4_samples | 4 | 0.1% |
| mmlu_generative | 0.893 | extractive_match | 1 | 0.0% |
| mmlu_pro | 0.831 | pass@1:1_samples | 1 | 0.0% |
| musr_generative | 0.672 | extractive_match | 1 | 0.0% |
| obqa_generative | 0.960 | extractive_match | 1 | 0.0% |
| rewardbench | 0.742 | eval/percent_correct | 1 | 0.0% |
| simpleqa_nous | 0.104 | fuzzy_match | 1 | 0.0% |
Overlong rate: 41 / 64,523 samples (0.1%) missing closing `</think>` tag
## Detailed Results
### aime24
| Metric | Score | Std Error |
|--------|-------|----------|
| math_pass@1:1_samples | 0.733 | 0.082 |
| math_pass@1:4_samples | 0.767 | 0.059 |
| math_pass@1:8_samples | 0.787 | 0.059 |
| math_pass@1:16_samples | 0.785 | 0.057 |
| math_pass@1:32_samples | 0.790 | 0.058 |
| math_pass@1:64_samples | 0.782 | 0.058 |
**Model:** qwen-235b-a22-thinking
**Evaluation Time (hh:mm:ss):** 00:54:33
**Temperature:** 0.6
**Overlong samples:** 0.5% (9 / 1920)
### aime25
| Metric | Score | Std Error |
|--------|-------|----------|
| math_pass@1:1_samples | 0.700 | 0.085 |
| math_pass@1:4_samples | 0.700 | 0.071 |
| math_pass@1:8_samples | 0.717 | 0.064 |
| math_pass@1:16_samples | 0.706 | 0.064 |
| math_pass@1:32_samples | 0.723 | 0.062 |
| math_pass@1:64_samples | 0.718 | 0.060 |
**Model:** qwen-235b-a22-thinking
**Evaluation Time (hh:mm:ss):** 00:47:11
**Temperature:** 0.6
**Overlong samples:** 0.1% (1 / 1920)
### arenahard
| Metric | Score | Std Error |
|--------|-------|----------|
| eval/overall_winrate | 0.939 | 0.000 |
| eval/total_samples | 500.000 | 0.000 |
| eval/win_count | 451.000 | 0.000 |
| eval/tie_count | 36.000 | 0.000 |
| eval/loss_count | 13.000 | 0.000 |
| eval/win_rate | 0.902 | 0.000 |
| eval/tie_rate | 0.072 | 0.000 |
| eval/loss_rate | 0.026 | 0.000 |
| eval/winrate_arena-hard-v0.1 | 0.939 | 0.000 |
**Model:** qwen-235b-think-arena
**Evaluation Time (hh:mm:ss):** 00:15:56
**Temperature:** 0.6
**Overlong samples:** 0.0% (0 / 500)
### bbh_generative
| Metric | Score | Std Error |
|--------|-------|----------|
| extractive_match | 0.884 | 0.014 |
**Model:** qwen-235b-a22-thinking
**Evaluation Time (hh:mm:ss):** 00:34:52
**Temperature:** 0.6
**Overlong samples:** 0.0% (2 / 5511)
### creative-writing-v3
| Metric | Score | Std Error |
|--------|-------|----------|
| creative_writing_score | 0.775 | 0.000 |
| num_samples | 96.000 | 0.000 |
**Model:** qwen-235b-a22-thinking
**Evaluation Time (hh:mm:ss):** N/A
**Temperature:** N/A
**Overlong samples:** 0.0% (0 / 96)
### drop_generative_nous
| Metric | Score | Std Error |
|--------|-------|----------|
| drop_acc | 0.903 | 0.003 |
**Model:** qwen-235b-a22-thinking
**Evaluation Time (hh:mm:ss):** 00:39:26
**Temperature:** 0.6
**Overlong samples:** 0.0% (2 / 9536)
### eqbench3
| Metric | Score | Std Error |
|--------|-------|----------|
| eqbench_score | 0.800 | 0.000 |
| num_samples | 135.000 | 0.000 |
**Model:** qwen-235b-a22-thinking
**Evaluation Time (hh:mm:ss):** N/A
**Temperature:** N/A
**Overlong samples:** 0.0% (0 / 135)
### gpqa_diamond
| Metric | Score | Std Error |
|--------|-------|----------|
| gpqa_pass@1:1_samples | 0.667 | 0.034 |
| gpqa_pass@1:4_samples | 0.698 | 0.029 |
| gpqa_pass@1:8_samples | 0.697 | 0.028 |
**Model:** qwen-235b-a22-thinking
**Evaluation Time (hh:mm:ss):** 00:29:12
**Temperature:** 0.6
**Overlong samples:** 0.1% (2 / 1584)
### ifeval
| Metric | Score | Std Error |
|--------|-------|----------|
| prompt_level_strict_acc | 0.834 | 0.016 |
| inst_level_strict_acc | 0.888 | 0.000 |
| prompt_level_loose_acc | 0.871 | 0.014 |
| inst_level_loose_acc | 0.914 | 0.000 |
**Model:** qwen-235b-a22-thinking
**Evaluation Time (hh:mm:ss):** 00:05:23
**Temperature:** 0.6
**Overlong samples:** 0.0% (0 / 541)
### lcb-v6-aug2024+
| Metric | Score | Std Error |
|--------|-------|----------|
| eval/pass_1 | 0.651 | 0.000 |
| eval/easy_pass_1 | 0.985 | 0.000 |
| eval/medium_pass_1 | 0.803 | 0.000 |
| eval/hard_pass_1 | 0.364 | 0.000 |
| eval/completion_length | 46846.020 | 0.000 |
**Model:** qwen-235ba22-reasoning
**Evaluation Time (hh:mm:ss):** 10:18:18
**Temperature:** N/A
**Overlong samples:** 0.2% (12 / 7264)
### math_500
| Metric | Score | Std Error |
|--------|-------|----------|
| math_pass@1:1_samples | 0.974 | 0.007 |
| math_pass@1:4_samples | 0.975 | 0.005 |
**Model:** qwen-235b-a22-thinking
**Evaluation Time (hh:mm:ss):** 00:23:38
**Temperature:** 0.6
**Overlong samples:** 0.1% (3 / 2000)
### mmlu_generative
| Metric | Score | Std Error |
|--------|-------|----------|
| extractive_match | 0.893 | 0.003 |
**Model:** qwen-235b-a22-thinking
**Evaluation Time (hh:mm:ss):** 01:07:22
**Temperature:** 0.6
**Overlong samples:** 0.0% (6 / 14042)
### mmlu_pro
| Metric | Score | Std Error |
|--------|-------|----------|
| pass@1:1_samples | 0.831 | 0.003 |
**Model:** qwen-235b-a22-thinking
**Evaluation Time (hh:mm:ss):** 01:04:20
**Temperature:** 0.6
**Overlong samples:** 0.0% (4 / 12032)
### musr_generative
| Metric | Score | Std Error |
|--------|-------|----------|
| extractive_match | 0.672 | 0.028 |
**Model:** qwen-235b-a22-thinking
**Evaluation Time (hh:mm:ss):** 00:05:49
**Temperature:** 0.6
**Overlong samples:** 0.0% (0 / 756)
### obqa_generative
| Metric | Score | Std Error |
|--------|-------|----------|
| extractive_match | 0.960 | 0.009 |
**Model:** qwen-235b-a22-thinking
**Evaluation Time (hh:mm:ss):** 00:02:45
**Temperature:** 0.6
**Overlong samples:** 0.0% (0 / 500)
### rewardbench
| Metric | Score | Std Error |
|--------|-------|----------|
| eval/percent_correct | 0.742 | 0.000 |
| eval/total_samples | 1865.000 | 0.000 |
| eval/correct_samples | 1384.000 | 0.000 |
| eval/format_compliance_rate | 0.995 | 0.000 |
| eval/avg_response_length | 5038.399 | 0.000 |
| eval/response_length_std | 4245.968 | 0.000 |
| eval/judgment_entropy | 1.411 | 0.000 |
| eval/most_common_judgment_freq | 0.261 | 0.000 |
| eval/format_error_rate | 0.005 | 0.000 |
| eval/avg_ties_rating | 3.663 | 0.000 |
| eval/ties_error_rate | 0.015 | 0.000 |
| eval/percent_correct_Factuality | 0.665 | 0.000 |
| eval/percent_correct_Precise IF | 0.425 | 0.000 |
| eval/percent_correct_Math | 0.869 | 0.000 |
| eval/percent_correct_Safety | 0.680 | 0.000 |
| eval/percent_correct_Focus | 0.877 | 0.000 |
| eval/percent_correct_Ties | 0.990 | 0.000 |
| eval/choice_samples | 1763.000 | 0.000 |
| eval/ties_samples | 102.000 | 0.000 |
| eval/choice_format_compliance_rate | 0.995 | 0.000 |
| eval/ties_format_compliance_rate | 1.000 | 0.000 |
| eval/wrong_answer_a_bias_rate | 0.281 | 0.000 |
| eval/wrong_answer_total_count | 480.000 | 0.000 |
| eval/wrong_answer_a_count | 135.000 | 0.000 |
**Model:** qwen-235b-think-reward-redo
**Evaluation Time (hh:mm:ss):** 00:32:51
**Temperature:** 0.6
**Overlong samples:** 0.0% (0 / 1865)
### simpleqa_nous
| Metric | Score | Std Error |
|--------|-------|----------|
| exact_match | 0.077 | 0.004 |
| fuzzy_match | 0.104 | 0.005 |
**Model:** qwen-235b-a22-thinking
**Evaluation Time (hh:mm:ss):** 00:18:19
**Temperature:** 0.6
**Overlong samples:** 0.0% (0 / 4321)
# qwen-235b-a22-thinking 模型评测结果
## 评测摘要
| 评测基准 | 得分 | 评测指标 | 样本数 | 超长样本占比 |
|-----------|-------|--------|---------|---------------|
| aime24 | 0.782 | math_pass@1:64_samples | 64 | 0.5% |
| aime25 | 0.718 | math_pass@1:64_samples | 64 | 0.1% |
| arenahard | 0.939 | eval/overall_winrate | 500 | 0.0% |
| bbh_generative | 0.884 | extractive_match | 1 | 0.0% |
| creative-writing-v3 | 0.775 | creative_writing_score | 96 | 0.0% |
| drop_generative_nous | 0.903 | drop_acc | 1 | 0.0% |
| eqbench3 | 0.800 | eqbench_score | 135 | 0.0% |
| gpqa_diamond | 0.697 | gpqa_pass@1:8_samples | 8 | 0.1% |
| ifeval | 0.914 | inst_level_loose_acc | 1 | 0.0% |
| lcb-v6-aug2024+ | 0.651 | eval/pass_1 | 1 | 0.2% |
| math_500 | 0.975 | math_pass@1:4_samples | 4 | 0.1% |
| mmlu_generative | 0.893 | extractive_match | 1 | 0.0% |
| mmlu_pro | 0.831 | pass@1:1_samples | 1 | 0.0% |
| musr_generative | 0.672 | extractive_match | 1 | 0.0% |
| obqa_generative | 0.960 | extractive_match | 1 | 0.0% |
| rewardbench | 0.742 | eval/percent_correct | 1 | 0.0% |
| simpleqa_nous | 0.104 | fuzzy_match | 1 | 0.0% |
超长样本整体情况:64523个样本中共有41个存在缺失闭合标签`</think>`的问题,占比0.1%
## 详细评测结果
### aime24
| 评测指标 | 得分 | 标准误差 |
|--------|-------|----------|
| math_pass@1:1_samples | 0.733 | 0.082 |
| math_pass@1:4_samples | 0.767 | 0.059 |
| math_pass@1:8_samples | 0.787 | 0.059 |
| math_pass@1:16_samples | 0.785 | 0.057 |
| math_pass@1:32_samples | 0.790 | 0.058 |
| math_pass@1:64_samples | 0.782 | 0.058 |
**模型:qwen-235b-a22-thinking**
**评测耗时(hh:mm:ss):00:54:33**
**温度系数:0.6**
**超长样本占比:0.5% (9 / 1920)**
### aime25
| 评测指标 | 得分 | 标准误差 |
|--------|-------|----------|
| math_pass@1:1_samples | 0.700 | 0.085 |
| math_pass@1:4_samples | 0.700 | 0.071 |
| math_pass@1:8_samples | 0.717 | 0.064 |
| math_pass@1:16_samples | 0.706 | 0.064 |
| math_pass@1:32_samples | 0.723 | 0.062 |
| math_pass@1:64_samples | 0.718 | 0.060 |
**模型:qwen-235b-a22-thinking**
**评测耗时(hh:mm:ss):00:47:11**
**温度系数:0.6**
**超长样本占比:0.1% (1 / 1920)**
### arenahard
| 评测指标 | 得分 | 标准误差 |
|--------|-------|----------|
| eval/overall_winrate | 0.939 | 0.000 |
| eval/total_samples | 500.000 | 0.000 |
| eval/win_count | 451.000 | 0.000 |
| eval/tie_count | 36.000 | 0.000 |
| eval/loss_count | 13.000 | 0.000 |
| eval/win_rate | 0.902 | 0.000 |
| eval/tie_rate | 0.072 | 0.000 |
| eval/loss_rate | 0.026 | 0.000 |
| eval/winrate_arena-hard-v0.1 | 0.939 | 0.000 |
**模型:qwen-235b-think-arena**
**评测耗时(hh:mm:ss):00:15:56**
**温度系数:0.6**
**超长样本占比:0.0% (0 / 500)**
### bbh_generative
| 评测指标 | 得分 | 标准误差 |
|--------|-------|----------|
| extractive_match | 0.884 | 0.014 |
**模型:qwen-235b-a22-thinking**
**评测耗时(hh:mm:ss):00:34:52**
**温度系数:0.6**
**超长样本占比:0.0% (2 / 5511)**
### creative-writing-v3
| 评测指标 | 得分 | 标准误差 |
|--------|-------|----------|
| creative_writing_score | 0.775 | 0.000 |
| num_samples | 96.000 | 0.000 |
**模型:qwen-235b-a22-thinking**
**评测耗时(hh:mm:ss):N/A**
**温度系数:N/A**
**超长样本占比:0.0% (0 / 96)**
### drop_generative_nous
| 评测指标 | 得分 | 标准误差 |
|--------|-------|----------|
| drop_acc | 0.903 | 0.003 |
**模型:qwen-235b-a22-thinking**
**评测耗时(hh:mm:ss):00:39:26**
**温度系数:0.6**
**超长样本占比:0.0% (2 / 9536)**
### eqbench3
| 评测指标 | 得分 | 标准误差 |
|--------|-------|----------|
| eqbench_score | 0.800 | 0.000 |
| num_samples | 135.000 | 0.000 |
**模型:qwen-235b-a22-thinking**
**评测耗时(hh:mm:ss):N/A**
**温度系数:N/A**
**超长样本占比:0.0% (0 / 135)**
### gpqa_diamond
| 评测指标 | 得分 | 标准误差 |
|--------|-------|----------|
| gpqa_pass@1:1_samples | 0.667 | 0.034 |
| gpqa_pass@1:4_samples | 0.698 | 0.029 |
| gpqa_pass@1:8_samples | 0.697 | 0.028 |
**模型:qwen-235b-a22-thinking**
**评测耗时(hh:mm:ss):00:29:12**
**温度系数:0.6**
**超长样本占比:0.1% (2 / 1584)**
### ifeval
| 评测指标 | 得分 | 标准误差 |
|--------|-------|----------|
| prompt_level_strict_acc | 0.834 | 0.016 |
| inst_level_strict_acc | 0.888 | 0.000 |
| prompt_level_loose_acc | 0.871 | 0.014 |
| inst_level_loose_acc | 0.914 | 0.000 |
**模型:qwen-235b-a22-thinking**
**评测耗时(hh:mm:ss):00:05:23**
**温度系数:0.6**
**超长样本占比:0.0% (0 / 541)**
### lcb-v6-aug2024+
| 评测指标 | 得分 | 标准误差 |
|--------|-------|----------|
| eval/pass_1 | 0.651 | 0.000 |
| eval/easy_pass_1 | 0.985 | 0.000 |
| eval/medium_pass_1 | 0.803 | 0.000 |
| eval/hard_pass_1 | 0.364 | 0.000 |
| eval/completion_length | 46846.020 | 0.000 |
**模型:qwen-235ba22-reasoning**
**评测耗时(hh:mm:ss):10:18:18**
**温度系数:N/A**
**超长样本占比:0.2% (12 / 7264)**
### math_500
| 评测指标 | 得分 | 标准误差 |
|--------|-------|----------|
| math_pass@1:1_samples | 0.974 | 0.007 |
| math_pass@1:4_samples | 0.975 | 0.005 |
**模型:qwen-235b-a22-thinking**
**评测耗时(hh:mm:ss):00:23:38**
**温度系数:0.6**
**超长样本占比:0.1% (3 / 2000)**
### mmlu_generative
| 评测指标 | 得分 | 标准误差 |
|--------|-------|----------|
| extractive_match | 0.893 | 0.003 |
**模型:qwen-235b-a22-thinking**
**评测耗时(hh:mm:ss):01:07:22**
**温度系数:0.6**
**超长样本占比:0.0% (6 / 14042)**
### mmlu_pro
| 评测指标 | 得分 | 标准误差 |
|--------|-------|----------|
| pass@1:1_samples | 0.831 | 0.003 |
**模型:qwen-235b-a22-thinking**
**评测耗时(hh:mm:ss):01:04:20**
**温度系数:0.6**
**超长样本占比:0.0% (4 / 12032)**
### musr_generative
| 评测指标 | 得分 | 标准误差 |
|--------|-------|----------|
| extractive_match | 0.672 | 0.028 |
**模型:qwen-235b-a22-thinking**
**评测耗时(hh:mm:ss):00:05:49**
**温度系数:0.6**
**超长样本占比:0.0% (0 / 756)**
### obqa_generative
| 评测指标 | 得分 | 标准误差 |
|--------|-------|----------|
| extractive_match | 0.960 | 0.009 |
**模型:qwen-235b-a22-thinking**
**评测耗时(hh:mm:ss):00:02:45**
**温度系数:0.6**
**超长样本占比:0.0% (0 / 500)**
### rewardbench
| 评测指标 | 得分 | 标准误差 |
|--------|-------|----------|
| eval/percent_correct | 0.742 | 0.000 |
| eval/total_samples | 1865.000 | 0.000 |
| eval/correct_samples | 1384.000 | 0.000 |
| eval/format_compliance_rate | 0.995 | 0.000 |
| eval/avg_response_length | 5038.399 | 0.000 |
| eval/response_length_std | 4245.968 | 0.000 |
| eval/judgment_entropy | 1.411 | 0.000 |
| eval/most_common_judgment_freq | 0.261 | 0.000 |
| eval/format_error_rate | 0.005 | 0.000 |
| eval/avg_ties_rating | 3.663 | 0.000 |
| eval/ties_error_rate | 0.015 | 0.000 |
| eval/percent_correct_Factuality | 0.665 | 0.000 |
| eval/percent_correct_Precise IF | 0.425 | 0.000 |
| eval/percent_correct_Math | 0.869 | 0.000 |
| eval/percent_correct_Safety | 0.680 | 0.000 |
| eval/percent_correct_Focus | 0.877 | 0.000 |
| eval/percent_correct_Ties | 0.990 | 0.000 |
| eval/choice_samples | 1763.000 | 0.000 |
| eval/ties_samples | 102.000 | 0.000 |
| eval/choice_format_compliance_rate | 0.995 | 0.000 |
| eval/ties_format_compliance_rate | 1.000 | 0.000 |
| eval/wrong_answer_a_bias_rate | 0.281 | 0.000 |
| eval/wrong_answer_total_count | 480.000 | 0.000 |
| eval/wrong_answer_a_count | 135.000 | 0.000 |
**模型:qwen-235b-think-reward-redo**
**评测耗时(hh:mm:ss):00:32:51**
**温度系数:0.6**
**超长样本占比:0.0% (0 / 1865)**
### simpleqa_nous
| 评测指标 | 得分 | 标准误差 |
|--------|-------|----------|
| exact_match | 0.077 | 0.004 |
| fuzzy_match | 0.104 | 0.005 |
**模型:qwen-235b-a22-thinking**
**评测耗时(hh:mm:ss):00:18:19**
**温度系数:0.6**
**超长样本占比:0.0% (0 / 4321)**
提供机构:
maas
创建时间:
2025-08-27



