eval-Hermes-4-14B-reasoning
收藏魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/NousResearch/eval-Hermes-4-14B-reasoning
下载链接
链接失效反馈官方服务:
资源简介:
# h4-14b-more-stage1-reasoning Evaluation Results
## Summary
| Benchmark | Score | Metric | Samples | Overlong rate |
|-----------|-------|--------|---------|---------------|
| aime24 | 0.554 | math_pass@1:64_samples | 64 | 0.1% |
| aime25 | 0.468 | math_pass@1:64_samples | 64 | 0.1% |
| arenahard | 0.830 | eval/overall_winrate | 500 | 0.0% |
| bbh_generative | 0.844 | extractive_match | 1 | 0.0% |
| creative-writing-v3 | 0.616 | creative_writing_score | 96 | 0.0% |
| drop_generative_nous | 0.845 | drop_acc | 1 | 0.0% |
| eqbench3 | 0.772 | eqbench_score | 135 | 0.0% |
| gpqa_diamond | 0.602 | gpqa_pass@1:8_samples | 8 | 0.2% |
| ifeval | 0.748 | inst_level_loose_acc | 1 | 0.4% |
| lcb-v6-aug2024+ | 0.425 | eval/pass_1 | 1 | 0.1% |
| math_500 | 0.911 | math_pass@1:4_samples | 4 | 0.1% |
| mmlu_generative | 0.841 | extractive_match | 1 | 0.0% |
| mmlu_pro | 0.743 | pass@1:1_samples | 1 | 0.0% |
| musr_generative | 0.668 | extractive_match | 1 | 0.0% |
| obqa_generative | 0.946 | extractive_match | 1 | 0.2% |
| rewardbench | 0.635 | eval/percent_correct | 1 | 0.1% |
| simpleqa_nous | 0.055 | fuzzy_match | 1 | 0.2% |
Overlong rate: 32 / 64,523 samples (0.0%) missing closing `</think>` tag
## Detailed Results
### aime24
| Metric | Score | Std Error |
|--------|-------|----------|
| math_pass@1:1_samples | 0.600 | 0.091 |
| math_pass@1:4_samples | 0.583 | 0.072 |
| math_pass@1:8_samples | 0.562 | 0.062 |
| math_pass@1:16_samples | 0.583 | 0.061 |
| math_pass@1:32_samples | 0.560 | 0.059 |
| math_pass@1:64_samples | 0.554 | 0.059 |
**Model:** h4-14b-more-stage1-reasoning
**Evaluation Time (hh:mm:ss):** 00:29:31
**Temperature:** 0.6
**Overlong samples:** 0.1% (1 / 1920)
### aime25
| Metric | Score | Std Error |
|--------|-------|----------|
| math_pass@1:1_samples | 0.467 | 0.093 |
| math_pass@1:4_samples | 0.442 | 0.069 |
| math_pass@1:8_samples | 0.458 | 0.069 |
| math_pass@1:16_samples | 0.463 | 0.071 |
| math_pass@1:32_samples | 0.471 | 0.067 |
| math_pass@1:64_samples | 0.468 | 0.066 |
**Model:** h4-14b-more-stage1-reasoning
**Evaluation Time (hh:mm:ss):** 00:28:19
**Temperature:** 0.6
**Overlong samples:** 0.1% (1 / 1920)
### arenahard
| Metric | Score | Std Error |
|--------|-------|----------|
| eval/overall_winrate | 0.830 | 0.000 |
| eval/total_samples | 500.000 | 0.000 |
| eval/win_count | 386.000 | 0.000 |
| eval/tie_count | 57.000 | 0.000 |
| eval/loss_count | 57.000 | 0.000 |
| eval/win_rate | 0.772 | 0.000 |
| eval/tie_rate | 0.114 | 0.000 |
| eval/loss_rate | 0.114 | 0.000 |
| eval/winrate_arena-hard-v0.1 | 0.830 | 0.000 |
**Model:** h4-14b-more-stage1-reasoning
**Evaluation Time (hh:mm:ss):** 00:03:02
**Temperature:** 0.6
**Overlong samples:** 0.0% (0 / 500)
### bbh_generative
| Metric | Score | Std Error |
|--------|-------|----------|
| extractive_match | 0.844 | 0.018 |
**Model:** h4-14b-more-stage1-reasoning
**Evaluation Time (hh:mm:ss):** 00:22:58
**Temperature:** 0.6
**Overlong samples:** 0.0% (0 / 5511)
### creative-writing-v3
| Metric | Score | Std Error |
|--------|-------|----------|
| creative_writing_score | 0.616 | 0.000 |
| num_samples | 96.000 | 0.000 |
**Model:** h4-14b-more-stage1-reasoning
**Evaluation Time (hh:mm:ss):** N/A
**Temperature:** N/A
**Overlong samples:** 0.0% (0 / 96)
### drop_generative_nous
| Metric | Score | Std Error |
|--------|-------|----------|
| drop_acc | 0.845 | 0.004 |
**Model:** h4-14b-more-stage1-reasoning
**Evaluation Time (hh:mm:ss):** 00:37:58
**Temperature:** 0.6
**Overlong samples:** 0.0% (0 / 9536)
### eqbench3
| Metric | Score | Std Error |
|--------|-------|----------|
| eqbench_score | 0.772 | 0.000 |
| num_samples | 135.000 | 0.000 |
**Model:** h4-14b-more-stage1-reasoning
**Evaluation Time (hh:mm:ss):** N/A
**Temperature:** N/A
**Overlong samples:** 0.0% (0 / 135)
### gpqa_diamond
| Metric | Score | Std Error |
|--------|-------|----------|
| gpqa_pass@1:1_samples | 0.581 | 0.035 |
| gpqa_pass@1:4_samples | 0.604 | 0.027 |
| gpqa_pass@1:8_samples | 0.602 | 0.026 |
**Model:** h4-14b-more-stage1-reasoning
**Evaluation Time (hh:mm:ss):** 00:14:01
**Temperature:** 0.6
**Overlong samples:** 0.2% (3 / 1584)
### ifeval
| Metric | Score | Std Error |
|--------|-------|----------|
| prompt_level_strict_acc | 0.597 | 0.021 |
| inst_level_strict_acc | 0.700 | 0.001 |
| prompt_level_loose_acc | 0.651 | 0.021 |
| inst_level_loose_acc | 0.748 | 0.000 |
**Model:** h4-14b-more-stage1-reasoning
**Evaluation Time (hh:mm:ss):** 00:10:14
**Temperature:** 0.6
**Overlong samples:** 0.4% (2 / 541)
### lcb-v6-aug2024+
| Metric | Score | Std Error |
|--------|-------|----------|
| eval/pass_1 | 0.425 | 0.000 |
| eval/easy_pass_1 | 0.913 | 0.000 |
| eval/medium_pass_1 | 0.491 | 0.000 |
| eval/hard_pass_1 | 0.114 | 0.000 |
| eval/completion_length | 37443.071 | 0.000 |
**Model:** h4-14b-more-stage1-reasoning
**Evaluation Time (hh:mm:ss):** 01:11:41
**Temperature:** N/A
**Overlong samples:** 0.1% (9 / 7264)
### math_500
| Metric | Score | Std Error |
|--------|-------|----------|
| math_pass@1:1_samples | 0.920 | 0.012 |
| math_pass@1:4_samples | 0.911 | 0.009 |
**Model:** h4-14b-more-stage1-reasoning
**Evaluation Time (hh:mm:ss):** 00:10:45
**Temperature:** 0.6
**Overlong samples:** 0.1% (2 / 2000)
### mmlu_generative
| Metric | Score | Std Error |
|--------|-------|----------|
| extractive_match | 0.841 | 0.003 |
**Model:** h4-14b-more-stage1-reasoning
**Evaluation Time (hh:mm:ss):** 00:56:10
**Temperature:** 0.6
**Overlong samples:** 0.0% (0 / 14042)
### mmlu_pro
| Metric | Score | Std Error |
|--------|-------|----------|
| pass@1:1_samples | 0.743 | 0.004 |
**Model:** h4-14b-more-stage1-reasoning
**Evaluation Time (hh:mm:ss):** 00:49:28
**Temperature:** 0.6
**Overlong samples:** 0.0% (2 / 12032)
### musr_generative
| Metric | Score | Std Error |
|--------|-------|----------|
| extractive_match | 0.668 | 0.029 |
**Model:** h4-14b-more-stage1-reasoning
**Evaluation Time (hh:mm:ss):** 00:03:41
**Temperature:** 0.6
**Overlong samples:** 0.0% (0 / 756)
### obqa_generative
| Metric | Score | Std Error |
|--------|-------|----------|
| extractive_match | 0.946 | 0.010 |
**Model:** h4-14b-more-stage1-reasoning
**Evaluation Time (hh:mm:ss):** 00:02:22
**Temperature:** 0.6
**Overlong samples:** 0.2% (1 / 500)
### rewardbench
| Metric | Score | Std Error |
|--------|-------|----------|
| eval/percent_correct | 0.635 | 0.000 |
| eval/total_samples | 1865.000 | 0.000 |
| eval/correct_samples | 1184.000 | 0.000 |
| eval/format_compliance_rate | 0.992 | 0.000 |
| eval/avg_response_length | 5515.739 | 0.000 |
| eval/response_length_std | 5197.323 | 0.000 |
| eval/judgment_entropy | 1.406 | 0.000 |
| eval/most_common_judgment_freq | 0.323 | 0.000 |
| eval/format_error_rate | 0.008 | 0.000 |
| eval/avg_ties_rating | 4.328 | 0.000 |
| eval/ties_error_rate | 0.377 | 0.000 |
| eval/percent_correct_Factuality | 0.512 | 0.000 |
| eval/percent_correct_Precise IF | 0.419 | 0.000 |
| eval/percent_correct_Math | 0.792 | 0.000 |
| eval/percent_correct_Safety | 0.573 | 0.000 |
| eval/percent_correct_Focus | 0.802 | 0.000 |
| eval/percent_correct_Ties | 0.725 | 0.000 |
| eval/choice_samples | 1763.000 | 0.000 |
| eval/ties_samples | 102.000 | 0.000 |
| eval/choice_format_compliance_rate | 0.992 | 0.000 |
| eval/ties_format_compliance_rate | 1.000 | 0.000 |
| eval/wrong_answer_a_bias_rate | 0.384 | 0.000 |
| eval/wrong_answer_total_count | 653.000 | 0.000 |
| eval/wrong_answer_a_count | 251.000 | 0.000 |
**Model:** h4-14b-more-stage1-reasoning
**Evaluation Time (hh:mm:ss):** 00:04:49
**Temperature:** 0.6
**Overlong samples:** 0.1% (1 / 1865)
### simpleqa_nous
| Metric | Score | Std Error |
|--------|-------|----------|
| exact_match | 0.037 | 0.003 |
| fuzzy_match | 0.055 | 0.003 |
**Model:** h4-14b-more-stage1-reasoning
**Evaluation Time (hh:mm:ss):** 00:17:31
**Temperature:** 0.6
**Overlong samples:** 0.2% (10 / 4321)
# h4-14b-more-stage1-reasoning 评估结果
## 摘要
| 评测基准 | 得分 | 评测指标 | 样本数 | 超长样本占比 |
|-----------|-------|--------|---------|---------------|
| AIME24 | 0.554 | math_pass@1:64_samples | 64 | 0.1% |
| AIME25 | 0.468 | math_pass@1:64_samples | 64 | 0.1% |
| ArenaHard | 0.830 | eval/overall_winrate | 500 | 0.0% |
| 生成式BBH | 0.844 | extractive_match | 1 | 0.0% |
| 创意写作评测v3 | 0.616 | creative_writing_score | 96 | 0.0% |
| 生成式DROP(Nous版) | 0.845 | drop_acc | 1 | 0.0% |
| EQBench3 | 0.772 | eqbench_score | 135 | 0.0% |
| GPQA Diamond | 0.602 | gpqa_pass@1:8_samples | 8 | 0.2% |
| IFEval | 0.748 | inst_level_loose_acc | 1 | 0.4% |
| 2024年8月版LCB-v6+ | 0.425 | eval/pass_1 | 1 | 0.1% |
| Math500 | 0.911 | math_pass@1:4_samples | 4 | 0.1% |
| 生成式MMLU | 0.841 | extractive_match | 1 | 0.0% |
| MMLU-Pro | 0.743 | pass@1:1_samples | 1 | 0.0% |
| 生成式MUSR | 0.668 | extractive_match | 1 | 0.0% |
| 生成式OBQA | 0.946 | extractive_match | 1 | 0.2% |
| RewardBench | 0.635 | eval/percent_correct | 1 | 0.1% |
| Nous版SimpleQA | 0.055 | fuzzy_match | 1 | 0.2% |
超长样本占比:32 / 64523 个样本(0.0%),存在缺失闭合`</think>`标签的样本
## 详细评测结果
### AIME24
| 评测指标 | 得分 | 标准误 |
|--------|-------|----------|
| math_pass@1:1_samples | 0.600 | 0.091 |
| math_pass@1:4_samples | 0.583 | 0.072 |
| math_pass@1:8_samples | 0.562 | 0.062 |
| math_pass@1:16_samples | 0.583 | 0.061 |
| math_pass@1:32_samples | 0.560 | 0.059 |
| math_pass@1:64_samples | 0.554 | 0.059 |
**模型:h4-14b-more-stage1-reasoning**
**评估耗时(hh:mm:ss):00:29:31**
**温度参数:0.6**
**超长样本占比:0.1% (1 / 1920)**
### AIME25
| 评测指标 | 得分 | 标准误 |
|--------|-------|----------|
| math_pass@1:1_samples | 0.467 | 0.093 |
| math_pass@1:4_samples | 0.442 | 0.069 |
| math_pass@1:8_samples | 0.458 | 0.069 |
| math_pass@1:16_samples | 0.463 | 0.071 |
| math_pass@1:32_samples | 0.471 | 0.067 |
| math_pass@1:64_samples | 0.468 | 0.066 |
**模型:h4-14b-more-stage1-reasoning**
**评估耗时(hh:mm:ss):00:28:19**
**温度参数:0.6**
**超长样本占比:0.1% (1 / 1920)**
### ArenaHard
| 评测指标 | 得分 | 标准误 |
|--------|-------|----------|
| eval/overall_winrate | 0.830 | 0.000 |
| eval/total_samples | 500.000 | 0.000 |
| eval/win_count | 386.000 | 0.000 |
| eval/tie_count | 57.000 | 0.000 |
| eval/loss_count | 57.000 | 0.000 |
| eval/win_rate | 0.772 | 0.000 |
| eval/tie_rate | 0.114 | 0.000 |
| eval/loss_rate | 0.114 | 0.000 |
| eval/winrate_arena-hard-v0.1 | 0.830 | 0.000 |
**模型:h4-14b-more-stage1-reasoning**
**评估耗时(hh:mm:ss):00:03:02**
**温度参数:0.6**
**超长样本占比:0.0% (0 / 500)**
### 生成式BBH
| 评测指标 | 得分 | 标准误 |
|--------|-------|----------|
| extractive_match | 0.844 | 0.018 |
**模型:h4-14b-more-stage1-reasoning**
**评估耗时(hh:mm:ss):00:22:58**
**温度参数:0.6**
**超长样本占比:0.0% (0 / 5511)**
### 创意写作评测v3
| 评测指标 | 得分 | 标准误 |
|--------|-------|----------|
| creative_writing_score | 0.616 | 0.000 |
| num_samples | 96.000 | 0.000 |
**模型:h4-14b-more-stage1-reasoning**
**评估耗时:未提供**
**温度参数:未提供**
**超长样本占比:0.0% (0 / 96)**
### 生成式DROP(Nous版)
| 评测指标 | 得分 | 标准误 |
|--------|-------|----------|
| drop_acc | 0.845 | 0.004 |
**模型:h4-14b-more-stage1-reasoning**
**评估耗时(hh:mm:ss):00:37:58**
**温度参数:0.6**
**超长样本占比:0.0% (0 / 9536)**
### EQBench3
| 评测指标 | 得分 | 标准误 |
|--------|-------|----------|
| eqbench_score | 0.772 | 0.000 |
| num_samples | 135.000 | 0.000 |
**模型:h4-14b-more-stage1-reasoning**
**评估耗时:未提供**
**温度参数:未提供**
**超长样本占比:0.0% (0 / 135)**
### GPQA Diamond
| 评测指标 | 得分 | 标准误 |
|--------|-------|----------|
| gpqa_pass@1:1_samples | 0.581 | 0.035 |
| gpqa_pass@1:4_samples | 0.604 | 0.027 |
| gpqa_pass@1:8_samples | 0.602 | 0.026 |
**模型:h4-14b-more-stage1-reasoning**
**评估耗时(hh:mm:ss):00:14:01**
**温度参数:0.6**
**超长样本占比:0.2% (3 / 1584)**
### IFEval
| 评测指标 | 得分 | 标准误 |
|--------|-------|----------|
| prompt_level_strict_acc | 0.597 | 0.021 |
| inst_level_strict_acc | 0.700 | 0.001 |
| prompt_level_loose_acc | 0.651 | 0.021 |
| inst_level_loose_acc | 0.748 | 0.000 |
**模型:h4-14b-more-stage1-reasoning**
**评估耗时(hh:mm:ss):00:10:14**
**温度参数:0.6**
**超长样本占比:0.4% (2 / 541)**
### 2024年8月版LCB-v6+
| 评测指标 | 得分 | 标准误 |
|--------|-------|----------|
| eval/pass_1 | 0.425 | 0.000 |
| eval/easy_pass_1 | 0.913 | 0.000 |
| eval/medium_pass_1 | 0.491 | 0.000 |
| eval/hard_pass_1 | 0.114 | 0.000 |
| eval/completion_length | 37443.071 | 0.000 |
**模型:h4-14b-more-stage1-reasoning**
**评估耗时(hh:mm:ss):01:11:41**
**温度参数:未提供**
**超长样本占比:0.1% (9 / 7264)**
### Math500
| 评测指标 | 得分 | 标准误 |
|--------|-------|----------|
| math_pass@1:1_samples | 0.920 | 0.012 |
| math_pass@1:4_samples | 0.911 | 0.009 |
**模型:h4-14b-more-stage1-reasoning**
**评估耗时(hh:mm:ss):00:10:45**
**温度参数:0.6**
**超长样本占比:0.1% (2 / 2000)**
### 生成式MMLU
| 评测指标 | 得分 | 标准误 |
|--------|-------|----------|
| extractive_match | 0.841 | 0.003 |
**模型:h4-14b-more-stage1-reasoning**
**评估耗时(hh:mm:ss):00:56:10**
**温度参数:0.6**
**超长样本占比:0.0% (0 / 14042)**
### MMLU-Pro
| 评测指标 | 得分 | 标准误 |
|--------|-------|----------|
| pass@1:1_samples | 0.743 | 0.004 |
**模型:h4-14b-more-stage1-reasoning**
**评估耗时(hh:mm:ss):00:49:28**
**温度参数:0.6**
**超长样本占比:0.0% (2 / 12032)**
### 生成式MUSR
| 评测指标 | 得分 | 标准误 |
|--------|-------|----------|
| extractive_match | 0.668 | 0.029 |
**模型:h4-14b-more-stage1-reasoning**
**评估耗时(hh:mm:ss):00:03:41**
**温度参数:0.6**
**超长样本占比:0.0% (0 / 756)**
### 生成式OBQA
| 评测指标 | 得分 | 标准误 |
|--------|-------|----------|
| extractive_match | 0.946 | 0.010 |
**模型:h4-14b-more-stage1-reasoning**
**评估耗时(hh:mm:ss):00:02:22**
**温度参数:0.6**
**超长样本占比:0.2% (1 / 500)**
### RewardBench
| 评测指标 | 得分 | 标准误 |
|--------|-------|----------|
| eval/percent_correct | 0.635 | 0.000 |
| eval/total_samples | 1865.000 | 0.000 |
| eval/correct_samples | 1184.000 | 0.000 |
| eval/format_compliance_rate | 0.992 | 0.000 |
| eval/avg_response_length | 5515.739 | 0.000 |
| eval/response_length_std | 5197.323 | 0.000 |
| eval/judgment_entropy | 1.406 | 0.000 |
| eval/most_common_judgment_freq | 0.323 | 0.000 |
| eval/format_error_rate | 0.008 | 0.000 |
| eval/avg_ties_rating | 4.328 | 0.000 |
| eval/ties_error_rate | 0.377 | 0.000 |
| eval/percent_correct_Factuality | 0.512 | 0.000 |
| eval/percent_correct_Precise IF | 0.419 | 0.000 |
| eval/percent_correct_Math | 0.792 | 0.000 |
| eval/percent_correct_Safety | 0.573 | 0.000 |
| eval/percent_correct_Focus | 0.802 | 0.000 |
| eval/percent_correct_Ties | 0.725 | 0.000 |
| eval/choice_samples | 1763.000 | 0.000 |
| eval/ties_samples | 102.000 | 0.000 |
| eval/choice_format_compliance_rate | 0.992 | 0.000 |
| eval/ties_format_compliance_rate | 1.000 | 0.000 |
| eval/wrong_answer_a_bias_rate | 0.384 | 0.000 |
| eval/wrong_answer_total_count | 653.000 | 0.000 |
| eval/wrong_answer_a_count | 251.000 | 0.000 |
**模型:h4-14b-more-stage1-reasoning**
**评估耗时(hh:mm:ss):00:04:49**
**温度参数:0.6**
**超长样本占比:0.1% (1 / 1865)**
### Nous版SimpleQA
| 评测指标 | 得分 | 标准误 |
|--------|-------|----------|
| exact_match | 0.037 | 0.003 |
| fuzzy_match | 0.055 | 0.003 |
**模型:h4-14b-more-stage1-reasoning**
**评估耗时(hh:mm:ss):00:17:31**
**温度参数:0.6**
**超长样本占比:0.2% (10 / 4321)**
提供机构:
maas
创建时间:
2025-09-07



