Hermes-4-405B-reasoning
收藏魔搭社区2025-11-12 更新2025-08-30 收录
下载链接:
https://modelscope.cn/datasets/NousResearch/Hermes-4-405B-reasoning
下载链接
链接失效反馈官方服务:
资源简介:
# 405b-e3-40k-reasoning Evaluation Results
## Summary
| Benchmark | Score | Metric | Samples | Overlong rate |
|-----------|-------|--------|---------|---------------|
| aime24 | 0.819 | math_pass@1:64_samples | 64 | 5.6% |
| aime25 | 0.781 | math_pass@1:64_samples | 64 | 5.3% |
| arenahard | 0.937 | eval/overall_winrate | 500 | 0.0% |
| bbh_generative | 0.863 | extractive_match | 1 | 4.7% |
| creative-writing-v3 | 0.793 | creative_writing_score | 96 | 0.0% |
| drop_generative_nous | 0.835 | drop_acc | 1 | 1.6% |
| eqbench3 | 0.855 | eqbench_score | 135 | 0.0% |
| gpqa_diamond | 0.706 | gpqa_pass@1:8_samples | 8 | 1.0% |
| ifeval | 0.815 | inst_level_loose_acc | 1 | 1.8% |
| lcb-v6-aug2024+ | 0.614 | eval/pass_1 | 1 | 5.6% |
| math_500 | 0.963 | math_pass@1:4_samples | 4 | 0.2% |
| mmlu_generative | 0.872 | extractive_match | 1 | 1.0% |
| mmlu_pro | 0.806 | pass@1:1_samples | 1 | 2.7% |
| musr_generative | 0.661 | extractive_match | 1 | 3.0% |
| obqa_generative | 0.942 | extractive_match | 1 | 1.6% |
| rewardbench | 0.730 | eval/percent_correct | 1 | 0.6% |
| simpleqa_nous | 0.258 | fuzzy_match | 1 | 1.0% |
Overlong rate: 1,619 / 64,523 samples (2.5%) missing closing `</think>` tag
## Detailed Results
### aime24
| Metric | Score | Std Error |
|--------|-------|----------|
| math_pass@1:1_samples | 0.667 | 0.088 |
| math_pass@1:4_samples | 0.808 | 0.057 |
| math_pass@1:8_samples | 0.812 | 0.055 |
| math_pass@1:16_samples | 0.804 | 0.057 |
| math_pass@1:32_samples | 0.816 | 0.054 |
| math_pass@1:64_samples | 0.819 | 0.054 |
**Model:** 405b-e3-40k-reasoning
**Evaluation Time (hh:mm:ss):** 01:32:48
**Temperature:** 0.6
**Overlong samples:** 5.6% (107 / 1920)
### aime25
| Metric | Score | Std Error |
|--------|-------|----------|
| math_pass@1:1_samples | 0.767 | 0.079 |
| math_pass@1:4_samples | 0.817 | 0.066 |
| math_pass@1:8_samples | 0.783 | 0.063 |
| math_pass@1:16_samples | 0.783 | 0.061 |
| math_pass@1:32_samples | 0.781 | 0.061 |
| math_pass@1:64_samples | 0.781 | 0.060 |
**Model:** 405b-e3-40k-reasoning
**Evaluation Time (hh:mm:ss):** 01:44:09
**Temperature:** 0.6
**Overlong samples:** 5.3% (101 / 1920)
### arenahard
| Metric | Score | Std Error |
|--------|-------|----------|
| eval/overall_winrate | 0.937 | 0.000 |
| eval/total_samples | 500.000 | 0.000 |
| eval/win_count | 455.000 | 0.000 |
| eval/tie_count | 27.000 | 0.000 |
| eval/loss_count | 18.000 | 0.000 |
| eval/win_rate | 0.910 | 0.000 |
| eval/tie_rate | 0.054 | 0.000 |
| eval/loss_rate | 0.036 | 0.000 |
| eval/winrate_arena-hard-v0.1 | 0.937 | 0.000 |
**Model:** h4-405b-thinking-arena
**Evaluation Time (hh:mm:ss):** 00:16:23
**Temperature:** 0.6
**Overlong samples:** 0.0% (0 / 500)
### bbh_generative
| Metric | Score | Std Error |
|--------|-------|----------|
| extractive_match | 0.863 | 0.018 |
**Model:** h4-405b-e3-mmlu-bbh-obqa
**Evaluation Time (hh:mm:ss):** 01:10:53
**Temperature:** 0.6
**Overlong samples:** 4.7% (261 / 5511)
### creative-writing-v3
| Metric | Score | Std Error |
|--------|-------|----------|
| creative_writing_score | 0.793 | 0.000 |
| num_samples | 96.000 | 0.000 |
**Model:** h4-405b-e3-think-cwlr
**Evaluation Time (hh:mm:ss):** N/A
**Temperature:** N/A
**Overlong samples:** 0.0% (0 / 96)
### drop_generative_nous
| Metric | Score | Std Error |
|--------|-------|----------|
| drop_acc | 0.835 | 0.004 |
**Model:** 405b-e3-bbh
**Evaluation Time (hh:mm:ss):** 02:17:30
**Temperature:** 0.6
**Overlong samples:** 1.6% (153 / 9536)
### eqbench3
| Metric | Score | Std Error |
|--------|-------|----------|
| eqbench_score | 0.855 | 0.000 |
| num_samples | 135.000 | 0.000 |
**Model:** h4-405b-thinking-arena
**Evaluation Time (hh:mm:ss):** N/A
**Temperature:** N/A
**Overlong samples:** 0.0% (0 / 135)
### gpqa_diamond
| Metric | Score | Std Error |
|--------|-------|----------|
| gpqa_pass@1:1_samples | 0.717 | 0.032 |
| gpqa_pass@1:4_samples | 0.707 | 0.027 |
| gpqa_pass@1:8_samples | 0.706 | 0.026 |
**Model:** h4-405b-e3-reasoning
**Evaluation Time (hh:mm:ss):** 01:54:42
**Temperature:** 0.6
**Overlong samples:** 1.0% (16 / 1584)
### ifeval
| Metric | Score | Std Error |
|--------|-------|----------|
| prompt_level_strict_acc | 0.649 | 0.021 |
| inst_level_strict_acc | 0.751 | 0.000 |
| prompt_level_loose_acc | 0.736 | 0.019 |
| inst_level_loose_acc | 0.815 | 0.000 |
**Model:** 405b-e3-bbh
**Evaluation Time (hh:mm:ss):** 00:23:43
**Temperature:** 0.6
**Overlong samples:** 1.8% (10 / 541)
### lcb-v6-aug2024+
| Metric | Score | Std Error |
|--------|-------|----------|
| eval/pass_1 | 0.614 | 0.000 |
| eval/easy_pass_1 | 0.963 | 0.000 |
| eval/medium_pass_1 | 0.766 | 0.000 |
| eval/hard_pass_1 | 0.318 | 0.000 |
| eval/completion_length | 52292.525 | 0.000 |
**Model:** h4-405b-e3
**Evaluation Time (hh:mm:ss):** 06:07:26
**Temperature:** N/A
**Overlong samples:** 5.6% (407 / 7264)
### math_500
| Metric | Score | Std Error |
|--------|-------|----------|
| math_pass@1:1_samples | 0.956 | 0.009 |
| math_pass@1:4_samples | 0.963 | 0.006 |
**Model:** h4-405b-e3-reasoning
**Evaluation Time (hh:mm:ss):** 01:29:42
**Temperature:** 0.6
**Overlong samples:** 0.2% (5 / 2000)
### mmlu_generative
| Metric | Score | Std Error |
|--------|-------|----------|
| extractive_match | 0.872 | 0.003 |
**Model:** h4-405b-e3-mmlu-bbh-obqa
**Evaluation Time (hh:mm:ss):** 01:48:55
**Temperature:** 0.6
**Overlong samples:** 1.0% (144 / 14042)
### mmlu_pro
| Metric | Score | Std Error |
|--------|-------|----------|
| pass@1:1_samples | 0.806 | 0.004 |
**Model:** h4-405b-8nodes
**Evaluation Time (hh:mm:ss):** 22:34:26
**Temperature:** 0.6
**Overlong samples:** 2.7% (329 / 12032)
### musr_generative
| Metric | Score | Std Error |
|--------|-------|----------|
| extractive_match | 0.661 | 0.029 |
**Model:** h4-405b-e3-reasoning
**Evaluation Time (hh:mm:ss):** 00:56:15
**Temperature:** 0.6
**Overlong samples:** 3.0% (23 / 756)
### obqa_generative
| Metric | Score | Std Error |
|--------|-------|----------|
| extractive_match | 0.942 | 0.010 |
**Model:** h4-405b-e3-mmlu-bbh-obqa
**Evaluation Time (hh:mm:ss):** 00:16:35
**Temperature:** 0.6
**Overlong samples:** 1.6% (8 / 500)
### rewardbench
| Metric | Score | Std Error |
|--------|-------|----------|
| eval/percent_correct | 0.730 | 0.000 |
| eval/total_samples | 1865.000 | 0.000 |
| eval/correct_samples | 1362.000 | 0.000 |
| eval/format_compliance_rate | 0.995 | 0.000 |
| eval/avg_response_length | 4527.372 | 0.000 |
| eval/response_length_std | 5674.062 | 0.000 |
| eval/judgment_entropy | 1.399 | 0.000 |
| eval/most_common_judgment_freq | 0.314 | 0.000 |
| eval/format_error_rate | 0.005 | 0.000 |
| eval/avg_ties_rating | 3.840 | 0.000 |
| eval/ties_error_rate | 0.353 | 0.000 |
| eval/percent_correct_Factuality | 0.663 | 0.000 |
| eval/percent_correct_Precise IF | 0.519 | 0.000 |
| eval/percent_correct_Math | 0.869 | 0.000 |
| eval/percent_correct_Safety | 0.636 | 0.000 |
| eval/percent_correct_Focus | 0.889 | 0.000 |
| eval/percent_correct_Ties | 0.775 | 0.000 |
| eval/choice_samples | 1763.000 | 0.000 |
| eval/ties_samples | 102.000 | 0.000 |
| eval/choice_format_compliance_rate | 0.995 | 0.000 |
| eval/ties_format_compliance_rate | 1.000 | 0.000 |
| eval/wrong_answer_a_bias_rate | 0.398 | 0.000 |
| eval/wrong_answer_total_count | 480.000 | 0.000 |
| eval/wrong_answer_a_count | 191.000 | 0.000 |
**Model:** h4-405b-thinking-reward-redo
**Evaluation Time (hh:mm:ss):** 00:25:11
**Temperature:** 0.6
**Overlong samples:** 0.6% (12 / 1865)
### simpleqa_nous
| Metric | Score | Std Error |
|--------|-------|----------|
| exact_match | 0.200 | 0.006 |
| fuzzy_match | 0.258 | 0.007 |
**Model:** 405b-e3-bbh
**Evaluation Time (hh:mm:ss):** 01:00:59
**Temperature:** 0.6
**Overlong samples:** 1.0% (43 / 4321)
提供机构:
maas
创建时间:
2025-08-27



