five

Hermes-4-70B-reasoning

收藏
魔搭社区2025-12-05 更新2025-08-30 收录
下载链接:
https://modelscope.cn/datasets/NousResearch/Hermes-4-70B-reasoning
下载链接
链接失效反馈
官方服务:
资源简介:
# hermes-4-70b-reasoning-40k Evaluation Results ## Summary | Benchmark | Score | Metric | Samples | Overlong rate | |-----------|-------|--------|---------|---------------| | aime24 | 0.735 | math_pass@1:64_samples | 64 | 8.4% | | aime25 | 0.674 | math_pass@1:64_samples | 64 | 9.6% | | arenahard | 0.901 | eval/overall_winrate | 500 | 0.0% | | bbh_generative | 0.878 | extractive_match | 1 | 4.8% | | creative-writing-v3 | 0.775 | creative_writing_score | 96 | 0.0% | | drop_generative_nous | 0.850 | drop_acc | 1 | 1.4% | | eqbench3 | 0.847 | eqbench_score | 135 | 0.0% | | gpqa_diamond | 0.661 | gpqa_pass@1:8_samples | 8 | 3.0% | | ifeval | 0.787 | inst_level_loose_acc | 1 | 6.5% | | lcb-v6-aug2024+ | 0.505 | eval/pass_1 | 1 | 16.6% | | math_500 | 0.956 | math_pass@1:4_samples | 4 | 0.9% | | mmlu_generative | 0.884 | extractive_match | 1 | 0.3% | | mmlu_pro | 0.807 | pass@1:1_samples | 1 | 0.7% | | musr_generative | 0.704 | extractive_match | 1 | 0.5% | | obqa_generative | 0.948 | extractive_match | 1 | 1.6% | | rewardbench | 0.649 | eval/percent_correct | 1 | 0.4% | | simpleqa_nous | 0.179 | fuzzy_match | 1 | 2.5% | Overlong rate: 2,311 / 64,523 samples (3.6%) missing closing `</think>` tag ## Detailed Results ### aime24 | Metric | Score | Std Error | |--------|-------|----------| | math_pass@1:1_samples | 0.700 | 0.085 | | math_pass@1:4_samples | 0.725 | 0.065 | | math_pass@1:8_samples | 0.738 | 0.065 | | math_pass@1:16_samples | 0.727 | 0.064 | | math_pass@1:32_samples | 0.736 | 0.060 | | math_pass@1:64_samples | 0.735 | 0.060 | **Model:** hermes-4-70b-reasoning-40k **Evaluation Time (hh:mm:ss):** 00:50:14 **Temperature:** 0.6 **Overlong samples:** 8.4% (161 / 1920) ### aime25 | Metric | Score | Std Error | |--------|-------|----------| | math_pass@1:1_samples | 0.700 | 0.085 | | math_pass@1:4_samples | 0.658 | 0.071 | | math_pass@1:8_samples | 0.667 | 0.068 | | math_pass@1:16_samples | 0.669 | 0.066 | | math_pass@1:32_samples | 0.674 | 0.065 | | math_pass@1:64_samples | 0.674 | 0.064 | **Model:** hermes-4-70b-reasoning-40k **Evaluation Time (hh:mm:ss):** 00:51:49 **Temperature:** 0.6 **Overlong samples:** 9.6% (185 / 1920) ### arenahard | Metric | Score | Std Error | |--------|-------|----------| | eval/overall_winrate | 0.901 | 0.000 | | eval/total_samples | 500.000 | 0.000 | | eval/win_count | 426.000 | 0.000 | | eval/tie_count | 50.000 | 0.000 | | eval/loss_count | 24.000 | 0.000 | | eval/win_rate | 0.852 | 0.000 | | eval/tie_rate | 0.100 | 0.000 | | eval/loss_rate | 0.048 | 0.000 | | eval/winrate_arena-hard-v0.1 | 0.901 | 0.000 | **Model:** h4-70b-thinking-arena **Evaluation Time (hh:mm:ss):** 00:07:16 **Temperature:** 0.6 **Overlong samples:** 0.0% (0 / 500) ### bbh_generative | Metric | Score | Std Error | |--------|-------|----------| | extractive_match | 0.878 | 0.016 | **Model:** hermes-4-70b-reasoning-40k **Evaluation Time (hh:mm:ss):** 00:40:44 **Temperature:** 0.6 **Overlong samples:** 4.8% (266 / 5511) ### creative-writing-v3 | Metric | Score | Std Error | |--------|-------|----------| | creative_writing_score | 0.775 | 0.000 | | num_samples | 96.000 | 0.000 | **Model:** h4-70b-reasoning-cwlf **Evaluation Time (hh:mm:ss):** N/A **Temperature:** N/A **Overlong samples:** 0.0% (0 / 96) ### drop_generative_nous | Metric | Score | Std Error | |--------|-------|----------| | drop_acc | 0.850 | 0.004 | **Model:** hermes-4-70b-reasoning-40k **Evaluation Time (hh:mm:ss):** 00:54:42 **Temperature:** 0.6 **Overlong samples:** 1.4% (132 / 9536) ### eqbench3 | Metric | Score | Std Error | |--------|-------|----------| | eqbench_score | 0.847 | 0.000 | | num_samples | 135.000 | 0.000 | **Model:** h4-70b-thinking-arena **Evaluation Time (hh:mm:ss):** N/A **Temperature:** N/A **Overlong samples:** 0.0% (0 / 135) ### gpqa_diamond | Metric | Score | Std Error | |--------|-------|----------| | gpqa_pass@1:1_samples | 0.667 | 0.034 | | gpqa_pass@1:4_samples | 0.659 | 0.028 | | gpqa_pass@1:8_samples | 0.661 | 0.026 | **Model:** hermes-4-70b-reasoning-40k **Evaluation Time (hh:mm:ss):** 00:24:41 **Temperature:** 0.6 **Overlong samples:** 3.0% (48 / 1584) ### ifeval | Metric | Score | Std Error | |--------|-------|----------| | prompt_level_strict_acc | 0.641 | 0.021 | | inst_level_strict_acc | 0.743 | 0.000 | | prompt_level_loose_acc | 0.699 | 0.020 | | inst_level_loose_acc | 0.787 | 0.000 | **Model:** hermes-4-70b-reasoning-40k **Evaluation Time (hh:mm:ss):** 00:22:55 **Temperature:** 0.6 **Overlong samples:** 6.5% (35 / 541) ### lcb-v6-aug2024+ | Metric | Score | Std Error | |--------|-------|----------| | eval/pass_1 | 0.505 | 0.000 | | eval/easy_pass_1 | 0.890 | 0.000 | | eval/medium_pass_1 | 0.605 | 0.000 | | eval/hard_pass_1 | 0.226 | 0.000 | | eval/completion_length | 61285.118 | 0.000 | **Model:** h4-70b-reasoning **Evaluation Time (hh:mm:ss):** 06:48:33 **Temperature:** N/A **Overlong samples:** 16.6% (1208 / 7264) ### math_500 | Metric | Score | Std Error | |--------|-------|----------| | math_pass@1:1_samples | 0.960 | 0.009 | | math_pass@1:4_samples | 0.956 | 0.007 | **Model:** hermes-4-70b-reasoning-40k **Evaluation Time (hh:mm:ss):** 00:17:36 **Temperature:** 0.6 **Overlong samples:** 0.9% (19 / 2000) ### mmlu_generative | Metric | Score | Std Error | |--------|-------|----------| | extractive_match | 0.884 | 0.003 | **Model:** hermes-4-70b-reasoning-40k **Evaluation Time (hh:mm:ss):** 01:10:41 **Temperature:** 0.6 **Overlong samples:** 0.3% (43 / 14042) ### mmlu_pro | Metric | Score | Std Error | |--------|-------|----------| | pass@1:1_samples | 0.807 | 0.004 | **Model:** hermes-4-70b-reasoning-40k **Evaluation Time (hh:mm:ss):** 01:19:44 **Temperature:** 0.6 **Overlong samples:** 0.7% (89 / 12032) ### musr_generative | Metric | Score | Std Error | |--------|-------|----------| | extractive_match | 0.704 | 0.028 | **Model:** hermes-4-70b-reasoning-40k **Evaluation Time (hh:mm:ss):** 00:14:29 **Temperature:** 0.6 **Overlong samples:** 0.5% (4 / 756) ### obqa_generative | Metric | Score | Std Error | |--------|-------|----------| | extractive_match | 0.948 | 0.010 | **Model:** hermes-4-70b-reasoning-40k **Evaluation Time (hh:mm:ss):** 00:14:45 **Temperature:** 0.6 **Overlong samples:** 1.6% (8 / 500) ### rewardbench | Metric | Score | Std Error | |--------|-------|----------| | eval/percent_correct | 0.649 | 0.000 | | eval/total_samples | 1865.000 | 0.000 | | eval/correct_samples | 1210.000 | 0.000 | | eval/format_compliance_rate | 0.945 | 0.000 | | eval/avg_response_length | 9135.364 | 0.000 | | eval/response_length_std | 8536.122 | 0.000 | | eval/judgment_entropy | 1.525 | 0.000 | | eval/most_common_judgment_freq | 0.264 | 0.000 | | eval/format_error_rate | 0.058 | 0.000 | | eval/avg_ties_rating | 4.861 | 0.000 | | eval/ties_error_rate | 0.130 | 0.000 | | eval/percent_correct_Factuality | 0.581 | 0.000 | | eval/percent_correct_Precise IF | 0.475 | 0.000 | | eval/percent_correct_Math | 0.869 | 0.000 | | eval/percent_correct_Safety | 0.549 | 0.000 | | eval/percent_correct_Focus | 0.719 | 0.000 | | eval/percent_correct_Ties | 0.941 | 0.000 | | eval/choice_samples | 1763.000 | 0.000 | | eval/ties_samples | 102.000 | 0.000 | | eval/choice_format_compliance_rate | 0.942 | 0.000 | | eval/ties_format_compliance_rate | 1.000 | 0.000 | | eval/wrong_answer_a_bias_rate | 0.284 | 0.000 | | eval/wrong_answer_total_count | 649.000 | 0.000 | | eval/wrong_answer_a_count | 184.000 | 0.000 | **Model:** h4-70b-thinking-arena **Evaluation Time (hh:mm:ss):** 00:16:07 **Temperature:** 0.6 **Overlong samples:** 0.4% (7 / 1865) ### simpleqa_nous | Metric | Score | Std Error | |--------|-------|----------| | exact_match | 0.130 | 0.005 | | fuzzy_match | 0.179 | 0.006 | **Model:** hermes-4-70b-reasoning-40k **Evaluation Time (hh:mm:ss):** 00:34:35 **Temperature:** 0.6 **Overlong samples:** 2.5% (106 / 4321)
提供机构:
maas
创建时间:
2025-08-27
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作