five

NousResearch/eval-Hermes-4.3-36B

收藏
Hugging Face2025-11-25 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/NousResearch/eval-Hermes-4.3-36B
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: benchmark_results dtype: string configs: - config_name: aime24_groups data_files: - split: latest path: "aime24/details.parquet" - config_name: aime24_samples data_files: - split: latest path: "aime24/conversations.parquet" - config_name: aime25_groups data_files: - split: latest path: "aime25/details.parquet" - config_name: aime25_samples data_files: - split: latest path: "aime25/conversations.parquet" - config_name: bbh_generative_groups data_files: - split: latest path: "bbh_generative/details.parquet" - config_name: bbh_generative_samples data_files: - split: latest path: "bbh_generative/conversations.parquet" - config_name: drop_generative_nous_groups data_files: - split: latest path: "drop_generative_nous/details.parquet" - config_name: drop_generative_nous_samples data_files: - split: latest path: "drop_generative_nous/conversations.parquet" - config_name: gpqa_diamond_groups data_files: - split: latest path: "gpqa_diamond/details.parquet" - config_name: gpqa_diamond_samples data_files: - split: latest path: "gpqa_diamond/conversations.parquet" - config_name: ifeval_groups data_files: - split: latest path: "ifeval/details.parquet" - config_name: ifeval_samples data_files: - split: latest path: "ifeval/conversations.parquet" - config_name: math_500_groups data_files: - split: latest path: "math_500/details.parquet" - config_name: math_500_samples data_files: - split: latest path: "math_500/conversations.parquet" - config_name: mmlu_generative_groups data_files: - split: latest path: "mmlu_generative/details.parquet" - config_name: mmlu_generative_samples data_files: - split: latest path: "mmlu_generative/conversations.parquet" - config_name: mmlu_pro_groups data_files: - split: latest path: "mmlu_pro/details.parquet" - config_name: mmlu_pro_samples data_files: - split: latest path: "mmlu_pro/conversations.parquet" - config_name: musr_generative_groups data_files: - split: latest path: "musr_generative/details.parquet" - config_name: musr_generative_samples data_files: - split: latest path: "musr_generative/conversations.parquet" - config_name: obqa_generative_groups data_files: - split: latest path: "obqa_generative/details.parquet" - config_name: obqa_generative_samples data_files: - split: latest path: "obqa_generative/conversations.parquet" - config_name: simpleqa_nous_groups data_files: - split: latest path: "simpleqa_nous/details.parquet" - config_name: simpleqa_nous_samples data_files: - split: latest path: "simpleqa_nous/conversations.parquet" language: - en size_categories: - 1K<n<10K tags: - evaluation - benchmarks --- # 36bpsychev2 Evaluation Results ## Summary | Benchmark | Score | Metric | Samples | Overlong rate | |-----------|-------|--------|---------|---------------| | aime24 | 0.719 | math_pass@1:64_samples | 64 | 17.6% | | aime25 | 0.693 | math_pass@1:64_samples | 64 | 18.8% | | bbh_generative | 0.864 | extractive_match | 1 | 4.8% | | drop_generative_nous | 0.835 | drop_acc | 1 | 2.7% | | gpqa_diamond | 0.655 | gpqa_pass@1:8_samples | 8 | 2.2% | | ifeval | 0.779 | inst_level_loose_acc | 1 | 7.8% | | math_500 | 0.938 | math_pass@1:4_samples | 4 | 1.8% | | mmlu_generative | 0.877 | extractive_match | 1 | 0.1% | | mmlu_pro | 0.807 | pass@1:1_samples | 1 | 1.1% | | musr_generative | 0.697 | extractive_match | 1 | 0.4% | | obqa_generative | 0.966 | extractive_match | 1 | 0.2% | | simpleqa_nous | 0.060 | fuzzy_match | 1 | 8.6% | Overlong rate: 1,846 / 54,663 samples (3.4%) missing closing `</think>` tag ## Detailed Results ### aime24 | Metric | Score | Std Error | |--------|-------|----------| | math_pass@1:1_samples | 0.767 | 0.079 | | math_pass@1:4_samples | 0.700 | 0.068 | | math_pass@1:8_samples | 0.713 | 0.064 | | math_pass@1:16_samples | 0.719 | 0.061 | | math_pass@1:32_samples | 0.731 | 0.061 | | math_pass@1:64_samples | 0.719 | 0.062 | **Model:** 36bpsychev2 **Evaluation Time (hh:mm:ss):** 02:57:47 **Temperature:** 0.6 **Overlong samples:** 17.6% (337 / 1920) ### aime25 | Metric | Score | Std Error | |--------|-------|----------| | math_pass@1:1_samples | 0.633 | 0.089 | | math_pass@1:4_samples | 0.692 | 0.062 | | math_pass@1:8_samples | 0.662 | 0.063 | | math_pass@1:16_samples | 0.694 | 0.061 | | math_pass@1:32_samples | 0.693 | 0.062 | | math_pass@1:64_samples | 0.693 | 0.063 | **Model:** 36bpsychev2 **Evaluation Time (hh:mm:ss):** 03:06:24 **Temperature:** 0.6 **Overlong samples:** 18.8% (360 / 1920) ### bbh_generative | Metric | Score | Std Error | |--------|-------|----------| | extractive_match | 0.864 | 0.017 | **Model:** 36bpsychev2 **Evaluation Time (hh:mm:ss):** 01:27:42 **Temperature:** 0.6 **Overlong samples:** 4.8% (262 / 5511) ### drop_generative_nous | Metric | Score | Std Error | |--------|-------|----------| | drop_acc | 0.835 | 0.004 | **Model:** 36bpsychev2 **Evaluation Time (hh:mm:ss):** 01:22:41 **Temperature:** 0.6 **Overlong samples:** 2.7% (255 / 9536) ### gpqa_diamond | Metric | Score | Std Error | |--------|-------|----------| | gpqa_pass@1:1_samples | 0.641 | 0.034 | | gpqa_pass@1:4_samples | 0.663 | 0.027 | | gpqa_pass@1:8_samples | 0.655 | 0.026 | **Model:** 36bpsychev2 **Evaluation Time (hh:mm:ss):** 01:18:07 **Temperature:** 0.6 **Overlong samples:** 2.2% (35 / 1584) ### ifeval | Metric | Score | Std Error | |--------|-------|----------| | prompt_level_strict_acc | 0.654 | 0.020 | | inst_level_strict_acc | 0.740 | 0.000 | | prompt_level_loose_acc | 0.702 | 0.020 | | inst_level_loose_acc | 0.779 | 0.000 | **Model:** 36bpsychev2 **Evaluation Time (hh:mm:ss):** 00:25:46 **Temperature:** 0.6 **Overlong samples:** 7.8% (42 / 541) ### math_500 | Metric | Score | Std Error | |--------|-------|----------| | math_pass@1:1_samples | 0.942 | 0.010 | | math_pass@1:4_samples | 0.938 | 0.008 | **Model:** 36bpsychev2 **Evaluation Time (hh:mm:ss):** 00:33:46 **Temperature:** 0.6 **Overlong samples:** 1.8% (35 / 2000) ### mmlu_generative | Metric | Score | Std Error | |--------|-------|----------| | extractive_match | 0.877 | 0.003 | **Model:** 36bpsychev2 **Evaluation Time (hh:mm:ss):** 01:46:37 **Temperature:** 0.6 **Overlong samples:** 0.1% (16 / 14042) ### mmlu_pro | Metric | Score | Std Error | |--------|-------|----------| | pass@1:1_samples | 0.807 | 0.004 | **Model:** 36bpsychev2 **Evaluation Time (hh:mm:ss):** 04:43:51 **Temperature:** 0.6 **Overlong samples:** 1.1% (128 / 12032) ### musr_generative | Metric | Score | Std Error | |--------|-------|----------| | extractive_match | 0.697 | 0.028 | **Model:** 36bpsychev2 **Evaluation Time (hh:mm:ss):** 00:23:45 **Temperature:** 0.6 **Overlong samples:** 0.4% (3 / 756) ### obqa_generative | Metric | Score | Std Error | |--------|-------|----------| | extractive_match | 0.966 | 0.008 | **Model:** 36bpsychev2 **Evaluation Time (hh:mm:ss):** 00:17:21 **Temperature:** 0.6 **Overlong samples:** 0.2% (1 / 500) ### simpleqa_nous | Metric | Score | Std Error | |--------|-------|----------| | exact_match | 0.039 | 0.003 | | fuzzy_match | 0.060 | 0.004 | **Model:** 36bpsychev2 **Evaluation Time (hh:mm:ss):** 01:27:06 **Temperature:** 0.6 **Overlong samples:** 8.6% (372 / 4321)
提供机构:
NousResearch
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作