five

NousResearch/eval-Hermes-4.3-36B-centralized

收藏
Hugging Face2025-11-26 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/NousResearch/eval-Hermes-4.3-36B-centralized
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: benchmark_results dtype: string configs: - config_name: aime24_groups data_files: - split: latest path: "aime24/details.parquet" - config_name: aime24_samples data_files: - split: latest path: "aime24/conversations.parquet" - config_name: aime25_groups data_files: - split: latest path: "aime25/details.parquet" - config_name: aime25_samples data_files: - split: latest path: "aime25/conversations.parquet" - config_name: bbh_generative_groups data_files: - split: latest path: "bbh_generative/details.parquet" - config_name: bbh_generative_samples data_files: - split: latest path: "bbh_generative/conversations.parquet" - config_name: drop_generative_nous_groups data_files: - split: latest path: "drop_generative_nous/details.parquet" - config_name: drop_generative_nous_samples data_files: - split: latest path: "drop_generative_nous/conversations.parquet" - config_name: gpqa_diamond_groups data_files: - split: latest path: "gpqa_diamond/details.parquet" - config_name: gpqa_diamond_samples data_files: - split: latest path: "gpqa_diamond/conversations.parquet" - config_name: ifeval_groups data_files: - split: latest path: "ifeval/details.parquet" - config_name: ifeval_samples data_files: - split: latest path: "ifeval/conversations.parquet" - config_name: math_500_groups data_files: - split: latest path: "math_500/details.parquet" - config_name: math_500_samples data_files: - split: latest path: "math_500/conversations.parquet" - config_name: mmlu_generative_groups data_files: - split: latest path: "mmlu_generative/details.parquet" - config_name: mmlu_generative_samples data_files: - split: latest path: "mmlu_generative/conversations.parquet" - config_name: mmlu_pro_groups data_files: - split: latest path: "mmlu_pro/details.parquet" - config_name: mmlu_pro_samples data_files: - split: latest path: "mmlu_pro/conversations.parquet" - config_name: musr_generative_groups data_files: - split: latest path: "musr_generative/details.parquet" - config_name: musr_generative_samples data_files: - split: latest path: "musr_generative/conversations.parquet" - config_name: obqa_generative_groups data_files: - split: latest path: "obqa_generative/details.parquet" - config_name: obqa_generative_samples data_files: - split: latest path: "obqa_generative/conversations.parquet" - config_name: simpleqa_nous_groups data_files: - split: latest path: "simpleqa_nous/details.parquet" - config_name: simpleqa_nous_samples data_files: - split: latest path: "simpleqa_nous/conversations.parquet" language: - en size_categories: - 1K<n<10K tags: - evaluation - benchmarks --- # 36btorchtitan Evaluation Results ## Summary | Benchmark | Score | Metric | Samples | Overlong rate | |-----------|-------|--------|---------|---------------| | aime24 | 0.706 | math_pass@1:64_samples | 64 | 24.6% | | aime25 | 0.669 | math_pass@1:64_samples | 64 | 26.8% | | bbh_generative | 0.847 | extractive_match | 1 | 9.3% | | drop_generative_nous | 0.817 | drop_acc | 1 | 6.9% | | gpqa_diamond | 0.649 | gpqa_pass@1:8_samples | 8 | 4.2% | | ifeval | 0.740 | inst_level_loose_acc | 1 | 13.1% | | math_500 | 0.923 | math_pass@1:4_samples | 4 | 3.1% | | mmlu_generative | 0.866 | extractive_match | 1 | 1.4% | | mmlu_pro | 0.797 | pass@1:1_samples | 1 | 2.9% | | musr_generative | 0.647 | extractive_match | 1 | 17.2% | | obqa_generative | 0.918 | extractive_match | 1 | 5.0% | | simpleqa_nous | 0.056 | fuzzy_match | 1 | 22.8% | Overlong rate: 4,041 / 54,663 samples (7.4%) missing closing `</think>` tag ## Detailed Results ### aime24 | Metric | Score | Std Error | |--------|-------|----------| | math_pass@1:1_samples | 0.767 | 0.079 | | math_pass@1:4_samples | 0.733 | 0.070 | | math_pass@1:8_samples | 0.700 | 0.064 | | math_pass@1:16_samples | 0.690 | 0.064 | | math_pass@1:32_samples | 0.702 | 0.066 | | math_pass@1:64_samples | 0.706 | 0.066 | **Model:** 36btorchtitan **Evaluation Time (hh:mm:ss):** 03:10:32 **Temperature:** 0.6 **Overlong samples:** 24.6% (473 / 1920) ### aime25 | Metric | Score | Std Error | |--------|-------|----------| | math_pass@1:1_samples | 0.633 | 0.089 | | math_pass@1:4_samples | 0.683 | 0.077 | | math_pass@1:8_samples | 0.646 | 0.072 | | math_pass@1:16_samples | 0.650 | 0.069 | | math_pass@1:32_samples | 0.662 | 0.068 | | math_pass@1:64_samples | 0.669 | 0.067 | **Model:** 36btorchtitan **Evaluation Time (hh:mm:ss):** 03:16:06 **Temperature:** 0.6 **Overlong samples:** 26.8% (515 / 1920) ### bbh_generative | Metric | Score | Std Error | |--------|-------|----------| | extractive_match | 0.847 | 0.019 | **Model:** 36btorchtitan **Evaluation Time (hh:mm:ss):** 02:05:45 **Temperature:** 0.6 **Overlong samples:** 9.3% (513 / 5511) ### drop_generative_nous | Metric | Score | Std Error | |--------|-------|----------| | drop_acc | 0.817 | 0.004 | **Model:** 36btorchtitan **Evaluation Time (hh:mm:ss):** 02:08:58 **Temperature:** 0.6 **Overlong samples:** 6.9% (656 / 9536) ### gpqa_diamond | Metric | Score | Std Error | |--------|-------|----------| | gpqa_pass@1:1_samples | 0.677 | 0.033 | | gpqa_pass@1:4_samples | 0.657 | 0.028 | | gpqa_pass@1:8_samples | 0.649 | 0.027 | **Model:** 36btorchtitan **Evaluation Time (hh:mm:ss):** 01:27:00 **Temperature:** 0.6 **Overlong samples:** 4.2% (67 / 1584) ### ifeval | Metric | Score | Std Error | |--------|-------|----------| | prompt_level_strict_acc | 0.595 | 0.021 | | inst_level_strict_acc | 0.697 | 0.001 | | prompt_level_loose_acc | 0.654 | 0.020 | | inst_level_loose_acc | 0.740 | 0.001 | **Model:** 36btorchtitan **Evaluation Time (hh:mm:ss):** 00:32:45 **Temperature:** 0.6 **Overlong samples:** 13.1% (71 / 541) ### math_500 | Metric | Score | Std Error | |--------|-------|----------| | math_pass@1:1_samples | 0.924 | 0.012 | | math_pass@1:4_samples | 0.923 | 0.008 | **Model:** 36btorchtitan **Evaluation Time (hh:mm:ss):** 00:40:12 **Temperature:** 0.6 **Overlong samples:** 3.1% (63 / 2000) ### mmlu_generative | Metric | Score | Std Error | |--------|-------|----------| | extractive_match | 0.866 | 0.003 | **Model:** 36btorchtitan **Evaluation Time (hh:mm:ss):** 02:32:05 **Temperature:** 0.6 **Overlong samples:** 1.4% (200 / 14042) ### mmlu_pro | Metric | Score | Std Error | |--------|-------|----------| | pass@1:1_samples | 0.797 | 0.004 | **Model:** 36btorchtitan **Evaluation Time (hh:mm:ss):** 05:52:54 **Temperature:** 0.6 **Overlong samples:** 2.9% (343 / 12032) ### musr_generative | Metric | Score | Std Error | |--------|-------|----------| | extractive_match | 0.647 | 0.029 | **Model:** 36btorchtitan **Evaluation Time (hh:mm:ss):** 00:41:25 **Temperature:** 0.6 **Overlong samples:** 17.2% (130 / 756) ### obqa_generative | Metric | Score | Std Error | |--------|-------|----------| | extractive_match | 0.918 | 0.012 | **Model:** 36btorchtitan **Evaluation Time (hh:mm:ss):** 00:23:38 **Temperature:** 0.6 **Overlong samples:** 5.0% (25 / 500) ### simpleqa_nous | Metric | Score | Std Error | |--------|-------|----------| | exact_match | 0.038 | 0.003 | | fuzzy_match | 0.056 | 0.003 | **Model:** 36btorchtitan **Evaluation Time (hh:mm:ss):** 02:52:46 **Temperature:** 0.6 **Overlong samples:** 22.8% (985 / 4321)
提供机构:
NousResearch
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作