NousResearch/eval-Hermes-4.3-36B-centralized
收藏Hugging Face2025-11-26 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/NousResearch/eval-Hermes-4.3-36B-centralized
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: benchmark_results
dtype: string
configs:
- config_name: aime24_groups
data_files:
- split: latest
path: "aime24/details.parquet"
- config_name: aime24_samples
data_files:
- split: latest
path: "aime24/conversations.parquet"
- config_name: aime25_groups
data_files:
- split: latest
path: "aime25/details.parquet"
- config_name: aime25_samples
data_files:
- split: latest
path: "aime25/conversations.parquet"
- config_name: bbh_generative_groups
data_files:
- split: latest
path: "bbh_generative/details.parquet"
- config_name: bbh_generative_samples
data_files:
- split: latest
path: "bbh_generative/conversations.parquet"
- config_name: drop_generative_nous_groups
data_files:
- split: latest
path: "drop_generative_nous/details.parquet"
- config_name: drop_generative_nous_samples
data_files:
- split: latest
path: "drop_generative_nous/conversations.parquet"
- config_name: gpqa_diamond_groups
data_files:
- split: latest
path: "gpqa_diamond/details.parquet"
- config_name: gpqa_diamond_samples
data_files:
- split: latest
path: "gpqa_diamond/conversations.parquet"
- config_name: ifeval_groups
data_files:
- split: latest
path: "ifeval/details.parquet"
- config_name: ifeval_samples
data_files:
- split: latest
path: "ifeval/conversations.parquet"
- config_name: math_500_groups
data_files:
- split: latest
path: "math_500/details.parquet"
- config_name: math_500_samples
data_files:
- split: latest
path: "math_500/conversations.parquet"
- config_name: mmlu_generative_groups
data_files:
- split: latest
path: "mmlu_generative/details.parquet"
- config_name: mmlu_generative_samples
data_files:
- split: latest
path: "mmlu_generative/conversations.parquet"
- config_name: mmlu_pro_groups
data_files:
- split: latest
path: "mmlu_pro/details.parquet"
- config_name: mmlu_pro_samples
data_files:
- split: latest
path: "mmlu_pro/conversations.parquet"
- config_name: musr_generative_groups
data_files:
- split: latest
path: "musr_generative/details.parquet"
- config_name: musr_generative_samples
data_files:
- split: latest
path: "musr_generative/conversations.parquet"
- config_name: obqa_generative_groups
data_files:
- split: latest
path: "obqa_generative/details.parquet"
- config_name: obqa_generative_samples
data_files:
- split: latest
path: "obqa_generative/conversations.parquet"
- config_name: simpleqa_nous_groups
data_files:
- split: latest
path: "simpleqa_nous/details.parquet"
- config_name: simpleqa_nous_samples
data_files:
- split: latest
path: "simpleqa_nous/conversations.parquet"
language:
- en
size_categories:
- 1K<n<10K
tags:
- evaluation
- benchmarks
---
# 36btorchtitan Evaluation Results
## Summary
| Benchmark | Score | Metric | Samples | Overlong rate |
|-----------|-------|--------|---------|---------------|
| aime24 | 0.706 | math_pass@1:64_samples | 64 | 24.6% |
| aime25 | 0.669 | math_pass@1:64_samples | 64 | 26.8% |
| bbh_generative | 0.847 | extractive_match | 1 | 9.3% |
| drop_generative_nous | 0.817 | drop_acc | 1 | 6.9% |
| gpqa_diamond | 0.649 | gpqa_pass@1:8_samples | 8 | 4.2% |
| ifeval | 0.740 | inst_level_loose_acc | 1 | 13.1% |
| math_500 | 0.923 | math_pass@1:4_samples | 4 | 3.1% |
| mmlu_generative | 0.866 | extractive_match | 1 | 1.4% |
| mmlu_pro | 0.797 | pass@1:1_samples | 1 | 2.9% |
| musr_generative | 0.647 | extractive_match | 1 | 17.2% |
| obqa_generative | 0.918 | extractive_match | 1 | 5.0% |
| simpleqa_nous | 0.056 | fuzzy_match | 1 | 22.8% |
Overlong rate: 4,041 / 54,663 samples (7.4%) missing closing `</think>` tag
## Detailed Results
### aime24
| Metric | Score | Std Error |
|--------|-------|----------|
| math_pass@1:1_samples | 0.767 | 0.079 |
| math_pass@1:4_samples | 0.733 | 0.070 |
| math_pass@1:8_samples | 0.700 | 0.064 |
| math_pass@1:16_samples | 0.690 | 0.064 |
| math_pass@1:32_samples | 0.702 | 0.066 |
| math_pass@1:64_samples | 0.706 | 0.066 |
**Model:** 36btorchtitan
**Evaluation Time (hh:mm:ss):** 03:10:32
**Temperature:** 0.6
**Overlong samples:** 24.6% (473 / 1920)
### aime25
| Metric | Score | Std Error |
|--------|-------|----------|
| math_pass@1:1_samples | 0.633 | 0.089 |
| math_pass@1:4_samples | 0.683 | 0.077 |
| math_pass@1:8_samples | 0.646 | 0.072 |
| math_pass@1:16_samples | 0.650 | 0.069 |
| math_pass@1:32_samples | 0.662 | 0.068 |
| math_pass@1:64_samples | 0.669 | 0.067 |
**Model:** 36btorchtitan
**Evaluation Time (hh:mm:ss):** 03:16:06
**Temperature:** 0.6
**Overlong samples:** 26.8% (515 / 1920)
### bbh_generative
| Metric | Score | Std Error |
|--------|-------|----------|
| extractive_match | 0.847 | 0.019 |
**Model:** 36btorchtitan
**Evaluation Time (hh:mm:ss):** 02:05:45
**Temperature:** 0.6
**Overlong samples:** 9.3% (513 / 5511)
### drop_generative_nous
| Metric | Score | Std Error |
|--------|-------|----------|
| drop_acc | 0.817 | 0.004 |
**Model:** 36btorchtitan
**Evaluation Time (hh:mm:ss):** 02:08:58
**Temperature:** 0.6
**Overlong samples:** 6.9% (656 / 9536)
### gpqa_diamond
| Metric | Score | Std Error |
|--------|-------|----------|
| gpqa_pass@1:1_samples | 0.677 | 0.033 |
| gpqa_pass@1:4_samples | 0.657 | 0.028 |
| gpqa_pass@1:8_samples | 0.649 | 0.027 |
**Model:** 36btorchtitan
**Evaluation Time (hh:mm:ss):** 01:27:00
**Temperature:** 0.6
**Overlong samples:** 4.2% (67 / 1584)
### ifeval
| Metric | Score | Std Error |
|--------|-------|----------|
| prompt_level_strict_acc | 0.595 | 0.021 |
| inst_level_strict_acc | 0.697 | 0.001 |
| prompt_level_loose_acc | 0.654 | 0.020 |
| inst_level_loose_acc | 0.740 | 0.001 |
**Model:** 36btorchtitan
**Evaluation Time (hh:mm:ss):** 00:32:45
**Temperature:** 0.6
**Overlong samples:** 13.1% (71 / 541)
### math_500
| Metric | Score | Std Error |
|--------|-------|----------|
| math_pass@1:1_samples | 0.924 | 0.012 |
| math_pass@1:4_samples | 0.923 | 0.008 |
**Model:** 36btorchtitan
**Evaluation Time (hh:mm:ss):** 00:40:12
**Temperature:** 0.6
**Overlong samples:** 3.1% (63 / 2000)
### mmlu_generative
| Metric | Score | Std Error |
|--------|-------|----------|
| extractive_match | 0.866 | 0.003 |
**Model:** 36btorchtitan
**Evaluation Time (hh:mm:ss):** 02:32:05
**Temperature:** 0.6
**Overlong samples:** 1.4% (200 / 14042)
### mmlu_pro
| Metric | Score | Std Error |
|--------|-------|----------|
| pass@1:1_samples | 0.797 | 0.004 |
**Model:** 36btorchtitan
**Evaluation Time (hh:mm:ss):** 05:52:54
**Temperature:** 0.6
**Overlong samples:** 2.9% (343 / 12032)
### musr_generative
| Metric | Score | Std Error |
|--------|-------|----------|
| extractive_match | 0.647 | 0.029 |
**Model:** 36btorchtitan
**Evaluation Time (hh:mm:ss):** 00:41:25
**Temperature:** 0.6
**Overlong samples:** 17.2% (130 / 756)
### obqa_generative
| Metric | Score | Std Error |
|--------|-------|----------|
| extractive_match | 0.918 | 0.012 |
**Model:** 36btorchtitan
**Evaluation Time (hh:mm:ss):** 00:23:38
**Temperature:** 0.6
**Overlong samples:** 5.0% (25 / 500)
### simpleqa_nous
| Metric | Score | Std Error |
|--------|-------|----------|
| exact_match | 0.038 | 0.003 |
| fuzzy_match | 0.056 | 0.003 |
**Model:** 36btorchtitan
**Evaluation Time (hh:mm:ss):** 02:52:46
**Temperature:** 0.6
**Overlong samples:** 22.8% (985 / 4321)
提供机构:
NousResearch



