NousResearch/eval-Hermes-4.3-36B
收藏Hugging Face2025-11-25 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/NousResearch/eval-Hermes-4.3-36B
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: benchmark_results
dtype: string
configs:
- config_name: aime24_groups
data_files:
- split: latest
path: "aime24/details.parquet"
- config_name: aime24_samples
data_files:
- split: latest
path: "aime24/conversations.parquet"
- config_name: aime25_groups
data_files:
- split: latest
path: "aime25/details.parquet"
- config_name: aime25_samples
data_files:
- split: latest
path: "aime25/conversations.parquet"
- config_name: bbh_generative_groups
data_files:
- split: latest
path: "bbh_generative/details.parquet"
- config_name: bbh_generative_samples
data_files:
- split: latest
path: "bbh_generative/conversations.parquet"
- config_name: drop_generative_nous_groups
data_files:
- split: latest
path: "drop_generative_nous/details.parquet"
- config_name: drop_generative_nous_samples
data_files:
- split: latest
path: "drop_generative_nous/conversations.parquet"
- config_name: gpqa_diamond_groups
data_files:
- split: latest
path: "gpqa_diamond/details.parquet"
- config_name: gpqa_diamond_samples
data_files:
- split: latest
path: "gpqa_diamond/conversations.parquet"
- config_name: ifeval_groups
data_files:
- split: latest
path: "ifeval/details.parquet"
- config_name: ifeval_samples
data_files:
- split: latest
path: "ifeval/conversations.parquet"
- config_name: math_500_groups
data_files:
- split: latest
path: "math_500/details.parquet"
- config_name: math_500_samples
data_files:
- split: latest
path: "math_500/conversations.parquet"
- config_name: mmlu_generative_groups
data_files:
- split: latest
path: "mmlu_generative/details.parquet"
- config_name: mmlu_generative_samples
data_files:
- split: latest
path: "mmlu_generative/conversations.parquet"
- config_name: mmlu_pro_groups
data_files:
- split: latest
path: "mmlu_pro/details.parquet"
- config_name: mmlu_pro_samples
data_files:
- split: latest
path: "mmlu_pro/conversations.parquet"
- config_name: musr_generative_groups
data_files:
- split: latest
path: "musr_generative/details.parquet"
- config_name: musr_generative_samples
data_files:
- split: latest
path: "musr_generative/conversations.parquet"
- config_name: obqa_generative_groups
data_files:
- split: latest
path: "obqa_generative/details.parquet"
- config_name: obqa_generative_samples
data_files:
- split: latest
path: "obqa_generative/conversations.parquet"
- config_name: simpleqa_nous_groups
data_files:
- split: latest
path: "simpleqa_nous/details.parquet"
- config_name: simpleqa_nous_samples
data_files:
- split: latest
path: "simpleqa_nous/conversations.parquet"
language:
- en
size_categories:
- 1K<n<10K
tags:
- evaluation
- benchmarks
---
# 36bpsychev2 Evaluation Results
## Summary
| Benchmark | Score | Metric | Samples | Overlong rate |
|-----------|-------|--------|---------|---------------|
| aime24 | 0.719 | math_pass@1:64_samples | 64 | 17.6% |
| aime25 | 0.693 | math_pass@1:64_samples | 64 | 18.8% |
| bbh_generative | 0.864 | extractive_match | 1 | 4.8% |
| drop_generative_nous | 0.835 | drop_acc | 1 | 2.7% |
| gpqa_diamond | 0.655 | gpqa_pass@1:8_samples | 8 | 2.2% |
| ifeval | 0.779 | inst_level_loose_acc | 1 | 7.8% |
| math_500 | 0.938 | math_pass@1:4_samples | 4 | 1.8% |
| mmlu_generative | 0.877 | extractive_match | 1 | 0.1% |
| mmlu_pro | 0.807 | pass@1:1_samples | 1 | 1.1% |
| musr_generative | 0.697 | extractive_match | 1 | 0.4% |
| obqa_generative | 0.966 | extractive_match | 1 | 0.2% |
| simpleqa_nous | 0.060 | fuzzy_match | 1 | 8.6% |
Overlong rate: 1,846 / 54,663 samples (3.4%) missing closing `</think>` tag
## Detailed Results
### aime24
| Metric | Score | Std Error |
|--------|-------|----------|
| math_pass@1:1_samples | 0.767 | 0.079 |
| math_pass@1:4_samples | 0.700 | 0.068 |
| math_pass@1:8_samples | 0.713 | 0.064 |
| math_pass@1:16_samples | 0.719 | 0.061 |
| math_pass@1:32_samples | 0.731 | 0.061 |
| math_pass@1:64_samples | 0.719 | 0.062 |
**Model:** 36bpsychev2
**Evaluation Time (hh:mm:ss):** 02:57:47
**Temperature:** 0.6
**Overlong samples:** 17.6% (337 / 1920)
### aime25
| Metric | Score | Std Error |
|--------|-------|----------|
| math_pass@1:1_samples | 0.633 | 0.089 |
| math_pass@1:4_samples | 0.692 | 0.062 |
| math_pass@1:8_samples | 0.662 | 0.063 |
| math_pass@1:16_samples | 0.694 | 0.061 |
| math_pass@1:32_samples | 0.693 | 0.062 |
| math_pass@1:64_samples | 0.693 | 0.063 |
**Model:** 36bpsychev2
**Evaluation Time (hh:mm:ss):** 03:06:24
**Temperature:** 0.6
**Overlong samples:** 18.8% (360 / 1920)
### bbh_generative
| Metric | Score | Std Error |
|--------|-------|----------|
| extractive_match | 0.864 | 0.017 |
**Model:** 36bpsychev2
**Evaluation Time (hh:mm:ss):** 01:27:42
**Temperature:** 0.6
**Overlong samples:** 4.8% (262 / 5511)
### drop_generative_nous
| Metric | Score | Std Error |
|--------|-------|----------|
| drop_acc | 0.835 | 0.004 |
**Model:** 36bpsychev2
**Evaluation Time (hh:mm:ss):** 01:22:41
**Temperature:** 0.6
**Overlong samples:** 2.7% (255 / 9536)
### gpqa_diamond
| Metric | Score | Std Error |
|--------|-------|----------|
| gpqa_pass@1:1_samples | 0.641 | 0.034 |
| gpqa_pass@1:4_samples | 0.663 | 0.027 |
| gpqa_pass@1:8_samples | 0.655 | 0.026 |
**Model:** 36bpsychev2
**Evaluation Time (hh:mm:ss):** 01:18:07
**Temperature:** 0.6
**Overlong samples:** 2.2% (35 / 1584)
### ifeval
| Metric | Score | Std Error |
|--------|-------|----------|
| prompt_level_strict_acc | 0.654 | 0.020 |
| inst_level_strict_acc | 0.740 | 0.000 |
| prompt_level_loose_acc | 0.702 | 0.020 |
| inst_level_loose_acc | 0.779 | 0.000 |
**Model:** 36bpsychev2
**Evaluation Time (hh:mm:ss):** 00:25:46
**Temperature:** 0.6
**Overlong samples:** 7.8% (42 / 541)
### math_500
| Metric | Score | Std Error |
|--------|-------|----------|
| math_pass@1:1_samples | 0.942 | 0.010 |
| math_pass@1:4_samples | 0.938 | 0.008 |
**Model:** 36bpsychev2
**Evaluation Time (hh:mm:ss):** 00:33:46
**Temperature:** 0.6
**Overlong samples:** 1.8% (35 / 2000)
### mmlu_generative
| Metric | Score | Std Error |
|--------|-------|----------|
| extractive_match | 0.877 | 0.003 |
**Model:** 36bpsychev2
**Evaluation Time (hh:mm:ss):** 01:46:37
**Temperature:** 0.6
**Overlong samples:** 0.1% (16 / 14042)
### mmlu_pro
| Metric | Score | Std Error |
|--------|-------|----------|
| pass@1:1_samples | 0.807 | 0.004 |
**Model:** 36bpsychev2
**Evaluation Time (hh:mm:ss):** 04:43:51
**Temperature:** 0.6
**Overlong samples:** 1.1% (128 / 12032)
### musr_generative
| Metric | Score | Std Error |
|--------|-------|----------|
| extractive_match | 0.697 | 0.028 |
**Model:** 36bpsychev2
**Evaluation Time (hh:mm:ss):** 00:23:45
**Temperature:** 0.6
**Overlong samples:** 0.4% (3 / 756)
### obqa_generative
| Metric | Score | Std Error |
|--------|-------|----------|
| extractive_match | 0.966 | 0.008 |
**Model:** 36bpsychev2
**Evaluation Time (hh:mm:ss):** 00:17:21
**Temperature:** 0.6
**Overlong samples:** 0.2% (1 / 500)
### simpleqa_nous
| Metric | Score | Std Error |
|--------|-------|----------|
| exact_match | 0.039 | 0.003 |
| fuzzy_match | 0.060 | 0.004 |
**Model:** 36bpsychev2
**Evaluation Time (hh:mm:ss):** 01:27:06
**Temperature:** 0.6
**Overlong samples:** 8.6% (372 / 4321)
提供机构:
NousResearch



