eval-DeepSeek-R1-0528
收藏魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/NousResearch/eval-DeepSeek-R1-0528
下载链接
链接失效反馈官方服务:
资源简介:
# r1-0528 Evaluation Results
## Summary
| Benchmark | Score | Metric | Samples | Overlong rate |
|-----------|-------|--------|---------|---------------|
| aime24 | 0.865 | math_pass@1:64_samples | 64 | 0.0% |
| aime25 | 0.831 | math_pass@1:64_samples | 64 | 0.0% |
| arenahard | 0.951 | eval/overall_winrate | 500 | 0.0% |
| bbh_generative | 0.894 | extractive_match | 1 | 0.0% |
| creative-writing-v3 | 0.803 | creative_writing_score | 96 | 0.0% |
| drop_generative_nous | 0.865 | drop_acc | 1 | 0.0% |
| eqbench3 | 0.865 | eqbench_score | 135 | 0.0% |
| gpqa_diamond | 0.781 | gpqa_pass@1:8_samples | 8 | 0.1% |
| ifeval | 0.900 | inst_level_loose_acc | 1 | 0.0% |
| lcb-v6-aug2024+ | 0.718 | eval/pass_1 | 1 | 0.2% |
| math_500 | 0.975 | math_pass@1:4_samples | 4 | 0.7% |
| mmlu_generative | 0.904 | extractive_match | 1 | 0.0% |
| mmlu_pro | 0.843 | pass@1:1_samples | 1 | 0.0% |
| musr_generative | 0.726 | extractive_match | 1 | 0.0% |
| obqa_generative | 0.956 | extractive_match | 1 | 0.0% |
| rewardbench | 0.701 | eval/percent_correct | 1 | 0.1% |
| simpleqa_nous | 0.220 | fuzzy_match | 1 | 0.0% |
Overlong rate: 28 / 64,523 samples (0.0%) missing closing `</think>` tag
## Detailed Results
### aime24
| Metric | Score | Std Error |
|--------|-------|----------|
| math_pass@1:1_samples | 0.900 | 0.056 |
| math_pass@1:4_samples | 0.858 | 0.057 |
| math_pass@1:8_samples | 0.879 | 0.051 |
| math_pass@1:16_samples | 0.871 | 0.052 |
| math_pass@1:32_samples | 0.866 | 0.050 |
| math_pass@1:64_samples | 0.865 | 0.050 |
**Model:** r1-0528
**Evaluation Time (hh:mm:ss):** 01:41:39
**Temperature:** 0.6
**Overlong samples:** 0.0% (0 / 1920)
### aime25
| Metric | Score | Std Error |
|--------|-------|----------|
| math_pass@1:1_samples | 0.800 | 0.074 |
| math_pass@1:4_samples | 0.833 | 0.063 |
| math_pass@1:8_samples | 0.829 | 0.062 |
| math_pass@1:16_samples | 0.833 | 0.058 |
| math_pass@1:32_samples | 0.833 | 0.058 |
| math_pass@1:64_samples | 0.831 | 0.057 |
**Model:** r1-0528
**Evaluation Time (hh:mm:ss):** 04:05:22
**Temperature:** 0.6
**Overlong samples:** 0.0% (0 / 1920)
### arenahard
| Metric | Score | Std Error |
|--------|-------|----------|
| eval/overall_winrate | 0.951 | 0.000 |
| eval/total_samples | 500.000 | 0.000 |
| eval/win_count | 459.000 | 0.000 |
| eval/tie_count | 31.000 | 0.000 |
| eval/loss_count | 10.000 | 0.000 |
| eval/win_rate | 0.918 | 0.000 |
| eval/tie_rate | 0.062 | 0.000 |
| eval/loss_rate | 0.020 | 0.000 |
| eval/winrate_arena-hard-v0.1 | 0.951 | 0.000 |
**Model:** r1-0528
**Evaluation Time (hh:mm:ss):** 00:11:04
**Temperature:** 0.6
**Overlong samples:** 0.0% (0 / 500)
### bbh_generative
| Metric | Score | Std Error |
|--------|-------|----------|
| extractive_match | 0.894 | 0.014 |
**Model:** r1-0528
**Evaluation Time (hh:mm:ss):** 00:25:54
**Temperature:** 0.6
**Overlong samples:** 0.0% (0 / 5511)
### creative-writing-v3
| Metric | Score | Std Error |
|--------|-------|----------|
| creative_writing_score | 0.803 | 0.000 |
| num_samples | 96.000 | 0.000 |
**Model:** r1-0528-thinking
**Evaluation Time (hh:mm:ss):** N/A
**Temperature:** N/A
**Overlong samples:** 0.0% (0 / 96)
### drop_generative_nous
| Metric | Score | Std Error |
|--------|-------|----------|
| drop_acc | 0.865 | 0.004 |
**Model:** r1-0528
**Evaluation Time (hh:mm:ss):** 00:38:49
**Temperature:** 0.6
**Overlong samples:** 0.0% (0 / 9536)
### eqbench3
| Metric | Score | Std Error |
|--------|-------|----------|
| eqbench_score | 0.865 | 0.000 |
| num_samples | 135.000 | 0.000 |
**Model:** r1-0528
**Evaluation Time (hh:mm:ss):** N/A
**Temperature:** N/A
**Overlong samples:** 0.0% (0 / 135)
### gpqa_diamond
| Metric | Score | Std Error |
|--------|-------|----------|
| gpqa_pass@1:1_samples | 0.788 | 0.029 |
| gpqa_pass@1:4_samples | 0.782 | 0.025 |
| gpqa_pass@1:8_samples | 0.781 | 0.025 |
**Model:** r1-0528
**Evaluation Time (hh:mm:ss):** 01:34:37
**Temperature:** 0.6
**Overlong samples:** 0.1% (1 / 1584)
### ifeval
| Metric | Score | Std Error |
|--------|-------|----------|
| prompt_level_strict_acc | 0.811 | 0.017 |
| inst_level_strict_acc | 0.871 | 0.000 |
| prompt_level_loose_acc | 0.848 | 0.015 |
| inst_level_loose_acc | 0.900 | 0.000 |
**Model:** r1-0528
**Evaluation Time (hh:mm:ss):** 00:07:00
**Temperature:** 0.6
**Overlong samples:** 0.0% (0 / 541)
### lcb-v6-aug2024+
| Metric | Score | Std Error |
|--------|-------|----------|
| eval/pass_1 | 0.718 | 0.000 |
| eval/easy_pass_1 | 0.983 | 0.000 |
| eval/medium_pass_1 | 0.843 | 0.000 |
| eval/hard_pass_1 | 0.487 | 0.000 |
| eval/completion_length | 66098.651 | 0.000 |
**Model:** r1-0528-final
**Evaluation Time (hh:mm:ss):** 21:25:02
**Temperature:** N/A
**Overlong samples:** 0.2% (11 / 7264)
### math_500
| Metric | Score | Std Error |
|--------|-------|----------|
| math_pass@1:1_samples | 0.974 | 0.007 |
| math_pass@1:4_samples | 0.975 | 0.006 |
**Model:** r1-0528
**Evaluation Time (hh:mm:ss):** 00:24:51
**Temperature:** 0.6
**Overlong samples:** 0.7% (13 / 2000)
### mmlu_generative
| Metric | Score | Std Error |
|--------|-------|----------|
| extractive_match | 0.904 | 0.002 |
**Model:** r1-0528
**Evaluation Time (hh:mm:ss):** 01:00:12
**Temperature:** 0.6
**Overlong samples:** 0.0% (0 / 14042)
### mmlu_pro
| Metric | Score | Std Error |
|--------|-------|----------|
| pass@1:1_samples | 0.843 | 0.003 |
**Model:** r1-0528
**Evaluation Time (hh:mm:ss):** 01:17:36
**Temperature:** 0.6
**Overlong samples:** 0.0% (2 / 12032)
### musr_generative
| Metric | Score | Std Error |
|--------|-------|----------|
| extractive_match | 0.726 | 0.027 |
**Model:** r1-0528
**Evaluation Time (hh:mm:ss):** 00:05:36
**Temperature:** 0.6
**Overlong samples:** 0.0% (0 / 756)
### obqa_generative
| Metric | Score | Std Error |
|--------|-------|----------|
| extractive_match | 0.956 | 0.009 |
**Model:** r1-0528
**Evaluation Time (hh:mm:ss):** 00:03:47
**Temperature:** 0.6
**Overlong samples:** 0.0% (0 / 500)
### rewardbench
| Metric | Score | Std Error |
|--------|-------|----------|
| eval/percent_correct | 0.701 | 0.000 |
| eval/total_samples | 1865.000 | 0.000 |
| eval/correct_samples | 1307.000 | 0.000 |
| eval/format_compliance_rate | 0.999 | 0.000 |
| eval/avg_response_length | 4454.774 | 0.000 |
| eval/response_length_std | 1776.983 | 0.000 |
| eval/judgment_entropy | 1.393 | 0.000 |
| eval/most_common_judgment_freq | 0.255 | 0.000 |
| eval/format_error_rate | 0.001 | 0.000 |
| eval/avg_ties_rating | 3.526 | 0.000 |
| eval/ties_error_rate | 0.243 | 0.000 |
| eval/percent_correct_Factuality | 0.632 | 0.000 |
| eval/percent_correct_Precise IF | 0.438 | 0.000 |
| eval/percent_correct_Math | 0.820 | 0.000 |
| eval/percent_correct_Safety | 0.620 | 0.000 |
| eval/percent_correct_Focus | 0.848 | 0.000 |
| eval/percent_correct_Ties | 0.863 | 0.000 |
| eval/choice_samples | 1763.000 | 0.000 |
| eval/ties_samples | 102.000 | 0.000 |
| eval/choice_format_compliance_rate | 0.999 | 0.000 |
| eval/ties_format_compliance_rate | 1.000 | 0.000 |
| eval/wrong_answer_a_bias_rate | 0.270 | 0.000 |
| eval/wrong_answer_total_count | 544.000 | 0.000 |
| eval/wrong_answer_a_count | 147.000 | 0.000 |
**Model:** r1-0528
**Evaluation Time (hh:mm:ss):** 00:15:37
**Temperature:** 0.6
**Overlong samples:** 0.1% (1 / 1865)
### simpleqa_nous
| Metric | Score | Std Error |
|--------|-------|----------|
| exact_match | 0.167 | 0.006 |
| fuzzy_match | 0.220 | 0.006 |
**Model:** r1-0528
**Evaluation Time (hh:mm:ss):** 00:18:31
**Temperature:** 0.6
**Overlong samples:** 0.0% (0 / 4321)
提供机构:
maas
创建时间:
2025-08-29



