DeepSeek-V3-0324
收藏魔搭社区2025-12-05 更新2025-08-30 收录
下载链接:
https://modelscope.cn/datasets/NousResearch/DeepSeek-V3-0324
下载链接
链接失效反馈官方服务:
资源简介:
# dsv3 Evaluation Results
## Summary
| Benchmark | Score | Metric | Samples | Overlong rate |
|-----------|-------|--------|---------|---------------|
| aime24 | 0.506 | math_pass@1:64_samples | 64 | 100.0% |
| aime25 | 0.422 | math_pass@1:64_samples | 64 | 100.0% |
| arenahard | 0.926 | eval/overall_winrate | 500 | 0.0% |
| bbh_generative | 0.868 | extractive_match | 1 | 100.0% |
| creative-writing-v3 | 0.767 | creative_writing_score | 96 | 0.0% |
| drop_generative_nous | 0.829 | drop_acc | 1 | 100.0% |
| eqbench3 | 0.831 | eqbench_score | 135 | 0.0% |
| gpqa_diamond | 0.680 | gpqa_pass@1:8_samples | 8 | 100.0% |
| ifeval | 0.904 | inst_level_loose_acc | 1 | 100.0% |
| lcb-v6-aug2024+ | 0.492 | eval/pass_1 | 1 | 100.0% |
| math_500 | 0.925 | math_pass@1:4_samples | 4 | 100.0% |
| mmlu_generative | 0.886 | extractive_match | 1 | 100.0% |
| mmlu_pro | 0.816 | pass@1:1_samples | 1 | 100.0% |
| musr_generative | 0.654 | extractive_match | 1 | 100.0% |
| obqa_generative | 0.956 | extractive_match | 1 | 100.0% |
| rewardbench | 0.681 | eval/percent_correct | 1 | 94.5% |
| simpleqa_nous | 0.186 | fuzzy_match | 1 | 100.0% |
Overlong rate: 63,690 / 64,523 samples (98.7%) missing closing `</think>` tag
## Detailed Results
### aime24
| Metric | Score | Std Error |
|--------|-------|----------|
| math_pass@1:1_samples | 0.533 | 0.093 |
| math_pass@1:4_samples | 0.517 | 0.077 |
| math_pass@1:8_samples | 0.529 | 0.076 |
| math_pass@1:16_samples | 0.517 | 0.074 |
| math_pass@1:32_samples | 0.515 | 0.073 |
| math_pass@1:64_samples | 0.506 | 0.073 |
**Model:** dsv3
**Evaluation Time (hh:mm:ss):** 03:19:18
**Temperature:** 0.3
**Overlong samples:** 100.0% (1920 / 1920)
### aime25
| Metric | Score | Std Error |
|--------|-------|----------|
| math_pass@1:1_samples | 0.500 | 0.093 |
| math_pass@1:4_samples | 0.417 | 0.076 |
| math_pass@1:8_samples | 0.408 | 0.073 |
| math_pass@1:16_samples | 0.415 | 0.072 |
| math_pass@1:32_samples | 0.414 | 0.072 |
| math_pass@1:64_samples | 0.422 | 0.072 |
**Model:** dsv3
**Evaluation Time (hh:mm:ss):** 02:01:49
**Temperature:** 0.3
**Overlong samples:** 100.0% (1920 / 1920)
### arenahard
| Metric | Score | Std Error |
|--------|-------|----------|
| eval/overall_winrate | 0.926 | 0.000 |
| eval/total_samples | 500.000 | 0.000 |
| eval/win_count | 446.000 | 0.000 |
| eval/tie_count | 34.000 | 0.000 |
| eval/loss_count | 20.000 | 0.000 |
| eval/win_rate | 0.892 | 0.000 |
| eval/tie_rate | 0.068 | 0.000 |
| eval/loss_rate | 0.040 | 0.000 |
| eval/winrate_arena-hard-v0.1 | 0.926 | 0.000 |
**Model:** dsv3-arena
**Evaluation Time (hh:mm:ss):** 00:04:19
**Temperature:** 0.6
**Overlong samples:** 0.0% (0 / 500)
### bbh_generative
| Metric | Score | Std Error |
|--------|-------|----------|
| extractive_match | 0.868 | 0.015 |
**Model:** dsv3
**Evaluation Time (hh:mm:ss):** 00:21:52
**Temperature:** 0.3
**Overlong samples:** 100.0% (5511 / 5511)
### creative-writing-v3
| Metric | Score | Std Error |
|--------|-------|----------|
| creative_writing_score | 0.767 | 0.000 |
| num_samples | 96.000 | 0.000 |
**Model:** dsv3-nonthinking
**Evaluation Time (hh:mm:ss):** N/A
**Temperature:** N/A
**Overlong samples:** 0.0% (0 / 96)
### drop_generative_nous
| Metric | Score | Std Error |
|--------|-------|----------|
| drop_acc | 0.829 | 0.004 |
**Model:** dsv3
**Evaluation Time (hh:mm:ss):** 00:37:29
**Temperature:** 0.3
**Overlong samples:** 100.0% (9536 / 9536)
### eqbench3
| Metric | Score | Std Error |
|--------|-------|----------|
| eqbench_score | 0.831 | 0.000 |
| num_samples | 135.000 | 0.000 |
**Model:** dsv3-arena
**Evaluation Time (hh:mm:ss):** N/A
**Temperature:** N/A
**Overlong samples:** 0.0% (0 / 135)
### gpqa_diamond
| Metric | Score | Std Error |
|--------|-------|----------|
| gpqa_pass@1:1_samples | 0.682 | 0.033 |
| gpqa_pass@1:4_samples | 0.674 | 0.028 |
| gpqa_pass@1:8_samples | 0.680 | 0.027 |
**Model:** dsv3
**Evaluation Time (hh:mm:ss):** 01:10:50
**Temperature:** 0.3
**Overlong samples:** 100.0% (1584 / 1584)
### ifeval
| Metric | Score | Std Error |
|--------|-------|----------|
| prompt_level_strict_acc | 0.815 | 0.017 |
| inst_level_strict_acc | 0.871 | 0.000 |
| prompt_level_loose_acc | 0.858 | 0.015 |
| inst_level_loose_acc | 0.904 | 0.000 |
**Model:** dsv3
**Evaluation Time (hh:mm:ss):** 00:48:12
**Temperature:** 0.3
**Overlong samples:** 100.0% (541 / 541)
### lcb-v6-aug2024+
| Metric | Score | Std Error |
|--------|-------|----------|
| eval/pass_1 | 0.492 | 0.000 |
| eval/easy_pass_1 | 0.935 | 0.000 |
| eval/medium_pass_1 | 0.565 | 0.000 |
| eval/hard_pass_1 | 0.202 | 0.000 |
| eval/completion_length | 14047.031 | 0.000 |
**Model:** dsv3-temp0.3
**Evaluation Time (hh:mm:ss):** 07:35:20
**Temperature:** N/A
**Overlong samples:** 100.0% (7264 / 7264)
### math_500
| Metric | Score | Std Error |
|--------|-------|----------|
| math_pass@1:1_samples | 0.930 | 0.011 |
| math_pass@1:4_samples | 0.925 | 0.010 |
**Model:** dsv3
**Evaluation Time (hh:mm:ss):** 00:14:00
**Temperature:** 0.3
**Overlong samples:** 100.0% (2000 / 2000)
### mmlu_generative
| Metric | Score | Std Error |
|--------|-------|----------|
| extractive_match | 0.886 | 0.003 |
**Model:** dsv3
**Evaluation Time (hh:mm:ss):** 00:55:26
**Temperature:** 0.3
**Overlong samples:** 100.0% (14042 / 14042)
### mmlu_pro
| Metric | Score | Std Error |
|--------|-------|----------|
| pass@1:1_samples | 0.816 | 0.004 |
**Model:** dsv3
**Evaluation Time (hh:mm:ss):** 01:47:48
**Temperature:** 0.3
**Overlong samples:** 100.0% (12032 / 12032)
### musr_generative
| Metric | Score | Std Error |
|--------|-------|----------|
| extractive_match | 0.654 | 0.029 |
**Model:** dsv3
**Evaluation Time (hh:mm:ss):** 00:10:13
**Temperature:** 0.3
**Overlong samples:** 100.0% (756 / 756)
### obqa_generative
| Metric | Score | Std Error |
|--------|-------|----------|
| extractive_match | 0.956 | 0.009 |
**Model:** dsv3
**Evaluation Time (hh:mm:ss):** 00:02:02
**Temperature:** 0.3
**Overlong samples:** 100.0% (500 / 500)
### rewardbench
| Metric | Score | Std Error |
|--------|-------|----------|
| eval/percent_correct | 0.681 | 0.000 |
| eval/total_samples | 1865.000 | 0.000 |
| eval/correct_samples | 1270.000 | 0.000 |
| eval/format_compliance_rate | 0.999 | 0.000 |
| eval/avg_response_length | 1703.678 | 0.000 |
| eval/response_length_std | 1577.530 | 0.000 |
| eval/judgment_entropy | 1.380 | 0.000 |
| eval/most_common_judgment_freq | 0.318 | 0.000 |
| eval/format_error_rate | 0.001 | 0.000 |
| eval/avg_ties_rating | 4.123 | 0.000 |
| eval/ties_error_rate | 0.023 | 0.000 |
| eval/percent_correct_Factuality | 0.566 | 0.000 |
| eval/percent_correct_Precise IF | 0.369 | 0.000 |
| eval/percent_correct_Math | 0.628 | 0.000 |
| eval/percent_correct_Safety | 0.660 | 0.000 |
| eval/percent_correct_Focus | 0.875 | 0.000 |
| eval/percent_correct_Ties | 0.951 | 0.000 |
| eval/choice_samples | 1763.000 | 0.000 |
| eval/ties_samples | 102.000 | 0.000 |
| eval/choice_format_compliance_rate | 0.999 | 0.000 |
| eval/ties_format_compliance_rate | 1.000 | 0.000 |
| eval/wrong_answer_a_bias_rate | 0.378 | 0.000 |
| eval/wrong_answer_total_count | 590.000 | 0.000 |
| eval/wrong_answer_a_count | 223.000 | 0.000 |
**Model:** dsv3-arena
**Evaluation Time (hh:mm:ss):** 00:08:58
**Temperature:** 0.6
**Overlong samples:** 94.5% (1763 / 1865)
### simpleqa_nous
| Metric | Score | Std Error |
|--------|-------|----------|
| exact_match | 0.134 | 0.005 |
| fuzzy_match | 0.186 | 0.006 |
**Model:** dsv3
**Evaluation Time (hh:mm:ss):** 00:16:58
**Temperature:** 0.3
**Overlong samples:** 100.0% (4321 / 4321)
# dsv3 评估结果
## 概述
| 评测基准 | 得分 | 评价指标 | 样本量 | 超长样本占比 |
|-----------|-------|--------|---------|---------------|
| aime24 | 0.506 | math_pass@1:64_samples | 64 | 100.0% |
| aime25 | 0.422 | math_pass@1:64_samples | 64 | 100.0% |
| arenahard | 0.926 | eval/overall_winrate | 500 | 0.0% |
| bbh_generative | 0.868 | extractive_match | 1 | 100.0% |
| creative-writing-v3 | 0.767 | creative_writing_score | 96 | 0.0% |
| drop_generative_nous | 0.829 | drop_acc | 1 | 100.0% |
| eqbench3 | 0.831 | eqbench_score | 135 | 0.0% |
| gpqa_diamond | 0.680 | gpqa_pass@1:8_samples | 8 | 100.0% |
| ifeval | 0.904 | inst_level_loose_acc | 1 | 100.0% |
| lcb-v6-aug2024+ | 0.492 | eval/pass_1 | 1 | 100.0% |
| math_500 | 0.925 | math_pass@1:4_samples | 4 | 100.0% |
| mmlu_generative | 0.886 | extractive_match | 1 | 100.0% |
| mmlu_pro | 0.816 | pass@1:1_samples | 1 | 100.0% |
| musr_generative | 0.654 | extractive_match | 1 | 100.0% |
| obqa_generative | 0.956 | extractive_match | 1 | 100.0% |
| rewardbench | 0.681 | eval/percent_correct | 1 | 94.5% |
| simpleqa_nous | 0.186 | fuzzy_match | 1 | 100.0% |
超长样本占比:63,690 / 64,523 个样本(98.7%)缺失闭合标签`</think>`
## 详细评测结果
### aime24
| 评价指标 | 得分 | 标准误差 |
|--------|-------|----------|
| math_pass@1:1_samples | 0.533 | 0.093 |
| math_pass@1:4_samples | 0.517 | 0.077 |
| math_pass@1:8_samples | 0.529 | 0.076 |
| math_pass@1:16_samples | 0.517 | 0.074 |
| math_pass@1:32_samples | 0.515 | 0.073 |
| math_pass@1:64_samples | 0.506 | 0.073 |
**模型:** dsv3
**评估耗时(hh:mm:ss):** 03:19:18
**温度系数(Temperature):** 0.3
**超长样本占比:** 100.0% (1920 / 1920)
### aime25
| 评价指标 | 得分 | 标准误差 |
|--------|-------|----------|
| math_pass@1:1_samples | 0.500 | 0.093 |
| math_pass@1:4_samples | 0.417 | 0.076 |
| math_pass@1:8_samples | 0.408 | 0.073 |
| math_pass@1:16_samples | 0.415 | 0.072 |
| math_pass@1:32_samples | 0.414 | 0.072 |
| math_pass@1:64_samples | 0.422 | 0.072 |
**模型:** dsv3
**评估耗时(hh:mm:ss):** 02:01:49
**温度系数(Temperature):** 0.3
**超长样本占比:** 100.0% (1920 / 1920)
### arenahard
| 评价指标 | 得分 | 标准误差 |
|--------|-------|----------|
| eval/overall_winrate | 0.926 | 0.000 |
| eval/total_samples | 500.000 | 0.000 |
| eval/win_count | 446.000 | 0.000 |
| eval/tie_count | 34.000 | 0.000 |
| eval/loss_count | 20.000 | 0.000 |
| eval/win_rate | 0.892 | 0.000 |
| eval/tie_rate | 0.068 | 0.000 |
| eval/loss_rate | 0.040 | 0.000 |
| eval/winrate_arena-hard-v0.1 | 0.926 | 0.000 |
**模型:** dsv3-arena
**评估耗时(hh:mm:ss):** 00:04:19
**温度系数(Temperature):** 0.6
**超长样本占比:** 0.0% (0 / 500)
### bbh_generative
| 评价指标 | 得分 | 标准误差 |
|--------|-------|----------|
| extractive_match | 0.868 | 0.015 |
**模型:** dsv3
**评估耗时(hh:mm:ss):** 00:21:52
**温度系数(Temperature):** 0.3
**超长样本占比:** 100.0% (5511 / 5511)
### creative-writing-v3
| 评价指标 | 得分 | 标准误差 |
|--------|-------|----------|
| creative_writing_score | 0.767 | 0.000 |
| num_samples | 96.000 | 0.000 |
**模型:** dsv3-nonthinking
**评估耗时(hh:mm:ss):** N/A
**温度系数(Temperature):** N/A
**超长样本占比:** 0.0% (0 / 96)
### drop_generative_nous
| 评价指标 | 得分 | 标准误差 |
|--------|-------|----------|
| drop_acc | 0.829 | 0.004 |
**模型:** dsv3
**评估耗时(hh:mm:ss):** 00:37:29
**温度系数(Temperature):** 0.3
**超长样本占比:** 100.0% (9536 / 9536)
### eqbench3
| 评价指标 | 得分 | 标准误差 |
|--------|-------|----------|
| eqbench_score | 0.831 | 0.000 |
| num_samples | 135.000 | 0.000 |
**模型:** dsv3-arena
**评估耗时(hh:mm:ss):** N/A
**温度系数(Temperature):** N/A
**超长样本占比:** 0.0% (0 / 135)
### gpqa_diamond
| 评价指标 | 得分 | 标准误差 |
|--------|-------|----------|
| gpqa_pass@1:1_samples | 0.682 | 0.033 |
| gpqa_pass@1:4_samples | 0.674 | 0.028 |
| gpqa_pass@1:8_samples | 0.680 | 0.027 |
**模型:** dsv3
**评估耗时(hh:mm:ss):** 01:10:50
**温度系数(Temperature):** 0.3
**超长样本占比:** 100.0% (1584 / 1584)
### ifeval
| 评价指标 | 得分 | 标准误差 |
|--------|-------|----------|
| prompt_level_strict_acc | 0.815 | 0.017 |
| inst_level_strict_acc | 0.871 | 0.000 |
| prompt_level_loose_acc | 0.858 | 0.015 |
| inst_level_loose_acc | 0.904 | 0.000 |
**模型:** dsv3
**评估耗时(hh:mm:ss):** 00:48:12
**温度系数(Temperature):** 0.3
**超长样本占比:** 100.0% (541 / 541)
### lcb-v6-aug2024+
| 评价指标 | 得分 | 标准误差 |
|--------|-------|----------|
| eval/pass_1 | 0.492 | 0.000 |
| eval/easy_pass_1 | 0.935 | 0.000 |
| eval/medium_pass_1 | 0.565 | 0.000 |
| eval/hard_pass_1 | 0.202 | 0.000 |
| eval/completion_length | 14047.031 | 0.000 |
**模型:** dsv3-temp0.3
**评估耗时(hh:mm:ss):** 07:35:20
**温度系数(Temperature):** N/A
**超长样本占比:** 100.0% (7264 / 7264)
### math_500
| 评价指标 | 得分 | 标准误差 |
|--------|-------|----------|
| math_pass@1:1_samples | 0.930 | 0.011 |
| math_pass@1:4_samples | 0.925 | 0.010 |
**模型:** dsv3
**评估耗时(hh:mm:ss):** 00:14:00
**温度系数(Temperature):** 0.3
**超长样本占比:** 100.0% (2000 / 2000)
### mmlu_generative
| 评价指标 | 得分 | 标准误差 |
|--------|-------|----------|
| extractive_match | 0.886 | 0.003 |
**模型:** dsv3
**评估耗时(hh:mm:ss):** 00:55:26
**温度系数(Temperature):** 0.3
**超长样本占比:** 100.0% (14042 / 14042)
### mmlu_pro
| 评价指标 | 得分 | 标准误差 |
|--------|-------|----------|
| pass@1:1_samples | 0.816 | 0.004 |
**模型:** dsv3
**评估耗时(hh:mm:ss):** 01:47:48
**温度系数(Temperature):** 0.3
**超长样本占比:** 100.0% (12032 / 12032)
### musr_generative
| 评价指标 | 得分 | 标准误差 |
|--------|-------|----------|
| extractive_match | 0.654 | 0.029 |
**模型:** dsv3
**评估耗时(hh:mm:ss):** 00:10:13
**温度系数(Temperature):** 0.3
**超长样本占比:** 100.0% (756 / 756)
### obqa_generative
| 评价指标 | 得分 | 标准误差 |
|--------|-------|----------|
| extractive_match | 0.956 | 0.009 |
**模型:** dsv3
**评估耗时(hh:mm:ss):** 00:02:02
**温度系数(Temperature):** 0.3
**超长样本占比:** 100.0% (500 / 500)
### rewardbench
| 评价指标 | 得分 | 标准误差 |
|--------|-------|----------|
| eval/percent_correct | 0.681 | 0.000 |
| eval/total_samples | 1865.000 | 0.000 |
| eval/correct_samples | 1270.000 | 0.000 |
| eval/format_compliance_rate | 0.999 | 0.000 |
| eval/avg_response_length | 1703.678 | 0.000 |
| eval/response_length_std | 1577.530 | 0.000 |
| eval/judgment_entropy | 1.380 | 0.000 |
| eval/most_common_judgment_freq | 0.318 | 0.000 |
| eval/format_error_rate | 0.001 | 0.000 |
| eval/avg_ties_rating | 4.123 | 0.000 |
| eval/ties_error_rate | 0.023 | 0.000 |
| eval/percent_correct_Factuality | 0.566 | 0.000 |
| eval/percent_correct_Precise IF | 0.369 | 0.000 |
| eval/percent_correct_Math | 0.628 | 0.000 |
| eval/percent_correct_Safety | 0.660 | 0.000 |
| eval/percent_correct_Focus | 0.875 | 0.000 |
| eval/percent_correct_Ties | 0.951 | 0.000 |
| eval/choice_samples | 1763.000 | 0.000 |
| eval/ties_samples | 102.000 | 0.000 |
| eval/choice_format_compliance_rate | 0.999 | 0.000 |
| eval/ties_format_compliance_rate | 1.000 | 0.000 |
| eval/wrong_answer_a_bias_rate | 0.378 | 0.000 |
| eval/wrong_answer_total_count | 590.000 | 0.000 |
| eval/wrong_answer_a_count | 223.000 | 0.000 |
**模型:** dsv3-arena
**评估耗时(hh:mm:ss):** 00:08:58
**温度系数(Temperature):** 0.6
**超长样本占比:** 94.5% (1763 / 1865)
### simpleqa_nous
| 评价指标 | 得分 | 标准误差 |
|--------|-------|----------|
| exact_match | 0.134 | 0.005 |
| fuzzy_match | 0.186 | 0.006 |
**模型:** dsv3
**评估耗时(hh:mm:ss):** 00:16:58
**温度系数(Temperature):** 0.3
**超长样本占比:** 100.0% (4321 / 4321)
提供机构:
maas
创建时间:
2025-08-27



