eval-Cogito-v2-preview-70B-nonreasoning

Name: eval-Cogito-v2-preview-70B-nonreasoning
Creator: maas
Published: 2025-12-05 16:48:42
License: 暂无描述

魔搭社区2025-12-05 更新2025-12-06 收录

下载链接：

https://modelscope.cn/datasets/NousResearch/eval-Cogito-v2-preview-70B-nonreasoning

下载链接

链接失效反馈

官方服务：

资源简介：

# cogito-70b-nonthinking Evaluation Results ## Summary | Benchmark | Score | Metric | Samples | Overlong rate | |-----------|-------|--------|---------|---------------| | aime24 | 0.122 | math_pass@1:64_samples | 64 | 100.0% | | aime25 | 0.060 | math_pass@1:64_samples | 64 | 100.0% | | arenahard | 0.819 | eval/overall_winrate | 500 | 0.0% | | bbh_generative | 0.876 | extractive_match | 1 | 100.0% | | creative-writing-v3 | 0.655 | creative_writing_score | 96 | 0.0% | | drop_generative_nous | 0.841 | drop_acc | 1 | 100.0% | | eqbench3 | 0.681 | eqbench_score | 135 | 0.0% | | gpqa_diamond | 0.528 | gpqa_pass@1:8_samples | 8 | 100.0% | | ifeval | 0.927 | inst_level_loose_acc | 1 | 100.0% | | lcb-v6-aug2024+ | 0.272 | eval/pass_1 | 1 | 100.0% | | math_500 | 0.756 | math_pass@1:4_samples | 4 | 100.0% | | mmlu_generative | 0.905 | extractive_match | 1 | 100.0% | | mmlu_pro | 0.760 | pass@1:1_samples | 1 | 100.0% | | musr_generative | 0.592 | extractive_match | 1 | 100.0% | | obqa_generative | 0.942 | extractive_match | 1 | 100.0% | | rewardbench | 0.627 | eval/percent_correct | 1 | 94.5% | | simpleqa_nous | 0.227 | fuzzy_match | 1 | 100.0% | Overlong rate: 63,690 / 64,523 samples (98.7%) missing closing `</think>` tag ## Detailed Results ### aime24 | Metric | Score | Std Error | |--------|-------|----------| | math_pass@1:1_samples | 0.133 | 0.063 | | math_pass@1:4_samples | 0.117 | 0.052 | | math_pass@1:8_samples | 0.117 | 0.049 | | math_pass@1:16_samples | 0.115 | 0.048 | | math_pass@1:32_samples | 0.118 | 0.048 | | math_pass@1:64_samples | 0.122 | 0.048 | **Model:** cogito-70b-nonthinking **Evaluation Time (hh:mm:ss):** 00:19:49 **Temperature:** 0.6 **Overlong samples:** 100.0% (1920 / 1920) ### aime25 | Metric | Score | Std Error | |--------|-------|----------| | math_pass@1:1_samples | 0.067 | 0.046 | | math_pass@1:4_samples | 0.058 | 0.037 | | math_pass@1:8_samples | 0.058 | 0.031 | | math_pass@1:16_samples | 0.058 | 0.029 | | math_pass@1:32_samples | 0.067 | 0.030 | | math_pass@1:64_samples | 0.060 | 0.029 | **Model:** cogito-70b-nonthinking **Evaluation Time (hh:mm:ss):** 00:17:57 **Temperature:** 0.6 **Overlong samples:** 100.0% (1920 / 1920) ### arenahard | Metric | Score | Std Error | |--------|-------|----------| | eval/overall_winrate | 0.819 | 0.000 | | eval/total_samples | 500.000 | 0.000 | | eval/win_count | 372.000 | 0.000 | | eval/tie_count | 74.000 | 0.000 | | eval/loss_count | 54.000 | 0.000 | | eval/win_rate | 0.744 | 0.000 | | eval/tie_rate | 0.148 | 0.000 | | eval/loss_rate | 0.108 | 0.000 | | eval/winrate_arena-hard-v0.1 | 0.819 | 0.000 | **Model:** cogito-70b-arena-nothink **Evaluation Time (hh:mm:ss):** 00:02:07 **Temperature:** 0.6 **Overlong samples:** 0.0% (0 / 500) ### bbh_generative | Metric | Score | Std Error | |--------|-------|----------| | extractive_match | 0.876 | 0.015 | **Model:** cogito-70b-nonthinking **Evaluation Time (hh:mm:ss):** 00:24:48 **Temperature:** 0.6 **Overlong samples:** 100.0% (5511 / 5511) ### creative-writing-v3 | Metric | Score | Std Error | |--------|-------|----------| | creative_writing_score | 0.655 | 0.000 | | num_samples | 96.000 | 0.000 | **Model:** cogito-70b-nonthinking **Evaluation Time (hh:mm:ss):** N/A **Temperature:** N/A **Overlong samples:** 0.0% (0 / 96) ### drop_generative_nous | Metric | Score | Std Error | |--------|-------|----------| | drop_acc | 0.841 | 0.004 | **Model:** cogito-70b-nonthinking **Evaluation Time (hh:mm:ss):** 00:37:27 **Temperature:** 0.6 **Overlong samples:** 100.0% (9536 / 9536) ### eqbench3 | Metric | Score | Std Error | |--------|-------|----------| | eqbench_score | 0.681 | 0.000 | | num_samples | 135.000 | 0.000 | **Model:** cogito-70b-arena-nothink **Evaluation Time (hh:mm:ss):** N/A **Temperature:** N/A **Overlong samples:** 0.0% (0 / 135) ### gpqa_diamond | Metric | Score | Std Error | |--------|-------|----------| | gpqa_pass@1:1_samples | 0.571 | 0.035 | | gpqa_pass@1:4_samples | 0.529 | 0.027 | | gpqa_pass@1:8_samples | 0.528 | 0.026 | **Model:** cogito-70b-nonthinking **Evaluation Time (hh:mm:ss):** 00:15:49 **Temperature:** 0.6 **Overlong samples:** 100.0% (1584 / 1584) ### ifeval | Metric | Score | Std Error | |--------|-------|----------| | prompt_level_strict_acc | 0.865 | 0.015 | | inst_level_strict_acc | 0.911 | 0.000 | | prompt_level_loose_acc | 0.889 | 0.014 | | inst_level_loose_acc | 0.927 | 0.000 | **Model:** cogito-70b-nonthinking **Evaluation Time (hh:mm:ss):** 00:02:18 **Temperature:** 0.6 **Overlong samples:** 100.0% (541 / 541) ### lcb-v6-aug2024+ | Metric | Score | Std Error | |--------|-------|----------| | eval/pass_1 | 0.272 | 0.000 | | eval/easy_pass_1 | 0.751 | 0.000 | | eval/medium_pass_1 | 0.212 | 0.000 | | eval/hard_pass_1 | 0.055 | 0.000 | | eval/completion_length | 1847.961 | 0.000 | **Model:** cogito-70b-nonthinking **Evaluation Time (hh:mm:ss):** 00:18:42 **Temperature:** N/A **Overlong samples:** 100.0% (7264 / 7264) ### math_500 | Metric | Score | Std Error | |--------|-------|----------| | math_pass@1:1_samples | 0.750 | 0.019 | | math_pass@1:4_samples | 0.756 | 0.016 | **Model:** cogito-70b-nonthinking **Evaluation Time (hh:mm:ss):** 00:12:54 **Temperature:** 0.6 **Overlong samples:** 100.0% (2000 / 2000) ### mmlu_generative | Metric | Score | Std Error | |--------|-------|----------| | extractive_match | 0.905 | 0.002 | **Model:** cogito-70b-nonthinking **Evaluation Time (hh:mm:ss):** 00:55:09 **Temperature:** 0.6 **Overlong samples:** 100.0% (14042 / 14042) ### mmlu_pro | Metric | Score | Std Error | |--------|-------|----------| | pass@1:1_samples | 0.760 | 0.004 | **Model:** cogito-70b-nonthinking **Evaluation Time (hh:mm:ss):** 01:01:21 **Temperature:** 0.6 **Overlong samples:** 100.0% (12032 / 12032) ### musr_generative | Metric | Score | Std Error | |--------|-------|----------| | extractive_match | 0.592 | 0.031 | **Model:** cogito-70b-nonthinking **Evaluation Time (hh:mm:ss):** 00:03:01 **Temperature:** 0.6 **Overlong samples:** 100.0% (756 / 756) ### obqa_generative | Metric | Score | Std Error | |--------|-------|----------| | extractive_match | 0.942 | 0.010 | **Model:** cogito-70b-nonthinking **Evaluation Time (hh:mm:ss):** 00:02:01 **Temperature:** 0.6 **Overlong samples:** 100.0% (500 / 500) ### rewardbench | Metric | Score | Std Error | |--------|-------|----------| | eval/percent_correct | 0.627 | 0.000 | | eval/total_samples | 1865.000 | 0.000 | | eval/correct_samples | 1170.000 | 0.000 | | eval/format_compliance_rate | 1.000 | 0.000 | | eval/avg_response_length | 1283.707 | 0.000 | | eval/response_length_std | 209.582 | 0.000 | | eval/judgment_entropy | 1.367 | 0.000 | | eval/most_common_judgment_freq | 0.330 | 0.000 | | eval/format_error_rate | 0.000 | 0.000 | | eval/avg_ties_rating | 3.614 | 0.000 | | eval/ties_error_rate | 0.019 | 0.000 | | eval/percent_correct_Factuality | 0.514 | 0.000 | | eval/percent_correct_Precise IF | 0.362 | 0.000 | | eval/percent_correct_Math | 0.497 | 0.000 | | eval/percent_correct_Safety | 0.627 | 0.000 | | eval/percent_correct_Focus | 0.804 | 0.000 | | eval/percent_correct_Ties | 0.951 | 0.000 | | eval/choice_samples | 1763.000 | 0.000 | | eval/ties_samples | 102.000 | 0.000 | | eval/choice_format_compliance_rate | 1.000 | 0.000 | | eval/ties_format_compliance_rate | 1.000 | 0.000 | | eval/wrong_answer_a_bias_rate | 0.371 | 0.000 | | eval/wrong_answer_total_count | 690.000 | 0.000 | | eval/wrong_answer_a_count | 256.000 | 0.000 | **Model:** cogito-70b-arena-nothink **Evaluation Time (hh:mm:ss):** 00:05:11 **Temperature:** 0.6 **Overlong samples:** 94.5% (1763 / 1865) ### simpleqa_nous | Metric | Score | Std Error | |--------|-------|----------| | exact_match | 0.167 | 0.006 | | fuzzy_match | 0.227 | 0.006 | **Model:** cogito-70b-nonthinking **Evaluation Time (hh:mm:ss):** 00:16:58 **Temperature:** 0.6 **Overlong samples:** 100.0% (4321 / 4321)

# cogito-70b-nonthinking 评测结果 ## 摘要 | 评测基准 | 得分 | 评测指标 | 样本量 | 超长样本占比 | |-----------|-------|--------|---------|---------------| | aime24 | 0.122 | math_pass@1:64_samples | 64 | 100.0% | | aime25 | 0.060 | math_pass@1:64_samples | 64 | 100.0% | | arenahard | 0.819 | eval/overall_winrate | 500 | 0.0% | | bbh_generative | 0.876 | extractive_match | 1 | 100.0% | | creative-writing-v3 | 0.655 | creative_writing_score | 96 | 0.0% | | drop_generative_nous | 0.841 | drop_acc | 1 | 100.0% | | eqbench3 | 0.681 | eqbench_score | 135 | 0.0% | | gpqa_diamond | 0.528 | gpqa_pass@1:8_samples | 8 | 100.0% | | ifeval | 0.927 | inst_level_loose_acc | 1 | 100.0% | | lcb-v6-aug2024+ | 0.272 | eval/pass_1 | 1 | 100.0% | | math_500 | 0.756 | math_pass@1:4_samples | 4 | 100.0% | | mmlu_generative | 0.905 | extractive_match | 1 | 100.0% | | mmlu_pro | 0.760 | pass@1:1_samples | 1 | 100.0% | | musr_generative | 0.592 | extractive_match | 1 | 100.0% | | obqa_generative | 0.942 | extractive_match | 1 | 100.0% | | rewardbench | 0.627 | eval/percent_correct | 1 | 94.5% | | simpleqa_nous | 0.227 | fuzzy_match | 1 | 100.0% | 整体超长样本情况：63,690 / 64,523 个样本（占比98.7%）缺失闭合标签`</think>` ## 详细评测结果 ### aime24 | 评测指标 | 得分 | 标准误差 | |--------|-------|----------| | math_pass@1:1_samples | 0.133 | 0.063 | | math_pass@1:4_samples | 0.117 | 0.052 | | math_pass@1:8_samples | 0.117 | 0.049 | | math_pass@1:16_samples | 0.115 | 0.048 | | math_pass@1:32_samples | 0.118 | 0.048 | | math_pass@1:64_samples | 0.122 | 0.048 | **模型：** cogito-70b-nonthinking **评测耗时（时:分:秒）：** 00:19:49 **温度参数：** 0.6 **超长样本情况：** 100.0% (1920 / 1920) ### aime25 | 评测指标 | 得分 | 标准误差 | |--------|-------|----------| | math_pass@1:1_samples | 0.067 | 0.046 | | math_pass@1:4_samples | 0.058 | 0.037 | | math_pass@1:8_samples | 0.058 | 0.031 | | math_pass@1:16_samples | 0.058 | 0.029 | | math_pass@1:32_samples | 0.067 | 0.030 | | math_pass@1:64_samples | 0.060 | 0.029 | **模型：** cogito-70b-nonthinking **评测耗时（时:分:秒）：** 00:17:57 **温度参数：** 0.6 **超长样本情况：** 100.0% (1920 / 1920) ### arenahard | 评测指标 | 得分 | 标准误差 | |--------|-------|----------| | eval/overall_winrate | 0.819 | 0.000 | | eval/total_samples | 500.000 | 0.000 | | eval/win_count | 372.000 | 0.000 | | eval/tie_count | 74.000 | 0.000 | | eval/loss_count | 54.000 | 0.000 | | eval/win_rate | 0.744 | 0.000 | | eval/tie_rate | 0.148 | 0.000 | | eval/loss_rate | 0.108 | 0.000 | | eval/winrate_arena-hard-v0.1 | 0.819 | 0.000 | **模型：** cogito-70b-arena-nothink **评测耗时（时:分:秒）：** 00:02:07 **温度参数：** 0.6 **超长样本情况：** 0.0% (0 / 500) ### bbh_generative | 评测指标 | 得分 | 标准误差 | |--------|-------|----------| | extractive_match | 0.876 | 0.015 | **模型：** cogito-70b-nonthinking **评测耗时（时:分:秒）：** 00:24:48 **温度参数：** 0.6 **超长样本情况：** 100.0% (5511 / 5511) ### creative-writing-v3 | 评测指标 | 得分 | 标准误差 | |--------|-------|----------| | creative_writing_score | 0.655 | 0.000 | | num_samples | 96.000 | 0.000 | **模型：** cogito-70b-nonthinking **评测耗时（时:分:秒）：** N/A **温度参数：** N/A **超长样本情况：** 0.0% (0 / 96) ### drop_generative_nous | 评测指标 | 得分 | 标准误差 | |--------|-------|----------| | drop_acc | 0.841 | 0.004 | **模型：** cogito-70b-nonthinking **评测耗时（时:分:秒）：** 00:37:27 **温度参数：** 0.6 **超长样本情况：** 100.0% (9536 / 9536) ### eqbench3 | 评测指标 | 得分 | 标准误差 | |--------|-------|----------| | eqbench_score | 0.681 | 0.000 | | num_samples | 135.000 | 0.000 | **模型：** cogito-70b-arena-nothink **评测耗时（时:分:秒）：** N/A **温度参数：** N/A **超长样本情况：** 0.0% (0 / 135) ### gpqa_diamond | 评测指标 | 得分 | 标准误差 | |--------|-------|----------| | gpqa_pass@1:1_samples | 0.571 | 0.035 | | gpqa_pass@1:4_samples | 0.529 | 0.027 | | gpqa_pass@1:8_samples | 0.528 | 0.026 | **模型：** cogito-70b-nonthinking **评测耗时（时:分:秒）：** 00:15:49 **温度参数：** 0.6 **超长样本情况：** 100.0% (1584 / 1584) ### ifeval | 评测指标 | 得分 | 标准误差 | |--------|-------|----------| | prompt_level_strict_acc | 0.865 | 0.015 | | inst_level_strict_acc | 0.911 | 0.000 | | prompt_level_loose_acc | 0.889 | 0.014 | | inst_level_loose_acc | 0.927 | 0.000 | **模型：** cogito-70b-nonthinking **评测耗时（时:分:秒）：** 00:02:18 **温度参数：** 0.6 **超长样本情况：** 100.0% (541 / 541) ### lcb-v6-aug2024+ | 评测指标 | 得分 | 标准误差 | |--------|-------|----------| | eval/pass_1 | 0.272 | 0.000 | | eval/easy_pass_1 | 0.751 | 0.000 | | eval/medium_pass_1 | 0.212 | 0.000 | | eval/hard_pass_1 | 0.055 | 0.000 | | eval/completion_length | 1847.961 | 0.000 | **模型：** cogito-70b-nonthinking **评测耗时（时:分:秒）：** 00:18:42 **温度参数：** N/A **超长样本情况：** 100.0% (7264 / 7264) ### math_500 | 评测指标 | 得分 | 标准误差 | |--------|-------|----------| | math_pass@1:1_samples | 0.750 | 0.019 | | math_pass@1:4_samples | 0.756 | 0.016 | **模型：** cogito-70b-nonthinking **评测耗时（时:分:秒）：** 00:12:54 **温度参数：** 0.6 **超长样本情况：** 100.0% (2000 / 2000) ### mmlu_generative | 评测指标 | 得分 | 标准误差 | |--------|-------|----------| | extractive_match | 0.905 | 0.002 | **模型：** cogito-70b-nonthinking **评测耗时（时:分:秒）：** 00:55:09 **温度参数：** 0.6 **超长样本情况：** 100.0% (14042 / 14042) ### mmlu_pro | 评测指标 | 得分 | 标准误差 | |--------|-------|----------| | pass@1:1_samples | 0.760 | 0.004 | **模型：** cogito-70b-nonthinking **评测耗时（时:分:秒）：** 01:01:21 **温度参数：** 0.6 **超长样本情况：** 100.0% (12032 / 12032) ### musr_generative | 评测指标 | 得分 | 标准误差 | |--------|-------|----------| | extractive_match | 0.592 | 0.031 | **模型：** cogito-70b-nonthinking **评测耗时（时:分:秒）：** 00:03:01 **温度参数：** 0.6 **超长样本情况：** 100.0% (756 / 756) ### obqa_generative | 评测指标 | 得分 | 标准误差 | |--------|-------|----------| | extractive_match | 0.942 | 0.010 | **模型：** cogito-70b-nonthinking **评测耗时（时:分:秒）：** 00:02:01 **温度参数：** 0.6 **超长样本情况：** 100.0% (500 / 500) ### rewardbench | 评测指标 | 得分 | 标准误差 | |--------|-------|----------| | eval/percent_correct | 0.627 | 0.000 | | eval/total_samples | 1865.000 | 0.000 | | eval/correct_samples | 1170.000 | 0.000 | | eval/format_compliance_rate | 1.000 | 0.000 | | eval/avg_response_length | 1283.707 | 0.000 | | eval/response_length_std | 209.582 | 0.000 | | eval/judgment_entropy | 1.367 | 0.000 | | eval/most_common_judgment_freq | 0.330 | 0.000 | | eval/format_error_rate | 0.000 | 0.000 | | eval/avg_ties_rating | 3.614 | 0.000 | | eval/ties_error_rate | 0.019 | 0.000 | | eval/percent_correct_Factuality | 0.514 | 0.000 | | eval/percent_correct_Precise IF | 0.362 | 0.000 | | eval/percent_correct_Math | 0.497 | 0.000 | | eval/percent_correct_Safety | 0.627 | 0.000 | | eval/percent_correct_Focus | 0.804 | 0.000 | | eval/percent_correct_Ties | 0.951 | 0.000 | | eval/choice_samples | 1763.000 | 0.000 | | eval/ties_samples | 102.000 | 0.000 | | eval/choice_format_compliance_rate | 1.000 | 0.000 | | eval/ties_format_compliance_rate | 1.000 | 0.000 | | eval/wrong_answer_a_bias_rate | 0.371 | 0.000 | | eval/wrong_answer_total_count | 690.000 | 0.000 | | eval/wrong_answer_a_count | 256.000 | 0.000 | **模型：** cogito-70b-arena-nothink **评测耗时（时:分:秒）：** 00:05:11 **温度参数：** 0.6 **超长样本情况：** 94.5% (1763 / 1865) ### simpleqa_nous | 评测指标 | 得分 | 标准误差 | |--------|-------|----------| | exact_match | 0.167 | 0.006 | | fuzzy_match | 0.227 | 0.006 | **模型：** cogito-70b-nonthinking **评测耗时（时:分:秒）：** 00:16:58 **温度参数：** 0.6 **超长样本情况：** 100.0% (4321 / 4321)

提供机构：

maas

创建时间：

2025-08-29

5,000+

优质数据集

54 个

任务类型

进入经典数据集