five

tytodd/qwen3.5-hard-only-r4-eval

收藏
Hugging Face2026-03-26 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/tytodd/qwen3.5-hard-only-r4-eval
下载链接
链接失效反馈
官方服务:
资源简介:
--- configs: - config_name: arc_challenge data_files: - path: arc_challenge/ood-* split: ood - config_name: argument_quality_ranking data_files: - path: argument_quality_ranking/ood-* split: ood - config_name: bbeh data_files: - path: bbeh/ood-* split: ood - config_name: bbh_causal_judgement data_files: - path: bbh_causal_judgement/ood-* split: ood - config_name: bbh_disambiguation_qa data_files: - path: bbh_disambiguation_qa/ood-* split: ood - config_name: bbh_geometric_shapes data_files: - path: bbh_geometric_shapes/ood-* split: ood - config_name: bbh_movie_recommendation data_files: - path: bbh_movie_recommendation/ood-* split: ood - config_name: bbh_reasoning_about_colored_objects data_files: - path: bbh_reasoning_about_colored_objects/ood-* split: ood - config_name: bbh_ruin_names data_files: - path: bbh_ruin_names/ood-* split: ood - config_name: bbh_salient_translation_error_detection data_files: - path: bbh_salient_translation_error_detection/ood-* split: ood - config_name: bbh_snarks data_files: - path: bbh_snarks/ood-* split: ood - config_name: bbh_sports_understanding data_files: - path: bbh_sports_understanding/ood-* split: ood - config_name: bbh_tracking_shuffled_objects_five_objects data_files: - path: bbh_tracking_shuffled_objects_five_objects/ood-* split: ood - config_name: bbh_web_of_lies data_files: - path: bbh_web_of_lies/ood-* split: ood - config_name: gpqa_diamond data_files: - path: gpqa_diamond/ood-* split: ood - config_name: halueval_summarization data_files: - path: halueval_summarization/ood-* split: ood - config_name: judge_bench data_files: - path: judge_bench/ood-* split: ood - config_name: mmlu data_files: - path: mmlu/ood-* split: ood - config_name: mmlu_pro data_files: - path: mmlu_pro/ood-* split: ood - config_name: musr_murder_mysteries data_files: - path: musr_murder_mysteries/ood-* split: ood - config_name: musr_object_placements data_files: - path: musr_object_placements/ood-* split: ood - config_name: musr_team_allocation data_files: - path: musr_team_allocation/ood-* split: ood - config_name: or_bench_toxic data_files: - path: or_bench_toxic/ood-* split: ood - config_name: rod101_essay_scoring data_files: - path: rod101_essay_scoring/ood-* split: ood - config_name: seekbench_full_trace data_files: - path: seekbench_full_trace/val-* split: val probe_version: b730cf39731757c4b2ee0818b776b8fd2e476da3 --- # Evaluation: tytodd/qwen3.5-hard-only-r4 Confidence scores from probe evaluation. ## Results | Config | N | AUROC | Accuracy | ECE | MCE | |--------|--:|------:|---------:|----:|----:| | arc_challenge/ood | 1000 | 0.8737 | 0.8480 | 0.2830 | 0.6131 | | argument_quality_ranking/ood | 1000 | 0.2602 | 0.4010 | 0.4127 | 0.8508 | | bbeh/ood | 1000 | 0.5670 | 0.5770 | 0.1952 | 0.7311 | | bbh_causal_judgement/ood | 149 | 0.5648 | 0.5839 | 0.1357 | 0.3102 | | bbh_disambiguation_qa/ood | 200 | 0.5389 | 0.5250 | 0.1812 | 0.4011 | | bbh_geometric_shapes/ood | 200 | 0.3123 | 0.3300 | 0.5416 | 0.9185 | | bbh_movie_recommendation/ood | 199 | 0.6269 | 0.5025 | 0.2385 | 0.4356 | | bbh_reasoning_about_colored_objects/ood | 200 | 0.7210 | 0.8550 | 0.2507 | 0.9703 | | bbh_ruin_names/ood | 200 | 0.7060 | 0.6500 | 0.1598 | 0.2781 | | bbh_salient_translation_error_detection/ood | 200 | 0.5909 | 0.5300 | 0.2624 | 0.6040 | | bbh_snarks/ood | 142 | 0.5974 | 0.5282 | 0.3273 | 0.9189 | | bbh_sports_understanding/ood | 200 | 0.6677 | 0.6500 | 0.1110 | 0.4697 | | bbh_tracking_shuffled_objects_five_objects/ood | 200 | 0.8081 | 0.8950 | 0.3164 | 0.8361 | | bbh_web_of_lies/ood | 200 | 1.0000 | 1.0000 | 0.1946 | 0.4533 | | gpqa_diamond/ood | 198 | 0.5694 | 0.5101 | 0.2744 | 0.6390 | | halueval_summarization/ood | 1000 | 0.5847 | 0.6260 | 0.1795 | 0.4749 | | judge_bench/ood | 278 | 0.6985 | 0.6619 | 0.2782 | 0.3646 | | mmlu/ood | 1000 | 0.7492 | 0.7050 | 0.2336 | 0.4760 | | mmlu_pro/ood | 1000 | 0.7153 | 0.6900 | 0.2080 | 0.5085 | | musr_murder_mysteries/ood | 250 | 0.6137 | 0.6240 | 0.1181 | 0.9544 | | musr_object_placements/ood | 254 | 0.5611 | 0.5866 | 0.1809 | 0.4160 | | musr_team_allocation/ood | 250 | 0.5323 | 0.6560 | 0.1213 | 0.9673 | | or_bench_toxic/ood | 524 | 0.6724 | 0.4275 | 0.5496 | 0.6421 | | rod101_essay_scoring/ood | 81 | 0.6482 | 0.7778 | 0.1777 | 0.6522 | | seekbench_full_trace/val | 57 | 0.7651 | 0.7544 | 0.1219 | 0.5936 |
提供机构:
tytodd
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作