tytodd/qwen3.5-hard-only-r4-eval
收藏Hugging Face2026-03-26 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/tytodd/qwen3.5-hard-only-r4-eval
下载链接
链接失效反馈官方服务:
资源简介:
---
configs:
- config_name: arc_challenge
data_files:
- path: arc_challenge/ood-*
split: ood
- config_name: argument_quality_ranking
data_files:
- path: argument_quality_ranking/ood-*
split: ood
- config_name: bbeh
data_files:
- path: bbeh/ood-*
split: ood
- config_name: bbh_causal_judgement
data_files:
- path: bbh_causal_judgement/ood-*
split: ood
- config_name: bbh_disambiguation_qa
data_files:
- path: bbh_disambiguation_qa/ood-*
split: ood
- config_name: bbh_geometric_shapes
data_files:
- path: bbh_geometric_shapes/ood-*
split: ood
- config_name: bbh_movie_recommendation
data_files:
- path: bbh_movie_recommendation/ood-*
split: ood
- config_name: bbh_reasoning_about_colored_objects
data_files:
- path: bbh_reasoning_about_colored_objects/ood-*
split: ood
- config_name: bbh_ruin_names
data_files:
- path: bbh_ruin_names/ood-*
split: ood
- config_name: bbh_salient_translation_error_detection
data_files:
- path: bbh_salient_translation_error_detection/ood-*
split: ood
- config_name: bbh_snarks
data_files:
- path: bbh_snarks/ood-*
split: ood
- config_name: bbh_sports_understanding
data_files:
- path: bbh_sports_understanding/ood-*
split: ood
- config_name: bbh_tracking_shuffled_objects_five_objects
data_files:
- path: bbh_tracking_shuffled_objects_five_objects/ood-*
split: ood
- config_name: bbh_web_of_lies
data_files:
- path: bbh_web_of_lies/ood-*
split: ood
- config_name: gpqa_diamond
data_files:
- path: gpqa_diamond/ood-*
split: ood
- config_name: halueval_summarization
data_files:
- path: halueval_summarization/ood-*
split: ood
- config_name: judge_bench
data_files:
- path: judge_bench/ood-*
split: ood
- config_name: mmlu
data_files:
- path: mmlu/ood-*
split: ood
- config_name: mmlu_pro
data_files:
- path: mmlu_pro/ood-*
split: ood
- config_name: musr_murder_mysteries
data_files:
- path: musr_murder_mysteries/ood-*
split: ood
- config_name: musr_object_placements
data_files:
- path: musr_object_placements/ood-*
split: ood
- config_name: musr_team_allocation
data_files:
- path: musr_team_allocation/ood-*
split: ood
- config_name: or_bench_toxic
data_files:
- path: or_bench_toxic/ood-*
split: ood
- config_name: rod101_essay_scoring
data_files:
- path: rod101_essay_scoring/ood-*
split: ood
- config_name: seekbench_full_trace
data_files:
- path: seekbench_full_trace/val-*
split: val
probe_version: b730cf39731757c4b2ee0818b776b8fd2e476da3
---
# Evaluation: tytodd/qwen3.5-hard-only-r4
Confidence scores from probe evaluation.
## Results
| Config | N | AUROC | Accuracy | ECE | MCE |
|--------|--:|------:|---------:|----:|----:|
| arc_challenge/ood | 1000 | 0.8737 | 0.8480 | 0.2830 | 0.6131 |
| argument_quality_ranking/ood | 1000 | 0.2602 | 0.4010 | 0.4127 | 0.8508 |
| bbeh/ood | 1000 | 0.5670 | 0.5770 | 0.1952 | 0.7311 |
| bbh_causal_judgement/ood | 149 | 0.5648 | 0.5839 | 0.1357 | 0.3102 |
| bbh_disambiguation_qa/ood | 200 | 0.5389 | 0.5250 | 0.1812 | 0.4011 |
| bbh_geometric_shapes/ood | 200 | 0.3123 | 0.3300 | 0.5416 | 0.9185 |
| bbh_movie_recommendation/ood | 199 | 0.6269 | 0.5025 | 0.2385 | 0.4356 |
| bbh_reasoning_about_colored_objects/ood | 200 | 0.7210 | 0.8550 | 0.2507 | 0.9703 |
| bbh_ruin_names/ood | 200 | 0.7060 | 0.6500 | 0.1598 | 0.2781 |
| bbh_salient_translation_error_detection/ood | 200 | 0.5909 | 0.5300 | 0.2624 | 0.6040 |
| bbh_snarks/ood | 142 | 0.5974 | 0.5282 | 0.3273 | 0.9189 |
| bbh_sports_understanding/ood | 200 | 0.6677 | 0.6500 | 0.1110 | 0.4697 |
| bbh_tracking_shuffled_objects_five_objects/ood | 200 | 0.8081 | 0.8950 | 0.3164 | 0.8361 |
| bbh_web_of_lies/ood | 200 | 1.0000 | 1.0000 | 0.1946 | 0.4533 |
| gpqa_diamond/ood | 198 | 0.5694 | 0.5101 | 0.2744 | 0.6390 |
| halueval_summarization/ood | 1000 | 0.5847 | 0.6260 | 0.1795 | 0.4749 |
| judge_bench/ood | 278 | 0.6985 | 0.6619 | 0.2782 | 0.3646 |
| mmlu/ood | 1000 | 0.7492 | 0.7050 | 0.2336 | 0.4760 |
| mmlu_pro/ood | 1000 | 0.7153 | 0.6900 | 0.2080 | 0.5085 |
| musr_murder_mysteries/ood | 250 | 0.6137 | 0.6240 | 0.1181 | 0.9544 |
| musr_object_placements/ood | 254 | 0.5611 | 0.5866 | 0.1809 | 0.4160 |
| musr_team_allocation/ood | 250 | 0.5323 | 0.6560 | 0.1213 | 0.9673 |
| or_bench_toxic/ood | 524 | 0.6724 | 0.4275 | 0.5496 | 0.6421 |
| rod101_essay_scoring/ood | 81 | 0.6482 | 0.7778 | 0.1777 | 0.6522 |
| seekbench_full_trace/val | 57 | 0.7651 | 0.7544 | 0.1219 | 0.5936 |
提供机构:
tytodd



