five

arubique/disco-model-outputs

收藏
Hugging Face2026-04-06 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/arubique/disco-model-outputs
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: other task_categories: - other pretty_name: DISCO Model Outputs (Open LLM Leaderboard) tags: - disco - leaderboard - mmlu - hellaswag - winogrande - arc - model-evaluation dataset_info: - config_name: arc_challenge features: - name: sample_idx dtype: int64 - name: model_idx dtype: int64 - name: correctness dtype: float64 - name: logit_0 dtype: float64 - name: logit_1 dtype: float64 - name: logit_2 dtype: float64 - name: logit_3 dtype: float64 - name: logit_4 dtype: float64 splits: - name: train num_bytes: 31878400 num_examples: 498100 download_size: 16330950 dataset_size: 31878400 - config_name: hellaswag features: - name: sample_idx dtype: int64 - name: model_idx dtype: int64 - name: correctness dtype: float64 - name: logit_0 dtype: float64 - name: logit_1 dtype: float64 - name: logit_2 dtype: float64 - name: logit_3 dtype: float64 splits: - name: train num_bytes: 238999600 num_examples: 4267850 download_size: 136887441 dataset_size: 238999600 - config_name: manifest features: - name: format_version dtype: int64 - name: model_split_name dtype: string - name: task_split_name dtype: string - name: original_data_key dtype: string - name: prediction_width dtype: int64 splits: - name: train num_bytes: 5825 num_examples: 61 download_size: 4450 dataset_size: 5825 - config_name: mmlu_abstract_algebra features: - name: sample_idx dtype: int64 - name: model_idx dtype: int64 - name: correctness dtype: float64 - name: logit_0 dtype: float64 - name: logit_1 dtype: float64 - name: logit_2 dtype: float64 - name: logit_3 dtype: float64 splits: - name: train num_bytes: 2380000 num_examples: 42500 download_size: 1412734 dataset_size: 2380000 - config_name: mmlu_anatomy features: - name: sample_idx dtype: int64 - name: model_idx dtype: int64 - name: correctness dtype: float64 - name: logit_0 dtype: float64 - name: logit_1 dtype: float64 - name: logit_2 dtype: float64 - name: logit_3 dtype: float64 splits: - name: train num_bytes: 3213000 num_examples: 57375 download_size: 1923208 dataset_size: 3213000 - config_name: mmlu_astronomy features: - name: sample_idx dtype: int64 - name: model_idx dtype: int64 - name: correctness dtype: float64 - name: logit_0 dtype: float64 - name: logit_1 dtype: float64 - name: logit_2 dtype: float64 - name: logit_3 dtype: float64 splits: - name: train num_bytes: 3617600 num_examples: 64600 download_size: 2169321 dataset_size: 3617600 - config_name: mmlu_business_ethics features: - name: sample_idx dtype: int64 - name: model_idx dtype: int64 - name: correctness dtype: float64 - name: logit_0 dtype: float64 - name: logit_1 dtype: float64 - name: logit_2 dtype: float64 - name: logit_3 dtype: float64 splits: - name: train num_bytes: 2380000 num_examples: 42500 download_size: 1427584 dataset_size: 2380000 - config_name: mmlu_clinical_knowledge features: - name: sample_idx dtype: int64 - name: model_idx dtype: int64 - name: correctness dtype: float64 - name: logit_0 dtype: float64 - name: logit_1 dtype: float64 - name: logit_2 dtype: float64 - name: logit_3 dtype: float64 splits: - name: train num_bytes: 6307000 num_examples: 112625 download_size: 3778231 dataset_size: 6307000 - config_name: mmlu_college_biology features: - name: sample_idx dtype: int64 - name: model_idx dtype: int64 - name: correctness dtype: float64 - name: logit_0 dtype: float64 - name: logit_1 dtype: float64 - name: logit_2 dtype: float64 - name: logit_3 dtype: float64 splits: - name: train num_bytes: 3427200 num_examples: 61200 download_size: 2056264 dataset_size: 3427200 - config_name: mmlu_college_chemistry features: - name: sample_idx dtype: int64 - name: model_idx dtype: int64 - name: correctness dtype: float64 - name: logit_0 dtype: float64 - name: logit_1 dtype: float64 - name: logit_2 dtype: float64 - name: logit_3 dtype: float64 splits: - name: train num_bytes: 2380000 num_examples: 42500 download_size: 1418641 dataset_size: 2380000 - config_name: mmlu_college_computer_science features: - name: sample_idx dtype: int64 - name: model_idx dtype: int64 - name: correctness dtype: float64 - name: logit_0 dtype: float64 - name: logit_1 dtype: float64 - name: logit_2 dtype: float64 - name: logit_3 dtype: float64 splits: - name: train num_bytes: 2380000 num_examples: 42500 download_size: 1420806 dataset_size: 2380000 - config_name: mmlu_college_mathematics features: - name: sample_idx dtype: int64 - name: model_idx dtype: int64 - name: correctness dtype: float64 - name: logit_0 dtype: float64 - name: logit_1 dtype: float64 - name: logit_2 dtype: float64 - name: logit_3 dtype: float64 splits: - name: train num_bytes: 2380000 num_examples: 42500 download_size: 1413541 dataset_size: 2380000 - config_name: mmlu_college_medicine features: - name: sample_idx dtype: int64 - name: model_idx dtype: int64 - name: correctness dtype: float64 - name: logit_0 dtype: float64 - name: logit_1 dtype: float64 - name: logit_2 dtype: float64 - name: logit_3 dtype: float64 splits: - name: train num_bytes: 4117400 num_examples: 73525 download_size: 2469001 dataset_size: 4117400 - config_name: mmlu_college_physics features: - name: sample_idx dtype: int64 - name: model_idx dtype: int64 - name: correctness dtype: float64 - name: logit_0 dtype: float64 - name: logit_1 dtype: float64 - name: logit_2 dtype: float64 - name: logit_3 dtype: float64 splits: - name: train num_bytes: 2427600 num_examples: 43350 download_size: 1447730 dataset_size: 2427600 - config_name: mmlu_computer_security features: - name: sample_idx dtype: int64 - name: model_idx dtype: int64 - name: correctness dtype: float64 - name: logit_0 dtype: float64 - name: logit_1 dtype: float64 - name: logit_2 dtype: float64 - name: logit_3 dtype: float64 splits: - name: train num_bytes: 2380000 num_examples: 42500 download_size: 1428622 dataset_size: 2380000 - config_name: mmlu_conceptual_physics features: - name: sample_idx dtype: int64 - name: model_idx dtype: int64 - name: correctness dtype: float64 - name: logit_0 dtype: float64 - name: logit_1 dtype: float64 - name: logit_2 dtype: float64 - name: logit_3 dtype: float64 splits: - name: train num_bytes: 5593000 num_examples: 99875 download_size: 3345346 dataset_size: 5593000 - config_name: mmlu_econometrics features: - name: sample_idx dtype: int64 - name: model_idx dtype: int64 - name: correctness dtype: float64 - name: logit_0 dtype: float64 - name: logit_1 dtype: float64 - name: logit_2 dtype: float64 - name: logit_3 dtype: float64 splits: - name: train num_bytes: 2713200 num_examples: 48450 download_size: 1621991 dataset_size: 2713200 - config_name: mmlu_electrical_engineering features: - name: sample_idx dtype: int64 - name: model_idx dtype: int64 - name: correctness dtype: float64 - name: logit_0 dtype: float64 - name: logit_1 dtype: float64 - name: logit_2 dtype: float64 - name: logit_3 dtype: float64 splits: - name: train num_bytes: 3451000 num_examples: 61625 download_size: 2063011 dataset_size: 3451000 - config_name: mmlu_elementary_mathematics features: - name: sample_idx dtype: int64 - name: model_idx dtype: int64 - name: correctness dtype: float64 - name: logit_0 dtype: float64 - name: logit_1 dtype: float64 - name: logit_2 dtype: float64 - name: logit_3 dtype: float64 splits: - name: train num_bytes: 8996400 num_examples: 160650 download_size: 5366853 dataset_size: 8996400 - config_name: mmlu_formal_logic features: - name: sample_idx dtype: int64 - name: model_idx dtype: int64 - name: correctness dtype: float64 - name: logit_0 dtype: float64 - name: logit_1 dtype: float64 - name: logit_2 dtype: float64 - name: logit_3 dtype: float64 splits: - name: train num_bytes: 2998800 num_examples: 53550 download_size: 1788904 dataset_size: 2998800 - config_name: mmlu_global_facts features: - name: sample_idx dtype: int64 - name: model_idx dtype: int64 - name: correctness dtype: float64 - name: logit_0 dtype: float64 - name: logit_1 dtype: float64 - name: logit_2 dtype: float64 - name: logit_3 dtype: float64 splits: - name: train num_bytes: 2380000 num_examples: 42500 download_size: 1418484 dataset_size: 2380000 - config_name: mmlu_high_school_biology features: - name: sample_idx dtype: int64 - name: model_idx dtype: int64 - name: correctness dtype: float64 - name: logit_0 dtype: float64 - name: logit_1 dtype: float64 - name: logit_2 dtype: float64 - name: logit_3 dtype: float64 splits: - name: train num_bytes: 7378000 num_examples: 131750 download_size: 4426535 dataset_size: 7378000 - config_name: mmlu_high_school_chemistry features: - name: sample_idx dtype: int64 - name: model_idx dtype: int64 - name: correctness dtype: float64 - name: logit_0 dtype: float64 - name: logit_1 dtype: float64 - name: logit_2 dtype: float64 - name: logit_3 dtype: float64 splits: - name: train num_bytes: 4831400 num_examples: 86275 download_size: 2886551 dataset_size: 4831400 - config_name: mmlu_high_school_computer_science features: - name: sample_idx dtype: int64 - name: model_idx dtype: int64 - name: correctness dtype: float64 - name: logit_0 dtype: float64 - name: logit_1 dtype: float64 - name: logit_2 dtype: float64 - name: logit_3 dtype: float64 splits: - name: train num_bytes: 2380000 num_examples: 42500 download_size: 1427919 dataset_size: 2380000 - config_name: mmlu_high_school_european_history features: - name: sample_idx dtype: int64 - name: model_idx dtype: int64 - name: correctness dtype: float64 - name: logit_0 dtype: float64 - name: logit_1 dtype: float64 - name: logit_2 dtype: float64 - name: logit_3 dtype: float64 splits: - name: train num_bytes: 3927000 num_examples: 70125 download_size: 2360586 dataset_size: 3927000 - config_name: mmlu_high_school_geography features: - name: sample_idx dtype: int64 - name: model_idx dtype: int64 - name: correctness dtype: float64 - name: logit_0 dtype: float64 - name: logit_1 dtype: float64 - name: logit_2 dtype: float64 - name: logit_3 dtype: float64 splits: - name: train num_bytes: 4712400 num_examples: 84150 download_size: 2825192 dataset_size: 4712400 - config_name: mmlu_high_school_government_and_politics features: - name: sample_idx dtype: int64 - name: model_idx dtype: int64 - name: correctness dtype: float64 - name: logit_0 dtype: float64 - name: logit_1 dtype: float64 - name: logit_2 dtype: float64 - name: logit_3 dtype: float64 splits: - name: train num_bytes: 4593400 num_examples: 82025 download_size: 2752593 dataset_size: 4593400 - config_name: mmlu_high_school_macroeconomics features: - name: sample_idx dtype: int64 - name: model_idx dtype: int64 - name: correctness dtype: float64 - name: logit_0 dtype: float64 - name: logit_1 dtype: float64 - name: logit_2 dtype: float64 - name: logit_3 dtype: float64 splits: - name: train num_bytes: 9282000 num_examples: 165750 download_size: 5557886 dataset_size: 9282000 - config_name: mmlu_high_school_mathematics features: - name: sample_idx dtype: int64 - name: model_idx dtype: int64 - name: correctness dtype: float64 - name: logit_0 dtype: float64 - name: logit_1 dtype: float64 - name: logit_2 dtype: float64 - name: logit_3 dtype: float64 splits: - name: train num_bytes: 6426000 num_examples: 114750 download_size: 3808662 dataset_size: 6426000 - config_name: mmlu_high_school_microeconomics features: - name: sample_idx dtype: int64 - name: model_idx dtype: int64 - name: correctness dtype: float64 - name: logit_0 dtype: float64 - name: logit_1 dtype: float64 - name: logit_2 dtype: float64 - name: logit_3 dtype: float64 splits: - name: train num_bytes: 5664400 num_examples: 101150 download_size: 3396626 dataset_size: 5664400 - config_name: mmlu_high_school_physics features: - name: sample_idx dtype: int64 - name: model_idx dtype: int64 - name: correctness dtype: float64 - name: logit_0 dtype: float64 - name: logit_1 dtype: float64 - name: logit_2 dtype: float64 - name: logit_3 dtype: float64 splits: - name: train num_bytes: 3593800 num_examples: 64175 download_size: 2141131 dataset_size: 3593800 - config_name: mmlu_high_school_psychology features: - name: sample_idx dtype: int64 - name: model_idx dtype: int64 - name: correctness dtype: float64 - name: logit_0 dtype: float64 - name: logit_1 dtype: float64 - name: logit_2 dtype: float64 - name: logit_3 dtype: float64 splits: - name: train num_bytes: 12971000 num_examples: 231625 download_size: 7774585 dataset_size: 12971000 - config_name: mmlu_high_school_statistics features: - name: sample_idx dtype: int64 - name: model_idx dtype: int64 - name: correctness dtype: float64 - name: logit_0 dtype: float64 - name: logit_1 dtype: float64 - name: logit_2 dtype: float64 - name: logit_3 dtype: float64 splits: - name: train num_bytes: 5140800 num_examples: 91800 download_size: 3068728 dataset_size: 5140800 - config_name: mmlu_high_school_us_history features: - name: sample_idx dtype: int64 - name: model_idx dtype: int64 - name: correctness dtype: float64 - name: logit_0 dtype: float64 - name: logit_1 dtype: float64 - name: logit_2 dtype: float64 - name: logit_3 dtype: float64 splits: - name: train num_bytes: 4855200 num_examples: 86700 download_size: 2916112 dataset_size: 4855200 - config_name: mmlu_high_school_world_history features: - name: sample_idx dtype: int64 - name: model_idx dtype: int64 - name: correctness dtype: float64 - name: logit_0 dtype: float64 - name: logit_1 dtype: float64 - name: logit_2 dtype: float64 - name: logit_3 dtype: float64 splits: - name: train num_bytes: 5640600 num_examples: 100725 download_size: 3387962 dataset_size: 5640600 - config_name: mmlu_human_aging features: - name: sample_idx dtype: int64 - name: model_idx dtype: int64 - name: correctness dtype: float64 - name: logit_0 dtype: float64 - name: logit_1 dtype: float64 - name: logit_2 dtype: float64 - name: logit_3 dtype: float64 splits: - name: train num_bytes: 5307400 num_examples: 94775 download_size: 3182651 dataset_size: 5307400 - config_name: mmlu_human_sexuality features: - name: sample_idx dtype: int64 - name: model_idx dtype: int64 - name: correctness dtype: float64 - name: logit_0 dtype: float64 - name: logit_1 dtype: float64 - name: logit_2 dtype: float64 - name: logit_3 dtype: float64 splits: - name: train num_bytes: 3117800 num_examples: 55675 download_size: 1870905 dataset_size: 3117800 - config_name: mmlu_international_law features: - name: sample_idx dtype: int64 - name: model_idx dtype: int64 - name: correctness dtype: float64 - name: logit_0 dtype: float64 - name: logit_1 dtype: float64 - name: logit_2 dtype: float64 - name: logit_3 dtype: float64 splits: - name: train num_bytes: 2879800 num_examples: 51425 download_size: 1729487 dataset_size: 2879800 - config_name: mmlu_jurisprudence features: - name: sample_idx dtype: int64 - name: model_idx dtype: int64 - name: correctness dtype: float64 - name: logit_0 dtype: float64 - name: logit_1 dtype: float64 - name: logit_2 dtype: float64 - name: logit_3 dtype: float64 splits: - name: train num_bytes: 2570400 num_examples: 45900 download_size: 1539580 dataset_size: 2570400 - config_name: mmlu_logical_fallacies features: - name: sample_idx dtype: int64 - name: model_idx dtype: int64 - name: correctness dtype: float64 - name: logit_0 dtype: float64 - name: logit_1 dtype: float64 - name: logit_2 dtype: float64 - name: logit_3 dtype: float64 splits: - name: train num_bytes: 3879400 num_examples: 69275 download_size: 2329103 dataset_size: 3879400 - config_name: mmlu_machine_learning features: - name: sample_idx dtype: int64 - name: model_idx dtype: int64 - name: correctness dtype: float64 - name: logit_0 dtype: float64 - name: logit_1 dtype: float64 - name: logit_2 dtype: float64 - name: logit_3 dtype: float64 splits: - name: train num_bytes: 2665600 num_examples: 47600 download_size: 1592247 dataset_size: 2665600 - config_name: mmlu_management features: - name: sample_idx dtype: int64 - name: model_idx dtype: int64 - name: correctness dtype: float64 - name: logit_0 dtype: float64 - name: logit_1 dtype: float64 - name: logit_2 dtype: float64 - name: logit_3 dtype: float64 splits: - name: train num_bytes: 2451400 num_examples: 43775 download_size: 1470893 dataset_size: 2451400 - config_name: mmlu_marketing features: - name: sample_idx dtype: int64 - name: model_idx dtype: int64 - name: correctness dtype: float64 - name: logit_0 dtype: float64 - name: logit_1 dtype: float64 - name: logit_2 dtype: float64 - name: logit_3 dtype: float64 splits: - name: train num_bytes: 5569200 num_examples: 99450 download_size: 3342579 dataset_size: 5569200 - config_name: mmlu_medical_genetics features: - name: sample_idx dtype: int64 - name: model_idx dtype: int64 - name: correctness dtype: float64 - name: logit_0 dtype: float64 - name: logit_1 dtype: float64 - name: logit_2 dtype: float64 - name: logit_3 dtype: float64 splits: - name: train num_bytes: 2380000 num_examples: 42500 download_size: 1429051 dataset_size: 2380000 - config_name: mmlu_miscellaneous features: - name: sample_idx dtype: int64 - name: model_idx dtype: int64 - name: correctness dtype: float64 - name: logit_0 dtype: float64 - name: logit_1 dtype: float64 - name: logit_2 dtype: float64 - name: logit_3 dtype: float64 splits: - name: train num_bytes: 18635400 num_examples: 332775 download_size: 11163996 dataset_size: 18635400 - config_name: mmlu_moral_disputes features: - name: sample_idx dtype: int64 - name: model_idx dtype: int64 - name: correctness dtype: float64 - name: logit_0 dtype: float64 - name: logit_1 dtype: float64 - name: logit_2 dtype: float64 - name: logit_3 dtype: float64 splits: - name: train num_bytes: 8234800 num_examples: 147050 download_size: 4935460 dataset_size: 8234800 - config_name: mmlu_moral_scenarios features: - name: sample_idx dtype: int64 - name: model_idx dtype: int64 - name: correctness dtype: float64 - name: logit_0 dtype: float64 - name: logit_1 dtype: float64 - name: logit_2 dtype: float64 - name: logit_3 dtype: float64 splits: - name: train num_bytes: 21301000 num_examples: 380375 download_size: 12671392 dataset_size: 21301000 - config_name: mmlu_nutrition features: - name: sample_idx dtype: int64 - name: model_idx dtype: int64 - name: correctness dtype: float64 - name: logit_0 dtype: float64 - name: logit_1 dtype: float64 - name: logit_2 dtype: float64 - name: logit_3 dtype: float64 splits: - name: train num_bytes: 7282800 num_examples: 130050 download_size: 4363477 dataset_size: 7282800 - config_name: mmlu_philosophy features: - name: sample_idx dtype: int64 - name: model_idx dtype: int64 - name: correctness dtype: float64 - name: logit_0 dtype: float64 - name: logit_1 dtype: float64 - name: logit_2 dtype: float64 - name: logit_3 dtype: float64 splits: - name: train num_bytes: 7401800 num_examples: 132175 download_size: 4438968 dataset_size: 7401800 - config_name: mmlu_prehistory features: - name: sample_idx dtype: int64 - name: model_idx dtype: int64 - name: correctness dtype: float64 - name: logit_0 dtype: float64 - name: logit_1 dtype: float64 - name: logit_2 dtype: float64 - name: logit_3 dtype: float64 splits: - name: train num_bytes: 7711200 num_examples: 137700 download_size: 4625616 dataset_size: 7711200 - config_name: mmlu_professional_accounting features: - name: sample_idx dtype: int64 - name: model_idx dtype: int64 - name: correctness dtype: float64 - name: logit_0 dtype: float64 - name: logit_1 dtype: float64 - name: logit_2 dtype: float64 - name: logit_3 dtype: float64 splits: - name: train num_bytes: 6711600 num_examples: 119850 download_size: 4001714 dataset_size: 6711600 - config_name: mmlu_professional_law features: - name: sample_idx dtype: int64 - name: model_idx dtype: int64 - name: correctness dtype: float64 - name: logit_0 dtype: float64 - name: logit_1 dtype: float64 - name: logit_2 dtype: float64 - name: logit_3 dtype: float64 splits: - name: train num_bytes: 36509200 num_examples: 651950 download_size: 21828379 dataset_size: 36509200 - config_name: mmlu_professional_medicine features: - name: sample_idx dtype: int64 - name: model_idx dtype: int64 - name: correctness dtype: float64 - name: logit_0 dtype: float64 - name: logit_1 dtype: float64 - name: logit_2 dtype: float64 - name: logit_3 dtype: float64 splits: - name: train num_bytes: 6473600 num_examples: 115600 download_size: 3884673 dataset_size: 6473600 - config_name: mmlu_professional_psychology features: - name: sample_idx dtype: int64 - name: model_idx dtype: int64 - name: correctness dtype: float64 - name: logit_0 dtype: float64 - name: logit_1 dtype: float64 - name: logit_2 dtype: float64 - name: logit_3 dtype: float64 splits: - name: train num_bytes: 14565600 num_examples: 260100 download_size: 8738906 dataset_size: 14565600 - config_name: mmlu_public_relations features: - name: sample_idx dtype: int64 - name: model_idx dtype: int64 - name: correctness dtype: float64 - name: logit_0 dtype: float64 - name: logit_1 dtype: float64 - name: logit_2 dtype: float64 - name: logit_3 dtype: float64 splits: - name: train num_bytes: 2618000 num_examples: 46750 download_size: 1568670 dataset_size: 2618000 - config_name: mmlu_security_studies features: - name: sample_idx dtype: int64 - name: model_idx dtype: int64 - name: correctness dtype: float64 - name: logit_0 dtype: float64 - name: logit_1 dtype: float64 - name: logit_2 dtype: float64 - name: logit_3 dtype: float64 splits: - name: train num_bytes: 5831000 num_examples: 104125 download_size: 3501017 dataset_size: 5831000 - config_name: mmlu_sociology features: - name: sample_idx dtype: int64 - name: model_idx dtype: int64 - name: correctness dtype: float64 - name: logit_0 dtype: float64 - name: logit_1 dtype: float64 - name: logit_2 dtype: float64 - name: logit_3 dtype: float64 splits: - name: train num_bytes: 4783800 num_examples: 85425 download_size: 2871854 dataset_size: 4783800 - config_name: mmlu_us_foreign_policy features: - name: sample_idx dtype: int64 - name: model_idx dtype: int64 - name: correctness dtype: float64 - name: logit_0 dtype: float64 - name: logit_1 dtype: float64 - name: logit_2 dtype: float64 - name: logit_3 dtype: float64 splits: - name: train num_bytes: 2380000 num_examples: 42500 download_size: 1429944 dataset_size: 2380000 - config_name: mmlu_virology features: - name: sample_idx dtype: int64 - name: model_idx dtype: int64 - name: correctness dtype: float64 - name: logit_0 dtype: float64 - name: logit_1 dtype: float64 - name: logit_2 dtype: float64 - name: logit_3 dtype: float64 splits: - name: train num_bytes: 3950800 num_examples: 70550 download_size: 2366657 dataset_size: 3950800 - config_name: mmlu_world_religions features: - name: sample_idx dtype: int64 - name: model_idx dtype: int64 - name: correctness dtype: float64 - name: logit_0 dtype: float64 - name: logit_1 dtype: float64 - name: logit_2 dtype: float64 - name: logit_3 dtype: float64 splits: - name: train num_bytes: 4069800 num_examples: 72675 download_size: 2437317 dataset_size: 4069800 - config_name: models features: - name: model_idx dtype: int64 - name: model_name dtype: string splits: - name: train num_bytes: 31075 num_examples: 425 download_size: 14058 dataset_size: 31075 - config_name: truthfulqa_mc_0 features: - name: sample_idx dtype: int64 - name: model_idx dtype: int64 - name: correctness dtype: float64 - name: logit_0 dtype: float64 - name: logit_1 dtype: float64 - name: logit_2 dtype: float64 - name: logit_3 dtype: float64 - name: logit_4 dtype: float64 - name: logit_5 dtype: float64 - name: logit_6 dtype: float64 - name: logit_7 dtype: float64 - name: logit_8 dtype: float64 - name: logit_9 dtype: float64 - name: logit_10 dtype: float64 - name: logit_11 dtype: float64 - name: logit_12 dtype: float64 - name: logit_13 dtype: float64 - name: logit_14 dtype: float64 - name: logit_15 dtype: float64 - name: logit_16 dtype: float64 - name: logit_17 dtype: float64 - name: logit_18 dtype: float64 - name: logit_19 dtype: float64 - name: logit_20 dtype: float64 - name: logit_21 dtype: float64 - name: logit_22 dtype: float64 - name: logit_23 dtype: float64 - name: logit_24 dtype: float64 - name: logit_25 dtype: float64 - name: logit_26 dtype: float64 - name: logit_27 dtype: float64 - name: logit_28 dtype: float64 - name: logit_29 dtype: float64 - name: logit_30 dtype: float64 splits: - name: train num_bytes: 94445200 num_examples: 347225 download_size: 36543082 dataset_size: 94445200 - config_name: winogrande features: - name: sample_idx dtype: int64 - name: model_idx dtype: int64 - name: correctness dtype: float64 - name: logit_0 dtype: float64 - name: logit_1 dtype: float64 splits: - name: train num_bytes: 21539000 num_examples: 538475 download_size: 9700053 dataset_size: 21539000 configs: - config_name: arc_challenge data_files: - split: train path: arc_challenge/train-* - config_name: default data_files: - split: manifest path: data/manifest-* - split: models path: data/models-* - config_name: hellaswag data_files: - split: train path: hellaswag/train-* - config_name: manifest data_files: - split: train path: manifest/train-* - config_name: mmlu_abstract_algebra data_files: - split: train path: mmlu_abstract_algebra/train-* - config_name: mmlu_anatomy data_files: - split: train path: mmlu_anatomy/train-* - config_name: mmlu_astronomy data_files: - split: train path: mmlu_astronomy/train-* - config_name: mmlu_business_ethics data_files: - split: train path: mmlu_business_ethics/train-* - config_name: mmlu_clinical_knowledge data_files: - split: train path: mmlu_clinical_knowledge/train-* - config_name: mmlu_college_biology data_files: - split: train path: mmlu_college_biology/train-* - config_name: mmlu_college_chemistry data_files: - split: train path: mmlu_college_chemistry/train-* - config_name: mmlu_college_computer_science data_files: - split: train path: mmlu_college_computer_science/train-* - config_name: mmlu_college_mathematics data_files: - split: train path: mmlu_college_mathematics/train-* - config_name: mmlu_college_medicine data_files: - split: train path: mmlu_college_medicine/train-* - config_name: mmlu_college_physics data_files: - split: train path: mmlu_college_physics/train-* - config_name: mmlu_computer_security data_files: - split: train path: mmlu_computer_security/train-* - config_name: mmlu_conceptual_physics data_files: - split: train path: mmlu_conceptual_physics/train-* - config_name: mmlu_econometrics data_files: - split: train path: mmlu_econometrics/train-* - config_name: mmlu_electrical_engineering data_files: - split: train path: mmlu_electrical_engineering/train-* - config_name: mmlu_elementary_mathematics data_files: - split: train path: mmlu_elementary_mathematics/train-* - config_name: mmlu_formal_logic data_files: - split: train path: mmlu_formal_logic/train-* - config_name: mmlu_global_facts data_files: - split: train path: mmlu_global_facts/train-* - config_name: mmlu_high_school_biology data_files: - split: train path: mmlu_high_school_biology/train-* - config_name: mmlu_high_school_chemistry data_files: - split: train path: mmlu_high_school_chemistry/train-* - config_name: mmlu_high_school_computer_science data_files: - split: train path: mmlu_high_school_computer_science/train-* - config_name: mmlu_high_school_european_history data_files: - split: train path: mmlu_high_school_european_history/train-* - config_name: mmlu_high_school_geography data_files: - split: train path: mmlu_high_school_geography/train-* - config_name: mmlu_high_school_government_and_politics data_files: - split: train path: mmlu_high_school_government_and_politics/train-* - config_name: mmlu_high_school_macroeconomics data_files: - split: train path: mmlu_high_school_macroeconomics/train-* - config_name: mmlu_high_school_mathematics data_files: - split: train path: mmlu_high_school_mathematics/train-* - config_name: mmlu_high_school_microeconomics data_files: - split: train path: mmlu_high_school_microeconomics/train-* - config_name: mmlu_high_school_physics data_files: - split: train path: mmlu_high_school_physics/train-* - config_name: mmlu_high_school_psychology data_files: - split: train path: mmlu_high_school_psychology/train-* - config_name: mmlu_high_school_statistics data_files: - split: train path: mmlu_high_school_statistics/train-* - config_name: mmlu_high_school_us_history data_files: - split: train path: mmlu_high_school_us_history/train-* - config_name: mmlu_high_school_world_history data_files: - split: train path: mmlu_high_school_world_history/train-* - config_name: mmlu_human_aging data_files: - split: train path: mmlu_human_aging/train-* - config_name: mmlu_human_sexuality data_files: - split: train path: mmlu_human_sexuality/train-* - config_name: mmlu_international_law data_files: - split: train path: mmlu_international_law/train-* - config_name: mmlu_jurisprudence data_files: - split: train path: mmlu_jurisprudence/train-* - config_name: mmlu_logical_fallacies data_files: - split: train path: mmlu_logical_fallacies/train-* - config_name: mmlu_machine_learning data_files: - split: train path: mmlu_machine_learning/train-* - config_name: mmlu_management data_files: - split: train path: mmlu_management/train-* - config_name: mmlu_marketing data_files: - split: train path: mmlu_marketing/train-* - config_name: mmlu_medical_genetics data_files: - split: train path: mmlu_medical_genetics/train-* - config_name: mmlu_miscellaneous data_files: - split: train path: mmlu_miscellaneous/train-* - config_name: mmlu_moral_disputes data_files: - split: train path: mmlu_moral_disputes/train-* - config_name: mmlu_moral_scenarios data_files: - split: train path: mmlu_moral_scenarios/train-* - config_name: mmlu_nutrition data_files: - split: train path: mmlu_nutrition/train-* - config_name: mmlu_philosophy data_files: - split: train path: mmlu_philosophy/train-* - config_name: mmlu_prehistory data_files: - split: train path: mmlu_prehistory/train-* - config_name: mmlu_professional_accounting data_files: - split: train path: mmlu_professional_accounting/train-* - config_name: mmlu_professional_law data_files: - split: train path: mmlu_professional_law/train-* - config_name: mmlu_professional_medicine data_files: - split: train path: mmlu_professional_medicine/train-* - config_name: mmlu_professional_psychology data_files: - split: train path: mmlu_professional_psychology/train-* - config_name: mmlu_public_relations data_files: - split: train path: mmlu_public_relations/train-* - config_name: mmlu_security_studies data_files: - split: train path: mmlu_security_studies/train-* - config_name: mmlu_sociology data_files: - split: train path: mmlu_sociology/train-* - config_name: mmlu_us_foreign_policy data_files: - split: train path: mmlu_us_foreign_policy/train-* - config_name: mmlu_virology data_files: - split: train path: mmlu_virology/train-* - config_name: mmlu_world_religions data_files: - split: train path: mmlu_world_religions/train-* - config_name: models data_files: - split: train path: models/train-* - config_name: truthfulqa_mc_0 data_files: - split: train path: truthfulqa_mc_0/train-* - config_name: winogrande data_files: - split: train path: winogrande/train-* --- # DISCO model outputs Tabular release of **per-model, per-item correctness and answer scores** used to train and evaluate [DISCO: Diversifying Sample Condensation for Efficient Model Evaluation](https://huggingface.co/papers/2510.07959). The paper studies cheap benchmark performance prediction from a small subset of evaluation items; this dataset supplies the raw harness-style outputs for MMLU (57 subjects), HellaSwag, Winogrande, ARC, and related tasks from the Open LLM Leaderboard ecosystem. ## Paper - **Hugging Face Papers:** [2510.07959](https://huggingface.co/papers/2510.07959) - **arXiv:** [2510.07959](https://arxiv.org/abs/2510.07959) ## How this dataset is built (source pipeline) The on-disk artifact in the [DISCO codebase](https://github.com/arubique/disco-public) is `data/model_outputs.pickle`. It can be **downloaded** from the Hub (see `scripts/download_model_outputs.py`) or **rebuilt from Open LLM Leaderboard snapshots** using the same steps as in the project README: 1. **Extended leaderboard snapshot** (tinyBenchmarks-style Open LLM Leaderboard data), on the order of many hours to fetch: `python ./scripts/download_leaderboard.py --lb_type openllm_leaderboard --lb_savepath ./data/lb_raw_extended.pickle` 2. **MMLU-fields snapshot** (additional models / fields), on the order of ~1 hour: `python ./scripts/download_leaderboard.py --lb_type mmlu_fields --lb_savepath ./data/lb_raw.pickle` 3. **Merge and extract** into the ordered pickle consumed by DISCO (~20 minutes): `python scripts/extract_model_outputs_from_raw_data.py` That pipeline produces `model_outputs.pickle` with a list of model identifiers and, for each harness task, dense arrays of **correctness** and **per-choice scores** (logits / likelihood-style values as stored by the harness). The Hub upload script flattens those arrays into viewer-friendly tables. ## Hub layout (configs and columns) This repository is a **multi-config** dataset. Each config corresponds to one logical table; within each config the split is named **`train`**. | Config | Role | |--------|------| | `manifest` | Maps Hub config names to original harness keys (`task_split_name` → `original_data_key`). | | `models` | `model_idx`, `model_name` — one row per model. | | *task configs* | e.g. `hellaswag`, `mmlu_abstract_algebra`, … — long format: `sample_idx`, `model_idx`, `correctness`, and `logit_0` … `logit_{K-1}` for each answer choice. | Pick a **subset** in the dataset viewer, then open the **`train`** split to inspect rows. ## Code and documentation - Repository: [github.com/arubique/disco-public](https://github.com/arubique/disco-public) - Hub upload / download helpers: `scripts/model_outputs_hf.py`, `scripts/upload_model_outputs_to_hf.py`, `scripts/download_model_outputs.py` - Extra notes: [`docs/datasets.md`](https://github.com/arubique/disco-public/blob/main/docs/datasets.md) (paths relative to the GitHub repo) ## License This card uses `license: other` because the release aggregates **derived statistics** from public Open LLM Leaderboard–style evaluations; confirm any reuse constraints with the original benchmark and leaderboard terms. ## Citation If you use this dataset, please cite the DISCO paper (see the Hugging Face Papers page above for bibliographic metadata).

语言: - 英语 许可协议:其他 任务类别: - 其他 友好展示名称:DISCO模型输出(开放大语言模型排行榜,Open LLM Leaderboard) 标签: - DISCO - 排行榜 - MMLU(大规模多任务语言理解,Massive Multitask Language Understanding) - HellaSwag - Winogrande - ARC(AI2推理挑战,AI2 Reasoning Challenge) - 模型评估 数据集信息: - 配置名称:ARC挑战集(ARC-Challenge) 字段列表: - 字段名:样本索引,数据类型:64位整数 - 字段名:模型索引,数据类型:64位整数 - 字段名:正确性标签,数据类型:64位浮点数 - 字段名:logit值0,数据类型:64位浮点数 - 字段名:logit值1,数据类型:64位浮点数 - 字段名:logit值2,数据类型:64位浮点数 - 字段名:logit值3,数据类型:64位浮点数 - 字段名:logit值4,数据类型:64位浮点数 数据拆分: - 拆分名称:训练集,字节数:31878400,样本数量:498100 下载大小:16330950字节 数据集大小:31878400字节 - 配置名称:HellaSwag 字段列表: - 字段名:样本索引,数据类型:64位整数 - 字段名:模型索引,数据类型:64位整数 - 字段名:正确性标签,数据类型:64位浮点数 - 字段名:logit值0,数据类型:64位浮点数 - 字段名:logit值1,数据类型:64位浮点数 - 字段名:logit值2,数据类型:64位浮点数 - 字段名:logit值3,数据类型:64位浮点数 数据拆分: - 拆分名称:训练集,字节数:238999600,样本数量:4267850 下载大小:136887441字节 数据集大小:238999600字节 - 配置名称:清单(manifest) 字段列表: - 字段名:格式版本,数据类型:64位整数 - 字段名:模型拆分名称,数据类型:字符串 - 字段名:任务拆分名称,数据类型:字符串 - 字段名:原始数据键,数据类型:字符串 - 字段名:预测宽度,数据类型:64位整数 数据拆分: - 拆分名称:训练集,字节数:5825,样本数量:61 下载大小:4450字节 数据集大小:5825字节 - 配置名称:MMLU-抽象代数(mmlu_abstract_algebra) 字段列表: - 字段名:样本索引,数据类型:64位整数 - 字段名:模型索引,数据类型:64位整数 - 字段名:正确性标签,数据类型:64位浮点数 - 字段名:logit值0,数据类型:64位浮点数 - 字段名:logit值1,数据类型:64位浮点数 - 字段名:logit值2,数据类型:64位浮点数 - 字段名:logit值3,数据类型:64位浮点数 数据拆分: - 拆分名称:训练集,字节数:2380000,样本数量:42500 下载大小:1412734字节 数据集大小:2380000字节 - 配置名称:MMLU-解剖学(mmlu_anatomy) 字段列表: - 字段名:样本索引,数据类型:64位整数 - 字段名:模型索引,数据类型:64位整数 - 字段名:正确性标签,数据类型:64位浮点数 - 字段名:logit值0,数据类型:64位浮点数 - 字段名:logit值1,数据类型:64位浮点数 - 字段名:logit值2,数据类型:64位浮点数 - 字段名:logit值3,数据类型:64位浮点数 数据拆分: - 拆分名称:训练集,字节数:3213000,样本数量:57375 下载大小:1923208字节 数据集大小:3213000字节 # 其余MMLU细分科目配置结构与上述一致,此处省略 - 配置名称:Winogrande 字段列表: - 字段名:样本索引,数据类型:64位整数 - 字段名:模型索引,数据类型:64位整数 - 字段名:正确性标签,数据类型:64位浮点数 - 字段名:logit值0,数据类型:64位浮点数 - 字段名:logit值1,数据类型:64位浮点数 数据拆分: - 拆分名称:训练集,字节数:21539000,样本数量:538475 下载大小:9700053字节 数据集大小:21539000字节 配置列表: - 配置名称:ARC挑战集(ARC-Challenge) 数据文件: - 拆分:训练集,路径:arc_challenge/train-* - 配置名称:默认(default) 数据文件: - 拆分:清单(manifest),路径:data/manifest-* - 拆分:模型(models),路径:data/models-* - 配置名称:HellaSwag 数据文件: - 拆分:训练集,路径:hellaswag/train-* # 其余配置的数据文件结构与上述一致,此处省略 - 配置名称:Winogrande 数据文件: - 拆分:训练集,路径:winogrande/train-* # DISCO 模型输出 本数据集以表格形式发布了用于训练和评估[DISCO: 多样化样本压缩实现高效模型评估](https://huggingface.co/papers/2510.07959)的**逐模型、逐条目正确性与答案得分**。该论文研究如何通过少量评估条目实现低成本的基准测试性能预测;本数据集提供了来自开放大语言模型排行榜(Open LLM Leaderboard)生态系统中MMLU(57个细分科目)、HellaSwag、Winogrande、ARC及相关任务的原始评测套件风格输出结果。 ## 相关论文 - **Hugging Face 论文库:** [2510.07959](https://huggingface.co/papers/2510.07959) - **arXiv:** [2510.07959](https://arxiv.org/abs/2510.07959) ## 数据集构建流程(源数据管道) [DISCO 代码库](https://github.com/arubique/disco-public)中的磁盘级产物为`data/model_outputs.pickle`。你可以从Hugging Face Hub下载该文件(详见`scripts/download_model_outputs.py`),或按照项目README中的步骤,基于开放大语言模型排行榜快照重新构建: 1. **扩展排行榜快照**(tinyBenchmarks风格的开放大语言模型排行榜数据,下载耗时约数小时): `python ./scripts/download_leaderboard.py --lb_type openllm_leaderboard --lb_savepath ./data/lb_raw_extended.pickle` 2. **MMLU细分科目快照**(补充额外模型与字段,下载耗时约1小时): `python ./scripts/download_leaderboard.py --lb_type mmlu_fields --lb_savepath ./data/lb_raw.pickle` 3. **合并与提取**:生成DISCO所需的有序pickle文件(耗时约20分钟): `python scripts/extract_model_outputs_from_raw_data.py` 上述流程会生成`model_outputs.pickle`,其中包含模型标识符列表,以及针对每个评测任务的、包含**正确性标签**与**每选项得分**(评测套件存储的logit/似然类数值)的稠密数组。本Hub上传脚本将这些数组转换为便于查看的表格格式。 ## Hugging Face Hub 布局(配置与字段) 本仓库为**多配置数据集**,每个配置对应一张逻辑表格,所有配置的拆分均命名为`train`。 | 配置名称 | 功能说明 | |--------|------| | `manifest` | 映射Hub配置名称与原始评测框架键值(`task_split_name` → `original_data_key`)。 | | `models` | 包含`model_idx`(模型索引)、`model_name`(模型名称),每个模型对应一行数据。 | | *任务配置* | 例如`hellaswag`、`mmlu_abstract_algebra`等——采用长格式存储:`sample_idx`(样本索引)、`model_idx`(模型索引)、`correctness`(正确性标签),以及针对每个答案选项的`logit_0` … `logit_{K-1}`。 | 你可以在数据集查看器中选择**子集**,然后打开`train`拆分以浏览数据行。 ## 代码与文档 - 代码仓库:[github.com/arubique/disco-public](https://github.com/arubique/disco-public) - Hub上传/下载辅助脚本:`scripts/model_outputs_hf.py`、`scripts/upload_model_outputs_to_hf.py`、`scripts/download_model_outputs.py` - 额外说明文档:[`docs/datasets.md`](https://github.com/arubique/disco-public/blob/main/docs/datasets.md)(路径相对于GitHub仓库根目录) ## 许可协议 本数据集卡片使用`license: other`,因为本数据集聚合了来自公开开放大语言模型排行榜风格评测的**派生统计数据**;若需复用该数据集,请确认符合原始基准测试与排行榜的使用条款。 ## 引用说明 若你使用本数据集,请引用DISCO相关论文(相关文献元数据可参见上文的Hugging Face论文库页面)。
提供机构:
arubique
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作