arubique/disco-model-outputs
收藏Hugging Face2026-04-06 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/arubique/disco-model-outputs
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: other
task_categories:
- other
pretty_name: DISCO Model Outputs (Open LLM Leaderboard)
tags:
- disco
- leaderboard
- mmlu
- hellaswag
- winogrande
- arc
- model-evaluation
dataset_info:
- config_name: arc_challenge
features:
- name: sample_idx
dtype: int64
- name: model_idx
dtype: int64
- name: correctness
dtype: float64
- name: logit_0
dtype: float64
- name: logit_1
dtype: float64
- name: logit_2
dtype: float64
- name: logit_3
dtype: float64
- name: logit_4
dtype: float64
splits:
- name: train
num_bytes: 31878400
num_examples: 498100
download_size: 16330950
dataset_size: 31878400
- config_name: hellaswag
features:
- name: sample_idx
dtype: int64
- name: model_idx
dtype: int64
- name: correctness
dtype: float64
- name: logit_0
dtype: float64
- name: logit_1
dtype: float64
- name: logit_2
dtype: float64
- name: logit_3
dtype: float64
splits:
- name: train
num_bytes: 238999600
num_examples: 4267850
download_size: 136887441
dataset_size: 238999600
- config_name: manifest
features:
- name: format_version
dtype: int64
- name: model_split_name
dtype: string
- name: task_split_name
dtype: string
- name: original_data_key
dtype: string
- name: prediction_width
dtype: int64
splits:
- name: train
num_bytes: 5825
num_examples: 61
download_size: 4450
dataset_size: 5825
- config_name: mmlu_abstract_algebra
features:
- name: sample_idx
dtype: int64
- name: model_idx
dtype: int64
- name: correctness
dtype: float64
- name: logit_0
dtype: float64
- name: logit_1
dtype: float64
- name: logit_2
dtype: float64
- name: logit_3
dtype: float64
splits:
- name: train
num_bytes: 2380000
num_examples: 42500
download_size: 1412734
dataset_size: 2380000
- config_name: mmlu_anatomy
features:
- name: sample_idx
dtype: int64
- name: model_idx
dtype: int64
- name: correctness
dtype: float64
- name: logit_0
dtype: float64
- name: logit_1
dtype: float64
- name: logit_2
dtype: float64
- name: logit_3
dtype: float64
splits:
- name: train
num_bytes: 3213000
num_examples: 57375
download_size: 1923208
dataset_size: 3213000
- config_name: mmlu_astronomy
features:
- name: sample_idx
dtype: int64
- name: model_idx
dtype: int64
- name: correctness
dtype: float64
- name: logit_0
dtype: float64
- name: logit_1
dtype: float64
- name: logit_2
dtype: float64
- name: logit_3
dtype: float64
splits:
- name: train
num_bytes: 3617600
num_examples: 64600
download_size: 2169321
dataset_size: 3617600
- config_name: mmlu_business_ethics
features:
- name: sample_idx
dtype: int64
- name: model_idx
dtype: int64
- name: correctness
dtype: float64
- name: logit_0
dtype: float64
- name: logit_1
dtype: float64
- name: logit_2
dtype: float64
- name: logit_3
dtype: float64
splits:
- name: train
num_bytes: 2380000
num_examples: 42500
download_size: 1427584
dataset_size: 2380000
- config_name: mmlu_clinical_knowledge
features:
- name: sample_idx
dtype: int64
- name: model_idx
dtype: int64
- name: correctness
dtype: float64
- name: logit_0
dtype: float64
- name: logit_1
dtype: float64
- name: logit_2
dtype: float64
- name: logit_3
dtype: float64
splits:
- name: train
num_bytes: 6307000
num_examples: 112625
download_size: 3778231
dataset_size: 6307000
- config_name: mmlu_college_biology
features:
- name: sample_idx
dtype: int64
- name: model_idx
dtype: int64
- name: correctness
dtype: float64
- name: logit_0
dtype: float64
- name: logit_1
dtype: float64
- name: logit_2
dtype: float64
- name: logit_3
dtype: float64
splits:
- name: train
num_bytes: 3427200
num_examples: 61200
download_size: 2056264
dataset_size: 3427200
- config_name: mmlu_college_chemistry
features:
- name: sample_idx
dtype: int64
- name: model_idx
dtype: int64
- name: correctness
dtype: float64
- name: logit_0
dtype: float64
- name: logit_1
dtype: float64
- name: logit_2
dtype: float64
- name: logit_3
dtype: float64
splits:
- name: train
num_bytes: 2380000
num_examples: 42500
download_size: 1418641
dataset_size: 2380000
- config_name: mmlu_college_computer_science
features:
- name: sample_idx
dtype: int64
- name: model_idx
dtype: int64
- name: correctness
dtype: float64
- name: logit_0
dtype: float64
- name: logit_1
dtype: float64
- name: logit_2
dtype: float64
- name: logit_3
dtype: float64
splits:
- name: train
num_bytes: 2380000
num_examples: 42500
download_size: 1420806
dataset_size: 2380000
- config_name: mmlu_college_mathematics
features:
- name: sample_idx
dtype: int64
- name: model_idx
dtype: int64
- name: correctness
dtype: float64
- name: logit_0
dtype: float64
- name: logit_1
dtype: float64
- name: logit_2
dtype: float64
- name: logit_3
dtype: float64
splits:
- name: train
num_bytes: 2380000
num_examples: 42500
download_size: 1413541
dataset_size: 2380000
- config_name: mmlu_college_medicine
features:
- name: sample_idx
dtype: int64
- name: model_idx
dtype: int64
- name: correctness
dtype: float64
- name: logit_0
dtype: float64
- name: logit_1
dtype: float64
- name: logit_2
dtype: float64
- name: logit_3
dtype: float64
splits:
- name: train
num_bytes: 4117400
num_examples: 73525
download_size: 2469001
dataset_size: 4117400
- config_name: mmlu_college_physics
features:
- name: sample_idx
dtype: int64
- name: model_idx
dtype: int64
- name: correctness
dtype: float64
- name: logit_0
dtype: float64
- name: logit_1
dtype: float64
- name: logit_2
dtype: float64
- name: logit_3
dtype: float64
splits:
- name: train
num_bytes: 2427600
num_examples: 43350
download_size: 1447730
dataset_size: 2427600
- config_name: mmlu_computer_security
features:
- name: sample_idx
dtype: int64
- name: model_idx
dtype: int64
- name: correctness
dtype: float64
- name: logit_0
dtype: float64
- name: logit_1
dtype: float64
- name: logit_2
dtype: float64
- name: logit_3
dtype: float64
splits:
- name: train
num_bytes: 2380000
num_examples: 42500
download_size: 1428622
dataset_size: 2380000
- config_name: mmlu_conceptual_physics
features:
- name: sample_idx
dtype: int64
- name: model_idx
dtype: int64
- name: correctness
dtype: float64
- name: logit_0
dtype: float64
- name: logit_1
dtype: float64
- name: logit_2
dtype: float64
- name: logit_3
dtype: float64
splits:
- name: train
num_bytes: 5593000
num_examples: 99875
download_size: 3345346
dataset_size: 5593000
- config_name: mmlu_econometrics
features:
- name: sample_idx
dtype: int64
- name: model_idx
dtype: int64
- name: correctness
dtype: float64
- name: logit_0
dtype: float64
- name: logit_1
dtype: float64
- name: logit_2
dtype: float64
- name: logit_3
dtype: float64
splits:
- name: train
num_bytes: 2713200
num_examples: 48450
download_size: 1621991
dataset_size: 2713200
- config_name: mmlu_electrical_engineering
features:
- name: sample_idx
dtype: int64
- name: model_idx
dtype: int64
- name: correctness
dtype: float64
- name: logit_0
dtype: float64
- name: logit_1
dtype: float64
- name: logit_2
dtype: float64
- name: logit_3
dtype: float64
splits:
- name: train
num_bytes: 3451000
num_examples: 61625
download_size: 2063011
dataset_size: 3451000
- config_name: mmlu_elementary_mathematics
features:
- name: sample_idx
dtype: int64
- name: model_idx
dtype: int64
- name: correctness
dtype: float64
- name: logit_0
dtype: float64
- name: logit_1
dtype: float64
- name: logit_2
dtype: float64
- name: logit_3
dtype: float64
splits:
- name: train
num_bytes: 8996400
num_examples: 160650
download_size: 5366853
dataset_size: 8996400
- config_name: mmlu_formal_logic
features:
- name: sample_idx
dtype: int64
- name: model_idx
dtype: int64
- name: correctness
dtype: float64
- name: logit_0
dtype: float64
- name: logit_1
dtype: float64
- name: logit_2
dtype: float64
- name: logit_3
dtype: float64
splits:
- name: train
num_bytes: 2998800
num_examples: 53550
download_size: 1788904
dataset_size: 2998800
- config_name: mmlu_global_facts
features:
- name: sample_idx
dtype: int64
- name: model_idx
dtype: int64
- name: correctness
dtype: float64
- name: logit_0
dtype: float64
- name: logit_1
dtype: float64
- name: logit_2
dtype: float64
- name: logit_3
dtype: float64
splits:
- name: train
num_bytes: 2380000
num_examples: 42500
download_size: 1418484
dataset_size: 2380000
- config_name: mmlu_high_school_biology
features:
- name: sample_idx
dtype: int64
- name: model_idx
dtype: int64
- name: correctness
dtype: float64
- name: logit_0
dtype: float64
- name: logit_1
dtype: float64
- name: logit_2
dtype: float64
- name: logit_3
dtype: float64
splits:
- name: train
num_bytes: 7378000
num_examples: 131750
download_size: 4426535
dataset_size: 7378000
- config_name: mmlu_high_school_chemistry
features:
- name: sample_idx
dtype: int64
- name: model_idx
dtype: int64
- name: correctness
dtype: float64
- name: logit_0
dtype: float64
- name: logit_1
dtype: float64
- name: logit_2
dtype: float64
- name: logit_3
dtype: float64
splits:
- name: train
num_bytes: 4831400
num_examples: 86275
download_size: 2886551
dataset_size: 4831400
- config_name: mmlu_high_school_computer_science
features:
- name: sample_idx
dtype: int64
- name: model_idx
dtype: int64
- name: correctness
dtype: float64
- name: logit_0
dtype: float64
- name: logit_1
dtype: float64
- name: logit_2
dtype: float64
- name: logit_3
dtype: float64
splits:
- name: train
num_bytes: 2380000
num_examples: 42500
download_size: 1427919
dataset_size: 2380000
- config_name: mmlu_high_school_european_history
features:
- name: sample_idx
dtype: int64
- name: model_idx
dtype: int64
- name: correctness
dtype: float64
- name: logit_0
dtype: float64
- name: logit_1
dtype: float64
- name: logit_2
dtype: float64
- name: logit_3
dtype: float64
splits:
- name: train
num_bytes: 3927000
num_examples: 70125
download_size: 2360586
dataset_size: 3927000
- config_name: mmlu_high_school_geography
features:
- name: sample_idx
dtype: int64
- name: model_idx
dtype: int64
- name: correctness
dtype: float64
- name: logit_0
dtype: float64
- name: logit_1
dtype: float64
- name: logit_2
dtype: float64
- name: logit_3
dtype: float64
splits:
- name: train
num_bytes: 4712400
num_examples: 84150
download_size: 2825192
dataset_size: 4712400
- config_name: mmlu_high_school_government_and_politics
features:
- name: sample_idx
dtype: int64
- name: model_idx
dtype: int64
- name: correctness
dtype: float64
- name: logit_0
dtype: float64
- name: logit_1
dtype: float64
- name: logit_2
dtype: float64
- name: logit_3
dtype: float64
splits:
- name: train
num_bytes: 4593400
num_examples: 82025
download_size: 2752593
dataset_size: 4593400
- config_name: mmlu_high_school_macroeconomics
features:
- name: sample_idx
dtype: int64
- name: model_idx
dtype: int64
- name: correctness
dtype: float64
- name: logit_0
dtype: float64
- name: logit_1
dtype: float64
- name: logit_2
dtype: float64
- name: logit_3
dtype: float64
splits:
- name: train
num_bytes: 9282000
num_examples: 165750
download_size: 5557886
dataset_size: 9282000
- config_name: mmlu_high_school_mathematics
features:
- name: sample_idx
dtype: int64
- name: model_idx
dtype: int64
- name: correctness
dtype: float64
- name: logit_0
dtype: float64
- name: logit_1
dtype: float64
- name: logit_2
dtype: float64
- name: logit_3
dtype: float64
splits:
- name: train
num_bytes: 6426000
num_examples: 114750
download_size: 3808662
dataset_size: 6426000
- config_name: mmlu_high_school_microeconomics
features:
- name: sample_idx
dtype: int64
- name: model_idx
dtype: int64
- name: correctness
dtype: float64
- name: logit_0
dtype: float64
- name: logit_1
dtype: float64
- name: logit_2
dtype: float64
- name: logit_3
dtype: float64
splits:
- name: train
num_bytes: 5664400
num_examples: 101150
download_size: 3396626
dataset_size: 5664400
- config_name: mmlu_high_school_physics
features:
- name: sample_idx
dtype: int64
- name: model_idx
dtype: int64
- name: correctness
dtype: float64
- name: logit_0
dtype: float64
- name: logit_1
dtype: float64
- name: logit_2
dtype: float64
- name: logit_3
dtype: float64
splits:
- name: train
num_bytes: 3593800
num_examples: 64175
download_size: 2141131
dataset_size: 3593800
- config_name: mmlu_high_school_psychology
features:
- name: sample_idx
dtype: int64
- name: model_idx
dtype: int64
- name: correctness
dtype: float64
- name: logit_0
dtype: float64
- name: logit_1
dtype: float64
- name: logit_2
dtype: float64
- name: logit_3
dtype: float64
splits:
- name: train
num_bytes: 12971000
num_examples: 231625
download_size: 7774585
dataset_size: 12971000
- config_name: mmlu_high_school_statistics
features:
- name: sample_idx
dtype: int64
- name: model_idx
dtype: int64
- name: correctness
dtype: float64
- name: logit_0
dtype: float64
- name: logit_1
dtype: float64
- name: logit_2
dtype: float64
- name: logit_3
dtype: float64
splits:
- name: train
num_bytes: 5140800
num_examples: 91800
download_size: 3068728
dataset_size: 5140800
- config_name: mmlu_high_school_us_history
features:
- name: sample_idx
dtype: int64
- name: model_idx
dtype: int64
- name: correctness
dtype: float64
- name: logit_0
dtype: float64
- name: logit_1
dtype: float64
- name: logit_2
dtype: float64
- name: logit_3
dtype: float64
splits:
- name: train
num_bytes: 4855200
num_examples: 86700
download_size: 2916112
dataset_size: 4855200
- config_name: mmlu_high_school_world_history
features:
- name: sample_idx
dtype: int64
- name: model_idx
dtype: int64
- name: correctness
dtype: float64
- name: logit_0
dtype: float64
- name: logit_1
dtype: float64
- name: logit_2
dtype: float64
- name: logit_3
dtype: float64
splits:
- name: train
num_bytes: 5640600
num_examples: 100725
download_size: 3387962
dataset_size: 5640600
- config_name: mmlu_human_aging
features:
- name: sample_idx
dtype: int64
- name: model_idx
dtype: int64
- name: correctness
dtype: float64
- name: logit_0
dtype: float64
- name: logit_1
dtype: float64
- name: logit_2
dtype: float64
- name: logit_3
dtype: float64
splits:
- name: train
num_bytes: 5307400
num_examples: 94775
download_size: 3182651
dataset_size: 5307400
- config_name: mmlu_human_sexuality
features:
- name: sample_idx
dtype: int64
- name: model_idx
dtype: int64
- name: correctness
dtype: float64
- name: logit_0
dtype: float64
- name: logit_1
dtype: float64
- name: logit_2
dtype: float64
- name: logit_3
dtype: float64
splits:
- name: train
num_bytes: 3117800
num_examples: 55675
download_size: 1870905
dataset_size: 3117800
- config_name: mmlu_international_law
features:
- name: sample_idx
dtype: int64
- name: model_idx
dtype: int64
- name: correctness
dtype: float64
- name: logit_0
dtype: float64
- name: logit_1
dtype: float64
- name: logit_2
dtype: float64
- name: logit_3
dtype: float64
splits:
- name: train
num_bytes: 2879800
num_examples: 51425
download_size: 1729487
dataset_size: 2879800
- config_name: mmlu_jurisprudence
features:
- name: sample_idx
dtype: int64
- name: model_idx
dtype: int64
- name: correctness
dtype: float64
- name: logit_0
dtype: float64
- name: logit_1
dtype: float64
- name: logit_2
dtype: float64
- name: logit_3
dtype: float64
splits:
- name: train
num_bytes: 2570400
num_examples: 45900
download_size: 1539580
dataset_size: 2570400
- config_name: mmlu_logical_fallacies
features:
- name: sample_idx
dtype: int64
- name: model_idx
dtype: int64
- name: correctness
dtype: float64
- name: logit_0
dtype: float64
- name: logit_1
dtype: float64
- name: logit_2
dtype: float64
- name: logit_3
dtype: float64
splits:
- name: train
num_bytes: 3879400
num_examples: 69275
download_size: 2329103
dataset_size: 3879400
- config_name: mmlu_machine_learning
features:
- name: sample_idx
dtype: int64
- name: model_idx
dtype: int64
- name: correctness
dtype: float64
- name: logit_0
dtype: float64
- name: logit_1
dtype: float64
- name: logit_2
dtype: float64
- name: logit_3
dtype: float64
splits:
- name: train
num_bytes: 2665600
num_examples: 47600
download_size: 1592247
dataset_size: 2665600
- config_name: mmlu_management
features:
- name: sample_idx
dtype: int64
- name: model_idx
dtype: int64
- name: correctness
dtype: float64
- name: logit_0
dtype: float64
- name: logit_1
dtype: float64
- name: logit_2
dtype: float64
- name: logit_3
dtype: float64
splits:
- name: train
num_bytes: 2451400
num_examples: 43775
download_size: 1470893
dataset_size: 2451400
- config_name: mmlu_marketing
features:
- name: sample_idx
dtype: int64
- name: model_idx
dtype: int64
- name: correctness
dtype: float64
- name: logit_0
dtype: float64
- name: logit_1
dtype: float64
- name: logit_2
dtype: float64
- name: logit_3
dtype: float64
splits:
- name: train
num_bytes: 5569200
num_examples: 99450
download_size: 3342579
dataset_size: 5569200
- config_name: mmlu_medical_genetics
features:
- name: sample_idx
dtype: int64
- name: model_idx
dtype: int64
- name: correctness
dtype: float64
- name: logit_0
dtype: float64
- name: logit_1
dtype: float64
- name: logit_2
dtype: float64
- name: logit_3
dtype: float64
splits:
- name: train
num_bytes: 2380000
num_examples: 42500
download_size: 1429051
dataset_size: 2380000
- config_name: mmlu_miscellaneous
features:
- name: sample_idx
dtype: int64
- name: model_idx
dtype: int64
- name: correctness
dtype: float64
- name: logit_0
dtype: float64
- name: logit_1
dtype: float64
- name: logit_2
dtype: float64
- name: logit_3
dtype: float64
splits:
- name: train
num_bytes: 18635400
num_examples: 332775
download_size: 11163996
dataset_size: 18635400
- config_name: mmlu_moral_disputes
features:
- name: sample_idx
dtype: int64
- name: model_idx
dtype: int64
- name: correctness
dtype: float64
- name: logit_0
dtype: float64
- name: logit_1
dtype: float64
- name: logit_2
dtype: float64
- name: logit_3
dtype: float64
splits:
- name: train
num_bytes: 8234800
num_examples: 147050
download_size: 4935460
dataset_size: 8234800
- config_name: mmlu_moral_scenarios
features:
- name: sample_idx
dtype: int64
- name: model_idx
dtype: int64
- name: correctness
dtype: float64
- name: logit_0
dtype: float64
- name: logit_1
dtype: float64
- name: logit_2
dtype: float64
- name: logit_3
dtype: float64
splits:
- name: train
num_bytes: 21301000
num_examples: 380375
download_size: 12671392
dataset_size: 21301000
- config_name: mmlu_nutrition
features:
- name: sample_idx
dtype: int64
- name: model_idx
dtype: int64
- name: correctness
dtype: float64
- name: logit_0
dtype: float64
- name: logit_1
dtype: float64
- name: logit_2
dtype: float64
- name: logit_3
dtype: float64
splits:
- name: train
num_bytes: 7282800
num_examples: 130050
download_size: 4363477
dataset_size: 7282800
- config_name: mmlu_philosophy
features:
- name: sample_idx
dtype: int64
- name: model_idx
dtype: int64
- name: correctness
dtype: float64
- name: logit_0
dtype: float64
- name: logit_1
dtype: float64
- name: logit_2
dtype: float64
- name: logit_3
dtype: float64
splits:
- name: train
num_bytes: 7401800
num_examples: 132175
download_size: 4438968
dataset_size: 7401800
- config_name: mmlu_prehistory
features:
- name: sample_idx
dtype: int64
- name: model_idx
dtype: int64
- name: correctness
dtype: float64
- name: logit_0
dtype: float64
- name: logit_1
dtype: float64
- name: logit_2
dtype: float64
- name: logit_3
dtype: float64
splits:
- name: train
num_bytes: 7711200
num_examples: 137700
download_size: 4625616
dataset_size: 7711200
- config_name: mmlu_professional_accounting
features:
- name: sample_idx
dtype: int64
- name: model_idx
dtype: int64
- name: correctness
dtype: float64
- name: logit_0
dtype: float64
- name: logit_1
dtype: float64
- name: logit_2
dtype: float64
- name: logit_3
dtype: float64
splits:
- name: train
num_bytes: 6711600
num_examples: 119850
download_size: 4001714
dataset_size: 6711600
- config_name: mmlu_professional_law
features:
- name: sample_idx
dtype: int64
- name: model_idx
dtype: int64
- name: correctness
dtype: float64
- name: logit_0
dtype: float64
- name: logit_1
dtype: float64
- name: logit_2
dtype: float64
- name: logit_3
dtype: float64
splits:
- name: train
num_bytes: 36509200
num_examples: 651950
download_size: 21828379
dataset_size: 36509200
- config_name: mmlu_professional_medicine
features:
- name: sample_idx
dtype: int64
- name: model_idx
dtype: int64
- name: correctness
dtype: float64
- name: logit_0
dtype: float64
- name: logit_1
dtype: float64
- name: logit_2
dtype: float64
- name: logit_3
dtype: float64
splits:
- name: train
num_bytes: 6473600
num_examples: 115600
download_size: 3884673
dataset_size: 6473600
- config_name: mmlu_professional_psychology
features:
- name: sample_idx
dtype: int64
- name: model_idx
dtype: int64
- name: correctness
dtype: float64
- name: logit_0
dtype: float64
- name: logit_1
dtype: float64
- name: logit_2
dtype: float64
- name: logit_3
dtype: float64
splits:
- name: train
num_bytes: 14565600
num_examples: 260100
download_size: 8738906
dataset_size: 14565600
- config_name: mmlu_public_relations
features:
- name: sample_idx
dtype: int64
- name: model_idx
dtype: int64
- name: correctness
dtype: float64
- name: logit_0
dtype: float64
- name: logit_1
dtype: float64
- name: logit_2
dtype: float64
- name: logit_3
dtype: float64
splits:
- name: train
num_bytes: 2618000
num_examples: 46750
download_size: 1568670
dataset_size: 2618000
- config_name: mmlu_security_studies
features:
- name: sample_idx
dtype: int64
- name: model_idx
dtype: int64
- name: correctness
dtype: float64
- name: logit_0
dtype: float64
- name: logit_1
dtype: float64
- name: logit_2
dtype: float64
- name: logit_3
dtype: float64
splits:
- name: train
num_bytes: 5831000
num_examples: 104125
download_size: 3501017
dataset_size: 5831000
- config_name: mmlu_sociology
features:
- name: sample_idx
dtype: int64
- name: model_idx
dtype: int64
- name: correctness
dtype: float64
- name: logit_0
dtype: float64
- name: logit_1
dtype: float64
- name: logit_2
dtype: float64
- name: logit_3
dtype: float64
splits:
- name: train
num_bytes: 4783800
num_examples: 85425
download_size: 2871854
dataset_size: 4783800
- config_name: mmlu_us_foreign_policy
features:
- name: sample_idx
dtype: int64
- name: model_idx
dtype: int64
- name: correctness
dtype: float64
- name: logit_0
dtype: float64
- name: logit_1
dtype: float64
- name: logit_2
dtype: float64
- name: logit_3
dtype: float64
splits:
- name: train
num_bytes: 2380000
num_examples: 42500
download_size: 1429944
dataset_size: 2380000
- config_name: mmlu_virology
features:
- name: sample_idx
dtype: int64
- name: model_idx
dtype: int64
- name: correctness
dtype: float64
- name: logit_0
dtype: float64
- name: logit_1
dtype: float64
- name: logit_2
dtype: float64
- name: logit_3
dtype: float64
splits:
- name: train
num_bytes: 3950800
num_examples: 70550
download_size: 2366657
dataset_size: 3950800
- config_name: mmlu_world_religions
features:
- name: sample_idx
dtype: int64
- name: model_idx
dtype: int64
- name: correctness
dtype: float64
- name: logit_0
dtype: float64
- name: logit_1
dtype: float64
- name: logit_2
dtype: float64
- name: logit_3
dtype: float64
splits:
- name: train
num_bytes: 4069800
num_examples: 72675
download_size: 2437317
dataset_size: 4069800
- config_name: models
features:
- name: model_idx
dtype: int64
- name: model_name
dtype: string
splits:
- name: train
num_bytes: 31075
num_examples: 425
download_size: 14058
dataset_size: 31075
- config_name: truthfulqa_mc_0
features:
- name: sample_idx
dtype: int64
- name: model_idx
dtype: int64
- name: correctness
dtype: float64
- name: logit_0
dtype: float64
- name: logit_1
dtype: float64
- name: logit_2
dtype: float64
- name: logit_3
dtype: float64
- name: logit_4
dtype: float64
- name: logit_5
dtype: float64
- name: logit_6
dtype: float64
- name: logit_7
dtype: float64
- name: logit_8
dtype: float64
- name: logit_9
dtype: float64
- name: logit_10
dtype: float64
- name: logit_11
dtype: float64
- name: logit_12
dtype: float64
- name: logit_13
dtype: float64
- name: logit_14
dtype: float64
- name: logit_15
dtype: float64
- name: logit_16
dtype: float64
- name: logit_17
dtype: float64
- name: logit_18
dtype: float64
- name: logit_19
dtype: float64
- name: logit_20
dtype: float64
- name: logit_21
dtype: float64
- name: logit_22
dtype: float64
- name: logit_23
dtype: float64
- name: logit_24
dtype: float64
- name: logit_25
dtype: float64
- name: logit_26
dtype: float64
- name: logit_27
dtype: float64
- name: logit_28
dtype: float64
- name: logit_29
dtype: float64
- name: logit_30
dtype: float64
splits:
- name: train
num_bytes: 94445200
num_examples: 347225
download_size: 36543082
dataset_size: 94445200
- config_name: winogrande
features:
- name: sample_idx
dtype: int64
- name: model_idx
dtype: int64
- name: correctness
dtype: float64
- name: logit_0
dtype: float64
- name: logit_1
dtype: float64
splits:
- name: train
num_bytes: 21539000
num_examples: 538475
download_size: 9700053
dataset_size: 21539000
configs:
- config_name: arc_challenge
data_files:
- split: train
path: arc_challenge/train-*
- config_name: default
data_files:
- split: manifest
path: data/manifest-*
- split: models
path: data/models-*
- config_name: hellaswag
data_files:
- split: train
path: hellaswag/train-*
- config_name: manifest
data_files:
- split: train
path: manifest/train-*
- config_name: mmlu_abstract_algebra
data_files:
- split: train
path: mmlu_abstract_algebra/train-*
- config_name: mmlu_anatomy
data_files:
- split: train
path: mmlu_anatomy/train-*
- config_name: mmlu_astronomy
data_files:
- split: train
path: mmlu_astronomy/train-*
- config_name: mmlu_business_ethics
data_files:
- split: train
path: mmlu_business_ethics/train-*
- config_name: mmlu_clinical_knowledge
data_files:
- split: train
path: mmlu_clinical_knowledge/train-*
- config_name: mmlu_college_biology
data_files:
- split: train
path: mmlu_college_biology/train-*
- config_name: mmlu_college_chemistry
data_files:
- split: train
path: mmlu_college_chemistry/train-*
- config_name: mmlu_college_computer_science
data_files:
- split: train
path: mmlu_college_computer_science/train-*
- config_name: mmlu_college_mathematics
data_files:
- split: train
path: mmlu_college_mathematics/train-*
- config_name: mmlu_college_medicine
data_files:
- split: train
path: mmlu_college_medicine/train-*
- config_name: mmlu_college_physics
data_files:
- split: train
path: mmlu_college_physics/train-*
- config_name: mmlu_computer_security
data_files:
- split: train
path: mmlu_computer_security/train-*
- config_name: mmlu_conceptual_physics
data_files:
- split: train
path: mmlu_conceptual_physics/train-*
- config_name: mmlu_econometrics
data_files:
- split: train
path: mmlu_econometrics/train-*
- config_name: mmlu_electrical_engineering
data_files:
- split: train
path: mmlu_electrical_engineering/train-*
- config_name: mmlu_elementary_mathematics
data_files:
- split: train
path: mmlu_elementary_mathematics/train-*
- config_name: mmlu_formal_logic
data_files:
- split: train
path: mmlu_formal_logic/train-*
- config_name: mmlu_global_facts
data_files:
- split: train
path: mmlu_global_facts/train-*
- config_name: mmlu_high_school_biology
data_files:
- split: train
path: mmlu_high_school_biology/train-*
- config_name: mmlu_high_school_chemistry
data_files:
- split: train
path: mmlu_high_school_chemistry/train-*
- config_name: mmlu_high_school_computer_science
data_files:
- split: train
path: mmlu_high_school_computer_science/train-*
- config_name: mmlu_high_school_european_history
data_files:
- split: train
path: mmlu_high_school_european_history/train-*
- config_name: mmlu_high_school_geography
data_files:
- split: train
path: mmlu_high_school_geography/train-*
- config_name: mmlu_high_school_government_and_politics
data_files:
- split: train
path: mmlu_high_school_government_and_politics/train-*
- config_name: mmlu_high_school_macroeconomics
data_files:
- split: train
path: mmlu_high_school_macroeconomics/train-*
- config_name: mmlu_high_school_mathematics
data_files:
- split: train
path: mmlu_high_school_mathematics/train-*
- config_name: mmlu_high_school_microeconomics
data_files:
- split: train
path: mmlu_high_school_microeconomics/train-*
- config_name: mmlu_high_school_physics
data_files:
- split: train
path: mmlu_high_school_physics/train-*
- config_name: mmlu_high_school_psychology
data_files:
- split: train
path: mmlu_high_school_psychology/train-*
- config_name: mmlu_high_school_statistics
data_files:
- split: train
path: mmlu_high_school_statistics/train-*
- config_name: mmlu_high_school_us_history
data_files:
- split: train
path: mmlu_high_school_us_history/train-*
- config_name: mmlu_high_school_world_history
data_files:
- split: train
path: mmlu_high_school_world_history/train-*
- config_name: mmlu_human_aging
data_files:
- split: train
path: mmlu_human_aging/train-*
- config_name: mmlu_human_sexuality
data_files:
- split: train
path: mmlu_human_sexuality/train-*
- config_name: mmlu_international_law
data_files:
- split: train
path: mmlu_international_law/train-*
- config_name: mmlu_jurisprudence
data_files:
- split: train
path: mmlu_jurisprudence/train-*
- config_name: mmlu_logical_fallacies
data_files:
- split: train
path: mmlu_logical_fallacies/train-*
- config_name: mmlu_machine_learning
data_files:
- split: train
path: mmlu_machine_learning/train-*
- config_name: mmlu_management
data_files:
- split: train
path: mmlu_management/train-*
- config_name: mmlu_marketing
data_files:
- split: train
path: mmlu_marketing/train-*
- config_name: mmlu_medical_genetics
data_files:
- split: train
path: mmlu_medical_genetics/train-*
- config_name: mmlu_miscellaneous
data_files:
- split: train
path: mmlu_miscellaneous/train-*
- config_name: mmlu_moral_disputes
data_files:
- split: train
path: mmlu_moral_disputes/train-*
- config_name: mmlu_moral_scenarios
data_files:
- split: train
path: mmlu_moral_scenarios/train-*
- config_name: mmlu_nutrition
data_files:
- split: train
path: mmlu_nutrition/train-*
- config_name: mmlu_philosophy
data_files:
- split: train
path: mmlu_philosophy/train-*
- config_name: mmlu_prehistory
data_files:
- split: train
path: mmlu_prehistory/train-*
- config_name: mmlu_professional_accounting
data_files:
- split: train
path: mmlu_professional_accounting/train-*
- config_name: mmlu_professional_law
data_files:
- split: train
path: mmlu_professional_law/train-*
- config_name: mmlu_professional_medicine
data_files:
- split: train
path: mmlu_professional_medicine/train-*
- config_name: mmlu_professional_psychology
data_files:
- split: train
path: mmlu_professional_psychology/train-*
- config_name: mmlu_public_relations
data_files:
- split: train
path: mmlu_public_relations/train-*
- config_name: mmlu_security_studies
data_files:
- split: train
path: mmlu_security_studies/train-*
- config_name: mmlu_sociology
data_files:
- split: train
path: mmlu_sociology/train-*
- config_name: mmlu_us_foreign_policy
data_files:
- split: train
path: mmlu_us_foreign_policy/train-*
- config_name: mmlu_virology
data_files:
- split: train
path: mmlu_virology/train-*
- config_name: mmlu_world_religions
data_files:
- split: train
path: mmlu_world_religions/train-*
- config_name: models
data_files:
- split: train
path: models/train-*
- config_name: truthfulqa_mc_0
data_files:
- split: train
path: truthfulqa_mc_0/train-*
- config_name: winogrande
data_files:
- split: train
path: winogrande/train-*
---
# DISCO model outputs
Tabular release of **per-model, per-item correctness and answer scores** used to train and evaluate [DISCO: Diversifying Sample Condensation for Efficient Model Evaluation](https://huggingface.co/papers/2510.07959). The paper studies cheap benchmark performance prediction from a small subset of evaluation items; this dataset supplies the raw harness-style outputs for MMLU (57 subjects), HellaSwag, Winogrande, ARC, and related tasks from the Open LLM Leaderboard ecosystem.
## Paper
- **Hugging Face Papers:** [2510.07959](https://huggingface.co/papers/2510.07959)
- **arXiv:** [2510.07959](https://arxiv.org/abs/2510.07959)
## How this dataset is built (source pipeline)
The on-disk artifact in the [DISCO codebase](https://github.com/arubique/disco-public) is `data/model_outputs.pickle`. It can be **downloaded** from the Hub (see `scripts/download_model_outputs.py`) or **rebuilt from Open LLM Leaderboard snapshots** using the same steps as in the project README:
1. **Extended leaderboard snapshot** (tinyBenchmarks-style Open LLM Leaderboard data), on the order of many hours to fetch:
`python ./scripts/download_leaderboard.py --lb_type openllm_leaderboard --lb_savepath ./data/lb_raw_extended.pickle`
2. **MMLU-fields snapshot** (additional models / fields), on the order of ~1 hour:
`python ./scripts/download_leaderboard.py --lb_type mmlu_fields --lb_savepath ./data/lb_raw.pickle`
3. **Merge and extract** into the ordered pickle consumed by DISCO (~20 minutes):
`python scripts/extract_model_outputs_from_raw_data.py`
That pipeline produces `model_outputs.pickle` with a list of model identifiers and, for each harness task, dense arrays of **correctness** and **per-choice scores** (logits / likelihood-style values as stored by the harness). The Hub upload script flattens those arrays into viewer-friendly tables.
## Hub layout (configs and columns)
This repository is a **multi-config** dataset. Each config corresponds to one logical table; within each config the split is named **`train`**.
| Config | Role |
|--------|------|
| `manifest` | Maps Hub config names to original harness keys (`task_split_name` → `original_data_key`). |
| `models` | `model_idx`, `model_name` — one row per model. |
| *task configs* | e.g. `hellaswag`, `mmlu_abstract_algebra`, … — long format: `sample_idx`, `model_idx`, `correctness`, and `logit_0` … `logit_{K-1}` for each answer choice. |
Pick a **subset** in the dataset viewer, then open the **`train`** split to inspect rows.
## Code and documentation
- Repository: [github.com/arubique/disco-public](https://github.com/arubique/disco-public)
- Hub upload / download helpers: `scripts/model_outputs_hf.py`, `scripts/upload_model_outputs_to_hf.py`, `scripts/download_model_outputs.py`
- Extra notes: [`docs/datasets.md`](https://github.com/arubique/disco-public/blob/main/docs/datasets.md) (paths relative to the GitHub repo)
## License
This card uses `license: other` because the release aggregates **derived statistics** from public Open LLM Leaderboard–style evaluations; confirm any reuse constraints with the original benchmark and leaderboard terms.
## Citation
If you use this dataset, please cite the DISCO paper (see the Hugging Face Papers page above for bibliographic metadata).
语言:
- 英语
许可协议:其他
任务类别:
- 其他
友好展示名称:DISCO模型输出(开放大语言模型排行榜,Open LLM Leaderboard)
标签:
- DISCO
- 排行榜
- MMLU(大规模多任务语言理解,Massive Multitask Language Understanding)
- HellaSwag
- Winogrande
- ARC(AI2推理挑战,AI2 Reasoning Challenge)
- 模型评估
数据集信息:
- 配置名称:ARC挑战集(ARC-Challenge)
字段列表:
- 字段名:样本索引,数据类型:64位整数
- 字段名:模型索引,数据类型:64位整数
- 字段名:正确性标签,数据类型:64位浮点数
- 字段名:logit值0,数据类型:64位浮点数
- 字段名:logit值1,数据类型:64位浮点数
- 字段名:logit值2,数据类型:64位浮点数
- 字段名:logit值3,数据类型:64位浮点数
- 字段名:logit值4,数据类型:64位浮点数
数据拆分:
- 拆分名称:训练集,字节数:31878400,样本数量:498100
下载大小:16330950字节
数据集大小:31878400字节
- 配置名称:HellaSwag
字段列表:
- 字段名:样本索引,数据类型:64位整数
- 字段名:模型索引,数据类型:64位整数
- 字段名:正确性标签,数据类型:64位浮点数
- 字段名:logit值0,数据类型:64位浮点数
- 字段名:logit值1,数据类型:64位浮点数
- 字段名:logit值2,数据类型:64位浮点数
- 字段名:logit值3,数据类型:64位浮点数
数据拆分:
- 拆分名称:训练集,字节数:238999600,样本数量:4267850
下载大小:136887441字节
数据集大小:238999600字节
- 配置名称:清单(manifest)
字段列表:
- 字段名:格式版本,数据类型:64位整数
- 字段名:模型拆分名称,数据类型:字符串
- 字段名:任务拆分名称,数据类型:字符串
- 字段名:原始数据键,数据类型:字符串
- 字段名:预测宽度,数据类型:64位整数
数据拆分:
- 拆分名称:训练集,字节数:5825,样本数量:61
下载大小:4450字节
数据集大小:5825字节
- 配置名称:MMLU-抽象代数(mmlu_abstract_algebra)
字段列表:
- 字段名:样本索引,数据类型:64位整数
- 字段名:模型索引,数据类型:64位整数
- 字段名:正确性标签,数据类型:64位浮点数
- 字段名:logit值0,数据类型:64位浮点数
- 字段名:logit值1,数据类型:64位浮点数
- 字段名:logit值2,数据类型:64位浮点数
- 字段名:logit值3,数据类型:64位浮点数
数据拆分:
- 拆分名称:训练集,字节数:2380000,样本数量:42500
下载大小:1412734字节
数据集大小:2380000字节
- 配置名称:MMLU-解剖学(mmlu_anatomy)
字段列表:
- 字段名:样本索引,数据类型:64位整数
- 字段名:模型索引,数据类型:64位整数
- 字段名:正确性标签,数据类型:64位浮点数
- 字段名:logit值0,数据类型:64位浮点数
- 字段名:logit值1,数据类型:64位浮点数
- 字段名:logit值2,数据类型:64位浮点数
- 字段名:logit值3,数据类型:64位浮点数
数据拆分:
- 拆分名称:训练集,字节数:3213000,样本数量:57375
下载大小:1923208字节
数据集大小:3213000字节
# 其余MMLU细分科目配置结构与上述一致,此处省略
- 配置名称:Winogrande
字段列表:
- 字段名:样本索引,数据类型:64位整数
- 字段名:模型索引,数据类型:64位整数
- 字段名:正确性标签,数据类型:64位浮点数
- 字段名:logit值0,数据类型:64位浮点数
- 字段名:logit值1,数据类型:64位浮点数
数据拆分:
- 拆分名称:训练集,字节数:21539000,样本数量:538475
下载大小:9700053字节
数据集大小:21539000字节
配置列表:
- 配置名称:ARC挑战集(ARC-Challenge)
数据文件:
- 拆分:训练集,路径:arc_challenge/train-*
- 配置名称:默认(default)
数据文件:
- 拆分:清单(manifest),路径:data/manifest-*
- 拆分:模型(models),路径:data/models-*
- 配置名称:HellaSwag
数据文件:
- 拆分:训练集,路径:hellaswag/train-*
# 其余配置的数据文件结构与上述一致,此处省略
- 配置名称:Winogrande
数据文件:
- 拆分:训练集,路径:winogrande/train-*
# DISCO 模型输出
本数据集以表格形式发布了用于训练和评估[DISCO: 多样化样本压缩实现高效模型评估](https://huggingface.co/papers/2510.07959)的**逐模型、逐条目正确性与答案得分**。该论文研究如何通过少量评估条目实现低成本的基准测试性能预测;本数据集提供了来自开放大语言模型排行榜(Open LLM Leaderboard)生态系统中MMLU(57个细分科目)、HellaSwag、Winogrande、ARC及相关任务的原始评测套件风格输出结果。
## 相关论文
- **Hugging Face 论文库:** [2510.07959](https://huggingface.co/papers/2510.07959)
- **arXiv:** [2510.07959](https://arxiv.org/abs/2510.07959)
## 数据集构建流程(源数据管道)
[DISCO 代码库](https://github.com/arubique/disco-public)中的磁盘级产物为`data/model_outputs.pickle`。你可以从Hugging Face Hub下载该文件(详见`scripts/download_model_outputs.py`),或按照项目README中的步骤,基于开放大语言模型排行榜快照重新构建:
1. **扩展排行榜快照**(tinyBenchmarks风格的开放大语言模型排行榜数据,下载耗时约数小时):
`python ./scripts/download_leaderboard.py --lb_type openllm_leaderboard --lb_savepath ./data/lb_raw_extended.pickle`
2. **MMLU细分科目快照**(补充额外模型与字段,下载耗时约1小时):
`python ./scripts/download_leaderboard.py --lb_type mmlu_fields --lb_savepath ./data/lb_raw.pickle`
3. **合并与提取**:生成DISCO所需的有序pickle文件(耗时约20分钟):
`python scripts/extract_model_outputs_from_raw_data.py`
上述流程会生成`model_outputs.pickle`,其中包含模型标识符列表,以及针对每个评测任务的、包含**正确性标签**与**每选项得分**(评测套件存储的logit/似然类数值)的稠密数组。本Hub上传脚本将这些数组转换为便于查看的表格格式。
## Hugging Face Hub 布局(配置与字段)
本仓库为**多配置数据集**,每个配置对应一张逻辑表格,所有配置的拆分均命名为`train`。
| 配置名称 | 功能说明 |
|--------|------|
| `manifest` | 映射Hub配置名称与原始评测框架键值(`task_split_name` → `original_data_key`)。 |
| `models` | 包含`model_idx`(模型索引)、`model_name`(模型名称),每个模型对应一行数据。 |
| *任务配置* | 例如`hellaswag`、`mmlu_abstract_algebra`等——采用长格式存储:`sample_idx`(样本索引)、`model_idx`(模型索引)、`correctness`(正确性标签),以及针对每个答案选项的`logit_0` … `logit_{K-1}`。 |
你可以在数据集查看器中选择**子集**,然后打开`train`拆分以浏览数据行。
## 代码与文档
- 代码仓库:[github.com/arubique/disco-public](https://github.com/arubique/disco-public)
- Hub上传/下载辅助脚本:`scripts/model_outputs_hf.py`、`scripts/upload_model_outputs_to_hf.py`、`scripts/download_model_outputs.py`
- 额外说明文档:[`docs/datasets.md`](https://github.com/arubique/disco-public/blob/main/docs/datasets.md)(路径相对于GitHub仓库根目录)
## 许可协议
本数据集卡片使用`license: other`,因为本数据集聚合了来自公开开放大语言模型排行榜风格评测的**派生统计数据**;若需复用该数据集,请确认符合原始基准测试与排行榜的使用条款。
## 引用说明
若你使用本数据集,请引用DISCO相关论文(相关文献元数据可参见上文的Hugging Face论文库页面)。
提供机构:
arubique



