five

open-llm-leaderboard-old/details_TeeZee__DarkSapling-7B-v2.0

收藏
Hugging Face2024-03-10 更新2024-06-22 收录
下载链接:
https://hf-mirror.com/datasets/open-llm-leaderboard-old/details_TeeZee__DarkSapling-7B-v2.0
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: Evaluation run of TeeZee/DarkSapling-7B-v2.0 dataset_summary: "Dataset automatically created during the evaluation run of model\ \ [TeeZee/DarkSapling-7B-v2.0](https://huggingface.co/TeeZee/DarkSapling-7B-v2.0)\ \ on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).\n\ \nThe dataset is composed of 63 configuration, each one coresponding to one of the\ \ evaluated task.\n\nThe dataset has been created from 1 run(s). Each run can be\ \ found as a specific split in each configuration, the split being named using the\ \ timestamp of the run.The \"train\" split is always pointing to the latest results.\n\ \nAn additional configuration \"results\" store all the aggregated results of the\ \ run (and is used to compute and display the aggregated metrics on the [Open LLM\ \ Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)).\n\ \nTo load the details from a run, you can for instance do the following:\n```python\n\ from datasets import load_dataset\ndata = load_dataset(\"open-llm-leaderboard/details_TeeZee__DarkSapling-7B-v2.0\"\ ,\n\t\"harness_winogrande_5\",\n\tsplit=\"train\")\n```\n\n## Latest results\n\n\ These are the [latest results from run 2024-03-10T04:57:12.333081](https://huggingface.co/datasets/open-llm-leaderboard/details_TeeZee__DarkSapling-7B-v2.0/blob/main/results_2024-03-10T04-57-12.333081.json)(note\ \ that their might be results for other tasks in the repos if successive evals didn't\ \ cover the same tasks. You find each in the results and the \"latest\" split for\ \ each eval):\n\n```python\n{\n \"all\": {\n \"acc\": 0.6424579193008534,\n\ \ \"acc_stderr\": 0.032218866498356466,\n \"acc_norm\": 0.6471361795899754,\n\ \ \"acc_norm_stderr\": 0.032858397843778114,\n \"mc1\": 0.3659730722154223,\n\ \ \"mc1_stderr\": 0.016862941684088376,\n \"mc2\": 0.5221487837375264,\n\ \ \"mc2_stderr\": 0.015253502717954797\n },\n \"harness|arc:challenge|25\"\ : {\n \"acc\": 0.6023890784982935,\n \"acc_stderr\": 0.01430175222327954,\n\ \ \"acc_norm\": 0.6416382252559727,\n \"acc_norm_stderr\": 0.014012883334859857\n\ \ },\n \"harness|hellaswag|10\": {\n \"acc\": 0.6589324835690101,\n\ \ \"acc_stderr\": 0.004730991357194292,\n \"acc_norm\": 0.8510256920932086,\n\ \ \"acc_norm_stderr\": 0.003553354528132355\n },\n \"harness|hendrycksTest-abstract_algebra|5\"\ : {\n \"acc\": 0.36,\n \"acc_stderr\": 0.04824181513244218,\n \ \ \"acc_norm\": 0.36,\n \"acc_norm_stderr\": 0.04824181513244218\n \ \ },\n \"harness|hendrycksTest-anatomy|5\": {\n \"acc\": 0.6296296296296297,\n\ \ \"acc_stderr\": 0.041716541613545426,\n \"acc_norm\": 0.6296296296296297,\n\ \ \"acc_norm_stderr\": 0.041716541613545426\n },\n \"harness|hendrycksTest-astronomy|5\"\ : {\n \"acc\": 0.6710526315789473,\n \"acc_stderr\": 0.03823428969926604,\n\ \ \"acc_norm\": 0.6710526315789473,\n \"acc_norm_stderr\": 0.03823428969926604\n\ \ },\n \"harness|hendrycksTest-business_ethics|5\": {\n \"acc\": 0.58,\n\ \ \"acc_stderr\": 0.049604496374885836,\n \"acc_norm\": 0.58,\n \ \ \"acc_norm_stderr\": 0.049604496374885836\n },\n \"harness|hendrycksTest-clinical_knowledge|5\"\ : {\n \"acc\": 0.7056603773584905,\n \"acc_stderr\": 0.028049186315695248,\n\ \ \"acc_norm\": 0.7056603773584905,\n \"acc_norm_stderr\": 0.028049186315695248\n\ \ },\n \"harness|hendrycksTest-college_biology|5\": {\n \"acc\": 0.7361111111111112,\n\ \ \"acc_stderr\": 0.03685651095897532,\n \"acc_norm\": 0.7361111111111112,\n\ \ \"acc_norm_stderr\": 0.03685651095897532\n },\n \"harness|hendrycksTest-college_chemistry|5\"\ : {\n \"acc\": 0.5,\n \"acc_stderr\": 0.050251890762960605,\n \ \ \"acc_norm\": 0.5,\n \"acc_norm_stderr\": 0.050251890762960605\n \ \ },\n \"harness|hendrycksTest-college_computer_science|5\": {\n \"acc\"\ : 0.53,\n \"acc_stderr\": 0.05016135580465919,\n \"acc_norm\": 0.53,\n\ \ \"acc_norm_stderr\": 0.05016135580465919\n },\n \"harness|hendrycksTest-college_mathematics|5\"\ : {\n \"acc\": 0.36,\n \"acc_stderr\": 0.048241815132442176,\n \ \ \"acc_norm\": 0.36,\n \"acc_norm_stderr\": 0.048241815132442176\n \ \ },\n \"harness|hendrycksTest-college_medicine|5\": {\n \"acc\": 0.630057803468208,\n\ \ \"acc_stderr\": 0.0368122963339432,\n \"acc_norm\": 0.630057803468208,\n\ \ \"acc_norm_stderr\": 0.0368122963339432\n },\n \"harness|hendrycksTest-college_physics|5\"\ : {\n \"acc\": 0.43137254901960786,\n \"acc_stderr\": 0.04928099597287534,\n\ \ \"acc_norm\": 0.43137254901960786,\n \"acc_norm_stderr\": 0.04928099597287534\n\ \ },\n \"harness|hendrycksTest-computer_security|5\": {\n \"acc\":\ \ 0.77,\n \"acc_stderr\": 0.042295258468165065,\n \"acc_norm\": 0.77,\n\ \ \"acc_norm_stderr\": 0.042295258468165065\n },\n \"harness|hendrycksTest-conceptual_physics|5\"\ : {\n \"acc\": 0.5914893617021276,\n \"acc_stderr\": 0.032134180267015755,\n\ \ \"acc_norm\": 0.5914893617021276,\n \"acc_norm_stderr\": 0.032134180267015755\n\ \ },\n \"harness|hendrycksTest-econometrics|5\": {\n \"acc\": 0.4824561403508772,\n\ \ \"acc_stderr\": 0.0470070803355104,\n \"acc_norm\": 0.4824561403508772,\n\ \ \"acc_norm_stderr\": 0.0470070803355104\n },\n \"harness|hendrycksTest-electrical_engineering|5\"\ : {\n \"acc\": 0.5862068965517241,\n \"acc_stderr\": 0.04104269211806232,\n\ \ \"acc_norm\": 0.5862068965517241,\n \"acc_norm_stderr\": 0.04104269211806232\n\ \ },\n \"harness|hendrycksTest-elementary_mathematics|5\": {\n \"acc\"\ : 0.3994708994708995,\n \"acc_stderr\": 0.025225450284067887,\n \"\ acc_norm\": 0.3994708994708995,\n \"acc_norm_stderr\": 0.025225450284067887\n\ \ },\n \"harness|hendrycksTest-formal_logic|5\": {\n \"acc\": 0.4126984126984127,\n\ \ \"acc_stderr\": 0.04403438954768177,\n \"acc_norm\": 0.4126984126984127,\n\ \ \"acc_norm_stderr\": 0.04403438954768177\n },\n \"harness|hendrycksTest-global_facts|5\"\ : {\n \"acc\": 0.35,\n \"acc_stderr\": 0.047937248544110196,\n \ \ \"acc_norm\": 0.35,\n \"acc_norm_stderr\": 0.047937248544110196\n \ \ },\n \"harness|hendrycksTest-high_school_biology|5\": {\n \"acc\"\ : 0.7709677419354839,\n \"acc_stderr\": 0.023904914311782648,\n \"\ acc_norm\": 0.7709677419354839,\n \"acc_norm_stderr\": 0.023904914311782648\n\ \ },\n \"harness|hendrycksTest-high_school_chemistry|5\": {\n \"acc\"\ : 0.5123152709359606,\n \"acc_stderr\": 0.035169204442208966,\n \"\ acc_norm\": 0.5123152709359606,\n \"acc_norm_stderr\": 0.035169204442208966\n\ \ },\n \"harness|hendrycksTest-high_school_computer_science|5\": {\n \ \ \"acc\": 0.7,\n \"acc_stderr\": 0.046056618647183814,\n \"acc_norm\"\ : 0.7,\n \"acc_norm_stderr\": 0.046056618647183814\n },\n \"harness|hendrycksTest-high_school_european_history|5\"\ : {\n \"acc\": 0.793939393939394,\n \"acc_stderr\": 0.03158415324047711,\n\ \ \"acc_norm\": 0.793939393939394,\n \"acc_norm_stderr\": 0.03158415324047711\n\ \ },\n \"harness|hendrycksTest-high_school_geography|5\": {\n \"acc\"\ : 0.803030303030303,\n \"acc_stderr\": 0.028335609732463355,\n \"\ acc_norm\": 0.803030303030303,\n \"acc_norm_stderr\": 0.028335609732463355\n\ \ },\n \"harness|hendrycksTest-high_school_government_and_politics|5\": {\n\ \ \"acc\": 0.8860103626943006,\n \"acc_stderr\": 0.022935144053919443,\n\ \ \"acc_norm\": 0.8860103626943006,\n \"acc_norm_stderr\": 0.022935144053919443\n\ \ },\n \"harness|hendrycksTest-high_school_macroeconomics|5\": {\n \ \ \"acc\": 0.6666666666666666,\n \"acc_stderr\": 0.023901157979402534,\n\ \ \"acc_norm\": 0.6666666666666666,\n \"acc_norm_stderr\": 0.023901157979402534\n\ \ },\n \"harness|hendrycksTest-high_school_mathematics|5\": {\n \"\ acc\": 0.362962962962963,\n \"acc_stderr\": 0.029318203645206865,\n \ \ \"acc_norm\": 0.362962962962963,\n \"acc_norm_stderr\": 0.029318203645206865\n\ \ },\n \"harness|hendrycksTest-high_school_microeconomics|5\": {\n \ \ \"acc\": 0.6680672268907563,\n \"acc_stderr\": 0.03058869701378364,\n \ \ \"acc_norm\": 0.6680672268907563,\n \"acc_norm_stderr\": 0.03058869701378364\n\ \ },\n \"harness|hendrycksTest-high_school_physics|5\": {\n \"acc\"\ : 0.2980132450331126,\n \"acc_stderr\": 0.037345356767871984,\n \"\ acc_norm\": 0.2980132450331126,\n \"acc_norm_stderr\": 0.037345356767871984\n\ \ },\n \"harness|hendrycksTest-high_school_psychology|5\": {\n \"acc\"\ : 0.8238532110091743,\n \"acc_stderr\": 0.016332882393431385,\n \"\ acc_norm\": 0.8238532110091743,\n \"acc_norm_stderr\": 0.016332882393431385\n\ \ },\n \"harness|hendrycksTest-high_school_statistics|5\": {\n \"acc\"\ : 0.5231481481481481,\n \"acc_stderr\": 0.03406315360711507,\n \"\ acc_norm\": 0.5231481481481481,\n \"acc_norm_stderr\": 0.03406315360711507\n\ \ },\n \"harness|hendrycksTest-high_school_us_history|5\": {\n \"acc\"\ : 0.7892156862745098,\n \"acc_stderr\": 0.028626547912437406,\n \"\ acc_norm\": 0.7892156862745098,\n \"acc_norm_stderr\": 0.028626547912437406\n\ \ },\n \"harness|hendrycksTest-high_school_world_history|5\": {\n \"\ acc\": 0.7805907172995781,\n \"acc_stderr\": 0.026939106581553945,\n \ \ \"acc_norm\": 0.7805907172995781,\n \"acc_norm_stderr\": 0.026939106581553945\n\ \ },\n \"harness|hendrycksTest-human_aging|5\": {\n \"acc\": 0.6771300448430493,\n\ \ \"acc_stderr\": 0.03138147637575499,\n \"acc_norm\": 0.6771300448430493,\n\ \ \"acc_norm_stderr\": 0.03138147637575499\n },\n \"harness|hendrycksTest-human_sexuality|5\"\ : {\n \"acc\": 0.7709923664122137,\n \"acc_stderr\": 0.036853466317118506,\n\ \ \"acc_norm\": 0.7709923664122137,\n \"acc_norm_stderr\": 0.036853466317118506\n\ \ },\n \"harness|hendrycksTest-international_law|5\": {\n \"acc\":\ \ 0.7851239669421488,\n \"acc_stderr\": 0.037494924487096966,\n \"\ acc_norm\": 0.7851239669421488,\n \"acc_norm_stderr\": 0.037494924487096966\n\ \ },\n \"harness|hendrycksTest-jurisprudence|5\": {\n \"acc\": 0.7685185185185185,\n\ \ \"acc_stderr\": 0.04077494709252627,\n \"acc_norm\": 0.7685185185185185,\n\ \ \"acc_norm_stderr\": 0.04077494709252627\n },\n \"harness|hendrycksTest-logical_fallacies|5\"\ : {\n \"acc\": 0.7668711656441718,\n \"acc_stderr\": 0.0332201579577674,\n\ \ \"acc_norm\": 0.7668711656441718,\n \"acc_norm_stderr\": 0.0332201579577674\n\ \ },\n \"harness|hendrycksTest-machine_learning|5\": {\n \"acc\": 0.49107142857142855,\n\ \ \"acc_stderr\": 0.04745033255489123,\n \"acc_norm\": 0.49107142857142855,\n\ \ \"acc_norm_stderr\": 0.04745033255489123\n },\n \"harness|hendrycksTest-management|5\"\ : {\n \"acc\": 0.8155339805825242,\n \"acc_stderr\": 0.03840423627288276,\n\ \ \"acc_norm\": 0.8155339805825242,\n \"acc_norm_stderr\": 0.03840423627288276\n\ \ },\n \"harness|hendrycksTest-marketing|5\": {\n \"acc\": 0.8760683760683761,\n\ \ \"acc_stderr\": 0.02158649400128139,\n \"acc_norm\": 0.8760683760683761,\n\ \ \"acc_norm_stderr\": 0.02158649400128139\n },\n \"harness|hendrycksTest-medical_genetics|5\"\ : {\n \"acc\": 0.77,\n \"acc_stderr\": 0.04229525846816508,\n \ \ \"acc_norm\": 0.77,\n \"acc_norm_stderr\": 0.04229525846816508\n \ \ },\n \"harness|hendrycksTest-miscellaneous|5\": {\n \"acc\": 0.8173690932311622,\n\ \ \"acc_stderr\": 0.013816335389973136,\n \"acc_norm\": 0.8173690932311622,\n\ \ \"acc_norm_stderr\": 0.013816335389973136\n },\n \"harness|hendrycksTest-moral_disputes|5\"\ : {\n \"acc\": 0.7254335260115607,\n \"acc_stderr\": 0.024027745155265012,\n\ \ \"acc_norm\": 0.7254335260115607,\n \"acc_norm_stderr\": 0.024027745155265012\n\ \ },\n \"harness|hendrycksTest-moral_scenarios|5\": {\n \"acc\": 0.3486033519553073,\n\ \ \"acc_stderr\": 0.015937484656687033,\n \"acc_norm\": 0.3486033519553073,\n\ \ \"acc_norm_stderr\": 0.015937484656687033\n },\n \"harness|hendrycksTest-nutrition|5\"\ : {\n \"acc\": 0.7516339869281046,\n \"acc_stderr\": 0.02473998135511359,\n\ \ \"acc_norm\": 0.7516339869281046,\n \"acc_norm_stderr\": 0.02473998135511359\n\ \ },\n \"harness|hendrycksTest-philosophy|5\": {\n \"acc\": 0.707395498392283,\n\ \ \"acc_stderr\": 0.02583989833487798,\n \"acc_norm\": 0.707395498392283,\n\ \ \"acc_norm_stderr\": 0.02583989833487798\n },\n \"harness|hendrycksTest-prehistory|5\"\ : {\n \"acc\": 0.7283950617283951,\n \"acc_stderr\": 0.024748624490537368,\n\ \ \"acc_norm\": 0.7283950617283951,\n \"acc_norm_stderr\": 0.024748624490537368\n\ \ },\n \"harness|hendrycksTest-professional_accounting|5\": {\n \"\ acc\": 0.48226950354609927,\n \"acc_stderr\": 0.02980873964223777,\n \ \ \"acc_norm\": 0.48226950354609927,\n \"acc_norm_stderr\": 0.02980873964223777\n\ \ },\n \"harness|hendrycksTest-professional_law|5\": {\n \"acc\": 0.4491525423728814,\n\ \ \"acc_stderr\": 0.012704030518851488,\n \"acc_norm\": 0.4491525423728814,\n\ \ \"acc_norm_stderr\": 0.012704030518851488\n },\n \"harness|hendrycksTest-professional_medicine|5\"\ : {\n \"acc\": 0.6764705882352942,\n \"acc_stderr\": 0.028418208619406755,\n\ \ \"acc_norm\": 0.6764705882352942,\n \"acc_norm_stderr\": 0.028418208619406755\n\ \ },\n \"harness|hendrycksTest-professional_psychology|5\": {\n \"\ acc\": 0.6813725490196079,\n \"acc_stderr\": 0.018850084696468712,\n \ \ \"acc_norm\": 0.6813725490196079,\n \"acc_norm_stderr\": 0.018850084696468712\n\ \ },\n \"harness|hendrycksTest-public_relations|5\": {\n \"acc\": 0.6636363636363637,\n\ \ \"acc_stderr\": 0.04525393596302506,\n \"acc_norm\": 0.6636363636363637,\n\ \ \"acc_norm_stderr\": 0.04525393596302506\n },\n \"harness|hendrycksTest-security_studies|5\"\ : {\n \"acc\": 0.7346938775510204,\n \"acc_stderr\": 0.028263889943784593,\n\ \ \"acc_norm\": 0.7346938775510204,\n \"acc_norm_stderr\": 0.028263889943784593\n\ \ },\n \"harness|hendrycksTest-sociology|5\": {\n \"acc\": 0.8308457711442786,\n\ \ \"acc_stderr\": 0.026508590656233268,\n \"acc_norm\": 0.8308457711442786,\n\ \ \"acc_norm_stderr\": 0.026508590656233268\n },\n \"harness|hendrycksTest-us_foreign_policy|5\"\ : {\n \"acc\": 0.88,\n \"acc_stderr\": 0.03265986323710906,\n \ \ \"acc_norm\": 0.88,\n \"acc_norm_stderr\": 0.03265986323710906\n \ \ },\n \"harness|hendrycksTest-virology|5\": {\n \"acc\": 0.5602409638554217,\n\ \ \"acc_stderr\": 0.03864139923699122,\n \"acc_norm\": 0.5602409638554217,\n\ \ \"acc_norm_stderr\": 0.03864139923699122\n },\n \"harness|hendrycksTest-world_religions|5\"\ : {\n \"acc\": 0.8245614035087719,\n \"acc_stderr\": 0.029170885500727668,\n\ \ \"acc_norm\": 0.8245614035087719,\n \"acc_norm_stderr\": 0.029170885500727668\n\ \ },\n \"harness|truthfulqa:mc|0\": {\n \"mc1\": 0.3659730722154223,\n\ \ \"mc1_stderr\": 0.016862941684088376,\n \"mc2\": 0.5221487837375264,\n\ \ \"mc2_stderr\": 0.015253502717954797\n },\n \"harness|winogrande|5\"\ : {\n \"acc\": 0.7861089187056038,\n \"acc_stderr\": 0.011524466954090248\n\ \ },\n \"harness|gsm8k|5\": {\n \"acc\": 0.4541319181197877,\n \ \ \"acc_stderr\": 0.01371441094526456\n }\n}\n```" repo_url: https://huggingface.co/TeeZee/DarkSapling-7B-v2.0 leaderboard_url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard point_of_contact: clementine@hf.co configs: - config_name: harness_arc_challenge_25 data_files: - split: 2024_03_10T04_57_12.333081 path: - '**/details_harness|arc:challenge|25_2024-03-10T04-57-12.333081.parquet' - split: latest path: - '**/details_harness|arc:challenge|25_2024-03-10T04-57-12.333081.parquet' - config_name: harness_gsm8k_5 data_files: - split: 2024_03_10T04_57_12.333081 path: - '**/details_harness|gsm8k|5_2024-03-10T04-57-12.333081.parquet' - split: latest path: - '**/details_harness|gsm8k|5_2024-03-10T04-57-12.333081.parquet' - config_name: harness_hellaswag_10 data_files: - split: 2024_03_10T04_57_12.333081 path: - '**/details_harness|hellaswag|10_2024-03-10T04-57-12.333081.parquet' - split: latest path: - '**/details_harness|hellaswag|10_2024-03-10T04-57-12.333081.parquet' - config_name: harness_hendrycksTest_5 data_files: - split: 2024_03_10T04_57_12.333081 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-international_law|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-management|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-marketing|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-sociology|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-virology|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2024-03-10T04-57-12.333081.parquet' - split: latest path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-international_law|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-management|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-marketing|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-sociology|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-virology|5_2024-03-10T04-57-12.333081.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2024-03-10T04-57-12.333081.parquet' - config_name: harness_hendrycksTest_abstract_algebra_5 data_files: - split: 2024_03_10T04_57_12.333081 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-03-10T04-57-12.333081.parquet' - split: latest path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-03-10T04-57-12.333081.parquet' - config_name: harness_hendrycksTest_anatomy_5 data_files: - split: 2024_03_10T04_57_12.333081 path: - '**/details_harness|hendrycksTest-anatomy|5_2024-03-10T04-57-12.333081.parquet' - split: latest path: - '**/details_harness|hendrycksTest-anatomy|5_2024-03-10T04-57-12.333081.parquet' - config_name: harness_hendrycksTest_astronomy_5 data_files: - split: 2024_03_10T04_57_12.333081 path: - '**/details_harness|hendrycksTest-astronomy|5_2024-03-10T04-57-12.333081.parquet' - split: latest path: - '**/details_harness|hendrycksTest-astronomy|5_2024-03-10T04-57-12.333081.parquet' - config_name: harness_hendrycksTest_business_ethics_5 data_files: - split: 2024_03_10T04_57_12.333081 path: - '**/details_harness|hendrycksTest-business_ethics|5_2024-03-10T04-57-12.333081.parquet' - split: latest path: - '**/details_harness|hendrycksTest-business_ethics|5_2024-03-10T04-57-12.333081.parquet' - config_name: harness_hendrycksTest_clinical_knowledge_5 data_files: - split: 2024_03_10T04_57_12.333081 path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-03-10T04-57-12.333081.parquet' - split: latest path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-03-10T04-57-12.333081.parquet' - config_name: harness_hendrycksTest_college_biology_5 data_files: - split: 2024_03_10T04_57_12.333081 path: - '**/details_harness|hendrycksTest-college_biology|5_2024-03-10T04-57-12.333081.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_biology|5_2024-03-10T04-57-12.333081.parquet' - config_name: harness_hendrycksTest_college_chemistry_5 data_files: - split: 2024_03_10T04_57_12.333081 path: - '**/details_harness|hendrycksTest-college_chemistry|5_2024-03-10T04-57-12.333081.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_chemistry|5_2024-03-10T04-57-12.333081.parquet' - config_name: harness_hendrycksTest_college_computer_science_5 data_files: - split: 2024_03_10T04_57_12.333081 path: - '**/details_harness|hendrycksTest-college_computer_science|5_2024-03-10T04-57-12.333081.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_computer_science|5_2024-03-10T04-57-12.333081.parquet' - config_name: harness_hendrycksTest_college_mathematics_5 data_files: - split: 2024_03_10T04_57_12.333081 path: - '**/details_harness|hendrycksTest-college_mathematics|5_2024-03-10T04-57-12.333081.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_mathematics|5_2024-03-10T04-57-12.333081.parquet' - config_name: harness_hendrycksTest_college_medicine_5 data_files: - split: 2024_03_10T04_57_12.333081 path: - '**/details_harness|hendrycksTest-college_medicine|5_2024-03-10T04-57-12.333081.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_medicine|5_2024-03-10T04-57-12.333081.parquet' - config_name: harness_hendrycksTest_college_physics_5 data_files: - split: 2024_03_10T04_57_12.333081 path: - '**/details_harness|hendrycksTest-college_physics|5_2024-03-10T04-57-12.333081.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_physics|5_2024-03-10T04-57-12.333081.parquet' - config_name: harness_hendrycksTest_computer_security_5 data_files: - split: 2024_03_10T04_57_12.333081 path: - '**/details_harness|hendrycksTest-computer_security|5_2024-03-10T04-57-12.333081.parquet' - split: latest path: - '**/details_harness|hendrycksTest-computer_security|5_2024-03-10T04-57-12.333081.parquet' - config_name: harness_hendrycksTest_conceptual_physics_5 data_files: - split: 2024_03_10T04_57_12.333081 path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-03-10T04-57-12.333081.parquet' - split: latest path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-03-10T04-57-12.333081.parquet' - config_name: harness_hendrycksTest_econometrics_5 data_files: - split: 2024_03_10T04_57_12.333081 path: - '**/details_harness|hendrycksTest-econometrics|5_2024-03-10T04-57-12.333081.parquet' - split: latest path: - '**/details_harness|hendrycksTest-econometrics|5_2024-03-10T04-57-12.333081.parquet' - config_name: harness_hendrycksTest_electrical_engineering_5 data_files: - split: 2024_03_10T04_57_12.333081 path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-03-10T04-57-12.333081.parquet' - split: latest path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-03-10T04-57-12.333081.parquet' - config_name: harness_hendrycksTest_elementary_mathematics_5 data_files: - split: 2024_03_10T04_57_12.333081 path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-03-10T04-57-12.333081.parquet' - split: latest path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-03-10T04-57-12.333081.parquet' - config_name: harness_hendrycksTest_formal_logic_5 data_files: - split: 2024_03_10T04_57_12.333081 path: - '**/details_harness|hendrycksTest-formal_logic|5_2024-03-10T04-57-12.333081.parquet' - split: latest path: - '**/details_harness|hendrycksTest-formal_logic|5_2024-03-10T04-57-12.333081.parquet' - config_name: harness_hendrycksTest_global_facts_5 data_files: - split: 2024_03_10T04_57_12.333081 path: - '**/details_harness|hendrycksTest-global_facts|5_2024-03-10T04-57-12.333081.parquet' - split: latest path: - '**/details_harness|hendrycksTest-global_facts|5_2024-03-10T04-57-12.333081.parquet' - config_name: harness_hendrycksTest_high_school_biology_5 data_files: - split: 2024_03_10T04_57_12.333081 path: - '**/details_harness|hendrycksTest-high_school_biology|5_2024-03-10T04-57-12.333081.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_biology|5_2024-03-10T04-57-12.333081.parquet' - config_name: harness_hendrycksTest_high_school_chemistry_5 data_files: - split: 2024_03_10T04_57_12.333081 path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-03-10T04-57-12.333081.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-03-10T04-57-12.333081.parquet' - config_name: harness_hendrycksTest_high_school_computer_science_5 data_files: - split: 2024_03_10T04_57_12.333081 path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-03-10T04-57-12.333081.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-03-10T04-57-12.333081.parquet' - config_name: harness_hendrycksTest_high_school_european_history_5 data_files: - split: 2024_03_10T04_57_12.333081 path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-03-10T04-57-12.333081.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-03-10T04-57-12.333081.parquet' - config_name: harness_hendrycksTest_high_school_geography_5 data_files: - split: 2024_03_10T04_57_12.333081 path: - '**/details_harness|hendrycksTest-high_school_geography|5_2024-03-10T04-57-12.333081.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_geography|5_2024-03-10T04-57-12.333081.parquet' - config_name: harness_hendrycksTest_high_school_government_and_politics_5 data_files: - split: 2024_03_10T04_57_12.333081 path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-03-10T04-57-12.333081.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-03-10T04-57-12.333081.parquet' - config_name: harness_hendrycksTest_high_school_macroeconomics_5 data_files: - split: 2024_03_10T04_57_12.333081 path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-03-10T04-57-12.333081.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-03-10T04-57-12.333081.parquet' - config_name: harness_hendrycksTest_high_school_mathematics_5 data_files: - split: 2024_03_10T04_57_12.333081 path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-03-10T04-57-12.333081.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-03-10T04-57-12.333081.parquet' - config_name: harness_hendrycksTest_high_school_microeconomics_5 data_files: - split: 2024_03_10T04_57_12.333081 path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-03-10T04-57-12.333081.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-03-10T04-57-12.333081.parquet' - config_name: harness_hendrycksTest_high_school_physics_5 data_files: - split: 2024_03_10T04_57_12.333081 path: - '**/details_harness|hendrycksTest-high_school_physics|5_2024-03-10T04-57-12.333081.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_physics|5_2024-03-10T04-57-12.333081.parquet' - config_name: harness_hendrycksTest_high_school_psychology_5 data_files: - split: 2024_03_10T04_57_12.333081 path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-03-10T04-57-12.333081.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-03-10T04-57-12.333081.parquet' - config_name: harness_hendrycksTest_high_school_statistics_5 data_files: - split: 2024_03_10T04_57_12.333081 path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-03-10T04-57-12.333081.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-03-10T04-57-12.333081.parquet' - config_name: harness_hendrycksTest_high_school_us_history_5 data_files: - split: 2024_03_10T04_57_12.333081 path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-03-10T04-57-12.333081.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-03-10T04-57-12.333081.parquet' - config_name: harness_hendrycksTest_high_school_world_history_5 data_files: - split: 2024_03_10T04_57_12.333081 path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-03-10T04-57-12.333081.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-03-10T04-57-12.333081.parquet' - config_name: harness_hendrycksTest_human_aging_5 data_files: - split: 2024_03_10T04_57_12.333081 path: - '**/details_harness|hendrycksTest-human_aging|5_2024-03-10T04-57-12.333081.parquet' - split: latest path: - '**/details_harness|hendrycksTest-human_aging|5_2024-03-10T04-57-12.333081.parquet' - config_name: harness_hendrycksTest_human_sexuality_5 data_files: - split: 2024_03_10T04_57_12.333081 path: - '**/details_harness|hendrycksTest-human_sexuality|5_2024-03-10T04-57-12.333081.parquet' - split: latest path: - '**/details_harness|hendrycksTest-human_sexuality|5_2024-03-10T04-57-12.333081.parquet' - config_name: harness_hendrycksTest_international_law_5 data_files: - split: 2024_03_10T04_57_12.333081 path: - '**/details_harness|hendrycksTest-international_law|5_2024-03-10T04-57-12.333081.parquet' - split: latest path: - '**/details_harness|hendrycksTest-international_law|5_2024-03-10T04-57-12.333081.parquet' - config_name: harness_hendrycksTest_jurisprudence_5 data_files: - split: 2024_03_10T04_57_12.333081 path: - '**/details_harness|hendrycksTest-jurisprudence|5_2024-03-10T04-57-12.333081.parquet' - split: latest path: - '**/details_harness|hendrycksTest-jurisprudence|5_2024-03-10T04-57-12.333081.parquet' - config_name: harness_hendrycksTest_logical_fallacies_5 data_files: - split: 2024_03_10T04_57_12.333081 path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-03-10T04-57-12.333081.parquet' - split: latest path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-03-10T04-57-12.333081.parquet' - config_name: harness_hendrycksTest_machine_learning_5 data_files: - split: 2024_03_10T04_57_12.333081 path: - '**/details_harness|hendrycksTest-machine_learning|5_2024-03-10T04-57-12.333081.parquet' - split: latest path: - '**/details_harness|hendrycksTest-machine_learning|5_2024-03-10T04-57-12.333081.parquet' - config_name: harness_hendrycksTest_management_5 data_files: - split: 2024_03_10T04_57_12.333081 path: - '**/details_harness|hendrycksTest-management|5_2024-03-10T04-57-12.333081.parquet' - split: latest path: - '**/details_harness|hendrycksTest-management|5_2024-03-10T04-57-12.333081.parquet' - config_name: harness_hendrycksTest_marketing_5 data_files: - split: 2024_03_10T04_57_12.333081 path: - '**/details_harness|hendrycksTest-marketing|5_2024-03-10T04-57-12.333081.parquet' - split: latest path: - '**/details_harness|hendrycksTest-marketing|5_2024-03-10T04-57-12.333081.parquet' - config_name: harness_hendrycksTest_medical_genetics_5 data_files: - split: 2024_03_10T04_57_12.333081 path: - '**/details_harness|hendrycksTest-medical_genetics|5_2024-03-10T04-57-12.333081.parquet' - split: latest path: - '**/details_harness|hendrycksTest-medical_genetics|5_2024-03-10T04-57-12.333081.parquet' - config_name: harness_hendrycksTest_miscellaneous_5 data_files: - split: 2024_03_10T04_57_12.333081 path: - '**/details_harness|hendrycksTest-miscellaneous|5_2024-03-10T04-57-12.333081.parquet' - split: latest path: - '**/details_harness|hendrycksTest-miscellaneous|5_2024-03-10T04-57-12.333081.parquet' - config_name: harness_hendrycksTest_moral_disputes_5 data_files: - split: 2024_03_10T04_57_12.333081 path: - '**/details_harness|hendrycksTest-moral_disputes|5_2024-03-10T04-57-12.333081.parquet' - split: latest path: - '**/details_harness|hendrycksTest-moral_disputes|5_2024-03-10T04-57-12.333081.parquet' - config_name: harness_hendrycksTest_moral_scenarios_5 data_files: - split: 2024_03_10T04_57_12.333081 path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-03-10T04-57-12.333081.parquet' - split: latest path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-03-10T04-57-12.333081.parquet' - config_name: harness_hendrycksTest_nutrition_5 data_files: - split: 2024_03_10T04_57_12.333081 path: - '**/details_harness|hendrycksTest-nutrition|5_2024-03-10T04-57-12.333081.parquet' - split: latest path: - '**/details_harness|hendrycksTest-nutrition|5_2024-03-10T04-57-12.333081.parquet' - config_name: harness_hendrycksTest_philosophy_5 data_files: - split: 2024_03_10T04_57_12.333081 path: - '**/details_harness|hendrycksTest-philosophy|5_2024-03-10T04-57-12.333081.parquet' - split: latest path: - '**/details_harness|hendrycksTest-philosophy|5_2024-03-10T04-57-12.333081.parquet' - config_name: harness_hendrycksTest_prehistory_5 data_files: - split: 2024_03_10T04_57_12.333081 path: - '**/details_harness|hendrycksTest-prehistory|5_2024-03-10T04-57-12.333081.parquet' - split: latest path: - '**/details_harness|hendrycksTest-prehistory|5_2024-03-10T04-57-12.333081.parquet' - config_name: harness_hendrycksTest_professional_accounting_5 data_files: - split: 2024_03_10T04_57_12.333081 path: - '**/details_harness|hendrycksTest-professional_accounting|5_2024-03-10T04-57-12.333081.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_accounting|5_2024-03-10T04-57-12.333081.parquet' - config_name: harness_hendrycksTest_professional_law_5 data_files: - split: 2024_03_10T04_57_12.333081 path: - '**/details_harness|hendrycksTest-professional_law|5_2024-03-10T04-57-12.333081.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_law|5_2024-03-10T04-57-12.333081.parquet' - config_name: harness_hendrycksTest_professional_medicine_5 data_files: - split: 2024_03_10T04_57_12.333081 path: - '**/details_harness|hendrycksTest-professional_medicine|5_2024-03-10T04-57-12.333081.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_medicine|5_2024-03-10T04-57-12.333081.parquet' - config_name: harness_hendrycksTest_professional_psychology_5 data_files: - split: 2024_03_10T04_57_12.333081 path: - '**/details_harness|hendrycksTest-professional_psychology|5_2024-03-10T04-57-12.333081.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_psychology|5_2024-03-10T04-57-12.333081.parquet' - config_name: harness_hendrycksTest_public_relations_5 data_files: - split: 2024_03_10T04_57_12.333081 path: - '**/details_harness|hendrycksTest-public_relations|5_2024-03-10T04-57-12.333081.parquet' - split: latest path: - '**/details_harness|hendrycksTest-public_relations|5_2024-03-10T04-57-12.333081.parquet' - config_name: harness_hendrycksTest_security_studies_5 data_files: - split: 2024_03_10T04_57_12.333081 path: - '**/details_harness|hendrycksTest-security_studies|5_2024-03-10T04-57-12.333081.parquet' - split: latest path: - '**/details_harness|hendrycksTest-security_studies|5_2024-03-10T04-57-12.333081.parquet' - config_name: harness_hendrycksTest_sociology_5 data_files: - split: 2024_03_10T04_57_12.333081 path: - '**/details_harness|hendrycksTest-sociology|5_2024-03-10T04-57-12.333081.parquet' - split: latest path: - '**/details_harness|hendrycksTest-sociology|5_2024-03-10T04-57-12.333081.parquet' - config_name: harness_hendrycksTest_us_foreign_policy_5 data_files: - split: 2024_03_10T04_57_12.333081 path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-03-10T04-57-12.333081.parquet' - split: latest path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-03-10T04-57-12.333081.parquet' - config_name: harness_hendrycksTest_virology_5 data_files: - split: 2024_03_10T04_57_12.333081 path: - '**/details_harness|hendrycksTest-virology|5_2024-03-10T04-57-12.333081.parquet' - split: latest path: - '**/details_harness|hendrycksTest-virology|5_2024-03-10T04-57-12.333081.parquet' - config_name: harness_hendrycksTest_world_religions_5 data_files: - split: 2024_03_10T04_57_12.333081 path: - '**/details_harness|hendrycksTest-world_religions|5_2024-03-10T04-57-12.333081.parquet' - split: latest path: - '**/details_harness|hendrycksTest-world_religions|5_2024-03-10T04-57-12.333081.parquet' - config_name: harness_truthfulqa_mc_0 data_files: - split: 2024_03_10T04_57_12.333081 path: - '**/details_harness|truthfulqa:mc|0_2024-03-10T04-57-12.333081.parquet' - split: latest path: - '**/details_harness|truthfulqa:mc|0_2024-03-10T04-57-12.333081.parquet' - config_name: harness_winogrande_5 data_files: - split: 2024_03_10T04_57_12.333081 path: - '**/details_harness|winogrande|5_2024-03-10T04-57-12.333081.parquet' - split: latest path: - '**/details_harness|winogrande|5_2024-03-10T04-57-12.333081.parquet' - config_name: results data_files: - split: 2024_03_10T04_57_12.333081 path: - results_2024-03-10T04-57-12.333081.parquet - split: latest path: - results_2024-03-10T04-57-12.333081.parquet --- # Dataset Card for Evaluation run of TeeZee/DarkSapling-7B-v2.0 <!-- Provide a quick summary of the dataset. --> Dataset automatically created during the evaluation run of model [TeeZee/DarkSapling-7B-v2.0](https://huggingface.co/TeeZee/DarkSapling-7B-v2.0) on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard). The dataset is composed of 63 configuration, each one coresponding to one of the evaluated task. The dataset has been created from 1 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results. An additional configuration "results" store all the aggregated results of the run (and is used to compute and display the aggregated metrics on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)). To load the details from a run, you can for instance do the following: ```python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_TeeZee__DarkSapling-7B-v2.0", "harness_winogrande_5", split="train") ``` ## Latest results These are the [latest results from run 2024-03-10T04:57:12.333081](https://huggingface.co/datasets/open-llm-leaderboard/details_TeeZee__DarkSapling-7B-v2.0/blob/main/results_2024-03-10T04-57-12.333081.json)(note that their might be results for other tasks in the repos if successive evals didn't cover the same tasks. You find each in the results and the "latest" split for each eval): ```python { "all": { "acc": 0.6424579193008534, "acc_stderr": 0.032218866498356466, "acc_norm": 0.6471361795899754, "acc_norm_stderr": 0.032858397843778114, "mc1": 0.3659730722154223, "mc1_stderr": 0.016862941684088376, "mc2": 0.5221487837375264, "mc2_stderr": 0.015253502717954797 }, "harness|arc:challenge|25": { "acc": 0.6023890784982935, "acc_stderr": 0.01430175222327954, "acc_norm": 0.6416382252559727, "acc_norm_stderr": 0.014012883334859857 }, "harness|hellaswag|10": { "acc": 0.6589324835690101, "acc_stderr": 0.004730991357194292, "acc_norm": 0.8510256920932086, "acc_norm_stderr": 0.003553354528132355 }, "harness|hendrycksTest-abstract_algebra|5": { "acc": 0.36, "acc_stderr": 0.04824181513244218, "acc_norm": 0.36, "acc_norm_stderr": 0.04824181513244218 }, "harness|hendrycksTest-anatomy|5": { "acc": 0.6296296296296297, "acc_stderr": 0.041716541613545426, "acc_norm": 0.6296296296296297, "acc_norm_stderr": 0.041716541613545426 }, "harness|hendrycksTest-astronomy|5": { "acc": 0.6710526315789473, "acc_stderr": 0.03823428969926604, "acc_norm": 0.6710526315789473, "acc_norm_stderr": 0.03823428969926604 }, "harness|hendrycksTest-business_ethics|5": { "acc": 0.58, "acc_stderr": 0.049604496374885836, "acc_norm": 0.58, "acc_norm_stderr": 0.049604496374885836 }, "harness|hendrycksTest-clinical_knowledge|5": { "acc": 0.7056603773584905, "acc_stderr": 0.028049186315695248, "acc_norm": 0.7056603773584905, "acc_norm_stderr": 0.028049186315695248 }, "harness|hendrycksTest-college_biology|5": { "acc": 0.7361111111111112, "acc_stderr": 0.03685651095897532, "acc_norm": 0.7361111111111112, "acc_norm_stderr": 0.03685651095897532 }, "harness|hendrycksTest-college_chemistry|5": { "acc": 0.5, "acc_stderr": 0.050251890762960605, "acc_norm": 0.5, "acc_norm_stderr": 0.050251890762960605 }, "harness|hendrycksTest-college_computer_science|5": { "acc": 0.53, "acc_stderr": 0.05016135580465919, "acc_norm": 0.53, "acc_norm_stderr": 0.05016135580465919 }, "harness|hendrycksTest-college_mathematics|5": { "acc": 0.36, "acc_stderr": 0.048241815132442176, "acc_norm": 0.36, "acc_norm_stderr": 0.048241815132442176 }, "harness|hendrycksTest-college_medicine|5": { "acc": 0.630057803468208, "acc_stderr": 0.0368122963339432, "acc_norm": 0.630057803468208, "acc_norm_stderr": 0.0368122963339432 }, "harness|hendrycksTest-college_physics|5": { "acc": 0.43137254901960786, "acc_stderr": 0.04928099597287534, "acc_norm": 0.43137254901960786, "acc_norm_stderr": 0.04928099597287534 }, "harness|hendrycksTest-computer_security|5": { "acc": 0.77, "acc_stderr": 0.042295258468165065, "acc_norm": 0.77, "acc_norm_stderr": 0.042295258468165065 }, "harness|hendrycksTest-conceptual_physics|5": { "acc": 0.5914893617021276, "acc_stderr": 0.032134180267015755, "acc_norm": 0.5914893617021276, "acc_norm_stderr": 0.032134180267015755 }, "harness|hendrycksTest-econometrics|5": { "acc": 0.4824561403508772, "acc_stderr": 0.0470070803355104, "acc_norm": 0.4824561403508772, "acc_norm_stderr": 0.0470070803355104 }, "harness|hendrycksTest-electrical_engineering|5": { "acc": 0.5862068965517241, "acc_stderr": 0.04104269211806232, "acc_norm": 0.5862068965517241, "acc_norm_stderr": 0.04104269211806232 }, "harness|hendrycksTest-elementary_mathematics|5": { "acc": 0.3994708994708995, "acc_stderr": 0.025225450284067887, "acc_norm": 0.3994708994708995, "acc_norm_stderr": 0.025225450284067887 }, "harness|hendrycksTest-formal_logic|5": { "acc": 0.4126984126984127, "acc_stderr": 0.04403438954768177, "acc_norm": 0.4126984126984127, "acc_norm_stderr": 0.04403438954768177 }, "harness|hendrycksTest-global_facts|5": { "acc": 0.35, "acc_stderr": 0.047937248544110196, "acc_norm": 0.35, "acc_norm_stderr": 0.047937248544110196 }, "harness|hendrycksTest-high_school_biology|5": { "acc": 0.7709677419354839, "acc_stderr": 0.023904914311782648, "acc_norm": 0.7709677419354839, "acc_norm_stderr": 0.023904914311782648 }, "harness|hendrycksTest-high_school_chemistry|5": { "acc": 0.5123152709359606, "acc_stderr": 0.035169204442208966, "acc_norm": 0.5123152709359606, "acc_norm_stderr": 0.035169204442208966 }, "harness|hendrycksTest-high_school_computer_science|5": { "acc": 0.7, "acc_stderr": 0.046056618647183814, "acc_norm": 0.7, "acc_norm_stderr": 0.046056618647183814 }, "harness|hendrycksTest-high_school_european_history|5": { "acc": 0.793939393939394, "acc_stderr": 0.03158415324047711, "acc_norm": 0.793939393939394, "acc_norm_stderr": 0.03158415324047711 }, "harness|hendrycksTest-high_school_geography|5": { "acc": 0.803030303030303, "acc_stderr": 0.028335609732463355, "acc_norm": 0.803030303030303, "acc_norm_stderr": 0.028335609732463355 }, "harness|hendrycksTest-high_school_government_and_politics|5": { "acc": 0.8860103626943006, "acc_stderr": 0.022935144053919443, "acc_norm": 0.8860103626943006, "acc_norm_stderr": 0.022935144053919443 }, "harness|hendrycksTest-high_school_macroeconomics|5": { "acc": 0.6666666666666666, "acc_stderr": 0.023901157979402534, "acc_norm": 0.6666666666666666, "acc_norm_stderr": 0.023901157979402534 }, "harness|hendrycksTest-high_school_mathematics|5": { "acc": 0.362962962962963, "acc_stderr": 0.029318203645206865, "acc_norm": 0.362962962962963, "acc_norm_stderr": 0.029318203645206865 }, "harness|hendrycksTest-high_school_microeconomics|5": { "acc": 0.6680672268907563, "acc_stderr": 0.03058869701378364, "acc_norm": 0.6680672268907563, "acc_norm_stderr": 0.03058869701378364 }, "harness|hendrycksTest-high_school_physics|5": { "acc": 0.2980132450331126, "acc_stderr": 0.037345356767871984, "acc_norm": 0.2980132450331126, "acc_norm_stderr": 0.037345356767871984 }, "harness|hendrycksTest-high_school_psychology|5": { "acc": 0.8238532110091743, "acc_stderr": 0.016332882393431385, "acc_norm": 0.8238532110091743, "acc_norm_stderr": 0.016332882393431385 }, "harness|hendrycksTest-high_school_statistics|5": { "acc": 0.5231481481481481, "acc_stderr": 0.03406315360711507, "acc_norm": 0.5231481481481481, "acc_norm_stderr": 0.03406315360711507 }, "harness|hendrycksTest-high_school_us_history|5": { "acc": 0.7892156862745098, "acc_stderr": 0.028626547912437406, "acc_norm": 0.7892156862745098, "acc_norm_stderr": 0.028626547912437406 }, "harness|hendrycksTest-high_school_world_history|5": { "acc": 0.7805907172995781, "acc_stderr": 0.026939106581553945, "acc_norm": 0.7805907172995781, "acc_norm_stderr": 0.026939106581553945 }, "harness|hendrycksTest-human_aging|5": { "acc": 0.6771300448430493, "acc_stderr": 0.03138147637575499, "acc_norm": 0.6771300448430493, "acc_norm_stderr": 0.03138147637575499 }, "harness|hendrycksTest-human_sexuality|5": { "acc": 0.7709923664122137, "acc_stderr": 0.036853466317118506, "acc_norm": 0.7709923664122137, "acc_norm_stderr": 0.036853466317118506 }, "harness|hendrycksTest-international_law|5": { "acc": 0.7851239669421488, "acc_stderr": 0.037494924487096966, "acc_norm": 0.7851239669421488, "acc_norm_stderr": 0.037494924487096966 }, "harness|hendrycksTest-jurisprudence|5": { "acc": 0.7685185185185185, "acc_stderr": 0.04077494709252627, "acc_norm": 0.7685185185185185, "acc_norm_stderr": 0.04077494709252627 }, "harness|hendrycksTest-logical_fallacies|5": { "acc": 0.7668711656441718, "acc_stderr": 0.0332201579577674, "acc_norm": 0.7668711656441718, "acc_norm_stderr": 0.0332201579577674 }, "harness|hendrycksTest-machine_learning|5": { "acc": 0.49107142857142855, "acc_stderr": 0.04745033255489123, "acc_norm": 0.49107142857142855, "acc_norm_stderr": 0.04745033255489123 }, "harness|hendrycksTest-management|5": { "acc": 0.8155339805825242, "acc_stderr": 0.03840423627288276, "acc_norm": 0.8155339805825242, "acc_norm_stderr": 0.03840423627288276 }, "harness|hendrycksTest-marketing|5": { "acc": 0.8760683760683761, "acc_stderr": 0.02158649400128139, "acc_norm": 0.8760683760683761, "acc_norm_stderr": 0.02158649400128139 }, "harness|hendrycksTest-medical_genetics|5": { "acc": 0.77, "acc_stderr": 0.04229525846816508, "acc_norm": 0.77, "acc_norm_stderr": 0.04229525846816508 }, "harness|hendrycksTest-miscellaneous|5": { "acc": 0.8173690932311622, "acc_stderr": 0.013816335389973136, "acc_norm": 0.8173690932311622, "acc_norm_stderr": 0.013816335389973136 }, "harness|hendrycksTest-moral_disputes|5": { "acc": 0.7254335260115607, "acc_stderr": 0.024027745155265012, "acc_norm": 0.7254335260115607, "acc_norm_stderr": 0.024027745155265012 }, "harness|hendrycksTest-moral_scenarios|5": { "acc": 0.3486033519553073, "acc_stderr": 0.015937484656687033, "acc_norm": 0.3486033519553073, "acc_norm_stderr": 0.015937484656687033 }, "harness|hendrycksTest-nutrition|5": { "acc": 0.7516339869281046, "acc_stderr": 0.02473998135511359, "acc_norm": 0.7516339869281046, "acc_norm_stderr": 0.02473998135511359 }, "harness|hendrycksTest-philosophy|5": { "acc": 0.707395498392283, "acc_stderr": 0.02583989833487798, "acc_norm": 0.707395498392283, "acc_norm_stderr": 0.02583989833487798 }, "harness|hendrycksTest-prehistory|5": { "acc": 0.7283950617283951, "acc_stderr": 0.024748624490537368, "acc_norm": 0.7283950617283951, "acc_norm_stderr": 0.024748624490537368 }, "harness|hendrycksTest-professional_accounting|5": { "acc": 0.48226950354609927, "acc_stderr": 0.02980873964223777, "acc_norm": 0.48226950354609927, "acc_norm_stderr": 0.02980873964223777 }, "harness|hendrycksTest-professional_law|5": { "acc": 0.4491525423728814, "acc_stderr": 0.012704030518851488, "acc_norm": 0.4491525423728814, "acc_norm_stderr": 0.012704030518851488 }, "harness|hendrycksTest-professional_medicine|5": { "acc": 0.6764705882352942, "acc_stderr": 0.028418208619406755, "acc_norm": 0.6764705882352942, "acc_norm_stderr": 0.028418208619406755 }, "harness|hendrycksTest-professional_psychology|5": { "acc": 0.6813725490196079, "acc_stderr": 0.018850084696468712, "acc_norm": 0.6813725490196079, "acc_norm_stderr": 0.018850084696468712 }, "harness|hendrycksTest-public_relations|5": { "acc": 0.6636363636363637, "acc_stderr": 0.04525393596302506, "acc_norm": 0.6636363636363637, "acc_norm_stderr": 0.04525393596302506 }, "harness|hendrycksTest-security_studies|5": { "acc": 0.7346938775510204, "acc_stderr": 0.028263889943784593, "acc_norm": 0.7346938775510204, "acc_norm_stderr": 0.028263889943784593 }, "harness|hendrycksTest-sociology|5": { "acc": 0.8308457711442786, "acc_stderr": 0.026508590656233268, "acc_norm": 0.8308457711442786, "acc_norm_stderr": 0.026508590656233268 }, "harness|hendrycksTest-us_foreign_policy|5": { "acc": 0.88, "acc_stderr": 0.03265986323710906, "acc_norm": 0.88, "acc_norm_stderr": 0.03265986323710906 }, "harness|hendrycksTest-virology|5": { "acc": 0.5602409638554217, "acc_stderr": 0.03864139923699122, "acc_norm": 0.5602409638554217, "acc_norm_stderr": 0.03864139923699122 }, "harness|hendrycksTest-world_religions|5": { "acc": 0.8245614035087719, "acc_stderr": 0.029170885500727668, "acc_norm": 0.8245614035087719, "acc_norm_stderr": 0.029170885500727668 }, "harness|truthfulqa:mc|0": { "mc1": 0.3659730722154223, "mc1_stderr": 0.016862941684088376, "mc2": 0.5221487837375264, "mc2_stderr": 0.015253502717954797 }, "harness|winogrande|5": { "acc": 0.7861089187056038, "acc_stderr": 0.011524466954090248 }, "harness|gsm8k|5": { "acc": 0.4541319181197877, "acc_stderr": 0.01371441094526456 } } ``` ## Dataset Details ### Dataset Description <!-- Provide a longer summary of what this dataset is. --> - **Curated by:** [More Information Needed] - **Funded by [optional]:** [More Information Needed] - **Shared by [optional]:** [More Information Needed] - **Language(s) (NLP):** [More Information Needed] - **License:** [More Information Needed] ### Dataset Sources [optional] <!-- Provide the basic links for the dataset. --> - **Repository:** [More Information Needed] - **Paper [optional]:** [More Information Needed] - **Demo [optional]:** [More Information Needed] ## Uses <!-- Address questions around how the dataset is intended to be used. --> ### Direct Use <!-- This section describes suitable use cases for the dataset. --> [More Information Needed] ### Out-of-Scope Use <!-- This section addresses misuse, malicious use, and uses that the dataset will not work well for. --> [More Information Needed] ## Dataset Structure <!-- This section provides a description of the dataset fields, and additional information about the dataset structure such as criteria used to create the splits, relationships between data points, etc. --> [More Information Needed] ## Dataset Creation ### Curation Rationale <!-- Motivation for the creation of this dataset. --> [More Information Needed] ### Source Data <!-- This section describes the source data (e.g. news text and headlines, social media posts, translated sentences, ...). --> #### Data Collection and Processing <!-- This section describes the data collection and processing process such as data selection criteria, filtering and normalization methods, tools and libraries used, etc. --> [More Information Needed] #### Who are the source data producers? <!-- This section describes the people or systems who originally created the data. It should also include self-reported demographic or identity information for the source data creators if this information is available. --> [More Information Needed] ### Annotations [optional] <!-- If the dataset contains annotations which are not part of the initial data collection, use this section to describe them. --> #### Annotation process <!-- This section describes the annotation process such as annotation tools used in the process, the amount of data annotated, annotation guidelines provided to the annotators, interannotator statistics, annotation validation, etc. --> [More Information Needed] #### Who are the annotators? <!-- This section describes the people or systems who created the annotations. --> [More Information Needed] #### Personal and Sensitive Information <!-- State whether the dataset contains data that might be considered personal, sensitive, or private (e.g., data that reveals addresses, uniquely identifiable names or aliases, racial or ethnic origins, sexual orientations, religious beliefs, political opinions, financial or health data, etc.). If efforts were made to anonymize the data, describe the anonymization process. --> [More Information Needed] ## Bias, Risks, and Limitations <!-- This section is meant to convey both technical and sociotechnical limitations. --> [More Information Needed] ### Recommendations <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. --> Users should be made aware of the risks, biases and limitations of the dataset. More information needed for further recommendations. ## Citation [optional] <!-- If there is a paper or blog post introducing the dataset, the APA and Bibtex information for that should go in this section. --> **BibTeX:** [More Information Needed] **APA:** [More Information Needed] ## Glossary [optional] <!-- If relevant, include terms and calculations in this section that can help readers understand the dataset or dataset card. --> [More Information Needed] ## More Information [optional] [More Information Needed] ## Dataset Card Authors [optional] [More Information Needed] ## Dataset Card Contact [More Information Needed]
提供机构:
open-llm-leaderboard-old
原始信息汇总

数据集概述

数据集摘要

该数据集是在模型 TeeZee/DarkSapling-7B-v2.0Open LLM Leaderboard 上的评估运行期间自动创建的。

数据集组成

  • 数据集由 63 个配置组成,每个配置对应一个评估任务。
  • 数据集从 1 次运行中创建,每次运行可以在每个配置中找到特定的分割,分割名称使用运行的时间戳。
  • "train" 分割始终指向最新的结果。
  • 一个额外的配置 "results" 存储所有运行的聚合结果,用于计算和显示 Open LLM Leaderboard 上的聚合指标。

数据加载示例

python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_TeeZee__DarkSapling-7B-v2.0", "harness_winogrande_5", split="train")

最新结果

以下是 2024-03-10T04:57:12.333081 运行的最新结果

python { "all": { "acc": 0.6424579193008534, "acc_stderr": 0.032218866498356466, "acc_norm": 0.6471361795899754, "acc_norm_stderr": 0.032858397843778114, "mc1": 0.3659730722154223, "mc1_stderr": 0.016862941684088376, "mc2": 0.5221487837375264, "mc2_stderr": 0.015253502717954797 }, "harness|arc:challenge|25": { "acc": 0.6023890784982935, "acc_stderr": 0.01430175222327954, "acc_norm": 0.6416382252559727, "acc_norm_stderr": 0.014012883334859857 }, "harness|hellaswag|10": { "acc": 0.6589324835690101, "acc_stderr": 0.004730991357194292, "acc_norm": 0.8510256920932086, "acc_norm_stderr": 0.003553354528132355 }, "harness|hendrycksTest-abstract_algebra|5": { "acc": 0.36, "acc_stderr": 0.04824181513244218, "acc_norm": 0.36, "acc_norm_stderr": 0.04824181513244218 }, "harness|hendrycksTest-anatomy|5": { "acc": 0.6296296296296297, "acc_stderr": 0.041716541613545426, "acc_norm": 0.6296296296296297, "acc_norm_stderr": 0.041716541613545426 }, "harness|hendrycksTest-astronomy|5": { "acc": 0.6710526315789473, "acc_stderr": 0.03823428969926604, "acc_norm": 0.6710526315789473, "acc_norm_stderr": 0.03823428969926604 }, "harness|hendrycksTest-business_ethics|5": { "acc": 0.58, "acc_stderr": 0.049604496374885836, "acc_norm": 0.58, "acc_norm_stderr": 0.049604496374885836 }, "harness|hendrycksTest-clinical_knowledge|5": { "acc": 0.7056603773584905, "acc_stderr": 0.028049186315695248, "acc_norm": 0.7056603773584905, "acc_norm_stderr": 0.028049186315695248 }, "harness|hendrycksTest-college_biology|5": { "acc": 0.7361111111111112, "acc_stderr": 0.03685651095897532, "acc_norm": 0.7361111111111112, "acc_norm_stderr": 0.03685651095897532 }, "harness|hendrycksTest-college_chemistry|5": { "acc": 0.5, "acc_stderr": 0.050251890762960605, "acc_norm": 0.5, "acc_norm_stderr": 0.050251890762960605 }, "harness|hendrycksTest-college_computer_science|5": { "acc": 0.53, "acc_stderr": 0.05016135580465919, "acc_norm": 0.53, "acc_norm_stderr": 0.05016135580465919 }, "harness|hendrycksTest-college_mathematics|5": { "acc": 0.36, "acc_stderr": 0.048241815132442176, "acc_norm": 0.36, "acc_norm_stderr": 0.048241815132442176 }, "harness|hendrycksTest-college_medicine|5": { "acc": 0.630057803468208, "acc_stderr": 0.0368122963339432, "acc_norm": 0.630057803468208, "acc_norm_stderr": 0.0368122963339432 }, "harness|hendrycksTest-college_physics|5": { "acc": 0.43137254901960786, "acc_stderr": 0.04928099597287534, "acc_norm": 0.43137254901960786, "acc_norm_stderr": 0.04928099597287534 }, "harness|hendrycksTest-computer_security|5": { "acc": 0.77, "acc_stderr": 0.042295258468165065, "acc_norm": 0.77, "acc_norm_stderr": 0.042295258468165065 }, "harness|hendrycksTest-conceptual_physics|5": { "acc": 0.5914893617021276, "acc_stderr": 0.032134180267015755, "acc_norm": 0.5914893617021276, "acc_norm_stderr": 0.032134180267015755 }, "harness|hendrycksTest-econometrics|5": { "acc": 0.4824561403508772, "acc_stderr": 0.0470070803355104, "acc_norm": 0.4824561403508772, "acc_norm_stderr": 0.0470070803355104 }, "harness|hendrycksTest-electrical_engineering|5": { "acc": 0.5862068965517241, "acc_stderr": 0.04104269211806232, "acc_norm": 0.5862068965517241, "acc_norm_stderr": 0.04104269211806232 }, "harness|hendrycksTest-elementary_mathematics|5": { "acc": 0.3994708994708995, "acc_stderr": 0.025225450284067887, "acc_norm": 0.3994708994708995, "acc_norm_stderr": 0.025225450284067887 }, "harness|hendrycksTest-formal_logic|5": { "acc": 0.4126984126984127, "acc_stderr": 0.04403438954768177, "acc_norm": 0.4126984126984127, "acc_norm_stderr": 0.04403438954768177 }, "harness|hendrycksTest-global_facts|5": { "acc": 0.35, "acc_stderr": 0.047937248544110196, "acc_norm": 0.35, "acc_norm_stderr": 0.047937248544110196 }, "harness|hendrycksTest-high_school_biology|5": { "acc": 0.7709677419354839, "acc_stderr": 0.023904914311782648, "acc_norm": 0.7709677419354839, "acc_norm_stderr": 0.023904914311782648 }, "harness|hendrycksTest-high_school_chemistry|5": { "acc": 0.5123152709359606, "acc_stderr": 0.035169204442208966, "acc_norm": 0.5123152709359606, "acc_norm_stderr": 0.035169204442208966 }, "harness|hendrycksTest-high_school_computer_science|5": { "acc": 0.7, "acc_stderr": 0.046056618647183814, "acc_norm": 0.7, "acc_norm_stderr": 0.046056618647183814 }, "harness|hendrycksTest-high_school_european_history|5": { "acc": 0.793939393939394, "acc_stderr": 0.03158415324047711, "acc_norm": 0.793939393939394, "acc_norm_stderr": 0.03158415324047711 }, "harness|hendrycksTest-high_school_geography|5": { "acc": 0.803030303030303, "acc_stderr": 0.028335609732463355, "acc_norm": 0.803030303030303, "acc_norm_stderr": 0.028335609732463355 }, "harness|hendrycksTest-high_school_government_and_politics|5": { "acc": 0.8860103626943006, "acc_stderr": 0.022935144053919443, "acc_norm": 0.8860103626943006, "acc_norm_stderr": 0.022935144053919443 }, "harness|hendrycksTest-high_school_macroeconomics|5": { "acc": 0.6666666666666666, "acc_stderr": 0.023901157979402534, "acc_norm": 0.6666666666666666, "acc_norm_stderr": 0.023901157979402534 }, "harness|hendrycksTest-high_school_mathematics|5": { "acc": 0.362962962962963, "acc_stderr": 0.029318203645206865, "acc_norm": 0.3629629629

搜集汇总
数据集介绍
main_image_url
构建方式
该数据集是在开放大语言模型排行榜(Open LLM Leaderboard)对模型TeeZee/DarkSapling-7B-v2.0进行评估的过程中自动生成的。数据集由63个配置组成,每个配置对应一个被评估的任务。它来源于单次运行,每次运行的结果以时间戳命名的分割形式存储于各个配置中,而'train'分割则始终指向最新的评估结果。此外,一个名为'results'的额外配置汇集了所有运行的聚合结果,用于在排行榜上计算和展示综合指标。
特点
数据集的结构设计精巧,每个配置独立对应一个评估任务,便于研究人员按需访问特定任务的详细结果。时间戳分割的引入使得历史运行结果得以完整保留,便于进行纵向比较和复现分析。'results'配置的聚合特性进一步简化了整体性能的评估流程,为用户提供了一个直观的全局视图,从而高效地理解模型在不同维度上的表现。
使用方法
用户可通过HuggingFace的datasets库便捷地加载数据。例如,使用load_dataset函数指定数据集名称、任务配置(如'harness_winogrande_5')及分割(如'train')即可获取最新结果。对于历史数据,则可通过对应的时间戳分割进行访问。这种灵活的加载方式支持对单个任务的深入分析,也便于整合多任务结果进行综合评估,极大地方便了模型性能的复现与对比研究。
背景与挑战
背景概述
随着大语言模型(LLMs)的迅猛发展,如何系统、公正地评估其多维度能力成为该领域的核心议题。在此背景下,HuggingFace社区于2023年启动了Open LLM Leaderboard项目,旨在为开源模型提供标准化评测平台。该数据集由HuggingFace团队(联系人Clémentine)于2024年3月创建,专门记录模型TeeZee/DarkSapling-7B-v2.0在63个评测任务上的详细表现,涵盖ARC挑战集、HellaSwag常识推理、MMLU多学科知识、TruthfulQA真实性、Winogrande指代消解及GSM8K数学推理等基准。这一数据集不仅为模型开发者提供了细粒度的性能反馈,更通过可复现的评测流程推动了开源LLM社区的透明化竞争与迭代优化。
当前挑战
该数据集所应对的领域挑战在于大语言模型评测的全面性与公平性:传统单任务评估难以反映模型在推理、知识、真实性等多维度上的真实能力,而Open LLM Leaderboard通过整合63个异构任务构建了标准化评估体系,但任务间的难度差异与评分标准一致性仍是持续难题。在构建过程中,挑战体现在数据管道的自动化与版本管理上:需确保每次评测运行(如2024-03-10T04:57:12.333081)的63个配置结果能无缝聚合为统一的Parquet格式,同时维护“latest”分割始终指向最新结果,这对数据流水线的鲁棒性与时间戳对齐提出了严苛要求。
常用场景
经典使用场景
在大型语言模型(LLM)评估的学术疆域中,Open LLM Leaderboard 数据集扮演着基准测试的核心角色。该数据集专为模型评估而生,囊括了如 ARC-Challenge、HellaSwag、MMLU、TruthfulQA、Winogrande 和 GSM8K 等涵盖推理、常识、知识、诚实性及数学能力的多样化任务。研究者常利用其精细化的配置与分片结构,对像 DarkSapling-7B-v2.0 这样的模型进行多维度性能剖析,通过加载特定任务(如 harness_winogrande_5)的详细结果,深入探究模型在特定子任务上的表现优劣。
解决学术问题
该数据集精准回应了 LLM 评估中缺乏标准化、可复现性度量体系的学术困境。传统上,模型性能比较常因评估基准不统一而陷入混乱,而 Open LLM Leaderboard 数据集通过提供统一的任务配置、详尽的指标(如准确率及其标准误差)及时间戳分片,使得跨模型、跨时间维度的公平对比成为可能。它解决了如何系统性地衡量模型在推理、常识理解与数学运算等多领域能力的难题,为学术社区提供了透明、严谨的评估范式,极大推动了 LLM 性能研究的规范化与进步。
衍生相关工作
该数据集催生了大量围绕 LLM 评估的衍生性研究。其精细的任务配置激发了诸如“任务难度分析”与“模型失败模式挖掘”等探索,研究者通过对比不同模型在特定子任务(如高中物理、大学数学)上的差异,揭示了模型知识分布的偏倚。Leaderboard 的公开排名机制促使了“评估鲁棒性”与“数据泄露检测”等方向的发展,涌现出如“交叉验证评估框架”和“对抗性基准构建”等经典工作。这些衍生研究不仅深化了对模型能力的理解,还反哺了数据集本身的迭代,形成了评估体系良性循环的生态。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作