five

open-llm-leaderboard-old/details_postbot__emailgen-pythia-410m-deduped

收藏
Hugging Face2023-11-13 更新2024-06-22 收录
下载链接:
https://hf-mirror.com/datasets/open-llm-leaderboard-old/details_postbot__emailgen-pythia-410m-deduped
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: Evaluation run of postbot/emailgen-pythia-410m-deduped dataset_summary: "Dataset automatically created during the evaluation run of model\ \ [postbot/emailgen-pythia-410m-deduped](https://huggingface.co/postbot/emailgen-pythia-410m-deduped)\ \ on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).\n\ \nThe dataset is composed of 64 configuration, each one coresponding to one of the\ \ evaluated task.\n\nThe dataset has been created from 1 run(s). Each run can be\ \ found as a specific split in each configuration, the split being named using the\ \ timestamp of the run.The \"train\" split is always pointing to the latest results.\n\ \nAn additional configuration \"results\" store all the aggregated results of the\ \ run (and is used to compute and display the aggregated metrics on the [Open LLM\ \ Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)).\n\ \nTo load the details from a run, you can for instance do the following:\n```python\n\ from datasets import load_dataset\ndata = load_dataset(\"open-llm-leaderboard/details_postbot__emailgen-pythia-410m-deduped_public\"\ ,\n\t\"harness_winogrande_5\",\n\tsplit=\"train\")\n```\n\n## Latest results\n\n\ These are the [latest results from run 2023-11-13T15:24:35.622872](https://huggingface.co/datasets/open-llm-leaderboard/details_postbot__emailgen-pythia-410m-deduped_public/blob/main/results_2023-11-13T15-24-35.622872.json)(note\ \ that their might be results for other tasks in the repos if successive evals didn't\ \ cover the same tasks. You find each in the results and the \"latest\" split for\ \ each eval):\n\n```python\n{\n \"all\": {\n \"acc\": 0.2739821268942055,\n\ \ \"acc_stderr\": 0.031358822799769724,\n \"acc_norm\": 0.2757926465489037,\n\ \ \"acc_norm_stderr\": 0.03219166127988676,\n \"mc1\": 0.22276621787025705,\n\ \ \"mc1_stderr\": 0.01456650696139673,\n \"mc2\": 0.3819742528315203,\n\ \ \"mc2_stderr\": 0.015246089965112817,\n \"em\": 0.00020973154362416107,\n\ \ \"em_stderr\": 0.00014829481977280738,\n \"f1\": 0.009905620805369138,\n\ \ \"f1_stderr\": 0.0005041998138971091\n },\n \"harness|arc:challenge|25\"\ : {\n \"acc\": 0.2593856655290102,\n \"acc_stderr\": 0.012808273573927102,\n\ \ \"acc_norm\": 0.2790102389078498,\n \"acc_norm_stderr\": 0.013106784883601333\n\ \ },\n \"harness|hellaswag|10\": {\n \"acc\": 0.34027086237801235,\n\ \ \"acc_stderr\": 0.004728318577835236,\n \"acc_norm\": 0.4004182433778132,\n\ \ \"acc_norm_stderr\": 0.00488981748973969\n },\n \"harness|hendrycksTest-abstract_algebra|5\"\ : {\n \"acc\": 0.21,\n \"acc_stderr\": 0.040936018074033256,\n \ \ \"acc_norm\": 0.21,\n \"acc_norm_stderr\": 0.040936018074033256\n \ \ },\n \"harness|hendrycksTest-anatomy|5\": {\n \"acc\": 0.2518518518518518,\n\ \ \"acc_stderr\": 0.037498507091740234,\n \"acc_norm\": 0.2518518518518518,\n\ \ \"acc_norm_stderr\": 0.037498507091740234\n },\n \"harness|hendrycksTest-astronomy|5\"\ : {\n \"acc\": 0.2894736842105263,\n \"acc_stderr\": 0.03690677986137283,\n\ \ \"acc_norm\": 0.2894736842105263,\n \"acc_norm_stderr\": 0.03690677986137283\n\ \ },\n \"harness|hendrycksTest-business_ethics|5\": {\n \"acc\": 0.23,\n\ \ \"acc_stderr\": 0.04229525846816506,\n \"acc_norm\": 0.23,\n \ \ \"acc_norm_stderr\": 0.04229525846816506\n },\n \"harness|hendrycksTest-clinical_knowledge|5\"\ : {\n \"acc\": 0.33584905660377357,\n \"acc_stderr\": 0.029067220146644826,\n\ \ \"acc_norm\": 0.33584905660377357,\n \"acc_norm_stderr\": 0.029067220146644826\n\ \ },\n \"harness|hendrycksTest-college_biology|5\": {\n \"acc\": 0.2569444444444444,\n\ \ \"acc_stderr\": 0.036539469694421,\n \"acc_norm\": 0.2569444444444444,\n\ \ \"acc_norm_stderr\": 0.036539469694421\n },\n \"harness|hendrycksTest-college_chemistry|5\"\ : {\n \"acc\": 0.25,\n \"acc_stderr\": 0.04351941398892446,\n \ \ \"acc_norm\": 0.25,\n \"acc_norm_stderr\": 0.04351941398892446\n \ \ },\n \"harness|hendrycksTest-college_computer_science|5\": {\n \"acc\"\ : 0.39,\n \"acc_stderr\": 0.04902071300001976,\n \"acc_norm\": 0.39,\n\ \ \"acc_norm_stderr\": 0.04902071300001976\n },\n \"harness|hendrycksTest-college_mathematics|5\"\ : {\n \"acc\": 0.31,\n \"acc_stderr\": 0.04648231987117316,\n \ \ \"acc_norm\": 0.31,\n \"acc_norm_stderr\": 0.04648231987117316\n \ \ },\n \"harness|hendrycksTest-college_medicine|5\": {\n \"acc\": 0.2658959537572254,\n\ \ \"acc_stderr\": 0.033687629322594316,\n \"acc_norm\": 0.2658959537572254,\n\ \ \"acc_norm_stderr\": 0.033687629322594316\n },\n \"harness|hendrycksTest-college_physics|5\"\ : {\n \"acc\": 0.3137254901960784,\n \"acc_stderr\": 0.04617034827006717,\n\ \ \"acc_norm\": 0.3137254901960784,\n \"acc_norm_stderr\": 0.04617034827006717\n\ \ },\n \"harness|hendrycksTest-computer_security|5\": {\n \"acc\":\ \ 0.21,\n \"acc_stderr\": 0.04093601807403326,\n \"acc_norm\": 0.21,\n\ \ \"acc_norm_stderr\": 0.04093601807403326\n },\n \"harness|hendrycksTest-conceptual_physics|5\"\ : {\n \"acc\": 0.2978723404255319,\n \"acc_stderr\": 0.029896145682095455,\n\ \ \"acc_norm\": 0.2978723404255319,\n \"acc_norm_stderr\": 0.029896145682095455\n\ \ },\n \"harness|hendrycksTest-econometrics|5\": {\n \"acc\": 0.23684210526315788,\n\ \ \"acc_stderr\": 0.039994238792813344,\n \"acc_norm\": 0.23684210526315788,\n\ \ \"acc_norm_stderr\": 0.039994238792813344\n },\n \"harness|hendrycksTest-electrical_engineering|5\"\ : {\n \"acc\": 0.2206896551724138,\n \"acc_stderr\": 0.034559302019248096,\n\ \ \"acc_norm\": 0.2206896551724138,\n \"acc_norm_stderr\": 0.034559302019248096\n\ \ },\n \"harness|hendrycksTest-elementary_mathematics|5\": {\n \"acc\"\ : 0.25132275132275134,\n \"acc_stderr\": 0.022340482339643895,\n \"\ acc_norm\": 0.25132275132275134,\n \"acc_norm_stderr\": 0.022340482339643895\n\ \ },\n \"harness|hendrycksTest-formal_logic|5\": {\n \"acc\": 0.3333333333333333,\n\ \ \"acc_stderr\": 0.04216370213557836,\n \"acc_norm\": 0.3333333333333333,\n\ \ \"acc_norm_stderr\": 0.04216370213557836\n },\n \"harness|hendrycksTest-global_facts|5\"\ : {\n \"acc\": 0.31,\n \"acc_stderr\": 0.04648231987117316,\n \ \ \"acc_norm\": 0.31,\n \"acc_norm_stderr\": 0.04648231987117316\n \ \ },\n \"harness|hendrycksTest-high_school_biology|5\": {\n \"acc\": 0.22903225806451613,\n\ \ \"acc_stderr\": 0.02390491431178265,\n \"acc_norm\": 0.22903225806451613,\n\ \ \"acc_norm_stderr\": 0.02390491431178265\n },\n \"harness|hendrycksTest-high_school_chemistry|5\"\ : {\n \"acc\": 0.27586206896551724,\n \"acc_stderr\": 0.031447125816782426,\n\ \ \"acc_norm\": 0.27586206896551724,\n \"acc_norm_stderr\": 0.031447125816782426\n\ \ },\n \"harness|hendrycksTest-high_school_computer_science|5\": {\n \ \ \"acc\": 0.25,\n \"acc_stderr\": 0.04351941398892446,\n \"acc_norm\"\ : 0.25,\n \"acc_norm_stderr\": 0.04351941398892446\n },\n \"harness|hendrycksTest-high_school_european_history|5\"\ : {\n \"acc\": 0.26666666666666666,\n \"acc_stderr\": 0.03453131801885415,\n\ \ \"acc_norm\": 0.26666666666666666,\n \"acc_norm_stderr\": 0.03453131801885415\n\ \ },\n \"harness|hendrycksTest-high_school_geography|5\": {\n \"acc\"\ : 0.3181818181818182,\n \"acc_stderr\": 0.03318477333845331,\n \"\ acc_norm\": 0.3181818181818182,\n \"acc_norm_stderr\": 0.03318477333845331\n\ \ },\n \"harness|hendrycksTest-high_school_government_and_politics|5\": {\n\ \ \"acc\": 0.35751295336787564,\n \"acc_stderr\": 0.034588160421810045,\n\ \ \"acc_norm\": 0.35751295336787564,\n \"acc_norm_stderr\": 0.034588160421810045\n\ \ },\n \"harness|hendrycksTest-high_school_macroeconomics|5\": {\n \ \ \"acc\": 0.36153846153846153,\n \"acc_stderr\": 0.024359581465396987,\n\ \ \"acc_norm\": 0.36153846153846153,\n \"acc_norm_stderr\": 0.024359581465396987\n\ \ },\n \"harness|hendrycksTest-high_school_mathematics|5\": {\n \"\ acc\": 0.2740740740740741,\n \"acc_stderr\": 0.027195934804085622,\n \ \ \"acc_norm\": 0.2740740740740741,\n \"acc_norm_stderr\": 0.027195934804085622\n\ \ },\n \"harness|hendrycksTest-high_school_microeconomics|5\": {\n \ \ \"acc\": 0.3445378151260504,\n \"acc_stderr\": 0.03086868260412163,\n \ \ \"acc_norm\": 0.3445378151260504,\n \"acc_norm_stderr\": 0.03086868260412163\n\ \ },\n \"harness|hendrycksTest-high_school_physics|5\": {\n \"acc\"\ : 0.31788079470198677,\n \"acc_stderr\": 0.038020397601079024,\n \"\ acc_norm\": 0.31788079470198677,\n \"acc_norm_stderr\": 0.038020397601079024\n\ \ },\n \"harness|hendrycksTest-high_school_psychology|5\": {\n \"acc\"\ : 0.344954128440367,\n \"acc_stderr\": 0.02038060540506697,\n \"acc_norm\"\ : 0.344954128440367,\n \"acc_norm_stderr\": 0.02038060540506697\n },\n\ \ \"harness|hendrycksTest-high_school_statistics|5\": {\n \"acc\": 0.4166666666666667,\n\ \ \"acc_stderr\": 0.033622774366080424,\n \"acc_norm\": 0.4166666666666667,\n\ \ \"acc_norm_stderr\": 0.033622774366080424\n },\n \"harness|hendrycksTest-high_school_us_history|5\"\ : {\n \"acc\": 0.2549019607843137,\n \"acc_stderr\": 0.03058759135160425,\n\ \ \"acc_norm\": 0.2549019607843137,\n \"acc_norm_stderr\": 0.03058759135160425\n\ \ },\n \"harness|hendrycksTest-high_school_world_history|5\": {\n \"\ acc\": 0.2109704641350211,\n \"acc_stderr\": 0.02655837250266192,\n \ \ \"acc_norm\": 0.2109704641350211,\n \"acc_norm_stderr\": 0.02655837250266192\n\ \ },\n \"harness|hendrycksTest-human_aging|5\": {\n \"acc\": 0.12556053811659193,\n\ \ \"acc_stderr\": 0.022238985469323774,\n \"acc_norm\": 0.12556053811659193,\n\ \ \"acc_norm_stderr\": 0.022238985469323774\n },\n \"harness|hendrycksTest-human_sexuality|5\"\ : {\n \"acc\": 0.366412213740458,\n \"acc_stderr\": 0.04225875451969638,\n\ \ \"acc_norm\": 0.366412213740458,\n \"acc_norm_stderr\": 0.04225875451969638\n\ \ },\n \"harness|hendrycksTest-international_law|5\": {\n \"acc\":\ \ 0.23140495867768596,\n \"acc_stderr\": 0.0384985609879409,\n \"\ acc_norm\": 0.23140495867768596,\n \"acc_norm_stderr\": 0.0384985609879409\n\ \ },\n \"harness|hendrycksTest-jurisprudence|5\": {\n \"acc\": 0.23148148148148148,\n\ \ \"acc_stderr\": 0.04077494709252628,\n \"acc_norm\": 0.23148148148148148,\n\ \ \"acc_norm_stderr\": 0.04077494709252628\n },\n \"harness|hendrycksTest-logical_fallacies|5\"\ : {\n \"acc\": 0.25766871165644173,\n \"acc_stderr\": 0.03436150827846917,\n\ \ \"acc_norm\": 0.25766871165644173,\n \"acc_norm_stderr\": 0.03436150827846917\n\ \ },\n \"harness|hendrycksTest-machine_learning|5\": {\n \"acc\": 0.15178571428571427,\n\ \ \"acc_stderr\": 0.034057028381856945,\n \"acc_norm\": 0.15178571428571427,\n\ \ \"acc_norm_stderr\": 0.034057028381856945\n },\n \"harness|hendrycksTest-management|5\"\ : {\n \"acc\": 0.36893203883495146,\n \"acc_stderr\": 0.047776151811567386,\n\ \ \"acc_norm\": 0.36893203883495146,\n \"acc_norm_stderr\": 0.047776151811567386\n\ \ },\n \"harness|hendrycksTest-marketing|5\": {\n \"acc\": 0.21367521367521367,\n\ \ \"acc_stderr\": 0.026853450377009154,\n \"acc_norm\": 0.21367521367521367,\n\ \ \"acc_norm_stderr\": 0.026853450377009154\n },\n \"harness|hendrycksTest-medical_genetics|5\"\ : {\n \"acc\": 0.27,\n \"acc_stderr\": 0.04461960433384741,\n \ \ \"acc_norm\": 0.27,\n \"acc_norm_stderr\": 0.04461960433384741\n \ \ },\n \"harness|hendrycksTest-miscellaneous|5\": {\n \"acc\": 0.22988505747126436,\n\ \ \"acc_stderr\": 0.015046301846691807,\n \"acc_norm\": 0.22988505747126436,\n\ \ \"acc_norm_stderr\": 0.015046301846691807\n },\n \"harness|hendrycksTest-moral_disputes|5\"\ : {\n \"acc\": 0.21098265895953758,\n \"acc_stderr\": 0.021966309947043117,\n\ \ \"acc_norm\": 0.21098265895953758,\n \"acc_norm_stderr\": 0.021966309947043117\n\ \ },\n \"harness|hendrycksTest-moral_scenarios|5\": {\n \"acc\": 0.27150837988826815,\n\ \ \"acc_stderr\": 0.014874252168095273,\n \"acc_norm\": 0.27150837988826815,\n\ \ \"acc_norm_stderr\": 0.014874252168095273\n },\n \"harness|hendrycksTest-nutrition|5\"\ : {\n \"acc\": 0.2647058823529412,\n \"acc_stderr\": 0.025261691219729498,\n\ \ \"acc_norm\": 0.2647058823529412,\n \"acc_norm_stderr\": 0.025261691219729498\n\ \ },\n \"harness|hendrycksTest-philosophy|5\": {\n \"acc\": 0.2347266881028939,\n\ \ \"acc_stderr\": 0.024071805887677045,\n \"acc_norm\": 0.2347266881028939,\n\ \ \"acc_norm_stderr\": 0.024071805887677045\n },\n \"harness|hendrycksTest-prehistory|5\"\ : {\n \"acc\": 0.2345679012345679,\n \"acc_stderr\": 0.023576881744005705,\n\ \ \"acc_norm\": 0.2345679012345679,\n \"acc_norm_stderr\": 0.023576881744005705\n\ \ },\n \"harness|hendrycksTest-professional_accounting|5\": {\n \"\ acc\": 0.24113475177304963,\n \"acc_stderr\": 0.02551873104953776,\n \ \ \"acc_norm\": 0.24113475177304963,\n \"acc_norm_stderr\": 0.02551873104953776\n\ \ },\n \"harness|hendrycksTest-professional_law|5\": {\n \"acc\": 0.25097783572359844,\n\ \ \"acc_stderr\": 0.01107373029918723,\n \"acc_norm\": 0.25097783572359844,\n\ \ \"acc_norm_stderr\": 0.01107373029918723\n },\n \"harness|hendrycksTest-professional_medicine|5\"\ : {\n \"acc\": 0.4227941176470588,\n \"acc_stderr\": 0.030008562845003476,\n\ \ \"acc_norm\": 0.4227941176470588,\n \"acc_norm_stderr\": 0.030008562845003476\n\ \ },\n \"harness|hendrycksTest-professional_psychology|5\": {\n \"\ acc\": 0.24183006535947713,\n \"acc_stderr\": 0.017322789207784326,\n \ \ \"acc_norm\": 0.24183006535947713,\n \"acc_norm_stderr\": 0.017322789207784326\n\ \ },\n \"harness|hendrycksTest-public_relations|5\": {\n \"acc\": 0.19090909090909092,\n\ \ \"acc_stderr\": 0.03764425585984926,\n \"acc_norm\": 0.19090909090909092,\n\ \ \"acc_norm_stderr\": 0.03764425585984926\n },\n \"harness|hendrycksTest-security_studies|5\"\ : {\n \"acc\": 0.4,\n \"acc_stderr\": 0.031362502409358936,\n \ \ \"acc_norm\": 0.4,\n \"acc_norm_stderr\": 0.031362502409358936\n \ \ },\n \"harness|hendrycksTest-sociology|5\": {\n \"acc\": 0.2537313432835821,\n\ \ \"acc_stderr\": 0.030769444967296028,\n \"acc_norm\": 0.2537313432835821,\n\ \ \"acc_norm_stderr\": 0.030769444967296028\n },\n \"harness|hendrycksTest-us_foreign_policy|5\"\ : {\n \"acc\": 0.26,\n \"acc_stderr\": 0.04408440022768078,\n \ \ \"acc_norm\": 0.26,\n \"acc_norm_stderr\": 0.04408440022768078\n \ \ },\n \"harness|hendrycksTest-virology|5\": {\n \"acc\": 0.25301204819277107,\n\ \ \"acc_stderr\": 0.033844291552331346,\n \"acc_norm\": 0.25301204819277107,\n\ \ \"acc_norm_stderr\": 0.033844291552331346\n },\n \"harness|hendrycksTest-world_religions|5\"\ : {\n \"acc\": 0.22807017543859648,\n \"acc_stderr\": 0.032180937956023566,\n\ \ \"acc_norm\": 0.22807017543859648,\n \"acc_norm_stderr\": 0.032180937956023566\n\ \ },\n \"harness|truthfulqa:mc|0\": {\n \"mc1\": 0.22276621787025705,\n\ \ \"mc1_stderr\": 0.01456650696139673,\n \"mc2\": 0.3819742528315203,\n\ \ \"mc2_stderr\": 0.015246089965112817\n },\n \"harness|winogrande|5\"\ : {\n \"acc\": 0.5209155485398579,\n \"acc_stderr\": 0.014040185494212947\n\ \ },\n \"harness|drop|3\": {\n \"em\": 0.00020973154362416107,\n \ \ \"em_stderr\": 0.00014829481977280738,\n \"f1\": 0.009905620805369138,\n\ \ \"f1_stderr\": 0.0005041998138971091\n },\n \"harness|gsm8k|5\":\ \ {\n \"acc\": 0.0,\n \"acc_stderr\": 0.0\n }\n}\n```" repo_url: https://huggingface.co/postbot/emailgen-pythia-410m-deduped leaderboard_url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard point_of_contact: clementine@hf.co configs: - config_name: harness_arc_challenge_25 data_files: - split: 2023_11_13T15_24_35.622872 path: - '**/details_harness|arc:challenge|25_2023-11-13T15-24-35.622872.parquet' - split: latest path: - '**/details_harness|arc:challenge|25_2023-11-13T15-24-35.622872.parquet' - config_name: harness_drop_3 data_files: - split: 2023_11_13T15_24_35.622872 path: - '**/details_harness|drop|3_2023-11-13T15-24-35.622872.parquet' - split: latest path: - '**/details_harness|drop|3_2023-11-13T15-24-35.622872.parquet' - config_name: harness_gsm8k_5 data_files: - split: 2023_11_13T15_24_35.622872 path: - '**/details_harness|gsm8k|5_2023-11-13T15-24-35.622872.parquet' - split: latest path: - '**/details_harness|gsm8k|5_2023-11-13T15-24-35.622872.parquet' - config_name: harness_hellaswag_10 data_files: - split: 2023_11_13T15_24_35.622872 path: - '**/details_harness|hellaswag|10_2023-11-13T15-24-35.622872.parquet' - split: latest path: - '**/details_harness|hellaswag|10_2023-11-13T15-24-35.622872.parquet' - config_name: harness_hendrycksTest_5 data_files: - split: 2023_11_13T15_24_35.622872 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-international_law|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-management|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-marketing|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-sociology|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-virology|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2023-11-13T15-24-35.622872.parquet' - split: latest path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-international_law|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-management|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-marketing|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-sociology|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-virology|5_2023-11-13T15-24-35.622872.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2023-11-13T15-24-35.622872.parquet' - config_name: harness_hendrycksTest_abstract_algebra_5 data_files: - split: 2023_11_13T15_24_35.622872 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2023-11-13T15-24-35.622872.parquet' - split: latest path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2023-11-13T15-24-35.622872.parquet' - config_name: harness_hendrycksTest_anatomy_5 data_files: - split: 2023_11_13T15_24_35.622872 path: - '**/details_harness|hendrycksTest-anatomy|5_2023-11-13T15-24-35.622872.parquet' - split: latest path: - '**/details_harness|hendrycksTest-anatomy|5_2023-11-13T15-24-35.622872.parquet' - config_name: harness_hendrycksTest_astronomy_5 data_files: - split: 2023_11_13T15_24_35.622872 path: - '**/details_harness|hendrycksTest-astronomy|5_2023-11-13T15-24-35.622872.parquet' - split: latest path: - '**/details_harness|hendrycksTest-astronomy|5_2023-11-13T15-24-35.622872.parquet' - config_name: harness_hendrycksTest_business_ethics_5 data_files: - split: 2023_11_13T15_24_35.622872 path: - '**/details_harness|hendrycksTest-business_ethics|5_2023-11-13T15-24-35.622872.parquet' - split: latest path: - '**/details_harness|hendrycksTest-business_ethics|5_2023-11-13T15-24-35.622872.parquet' - config_name: harness_hendrycksTest_clinical_knowledge_5 data_files: - split: 2023_11_13T15_24_35.622872 path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2023-11-13T15-24-35.622872.parquet' - split: latest path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2023-11-13T15-24-35.622872.parquet' - config_name: harness_hendrycksTest_college_biology_5 data_files: - split: 2023_11_13T15_24_35.622872 path: - '**/details_harness|hendrycksTest-college_biology|5_2023-11-13T15-24-35.622872.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_biology|5_2023-11-13T15-24-35.622872.parquet' - config_name: harness_hendrycksTest_college_chemistry_5 data_files: - split: 2023_11_13T15_24_35.622872 path: - '**/details_harness|hendrycksTest-college_chemistry|5_2023-11-13T15-24-35.622872.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_chemistry|5_2023-11-13T15-24-35.622872.parquet' - config_name: harness_hendrycksTest_college_computer_science_5 data_files: - split: 2023_11_13T15_24_35.622872 path: - '**/details_harness|hendrycksTest-college_computer_science|5_2023-11-13T15-24-35.622872.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_computer_science|5_2023-11-13T15-24-35.622872.parquet' - config_name: harness_hendrycksTest_college_mathematics_5 data_files: - split: 2023_11_13T15_24_35.622872 path: - '**/details_harness|hendrycksTest-college_mathematics|5_2023-11-13T15-24-35.622872.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_mathematics|5_2023-11-13T15-24-35.622872.parquet' - config_name: harness_hendrycksTest_college_medicine_5 data_files: - split: 2023_11_13T15_24_35.622872 path: - '**/details_harness|hendrycksTest-college_medicine|5_2023-11-13T15-24-35.622872.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_medicine|5_2023-11-13T15-24-35.622872.parquet' - config_name: harness_hendrycksTest_college_physics_5 data_files: - split: 2023_11_13T15_24_35.622872 path: - '**/details_harness|hendrycksTest-college_physics|5_2023-11-13T15-24-35.622872.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_physics|5_2023-11-13T15-24-35.622872.parquet' - config_name: harness_hendrycksTest_computer_security_5 data_files: - split: 2023_11_13T15_24_35.622872 path: - '**/details_harness|hendrycksTest-computer_security|5_2023-11-13T15-24-35.622872.parquet' - split: latest path: - '**/details_harness|hendrycksTest-computer_security|5_2023-11-13T15-24-35.622872.parquet' - config_name: harness_hendrycksTest_conceptual_physics_5 data_files: - split: 2023_11_13T15_24_35.622872 path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2023-11-13T15-24-35.622872.parquet' - split: latest path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2023-11-13T15-24-35.622872.parquet' - config_name: harness_hendrycksTest_econometrics_5 data_files: - split: 2023_11_13T15_24_35.622872 path: - '**/details_harness|hendrycksTest-econometrics|5_2023-11-13T15-24-35.622872.parquet' - split: latest path: - '**/details_harness|hendrycksTest-econometrics|5_2023-11-13T15-24-35.622872.parquet' - config_name: harness_hendrycksTest_electrical_engineering_5 data_files: - split: 2023_11_13T15_24_35.622872 path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2023-11-13T15-24-35.622872.parquet' - split: latest path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2023-11-13T15-24-35.622872.parquet' - config_name: harness_hendrycksTest_elementary_mathematics_5 data_files: - split: 2023_11_13T15_24_35.622872 path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2023-11-13T15-24-35.622872.parquet' - split: latest path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2023-11-13T15-24-35.622872.parquet' - config_name: harness_hendrycksTest_formal_logic_5 data_files: - split: 2023_11_13T15_24_35.622872 path: - '**/details_harness|hendrycksTest-formal_logic|5_2023-11-13T15-24-35.622872.parquet' - split: latest path: - '**/details_harness|hendrycksTest-formal_logic|5_2023-11-13T15-24-35.622872.parquet' - config_name: harness_hendrycksTest_global_facts_5 data_files: - split: 2023_11_13T15_24_35.622872 path: - '**/details_harness|hendrycksTest-global_facts|5_2023-11-13T15-24-35.622872.parquet' - split: latest path: - '**/details_harness|hendrycksTest-global_facts|5_2023-11-13T15-24-35.622872.parquet' - config_name: harness_hendrycksTest_high_school_biology_5 data_files: - split: 2023_11_13T15_24_35.622872 path: - '**/details_harness|hendrycksTest-high_school_biology|5_2023-11-13T15-24-35.622872.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_biology|5_2023-11-13T15-24-35.622872.parquet' - config_name: harness_hendrycksTest_high_school_chemistry_5 data_files: - split: 2023_11_13T15_24_35.622872 path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2023-11-13T15-24-35.622872.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2023-11-13T15-24-35.622872.parquet' - config_name: harness_hendrycksTest_high_school_computer_science_5 data_files: - split: 2023_11_13T15_24_35.622872 path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2023-11-13T15-24-35.622872.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2023-11-13T15-24-35.622872.parquet' - config_name: harness_hendrycksTest_high_school_european_history_5 data_files: - split: 2023_11_13T15_24_35.622872 path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2023-11-13T15-24-35.622872.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2023-11-13T15-24-35.622872.parquet' - config_name: harness_hendrycksTest_high_school_geography_5 data_files: - split: 2023_11_13T15_24_35.622872 path: - '**/details_harness|hendrycksTest-high_school_geography|5_2023-11-13T15-24-35.622872.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_geography|5_2023-11-13T15-24-35.622872.parquet' - config_name: harness_hendrycksTest_high_school_government_and_politics_5 data_files: - split: 2023_11_13T15_24_35.622872 path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2023-11-13T15-24-35.622872.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2023-11-13T15-24-35.622872.parquet' - config_name: harness_hendrycksTest_high_school_macroeconomics_5 data_files: - split: 2023_11_13T15_24_35.622872 path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2023-11-13T15-24-35.622872.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2023-11-13T15-24-35.622872.parquet' - config_name: harness_hendrycksTest_high_school_mathematics_5 data_files: - split: 2023_11_13T15_24_35.622872 path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2023-11-13T15-24-35.622872.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2023-11-13T15-24-35.622872.parquet' - config_name: harness_hendrycksTest_high_school_microeconomics_5 data_files: - split: 2023_11_13T15_24_35.622872 path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2023-11-13T15-24-35.622872.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2023-11-13T15-24-35.622872.parquet' - config_name: harness_hendrycksTest_high_school_physics_5 data_files: - split: 2023_11_13T15_24_35.622872 path: - '**/details_harness|hendrycksTest-high_school_physics|5_2023-11-13T15-24-35.622872.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_physics|5_2023-11-13T15-24-35.622872.parquet' - config_name: harness_hendrycksTest_high_school_psychology_5 data_files: - split: 2023_11_13T15_24_35.622872 path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2023-11-13T15-24-35.622872.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2023-11-13T15-24-35.622872.parquet' - config_name: harness_hendrycksTest_high_school_statistics_5 data_files: - split: 2023_11_13T15_24_35.622872 path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2023-11-13T15-24-35.622872.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2023-11-13T15-24-35.622872.parquet' - config_name: harness_hendrycksTest_high_school_us_history_5 data_files: - split: 2023_11_13T15_24_35.622872 path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2023-11-13T15-24-35.622872.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2023-11-13T15-24-35.622872.parquet' - config_name: harness_hendrycksTest_high_school_world_history_5 data_files: - split: 2023_11_13T15_24_35.622872 path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2023-11-13T15-24-35.622872.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2023-11-13T15-24-35.622872.parquet' - config_name: harness_hendrycksTest_human_aging_5 data_files: - split: 2023_11_13T15_24_35.622872 path: - '**/details_harness|hendrycksTest-human_aging|5_2023-11-13T15-24-35.622872.parquet' - split: latest path: - '**/details_harness|hendrycksTest-human_aging|5_2023-11-13T15-24-35.622872.parquet' - config_name: harness_hendrycksTest_human_sexuality_5 data_files: - split: 2023_11_13T15_24_35.622872 path: - '**/details_harness|hendrycksTest-human_sexuality|5_2023-11-13T15-24-35.622872.parquet' - split: latest path: - '**/details_harness|hendrycksTest-human_sexuality|5_2023-11-13T15-24-35.622872.parquet' - config_name: harness_hendrycksTest_international_law_5 data_files: - split: 2023_11_13T15_24_35.622872 path: - '**/details_harness|hendrycksTest-international_law|5_2023-11-13T15-24-35.622872.parquet' - split: latest path: - '**/details_harness|hendrycksTest-international_law|5_2023-11-13T15-24-35.622872.parquet' - config_name: harness_hendrycksTest_jurisprudence_5 data_files: - split: 2023_11_13T15_24_35.622872 path: - '**/details_harness|hendrycksTest-jurisprudence|5_2023-11-13T15-24-35.622872.parquet' - split: latest path: - '**/details_harness|hendrycksTest-jurisprudence|5_2023-11-13T15-24-35.622872.parquet' - config_name: harness_hendrycksTest_logical_fallacies_5 data_files: - split: 2023_11_13T15_24_35.622872 path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2023-11-13T15-24-35.622872.parquet' - split: latest path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2023-11-13T15-24-35.622872.parquet' - config_name: harness_hendrycksTest_machine_learning_5 data_files: - split: 2023_11_13T15_24_35.622872 path: - '**/details_harness|hendrycksTest-machine_learning|5_2023-11-13T15-24-35.622872.parquet' - split: latest path: - '**/details_harness|hendrycksTest-machine_learning|5_2023-11-13T15-24-35.622872.parquet' - config_name: harness_hendrycksTest_management_5 data_files: - split: 2023_11_13T15_24_35.622872 path: - '**/details_harness|hendrycksTest-management|5_2023-11-13T15-24-35.622872.parquet' - split: latest path: - '**/details_harness|hendrycksTest-management|5_2023-11-13T15-24-35.622872.parquet' - config_name: harness_hendrycksTest_marketing_5 data_files: - split: 2023_11_13T15_24_35.622872 path: - '**/details_harness|hendrycksTest-marketing|5_2023-11-13T15-24-35.622872.parquet' - split: latest path: - '**/details_harness|hendrycksTest-marketing|5_2023-11-13T15-24-35.622872.parquet' - config_name: harness_hendrycksTest_medical_genetics_5 data_files: - split: 2023_11_13T15_24_35.622872 path: - '**/details_harness|hendrycksTest-medical_genetics|5_2023-11-13T15-24-35.622872.parquet' - split: latest path: - '**/details_harness|hendrycksTest-medical_genetics|5_2023-11-13T15-24-35.622872.parquet' - config_name: harness_hendrycksTest_miscellaneous_5 data_files: - split: 2023_11_13T15_24_35.622872 path: - '**/details_harness|hendrycksTest-miscellaneous|5_2023-11-13T15-24-35.622872.parquet' - split: latest path: - '**/details_harness|hendrycksTest-miscellaneous|5_2023-11-13T15-24-35.622872.parquet' - config_name: harness_hendrycksTest_moral_disputes_5 data_files: - split: 2023_11_13T15_24_35.622872 path: - '**/details_harness|hendrycksTest-moral_disputes|5_2023-11-13T15-24-35.622872.parquet' - split: latest path: - '**/details_harness|hendrycksTest-moral_disputes|5_2023-11-13T15-24-35.622872.parquet' - config_name: harness_hendrycksTest_moral_scenarios_5 data_files: - split: 2023_11_13T15_24_35.622872 path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2023-11-13T15-24-35.622872.parquet' - split: latest path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2023-11-13T15-24-35.622872.parquet' - config_name: harness_hendrycksTest_nutrition_5 data_files: - split: 2023_11_13T15_24_35.622872 path: - '**/details_harness|hendrycksTest-nutrition|5_2023-11-13T15-24-35.622872.parquet' - split: latest path: - '**/details_harness|hendrycksTest-nutrition|5_2023-11-13T15-24-35.622872.parquet' - config_name: harness_hendrycksTest_philosophy_5 data_files: - split: 2023_11_13T15_24_35.622872 path: - '**/details_harness|hendrycksTest-philosophy|5_2023-11-13T15-24-35.622872.parquet' - split: latest path: - '**/details_harness|hendrycksTest-philosophy|5_2023-11-13T15-24-35.622872.parquet' - config_name: harness_hendrycksTest_prehistory_5 data_files: - split: 2023_11_13T15_24_35.622872 path: - '**/details_harness|hendrycksTest-prehistory|5_2023-11-13T15-24-35.622872.parquet' - split: latest path: - '**/details_harness|hendrycksTest-prehistory|5_2023-11-13T15-24-35.622872.parquet' - config_name: harness_hendrycksTest_professional_accounting_5 data_files: - split: 2023_11_13T15_24_35.622872 path: - '**/details_harness|hendrycksTest-professional_accounting|5_2023-11-13T15-24-35.622872.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_accounting|5_2023-11-13T15-24-35.622872.parquet' - config_name: harness_hendrycksTest_professional_law_5 data_files: - split: 2023_11_13T15_24_35.622872 path: - '**/details_harness|hendrycksTest-professional_law|5_2023-11-13T15-24-35.622872.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_law|5_2023-11-13T15-24-35.622872.parquet' - config_name: harness_hendrycksTest_professional_medicine_5 data_files: - split: 2023_11_13T15_24_35.622872 path: - '**/details_harness|hendrycksTest-professional_medicine|5_2023-11-13T15-24-35.622872.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_medicine|5_2023-11-13T15-24-35.622872.parquet' - config_name: harness_hendrycksTest_professional_psychology_5 data_files: - split: 2023_11_13T15_24_35.622872 path: - '**/details_harness|hendrycksTest-professional_psychology|5_2023-11-13T15-24-35.622872.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_psychology|5_2023-11-13T15-24-35.622872.parquet' - config_name: harness_hendrycksTest_public_relations_5 data_files: - split: 2023_11_13T15_24_35.622872 path: - '**/details_harness|hendrycksTest-public_relations|5_2023-11-13T15-24-35.622872.parquet' - split: latest path: - '**/details_harness|hendrycksTest-public_relations|5_2023-11-13T15-24-35.622872.parquet' - config_name: harness_hendrycksTest_security_studies_5 data_files: - split: 2023_11_13T15_24_35.622872 path: - '**/details_harness|hendrycksTest-security_studies|5_2023-11-13T15-24-35.622872.parquet' - split: latest path: - '**/details_harness|hendrycksTest-security_studies|5_2023-11-13T15-24-35.622872.parquet' - config_name: harness_hendrycksTest_sociology_5 data_files: - split: 2023_11_13T15_24_35.622872 path: - '**/details_harness|hendrycksTest-sociology|5_2023-11-13T15-24-35.622872.parquet' - split: latest path: - '**/details_harness|hendrycksTest-sociology|5_2023-11-13T15-24-35.622872.parquet' - config_name: harness_hendrycksTest_us_foreign_policy_5 data_files: - split: 2023_11_13T15_24_35.622872 path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2023-11-13T15-24-35.622872.parquet' - split: latest path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2023-11-13T15-24-35.622872.parquet' - config_name: harness_hendrycksTest_virology_5 data_files: - split: 2023_11_13T15_24_35.622872 path: - '**/details_harness|hendrycksTest-virology|5_2023-11-13T15-24-35.622872.parquet' - split: latest path: - '**/details_harness|hendrycksTest-virology|5_2023-11-13T15-24-35.622872.parquet' - config_name: harness_hendrycksTest_world_religions_5 data_files: - split: 2023_11_13T15_24_35.622872 path: - '**/details_harness|hendrycksTest-world_religions|5_2023-11-13T15-24-35.622872.parquet' - split: latest path: - '**/details_harness|hendrycksTest-world_religions|5_2023-11-13T15-24-35.622872.parquet' - config_name: harness_truthfulqa_mc_0 data_files: - split: 2023_11_13T15_24_35.622872 path: - '**/details_harness|truthfulqa:mc|0_2023-11-13T15-24-35.622872.parquet' - split: latest path: - '**/details_harness|truthfulqa:mc|0_2023-11-13T15-24-35.622872.parquet' - config_name: harness_winogrande_5 data_files: - split: 2023_11_13T15_24_35.622872 path: - '**/details_harness|winogrande|5_2023-11-13T15-24-35.622872.parquet' - split: latest path: - '**/details_harness|winogrande|5_2023-11-13T15-24-35.622872.parquet' - config_name: results data_files: - split: 2023_11_13T15_24_35.622872 path: - results_2023-11-13T15-24-35.622872.parquet - split: latest path: - results_2023-11-13T15-24-35.622872.parquet --- # Dataset Card for Evaluation run of postbot/emailgen-pythia-410m-deduped ## Dataset Description - **Homepage:** - **Repository:** https://huggingface.co/postbot/emailgen-pythia-410m-deduped - **Paper:** - **Leaderboard:** https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard - **Point of Contact:** clementine@hf.co ### Dataset Summary Dataset automatically created during the evaluation run of model [postbot/emailgen-pythia-410m-deduped](https://huggingface.co/postbot/emailgen-pythia-410m-deduped) on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard). The dataset is composed of 64 configuration, each one coresponding to one of the evaluated task. The dataset has been created from 1 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results. An additional configuration "results" store all the aggregated results of the run (and is used to compute and display the aggregated metrics on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)). To load the details from a run, you can for instance do the following: ```python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_postbot__emailgen-pythia-410m-deduped_public", "harness_winogrande_5", split="train") ``` ## Latest results These are the [latest results from run 2023-11-13T15:24:35.622872](https://huggingface.co/datasets/open-llm-leaderboard/details_postbot__emailgen-pythia-410m-deduped_public/blob/main/results_2023-11-13T15-24-35.622872.json)(note that their might be results for other tasks in the repos if successive evals didn't cover the same tasks. You find each in the results and the "latest" split for each eval): ```python { "all": { "acc": 0.2739821268942055, "acc_stderr": 0.031358822799769724, "acc_norm": 0.2757926465489037, "acc_norm_stderr": 0.03219166127988676, "mc1": 0.22276621787025705, "mc1_stderr": 0.01456650696139673, "mc2": 0.3819742528315203, "mc2_stderr": 0.015246089965112817, "em": 0.00020973154362416107, "em_stderr": 0.00014829481977280738, "f1": 0.009905620805369138, "f1_stderr": 0.0005041998138971091 }, "harness|arc:challenge|25": { "acc": 0.2593856655290102, "acc_stderr": 0.012808273573927102, "acc_norm": 0.2790102389078498, "acc_norm_stderr": 0.013106784883601333 }, "harness|hellaswag|10": { "acc": 0.34027086237801235, "acc_stderr": 0.004728318577835236, "acc_norm": 0.4004182433778132, "acc_norm_stderr": 0.00488981748973969 }, "harness|hendrycksTest-abstract_algebra|5": { "acc": 0.21, "acc_stderr": 0.040936018074033256, "acc_norm": 0.21, "acc_norm_stderr": 0.040936018074033256 }, "harness|hendrycksTest-anatomy|5": { "acc": 0.2518518518518518, "acc_stderr": 0.037498507091740234, "acc_norm": 0.2518518518518518, "acc_norm_stderr": 0.037498507091740234 }, "harness|hendrycksTest-astronomy|5": { "acc": 0.2894736842105263, "acc_stderr": 0.03690677986137283, "acc_norm": 0.2894736842105263, "acc_norm_stderr": 0.03690677986137283 }, "harness|hendrycksTest-business_ethics|5": { "acc": 0.23, "acc_stderr": 0.04229525846816506, "acc_norm": 0.23, "acc_norm_stderr": 0.04229525846816506 }, "harness|hendrycksTest-clinical_knowledge|5": { "acc": 0.33584905660377357, "acc_stderr": 0.029067220146644826, "acc_norm": 0.33584905660377357, "acc_norm_stderr": 0.029067220146644826 }, "harness|hendrycksTest-college_biology|5": { "acc": 0.2569444444444444, "acc_stderr": 0.036539469694421, "acc_norm": 0.2569444444444444, "acc_norm_stderr": 0.036539469694421 }, "harness|hendrycksTest-college_chemistry|5": { "acc": 0.25, "acc_stderr": 0.04351941398892446, "acc_norm": 0.25, "acc_norm_stderr": 0.04351941398892446 }, "harness|hendrycksTest-college_computer_science|5": { "acc": 0.39, "acc_stderr": 0.04902071300001976, "acc_norm": 0.39, "acc_norm_stderr": 0.04902071300001976 }, "harness|hendrycksTest-college_mathematics|5": { "acc": 0.31, "acc_stderr": 0.04648231987117316, "acc_norm": 0.31, "acc_norm_stderr": 0.04648231987117316 }, "harness|hendrycksTest-college_medicine|5": { "acc": 0.2658959537572254, "acc_stderr": 0.033687629322594316, "acc_norm": 0.2658959537572254, "acc_norm_stderr": 0.033687629322594316 }, "harness|hendrycksTest-college_physics|5": { "acc": 0.3137254901960784, "acc_stderr": 0.04617034827006717, "acc_norm": 0.3137254901960784, "acc_norm_stderr": 0.04617034827006717 }, "harness|hendrycksTest-computer_security|5": { "acc": 0.21, "acc_stderr": 0.04093601807403326, "acc_norm": 0.21, "acc_norm_stderr": 0.04093601807403326 }, "harness|hendrycksTest-conceptual_physics|5": { "acc": 0.2978723404255319, "acc_stderr": 0.029896145682095455, "acc_norm": 0.2978723404255319, "acc_norm_stderr": 0.029896145682095455 }, "harness|hendrycksTest-econometrics|5": { "acc": 0.23684210526315788, "acc_stderr": 0.039994238792813344, "acc_norm": 0.23684210526315788, "acc_norm_stderr": 0.039994238792813344 }, "harness|hendrycksTest-electrical_engineering|5": { "acc": 0.2206896551724138, "acc_stderr": 0.034559302019248096, "acc_norm": 0.2206896551724138, "acc_norm_stderr": 0.034559302019248096 }, "harness|hendrycksTest-elementary_mathematics|5": { "acc": 0.25132275132275134, "acc_stderr": 0.022340482339643895, "acc_norm": 0.25132275132275134, "acc_norm_stderr": 0.022340482339643895 }, "harness|hendrycksTest-formal_logic|5": { "acc": 0.3333333333333333, "acc_stderr": 0.04216370213557836, "acc_norm": 0.3333333333333333, "acc_norm_stderr": 0.04216370213557836 }, "harness|hendrycksTest-global_facts|5": { "acc": 0.31, "acc_stderr": 0.04648231987117316, "acc_norm": 0.31, "acc_norm_stderr": 0.04648231987117316 }, "harness|hendrycksTest-high_school_biology|5": { "acc": 0.22903225806451613, "acc_stderr": 0.02390491431178265, "acc_norm": 0.22903225806451613, "acc_norm_stderr": 0.02390491431178265 }, "harness|hendrycksTest-high_school_chemistry|5": { "acc": 0.27586206896551724, "acc_stderr": 0.031447125816782426, "acc_norm": 0.27586206896551724, "acc_norm_stderr": 0.031447125816782426 }, "harness|hendrycksTest-high_school_computer_science|5": { "acc": 0.25, "acc_stderr": 0.04351941398892446, "acc_norm": 0.25, "acc_norm_stderr": 0.04351941398892446 }, "harness|hendrycksTest-high_school_european_history|5": { "acc": 0.26666666666666666, "acc_stderr": 0.03453131801885415, "acc_norm": 0.26666666666666666, "acc_norm_stderr": 0.03453131801885415 }, "harness|hendrycksTest-high_school_geography|5": { "acc": 0.3181818181818182, "acc_stderr": 0.03318477333845331, "acc_norm": 0.3181818181818182, "acc_norm_stderr": 0.03318477333845331 }, "harness|hendrycksTest-high_school_government_and_politics|5": { "acc": 0.35751295336787564, "acc_stderr": 0.034588160421810045, "acc_norm": 0.35751295336787564, "acc_norm_stderr": 0.034588160421810045 }, "harness|hendrycksTest-high_school_macroeconomics|5": { "acc": 0.36153846153846153, "acc_stderr": 0.024359581465396987, "acc_norm": 0.36153846153846153, "acc_norm_stderr": 0.024359581465396987 }, "harness|hendrycksTest-high_school_mathematics|5": { "acc": 0.2740740740740741, "acc_stderr": 0.027195934804085622, "acc_norm": 0.2740740740740741, "acc_norm_stderr": 0.027195934804085622 }, "harness|hendrycksTest-high_school_microeconomics|5": { "acc": 0.3445378151260504, "acc_stderr": 0.03086868260412163, "acc_norm": 0.3445378151260504, "acc_norm_stderr": 0.03086868260412163 }, "harness|hendrycksTest-high_school_physics|5": { "acc": 0.31788079470198677, "acc_stderr": 0.038020397601079024, "acc_norm": 0.31788079470198677, "acc_norm_stderr": 0.038020397601079024 }, "harness|hendrycksTest-high_school_psychology|5": { "acc": 0.344954128440367, "acc_stderr": 0.02038060540506697, "acc_norm": 0.344954128440367, "acc_norm_stderr": 0.02038060540506697 }, "harness|hendrycksTest-high_school_statistics|5": { "acc": 0.4166666666666667, "acc_stderr": 0.033622774366080424, "acc_norm": 0.4166666666666667, "acc_norm_stderr": 0.033622774366080424 }, "harness|hendrycksTest-high_school_us_history|5": { "acc": 0.2549019607843137, "acc_stderr": 0.03058759135160425, "acc_norm": 0.2549019607843137, "acc_norm_stderr": 0.03058759135160425 }, "harness|hendrycksTest-high_school_world_history|5": { "acc": 0.2109704641350211, "acc_stderr": 0.02655837250266192, "acc_norm": 0.2109704641350211, "acc_norm_stderr": 0.02655837250266192 }, "harness|hendrycksTest-human_aging|5": { "acc": 0.12556053811659193, "acc_stderr": 0.022238985469323774, "acc_norm": 0.12556053811659193, "acc_norm_stderr": 0.022238985469323774 }, "harness|hendrycksTest-human_sexuality|5": { "acc": 0.366412213740458, "acc_stderr": 0.04225875451969638, "acc_norm": 0.366412213740458, "acc_norm_stderr": 0.04225875451969638 }, "harness|hendrycksTest-international_law|5": { "acc": 0.23140495867768596, "acc_stderr": 0.0384985609879409, "acc_norm": 0.23140495867768596, "acc_norm_stderr": 0.0384985609879409 }, "harness|hendrycksTest-jurisprudence|5": { "acc": 0.23148148148148148, "acc_stderr": 0.04077494709252628, "acc_norm": 0.23148148148148148, "acc_norm_stderr": 0.04077494709252628 }, "harness|hendrycksTest-logical_fallacies|5": { "acc": 0.25766871165644173, "acc_stderr": 0.03436150827846917, "acc_norm": 0.25766871165644173, "acc_norm_stderr": 0.03436150827846917 }, "harness|hendrycksTest-machine_learning|5": { "acc": 0.15178571428571427, "acc_stderr": 0.034057028381856945, "acc_norm": 0.15178571428571427, "acc_norm_stderr": 0.034057028381856945 }, "harness|hendrycksTest-management|5": { "acc": 0.36893203883495146, "acc_stderr": 0.047776151811567386, "acc_norm": 0.36893203883495146, "acc_norm_stderr": 0.047776151811567386 }, "harness|hendrycksTest-marketing|5": { "acc": 0.21367521367521367, "acc_stderr": 0.026853450377009154, "acc_norm": 0.21367521367521367, "acc_norm_stderr": 0.026853450377009154 }, "harness|hendrycksTest-medical_genetics|5": { "acc": 0.27, "acc_stderr": 0.04461960433384741, "acc_norm": 0.27, "acc_norm_stderr": 0.04461960433384741 }, "harness|hendrycksTest-miscellaneous|5": { "acc": 0.22988505747126436, "acc_stderr": 0.015046301846691807, "acc_norm": 0.22988505747126436, "acc_norm_stderr": 0.015046301846691807 }, "harness|hendrycksTest-moral_disputes|5": { "acc": 0.21098265895953758, "acc_stderr": 0.021966309947043117, "acc_norm": 0.21098265895953758, "acc_norm_stderr": 0.021966309947043117 }, "harness|hendrycksTest-moral_scenarios|5": { "acc": 0.27150837988826815, "acc_stderr": 0.014874252168095273, "acc_norm": 0.27150837988826815, "acc_norm_stderr": 0.014874252168095273 }, "harness|hendrycksTest-nutrition|5": { "acc": 0.2647058823529412, "acc_stderr": 0.025261691219729498, "acc_norm": 0.2647058823529412, "acc_norm_stderr": 0.025261691219729498 }, "harness|hendrycksTest-philosophy|5": { "acc": 0.2347266881028939, "acc_stderr": 0.024071805887677045, "acc_norm": 0.2347266881028939, "acc_norm_stderr": 0.024071805887677045 }, "harness|hendrycksTest-prehistory|5": { "acc": 0.2345679012345679, "acc_stderr": 0.023576881744005705, "acc_norm": 0.2345679012345679, "acc_norm_stderr": 0.023576881744005705 }, "harness|hendrycksTest-professional_accounting|5": { "acc": 0.24113475177304963, "acc_stderr": 0.02551873104953776, "acc_norm": 0.24113475177304963, "acc_norm_stderr": 0.02551873104953776 }, "harness|hendrycksTest-professional_law|5": { "acc": 0.25097783572359844, "acc_stderr": 0.01107373029918723, "acc_norm": 0.25097783572359844, "acc_norm_stderr": 0.01107373029918723 }, "harness|hendrycksTest-professional_medicine|5": { "acc": 0.4227941176470588, "acc_stderr": 0.030008562845003476, "acc_norm": 0.4227941176470588, "acc_norm_stderr": 0.030008562845003476 }, "harness|hendrycksTest-professional_psychology|5": { "acc": 0.24183006535947713, "acc_stderr": 0.017322789207784326, "acc_norm": 0.24183006535947713, "acc_norm_stderr": 0.017322789207784326 }, "harness|hendrycksTest-public_relations|5": { "acc": 0.19090909090909092, "acc_stderr": 0.03764425585984926, "acc_norm": 0.19090909090909092, "acc_norm_stderr": 0.03764425585984926 }, "harness|hendrycksTest-security_studies|5": { "acc": 0.4, "acc_stderr": 0.031362502409358936, "acc_norm": 0.4, "acc_norm_stderr": 0.031362502409358936 }, "harness|hendrycksTest-sociology|5": { "acc": 0.2537313432835821, "acc_stderr": 0.030769444967296028, "acc_norm": 0.2537313432835821, "acc_norm_stderr": 0.030769444967296028 }, "harness|hendrycksTest-us_foreign_policy|5": { "acc": 0.26, "acc_stderr": 0.04408440022768078, "acc_norm": 0.26, "acc_norm_stderr": 0.04408440022768078 }, "harness|hendrycksTest-virology|5": { "acc": 0.25301204819277107, "acc_stderr": 0.033844291552331346, "acc_norm": 0.25301204819277107, "acc_norm_stderr": 0.033844291552331346 }, "harness|hendrycksTest-world_religions|5": { "acc": 0.22807017543859648, "acc_stderr": 0.032180937956023566, "acc_norm": 0.22807017543859648, "acc_norm_stderr": 0.032180937956023566 }, "harness|truthfulqa:mc|0": { "mc1": 0.22276621787025705, "mc1_stderr": 0.01456650696139673, "mc2": 0.3819742528315203, "mc2_stderr": 0.015246089965112817 }, "harness|winogrande|5": { "acc": 0.5209155485398579, "acc_stderr": 0.014040185494212947 }, "harness|drop|3": { "em": 0.00020973154362416107, "em_stderr": 0.00014829481977280738, "f1": 0.009905620805369138, "f1_stderr": 0.0005041998138971091 }, "harness|gsm8k|5": { "acc": 0.0, "acc_stderr": 0.0 } } ``` ### Supported Tasks and Leaderboards [More Information Needed] ### Languages [More Information Needed] ## Dataset Structure ### Data Instances [More Information Needed] ### Data Fields [More Information Needed] ### Data Splits [More Information Needed] ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information [More Information Needed] ### Citation Information [More Information Needed] ### Contributions [More Information Needed]
提供机构:
open-llm-leaderboard-old
原始信息汇总

数据集概述

数据集简介

数据集是在评估模型 postbot/emailgen-pythia-410m-dedupedOpen LLM Leaderboard 上的运行过程中自动创建的。该数据集包含 64 个配置,每个配置对应一个评估任务。

数据集结构

数据集由 1 次运行创建,每个运行可以在每个配置中找到特定的分割,分割名称使用运行的时间戳。"train" 分割始终指向最新的结果。

额外配置

一个额外的配置 "results" 存储了所有运行的聚合结果,用于计算和显示 Open LLM Leaderboard 上的聚合指标。

数据加载示例

python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_postbot__emailgen-pythia-410m-deduped", "harness_winogrande_5", split="train")

最新结果

以下是 2023-11-13T15:24:35.622872 运行的最新结果

python { "all": { "acc": 0.2739821268942055, "acc_stderr": 0.031358822799769724, "acc_norm": 0.2757926465489037, "acc_norm_stderr": 0.03219166127988676, "mc1": 0.22276621787025705, "mc1_stderr": 0.01456650696139673, "mc2": 0.3819742528315203, "mc2_stderr": 0.015246089965112817, "em": 0.00020973154362416107, "em_stderr": 0.00014829481977280738, "f1": 0.009905620805369138, "f1_stderr": 0.0005041998138971091 }, "harness|arc:challenge|25": { "acc": 0.2593856655290102, "acc_stderr": 0.012808273573927102, "acc_norm": 0.2790102389078498, "acc_norm_stderr": 0.013106784883601333 }, "harness|hellaswag|10": { "acc": 0.34027086237801235, "acc_stderr": 0.004728318577835236, "acc_norm": 0.4004182433778132, "acc_norm_stderr": 0.00488981748973969 }, "harness|hendrycksTest-abstract_algebra|5": { "acc": 0.21, "acc_stderr": 0.040936018074033256, "acc_norm": 0.21, "acc_norm_stderr": 0.040936018074033256 }, "harness|hendrycksTest-anatomy|5": { "acc": 0.2518518518518518, "acc_stderr": 0.037498507091740234, "acc_norm": 0.2518518518518518, "acc_norm_stderr": 0.037498507091740234 }, "harness|hendrycksTest-astronomy|5": { "acc": 0.2894736842105263, "acc_stderr": 0.03690677986137283, "acc_norm": 0.2894736842105263, "acc_norm_stderr": 0.03690677986137283 }, "harness|hendrycksTest-business_ethics|5": { "acc": 0.23, "acc_stderr": 0.04229525846816506, "acc_norm": 0.23, "acc_norm_stderr": 0.04229525846816506 }, "harness|hendrycksTest-clinical_knowledge|5": { "acc": 0.33584905660377357, "acc_stderr": 0.029067220146644826, "acc_norm": 0.33584905660377357, "acc_norm_stderr": 0.029067220146644826 }, "harness|hendrycksTest-college_biology|5": { "acc": 0.2569444444444444, "acc_stderr": 0.036539469694421, "acc_norm": 0.2569444444444444, "acc_norm_stderr": 0.036539469694421 }, "harness|hendrycksTest-college_chemistry|5": { "acc": 0.25, "acc_stderr": 0.04351941398892446, "acc_norm": 0.25, "acc_norm_stderr": 0.04351941398892446 }, "harness|hendrycksTest-college_computer_science|5": { "acc": 0.39, "acc_stderr": 0.04902071300001976, "acc_norm": 0.39, "acc_norm_stderr": 0.04902071300001976 }, "harness|hendrycksTest-college_mathematics|5": { "acc": 0.31, "acc_stderr": 0.04648231987117316, "acc_norm": 0.31, "acc_norm_stderr": 0.04648231987117316 }, "harness|hendrycksTest-college_medicine|5": { "acc": 0.2658959537572254, "acc_stderr": 0.033687629322594316, "acc_norm": 0.2658959537572254, "acc_norm_stderr": 0.033687629322594316 }, "harness|hendrycksTest-college_physics|5": { "acc": 0.3137254901960784, "acc_stderr": 0.04617034827006717, "acc_norm": 0.3137254901960784, "acc_norm_stderr": 0.04617034827006717 }, "harness|hendrycksTest-computer_security|5": { "acc": 0.21, "acc_stderr": 0.04093601807403326, "acc_norm": 0.21, "acc_norm_stderr": 0.04093601807403326 }, "harness|hendrycksTest-conceptual_physics|5": { "acc": 0.2978723404255319, "acc_stderr": 0.029896145682095455, "acc_norm": 0.2978723404255319, "acc_norm_stderr": 0.029896145682095455 }, "harness|hendrycksTest-econometrics|5": { "acc": 0.23684210526315788, "acc_stderr": 0.039994238792813344, "acc_norm": 0.23684210526315788, "acc_norm_stderr": 0.039994238792813344 }, "harness|hendrycksTest-electrical_engineering|5": { "acc": 0.2206896551724138, "acc_stderr": 0.034559302019248096, "acc_norm": 0.2206896551724138, "acc_norm_stderr": 0.034559302019248096 }, "harness|hendrycksTest-elementary_mathematics|5": { "acc": 0.25132275132275134, "acc_stderr": 0.022340482339643895, "acc_norm": 0.25132275132275134, "acc_norm_stderr": 0.022340482339643895 }, "harness|hendrycksTest-formal_logic|5": { "acc": 0.3333333333333333, "acc_stderr": 0.04216370213557836, "acc_norm": 0.3333333333333333, "acc_norm_stderr": 0.04216370213557836 }, "harness|hendrycksTest-global_facts|5": { "acc": 0.31, "acc_stderr": 0.04648231987117316, "acc_norm": 0.31, "acc_norm_stderr": 0.04648231987117316 }, "harness|hendrycksTest-high_school_biology|5": { "acc": 0.22903225806451613, "acc_stderr": 0.02390491431178265, "acc_norm": 0.22903225806451613, "acc_norm_stderr": 0.02390491431178265 }, "harness|hendrycksTest-high_school_chemistry|5": { "acc": 0.27586206896551724, "acc_stderr": 0.031447125816782426, "acc_norm": 0.27586206896551724, "acc_norm_stderr": 0.031447125816782426 }, "harness|hendrycksTest-high_school_computer_science|5": { "acc": 0.25, "acc_stderr": 0.04351941398892446, "acc_norm": 0.25, "acc_norm_stderr": 0.04351941398892446 }, "harness|hendrycksTest-high_school_european_history|5": { "acc": 0.26666666666666666, "acc_stderr": 0.03453131801885415, "acc_norm": 0.26666666666666666, "acc_norm_stderr": 0.03453131801885415 }, "harness|hendrycksTest-high_school_geography|5": { "acc": 0.3181818181818182, "acc_stderr": 0.03318477333845331, "acc_norm": 0.3181818181818182, "acc_norm_stderr": 0.03318477333845331 }, "harness|hendrycksTest-high_school_government_and_politics|5": { "acc": 0.35751295336787564, "acc_stderr": 0.034588160421810045, "acc_norm": 0.35751295336787564, "acc_norm_stderr": 0.034588160421810045 }, "harness|hendrycksTest-high_school_macroeconomics|5": { "acc": 0.36153846153846153, "acc_stderr": 0.024359581465396987, "acc_norm": 0.36153846153846153, "acc_norm_stderr": 0.02435958

搜集汇总
数据集介绍
main_image_url
构建方式
在大型语言模型评估的背景下,该数据集是为记录模型postbot/emailgen-pythia-410m-deduped在Open LLM Leaderboard上的评测结果而自动构建的。数据集由64个配置组成,每个配置对应一个被评估的任务。数据来源于单次运行,每次运行的结果以时间戳命名作为特定分割存储在相应配置中,其中“train”分割始终指向最新的结果。此外,一个独立的“results”配置汇总了所有运行的聚合指标,用于计算和展示排行榜上的综合分数。
特点
该数据集的核心特点在于其结构化的组织方式,能够清晰记录模型在多个基准任务上的细粒度表现。每个任务配置独立存储,便于针对特定能力(如常识推理、数学计算或领域知识)进行深入分析。通过时间戳分割,数据集不仅保留了历史评测记录,还通过“latest”分割自动更新至最新结果,确保了数据的时效性和可追溯性。聚合的“results”配置则提供了全局性能概览,方便研究者快速比较模型在不同维度上的优劣。
使用方法
使用该数据集时,研究者可通过HuggingFace的datasets库加载所需配置。例如,调用load_dataset函数并指定任务名称(如“harness_winogrande_5”)和分割(如“train”),即可获取该任务的最新详细结果。若需回溯历史记录,可选择对应时间戳的特定分割。数据集以Parquet格式存储,兼容高效的数据处理流程,适用于后续的性能分析、模型对比或可视化展示。
背景与挑战
背景概述
随着大语言模型(LLM)的迅猛发展,如何系统性地评估模型在多样化任务上的表现已成为该领域的核心议题。在此背景下,Hugging Face团队于2023年发起了Open LLM Leaderboard项目,旨在构建一个公开、可复现的模型评测基准。该数据集正是围绕postbot/emailgen-pythia-410m-deduped这一模型在Leaderboard上的评估过程自动生成的,记录了模型在涵盖常识推理、数学、医学、法律等57项不同任务上的详细性能指标。研究人员通过这一数据集,能够深入剖析模型在特定任务上的优势与局限,进而推动模型优化与评估方法的进步。该数据集的出现,不仅为LLM社区提供了宝贵的量化参考,也促进了评测流程的标准化与透明化。
当前挑战
该数据集所反映的挑战主要集中在两个方面。一方面,从领域问题来看,大语言模型在复杂推理与知识密集型任务上仍面临显著瓶颈,例如模型在GSM8K数学推理任务上的准确率为0%,在DROP阅读理解任务上的F1值也极低,揭示了模型在符号运算与精确信息抽取方面的薄弱能力。另一方面,在数据集构建过程中,如何确保评测结果的标准化与可复现性是一大难题。该项目需将不同任务的评估结果统一格式、跨时间戳管理多次运行结果,并维护“latest”分片以指向最新数据,这要求高度自动化的数据流水线,以避免版本混乱与信息丢失。此外,对模型在57个细分任务上的性能进行汇总与可视化,也增加了数据处理的复杂度。
常用场景
经典使用场景
在大型语言模型(LLM)评估领域,该数据集作为Open LLM Leaderboard的自动化评测产物,经典地用于衡量模型在多维度任务上的泛化能力。其涵盖ARC-Challenge、HellaSwag、MMLU(涵盖57个学科)、TruthfulQA、Winogrande、DROP及GSM8K等基准测试,能够系统性地评估模型在常识推理、知识理解、数学计算与文本生成等方面的表现。研究者常通过加载特定配置(如harness_winogrande_5)与时间戳分割,复现某一轮次评测细节,进而横向对比不同模型或同一模型在不同训练阶段的性能演化。
衍生相关工作
该数据集衍生了多项经典工作,包括基于其评测结果分析模型规模与任务表现之间缩放规律的研究,以及针对MMLU中特定学科(如法学、医学)构建领域增强训练数据的探索。部分工作进一步利用该数据集中的TruthfulQA与DROP结果,设计对抗性训练策略以提升模型的事实一致性。此外,该数据集催生了多轮次评测对比工具,如可视化模型在不同时间戳下的性能波动曲线,以及基于其统计量(如acc_stderr)构建置信区间以判断性能差异显著性的方法论。
数据集最近研究
最新研究方向
在大语言模型评测领域,Open LLM Leaderboard已成为衡量模型综合能力的重要基准。基于该平台对postbot/emailgen-pythia-410m-deduped模型的评估数据集,最新研究方向聚焦于小规模参数模型在多任务泛化中的表现边界。该数据集涵盖从ARC挑战、HellaSwag到MMLU多学科知识等57项评测任务,揭示了410M参数模型在常识推理(如Winogrande准确率52.09%)与零样本泛化上的潜力,同时暴露其在数学推理(GSM8K准确率为0%)和精确理解(DROP的F1值仅0.99%)上的显著短板。这一发现呼应了当前业界对模型规模与能力涌现关系的深度探讨,为后续研究如何通过数据增强、知识蒸馏或混合专家架构优化小模型效能提供了关键实证。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作