five

open-llm-leaderboard/details_ResplendentAI__DaturaCookie_7B

收藏
Hugging Face2024-03-24 更新2024-06-11 收录
下载链接:
https://hf-mirror.com/datasets/open-llm-leaderboard/details_ResplendentAI__DaturaCookie_7B
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: Evaluation run of ResplendentAI/DaturaCookie_7B dataset_summary: "Dataset automatically created during the evaluation run of model\ \ [ResplendentAI/DaturaCookie_7B](https://huggingface.co/ResplendentAI/DaturaCookie_7B)\ \ on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).\n\ \nThe dataset is composed of 63 configuration, each one coresponding to one of the\ \ evaluated task.\n\nThe dataset has been created from 1 run(s). Each run can be\ \ found as a specific split in each configuration, the split being named using the\ \ timestamp of the run.The \"train\" split is always pointing to the latest results.\n\ \nAn additional configuration \"results\" store all the aggregated results of the\ \ run (and is used to compute and display the aggregated metrics on the [Open LLM\ \ Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)).\n\ \nTo load the details from a run, you can for instance do the following:\n```python\n\ from datasets import load_dataset\ndata = load_dataset(\"open-llm-leaderboard/details_ResplendentAI__DaturaCookie_7B\"\ ,\n\t\"harness_winogrande_5\",\n\tsplit=\"train\")\n```\n\n## Latest results\n\n\ These are the [latest results from run 2024-03-24T14:57:27.448442](https://huggingface.co/datasets/open-llm-leaderboard/details_ResplendentAI__DaturaCookie_7B/blob/main/results_2024-03-24T14-57-27.448442.json)(note\ \ that their might be results for other tasks in the repos if successive evals didn't\ \ cover the same tasks. You find each in the results and the \"latest\" split for\ \ each eval):\n\n```python\n{\n \"all\": {\n \"acc\": 0.6479647730793441,\n\ \ \"acc_stderr\": 0.032130582997157306,\n \"acc_norm\": 0.6479675894625477,\n\ \ \"acc_norm_stderr\": 0.03279175471682202,\n \"mc1\": 0.5287637698898409,\n\ \ \"mc1_stderr\": 0.017474513848525518,\n \"mc2\": 0.6848002786802394,\n\ \ \"mc2_stderr\": 0.015189401847464286\n },\n \"harness|arc:challenge|25\"\ : {\n \"acc\": 0.6911262798634812,\n \"acc_stderr\": 0.013501770929344003,\n\ \ \"acc_norm\": 0.712457337883959,\n \"acc_norm_stderr\": 0.013226719056266127\n\ \ },\n \"harness|hellaswag|10\": {\n \"acc\": 0.716391157140012,\n\ \ \"acc_stderr\": 0.004498280244494498,\n \"acc_norm\": 0.8800039832702649,\n\ \ \"acc_norm_stderr\": 0.0032429275808698566\n },\n \"harness|hendrycksTest-abstract_algebra|5\"\ : {\n \"acc\": 0.34,\n \"acc_stderr\": 0.04760952285695235,\n \ \ \"acc_norm\": 0.34,\n \"acc_norm_stderr\": 0.04760952285695235\n \ \ },\n \"harness|hendrycksTest-anatomy|5\": {\n \"acc\": 0.6222222222222222,\n\ \ \"acc_stderr\": 0.04188307537595853,\n \"acc_norm\": 0.6222222222222222,\n\ \ \"acc_norm_stderr\": 0.04188307537595853\n },\n \"harness|hendrycksTest-astronomy|5\"\ : {\n \"acc\": 0.7171052631578947,\n \"acc_stderr\": 0.03665349695640767,\n\ \ \"acc_norm\": 0.7171052631578947,\n \"acc_norm_stderr\": 0.03665349695640767\n\ \ },\n \"harness|hendrycksTest-business_ethics|5\": {\n \"acc\": 0.61,\n\ \ \"acc_stderr\": 0.04902071300001975,\n \"acc_norm\": 0.61,\n \ \ \"acc_norm_stderr\": 0.04902071300001975\n },\n \"harness|hendrycksTest-clinical_knowledge|5\"\ : {\n \"acc\": 0.7094339622641509,\n \"acc_stderr\": 0.027943219989337135,\n\ \ \"acc_norm\": 0.7094339622641509,\n \"acc_norm_stderr\": 0.027943219989337135\n\ \ },\n \"harness|hendrycksTest-college_biology|5\": {\n \"acc\": 0.7430555555555556,\n\ \ \"acc_stderr\": 0.03653946969442099,\n \"acc_norm\": 0.7430555555555556,\n\ \ \"acc_norm_stderr\": 0.03653946969442099\n },\n \"harness|hendrycksTest-college_chemistry|5\"\ : {\n \"acc\": 0.45,\n \"acc_stderr\": 0.05,\n \"acc_norm\"\ : 0.45,\n \"acc_norm_stderr\": 0.05\n },\n \"harness|hendrycksTest-college_computer_science|5\"\ : {\n \"acc\": 0.54,\n \"acc_stderr\": 0.05009082659620333,\n \ \ \"acc_norm\": 0.54,\n \"acc_norm_stderr\": 0.05009082659620333\n \ \ },\n \"harness|hendrycksTest-college_mathematics|5\": {\n \"acc\": 0.28,\n\ \ \"acc_stderr\": 0.04512608598542127,\n \"acc_norm\": 0.28,\n \ \ \"acc_norm_stderr\": 0.04512608598542127\n },\n \"harness|hendrycksTest-college_medicine|5\"\ : {\n \"acc\": 0.6763005780346821,\n \"acc_stderr\": 0.0356760379963917,\n\ \ \"acc_norm\": 0.6763005780346821,\n \"acc_norm_stderr\": 0.0356760379963917\n\ \ },\n \"harness|hendrycksTest-college_physics|5\": {\n \"acc\": 0.39215686274509803,\n\ \ \"acc_stderr\": 0.04858083574266344,\n \"acc_norm\": 0.39215686274509803,\n\ \ \"acc_norm_stderr\": 0.04858083574266344\n },\n \"harness|hendrycksTest-computer_security|5\"\ : {\n \"acc\": 0.76,\n \"acc_stderr\": 0.042923469599092816,\n \ \ \"acc_norm\": 0.76,\n \"acc_norm_stderr\": 0.042923469599092816\n \ \ },\n \"harness|hendrycksTest-conceptual_physics|5\": {\n \"acc\":\ \ 0.5787234042553191,\n \"acc_stderr\": 0.03227834510146267,\n \"\ acc_norm\": 0.5787234042553191,\n \"acc_norm_stderr\": 0.03227834510146267\n\ \ },\n \"harness|hendrycksTest-econometrics|5\": {\n \"acc\": 0.49122807017543857,\n\ \ \"acc_stderr\": 0.04702880432049615,\n \"acc_norm\": 0.49122807017543857,\n\ \ \"acc_norm_stderr\": 0.04702880432049615\n },\n \"harness|hendrycksTest-electrical_engineering|5\"\ : {\n \"acc\": 0.5724137931034483,\n \"acc_stderr\": 0.04122737111370332,\n\ \ \"acc_norm\": 0.5724137931034483,\n \"acc_norm_stderr\": 0.04122737111370332\n\ \ },\n \"harness|hendrycksTest-elementary_mathematics|5\": {\n \"acc\"\ : 0.40476190476190477,\n \"acc_stderr\": 0.025279850397404904,\n \"\ acc_norm\": 0.40476190476190477,\n \"acc_norm_stderr\": 0.025279850397404904\n\ \ },\n \"harness|hendrycksTest-formal_logic|5\": {\n \"acc\": 0.4523809523809524,\n\ \ \"acc_stderr\": 0.044518079590553275,\n \"acc_norm\": 0.4523809523809524,\n\ \ \"acc_norm_stderr\": 0.044518079590553275\n },\n \"harness|hendrycksTest-global_facts|5\"\ : {\n \"acc\": 0.32,\n \"acc_stderr\": 0.046882617226215034,\n \ \ \"acc_norm\": 0.32,\n \"acc_norm_stderr\": 0.046882617226215034\n \ \ },\n \"harness|hendrycksTest-high_school_biology|5\": {\n \"acc\"\ : 0.7774193548387097,\n \"acc_stderr\": 0.023664216671642518,\n \"\ acc_norm\": 0.7774193548387097,\n \"acc_norm_stderr\": 0.023664216671642518\n\ \ },\n \"harness|hendrycksTest-high_school_chemistry|5\": {\n \"acc\"\ : 0.5024630541871922,\n \"acc_stderr\": 0.035179450386910616,\n \"\ acc_norm\": 0.5024630541871922,\n \"acc_norm_stderr\": 0.035179450386910616\n\ \ },\n \"harness|hendrycksTest-high_school_computer_science|5\": {\n \ \ \"acc\": 0.67,\n \"acc_stderr\": 0.04725815626252607,\n \"acc_norm\"\ : 0.67,\n \"acc_norm_stderr\": 0.04725815626252607\n },\n \"harness|hendrycksTest-high_school_european_history|5\"\ : {\n \"acc\": 0.7757575757575758,\n \"acc_stderr\": 0.03256866661681102,\n\ \ \"acc_norm\": 0.7757575757575758,\n \"acc_norm_stderr\": 0.03256866661681102\n\ \ },\n \"harness|hendrycksTest-high_school_geography|5\": {\n \"acc\"\ : 0.797979797979798,\n \"acc_stderr\": 0.028606204289229865,\n \"\ acc_norm\": 0.797979797979798,\n \"acc_norm_stderr\": 0.028606204289229865\n\ \ },\n \"harness|hendrycksTest-high_school_government_and_politics|5\": {\n\ \ \"acc\": 0.8911917098445595,\n \"acc_stderr\": 0.02247325333276877,\n\ \ \"acc_norm\": 0.8911917098445595,\n \"acc_norm_stderr\": 0.02247325333276877\n\ \ },\n \"harness|hendrycksTest-high_school_macroeconomics|5\": {\n \ \ \"acc\": 0.6666666666666666,\n \"acc_stderr\": 0.023901157979402538,\n\ \ \"acc_norm\": 0.6666666666666666,\n \"acc_norm_stderr\": 0.023901157979402538\n\ \ },\n \"harness|hendrycksTest-high_school_mathematics|5\": {\n \"\ acc\": 0.3296296296296296,\n \"acc_stderr\": 0.02866120111652456,\n \ \ \"acc_norm\": 0.3296296296296296,\n \"acc_norm_stderr\": 0.02866120111652456\n\ \ },\n \"harness|hendrycksTest-high_school_microeconomics|5\": {\n \ \ \"acc\": 0.6932773109243697,\n \"acc_stderr\": 0.02995382389188704,\n \ \ \"acc_norm\": 0.6932773109243697,\n \"acc_norm_stderr\": 0.02995382389188704\n\ \ },\n \"harness|hendrycksTest-high_school_physics|5\": {\n \"acc\"\ : 0.3708609271523179,\n \"acc_stderr\": 0.03943966699183629,\n \"\ acc_norm\": 0.3708609271523179,\n \"acc_norm_stderr\": 0.03943966699183629\n\ \ },\n \"harness|hendrycksTest-high_school_psychology|5\": {\n \"acc\"\ : 0.8458715596330275,\n \"acc_stderr\": 0.015480826865374303,\n \"\ acc_norm\": 0.8458715596330275,\n \"acc_norm_stderr\": 0.015480826865374303\n\ \ },\n \"harness|hendrycksTest-high_school_statistics|5\": {\n \"acc\"\ : 0.49537037037037035,\n \"acc_stderr\": 0.03409825519163572,\n \"\ acc_norm\": 0.49537037037037035,\n \"acc_norm_stderr\": 0.03409825519163572\n\ \ },\n \"harness|hendrycksTest-high_school_us_history|5\": {\n \"acc\"\ : 0.8333333333333334,\n \"acc_stderr\": 0.026156867523931045,\n \"\ acc_norm\": 0.8333333333333334,\n \"acc_norm_stderr\": 0.026156867523931045\n\ \ },\n \"harness|hendrycksTest-high_school_world_history|5\": {\n \"\ acc\": 0.810126582278481,\n \"acc_stderr\": 0.02553010046023349,\n \ \ \"acc_norm\": 0.810126582278481,\n \"acc_norm_stderr\": 0.02553010046023349\n\ \ },\n \"harness|hendrycksTest-human_aging|5\": {\n \"acc\": 0.6771300448430493,\n\ \ \"acc_stderr\": 0.031381476375754995,\n \"acc_norm\": 0.6771300448430493,\n\ \ \"acc_norm_stderr\": 0.031381476375754995\n },\n \"harness|hendrycksTest-human_sexuality|5\"\ : {\n \"acc\": 0.7938931297709924,\n \"acc_stderr\": 0.035477710041594654,\n\ \ \"acc_norm\": 0.7938931297709924,\n \"acc_norm_stderr\": 0.035477710041594654\n\ \ },\n \"harness|hendrycksTest-international_law|5\": {\n \"acc\":\ \ 0.7603305785123967,\n \"acc_stderr\": 0.03896878985070417,\n \"\ acc_norm\": 0.7603305785123967,\n \"acc_norm_stderr\": 0.03896878985070417\n\ \ },\n \"harness|hendrycksTest-jurisprudence|5\": {\n \"acc\": 0.7962962962962963,\n\ \ \"acc_stderr\": 0.03893542518824847,\n \"acc_norm\": 0.7962962962962963,\n\ \ \"acc_norm_stderr\": 0.03893542518824847\n },\n \"harness|hendrycksTest-logical_fallacies|5\"\ : {\n \"acc\": 0.7484662576687117,\n \"acc_stderr\": 0.03408997886857529,\n\ \ \"acc_norm\": 0.7484662576687117,\n \"acc_norm_stderr\": 0.03408997886857529\n\ \ },\n \"harness|hendrycksTest-machine_learning|5\": {\n \"acc\": 0.4375,\n\ \ \"acc_stderr\": 0.04708567521880525,\n \"acc_norm\": 0.4375,\n \ \ \"acc_norm_stderr\": 0.04708567521880525\n },\n \"harness|hendrycksTest-management|5\"\ : {\n \"acc\": 0.7766990291262136,\n \"acc_stderr\": 0.04123553189891431,\n\ \ \"acc_norm\": 0.7766990291262136,\n \"acc_norm_stderr\": 0.04123553189891431\n\ \ },\n \"harness|hendrycksTest-marketing|5\": {\n \"acc\": 0.8846153846153846,\n\ \ \"acc_stderr\": 0.02093019318517933,\n \"acc_norm\": 0.8846153846153846,\n\ \ \"acc_norm_stderr\": 0.02093019318517933\n },\n \"harness|hendrycksTest-medical_genetics|5\"\ : {\n \"acc\": 0.7,\n \"acc_stderr\": 0.046056618647183814,\n \ \ \"acc_norm\": 0.7,\n \"acc_norm_stderr\": 0.046056618647183814\n \ \ },\n \"harness|hendrycksTest-miscellaneous|5\": {\n \"acc\": 0.8237547892720306,\n\ \ \"acc_stderr\": 0.013625556907993462,\n \"acc_norm\": 0.8237547892720306,\n\ \ \"acc_norm_stderr\": 0.013625556907993462\n },\n \"harness|hendrycksTest-moral_disputes|5\"\ : {\n \"acc\": 0.7225433526011561,\n \"acc_stderr\": 0.02410571260775431,\n\ \ \"acc_norm\": 0.7225433526011561,\n \"acc_norm_stderr\": 0.02410571260775431\n\ \ },\n \"harness|hendrycksTest-moral_scenarios|5\": {\n \"acc\": 0.47150837988826816,\n\ \ \"acc_stderr\": 0.016695329746015793,\n \"acc_norm\": 0.47150837988826816,\n\ \ \"acc_norm_stderr\": 0.016695329746015793\n },\n \"harness|hendrycksTest-nutrition|5\"\ : {\n \"acc\": 0.7352941176470589,\n \"acc_stderr\": 0.02526169121972948,\n\ \ \"acc_norm\": 0.7352941176470589,\n \"acc_norm_stderr\": 0.02526169121972948\n\ \ },\n \"harness|hendrycksTest-philosophy|5\": {\n \"acc\": 0.7041800643086816,\n\ \ \"acc_stderr\": 0.025922371788818763,\n \"acc_norm\": 0.7041800643086816,\n\ \ \"acc_norm_stderr\": 0.025922371788818763\n },\n \"harness|hendrycksTest-prehistory|5\"\ : {\n \"acc\": 0.7438271604938271,\n \"acc_stderr\": 0.024288533637726095,\n\ \ \"acc_norm\": 0.7438271604938271,\n \"acc_norm_stderr\": 0.024288533637726095\n\ \ },\n \"harness|hendrycksTest-professional_accounting|5\": {\n \"\ acc\": 0.48226950354609927,\n \"acc_stderr\": 0.02980873964223777,\n \ \ \"acc_norm\": 0.48226950354609927,\n \"acc_norm_stderr\": 0.02980873964223777\n\ \ },\n \"harness|hendrycksTest-professional_law|5\": {\n \"acc\": 0.4641460234680574,\n\ \ \"acc_stderr\": 0.012737361318730583,\n \"acc_norm\": 0.4641460234680574,\n\ \ \"acc_norm_stderr\": 0.012737361318730583\n },\n \"harness|hendrycksTest-professional_medicine|5\"\ : {\n \"acc\": 0.6911764705882353,\n \"acc_stderr\": 0.02806499816704009,\n\ \ \"acc_norm\": 0.6911764705882353,\n \"acc_norm_stderr\": 0.02806499816704009\n\ \ },\n \"harness|hendrycksTest-professional_psychology|5\": {\n \"\ acc\": 0.6421568627450981,\n \"acc_stderr\": 0.019393058402355442,\n \ \ \"acc_norm\": 0.6421568627450981,\n \"acc_norm_stderr\": 0.019393058402355442\n\ \ },\n \"harness|hendrycksTest-public_relations|5\": {\n \"acc\": 0.6818181818181818,\n\ \ \"acc_stderr\": 0.044612721759105085,\n \"acc_norm\": 0.6818181818181818,\n\ \ \"acc_norm_stderr\": 0.044612721759105085\n },\n \"harness|hendrycksTest-security_studies|5\"\ : {\n \"acc\": 0.7183673469387755,\n \"acc_stderr\": 0.028795185574291296,\n\ \ \"acc_norm\": 0.7183673469387755,\n \"acc_norm_stderr\": 0.028795185574291296\n\ \ },\n \"harness|hendrycksTest-sociology|5\": {\n \"acc\": 0.8557213930348259,\n\ \ \"acc_stderr\": 0.02484575321230604,\n \"acc_norm\": 0.8557213930348259,\n\ \ \"acc_norm_stderr\": 0.02484575321230604\n },\n \"harness|hendrycksTest-us_foreign_policy|5\"\ : {\n \"acc\": 0.84,\n \"acc_stderr\": 0.0368452949177471,\n \ \ \"acc_norm\": 0.84,\n \"acc_norm_stderr\": 0.0368452949177471\n },\n\ \ \"harness|hendrycksTest-virology|5\": {\n \"acc\": 0.5240963855421686,\n\ \ \"acc_stderr\": 0.03887971849597264,\n \"acc_norm\": 0.5240963855421686,\n\ \ \"acc_norm_stderr\": 0.03887971849597264\n },\n \"harness|hendrycksTest-world_religions|5\"\ : {\n \"acc\": 0.8421052631578947,\n \"acc_stderr\": 0.027966785859160893,\n\ \ \"acc_norm\": 0.8421052631578947,\n \"acc_norm_stderr\": 0.027966785859160893\n\ \ },\n \"harness|truthfulqa:mc|0\": {\n \"mc1\": 0.5287637698898409,\n\ \ \"mc1_stderr\": 0.017474513848525518,\n \"mc2\": 0.6848002786802394,\n\ \ \"mc2_stderr\": 0.015189401847464286\n },\n \"harness|winogrande|5\"\ : {\n \"acc\": 0.8279400157853196,\n \"acc_stderr\": 0.010607731615247015\n\ \ },\n \"harness|gsm8k|5\": {\n \"acc\": 0.6527672479150872,\n \ \ \"acc_stderr\": 0.013113898382146879\n }\n}\n```" repo_url: https://huggingface.co/ResplendentAI/DaturaCookie_7B leaderboard_url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard point_of_contact: clementine@hf.co configs: - config_name: harness_arc_challenge_25 data_files: - split: 2024_03_24T14_57_27.448442 path: - '**/details_harness|arc:challenge|25_2024-03-24T14-57-27.448442.parquet' - split: latest path: - '**/details_harness|arc:challenge|25_2024-03-24T14-57-27.448442.parquet' - config_name: harness_gsm8k_5 data_files: - split: 2024_03_24T14_57_27.448442 path: - '**/details_harness|gsm8k|5_2024-03-24T14-57-27.448442.parquet' - split: latest path: - '**/details_harness|gsm8k|5_2024-03-24T14-57-27.448442.parquet' - config_name: harness_hellaswag_10 data_files: - split: 2024_03_24T14_57_27.448442 path: - '**/details_harness|hellaswag|10_2024-03-24T14-57-27.448442.parquet' - split: latest path: - '**/details_harness|hellaswag|10_2024-03-24T14-57-27.448442.parquet' - config_name: harness_hendrycksTest_5 data_files: - split: 2024_03_24T14_57_27.448442 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-international_law|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-management|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-marketing|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-sociology|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-virology|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2024-03-24T14-57-27.448442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-international_law|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-management|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-marketing|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-sociology|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-virology|5_2024-03-24T14-57-27.448442.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2024-03-24T14-57-27.448442.parquet' - config_name: harness_hendrycksTest_abstract_algebra_5 data_files: - split: 2024_03_24T14_57_27.448442 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-03-24T14-57-27.448442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-03-24T14-57-27.448442.parquet' - config_name: harness_hendrycksTest_anatomy_5 data_files: - split: 2024_03_24T14_57_27.448442 path: - '**/details_harness|hendrycksTest-anatomy|5_2024-03-24T14-57-27.448442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-anatomy|5_2024-03-24T14-57-27.448442.parquet' - config_name: harness_hendrycksTest_astronomy_5 data_files: - split: 2024_03_24T14_57_27.448442 path: - '**/details_harness|hendrycksTest-astronomy|5_2024-03-24T14-57-27.448442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-astronomy|5_2024-03-24T14-57-27.448442.parquet' - config_name: harness_hendrycksTest_business_ethics_5 data_files: - split: 2024_03_24T14_57_27.448442 path: - '**/details_harness|hendrycksTest-business_ethics|5_2024-03-24T14-57-27.448442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-business_ethics|5_2024-03-24T14-57-27.448442.parquet' - config_name: harness_hendrycksTest_clinical_knowledge_5 data_files: - split: 2024_03_24T14_57_27.448442 path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-03-24T14-57-27.448442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-03-24T14-57-27.448442.parquet' - config_name: harness_hendrycksTest_college_biology_5 data_files: - split: 2024_03_24T14_57_27.448442 path: - '**/details_harness|hendrycksTest-college_biology|5_2024-03-24T14-57-27.448442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_biology|5_2024-03-24T14-57-27.448442.parquet' - config_name: harness_hendrycksTest_college_chemistry_5 data_files: - split: 2024_03_24T14_57_27.448442 path: - '**/details_harness|hendrycksTest-college_chemistry|5_2024-03-24T14-57-27.448442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_chemistry|5_2024-03-24T14-57-27.448442.parquet' - config_name: harness_hendrycksTest_college_computer_science_5 data_files: - split: 2024_03_24T14_57_27.448442 path: - '**/details_harness|hendrycksTest-college_computer_science|5_2024-03-24T14-57-27.448442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_computer_science|5_2024-03-24T14-57-27.448442.parquet' - config_name: harness_hendrycksTest_college_mathematics_5 data_files: - split: 2024_03_24T14_57_27.448442 path: - '**/details_harness|hendrycksTest-college_mathematics|5_2024-03-24T14-57-27.448442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_mathematics|5_2024-03-24T14-57-27.448442.parquet' - config_name: harness_hendrycksTest_college_medicine_5 data_files: - split: 2024_03_24T14_57_27.448442 path: - '**/details_harness|hendrycksTest-college_medicine|5_2024-03-24T14-57-27.448442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_medicine|5_2024-03-24T14-57-27.448442.parquet' - config_name: harness_hendrycksTest_college_physics_5 data_files: - split: 2024_03_24T14_57_27.448442 path: - '**/details_harness|hendrycksTest-college_physics|5_2024-03-24T14-57-27.448442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_physics|5_2024-03-24T14-57-27.448442.parquet' - config_name: harness_hendrycksTest_computer_security_5 data_files: - split: 2024_03_24T14_57_27.448442 path: - '**/details_harness|hendrycksTest-computer_security|5_2024-03-24T14-57-27.448442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-computer_security|5_2024-03-24T14-57-27.448442.parquet' - config_name: harness_hendrycksTest_conceptual_physics_5 data_files: - split: 2024_03_24T14_57_27.448442 path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-03-24T14-57-27.448442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-03-24T14-57-27.448442.parquet' - config_name: harness_hendrycksTest_econometrics_5 data_files: - split: 2024_03_24T14_57_27.448442 path: - '**/details_harness|hendrycksTest-econometrics|5_2024-03-24T14-57-27.448442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-econometrics|5_2024-03-24T14-57-27.448442.parquet' - config_name: harness_hendrycksTest_electrical_engineering_5 data_files: - split: 2024_03_24T14_57_27.448442 path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-03-24T14-57-27.448442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-03-24T14-57-27.448442.parquet' - config_name: harness_hendrycksTest_elementary_mathematics_5 data_files: - split: 2024_03_24T14_57_27.448442 path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-03-24T14-57-27.448442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-03-24T14-57-27.448442.parquet' - config_name: harness_hendrycksTest_formal_logic_5 data_files: - split: 2024_03_24T14_57_27.448442 path: - '**/details_harness|hendrycksTest-formal_logic|5_2024-03-24T14-57-27.448442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-formal_logic|5_2024-03-24T14-57-27.448442.parquet' - config_name: harness_hendrycksTest_global_facts_5 data_files: - split: 2024_03_24T14_57_27.448442 path: - '**/details_harness|hendrycksTest-global_facts|5_2024-03-24T14-57-27.448442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-global_facts|5_2024-03-24T14-57-27.448442.parquet' - config_name: harness_hendrycksTest_high_school_biology_5 data_files: - split: 2024_03_24T14_57_27.448442 path: - '**/details_harness|hendrycksTest-high_school_biology|5_2024-03-24T14-57-27.448442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_biology|5_2024-03-24T14-57-27.448442.parquet' - config_name: harness_hendrycksTest_high_school_chemistry_5 data_files: - split: 2024_03_24T14_57_27.448442 path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-03-24T14-57-27.448442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-03-24T14-57-27.448442.parquet' - config_name: harness_hendrycksTest_high_school_computer_science_5 data_files: - split: 2024_03_24T14_57_27.448442 path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-03-24T14-57-27.448442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-03-24T14-57-27.448442.parquet' - config_name: harness_hendrycksTest_high_school_european_history_5 data_files: - split: 2024_03_24T14_57_27.448442 path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-03-24T14-57-27.448442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-03-24T14-57-27.448442.parquet' - config_name: harness_hendrycksTest_high_school_geography_5 data_files: - split: 2024_03_24T14_57_27.448442 path: - '**/details_harness|hendrycksTest-high_school_geography|5_2024-03-24T14-57-27.448442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_geography|5_2024-03-24T14-57-27.448442.parquet' - config_name: harness_hendrycksTest_high_school_government_and_politics_5 data_files: - split: 2024_03_24T14_57_27.448442 path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-03-24T14-57-27.448442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-03-24T14-57-27.448442.parquet' - config_name: harness_hendrycksTest_high_school_macroeconomics_5 data_files: - split: 2024_03_24T14_57_27.448442 path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-03-24T14-57-27.448442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-03-24T14-57-27.448442.parquet' - config_name: harness_hendrycksTest_high_school_mathematics_5 data_files: - split: 2024_03_24T14_57_27.448442 path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-03-24T14-57-27.448442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-03-24T14-57-27.448442.parquet' - config_name: harness_hendrycksTest_high_school_microeconomics_5 data_files: - split: 2024_03_24T14_57_27.448442 path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-03-24T14-57-27.448442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-03-24T14-57-27.448442.parquet' - config_name: harness_hendrycksTest_high_school_physics_5 data_files: - split: 2024_03_24T14_57_27.448442 path: - '**/details_harness|hendrycksTest-high_school_physics|5_2024-03-24T14-57-27.448442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_physics|5_2024-03-24T14-57-27.448442.parquet' - config_name: harness_hendrycksTest_high_school_psychology_5 data_files: - split: 2024_03_24T14_57_27.448442 path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-03-24T14-57-27.448442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-03-24T14-57-27.448442.parquet' - config_name: harness_hendrycksTest_high_school_statistics_5 data_files: - split: 2024_03_24T14_57_27.448442 path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-03-24T14-57-27.448442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-03-24T14-57-27.448442.parquet' - config_name: harness_hendrycksTest_high_school_us_history_5 data_files: - split: 2024_03_24T14_57_27.448442 path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-03-24T14-57-27.448442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-03-24T14-57-27.448442.parquet' - config_name: harness_hendrycksTest_high_school_world_history_5 data_files: - split: 2024_03_24T14_57_27.448442 path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-03-24T14-57-27.448442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-03-24T14-57-27.448442.parquet' - config_name: harness_hendrycksTest_human_aging_5 data_files: - split: 2024_03_24T14_57_27.448442 path: - '**/details_harness|hendrycksTest-human_aging|5_2024-03-24T14-57-27.448442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-human_aging|5_2024-03-24T14-57-27.448442.parquet' - config_name: harness_hendrycksTest_human_sexuality_5 data_files: - split: 2024_03_24T14_57_27.448442 path: - '**/details_harness|hendrycksTest-human_sexuality|5_2024-03-24T14-57-27.448442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-human_sexuality|5_2024-03-24T14-57-27.448442.parquet' - config_name: harness_hendrycksTest_international_law_5 data_files: - split: 2024_03_24T14_57_27.448442 path: - '**/details_harness|hendrycksTest-international_law|5_2024-03-24T14-57-27.448442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-international_law|5_2024-03-24T14-57-27.448442.parquet' - config_name: harness_hendrycksTest_jurisprudence_5 data_files: - split: 2024_03_24T14_57_27.448442 path: - '**/details_harness|hendrycksTest-jurisprudence|5_2024-03-24T14-57-27.448442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-jurisprudence|5_2024-03-24T14-57-27.448442.parquet' - config_name: harness_hendrycksTest_logical_fallacies_5 data_files: - split: 2024_03_24T14_57_27.448442 path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-03-24T14-57-27.448442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-03-24T14-57-27.448442.parquet' - config_name: harness_hendrycksTest_machine_learning_5 data_files: - split: 2024_03_24T14_57_27.448442 path: - '**/details_harness|hendrycksTest-machine_learning|5_2024-03-24T14-57-27.448442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-machine_learning|5_2024-03-24T14-57-27.448442.parquet' - config_name: harness_hendrycksTest_management_5 data_files: - split: 2024_03_24T14_57_27.448442 path: - '**/details_harness|hendrycksTest-management|5_2024-03-24T14-57-27.448442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-management|5_2024-03-24T14-57-27.448442.parquet' - config_name: harness_hendrycksTest_marketing_5 data_files: - split: 2024_03_24T14_57_27.448442 path: - '**/details_harness|hendrycksTest-marketing|5_2024-03-24T14-57-27.448442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-marketing|5_2024-03-24T14-57-27.448442.parquet' - config_name: harness_hendrycksTest_medical_genetics_5 data_files: - split: 2024_03_24T14_57_27.448442 path: - '**/details_harness|hendrycksTest-medical_genetics|5_2024-03-24T14-57-27.448442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-medical_genetics|5_2024-03-24T14-57-27.448442.parquet' - config_name: harness_hendrycksTest_miscellaneous_5 data_files: - split: 2024_03_24T14_57_27.448442 path: - '**/details_harness|hendrycksTest-miscellaneous|5_2024-03-24T14-57-27.448442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-miscellaneous|5_2024-03-24T14-57-27.448442.parquet' - config_name: harness_hendrycksTest_moral_disputes_5 data_files: - split: 2024_03_24T14_57_27.448442 path: - '**/details_harness|hendrycksTest-moral_disputes|5_2024-03-24T14-57-27.448442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-moral_disputes|5_2024-03-24T14-57-27.448442.parquet' - config_name: harness_hendrycksTest_moral_scenarios_5 data_files: - split: 2024_03_24T14_57_27.448442 path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-03-24T14-57-27.448442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-03-24T14-57-27.448442.parquet' - config_name: harness_hendrycksTest_nutrition_5 data_files: - split: 2024_03_24T14_57_27.448442 path: - '**/details_harness|hendrycksTest-nutrition|5_2024-03-24T14-57-27.448442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-nutrition|5_2024-03-24T14-57-27.448442.parquet' - config_name: harness_hendrycksTest_philosophy_5 data_files: - split: 2024_03_24T14_57_27.448442 path: - '**/details_harness|hendrycksTest-philosophy|5_2024-03-24T14-57-27.448442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-philosophy|5_2024-03-24T14-57-27.448442.parquet' - config_name: harness_hendrycksTest_prehistory_5 data_files: - split: 2024_03_24T14_57_27.448442 path: - '**/details_harness|hendrycksTest-prehistory|5_2024-03-24T14-57-27.448442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-prehistory|5_2024-03-24T14-57-27.448442.parquet' - config_name: harness_hendrycksTest_professional_accounting_5 data_files: - split: 2024_03_24T14_57_27.448442 path: - '**/details_harness|hendrycksTest-professional_accounting|5_2024-03-24T14-57-27.448442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_accounting|5_2024-03-24T14-57-27.448442.parquet' - config_name: harness_hendrycksTest_professional_law_5 data_files: - split: 2024_03_24T14_57_27.448442 path: - '**/details_harness|hendrycksTest-professional_law|5_2024-03-24T14-57-27.448442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_law|5_2024-03-24T14-57-27.448442.parquet' - config_name: harness_hendrycksTest_professional_medicine_5 data_files: - split: 2024_03_24T14_57_27.448442 path: - '**/details_harness|hendrycksTest-professional_medicine|5_2024-03-24T14-57-27.448442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_medicine|5_2024-03-24T14-57-27.448442.parquet' - config_name: harness_hendrycksTest_professional_psychology_5 data_files: - split: 2024_03_24T14_57_27.448442 path: - '**/details_harness|hendrycksTest-professional_psychology|5_2024-03-24T14-57-27.448442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_psychology|5_2024-03-24T14-57-27.448442.parquet' - config_name: harness_hendrycksTest_public_relations_5 data_files: - split: 2024_03_24T14_57_27.448442 path: - '**/details_harness|hendrycksTest-public_relations|5_2024-03-24T14-57-27.448442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-public_relations|5_2024-03-24T14-57-27.448442.parquet' - config_name: harness_hendrycksTest_security_studies_5 data_files: - split: 2024_03_24T14_57_27.448442 path: - '**/details_harness|hendrycksTest-security_studies|5_2024-03-24T14-57-27.448442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-security_studies|5_2024-03-24T14-57-27.448442.parquet' - config_name: harness_hendrycksTest_sociology_5 data_files: - split: 2024_03_24T14_57_27.448442 path: - '**/details_harness|hendrycksTest-sociology|5_2024-03-24T14-57-27.448442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-sociology|5_2024-03-24T14-57-27.448442.parquet' - config_name: harness_hendrycksTest_us_foreign_policy_5 data_files: - split: 2024_03_24T14_57_27.448442 path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-03-24T14-57-27.448442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-03-24T14-57-27.448442.parquet' - config_name: harness_hendrycksTest_virology_5 data_files: - split: 2024_03_24T14_57_27.448442 path: - '**/details_harness|hendrycksTest-virology|5_2024-03-24T14-57-27.448442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-virology|5_2024-03-24T14-57-27.448442.parquet' - config_name: harness_hendrycksTest_world_religions_5 data_files: - split: 2024_03_24T14_57_27.448442 path: - '**/details_harness|hendrycksTest-world_religions|5_2024-03-24T14-57-27.448442.parquet' - split: latest path: - '**/details_harness|hendrycksTest-world_religions|5_2024-03-24T14-57-27.448442.parquet' - config_name: harness_truthfulqa_mc_0 data_files: - split: 2024_03_24T14_57_27.448442 path: - '**/details_harness|truthfulqa:mc|0_2024-03-24T14-57-27.448442.parquet' - split: latest path: - '**/details_harness|truthfulqa:mc|0_2024-03-24T14-57-27.448442.parquet' - config_name: harness_winogrande_5 data_files: - split: 2024_03_24T14_57_27.448442 path: - '**/details_harness|winogrande|5_2024-03-24T14-57-27.448442.parquet' - split: latest path: - '**/details_harness|winogrande|5_2024-03-24T14-57-27.448442.parquet' - config_name: results data_files: - split: 2024_03_24T14_57_27.448442 path: - results_2024-03-24T14-57-27.448442.parquet - split: latest path: - results_2024-03-24T14-57-27.448442.parquet --- # Dataset Card for Evaluation run of ResplendentAI/DaturaCookie_7B <!-- Provide a quick summary of the dataset. --> Dataset automatically created during the evaluation run of model [ResplendentAI/DaturaCookie_7B](https://huggingface.co/ResplendentAI/DaturaCookie_7B) on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard). The dataset is composed of 63 configuration, each one coresponding to one of the evaluated task. The dataset has been created from 1 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results. An additional configuration "results" store all the aggregated results of the run (and is used to compute and display the aggregated metrics on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)). To load the details from a run, you can for instance do the following: ```python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_ResplendentAI__DaturaCookie_7B", "harness_winogrande_5", split="train") ``` ## Latest results These are the [latest results from run 2024-03-24T14:57:27.448442](https://huggingface.co/datasets/open-llm-leaderboard/details_ResplendentAI__DaturaCookie_7B/blob/main/results_2024-03-24T14-57-27.448442.json)(note that their might be results for other tasks in the repos if successive evals didn't cover the same tasks. You find each in the results and the "latest" split for each eval): ```python { "all": { "acc": 0.6479647730793441, "acc_stderr": 0.032130582997157306, "acc_norm": 0.6479675894625477, "acc_norm_stderr": 0.03279175471682202, "mc1": 0.5287637698898409, "mc1_stderr": 0.017474513848525518, "mc2": 0.6848002786802394, "mc2_stderr": 0.015189401847464286 }, "harness|arc:challenge|25": { "acc": 0.6911262798634812, "acc_stderr": 0.013501770929344003, "acc_norm": 0.712457337883959, "acc_norm_stderr": 0.013226719056266127 }, "harness|hellaswag|10": { "acc": 0.716391157140012, "acc_stderr": 0.004498280244494498, "acc_norm": 0.8800039832702649, "acc_norm_stderr": 0.0032429275808698566 }, "harness|hendrycksTest-abstract_algebra|5": { "acc": 0.34, "acc_stderr": 0.04760952285695235, "acc_norm": 0.34, "acc_norm_stderr": 0.04760952285695235 }, "harness|hendrycksTest-anatomy|5": { "acc": 0.6222222222222222, "acc_stderr": 0.04188307537595853, "acc_norm": 0.6222222222222222, "acc_norm_stderr": 0.04188307537595853 }, "harness|hendrycksTest-astronomy|5": { "acc": 0.7171052631578947, "acc_stderr": 0.03665349695640767, "acc_norm": 0.7171052631578947, "acc_norm_stderr": 0.03665349695640767 }, "harness|hendrycksTest-business_ethics|5": { "acc": 0.61, "acc_stderr": 0.04902071300001975, "acc_norm": 0.61, "acc_norm_stderr": 0.04902071300001975 }, "harness|hendrycksTest-clinical_knowledge|5": { "acc": 0.7094339622641509, "acc_stderr": 0.027943219989337135, "acc_norm": 0.7094339622641509, "acc_norm_stderr": 0.027943219989337135 }, "harness|hendrycksTest-college_biology|5": { "acc": 0.7430555555555556, "acc_stderr": 0.03653946969442099, "acc_norm": 0.7430555555555556, "acc_norm_stderr": 0.03653946969442099 }, "harness|hendrycksTest-college_chemistry|5": { "acc": 0.45, "acc_stderr": 0.05, "acc_norm": 0.45, "acc_norm_stderr": 0.05 }, "harness|hendrycksTest-college_computer_science|5": { "acc": 0.54, "acc_stderr": 0.05009082659620333, "acc_norm": 0.54, "acc_norm_stderr": 0.05009082659620333 }, "harness|hendrycksTest-college_mathematics|5": { "acc": 0.28, "acc_stderr": 0.04512608598542127, "acc_norm": 0.28, "acc_norm_stderr": 0.04512608598542127 }, "harness|hendrycksTest-college_medicine|5": { "acc": 0.6763005780346821, "acc_stderr": 0.0356760379963917, "acc_norm": 0.6763005780346821, "acc_norm_stderr": 0.0356760379963917 }, "harness|hendrycksTest-college_physics|5": { "acc": 0.39215686274509803, "acc_stderr": 0.04858083574266344, "acc_norm": 0.39215686274509803, "acc_norm_stderr": 0.04858083574266344 }, "harness|hendrycksTest-computer_security|5": { "acc": 0.76, "acc_stderr": 0.042923469599092816, "acc_norm": 0.76, "acc_norm_stderr": 0.042923469599092816 }, "harness|hendrycksTest-conceptual_physics|5": { "acc": 0.5787234042553191, "acc_stderr": 0.03227834510146267, "acc_norm": 0.5787234042553191, "acc_norm_stderr": 0.03227834510146267 }, "harness|hendrycksTest-econometrics|5": { "acc": 0.49122807017543857, "acc_stderr": 0.04702880432049615, "acc_norm": 0.49122807017543857, "acc_norm_stderr": 0.04702880432049615 }, "harness|hendrycksTest-electrical_engineering|5": { "acc": 0.5724137931034483, "acc_stderr": 0.04122737111370332, "acc_norm": 0.5724137931034483, "acc_norm_stderr": 0.04122737111370332 }, "harness|hendrycksTest-elementary_mathematics|5": { "acc": 0.40476190476190477, "acc_stderr": 0.025279850397404904, "acc_norm": 0.40476190476190477, "acc_norm_stderr": 0.025279850397404904 }, "harness|hendrycksTest-formal_logic|5": { "acc": 0.4523809523809524, "acc_stderr": 0.044518079590553275, "acc_norm": 0.4523809523809524, "acc_norm_stderr": 0.044518079590553275 }, "harness|hendrycksTest-global_facts|5": { "acc": 0.32, "acc_stderr": 0.046882617226215034, "acc_norm": 0.32, "acc_norm_stderr": 0.046882617226215034 }, "harness|hendrycksTest-high_school_biology|5": { "acc": 0.7774193548387097, "acc_stderr": 0.023664216671642518, "acc_norm": 0.7774193548387097, "acc_norm_stderr": 0.023664216671642518 }, "harness|hendrycksTest-high_school_chemistry|5": { "acc": 0.5024630541871922, "acc_stderr": 0.035179450386910616, "acc_norm": 0.5024630541871922, "acc_norm_stderr": 0.035179450386910616 }, "harness|hendrycksTest-high_school_computer_science|5": { "acc": 0.67, "acc_stderr": 0.04725815626252607, "acc_norm": 0.67, "acc_norm_stderr": 0.04725815626252607 }, "harness|hendrycksTest-high_school_european_history|5": { "acc": 0.7757575757575758, "acc_stderr": 0.03256866661681102, "acc_norm": 0.7757575757575758, "acc_norm_stderr": 0.03256866661681102 }, "harness|hendrycksTest-high_school_geography|5": { "acc": 0.797979797979798, "acc_stderr": 0.028606204289229865, "acc_norm": 0.797979797979798, "acc_norm_stderr": 0.028606204289229865 }, "harness|hendrycksTest-high_school_government_and_politics|5": { "acc": 0.8911917098445595, "acc_stderr": 0.02247325333276877, "acc_norm": 0.8911917098445595, "acc_norm_stderr": 0.02247325333276877 }, "harness|hendrycksTest-high_school_macroeconomics|5": { "acc": 0.6666666666666666, "acc_stderr": 0.023901157979402538, "acc_norm": 0.6666666666666666, "acc_norm_stderr": 0.023901157979402538 }, "harness|hendrycksTest-high_school_mathematics|5": { "acc": 0.3296296296296296, "acc_stderr": 0.02866120111652456, "acc_norm": 0.3296296296296296, "acc_norm_stderr": 0.02866120111652456 }, "harness|hendrycksTest-high_school_microeconomics|5": { "acc": 0.6932773109243697, "acc_stderr": 0.02995382389188704, "acc_norm": 0.6932773109243697, "acc_norm_stderr": 0.02995382389188704 }, "harness|hendrycksTest-high_school_physics|5": { "acc": 0.3708609271523179, "acc_stderr": 0.03943966699183629, "acc_norm": 0.3708609271523179, "acc_norm_stderr": 0.03943966699183629 }, "harness|hendrycksTest-high_school_psychology|5": { "acc": 0.8458715596330275, "acc_stderr": 0.015480826865374303, "acc_norm": 0.8458715596330275, "acc_norm_stderr": 0.015480826865374303 }, "harness|hendrycksTest-high_school_statistics|5": { "acc": 0.49537037037037035, "acc_stderr": 0.03409825519163572, "acc_norm": 0.49537037037037035, "acc_norm_stderr": 0.03409825519163572 }, "harness|hendrycksTest-high_school_us_history|5": { "acc": 0.8333333333333334, "acc_stderr": 0.026156867523931045, "acc_norm": 0.8333333333333334, "acc_norm_stderr": 0.026156867523931045 }, "harness|hendrycksTest-high_school_world_history|5": { "acc": 0.810126582278481, "acc_stderr": 0.02553010046023349, "acc_norm": 0.810126582278481, "acc_norm_stderr": 0.02553010046023349 }, "harness|hendrycksTest-human_aging|5": { "acc": 0.6771300448430493, "acc_stderr": 0.031381476375754995, "acc_norm": 0.6771300448430493, "acc_norm_stderr": 0.031381476375754995 }, "harness|hendrycksTest-human_sexuality|5": { "acc": 0.7938931297709924, "acc_stderr": 0.035477710041594654, "acc_norm": 0.7938931297709924, "acc_norm_stderr": 0.035477710041594654 }, "harness|hendrycksTest-international_law|5": { "acc": 0.7603305785123967, "acc_stderr": 0.03896878985070417, "acc_norm": 0.7603305785123967, "acc_norm_stderr": 0.03896878985070417 }, "harness|hendrycksTest-jurisprudence|5": { "acc": 0.7962962962962963, "acc_stderr": 0.03893542518824847, "acc_norm": 0.7962962962962963, "acc_norm_stderr": 0.03893542518824847 }, "harness|hendrycksTest-logical_fallacies|5": { "acc": 0.7484662576687117, "acc_stderr": 0.03408997886857529, "acc_norm": 0.7484662576687117, "acc_norm_stderr": 0.03408997886857529 }, "harness|hendrycksTest-machine_learning|5": { "acc": 0.4375, "acc_stderr": 0.04708567521880525, "acc_norm": 0.4375, "acc_norm_stderr": 0.04708567521880525 }, "harness|hendrycksTest-management|5": { "acc": 0.7766990291262136, "acc_stderr": 0.04123553189891431, "acc_norm": 0.7766990291262136, "acc_norm_stderr": 0.04123553189891431 }, "harness|hendrycksTest-marketing|5": { "acc": 0.8846153846153846, "acc_stderr": 0.02093019318517933, "acc_norm": 0.8846153846153846, "acc_norm_stderr": 0.02093019318517933 }, "harness|hendrycksTest-medical_genetics|5": { "acc": 0.7, "acc_stderr": 0.046056618647183814, "acc_norm": 0.7, "acc_norm_stderr": 0.046056618647183814 }, "harness|hendrycksTest-miscellaneous|5": { "acc": 0.8237547892720306, "acc_stderr": 0.013625556907993462, "acc_norm": 0.8237547892720306, "acc_norm_stderr": 0.013625556907993462 }, "harness|hendrycksTest-moral_disputes|5": { "acc": 0.7225433526011561, "acc_stderr": 0.02410571260775431, "acc_norm": 0.7225433526011561, "acc_norm_stderr": 0.02410571260775431 }, "harness|hendrycksTest-moral_scenarios|5": { "acc": 0.47150837988826816, "acc_stderr": 0.016695329746015793, "acc_norm": 0.47150837988826816, "acc_norm_stderr": 0.016695329746015793 }, "harness|hendrycksTest-nutrition|5": { "acc": 0.7352941176470589, "acc_stderr": 0.02526169121972948, "acc_norm": 0.7352941176470589, "acc_norm_stderr": 0.02526169121972948 }, "harness|hendrycksTest-philosophy|5": { "acc": 0.7041800643086816, "acc_stderr": 0.025922371788818763, "acc_norm": 0.7041800643086816, "acc_norm_stderr": 0.025922371788818763 }, "harness|hendrycksTest-prehistory|5": { "acc": 0.7438271604938271, "acc_stderr": 0.024288533637726095, "acc_norm": 0.7438271604938271, "acc_norm_stderr": 0.024288533637726095 }, "harness|hendrycksTest-professional_accounting|5": { "acc": 0.48226950354609927, "acc_stderr": 0.02980873964223777, "acc_norm": 0.48226950354609927, "acc_norm_stderr": 0.02980873964223777 }, "harness|hendrycksTest-professional_law|5": { "acc": 0.4641460234680574, "acc_stderr": 0.012737361318730583, "acc_norm": 0.4641460234680574, "acc_norm_stderr": 0.012737361318730583 }, "harness|hendrycksTest-professional_medicine|5": { "acc": 0.6911764705882353, "acc_stderr": 0.02806499816704009, "acc_norm": 0.6911764705882353, "acc_norm_stderr": 0.02806499816704009 }, "harness|hendrycksTest-professional_psychology|5": { "acc": 0.6421568627450981, "acc_stderr": 0.019393058402355442, "acc_norm": 0.6421568627450981, "acc_norm_stderr": 0.019393058402355442 }, "harness|hendrycksTest-public_relations|5": { "acc": 0.6818181818181818, "acc_stderr": 0.044612721759105085, "acc_norm": 0.6818181818181818, "acc_norm_stderr": 0.044612721759105085 }, "harness|hendrycksTest-security_studies|5": { "acc": 0.7183673469387755, "acc_stderr": 0.028795185574291296, "acc_norm": 0.7183673469387755, "acc_norm_stderr": 0.028795185574291296 }, "harness|hendrycksTest-sociology|5": { "acc": 0.8557213930348259, "acc_stderr": 0.02484575321230604, "acc_norm": 0.8557213930348259, "acc_norm_stderr": 0.02484575321230604 }, "harness|hendrycksTest-us_foreign_policy|5": { "acc": 0.84, "acc_stderr": 0.0368452949177471, "acc_norm": 0.84, "acc_norm_stderr": 0.0368452949177471 }, "harness|hendrycksTest-virology|5": { "acc": 0.5240963855421686, "acc_stderr": 0.03887971849597264, "acc_norm": 0.5240963855421686, "acc_norm_stderr": 0.03887971849597264 }, "harness|hendrycksTest-world_religions|5": { "acc": 0.8421052631578947, "acc_stderr": 0.027966785859160893, "acc_norm": 0.8421052631578947, "acc_norm_stderr": 0.027966785859160893 }, "harness|truthfulqa:mc|0": { "mc1": 0.5287637698898409, "mc1_stderr": 0.017474513848525518, "mc2": 0.6848002786802394, "mc2_stderr": 0.015189401847464286 }, "harness|winogrande|5": { "acc": 0.8279400157853196, "acc_stderr": 0.010607731615247015 }, "harness|gsm8k|5": { "acc": 0.6527672479150872, "acc_stderr": 0.013113898382146879 } } ``` ## Dataset Details ### Dataset Description <!-- Provide a longer summary of what this dataset is. --> - **Curated by:** [More Information Needed] - **Funded by [optional]:** [More Information Needed] - **Shared by [optional]:** [More Information Needed] - **Language(s) (NLP):** [More Information Needed] - **License:** [More Information Needed] ### Dataset Sources [optional] <!-- Provide the basic links for the dataset. --> - **Repository:** [More Information Needed] - **Paper [optional]:** [More Information Needed] - **Demo [optional]:** [More Information Needed] ## Uses <!-- Address questions around how the dataset is intended to be used. --> ### Direct Use <!-- This section describes suitable use cases for the dataset. --> [More Information Needed] ### Out-of-Scope Use <!-- This section addresses misuse, malicious use, and uses that the dataset will not work well for. --> [More Information Needed] ## Dataset Structure <!-- This section provides a description of the dataset fields, and additional information about the dataset structure such as criteria used to create the splits, relationships between data points, etc. --> [More Information Needed] ## Dataset Creation ### Curation Rationale <!-- Motivation for the creation of this dataset. --> [More Information Needed] ### Source Data <!-- This section describes the source data (e.g. news text and headlines, social media posts, translated sentences, ...). --> #### Data Collection and Processing <!-- This section describes the data collection and processing process such as data selection criteria, filtering and normalization methods, tools and libraries used, etc. --> [More Information Needed] #### Who are the source data producers? <!-- This section describes the people or systems who originally created the data. It should also include self-reported demographic or identity information for the source data creators if this information is available. --> [More Information Needed] ### Annotations [optional] <!-- If the dataset contains annotations which are not part of the initial data collection, use this section to describe them. --> #### Annotation process <!-- This section describes the annotation process such as annotation tools used in the process, the amount of data annotated, annotation guidelines provided to the annotators, interannotator statistics, annotation validation, etc. --> [More Information Needed] #### Who are the annotators? <!-- This section describes the people or systems who created the annotations. --> [More Information Needed] #### Personal and Sensitive Information <!-- State whether the dataset contains data that might be considered personal, sensitive, or private (e.g., data that reveals addresses, uniquely identifiable names or aliases, racial or ethnic origins, sexual orientations, religious beliefs, political opinions, financial or health data, etc.). If efforts were made to anonymize the data, describe the anonymization process. --> [More Information Needed] ## Bias, Risks, and Limitations <!-- This section is meant to convey both technical and sociotechnical limitations. --> [More Information Needed] ### Recommendations <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. --> Users should be made aware of the risks, biases and limitations of the dataset. More information needed for further recommendations. ## Citation [optional] <!-- If there is a paper or blog post introducing the dataset, the APA and Bibtex information for that should go in this section. --> **BibTeX:** [More Information Needed] **APA:** [More Information Needed] ## Glossary [optional] <!-- If relevant, include terms and calculations in this section that can help readers understand the dataset or dataset card. --> [More Information Needed] ## More Information [optional] [More Information Needed] ## Dataset Card Authors [optional] [More Information Needed] ## Dataset Card Contact [More Information Needed]
提供机构:
open-llm-leaderboard
原始信息汇总

数据集概述

该数据集是在对模型 ResplendentAI/DaturaCookie_7B 进行评估运行期间自动创建的,用于 Open LLM Leaderboard

数据集组成

  • 数据集包含 63 个配置,每个配置对应一个评估任务。
  • 数据集从 1 次运行中创建,每次运行可以在每个配置中找到特定的分割,分割名称使用运行的时间戳。
  • "train" 分割始终指向最新的结果。
  • 一个额外的配置 "results" 存储所有运行的聚合结果,用于计算和显示 Open LLM Leaderboard 上的聚合指标。

数据加载示例

python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_ResplendentAI__DaturaCookie_7B", "harness_winogrande_5", split="train")

最新结果

以下是 2024-03-24T14:57:27.448442 运行的最新结果

python { "all": { "acc": 0.6479647730793441, "acc_stderr": 0.032130582997157306, "acc_norm": 0.6479675894625477, "acc_norm_stderr": 0.03279175471682202, "mc1": 0.5287637698898409, "mc1_stderr": 0.017474513848525518, "mc2": 0.6848002786802394, "mc2_stderr": 0.015189401847464286 }, "harness|arc:challenge|25": { "acc": 0.6911262798634812, "acc_stderr": 0.013501770929344003, "acc_norm": 0.712457337883959, "acc_norm_stderr": 0.013226719056266127 }, "harness|hellaswag|10": { "acc": 0.716391157140012, "acc_stderr": 0.004498280244494498, "acc_norm": 0.8800039832702649, "acc_norm_stderr": 0.0032429275808698566 }, "harness|hendrycksTest-abstract_algebra|5": { "acc": 0.34, "acc_stderr": 0.04760952285695235, "acc_norm": 0.34, "acc_norm_stderr": 0.04760952285695235 }, "harness|hendrycksTest-anatomy|5": { "acc": 0.6222222222222222, "acc_stderr": 0.04188307537595853, "acc_norm": 0.6222222222222222, "acc_norm_stderr": 0.04188307537595853 }, "harness|hendrycksTest-astronomy|5": { "acc": 0.7171052631578947, "acc_stderr": 0.03665349695640767, "acc_norm": 0.7171052631578947, "acc_norm_stderr": 0.03665349695640767 }, "harness|hendrycksTest-business_ethics|5": { "acc": 0.61, "acc_stderr": 0.04902071300001975, "acc_norm": 0.61, "acc_norm_stderr": 0.04902071300001975 }, "harness|hendrycksTest-clinical_knowledge|5": { "acc": 0.7094339622641509, "acc_stderr": 0.027943219989337135, "acc_norm": 0.7094339622641509, "acc_norm_stderr": 0.027943219989337135 }, "harness|hendrycksTest-college_biology|5": { "acc": 0.7430555555555556, "acc_stderr": 0.03653946969442099, "acc_norm": 0.7430555555555556, "acc_norm_stderr": 0.03653946969442099 }, "harness|hendrycksTest-college_chemistry|5": { "acc": 0.45, "acc_stderr": 0.05, "acc_norm": 0.45, "acc_norm_stderr": 0.05 }, "harness|hendrycksTest-college_computer_science|5": { "acc": 0.54, "acc_stderr": 0.05009082659620333, "acc_norm": 0.54, "acc_norm_stderr": 0.05009082659620333 }, "harness|hendrycksTest-college_mathematics|5": { "acc": 0.28, "acc_stderr": 0.04512608598542127, "acc_norm": 0.28, "acc_norm_stderr": 0.04512608598542127 }, "harness|hendrycksTest-college_medicine|5": { "acc": 0.6763005780346821, "acc_stderr": 0.0356760379963917, "acc_norm": 0.6763005780346821, "acc_norm_stderr": 0.0356760379963917 }, "harness|hendrycksTest-college_physics|5": { "acc": 0.39215686274509803, "acc_stderr": 0.04858083574266344, "acc_norm": 0.39215686274509803, "acc_norm_stderr": 0.04858083574266344 }, "harness|hendrycksTest-computer_security|5": { "acc": 0.76, "acc_stderr": 0.042923469599092816, "acc_norm": 0.76, "acc_norm_stderr": 0.042923469599092816 }, "harness|hendrycksTest-conceptual_physics|5": { "acc": 0.5787234042553191, "acc_stderr": 0.03227834510146267, "acc_norm": 0.5787234042553191, "acc_norm_stderr": 0.03227834510146267 }, "harness|hendrycksTest-econometrics|5": { "acc": 0.49122807017543857, "acc_stderr": 0.04702880432049615, "acc_norm": 0.49122807017543857, "acc_norm_stderr": 0.04702880432049615 }, "harness|hendrycksTest-electrical_engineering|5": { "acc": 0.5724137931034483, "acc_stderr": 0.04122737111370332, "acc_norm": 0.5724137931034483, "acc_norm_stderr": 0.04122737111370332 }, "harness|hendrycksTest-elementary_mathematics|5": { "acc": 0.40476190476190477, "acc_stderr": 0.025279850397404904, "acc_norm": 0.40476190476190477, "acc_norm_stderr": 0.025279850397404904 }, "harness|hendrycksTest-formal_logic|5": { "acc": 0.4523809523809524, "acc_stderr": 0.044518079590553275, "acc_norm": 0.4523809523809524, "acc_norm_stderr": 0.044518079590553275 }, "harness|hendrycksTest-global_facts|5": { "acc": 0.32, "acc_stderr": 0.046882617226215034, "acc_norm": 0.32, "acc_norm_stderr": 0.046882617226215034 }, "harness|hendrycksTest-high_school_biology|5": { "acc": 0.7774193548387097, "acc_stderr": 0.023664216671642518, "acc_norm": 0.7774193548387097, "acc_norm_stderr": 0.023664216671642518 }, "harness|hendrycksTest-high_school_chemistry|5": { "acc": 0.5024630541871922, "acc_stderr": 0.035179450386910616, "acc_norm": 0.5024630541871922, "acc_norm_stderr": 0.035179450386910616 }, "harness|hendrycksTest-high_school_computer_science|5": { "acc": 0.67, "acc_stderr": 0.04725815626252607, "acc_norm": 0.67, "acc_norm_stderr": 0.04725815626252607 }, "harness|hendrycksTest-high_school_european_history|5": { "acc": 0.7757575757575758, "acc_stderr": 0.03256866661681102, "acc_norm": 0.7757575757575758, "acc_norm_stderr": 0.03256866661681102 }, "harness|hendrycksTest-high_school_geography|5": { "acc": 0.797979797979798, "acc_stderr": 0.028606204289229865, "acc_norm": 0.797979797979798, "acc_norm_stderr": 0.028606204289229865 }, "harness|hendrycksTest-high_school_government_and_politics|5": { "acc": 0.8911917098445595, "acc_stderr": 0.02247325333276877, "acc_norm": 0.8911917098445595, "acc_norm_stderr": 0.02247325333276877 }, "harness|hendrycksTest-high_school_macroeconomics|5": { "acc": 0.6666666666666666, "acc_stderr": 0.023901157979402538, "acc_norm": 0.6666666666666666, "acc_norm_stderr": 0.023901157979402538 }, "harness|hendrycksTest-high_school_mathematics|5": { "acc": 0.3296296296296296, "acc_stderr": 0.02866120111652456, "acc_norm": 0.3296296296296296, "acc_norm_stderr": 0.02866120111652456 }, "harness|hendrycksTest-high_school_microeconom

搜集汇总
数据集介绍
main_image_url
构建方式
该数据集是在Open LLM Leaderboard平台上对ResplendentAI/DaturaCookie_7B模型进行自动化评估时生成的。整个数据集由63个配置组成,每个配置对应一项被评估的任务。数据来源于一次独立的运行,每次运行的结果以时间戳为标识存储在相应配置的分割中,而'train'分割始终指向最新一次的评估结果。此外,一个名为'results'的配置汇集了所有运行的聚合指标,用于在Leaderboard上计算和展示综合性能。
特点
数据集的结构设计精巧且层次分明,每个任务配置下均包含原始评估细节与标准化指标,如准确率及其标准误差。评估覆盖了ARC Challenge、HellaSwag、MMLU(涵盖从抽象代数到病毒学等数十个学科)、TruthfulQA、Winogrande和GSM8K等多样化基准,全面衡量模型在推理、知识、常识和数学等多维度的能力。所有结果均以Parquet格式存储,便于高效检索与分析。
使用方法
用户可通过Hugging Face的datasets库便捷地加载数据,例如使用`load_dataset('open-llm-leaderboard/details_ResplendentAI__DaturaCookie_7B', 'harness_winogrande_5', split='train')`命令获取指定任务的评估结果。通过指定不同配置名称和分割(如时间戳分割或'latest'),研究者能够回溯历史评估或获取最新性能数据,从而进行深入的模型比较与误差分析。
背景与挑战
背景概述
随着大规模语言模型(LLM)在自然语言处理领域的迅猛发展,如何系统、公正地评估其性能成为学界与工业界共同关注的核心议题。Open LLM Leaderboard由Hugging Face团队于2023年发起,旨在为开源社区提供一个标准化、透明化的模型评测平台,其背后的研究人员包括Clémentine Fourrier等。该数据集作为DaturaCookie_7B模型在Leaderboard上的评测记录,涵盖了ARC-Challenge、HellaSwag、MMLU、TruthfulQA、Winogrande及GSM8K等63个多样化任务,全面反映了模型在推理、常识、数学及知识理解等多维度的能力。这一评估框架不仅推动了模型间横向比较的规范化,还促进了开源LLM的迭代优化与社区协作,对构建可复现、可信赖的评估生态具有重要影响。
当前挑战
该数据集所面临的挑战主要体现在两个层面。在领域问题层面,LLM评估需应对任务多样性与评价标准统一性的矛盾:不同任务(如数学推理与常识问答)对模型能力的要求迥异,单一指标难以全面刻画模型优劣,而跨任务对比又需规避数据泄露与过拟合风险。在构建过程中,挑战则集中于数据自动化处理的可靠性——评测结果需从多次运行中聚合,但不同时间戳对应的任务集可能不一致,导致“最新”分片需动态指向最新结果;此外,Parquet格式的存储与多任务配置的索引管理增加了数据加载的复杂度,确保各任务结果的完整性与可追溯性成为关键难点。
常用场景
经典使用场景
在开放大语言模型评测领域,该数据集作为Open LLM Leaderboard的标准化评估工具,被广泛用于量化模型在63项细分任务上的性能表现。其经典使用场景涵盖常识推理(如HellaSwag、Winogrande)、数学求解(GSM8K)、多学科知识问答(MMLU的57个专业领域)以及事实一致性检测(TruthfulQA)等维度,为研究者提供了一套细粒度、可复现的模型能力评估框架。
衍生相关工作
该数据集衍生出多项经典工作,包括基于其评测结果分析模型规模与能力关系的缩放定律研究、针对特定任务(如数学推理)的专项优化方法,以及利用其多任务结构设计元学习评价指标的探索。此外,其数据格式被后续评测工具(如LM Evaluation Harness)采纳,成为构建可扩展评测流水线的模板。
数据集最近研究
最新研究方向
在大型语言模型评测领域,Open LLM Leaderboard已成为衡量模型综合能力的权威基准。围绕ResplendentAI/DaturaCookie_7B这一7B参数规模的模型,其评估数据集系统性地覆盖了ARC-Challenge、HellaSwag、MMLU(涵盖57个学科)、TruthfulQA、Winogrande及GSM8K等核心任务,全面反映了模型在常识推理、世界知识、数学逻辑与事实一致性等维度的前沿表现。最新结果显示,该模型在HellaSwag上达到88%的标准化准确率,在Winogrande上取得82.8%的成绩,展现了其在复杂语言理解与常识推断中的稳健能力。这一评估数据集的构建与公开,不仅为社区提供了可复现的细粒度评测基准,更推动了7B级别模型在多样化任务上的性能透明化与可比性,对研究小参数模型的能力边界与优化方向具有重要参考意义。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作