five

open-llm-leaderboard-old/details_cloudyu__Pluto_24B_DPO_200

收藏
Hugging Face2024-01-18 更新2024-06-22 收录
下载链接:
https://hf-mirror.com/datasets/open-llm-leaderboard-old/details_cloudyu__Pluto_24B_DPO_200
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: Evaluation run of cloudyu/Pluto_24B_DPO_200 dataset_summary: "Dataset automatically created during the evaluation run of model\ \ [cloudyu/Pluto_24B_DPO_200](https://huggingface.co/cloudyu/Pluto_24B_DPO_200)\ \ on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).\n\ \nThe dataset is composed of 63 configuration, each one coresponding to one of the\ \ evaluated task.\n\nThe dataset has been created from 1 run(s). Each run can be\ \ found as a specific split in each configuration, the split being named using the\ \ timestamp of the run.The \"train\" split is always pointing to the latest results.\n\ \nAn additional configuration \"results\" store all the aggregated results of the\ \ run (and is used to compute and display the aggregated metrics on the [Open LLM\ \ Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)).\n\ \nTo load the details from a run, you can for instance do the following:\n```python\n\ from datasets import load_dataset\ndata = load_dataset(\"open-llm-leaderboard/details_cloudyu__Pluto_24B_DPO_200\"\ ,\n\t\"harness_winogrande_5\",\n\tsplit=\"train\")\n```\n\n## Latest results\n\n\ These are the [latest results from run 2024-01-18T17:18:01.366806](https://huggingface.co/datasets/open-llm-leaderboard/details_cloudyu__Pluto_24B_DPO_200/blob/main/results_2024-01-18T17-18-01.366806.json)(note\ \ that their might be results for other tasks in the repos if successive evals didn't\ \ cover the same tasks. You find each in the results and the \"latest\" split for\ \ each eval):\n\n```python\n{\n \"all\": {\n \"acc\": 0.6487883183265996,\n\ \ \"acc_stderr\": 0.03206766377553213,\n \"acc_norm\": 0.649809388886223,\n\ \ \"acc_norm_stderr\": 0.03271483221046768,\n \"mc1\": 0.5128518971848225,\n\ \ \"mc1_stderr\": 0.017497717944299822,\n \"mc2\": 0.6986184584005906,\n\ \ \"mc2_stderr\": 0.014631943760685329\n },\n \"harness|arc:challenge|25\"\ : {\n \"acc\": 0.6373720136518771,\n \"acc_stderr\": 0.014049106564955003,\n\ \ \"acc_norm\": 0.6561433447098977,\n \"acc_norm_stderr\": 0.013880644570156213\n\ \ },\n \"harness|hellaswag|10\": {\n \"acc\": 0.6717785301732723,\n\ \ \"acc_stderr\": 0.004686062421158146,\n \"acc_norm\": 0.8637721569408484,\n\ \ \"acc_norm_stderr\": 0.0034232928816321398\n },\n \"harness|hendrycksTest-abstract_algebra|5\"\ : {\n \"acc\": 0.31,\n \"acc_stderr\": 0.04648231987117316,\n \ \ \"acc_norm\": 0.31,\n \"acc_norm_stderr\": 0.04648231987117316\n \ \ },\n \"harness|hendrycksTest-anatomy|5\": {\n \"acc\": 0.6222222222222222,\n\ \ \"acc_stderr\": 0.04188307537595852,\n \"acc_norm\": 0.6222222222222222,\n\ \ \"acc_norm_stderr\": 0.04188307537595852\n },\n \"harness|hendrycksTest-astronomy|5\"\ : {\n \"acc\": 0.7039473684210527,\n \"acc_stderr\": 0.03715062154998905,\n\ \ \"acc_norm\": 0.7039473684210527,\n \"acc_norm_stderr\": 0.03715062154998905\n\ \ },\n \"harness|hendrycksTest-business_ethics|5\": {\n \"acc\": 0.61,\n\ \ \"acc_stderr\": 0.04902071300001975,\n \"acc_norm\": 0.61,\n \ \ \"acc_norm_stderr\": 0.04902071300001975\n },\n \"harness|hendrycksTest-clinical_knowledge|5\"\ : {\n \"acc\": 0.7056603773584905,\n \"acc_stderr\": 0.028049186315695248,\n\ \ \"acc_norm\": 0.7056603773584905,\n \"acc_norm_stderr\": 0.028049186315695248\n\ \ },\n \"harness|hendrycksTest-college_biology|5\": {\n \"acc\": 0.7430555555555556,\n\ \ \"acc_stderr\": 0.03653946969442099,\n \"acc_norm\": 0.7430555555555556,\n\ \ \"acc_norm_stderr\": 0.03653946969442099\n },\n \"harness|hendrycksTest-college_chemistry|5\"\ : {\n \"acc\": 0.46,\n \"acc_stderr\": 0.05009082659620333,\n \ \ \"acc_norm\": 0.46,\n \"acc_norm_stderr\": 0.05009082659620333\n \ \ },\n \"harness|hendrycksTest-college_computer_science|5\": {\n \"acc\"\ : 0.54,\n \"acc_stderr\": 0.05009082659620333,\n \"acc_norm\": 0.54,\n\ \ \"acc_norm_stderr\": 0.05009082659620333\n },\n \"harness|hendrycksTest-college_mathematics|5\"\ : {\n \"acc\": 0.32,\n \"acc_stderr\": 0.04688261722621504,\n \ \ \"acc_norm\": 0.32,\n \"acc_norm_stderr\": 0.04688261722621504\n \ \ },\n \"harness|hendrycksTest-college_medicine|5\": {\n \"acc\": 0.6705202312138728,\n\ \ \"acc_stderr\": 0.03583901754736412,\n \"acc_norm\": 0.6705202312138728,\n\ \ \"acc_norm_stderr\": 0.03583901754736412\n },\n \"harness|hendrycksTest-college_physics|5\"\ : {\n \"acc\": 0.4215686274509804,\n \"acc_stderr\": 0.04913595201274498,\n\ \ \"acc_norm\": 0.4215686274509804,\n \"acc_norm_stderr\": 0.04913595201274498\n\ \ },\n \"harness|hendrycksTest-computer_security|5\": {\n \"acc\":\ \ 0.76,\n \"acc_stderr\": 0.04292346959909283,\n \"acc_norm\": 0.76,\n\ \ \"acc_norm_stderr\": 0.04292346959909283\n },\n \"harness|hendrycksTest-conceptual_physics|5\"\ : {\n \"acc\": 0.6,\n \"acc_stderr\": 0.03202563076101735,\n \ \ \"acc_norm\": 0.6,\n \"acc_norm_stderr\": 0.03202563076101735\n },\n\ \ \"harness|hendrycksTest-econometrics|5\": {\n \"acc\": 0.49122807017543857,\n\ \ \"acc_stderr\": 0.04702880432049615,\n \"acc_norm\": 0.49122807017543857,\n\ \ \"acc_norm_stderr\": 0.04702880432049615\n },\n \"harness|hendrycksTest-electrical_engineering|5\"\ : {\n \"acc\": 0.5517241379310345,\n \"acc_stderr\": 0.04144311810878151,\n\ \ \"acc_norm\": 0.5517241379310345,\n \"acc_norm_stderr\": 0.04144311810878151\n\ \ },\n \"harness|hendrycksTest-elementary_mathematics|5\": {\n \"acc\"\ : 0.4126984126984127,\n \"acc_stderr\": 0.02535574126305526,\n \"\ acc_norm\": 0.4126984126984127,\n \"acc_norm_stderr\": 0.02535574126305526\n\ \ },\n \"harness|hendrycksTest-formal_logic|5\": {\n \"acc\": 0.4523809523809524,\n\ \ \"acc_stderr\": 0.044518079590553275,\n \"acc_norm\": 0.4523809523809524,\n\ \ \"acc_norm_stderr\": 0.044518079590553275\n },\n \"harness|hendrycksTest-global_facts|5\"\ : {\n \"acc\": 0.32,\n \"acc_stderr\": 0.04688261722621504,\n \ \ \"acc_norm\": 0.32,\n \"acc_norm_stderr\": 0.04688261722621504\n \ \ },\n \"harness|hendrycksTest-high_school_biology|5\": {\n \"acc\": 0.7806451612903226,\n\ \ \"acc_stderr\": 0.023540799358723295,\n \"acc_norm\": 0.7806451612903226,\n\ \ \"acc_norm_stderr\": 0.023540799358723295\n },\n \"harness|hendrycksTest-high_school_chemistry|5\"\ : {\n \"acc\": 0.5073891625615764,\n \"acc_stderr\": 0.035176035403610105,\n\ \ \"acc_norm\": 0.5073891625615764,\n \"acc_norm_stderr\": 0.035176035403610105\n\ \ },\n \"harness|hendrycksTest-high_school_computer_science|5\": {\n \ \ \"acc\": 0.71,\n \"acc_stderr\": 0.045604802157206845,\n \"acc_norm\"\ : 0.71,\n \"acc_norm_stderr\": 0.045604802157206845\n },\n \"harness|hendrycksTest-high_school_european_history|5\"\ : {\n \"acc\": 0.8,\n \"acc_stderr\": 0.031234752377721175,\n \ \ \"acc_norm\": 0.8,\n \"acc_norm_stderr\": 0.031234752377721175\n \ \ },\n \"harness|hendrycksTest-high_school_geography|5\": {\n \"acc\"\ : 0.7676767676767676,\n \"acc_stderr\": 0.030088629490217487,\n \"\ acc_norm\": 0.7676767676767676,\n \"acc_norm_stderr\": 0.030088629490217487\n\ \ },\n \"harness|hendrycksTest-high_school_government_and_politics|5\": {\n\ \ \"acc\": 0.8963730569948186,\n \"acc_stderr\": 0.02199531196364424,\n\ \ \"acc_norm\": 0.8963730569948186,\n \"acc_norm_stderr\": 0.02199531196364424\n\ \ },\n \"harness|hendrycksTest-high_school_macroeconomics|5\": {\n \ \ \"acc\": 0.6692307692307692,\n \"acc_stderr\": 0.02385479568097112,\n \ \ \"acc_norm\": 0.6692307692307692,\n \"acc_norm_stderr\": 0.02385479568097112\n\ \ },\n \"harness|hendrycksTest-high_school_mathematics|5\": {\n \"\ acc\": 0.31851851851851853,\n \"acc_stderr\": 0.02840653309060846,\n \ \ \"acc_norm\": 0.31851851851851853,\n \"acc_norm_stderr\": 0.02840653309060846\n\ \ },\n \"harness|hendrycksTest-high_school_microeconomics|5\": {\n \ \ \"acc\": 0.7100840336134454,\n \"acc_stderr\": 0.029472485833136098,\n\ \ \"acc_norm\": 0.7100840336134454,\n \"acc_norm_stderr\": 0.029472485833136098\n\ \ },\n \"harness|hendrycksTest-high_school_physics|5\": {\n \"acc\"\ : 0.3576158940397351,\n \"acc_stderr\": 0.03913453431177258,\n \"\ acc_norm\": 0.3576158940397351,\n \"acc_norm_stderr\": 0.03913453431177258\n\ \ },\n \"harness|hendrycksTest-high_school_psychology|5\": {\n \"acc\"\ : 0.8422018348623853,\n \"acc_stderr\": 0.01563002297009244,\n \"\ acc_norm\": 0.8422018348623853,\n \"acc_norm_stderr\": 0.01563002297009244\n\ \ },\n \"harness|hendrycksTest-high_school_statistics|5\": {\n \"acc\"\ : 0.5185185185185185,\n \"acc_stderr\": 0.03407632093854051,\n \"\ acc_norm\": 0.5185185185185185,\n \"acc_norm_stderr\": 0.03407632093854051\n\ \ },\n \"harness|hendrycksTest-high_school_us_history|5\": {\n \"acc\"\ : 0.8186274509803921,\n \"acc_stderr\": 0.027044621719474082,\n \"\ acc_norm\": 0.8186274509803921,\n \"acc_norm_stderr\": 0.027044621719474082\n\ \ },\n \"harness|hendrycksTest-high_school_world_history|5\": {\n \"\ acc\": 0.7974683544303798,\n \"acc_stderr\": 0.026160568246601446,\n \ \ \"acc_norm\": 0.7974683544303798,\n \"acc_norm_stderr\": 0.026160568246601446\n\ \ },\n \"harness|hendrycksTest-human_aging|5\": {\n \"acc\": 0.695067264573991,\n\ \ \"acc_stderr\": 0.030898610882477515,\n \"acc_norm\": 0.695067264573991,\n\ \ \"acc_norm_stderr\": 0.030898610882477515\n },\n \"harness|hendrycksTest-human_sexuality|5\"\ : {\n \"acc\": 0.7557251908396947,\n \"acc_stderr\": 0.037683359597287434,\n\ \ \"acc_norm\": 0.7557251908396947,\n \"acc_norm_stderr\": 0.037683359597287434\n\ \ },\n \"harness|hendrycksTest-international_law|5\": {\n \"acc\":\ \ 0.7851239669421488,\n \"acc_stderr\": 0.037494924487096966,\n \"\ acc_norm\": 0.7851239669421488,\n \"acc_norm_stderr\": 0.037494924487096966\n\ \ },\n \"harness|hendrycksTest-jurisprudence|5\": {\n \"acc\": 0.7777777777777778,\n\ \ \"acc_stderr\": 0.0401910747255735,\n \"acc_norm\": 0.7777777777777778,\n\ \ \"acc_norm_stderr\": 0.0401910747255735\n },\n \"harness|hendrycksTest-logical_fallacies|5\"\ : {\n \"acc\": 0.7668711656441718,\n \"acc_stderr\": 0.0332201579577674,\n\ \ \"acc_norm\": 0.7668711656441718,\n \"acc_norm_stderr\": 0.0332201579577674\n\ \ },\n \"harness|hendrycksTest-machine_learning|5\": {\n \"acc\": 0.48214285714285715,\n\ \ \"acc_stderr\": 0.047427623612430116,\n \"acc_norm\": 0.48214285714285715,\n\ \ \"acc_norm_stderr\": 0.047427623612430116\n },\n \"harness|hendrycksTest-management|5\"\ : {\n \"acc\": 0.7766990291262136,\n \"acc_stderr\": 0.04123553189891431,\n\ \ \"acc_norm\": 0.7766990291262136,\n \"acc_norm_stderr\": 0.04123553189891431\n\ \ },\n \"harness|hendrycksTest-marketing|5\": {\n \"acc\": 0.8888888888888888,\n\ \ \"acc_stderr\": 0.020588491316092368,\n \"acc_norm\": 0.8888888888888888,\n\ \ \"acc_norm_stderr\": 0.020588491316092368\n },\n \"harness|hendrycksTest-medical_genetics|5\"\ : {\n \"acc\": 0.79,\n \"acc_stderr\": 0.040936018074033256,\n \ \ \"acc_norm\": 0.79,\n \"acc_norm_stderr\": 0.040936018074033256\n \ \ },\n \"harness|hendrycksTest-miscellaneous|5\": {\n \"acc\": 0.8237547892720306,\n\ \ \"acc_stderr\": 0.013625556907993457,\n \"acc_norm\": 0.8237547892720306,\n\ \ \"acc_norm_stderr\": 0.013625556907993457\n },\n \"harness|hendrycksTest-moral_disputes|5\"\ : {\n \"acc\": 0.7456647398843931,\n \"acc_stderr\": 0.02344582627654554,\n\ \ \"acc_norm\": 0.7456647398843931,\n \"acc_norm_stderr\": 0.02344582627654554\n\ \ },\n \"harness|hendrycksTest-moral_scenarios|5\": {\n \"acc\": 0.38324022346368714,\n\ \ \"acc_stderr\": 0.016260159604429128,\n \"acc_norm\": 0.38324022346368714,\n\ \ \"acc_norm_stderr\": 0.016260159604429128\n },\n \"harness|hendrycksTest-nutrition|5\"\ : {\n \"acc\": 0.7287581699346405,\n \"acc_stderr\": 0.025457756696667888,\n\ \ \"acc_norm\": 0.7287581699346405,\n \"acc_norm_stderr\": 0.025457756696667888\n\ \ },\n \"harness|hendrycksTest-philosophy|5\": {\n \"acc\": 0.7170418006430869,\n\ \ \"acc_stderr\": 0.025583062489984813,\n \"acc_norm\": 0.7170418006430869,\n\ \ \"acc_norm_stderr\": 0.025583062489984813\n },\n \"harness|hendrycksTest-prehistory|5\"\ : {\n \"acc\": 0.7530864197530864,\n \"acc_stderr\": 0.02399350170904211,\n\ \ \"acc_norm\": 0.7530864197530864,\n \"acc_norm_stderr\": 0.02399350170904211\n\ \ },\n \"harness|hendrycksTest-professional_accounting|5\": {\n \"\ acc\": 0.4787234042553192,\n \"acc_stderr\": 0.029800481645628693,\n \ \ \"acc_norm\": 0.4787234042553192,\n \"acc_norm_stderr\": 0.029800481645628693\n\ \ },\n \"harness|hendrycksTest-professional_law|5\": {\n \"acc\": 0.47196870925684486,\n\ \ \"acc_stderr\": 0.012750151802922438,\n \"acc_norm\": 0.47196870925684486,\n\ \ \"acc_norm_stderr\": 0.012750151802922438\n },\n \"harness|hendrycksTest-professional_medicine|5\"\ : {\n \"acc\": 0.6764705882352942,\n \"acc_stderr\": 0.028418208619406755,\n\ \ \"acc_norm\": 0.6764705882352942,\n \"acc_norm_stderr\": 0.028418208619406755\n\ \ },\n \"harness|hendrycksTest-professional_psychology|5\": {\n \"\ acc\": 0.6683006535947712,\n \"acc_stderr\": 0.019047485239360378,\n \ \ \"acc_norm\": 0.6683006535947712,\n \"acc_norm_stderr\": 0.019047485239360378\n\ \ },\n \"harness|hendrycksTest-public_relations|5\": {\n \"acc\": 0.6818181818181818,\n\ \ \"acc_stderr\": 0.044612721759105085,\n \"acc_norm\": 0.6818181818181818,\n\ \ \"acc_norm_stderr\": 0.044612721759105085\n },\n \"harness|hendrycksTest-security_studies|5\"\ : {\n \"acc\": 0.7306122448979592,\n \"acc_stderr\": 0.02840125202902294,\n\ \ \"acc_norm\": 0.7306122448979592,\n \"acc_norm_stderr\": 0.02840125202902294\n\ \ },\n \"harness|hendrycksTest-sociology|5\": {\n \"acc\": 0.8557213930348259,\n\ \ \"acc_stderr\": 0.024845753212306046,\n \"acc_norm\": 0.8557213930348259,\n\ \ \"acc_norm_stderr\": 0.024845753212306046\n },\n \"harness|hendrycksTest-us_foreign_policy|5\"\ : {\n \"acc\": 0.83,\n \"acc_stderr\": 0.0377525168068637,\n \ \ \"acc_norm\": 0.83,\n \"acc_norm_stderr\": 0.0377525168068637\n },\n\ \ \"harness|hendrycksTest-virology|5\": {\n \"acc\": 0.5180722891566265,\n\ \ \"acc_stderr\": 0.03889951252827216,\n \"acc_norm\": 0.5180722891566265,\n\ \ \"acc_norm_stderr\": 0.03889951252827216\n },\n \"harness|hendrycksTest-world_religions|5\"\ : {\n \"acc\": 0.847953216374269,\n \"acc_stderr\": 0.027539122889061456,\n\ \ \"acc_norm\": 0.847953216374269,\n \"acc_norm_stderr\": 0.027539122889061456\n\ \ },\n \"harness|truthfulqa:mc|0\": {\n \"mc1\": 0.5128518971848225,\n\ \ \"mc1_stderr\": 0.017497717944299822,\n \"mc2\": 0.6986184584005906,\n\ \ \"mc2_stderr\": 0.014631943760685329\n },\n \"harness|winogrande|5\"\ : {\n \"acc\": 0.7892659826361483,\n \"acc_stderr\": 0.011462046419710683\n\ \ },\n \"harness|gsm8k|5\": {\n \"acc\": 0.6588324488248674,\n \ \ \"acc_stderr\": 0.013059111935831497\n }\n}\n```" repo_url: https://huggingface.co/cloudyu/Pluto_24B_DPO_200 leaderboard_url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard point_of_contact: clementine@hf.co configs: - config_name: harness_arc_challenge_25 data_files: - split: 2024_01_18T17_18_01.366806 path: - '**/details_harness|arc:challenge|25_2024-01-18T17-18-01.366806.parquet' - split: latest path: - '**/details_harness|arc:challenge|25_2024-01-18T17-18-01.366806.parquet' - config_name: harness_gsm8k_5 data_files: - split: 2024_01_18T17_18_01.366806 path: - '**/details_harness|gsm8k|5_2024-01-18T17-18-01.366806.parquet' - split: latest path: - '**/details_harness|gsm8k|5_2024-01-18T17-18-01.366806.parquet' - config_name: harness_hellaswag_10 data_files: - split: 2024_01_18T17_18_01.366806 path: - '**/details_harness|hellaswag|10_2024-01-18T17-18-01.366806.parquet' - split: latest path: - '**/details_harness|hellaswag|10_2024-01-18T17-18-01.366806.parquet' - config_name: harness_hendrycksTest_5 data_files: - split: 2024_01_18T17_18_01.366806 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-international_law|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-management|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-marketing|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-sociology|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-virology|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2024-01-18T17-18-01.366806.parquet' - split: latest path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-international_law|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-management|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-marketing|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-sociology|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-virology|5_2024-01-18T17-18-01.366806.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2024-01-18T17-18-01.366806.parquet' - config_name: harness_hendrycksTest_abstract_algebra_5 data_files: - split: 2024_01_18T17_18_01.366806 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-01-18T17-18-01.366806.parquet' - split: latest path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-01-18T17-18-01.366806.parquet' - config_name: harness_hendrycksTest_anatomy_5 data_files: - split: 2024_01_18T17_18_01.366806 path: - '**/details_harness|hendrycksTest-anatomy|5_2024-01-18T17-18-01.366806.parquet' - split: latest path: - '**/details_harness|hendrycksTest-anatomy|5_2024-01-18T17-18-01.366806.parquet' - config_name: harness_hendrycksTest_astronomy_5 data_files: - split: 2024_01_18T17_18_01.366806 path: - '**/details_harness|hendrycksTest-astronomy|5_2024-01-18T17-18-01.366806.parquet' - split: latest path: - '**/details_harness|hendrycksTest-astronomy|5_2024-01-18T17-18-01.366806.parquet' - config_name: harness_hendrycksTest_business_ethics_5 data_files: - split: 2024_01_18T17_18_01.366806 path: - '**/details_harness|hendrycksTest-business_ethics|5_2024-01-18T17-18-01.366806.parquet' - split: latest path: - '**/details_harness|hendrycksTest-business_ethics|5_2024-01-18T17-18-01.366806.parquet' - config_name: harness_hendrycksTest_clinical_knowledge_5 data_files: - split: 2024_01_18T17_18_01.366806 path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-01-18T17-18-01.366806.parquet' - split: latest path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-01-18T17-18-01.366806.parquet' - config_name: harness_hendrycksTest_college_biology_5 data_files: - split: 2024_01_18T17_18_01.366806 path: - '**/details_harness|hendrycksTest-college_biology|5_2024-01-18T17-18-01.366806.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_biology|5_2024-01-18T17-18-01.366806.parquet' - config_name: harness_hendrycksTest_college_chemistry_5 data_files: - split: 2024_01_18T17_18_01.366806 path: - '**/details_harness|hendrycksTest-college_chemistry|5_2024-01-18T17-18-01.366806.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_chemistry|5_2024-01-18T17-18-01.366806.parquet' - config_name: harness_hendrycksTest_college_computer_science_5 data_files: - split: 2024_01_18T17_18_01.366806 path: - '**/details_harness|hendrycksTest-college_computer_science|5_2024-01-18T17-18-01.366806.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_computer_science|5_2024-01-18T17-18-01.366806.parquet' - config_name: harness_hendrycksTest_college_mathematics_5 data_files: - split: 2024_01_18T17_18_01.366806 path: - '**/details_harness|hendrycksTest-college_mathematics|5_2024-01-18T17-18-01.366806.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_mathematics|5_2024-01-18T17-18-01.366806.parquet' - config_name: harness_hendrycksTest_college_medicine_5 data_files: - split: 2024_01_18T17_18_01.366806 path: - '**/details_harness|hendrycksTest-college_medicine|5_2024-01-18T17-18-01.366806.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_medicine|5_2024-01-18T17-18-01.366806.parquet' - config_name: harness_hendrycksTest_college_physics_5 data_files: - split: 2024_01_18T17_18_01.366806 path: - '**/details_harness|hendrycksTest-college_physics|5_2024-01-18T17-18-01.366806.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_physics|5_2024-01-18T17-18-01.366806.parquet' - config_name: harness_hendrycksTest_computer_security_5 data_files: - split: 2024_01_18T17_18_01.366806 path: - '**/details_harness|hendrycksTest-computer_security|5_2024-01-18T17-18-01.366806.parquet' - split: latest path: - '**/details_harness|hendrycksTest-computer_security|5_2024-01-18T17-18-01.366806.parquet' - config_name: harness_hendrycksTest_conceptual_physics_5 data_files: - split: 2024_01_18T17_18_01.366806 path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-01-18T17-18-01.366806.parquet' - split: latest path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-01-18T17-18-01.366806.parquet' - config_name: harness_hendrycksTest_econometrics_5 data_files: - split: 2024_01_18T17_18_01.366806 path: - '**/details_harness|hendrycksTest-econometrics|5_2024-01-18T17-18-01.366806.parquet' - split: latest path: - '**/details_harness|hendrycksTest-econometrics|5_2024-01-18T17-18-01.366806.parquet' - config_name: harness_hendrycksTest_electrical_engineering_5 data_files: - split: 2024_01_18T17_18_01.366806 path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-01-18T17-18-01.366806.parquet' - split: latest path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-01-18T17-18-01.366806.parquet' - config_name: harness_hendrycksTest_elementary_mathematics_5 data_files: - split: 2024_01_18T17_18_01.366806 path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-01-18T17-18-01.366806.parquet' - split: latest path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-01-18T17-18-01.366806.parquet' - config_name: harness_hendrycksTest_formal_logic_5 data_files: - split: 2024_01_18T17_18_01.366806 path: - '**/details_harness|hendrycksTest-formal_logic|5_2024-01-18T17-18-01.366806.parquet' - split: latest path: - '**/details_harness|hendrycksTest-formal_logic|5_2024-01-18T17-18-01.366806.parquet' - config_name: harness_hendrycksTest_global_facts_5 data_files: - split: 2024_01_18T17_18_01.366806 path: - '**/details_harness|hendrycksTest-global_facts|5_2024-01-18T17-18-01.366806.parquet' - split: latest path: - '**/details_harness|hendrycksTest-global_facts|5_2024-01-18T17-18-01.366806.parquet' - config_name: harness_hendrycksTest_high_school_biology_5 data_files: - split: 2024_01_18T17_18_01.366806 path: - '**/details_harness|hendrycksTest-high_school_biology|5_2024-01-18T17-18-01.366806.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_biology|5_2024-01-18T17-18-01.366806.parquet' - config_name: harness_hendrycksTest_high_school_chemistry_5 data_files: - split: 2024_01_18T17_18_01.366806 path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-01-18T17-18-01.366806.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-01-18T17-18-01.366806.parquet' - config_name: harness_hendrycksTest_high_school_computer_science_5 data_files: - split: 2024_01_18T17_18_01.366806 path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-01-18T17-18-01.366806.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-01-18T17-18-01.366806.parquet' - config_name: harness_hendrycksTest_high_school_european_history_5 data_files: - split: 2024_01_18T17_18_01.366806 path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-01-18T17-18-01.366806.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-01-18T17-18-01.366806.parquet' - config_name: harness_hendrycksTest_high_school_geography_5 data_files: - split: 2024_01_18T17_18_01.366806 path: - '**/details_harness|hendrycksTest-high_school_geography|5_2024-01-18T17-18-01.366806.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_geography|5_2024-01-18T17-18-01.366806.parquet' - config_name: harness_hendrycksTest_high_school_government_and_politics_5 data_files: - split: 2024_01_18T17_18_01.366806 path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-01-18T17-18-01.366806.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-01-18T17-18-01.366806.parquet' - config_name: harness_hendrycksTest_high_school_macroeconomics_5 data_files: - split: 2024_01_18T17_18_01.366806 path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-01-18T17-18-01.366806.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-01-18T17-18-01.366806.parquet' - config_name: harness_hendrycksTest_high_school_mathematics_5 data_files: - split: 2024_01_18T17_18_01.366806 path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-01-18T17-18-01.366806.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-01-18T17-18-01.366806.parquet' - config_name: harness_hendrycksTest_high_school_microeconomics_5 data_files: - split: 2024_01_18T17_18_01.366806 path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-01-18T17-18-01.366806.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-01-18T17-18-01.366806.parquet' - config_name: harness_hendrycksTest_high_school_physics_5 data_files: - split: 2024_01_18T17_18_01.366806 path: - '**/details_harness|hendrycksTest-high_school_physics|5_2024-01-18T17-18-01.366806.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_physics|5_2024-01-18T17-18-01.366806.parquet' - config_name: harness_hendrycksTest_high_school_psychology_5 data_files: - split: 2024_01_18T17_18_01.366806 path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-01-18T17-18-01.366806.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-01-18T17-18-01.366806.parquet' - config_name: harness_hendrycksTest_high_school_statistics_5 data_files: - split: 2024_01_18T17_18_01.366806 path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-01-18T17-18-01.366806.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-01-18T17-18-01.366806.parquet' - config_name: harness_hendrycksTest_high_school_us_history_5 data_files: - split: 2024_01_18T17_18_01.366806 path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-01-18T17-18-01.366806.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-01-18T17-18-01.366806.parquet' - config_name: harness_hendrycksTest_high_school_world_history_5 data_files: - split: 2024_01_18T17_18_01.366806 path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-01-18T17-18-01.366806.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-01-18T17-18-01.366806.parquet' - config_name: harness_hendrycksTest_human_aging_5 data_files: - split: 2024_01_18T17_18_01.366806 path: - '**/details_harness|hendrycksTest-human_aging|5_2024-01-18T17-18-01.366806.parquet' - split: latest path: - '**/details_harness|hendrycksTest-human_aging|5_2024-01-18T17-18-01.366806.parquet' - config_name: harness_hendrycksTest_human_sexuality_5 data_files: - split: 2024_01_18T17_18_01.366806 path: - '**/details_harness|hendrycksTest-human_sexuality|5_2024-01-18T17-18-01.366806.parquet' - split: latest path: - '**/details_harness|hendrycksTest-human_sexuality|5_2024-01-18T17-18-01.366806.parquet' - config_name: harness_hendrycksTest_international_law_5 data_files: - split: 2024_01_18T17_18_01.366806 path: - '**/details_harness|hendrycksTest-international_law|5_2024-01-18T17-18-01.366806.parquet' - split: latest path: - '**/details_harness|hendrycksTest-international_law|5_2024-01-18T17-18-01.366806.parquet' - config_name: harness_hendrycksTest_jurisprudence_5 data_files: - split: 2024_01_18T17_18_01.366806 path: - '**/details_harness|hendrycksTest-jurisprudence|5_2024-01-18T17-18-01.366806.parquet' - split: latest path: - '**/details_harness|hendrycksTest-jurisprudence|5_2024-01-18T17-18-01.366806.parquet' - config_name: harness_hendrycksTest_logical_fallacies_5 data_files: - split: 2024_01_18T17_18_01.366806 path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-01-18T17-18-01.366806.parquet' - split: latest path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-01-18T17-18-01.366806.parquet' - config_name: harness_hendrycksTest_machine_learning_5 data_files: - split: 2024_01_18T17_18_01.366806 path: - '**/details_harness|hendrycksTest-machine_learning|5_2024-01-18T17-18-01.366806.parquet' - split: latest path: - '**/details_harness|hendrycksTest-machine_learning|5_2024-01-18T17-18-01.366806.parquet' - config_name: harness_hendrycksTest_management_5 data_files: - split: 2024_01_18T17_18_01.366806 path: - '**/details_harness|hendrycksTest-management|5_2024-01-18T17-18-01.366806.parquet' - split: latest path: - '**/details_harness|hendrycksTest-management|5_2024-01-18T17-18-01.366806.parquet' - config_name: harness_hendrycksTest_marketing_5 data_files: - split: 2024_01_18T17_18_01.366806 path: - '**/details_harness|hendrycksTest-marketing|5_2024-01-18T17-18-01.366806.parquet' - split: latest path: - '**/details_harness|hendrycksTest-marketing|5_2024-01-18T17-18-01.366806.parquet' - config_name: harness_hendrycksTest_medical_genetics_5 data_files: - split: 2024_01_18T17_18_01.366806 path: - '**/details_harness|hendrycksTest-medical_genetics|5_2024-01-18T17-18-01.366806.parquet' - split: latest path: - '**/details_harness|hendrycksTest-medical_genetics|5_2024-01-18T17-18-01.366806.parquet' - config_name: harness_hendrycksTest_miscellaneous_5 data_files: - split: 2024_01_18T17_18_01.366806 path: - '**/details_harness|hendrycksTest-miscellaneous|5_2024-01-18T17-18-01.366806.parquet' - split: latest path: - '**/details_harness|hendrycksTest-miscellaneous|5_2024-01-18T17-18-01.366806.parquet' - config_name: harness_hendrycksTest_moral_disputes_5 data_files: - split: 2024_01_18T17_18_01.366806 path: - '**/details_harness|hendrycksTest-moral_disputes|5_2024-01-18T17-18-01.366806.parquet' - split: latest path: - '**/details_harness|hendrycksTest-moral_disputes|5_2024-01-18T17-18-01.366806.parquet' - config_name: harness_hendrycksTest_moral_scenarios_5 data_files: - split: 2024_01_18T17_18_01.366806 path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-01-18T17-18-01.366806.parquet' - split: latest path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-01-18T17-18-01.366806.parquet' - config_name: harness_hendrycksTest_nutrition_5 data_files: - split: 2024_01_18T17_18_01.366806 path: - '**/details_harness|hendrycksTest-nutrition|5_2024-01-18T17-18-01.366806.parquet' - split: latest path: - '**/details_harness|hendrycksTest-nutrition|5_2024-01-18T17-18-01.366806.parquet' - config_name: harness_hendrycksTest_philosophy_5 data_files: - split: 2024_01_18T17_18_01.366806 path: - '**/details_harness|hendrycksTest-philosophy|5_2024-01-18T17-18-01.366806.parquet' - split: latest path: - '**/details_harness|hendrycksTest-philosophy|5_2024-01-18T17-18-01.366806.parquet' - config_name: harness_hendrycksTest_prehistory_5 data_files: - split: 2024_01_18T17_18_01.366806 path: - '**/details_harness|hendrycksTest-prehistory|5_2024-01-18T17-18-01.366806.parquet' - split: latest path: - '**/details_harness|hendrycksTest-prehistory|5_2024-01-18T17-18-01.366806.parquet' - config_name: harness_hendrycksTest_professional_accounting_5 data_files: - split: 2024_01_18T17_18_01.366806 path: - '**/details_harness|hendrycksTest-professional_accounting|5_2024-01-18T17-18-01.366806.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_accounting|5_2024-01-18T17-18-01.366806.parquet' - config_name: harness_hendrycksTest_professional_law_5 data_files: - split: 2024_01_18T17_18_01.366806 path: - '**/details_harness|hendrycksTest-professional_law|5_2024-01-18T17-18-01.366806.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_law|5_2024-01-18T17-18-01.366806.parquet' - config_name: harness_hendrycksTest_professional_medicine_5 data_files: - split: 2024_01_18T17_18_01.366806 path: - '**/details_harness|hendrycksTest-professional_medicine|5_2024-01-18T17-18-01.366806.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_medicine|5_2024-01-18T17-18-01.366806.parquet' - config_name: harness_hendrycksTest_professional_psychology_5 data_files: - split: 2024_01_18T17_18_01.366806 path: - '**/details_harness|hendrycksTest-professional_psychology|5_2024-01-18T17-18-01.366806.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_psychology|5_2024-01-18T17-18-01.366806.parquet' - config_name: harness_hendrycksTest_public_relations_5 data_files: - split: 2024_01_18T17_18_01.366806 path: - '**/details_harness|hendrycksTest-public_relations|5_2024-01-18T17-18-01.366806.parquet' - split: latest path: - '**/details_harness|hendrycksTest-public_relations|5_2024-01-18T17-18-01.366806.parquet' - config_name: harness_hendrycksTest_security_studies_5 data_files: - split: 2024_01_18T17_18_01.366806 path: - '**/details_harness|hendrycksTest-security_studies|5_2024-01-18T17-18-01.366806.parquet' - split: latest path: - '**/details_harness|hendrycksTest-security_studies|5_2024-01-18T17-18-01.366806.parquet' - config_name: harness_hendrycksTest_sociology_5 data_files: - split: 2024_01_18T17_18_01.366806 path: - '**/details_harness|hendrycksTest-sociology|5_2024-01-18T17-18-01.366806.parquet' - split: latest path: - '**/details_harness|hendrycksTest-sociology|5_2024-01-18T17-18-01.366806.parquet' - config_name: harness_hendrycksTest_us_foreign_policy_5 data_files: - split: 2024_01_18T17_18_01.366806 path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-01-18T17-18-01.366806.parquet' - split: latest path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-01-18T17-18-01.366806.parquet' - config_name: harness_hendrycksTest_virology_5 data_files: - split: 2024_01_18T17_18_01.366806 path: - '**/details_harness|hendrycksTest-virology|5_2024-01-18T17-18-01.366806.parquet' - split: latest path: - '**/details_harness|hendrycksTest-virology|5_2024-01-18T17-18-01.366806.parquet' - config_name: harness_hendrycksTest_world_religions_5 data_files: - split: 2024_01_18T17_18_01.366806 path: - '**/details_harness|hendrycksTest-world_religions|5_2024-01-18T17-18-01.366806.parquet' - split: latest path: - '**/details_harness|hendrycksTest-world_religions|5_2024-01-18T17-18-01.366806.parquet' - config_name: harness_truthfulqa_mc_0 data_files: - split: 2024_01_18T17_18_01.366806 path: - '**/details_harness|truthfulqa:mc|0_2024-01-18T17-18-01.366806.parquet' - split: latest path: - '**/details_harness|truthfulqa:mc|0_2024-01-18T17-18-01.366806.parquet' - config_name: harness_winogrande_5 data_files: - split: 2024_01_18T17_18_01.366806 path: - '**/details_harness|winogrande|5_2024-01-18T17-18-01.366806.parquet' - split: latest path: - '**/details_harness|winogrande|5_2024-01-18T17-18-01.366806.parquet' - config_name: results data_files: - split: 2024_01_18T17_18_01.366806 path: - results_2024-01-18T17-18-01.366806.parquet - split: latest path: - results_2024-01-18T17-18-01.366806.parquet --- # Dataset Card for Evaluation run of cloudyu/Pluto_24B_DPO_200 <!-- Provide a quick summary of the dataset. --> Dataset automatically created during the evaluation run of model [cloudyu/Pluto_24B_DPO_200](https://huggingface.co/cloudyu/Pluto_24B_DPO_200) on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard). The dataset is composed of 63 configuration, each one coresponding to one of the evaluated task. The dataset has been created from 1 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results. An additional configuration "results" store all the aggregated results of the run (and is used to compute and display the aggregated metrics on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)). To load the details from a run, you can for instance do the following: ```python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_cloudyu__Pluto_24B_DPO_200", "harness_winogrande_5", split="train") ``` ## Latest results These are the [latest results from run 2024-01-18T17:18:01.366806](https://huggingface.co/datasets/open-llm-leaderboard/details_cloudyu__Pluto_24B_DPO_200/blob/main/results_2024-01-18T17-18-01.366806.json)(note that their might be results for other tasks in the repos if successive evals didn't cover the same tasks. You find each in the results and the "latest" split for each eval): ```python { "all": { "acc": 0.6487883183265996, "acc_stderr": 0.03206766377553213, "acc_norm": 0.649809388886223, "acc_norm_stderr": 0.03271483221046768, "mc1": 0.5128518971848225, "mc1_stderr": 0.017497717944299822, "mc2": 0.6986184584005906, "mc2_stderr": 0.014631943760685329 }, "harness|arc:challenge|25": { "acc": 0.6373720136518771, "acc_stderr": 0.014049106564955003, "acc_norm": 0.6561433447098977, "acc_norm_stderr": 0.013880644570156213 }, "harness|hellaswag|10": { "acc": 0.6717785301732723, "acc_stderr": 0.004686062421158146, "acc_norm": 0.8637721569408484, "acc_norm_stderr": 0.0034232928816321398 }, "harness|hendrycksTest-abstract_algebra|5": { "acc": 0.31, "acc_stderr": 0.04648231987117316, "acc_norm": 0.31, "acc_norm_stderr": 0.04648231987117316 }, "harness|hendrycksTest-anatomy|5": { "acc": 0.6222222222222222, "acc_stderr": 0.04188307537595852, "acc_norm": 0.6222222222222222, "acc_norm_stderr": 0.04188307537595852 }, "harness|hendrycksTest-astronomy|5": { "acc": 0.7039473684210527, "acc_stderr": 0.03715062154998905, "acc_norm": 0.7039473684210527, "acc_norm_stderr": 0.03715062154998905 }, "harness|hendrycksTest-business_ethics|5": { "acc": 0.61, "acc_stderr": 0.04902071300001975, "acc_norm": 0.61, "acc_norm_stderr": 0.04902071300001975 }, "harness|hendrycksTest-clinical_knowledge|5": { "acc": 0.7056603773584905, "acc_stderr": 0.028049186315695248, "acc_norm": 0.7056603773584905, "acc_norm_stderr": 0.028049186315695248 }, "harness|hendrycksTest-college_biology|5": { "acc": 0.7430555555555556, "acc_stderr": 0.03653946969442099, "acc_norm": 0.7430555555555556, "acc_norm_stderr": 0.03653946969442099 }, "harness|hendrycksTest-college_chemistry|5": { "acc": 0.46, "acc_stderr": 0.05009082659620333, "acc_norm": 0.46, "acc_norm_stderr": 0.05009082659620333 }, "harness|hendrycksTest-college_computer_science|5": { "acc": 0.54, "acc_stderr": 0.05009082659620333, "acc_norm": 0.54, "acc_norm_stderr": 0.05009082659620333 }, "harness|hendrycksTest-college_mathematics|5": { "acc": 0.32, "acc_stderr": 0.04688261722621504, "acc_norm": 0.32, "acc_norm_stderr": 0.04688261722621504 }, "harness|hendrycksTest-college_medicine|5": { "acc": 0.6705202312138728, "acc_stderr": 0.03583901754736412, "acc_norm": 0.6705202312138728, "acc_norm_stderr": 0.03583901754736412 }, "harness|hendrycksTest-college_physics|5": { "acc": 0.4215686274509804, "acc_stderr": 0.04913595201274498, "acc_norm": 0.4215686274509804, "acc_norm_stderr": 0.04913595201274498 }, "harness|hendrycksTest-computer_security|5": { "acc": 0.76, "acc_stderr": 0.04292346959909283, "acc_norm": 0.76, "acc_norm_stderr": 0.04292346959909283 }, "harness|hendrycksTest-conceptual_physics|5": { "acc": 0.6, "acc_stderr": 0.03202563076101735, "acc_norm": 0.6, "acc_norm_stderr": 0.03202563076101735 }, "harness|hendrycksTest-econometrics|5": { "acc": 0.49122807017543857, "acc_stderr": 0.04702880432049615, "acc_norm": 0.49122807017543857, "acc_norm_stderr": 0.04702880432049615 }, "harness|hendrycksTest-electrical_engineering|5": { "acc": 0.5517241379310345, "acc_stderr": 0.04144311810878151, "acc_norm": 0.5517241379310345, "acc_norm_stderr": 0.04144311810878151 }, "harness|hendrycksTest-elementary_mathematics|5": { "acc": 0.4126984126984127, "acc_stderr": 0.02535574126305526, "acc_norm": 0.4126984126984127, "acc_norm_stderr": 0.02535574126305526 }, "harness|hendrycksTest-formal_logic|5": { "acc": 0.4523809523809524, "acc_stderr": 0.044518079590553275, "acc_norm": 0.4523809523809524, "acc_norm_stderr": 0.044518079590553275 }, "harness|hendrycksTest-global_facts|5": { "acc": 0.32, "acc_stderr": 0.04688261722621504, "acc_norm": 0.32, "acc_norm_stderr": 0.04688261722621504 }, "harness|hendrycksTest-high_school_biology|5": { "acc": 0.7806451612903226, "acc_stderr": 0.023540799358723295, "acc_norm": 0.7806451612903226, "acc_norm_stderr": 0.023540799358723295 }, "harness|hendrycksTest-high_school_chemistry|5": { "acc": 0.5073891625615764, "acc_stderr": 0.035176035403610105, "acc_norm": 0.5073891625615764, "acc_norm_stderr": 0.035176035403610105 }, "harness|hendrycksTest-high_school_computer_science|5": { "acc": 0.71, "acc_stderr": 0.045604802157206845, "acc_norm": 0.71, "acc_norm_stderr": 0.045604802157206845 }, "harness|hendrycksTest-high_school_european_history|5": { "acc": 0.8, "acc_stderr": 0.031234752377721175, "acc_norm": 0.8, "acc_norm_stderr": 0.031234752377721175 }, "harness|hendrycksTest-high_school_geography|5": { "acc": 0.7676767676767676, "acc_stderr": 0.030088629490217487, "acc_norm": 0.7676767676767676, "acc_norm_stderr": 0.030088629490217487 }, "harness|hendrycksTest-high_school_government_and_politics|5": { "acc": 0.8963730569948186, "acc_stderr": 0.02199531196364424, "acc_norm": 0.8963730569948186, "acc_norm_stderr": 0.02199531196364424 }, "harness|hendrycksTest-high_school_macroeconomics|5": { "acc": 0.6692307692307692, "acc_stderr": 0.02385479568097112, "acc_norm": 0.6692307692307692, "acc_norm_stderr": 0.02385479568097112 }, "harness|hendrycksTest-high_school_mathematics|5": { "acc": 0.31851851851851853, "acc_stderr": 0.02840653309060846, "acc_norm": 0.31851851851851853, "acc_norm_stderr": 0.02840653309060846 }, "harness|hendrycksTest-high_school_microeconomics|5": { "acc": 0.7100840336134454, "acc_stderr": 0.029472485833136098, "acc_norm": 0.7100840336134454, "acc_norm_stderr": 0.029472485833136098 }, "harness|hendrycksTest-high_school_physics|5": { "acc": 0.3576158940397351, "acc_stderr": 0.03913453431177258, "acc_norm": 0.3576158940397351, "acc_norm_stderr": 0.03913453431177258 }, "harness|hendrycksTest-high_school_psychology|5": { "acc": 0.8422018348623853, "acc_stderr": 0.01563002297009244, "acc_norm": 0.8422018348623853, "acc_norm_stderr": 0.01563002297009244 }, "harness|hendrycksTest-high_school_statistics|5": { "acc": 0.5185185185185185, "acc_stderr": 0.03407632093854051, "acc_norm": 0.5185185185185185, "acc_norm_stderr": 0.03407632093854051 }, "harness|hendrycksTest-high_school_us_history|5": { "acc": 0.8186274509803921, "acc_stderr": 0.027044621719474082, "acc_norm": 0.8186274509803921, "acc_norm_stderr": 0.027044621719474082 }, "harness|hendrycksTest-high_school_world_history|5": { "acc": 0.7974683544303798, "acc_stderr": 0.026160568246601446, "acc_norm": 0.7974683544303798, "acc_norm_stderr": 0.026160568246601446 }, "harness|hendrycksTest-human_aging|5": { "acc": 0.695067264573991, "acc_stderr": 0.030898610882477515, "acc_norm": 0.695067264573991, "acc_norm_stderr": 0.030898610882477515 }, "harness|hendrycksTest-human_sexuality|5": { "acc": 0.7557251908396947, "acc_stderr": 0.037683359597287434, "acc_norm": 0.7557251908396947, "acc_norm_stderr": 0.037683359597287434 }, "harness|hendrycksTest-international_law|5": { "acc": 0.7851239669421488, "acc_stderr": 0.037494924487096966, "acc_norm": 0.7851239669421488, "acc_norm_stderr": 0.037494924487096966 }, "harness|hendrycksTest-jurisprudence|5": { "acc": 0.7777777777777778, "acc_stderr": 0.0401910747255735, "acc_norm": 0.7777777777777778, "acc_norm_stderr": 0.0401910747255735 }, "harness|hendrycksTest-logical_fallacies|5": { "acc": 0.7668711656441718, "acc_stderr": 0.0332201579577674, "acc_norm": 0.7668711656441718, "acc_norm_stderr": 0.0332201579577674 }, "harness|hendrycksTest-machine_learning|5": { "acc": 0.48214285714285715, "acc_stderr": 0.047427623612430116, "acc_norm": 0.48214285714285715, "acc_norm_stderr": 0.047427623612430116 }, "harness|hendrycksTest-management|5": { "acc": 0.7766990291262136, "acc_stderr": 0.04123553189891431, "acc_norm": 0.7766990291262136, "acc_norm_stderr": 0.04123553189891431 }, "harness|hendrycksTest-marketing|5": { "acc": 0.8888888888888888, "acc_stderr": 0.020588491316092368, "acc_norm": 0.8888888888888888, "acc_norm_stderr": 0.020588491316092368 }, "harness|hendrycksTest-medical_genetics|5": { "acc": 0.79, "acc_stderr": 0.040936018074033256, "acc_norm": 0.79, "acc_norm_stderr": 0.040936018074033256 }, "harness|hendrycksTest-miscellaneous|5": { "acc": 0.8237547892720306, "acc_stderr": 0.013625556907993457, "acc_norm": 0.8237547892720306, "acc_norm_stderr": 0.013625556907993457 }, "harness|hendrycksTest-moral_disputes|5": { "acc": 0.7456647398843931, "acc_stderr": 0.02344582627654554, "acc_norm": 0.7456647398843931, "acc_norm_stderr": 0.02344582627654554 }, "harness|hendrycksTest-moral_scenarios|5": { "acc": 0.38324022346368714, "acc_stderr": 0.016260159604429128, "acc_norm": 0.38324022346368714, "acc_norm_stderr": 0.016260159604429128 }, "harness|hendrycksTest-nutrition|5": { "acc": 0.7287581699346405, "acc_stderr": 0.025457756696667888, "acc_norm": 0.7287581699346405, "acc_norm_stderr": 0.025457756696667888 }, "harness|hendrycksTest-philosophy|5": { "acc": 0.7170418006430869, "acc_stderr": 0.025583062489984813, "acc_norm": 0.7170418006430869, "acc_norm_stderr": 0.025583062489984813 }, "harness|hendrycksTest-prehistory|5": { "acc": 0.7530864197530864, "acc_stderr": 0.02399350170904211, "acc_norm": 0.7530864197530864, "acc_norm_stderr": 0.02399350170904211 }, "harness|hendrycksTest-professional_accounting|5": { "acc": 0.4787234042553192, "acc_stderr": 0.029800481645628693, "acc_norm": 0.4787234042553192, "acc_norm_stderr": 0.029800481645628693 }, "harness|hendrycksTest-professional_law|5": { "acc": 0.47196870925684486, "acc_stderr": 0.012750151802922438, "acc_norm": 0.47196870925684486, "acc_norm_stderr": 0.012750151802922438 }, "harness|hendrycksTest-professional_medicine|5": { "acc": 0.6764705882352942, "acc_stderr": 0.028418208619406755, "acc_norm": 0.6764705882352942, "acc_norm_stderr": 0.028418208619406755 }, "harness|hendrycksTest-professional_psychology|5": { "acc": 0.6683006535947712, "acc_stderr": 0.019047485239360378, "acc_norm": 0.6683006535947712, "acc_norm_stderr": 0.019047485239360378 }, "harness|hendrycksTest-public_relations|5": { "acc": 0.6818181818181818, "acc_stderr": 0.044612721759105085, "acc_norm": 0.6818181818181818, "acc_norm_stderr": 0.044612721759105085 }, "harness|hendrycksTest-security_studies|5": { "acc": 0.7306122448979592, "acc_stderr": 0.02840125202902294, "acc_norm": 0.7306122448979592, "acc_norm_stderr": 0.02840125202902294 }, "harness|hendrycksTest-sociology|5": { "acc": 0.8557213930348259, "acc_stderr": 0.024845753212306046, "acc_norm": 0.8557213930348259, "acc_norm_stderr": 0.024845753212306046 }, "harness|hendrycksTest-us_foreign_policy|5": { "acc": 0.83, "acc_stderr": 0.0377525168068637, "acc_norm": 0.83, "acc_norm_stderr": 0.0377525168068637 }, "harness|hendrycksTest-virology|5": { "acc": 0.5180722891566265, "acc_stderr": 0.03889951252827216, "acc_norm": 0.5180722891566265, "acc_norm_stderr": 0.03889951252827216 }, "harness|hendrycksTest-world_religions|5": { "acc": 0.847953216374269, "acc_stderr": 0.027539122889061456, "acc_norm": 0.847953216374269, "acc_norm_stderr": 0.027539122889061456 }, "harness|truthfulqa:mc|0": { "mc1": 0.5128518971848225, "mc1_stderr": 0.017497717944299822, "mc2": 0.6986184584005906, "mc2_stderr": 0.014631943760685329 }, "harness|winogrande|5": { "acc": 0.7892659826361483, "acc_stderr": 0.011462046419710683 }, "harness|gsm8k|5": { "acc": 0.6588324488248674, "acc_stderr": 0.013059111935831497 } } ``` ## Dataset Details ### Dataset Description <!-- Provide a longer summary of what this dataset is. --> - **Curated by:** [More Information Needed] - **Funded by [optional]:** [More Information Needed] - **Shared by [optional]:** [More Information Needed] - **Language(s) (NLP):** [More Information Needed] - **License:** [More Information Needed] ### Dataset Sources [optional] <!-- Provide the basic links for the dataset. --> - **Repository:** [More Information Needed] - **Paper [optional]:** [More Information Needed] - **Demo [optional]:** [More Information Needed] ## Uses <!-- Address questions around how the dataset is intended to be used. --> ### Direct Use <!-- This section describes suitable use cases for the dataset. --> [More Information Needed] ### Out-of-Scope Use <!-- This section addresses misuse, malicious use, and uses that the dataset will not work well for. --> [More Information Needed] ## Dataset Structure <!-- This section provides a description of the dataset fields, and additional information about the dataset structure such as criteria used to create the splits, relationships between data points, etc. --> [More Information Needed] ## Dataset Creation ### Curation Rationale <!-- Motivation for the creation of this dataset. --> [More Information Needed] ### Source Data <!-- This section describes the source data (e.g. news text and headlines, social media posts, translated sentences, ...). --> #### Data Collection and Processing <!-- This section describes the data collection and processing process such as data selection criteria, filtering and normalization methods, tools and libraries used, etc. --> [More Information Needed] #### Who are the source data producers? <!-- This section describes the people or systems who originally created the data. It should also include self-reported demographic or identity information for the source data creators if this information is available. --> [More Information Needed] ### Annotations [optional] <!-- If the dataset contains annotations which are not part of the initial data collection, use this section to describe them. --> #### Annotation process <!-- This section describes the annotation process such as annotation tools used in the process, the amount of data annotated, annotation guidelines provided to the annotators, interannotator statistics, annotation validation, etc. --> [More Information Needed] #### Who are the annotators? <!-- This section describes the people or systems who created the annotations. --> [More Information Needed] #### Personal and Sensitive Information <!-- State whether the dataset contains data that might be considered personal, sensitive, or private (e.g., data that reveals addresses, uniquely identifiable names or aliases, racial or ethnic origins, sexual orientations, religious beliefs, political opinions, financial or health data, etc.). If efforts were made to anonymize the data, describe the anonymization process. --> [More Information Needed] ## Bias, Risks, and Limitations <!-- This section is meant to convey both technical and sociotechnical limitations. --> [More Information Needed] ### Recommendations <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. --> Users should be made aware of the risks, biases and limitations of the dataset. More information needed for further recommendations. ## Citation [optional] <!-- If there is a paper or blog post introducing the dataset, the APA and Bibtex information for that should go in this section. --> **BibTeX:** [More Information Needed] **APA:** [More Information Needed] ## Glossary [optional] <!-- If relevant, include terms and calculations in this section that can help readers understand the dataset or dataset card. --> [More Information Needed] ## More Information [optional] [More Information Needed] ## Dataset Card Authors [optional] [More Information Needed] ## Dataset Card Contact [More Information Needed]
提供机构:
open-llm-leaderboard-old
原始信息汇总

数据集概述

数据集简介

该数据集是在对模型 cloudyu/Pluto_24B_DPO_200 进行评估运行期间自动创建的,用于 Open LLM Leaderboard

数据集组成

  • 数据集包含 63 个配置,每个配置对应一个评估任务。
  • 数据集从 1 次运行中创建,每个运行可以在每个配置中找到特定的分割,分割名称使用运行的时间戳。
  • "train" 分割始终指向最新的结果。
  • 额外的 "results" 配置存储所有运行的聚合结果,用于计算和显示 Open LLM Leaderboard 上的聚合指标。

数据加载示例

python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_cloudyu__Pluto_24B_DPO_200", "harness_winogrande_5", split="train")

最新结果

以下是 2024-01-18T17:18:01.366806 运行的最新结果

python { "all": { "acc": 0.6487883183265996, "acc_stderr": 0.03206766377553213, "acc_norm": 0.649809388886223, "acc_norm_stderr": 0.03271483221046768, "mc1": 0.5128518971848225, "mc1_stderr": 0.017497717944299822, "mc2": 0.6986184584005906, "mc2_stderr": 0.014631943760685329 }, "harness|arc:challenge|25": { "acc": 0.6373720136518771, "acc_stderr": 0.014049106564955003, "acc_norm": 0.6561433447098977, "acc_norm_stderr": 0.013880644570156213 }, "harness|hellaswag|10": { "acc": 0.6717785301732723, "acc_stderr": 0.004686062421158146, "acc_norm": 0.8637721569408484, "acc_norm_stderr": 0.0034232928816321398 }, "harness|hendrycksTest-abstract_algebra|5": { "acc": 0.31, "acc_stderr": 0.04648231987117316, "acc_norm": 0.31, "acc_norm_stderr": 0.04648231987117316 }, "harness|hendrycksTest-anatomy|5": { "acc": 0.6222222222222222, "acc_stderr": 0.04188307537595852, "acc_norm": 0.6222222222222222, "acc_norm_stderr": 0.04188307537595852 }, "harness|hendrycksTest-astronomy|5": { "acc": 0.7039473684210527, "acc_stderr": 0.03715062154998905, "acc_norm": 0.7039473684210527, "acc_norm_stderr": 0.03715062154998905 }, "harness|hendrycksTest-business_ethics|5": { "acc": 0.61, "acc_stderr": 0.04902071300001975, "acc_norm": 0.61, "acc_norm_stderr": 0.04902071300001975 }, "harness|hendrycksTest-clinical_knowledge|5": { "acc": 0.7056603773584905, "acc_stderr": 0.028049186315695248, "acc_norm": 0.7056603773584905, "acc_norm_stderr": 0.028049186315695248 }, "harness|hendrycksTest-college_biology|5": { "acc": 0.7430555555555556, "acc_stderr": 0.03653946969442099, "acc_norm": 0.7430555555555556, "acc_norm_stderr": 0.03653946969442099 }, "harness|hendrycksTest-college_chemistry|5": { "acc": 0.46, "acc_stderr": 0.05009082659620333, "acc_norm": 0.46, "acc_norm_stderr": 0.05009082659620333 }, "harness|hendrycksTest-college_computer_science|5": { "acc": 0.54, "acc_stderr": 0.05009082659620333, "acc_norm": 0.54, "acc_norm_stderr": 0.05009082659620333 }, "harness|hendrycksTest-college_mathematics|5": { "acc": 0.32, "acc_stderr": 0.04688261722621504, "acc_norm": 0.32, "acc_norm_stderr": 0.04688261722621504 }, "harness|hendrycksTest-college_medicine|5": { "acc": 0.6705202312138728, "acc_stderr": 0.03583901754736412, "acc_norm": 0.6705202312138728, "acc_norm_stderr": 0.03583901754736412 }, "harness|hendrycksTest-college_physics|5": { "acc": 0.4215686274509804, "acc_stderr": 0.04913595201274498, "acc_norm": 0.4215686274509804, "acc_norm_stderr": 0.04913595201274498 }, "harness|hendrycksTest-computer_security|5": { "acc": 0.76, "acc_stderr": 0.04292346959909283, "acc_norm": 0.76, "acc_norm_stderr": 0.04292346959909283 }, "harness|hendrycksTest-conceptual_physics|5": { "acc": 0.6, "acc_stderr": 0.03202563076101735, "acc_norm": 0.6, "acc_norm_stderr": 0.03202563076101735 }, "harness|hendrycksTest-econometrics|5": { "acc": 0.49122807017543857, "acc_stderr": 0.04702880432049615, "acc_norm": 0.49122807017543857, "acc_norm_stderr": 0.04702880432049615 }, "harness|hendrycksTest-electrical_engineering|5": { "acc": 0.5517241379310345, "acc_stderr": 0.04144311810878151, "acc_norm": 0.5517241379310345, "acc_norm_stderr": 0.04144311810878151 }, "harness|hendrycksTest-elementary_mathematics|5": { "acc": 0.4126984126984127, "acc_stderr": 0.02535574126305526, "acc_norm": 0.4126984126984127, "acc_norm_stderr": 0.02535574126305526 }, "harness|hendrycksTest-formal_logic|5": { "acc": 0.4523809523809524, "acc_stderr": 0.044518079590553275, "acc_norm": 0.4523809523809524, "acc_norm_stderr": 0.044518079590553275 }, "harness|hendrycksTest-global_facts|5": { "acc": 0.32, "acc_stderr": 0.04688261722621504, "acc_norm": 0.32, "acc_norm_stderr": 0.04688261722621504 }, "harness|hendrycksTest-high_school_biology|5": { "acc": 0.7806451612903226, "acc_stderr": 0.023540799358723295, "acc_norm": 0.7806451612903226, "acc_norm_stderr": 0.023540799358723295 }, "harness|hendrycksTest-high_school_chemistry|5": { "acc": 0.5073891625615764, "acc_stderr": 0.035176035403610105, "acc_norm": 0.5073891625615764, "acc_norm_stderr": 0.035176035403610105 }, "harness|hendrycksTest-high_school_computer_science|5": { "acc": 0.71, "acc_stderr": 0.045604802157206845, "acc_norm": 0.71, "acc_norm_stderr": 0.045604802157206845 }, "harness|hendrycksTest-high_school_european_history|5": { "acc": 0.8, "acc_stderr": 0.031234752377721175, "acc_norm": 0.8, "acc_norm_stderr": 0.031234752377721175 }, "harness|hendrycksTest-high_school_geography|5": { "acc": 0.7676767676767676, "acc_stderr": 0.030088629490217487, "acc_norm": 0.7676767676767676, "acc_norm_stderr": 0.030088629490217487 }, "harness|hendrycksTest-high_school_government_and_politics|5": { "acc": 0.8963730569948186, "acc_stderr": 0.02199531196364424, "acc_norm": 0.8963730569948186, "acc_norm_stderr": 0.02199531196364424 }, "harness|hendrycksTest-high_school_macroeconomics|5": { "acc": 0.6692307692307692, "acc_stderr": 0.02385479568097112, "acc_norm": 0.6692307692307692, "acc_norm_stderr": 0.02385479568097112 }, "harness|hendrycksTest-high_school_mathematics|5": { "acc": 0.31851851851851853, "acc_stderr": 0.02840653309060846, "acc_norm": 0.31851851851851853, "acc_norm_stderr": 0.02840653309060846 }, "harness|hendrycksTest-high_school_microeconomics|5": { "acc":

搜集汇总
数据集介绍
main_image_url
构建方式
在大型语言模型评估领域,该数据集作为模型性能分析的基石,其构建过程体现了自动化与标准化的融合。数据集源于对cloudyu/Pluto_24B_DPO_200模型在Open LLM Leaderboard平台上的系统性评估运行,通过自动化流程生成。评估覆盖了63项独立任务配置,每项配置对应一个特定的评测任务,如ARC挑战、HellaSwag及MMLU系列学科测试等。数据采集自单次评估运行,每次运行以时间戳命名的独立切分形式存储,确保结果的可追溯性。同时,数据集包含一个名为“results”的聚合配置,专门用于汇总各次运行的评估指标,为模型综合性能的量化分析提供结构化数据支撑。
特点
该数据集在语言模型评估领域展现出多维度的结构性特征。其核心在于涵盖广泛的评测任务,包括常识推理、学科知识及数学问题求解等多样化能力维度。数据集以配置化方式组织,每个配置对应一项具体评测任务,便于进行细粒度的模型能力剖析。数据存储采用时间戳切分机制,既保留了历史评估记录,又通过“latest”切分指向最新结果,实现了数据的版本化管理。此外,数据集提供了丰富的评估指标,如准确率及其标准误差,并包含归一化处理后的性能数据,为模型表现的稳健性分析奠定了坚实基础。
使用方法
在语言模型研究实践中,该数据集为深入分析模型性能提供了便捷的访问途径。研究人员可通过Hugging Face的datasets库直接加载数据,例如使用load_dataset函数并指定数据集名称、具体配置(如“harness_winogrande_5”)及切分(如“train”)即可获取相应的评估细节。对于聚合结果,可查阅“results”配置以获取模型在各项任务上的综合表现指标。这种模块化的数据访问方式支持研究者灵活提取特定任务的评估数据,进行横向对比或纵向趋势分析,从而为模型优化与能力评估提供实证依据。
背景与挑战
背景概述
随着大语言模型(LLM)技术的迅猛发展,对其性能进行系统化、标准化的评估成为推动领域进步的关键。在此背景下,Hugging Face于2023年推出了Open LLM Leaderboard,旨在为社区提供一个透明、可复现的模型能力基准测试平台。数据集‘open-llm-leaderboard-old/details_cloudyu__Pluto_24B_DPO_200’正是该平台在2024年1月18日对模型‘cloudyu/Pluto_24B_DPO_200’进行评估时自动生成的详细结果记录。该数据集由Hugging Face团队主导构建,其核心研究问题在于如何通过多任务、多维度的评估框架,客观衡量LLM在常识推理、专业知识、数学计算及真实性等方面的综合能力,从而为模型优化与比较提供坚实的数据支撑,对促进LLM评估的规范化和科学化产生了深远影响。
当前挑战
该数据集所应对的领域挑战,在于解决大语言模型评估中存在的碎片化与不可比性问题。传统评估往往依赖单一任务或有限领域,难以全面反映模型的泛化能力与知识广度。Open LLM Leaderboard通过整合ARC挑战赛、HellaSwag、MMLU(HendrycksTest系列)、TruthfulQA等多个权威基准,构建了一个覆盖推理、知识、伦理等多维度的综合评估体系,其挑战在于如何设计公平、无偏的评估流程,并确保不同模型在不同任务上的得分具有可比性。在数据集构建过程中,技术挑战同样显著,包括自动化评估流水线的稳定性保障、海量评估结果(涵盖63项配置任务)的高效存储与组织、以及评估运行时间戳管理所带来的版本控制复杂性,这些都对数据集的可靠性、可访问性与可维护性提出了严格要求。
常用场景
经典使用场景
在大语言模型评估领域,该数据集作为Open LLM Leaderboard的组成部分,其经典使用场景在于为研究人员提供模型性能的细粒度分析。通过涵盖ARC挑战赛、HellaSwag、MMLU以及TruthfulQA等多样化基准任务,数据集允许对模型在常识推理、语言理解、专业知识和真实性等维度的能力进行系统性评测。这种多维度的评估框架,使得学术界能够深入洞察模型在不同认知任务上的表现差异,从而为模型优化提供精准的指导方向。
衍生相关工作
围绕该数据集衍生的经典工作,主要集中在模型能力诊断、评估方法创新以及排行榜生态的扩展。例如,基于细粒度评测结果的分析研究,深入探讨了模型在不同知识领域表现差异的内在机理。同时,该数据集也催生了针对评估偏差、提示工程鲁棒性以及少样本学习效率等问题的研究方法。此外,其开放的评估框架激励了社区开发更多样化的评测任务,推动了整个大模型评估体系向更全面、更公平的方向演进。
数据集最近研究
最新研究方向
在大型语言模型评估领域,open-llm-leaderboard数据集作为模型性能的基准测试平台,其最新研究聚焦于多维度能力评估与模型优化策略的深度结合。当前前沿方向围绕模型在复杂推理、专业知识及伦理对齐等方面的表现展开,例如通过MMLU、HellaSwag等多样化任务评估模型的泛化能力与知识深度。热点事件体现在开源社区对模型透明化评估的推动,如HuggingFace Open LLM Leaderboard的广泛采用,促进了模型性能的标准化比较与迭代优化。这一趋势不仅加速了模型在学术与工业场景的落地应用,也为模型安全性与可靠性研究提供了关键数据支撑,对推动人工智能技术的负责任发展具有深远意义。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作