five

open-llm-leaderboard-old/details_ewqr2130__alignment-handbook-zephyr-7b_ppo_5e7step_51

收藏
Hugging Face2024-01-19 更新2024-06-22 收录
下载链接:
https://hf-mirror.com/datasets/open-llm-leaderboard-old/details_ewqr2130__alignment-handbook-zephyr-7b_ppo_5e7step_51
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: Evaluation run of ewqr2130/alignment-handbook-zephyr-7b_ppo_5e7step_51 dataset_summary: "Dataset automatically created during the evaluation run of model\ \ [ewqr2130/alignment-handbook-zephyr-7b_ppo_5e7step_51](https://huggingface.co/ewqr2130/alignment-handbook-zephyr-7b_ppo_5e7step_51)\ \ on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).\n\ \nThe dataset is composed of 63 configuration, each one coresponding to one of the\ \ evaluated task.\n\nThe dataset has been created from 1 run(s). Each run can be\ \ found as a specific split in each configuration, the split being named using the\ \ timestamp of the run.The \"train\" split is always pointing to the latest results.\n\ \nAn additional configuration \"results\" store all the aggregated results of the\ \ run (and is used to compute and display the aggregated metrics on the [Open LLM\ \ Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)).\n\ \nTo load the details from a run, you can for instance do the following:\n```python\n\ from datasets import load_dataset\ndata = load_dataset(\"open-llm-leaderboard/details_ewqr2130__alignment-handbook-zephyr-7b_ppo_5e7step_51\"\ ,\n\t\"harness_winogrande_5\",\n\tsplit=\"train\")\n```\n\n## Latest results\n\n\ These are the [latest results from run 2024-01-19T08:19:43.491405](https://huggingface.co/datasets/open-llm-leaderboard/details_ewqr2130__alignment-handbook-zephyr-7b_ppo_5e7step_51/blob/main/results_2024-01-19T08-19-43.491405.json)(note\ \ that their might be results for other tasks in the repos if successive evals didn't\ \ cover the same tasks. You find each in the results and the \"latest\" split for\ \ each eval):\n\n```python\n{\n \"all\": {\n \"acc\": 0.5956914302139066,\n\ \ \"acc_stderr\": 0.03334872123940361,\n \"acc_norm\": 0.6014998966219293,\n\ \ \"acc_norm_stderr\": 0.03404340642551802,\n \"mc1\": 0.2802937576499388,\n\ \ \"mc1_stderr\": 0.015723139524608767,\n \"mc2\": 0.4145856205202453,\n\ \ \"mc2_stderr\": 0.015440318842858623\n },\n \"harness|arc:challenge|25\"\ : {\n \"acc\": 0.5648464163822525,\n \"acc_stderr\": 0.01448798619718604,\n\ \ \"acc_norm\": 0.5972696245733788,\n \"acc_norm_stderr\": 0.014332236306790149\n\ \ },\n \"harness|hellaswag|10\": {\n \"acc\": 0.6258713403704441,\n\ \ \"acc_stderr\": 0.0048290815328265015,\n \"acc_norm\": 0.8252340171280621,\n\ \ \"acc_norm_stderr\": 0.003789906792644689\n },\n \"harness|hendrycksTest-abstract_algebra|5\"\ : {\n \"acc\": 0.3,\n \"acc_stderr\": 0.046056618647183814,\n \ \ \"acc_norm\": 0.3,\n \"acc_norm_stderr\": 0.046056618647183814\n \ \ },\n \"harness|hendrycksTest-anatomy|5\": {\n \"acc\": 0.5851851851851851,\n\ \ \"acc_stderr\": 0.04256193767901408,\n \"acc_norm\": 0.5851851851851851,\n\ \ \"acc_norm_stderr\": 0.04256193767901408\n },\n \"harness|hendrycksTest-astronomy|5\"\ : {\n \"acc\": 0.625,\n \"acc_stderr\": 0.039397364351956274,\n \ \ \"acc_norm\": 0.625,\n \"acc_norm_stderr\": 0.039397364351956274\n\ \ },\n \"harness|hendrycksTest-business_ethics|5\": {\n \"acc\": 0.51,\n\ \ \"acc_stderr\": 0.05024183937956912,\n \"acc_norm\": 0.51,\n \ \ \"acc_norm_stderr\": 0.05024183937956912\n },\n \"harness|hendrycksTest-clinical_knowledge|5\"\ : {\n \"acc\": 0.6716981132075471,\n \"acc_stderr\": 0.02890159361241178,\n\ \ \"acc_norm\": 0.6716981132075471,\n \"acc_norm_stderr\": 0.02890159361241178\n\ \ },\n \"harness|hendrycksTest-college_biology|5\": {\n \"acc\": 0.6666666666666666,\n\ \ \"acc_stderr\": 0.03942082639927213,\n \"acc_norm\": 0.6666666666666666,\n\ \ \"acc_norm_stderr\": 0.03942082639927213\n },\n \"harness|hendrycksTest-college_chemistry|5\"\ : {\n \"acc\": 0.41,\n \"acc_stderr\": 0.049431107042371025,\n \ \ \"acc_norm\": 0.41,\n \"acc_norm_stderr\": 0.049431107042371025\n \ \ },\n \"harness|hendrycksTest-college_computer_science|5\": {\n \"\ acc\": 0.51,\n \"acc_stderr\": 0.05024183937956912,\n \"acc_norm\"\ : 0.51,\n \"acc_norm_stderr\": 0.05024183937956912\n },\n \"harness|hendrycksTest-college_mathematics|5\"\ : {\n \"acc\": 0.35,\n \"acc_stderr\": 0.04793724854411019,\n \ \ \"acc_norm\": 0.35,\n \"acc_norm_stderr\": 0.04793724854411019\n \ \ },\n \"harness|hendrycksTest-college_medicine|5\": {\n \"acc\": 0.6127167630057804,\n\ \ \"acc_stderr\": 0.03714325906302065,\n \"acc_norm\": 0.6127167630057804,\n\ \ \"acc_norm_stderr\": 0.03714325906302065\n },\n \"harness|hendrycksTest-college_physics|5\"\ : {\n \"acc\": 0.38235294117647056,\n \"acc_stderr\": 0.04835503696107223,\n\ \ \"acc_norm\": 0.38235294117647056,\n \"acc_norm_stderr\": 0.04835503696107223\n\ \ },\n \"harness|hendrycksTest-computer_security|5\": {\n \"acc\":\ \ 0.77,\n \"acc_stderr\": 0.04229525846816506,\n \"acc_norm\": 0.77,\n\ \ \"acc_norm_stderr\": 0.04229525846816506\n },\n \"harness|hendrycksTest-conceptual_physics|5\"\ : {\n \"acc\": 0.5191489361702127,\n \"acc_stderr\": 0.032662042990646796,\n\ \ \"acc_norm\": 0.5191489361702127,\n \"acc_norm_stderr\": 0.032662042990646796\n\ \ },\n \"harness|hendrycksTest-econometrics|5\": {\n \"acc\": 0.40350877192982454,\n\ \ \"acc_stderr\": 0.046151869625837026,\n \"acc_norm\": 0.40350877192982454,\n\ \ \"acc_norm_stderr\": 0.046151869625837026\n },\n \"harness|hendrycksTest-electrical_engineering|5\"\ : {\n \"acc\": 0.5586206896551724,\n \"acc_stderr\": 0.04137931034482757,\n\ \ \"acc_norm\": 0.5586206896551724,\n \"acc_norm_stderr\": 0.04137931034482757\n\ \ },\n \"harness|hendrycksTest-elementary_mathematics|5\": {\n \"acc\"\ : 0.3783068783068783,\n \"acc_stderr\": 0.02497695405315524,\n \"\ acc_norm\": 0.3783068783068783,\n \"acc_norm_stderr\": 0.02497695405315524\n\ \ },\n \"harness|hendrycksTest-formal_logic|5\": {\n \"acc\": 0.373015873015873,\n\ \ \"acc_stderr\": 0.04325506042017086,\n \"acc_norm\": 0.373015873015873,\n\ \ \"acc_norm_stderr\": 0.04325506042017086\n },\n \"harness|hendrycksTest-global_facts|5\"\ : {\n \"acc\": 0.42,\n \"acc_stderr\": 0.049604496374885836,\n \ \ \"acc_norm\": 0.42,\n \"acc_norm_stderr\": 0.049604496374885836\n \ \ },\n \"harness|hendrycksTest-high_school_biology|5\": {\n \"acc\"\ : 0.7129032258064516,\n \"acc_stderr\": 0.025736542745594525,\n \"\ acc_norm\": 0.7129032258064516,\n \"acc_norm_stderr\": 0.025736542745594525\n\ \ },\n \"harness|hendrycksTest-high_school_chemistry|5\": {\n \"acc\"\ : 0.5024630541871922,\n \"acc_stderr\": 0.03517945038691063,\n \"\ acc_norm\": 0.5024630541871922,\n \"acc_norm_stderr\": 0.03517945038691063\n\ \ },\n \"harness|hendrycksTest-high_school_computer_science|5\": {\n \ \ \"acc\": 0.65,\n \"acc_stderr\": 0.047937248544110196,\n \"acc_norm\"\ : 0.65,\n \"acc_norm_stderr\": 0.047937248544110196\n },\n \"harness|hendrycksTest-high_school_european_history|5\"\ : {\n \"acc\": 0.7454545454545455,\n \"acc_stderr\": 0.03401506715249039,\n\ \ \"acc_norm\": 0.7454545454545455,\n \"acc_norm_stderr\": 0.03401506715249039\n\ \ },\n \"harness|hendrycksTest-high_school_geography|5\": {\n \"acc\"\ : 0.7323232323232324,\n \"acc_stderr\": 0.03154449888270285,\n \"\ acc_norm\": 0.7323232323232324,\n \"acc_norm_stderr\": 0.03154449888270285\n\ \ },\n \"harness|hendrycksTest-high_school_government_and_politics|5\": {\n\ \ \"acc\": 0.7875647668393783,\n \"acc_stderr\": 0.02951928261681723,\n\ \ \"acc_norm\": 0.7875647668393783,\n \"acc_norm_stderr\": 0.02951928261681723\n\ \ },\n \"harness|hendrycksTest-high_school_macroeconomics|5\": {\n \ \ \"acc\": 0.5820512820512821,\n \"acc_stderr\": 0.02500732988246121,\n \ \ \"acc_norm\": 0.5820512820512821,\n \"acc_norm_stderr\": 0.02500732988246121\n\ \ },\n \"harness|hendrycksTest-high_school_mathematics|5\": {\n \"\ acc\": 0.3296296296296296,\n \"acc_stderr\": 0.028661201116524565,\n \ \ \"acc_norm\": 0.3296296296296296,\n \"acc_norm_stderr\": 0.028661201116524565\n\ \ },\n \"harness|hendrycksTest-high_school_microeconomics|5\": {\n \ \ \"acc\": 0.6428571428571429,\n \"acc_stderr\": 0.031124619309328177,\n\ \ \"acc_norm\": 0.6428571428571429,\n \"acc_norm_stderr\": 0.031124619309328177\n\ \ },\n \"harness|hendrycksTest-high_school_physics|5\": {\n \"acc\"\ : 0.33112582781456956,\n \"acc_stderr\": 0.038425817186598696,\n \"\ acc_norm\": 0.33112582781456956,\n \"acc_norm_stderr\": 0.038425817186598696\n\ \ },\n \"harness|hendrycksTest-high_school_psychology|5\": {\n \"acc\"\ : 0.8110091743119267,\n \"acc_stderr\": 0.016785481159203624,\n \"\ acc_norm\": 0.8110091743119267,\n \"acc_norm_stderr\": 0.016785481159203624\n\ \ },\n \"harness|hendrycksTest-high_school_statistics|5\": {\n \"acc\"\ : 0.4398148148148148,\n \"acc_stderr\": 0.03385177976044811,\n \"\ acc_norm\": 0.4398148148148148,\n \"acc_norm_stderr\": 0.03385177976044811\n\ \ },\n \"harness|hendrycksTest-high_school_us_history|5\": {\n \"acc\"\ : 0.7303921568627451,\n \"acc_stderr\": 0.031145570659486782,\n \"\ acc_norm\": 0.7303921568627451,\n \"acc_norm_stderr\": 0.031145570659486782\n\ \ },\n \"harness|hendrycksTest-high_school_world_history|5\": {\n \"\ acc\": 0.729957805907173,\n \"acc_stderr\": 0.028900721906293426,\n \ \ \"acc_norm\": 0.729957805907173,\n \"acc_norm_stderr\": 0.028900721906293426\n\ \ },\n \"harness|hendrycksTest-human_aging|5\": {\n \"acc\": 0.6457399103139013,\n\ \ \"acc_stderr\": 0.032100621541349864,\n \"acc_norm\": 0.6457399103139013,\n\ \ \"acc_norm_stderr\": 0.032100621541349864\n },\n \"harness|hendrycksTest-human_sexuality|5\"\ : {\n \"acc\": 0.6564885496183206,\n \"acc_stderr\": 0.041649760719448786,\n\ \ \"acc_norm\": 0.6564885496183206,\n \"acc_norm_stderr\": 0.041649760719448786\n\ \ },\n \"harness|hendrycksTest-international_law|5\": {\n \"acc\":\ \ 0.7768595041322314,\n \"acc_stderr\": 0.03800754475228733,\n \"\ acc_norm\": 0.7768595041322314,\n \"acc_norm_stderr\": 0.03800754475228733\n\ \ },\n \"harness|hendrycksTest-jurisprudence|5\": {\n \"acc\": 0.7777777777777778,\n\ \ \"acc_stderr\": 0.040191074725573483,\n \"acc_norm\": 0.7777777777777778,\n\ \ \"acc_norm_stderr\": 0.040191074725573483\n },\n \"harness|hendrycksTest-logical_fallacies|5\"\ : {\n \"acc\": 0.6932515337423313,\n \"acc_stderr\": 0.03623089915724147,\n\ \ \"acc_norm\": 0.6932515337423313,\n \"acc_norm_stderr\": 0.03623089915724147\n\ \ },\n \"harness|hendrycksTest-machine_learning|5\": {\n \"acc\": 0.44642857142857145,\n\ \ \"acc_stderr\": 0.04718471485219588,\n \"acc_norm\": 0.44642857142857145,\n\ \ \"acc_norm_stderr\": 0.04718471485219588\n },\n \"harness|hendrycksTest-management|5\"\ : {\n \"acc\": 0.7475728155339806,\n \"acc_stderr\": 0.04301250399690878,\n\ \ \"acc_norm\": 0.7475728155339806,\n \"acc_norm_stderr\": 0.04301250399690878\n\ \ },\n \"harness|hendrycksTest-marketing|5\": {\n \"acc\": 0.8247863247863247,\n\ \ \"acc_stderr\": 0.024904439098918242,\n \"acc_norm\": 0.8247863247863247,\n\ \ \"acc_norm_stderr\": 0.024904439098918242\n },\n \"harness|hendrycksTest-medical_genetics|5\"\ : {\n \"acc\": 0.7,\n \"acc_stderr\": 0.046056618647183814,\n \ \ \"acc_norm\": 0.7,\n \"acc_norm_stderr\": 0.046056618647183814\n \ \ },\n \"harness|hendrycksTest-miscellaneous|5\": {\n \"acc\": 0.776500638569604,\n\ \ \"acc_stderr\": 0.01489723522945071,\n \"acc_norm\": 0.776500638569604,\n\ \ \"acc_norm_stderr\": 0.01489723522945071\n },\n \"harness|hendrycksTest-moral_disputes|5\"\ : {\n \"acc\": 0.684971098265896,\n \"acc_stderr\": 0.025009313790069706,\n\ \ \"acc_norm\": 0.684971098265896,\n \"acc_norm_stderr\": 0.025009313790069706\n\ \ },\n \"harness|hendrycksTest-moral_scenarios|5\": {\n \"acc\": 0.38324022346368714,\n\ \ \"acc_stderr\": 0.016260159604429128,\n \"acc_norm\": 0.38324022346368714,\n\ \ \"acc_norm_stderr\": 0.016260159604429128\n },\n \"harness|hendrycksTest-nutrition|5\"\ : {\n \"acc\": 0.6535947712418301,\n \"acc_stderr\": 0.02724561304721536,\n\ \ \"acc_norm\": 0.6535947712418301,\n \"acc_norm_stderr\": 0.02724561304721536\n\ \ },\n \"harness|hendrycksTest-philosophy|5\": {\n \"acc\": 0.6688102893890675,\n\ \ \"acc_stderr\": 0.026730620728004906,\n \"acc_norm\": 0.6688102893890675,\n\ \ \"acc_norm_stderr\": 0.026730620728004906\n },\n \"harness|hendrycksTest-prehistory|5\"\ : {\n \"acc\": 0.6728395061728395,\n \"acc_stderr\": 0.026105673861409828,\n\ \ \"acc_norm\": 0.6728395061728395,\n \"acc_norm_stderr\": 0.026105673861409828\n\ \ },\n \"harness|hendrycksTest-professional_accounting|5\": {\n \"\ acc\": 0.4397163120567376,\n \"acc_stderr\": 0.029609912075594113,\n \ \ \"acc_norm\": 0.4397163120567376,\n \"acc_norm_stderr\": 0.029609912075594113\n\ \ },\n \"harness|hendrycksTest-professional_law|5\": {\n \"acc\": 0.4211212516297262,\n\ \ \"acc_stderr\": 0.012610325733489905,\n \"acc_norm\": 0.4211212516297262,\n\ \ \"acc_norm_stderr\": 0.012610325733489905\n },\n \"harness|hendrycksTest-professional_medicine|5\"\ : {\n \"acc\": 0.5992647058823529,\n \"acc_stderr\": 0.029768263528933105,\n\ \ \"acc_norm\": 0.5992647058823529,\n \"acc_norm_stderr\": 0.029768263528933105\n\ \ },\n \"harness|hendrycksTest-professional_psychology|5\": {\n \"\ acc\": 0.6143790849673203,\n \"acc_stderr\": 0.01969145905235403,\n \ \ \"acc_norm\": 0.6143790849673203,\n \"acc_norm_stderr\": 0.01969145905235403\n\ \ },\n \"harness|hendrycksTest-public_relations|5\": {\n \"acc\": 0.5363636363636364,\n\ \ \"acc_stderr\": 0.04776449162396197,\n \"acc_norm\": 0.5363636363636364,\n\ \ \"acc_norm_stderr\": 0.04776449162396197\n },\n \"harness|hendrycksTest-security_studies|5\"\ : {\n \"acc\": 0.6612244897959184,\n \"acc_stderr\": 0.030299506562154185,\n\ \ \"acc_norm\": 0.6612244897959184,\n \"acc_norm_stderr\": 0.030299506562154185\n\ \ },\n \"harness|hendrycksTest-sociology|5\": {\n \"acc\": 0.8059701492537313,\n\ \ \"acc_stderr\": 0.02796267760476892,\n \"acc_norm\": 0.8059701492537313,\n\ \ \"acc_norm_stderr\": 0.02796267760476892\n },\n \"harness|hendrycksTest-us_foreign_policy|5\"\ : {\n \"acc\": 0.81,\n \"acc_stderr\": 0.039427724440366255,\n \ \ \"acc_norm\": 0.81,\n \"acc_norm_stderr\": 0.039427724440366255\n \ \ },\n \"harness|hendrycksTest-virology|5\": {\n \"acc\": 0.5,\n \ \ \"acc_stderr\": 0.03892494720807614,\n \"acc_norm\": 0.5,\n \ \ \"acc_norm_stderr\": 0.03892494720807614\n },\n \"harness|hendrycksTest-world_religions|5\"\ : {\n \"acc\": 0.7953216374269005,\n \"acc_stderr\": 0.03094445977853321,\n\ \ \"acc_norm\": 0.7953216374269005,\n \"acc_norm_stderr\": 0.03094445977853321\n\ \ },\n \"harness|truthfulqa:mc|0\": {\n \"mc1\": 0.2802937576499388,\n\ \ \"mc1_stderr\": 0.015723139524608767,\n \"mc2\": 0.4145856205202453,\n\ \ \"mc2_stderr\": 0.015440318842858623\n },\n \"harness|winogrande|5\"\ : {\n \"acc\": 0.7719021310181531,\n \"acc_stderr\": 0.01179301581766359\n\ \ },\n \"harness|gsm8k|5\": {\n \"acc\": 0.30856709628506446,\n \ \ \"acc_stderr\": 0.012723076049815884\n }\n}\n```" repo_url: https://huggingface.co/ewqr2130/alignment-handbook-zephyr-7b_ppo_5e7step_51 leaderboard_url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard point_of_contact: clementine@hf.co configs: - config_name: harness_arc_challenge_25 data_files: - split: 2024_01_19T08_19_43.491405 path: - '**/details_harness|arc:challenge|25_2024-01-19T08-19-43.491405.parquet' - split: latest path: - '**/details_harness|arc:challenge|25_2024-01-19T08-19-43.491405.parquet' - config_name: harness_gsm8k_5 data_files: - split: 2024_01_19T08_19_43.491405 path: - '**/details_harness|gsm8k|5_2024-01-19T08-19-43.491405.parquet' - split: latest path: - '**/details_harness|gsm8k|5_2024-01-19T08-19-43.491405.parquet' - config_name: harness_hellaswag_10 data_files: - split: 2024_01_19T08_19_43.491405 path: - '**/details_harness|hellaswag|10_2024-01-19T08-19-43.491405.parquet' - split: latest path: - '**/details_harness|hellaswag|10_2024-01-19T08-19-43.491405.parquet' - config_name: harness_hendrycksTest_5 data_files: - split: 2024_01_19T08_19_43.491405 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-international_law|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-management|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-marketing|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-sociology|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-virology|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2024-01-19T08-19-43.491405.parquet' - split: latest path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-international_law|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-management|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-marketing|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-sociology|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-virology|5_2024-01-19T08-19-43.491405.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2024-01-19T08-19-43.491405.parquet' - config_name: harness_hendrycksTest_abstract_algebra_5 data_files: - split: 2024_01_19T08_19_43.491405 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-01-19T08-19-43.491405.parquet' - split: latest path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-01-19T08-19-43.491405.parquet' - config_name: harness_hendrycksTest_anatomy_5 data_files: - split: 2024_01_19T08_19_43.491405 path: - '**/details_harness|hendrycksTest-anatomy|5_2024-01-19T08-19-43.491405.parquet' - split: latest path: - '**/details_harness|hendrycksTest-anatomy|5_2024-01-19T08-19-43.491405.parquet' - config_name: harness_hendrycksTest_astronomy_5 data_files: - split: 2024_01_19T08_19_43.491405 path: - '**/details_harness|hendrycksTest-astronomy|5_2024-01-19T08-19-43.491405.parquet' - split: latest path: - '**/details_harness|hendrycksTest-astronomy|5_2024-01-19T08-19-43.491405.parquet' - config_name: harness_hendrycksTest_business_ethics_5 data_files: - split: 2024_01_19T08_19_43.491405 path: - '**/details_harness|hendrycksTest-business_ethics|5_2024-01-19T08-19-43.491405.parquet' - split: latest path: - '**/details_harness|hendrycksTest-business_ethics|5_2024-01-19T08-19-43.491405.parquet' - config_name: harness_hendrycksTest_clinical_knowledge_5 data_files: - split: 2024_01_19T08_19_43.491405 path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-01-19T08-19-43.491405.parquet' - split: latest path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-01-19T08-19-43.491405.parquet' - config_name: harness_hendrycksTest_college_biology_5 data_files: - split: 2024_01_19T08_19_43.491405 path: - '**/details_harness|hendrycksTest-college_biology|5_2024-01-19T08-19-43.491405.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_biology|5_2024-01-19T08-19-43.491405.parquet' - config_name: harness_hendrycksTest_college_chemistry_5 data_files: - split: 2024_01_19T08_19_43.491405 path: - '**/details_harness|hendrycksTest-college_chemistry|5_2024-01-19T08-19-43.491405.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_chemistry|5_2024-01-19T08-19-43.491405.parquet' - config_name: harness_hendrycksTest_college_computer_science_5 data_files: - split: 2024_01_19T08_19_43.491405 path: - '**/details_harness|hendrycksTest-college_computer_science|5_2024-01-19T08-19-43.491405.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_computer_science|5_2024-01-19T08-19-43.491405.parquet' - config_name: harness_hendrycksTest_college_mathematics_5 data_files: - split: 2024_01_19T08_19_43.491405 path: - '**/details_harness|hendrycksTest-college_mathematics|5_2024-01-19T08-19-43.491405.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_mathematics|5_2024-01-19T08-19-43.491405.parquet' - config_name: harness_hendrycksTest_college_medicine_5 data_files: - split: 2024_01_19T08_19_43.491405 path: - '**/details_harness|hendrycksTest-college_medicine|5_2024-01-19T08-19-43.491405.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_medicine|5_2024-01-19T08-19-43.491405.parquet' - config_name: harness_hendrycksTest_college_physics_5 data_files: - split: 2024_01_19T08_19_43.491405 path: - '**/details_harness|hendrycksTest-college_physics|5_2024-01-19T08-19-43.491405.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_physics|5_2024-01-19T08-19-43.491405.parquet' - config_name: harness_hendrycksTest_computer_security_5 data_files: - split: 2024_01_19T08_19_43.491405 path: - '**/details_harness|hendrycksTest-computer_security|5_2024-01-19T08-19-43.491405.parquet' - split: latest path: - '**/details_harness|hendrycksTest-computer_security|5_2024-01-19T08-19-43.491405.parquet' - config_name: harness_hendrycksTest_conceptual_physics_5 data_files: - split: 2024_01_19T08_19_43.491405 path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-01-19T08-19-43.491405.parquet' - split: latest path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-01-19T08-19-43.491405.parquet' - config_name: harness_hendrycksTest_econometrics_5 data_files: - split: 2024_01_19T08_19_43.491405 path: - '**/details_harness|hendrycksTest-econometrics|5_2024-01-19T08-19-43.491405.parquet' - split: latest path: - '**/details_harness|hendrycksTest-econometrics|5_2024-01-19T08-19-43.491405.parquet' - config_name: harness_hendrycksTest_electrical_engineering_5 data_files: - split: 2024_01_19T08_19_43.491405 path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-01-19T08-19-43.491405.parquet' - split: latest path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-01-19T08-19-43.491405.parquet' - config_name: harness_hendrycksTest_elementary_mathematics_5 data_files: - split: 2024_01_19T08_19_43.491405 path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-01-19T08-19-43.491405.parquet' - split: latest path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-01-19T08-19-43.491405.parquet' - config_name: harness_hendrycksTest_formal_logic_5 data_files: - split: 2024_01_19T08_19_43.491405 path: - '**/details_harness|hendrycksTest-formal_logic|5_2024-01-19T08-19-43.491405.parquet' - split: latest path: - '**/details_harness|hendrycksTest-formal_logic|5_2024-01-19T08-19-43.491405.parquet' - config_name: harness_hendrycksTest_global_facts_5 data_files: - split: 2024_01_19T08_19_43.491405 path: - '**/details_harness|hendrycksTest-global_facts|5_2024-01-19T08-19-43.491405.parquet' - split: latest path: - '**/details_harness|hendrycksTest-global_facts|5_2024-01-19T08-19-43.491405.parquet' - config_name: harness_hendrycksTest_high_school_biology_5 data_files: - split: 2024_01_19T08_19_43.491405 path: - '**/details_harness|hendrycksTest-high_school_biology|5_2024-01-19T08-19-43.491405.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_biology|5_2024-01-19T08-19-43.491405.parquet' - config_name: harness_hendrycksTest_high_school_chemistry_5 data_files: - split: 2024_01_19T08_19_43.491405 path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-01-19T08-19-43.491405.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-01-19T08-19-43.491405.parquet' - config_name: harness_hendrycksTest_high_school_computer_science_5 data_files: - split: 2024_01_19T08_19_43.491405 path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-01-19T08-19-43.491405.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-01-19T08-19-43.491405.parquet' - config_name: harness_hendrycksTest_high_school_european_history_5 data_files: - split: 2024_01_19T08_19_43.491405 path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-01-19T08-19-43.491405.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-01-19T08-19-43.491405.parquet' - config_name: harness_hendrycksTest_high_school_geography_5 data_files: - split: 2024_01_19T08_19_43.491405 path: - '**/details_harness|hendrycksTest-high_school_geography|5_2024-01-19T08-19-43.491405.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_geography|5_2024-01-19T08-19-43.491405.parquet' - config_name: harness_hendrycksTest_high_school_government_and_politics_5 data_files: - split: 2024_01_19T08_19_43.491405 path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-01-19T08-19-43.491405.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-01-19T08-19-43.491405.parquet' - config_name: harness_hendrycksTest_high_school_macroeconomics_5 data_files: - split: 2024_01_19T08_19_43.491405 path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-01-19T08-19-43.491405.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-01-19T08-19-43.491405.parquet' - config_name: harness_hendrycksTest_high_school_mathematics_5 data_files: - split: 2024_01_19T08_19_43.491405 path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-01-19T08-19-43.491405.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-01-19T08-19-43.491405.parquet' - config_name: harness_hendrycksTest_high_school_microeconomics_5 data_files: - split: 2024_01_19T08_19_43.491405 path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-01-19T08-19-43.491405.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-01-19T08-19-43.491405.parquet' - config_name: harness_hendrycksTest_high_school_physics_5 data_files: - split: 2024_01_19T08_19_43.491405 path: - '**/details_harness|hendrycksTest-high_school_physics|5_2024-01-19T08-19-43.491405.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_physics|5_2024-01-19T08-19-43.491405.parquet' - config_name: harness_hendrycksTest_high_school_psychology_5 data_files: - split: 2024_01_19T08_19_43.491405 path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-01-19T08-19-43.491405.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-01-19T08-19-43.491405.parquet' - config_name: harness_hendrycksTest_high_school_statistics_5 data_files: - split: 2024_01_19T08_19_43.491405 path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-01-19T08-19-43.491405.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-01-19T08-19-43.491405.parquet' - config_name: harness_hendrycksTest_high_school_us_history_5 data_files: - split: 2024_01_19T08_19_43.491405 path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-01-19T08-19-43.491405.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-01-19T08-19-43.491405.parquet' - config_name: harness_hendrycksTest_high_school_world_history_5 data_files: - split: 2024_01_19T08_19_43.491405 path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-01-19T08-19-43.491405.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-01-19T08-19-43.491405.parquet' - config_name: harness_hendrycksTest_human_aging_5 data_files: - split: 2024_01_19T08_19_43.491405 path: - '**/details_harness|hendrycksTest-human_aging|5_2024-01-19T08-19-43.491405.parquet' - split: latest path: - '**/details_harness|hendrycksTest-human_aging|5_2024-01-19T08-19-43.491405.parquet' - config_name: harness_hendrycksTest_human_sexuality_5 data_files: - split: 2024_01_19T08_19_43.491405 path: - '**/details_harness|hendrycksTest-human_sexuality|5_2024-01-19T08-19-43.491405.parquet' - split: latest path: - '**/details_harness|hendrycksTest-human_sexuality|5_2024-01-19T08-19-43.491405.parquet' - config_name: harness_hendrycksTest_international_law_5 data_files: - split: 2024_01_19T08_19_43.491405 path: - '**/details_harness|hendrycksTest-international_law|5_2024-01-19T08-19-43.491405.parquet' - split: latest path: - '**/details_harness|hendrycksTest-international_law|5_2024-01-19T08-19-43.491405.parquet' - config_name: harness_hendrycksTest_jurisprudence_5 data_files: - split: 2024_01_19T08_19_43.491405 path: - '**/details_harness|hendrycksTest-jurisprudence|5_2024-01-19T08-19-43.491405.parquet' - split: latest path: - '**/details_harness|hendrycksTest-jurisprudence|5_2024-01-19T08-19-43.491405.parquet' - config_name: harness_hendrycksTest_logical_fallacies_5 data_files: - split: 2024_01_19T08_19_43.491405 path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-01-19T08-19-43.491405.parquet' - split: latest path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-01-19T08-19-43.491405.parquet' - config_name: harness_hendrycksTest_machine_learning_5 data_files: - split: 2024_01_19T08_19_43.491405 path: - '**/details_harness|hendrycksTest-machine_learning|5_2024-01-19T08-19-43.491405.parquet' - split: latest path: - '**/details_harness|hendrycksTest-machine_learning|5_2024-01-19T08-19-43.491405.parquet' - config_name: harness_hendrycksTest_management_5 data_files: - split: 2024_01_19T08_19_43.491405 path: - '**/details_harness|hendrycksTest-management|5_2024-01-19T08-19-43.491405.parquet' - split: latest path: - '**/details_harness|hendrycksTest-management|5_2024-01-19T08-19-43.491405.parquet' - config_name: harness_hendrycksTest_marketing_5 data_files: - split: 2024_01_19T08_19_43.491405 path: - '**/details_harness|hendrycksTest-marketing|5_2024-01-19T08-19-43.491405.parquet' - split: latest path: - '**/details_harness|hendrycksTest-marketing|5_2024-01-19T08-19-43.491405.parquet' - config_name: harness_hendrycksTest_medical_genetics_5 data_files: - split: 2024_01_19T08_19_43.491405 path: - '**/details_harness|hendrycksTest-medical_genetics|5_2024-01-19T08-19-43.491405.parquet' - split: latest path: - '**/details_harness|hendrycksTest-medical_genetics|5_2024-01-19T08-19-43.491405.parquet' - config_name: harness_hendrycksTest_miscellaneous_5 data_files: - split: 2024_01_19T08_19_43.491405 path: - '**/details_harness|hendrycksTest-miscellaneous|5_2024-01-19T08-19-43.491405.parquet' - split: latest path: - '**/details_harness|hendrycksTest-miscellaneous|5_2024-01-19T08-19-43.491405.parquet' - config_name: harness_hendrycksTest_moral_disputes_5 data_files: - split: 2024_01_19T08_19_43.491405 path: - '**/details_harness|hendrycksTest-moral_disputes|5_2024-01-19T08-19-43.491405.parquet' - split: latest path: - '**/details_harness|hendrycksTest-moral_disputes|5_2024-01-19T08-19-43.491405.parquet' - config_name: harness_hendrycksTest_moral_scenarios_5 data_files: - split: 2024_01_19T08_19_43.491405 path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-01-19T08-19-43.491405.parquet' - split: latest path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-01-19T08-19-43.491405.parquet' - config_name: harness_hendrycksTest_nutrition_5 data_files: - split: 2024_01_19T08_19_43.491405 path: - '**/details_harness|hendrycksTest-nutrition|5_2024-01-19T08-19-43.491405.parquet' - split: latest path: - '**/details_harness|hendrycksTest-nutrition|5_2024-01-19T08-19-43.491405.parquet' - config_name: harness_hendrycksTest_philosophy_5 data_files: - split: 2024_01_19T08_19_43.491405 path: - '**/details_harness|hendrycksTest-philosophy|5_2024-01-19T08-19-43.491405.parquet' - split: latest path: - '**/details_harness|hendrycksTest-philosophy|5_2024-01-19T08-19-43.491405.parquet' - config_name: harness_hendrycksTest_prehistory_5 data_files: - split: 2024_01_19T08_19_43.491405 path: - '**/details_harness|hendrycksTest-prehistory|5_2024-01-19T08-19-43.491405.parquet' - split: latest path: - '**/details_harness|hendrycksTest-prehistory|5_2024-01-19T08-19-43.491405.parquet' - config_name: harness_hendrycksTest_professional_accounting_5 data_files: - split: 2024_01_19T08_19_43.491405 path: - '**/details_harness|hendrycksTest-professional_accounting|5_2024-01-19T08-19-43.491405.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_accounting|5_2024-01-19T08-19-43.491405.parquet' - config_name: harness_hendrycksTest_professional_law_5 data_files: - split: 2024_01_19T08_19_43.491405 path: - '**/details_harness|hendrycksTest-professional_law|5_2024-01-19T08-19-43.491405.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_law|5_2024-01-19T08-19-43.491405.parquet' - config_name: harness_hendrycksTest_professional_medicine_5 data_files: - split: 2024_01_19T08_19_43.491405 path: - '**/details_harness|hendrycksTest-professional_medicine|5_2024-01-19T08-19-43.491405.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_medicine|5_2024-01-19T08-19-43.491405.parquet' - config_name: harness_hendrycksTest_professional_psychology_5 data_files: - split: 2024_01_19T08_19_43.491405 path: - '**/details_harness|hendrycksTest-professional_psychology|5_2024-01-19T08-19-43.491405.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_psychology|5_2024-01-19T08-19-43.491405.parquet' - config_name: harness_hendrycksTest_public_relations_5 data_files: - split: 2024_01_19T08_19_43.491405 path: - '**/details_harness|hendrycksTest-public_relations|5_2024-01-19T08-19-43.491405.parquet' - split: latest path: - '**/details_harness|hendrycksTest-public_relations|5_2024-01-19T08-19-43.491405.parquet' - config_name: harness_hendrycksTest_security_studies_5 data_files: - split: 2024_01_19T08_19_43.491405 path: - '**/details_harness|hendrycksTest-security_studies|5_2024-01-19T08-19-43.491405.parquet' - split: latest path: - '**/details_harness|hendrycksTest-security_studies|5_2024-01-19T08-19-43.491405.parquet' - config_name: harness_hendrycksTest_sociology_5 data_files: - split: 2024_01_19T08_19_43.491405 path: - '**/details_harness|hendrycksTest-sociology|5_2024-01-19T08-19-43.491405.parquet' - split: latest path: - '**/details_harness|hendrycksTest-sociology|5_2024-01-19T08-19-43.491405.parquet' - config_name: harness_hendrycksTest_us_foreign_policy_5 data_files: - split: 2024_01_19T08_19_43.491405 path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-01-19T08-19-43.491405.parquet' - split: latest path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-01-19T08-19-43.491405.parquet' - config_name: harness_hendrycksTest_virology_5 data_files: - split: 2024_01_19T08_19_43.491405 path: - '**/details_harness|hendrycksTest-virology|5_2024-01-19T08-19-43.491405.parquet' - split: latest path: - '**/details_harness|hendrycksTest-virology|5_2024-01-19T08-19-43.491405.parquet' - config_name: harness_hendrycksTest_world_religions_5 data_files: - split: 2024_01_19T08_19_43.491405 path: - '**/details_harness|hendrycksTest-world_religions|5_2024-01-19T08-19-43.491405.parquet' - split: latest path: - '**/details_harness|hendrycksTest-world_religions|5_2024-01-19T08-19-43.491405.parquet' - config_name: harness_truthfulqa_mc_0 data_files: - split: 2024_01_19T08_19_43.491405 path: - '**/details_harness|truthfulqa:mc|0_2024-01-19T08-19-43.491405.parquet' - split: latest path: - '**/details_harness|truthfulqa:mc|0_2024-01-19T08-19-43.491405.parquet' - config_name: harness_winogrande_5 data_files: - split: 2024_01_19T08_19_43.491405 path: - '**/details_harness|winogrande|5_2024-01-19T08-19-43.491405.parquet' - split: latest path: - '**/details_harness|winogrande|5_2024-01-19T08-19-43.491405.parquet' - config_name: results data_files: - split: 2024_01_19T08_19_43.491405 path: - results_2024-01-19T08-19-43.491405.parquet - split: latest path: - results_2024-01-19T08-19-43.491405.parquet --- # Dataset Card for Evaluation run of ewqr2130/alignment-handbook-zephyr-7b_ppo_5e7step_51 <!-- Provide a quick summary of the dataset. --> Dataset automatically created during the evaluation run of model [ewqr2130/alignment-handbook-zephyr-7b_ppo_5e7step_51](https://huggingface.co/ewqr2130/alignment-handbook-zephyr-7b_ppo_5e7step_51) on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard). The dataset is composed of 63 configuration, each one coresponding to one of the evaluated task. The dataset has been created from 1 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results. An additional configuration "results" store all the aggregated results of the run (and is used to compute and display the aggregated metrics on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)). To load the details from a run, you can for instance do the following: ```python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_ewqr2130__alignment-handbook-zephyr-7b_ppo_5e7step_51", "harness_winogrande_5", split="train") ``` ## Latest results These are the [latest results from run 2024-01-19T08:19:43.491405](https://huggingface.co/datasets/open-llm-leaderboard/details_ewqr2130__alignment-handbook-zephyr-7b_ppo_5e7step_51/blob/main/results_2024-01-19T08-19-43.491405.json)(note that their might be results for other tasks in the repos if successive evals didn't cover the same tasks. You find each in the results and the "latest" split for each eval): ```python { "all": { "acc": 0.5956914302139066, "acc_stderr": 0.03334872123940361, "acc_norm": 0.6014998966219293, "acc_norm_stderr": 0.03404340642551802, "mc1": 0.2802937576499388, "mc1_stderr": 0.015723139524608767, "mc2": 0.4145856205202453, "mc2_stderr": 0.015440318842858623 }, "harness|arc:challenge|25": { "acc": 0.5648464163822525, "acc_stderr": 0.01448798619718604, "acc_norm": 0.5972696245733788, "acc_norm_stderr": 0.014332236306790149 }, "harness|hellaswag|10": { "acc": 0.6258713403704441, "acc_stderr": 0.0048290815328265015, "acc_norm": 0.8252340171280621, "acc_norm_stderr": 0.003789906792644689 }, "harness|hendrycksTest-abstract_algebra|5": { "acc": 0.3, "acc_stderr": 0.046056618647183814, "acc_norm": 0.3, "acc_norm_stderr": 0.046056618647183814 }, "harness|hendrycksTest-anatomy|5": { "acc": 0.5851851851851851, "acc_stderr": 0.04256193767901408, "acc_norm": 0.5851851851851851, "acc_norm_stderr": 0.04256193767901408 }, "harness|hendrycksTest-astronomy|5": { "acc": 0.625, "acc_stderr": 0.039397364351956274, "acc_norm": 0.625, "acc_norm_stderr": 0.039397364351956274 }, "harness|hendrycksTest-business_ethics|5": { "acc": 0.51, "acc_stderr": 0.05024183937956912, "acc_norm": 0.51, "acc_norm_stderr": 0.05024183937956912 }, "harness|hendrycksTest-clinical_knowledge|5": { "acc": 0.6716981132075471, "acc_stderr": 0.02890159361241178, "acc_norm": 0.6716981132075471, "acc_norm_stderr": 0.02890159361241178 }, "harness|hendrycksTest-college_biology|5": { "acc": 0.6666666666666666, "acc_stderr": 0.03942082639927213, "acc_norm": 0.6666666666666666, "acc_norm_stderr": 0.03942082639927213 }, "harness|hendrycksTest-college_chemistry|5": { "acc": 0.41, "acc_stderr": 0.049431107042371025, "acc_norm": 0.41, "acc_norm_stderr": 0.049431107042371025 }, "harness|hendrycksTest-college_computer_science|5": { "acc": 0.51, "acc_stderr": 0.05024183937956912, "acc_norm": 0.51, "acc_norm_stderr": 0.05024183937956912 }, "harness|hendrycksTest-college_mathematics|5": { "acc": 0.35, "acc_stderr": 0.04793724854411019, "acc_norm": 0.35, "acc_norm_stderr": 0.04793724854411019 }, "harness|hendrycksTest-college_medicine|5": { "acc": 0.6127167630057804, "acc_stderr": 0.03714325906302065, "acc_norm": 0.6127167630057804, "acc_norm_stderr": 0.03714325906302065 }, "harness|hendrycksTest-college_physics|5": { "acc": 0.38235294117647056, "acc_stderr": 0.04835503696107223, "acc_norm": 0.38235294117647056, "acc_norm_stderr": 0.04835503696107223 }, "harness|hendrycksTest-computer_security|5": { "acc": 0.77, "acc_stderr": 0.04229525846816506, "acc_norm": 0.77, "acc_norm_stderr": 0.04229525846816506 }, "harness|hendrycksTest-conceptual_physics|5": { "acc": 0.5191489361702127, "acc_stderr": 0.032662042990646796, "acc_norm": 0.5191489361702127, "acc_norm_stderr": 0.032662042990646796 }, "harness|hendrycksTest-econometrics|5": { "acc": 0.40350877192982454, "acc_stderr": 0.046151869625837026, "acc_norm": 0.40350877192982454, "acc_norm_stderr": 0.046151869625837026 }, "harness|hendrycksTest-electrical_engineering|5": { "acc": 0.5586206896551724, "acc_stderr": 0.04137931034482757, "acc_norm": 0.5586206896551724, "acc_norm_stderr": 0.04137931034482757 }, "harness|hendrycksTest-elementary_mathematics|5": { "acc": 0.3783068783068783, "acc_stderr": 0.02497695405315524, "acc_norm": 0.3783068783068783, "acc_norm_stderr": 0.02497695405315524 }, "harness|hendrycksTest-formal_logic|5": { "acc": 0.373015873015873, "acc_stderr": 0.04325506042017086, "acc_norm": 0.373015873015873, "acc_norm_stderr": 0.04325506042017086 }, "harness|hendrycksTest-global_facts|5": { "acc": 0.42, "acc_stderr": 0.049604496374885836, "acc_norm": 0.42, "acc_norm_stderr": 0.049604496374885836 }, "harness|hendrycksTest-high_school_biology|5": { "acc": 0.7129032258064516, "acc_stderr": 0.025736542745594525, "acc_norm": 0.7129032258064516, "acc_norm_stderr": 0.025736542745594525 }, "harness|hendrycksTest-high_school_chemistry|5": { "acc": 0.5024630541871922, "acc_stderr": 0.03517945038691063, "acc_norm": 0.5024630541871922, "acc_norm_stderr": 0.03517945038691063 }, "harness|hendrycksTest-high_school_computer_science|5": { "acc": 0.65, "acc_stderr": 0.047937248544110196, "acc_norm": 0.65, "acc_norm_stderr": 0.047937248544110196 }, "harness|hendrycksTest-high_school_european_history|5": { "acc": 0.7454545454545455, "acc_stderr": 0.03401506715249039, "acc_norm": 0.7454545454545455, "acc_norm_stderr": 0.03401506715249039 }, "harness|hendrycksTest-high_school_geography|5": { "acc": 0.7323232323232324, "acc_stderr": 0.03154449888270285, "acc_norm": 0.7323232323232324, "acc_norm_stderr": 0.03154449888270285 }, "harness|hendrycksTest-high_school_government_and_politics|5": { "acc": 0.7875647668393783, "acc_stderr": 0.02951928261681723, "acc_norm": 0.7875647668393783, "acc_norm_stderr": 0.02951928261681723 }, "harness|hendrycksTest-high_school_macroeconomics|5": { "acc": 0.5820512820512821, "acc_stderr": 0.02500732988246121, "acc_norm": 0.5820512820512821, "acc_norm_stderr": 0.02500732988246121 }, "harness|hendrycksTest-high_school_mathematics|5": { "acc": 0.3296296296296296, "acc_stderr": 0.028661201116524565, "acc_norm": 0.3296296296296296, "acc_norm_stderr": 0.028661201116524565 }, "harness|hendrycksTest-high_school_microeconomics|5": { "acc": 0.6428571428571429, "acc_stderr": 0.031124619309328177, "acc_norm": 0.6428571428571429, "acc_norm_stderr": 0.031124619309328177 }, "harness|hendrycksTest-high_school_physics|5": { "acc": 0.33112582781456956, "acc_stderr": 0.038425817186598696, "acc_norm": 0.33112582781456956, "acc_norm_stderr": 0.038425817186598696 }, "harness|hendrycksTest-high_school_psychology|5": { "acc": 0.8110091743119267, "acc_stderr": 0.016785481159203624, "acc_norm": 0.8110091743119267, "acc_norm_stderr": 0.016785481159203624 }, "harness|hendrycksTest-high_school_statistics|5": { "acc": 0.4398148148148148, "acc_stderr": 0.03385177976044811, "acc_norm": 0.4398148148148148, "acc_norm_stderr": 0.03385177976044811 }, "harness|hendrycksTest-high_school_us_history|5": { "acc": 0.7303921568627451, "acc_stderr": 0.031145570659486782, "acc_norm": 0.7303921568627451, "acc_norm_stderr": 0.031145570659486782 }, "harness|hendrycksTest-high_school_world_history|5": { "acc": 0.729957805907173, "acc_stderr": 0.028900721906293426, "acc_norm": 0.729957805907173, "acc_norm_stderr": 0.028900721906293426 }, "harness|hendrycksTest-human_aging|5": { "acc": 0.6457399103139013, "acc_stderr": 0.032100621541349864, "acc_norm": 0.6457399103139013, "acc_norm_stderr": 0.032100621541349864 }, "harness|hendrycksTest-human_sexuality|5": { "acc": 0.6564885496183206, "acc_stderr": 0.041649760719448786, "acc_norm": 0.6564885496183206, "acc_norm_stderr": 0.041649760719448786 }, "harness|hendrycksTest-international_law|5": { "acc": 0.7768595041322314, "acc_stderr": 0.03800754475228733, "acc_norm": 0.7768595041322314, "acc_norm_stderr": 0.03800754475228733 }, "harness|hendrycksTest-jurisprudence|5": { "acc": 0.7777777777777778, "acc_stderr": 0.040191074725573483, "acc_norm": 0.7777777777777778, "acc_norm_stderr": 0.040191074725573483 }, "harness|hendrycksTest-logical_fallacies|5": { "acc": 0.6932515337423313, "acc_stderr": 0.03623089915724147, "acc_norm": 0.6932515337423313, "acc_norm_stderr": 0.03623089915724147 }, "harness|hendrycksTest-machine_learning|5": { "acc": 0.44642857142857145, "acc_stderr": 0.04718471485219588, "acc_norm": 0.44642857142857145, "acc_norm_stderr": 0.04718471485219588 }, "harness|hendrycksTest-management|5": { "acc": 0.7475728155339806, "acc_stderr": 0.04301250399690878, "acc_norm": 0.7475728155339806, "acc_norm_stderr": 0.04301250399690878 }, "harness|hendrycksTest-marketing|5": { "acc": 0.8247863247863247, "acc_stderr": 0.024904439098918242, "acc_norm": 0.8247863247863247, "acc_norm_stderr": 0.024904439098918242 }, "harness|hendrycksTest-medical_genetics|5": { "acc": 0.7, "acc_stderr": 0.046056618647183814, "acc_norm": 0.7, "acc_norm_stderr": 0.046056618647183814 }, "harness|hendrycksTest-miscellaneous|5": { "acc": 0.776500638569604, "acc_stderr": 0.01489723522945071, "acc_norm": 0.776500638569604, "acc_norm_stderr": 0.01489723522945071 }, "harness|hendrycksTest-moral_disputes|5": { "acc": 0.684971098265896, "acc_stderr": 0.025009313790069706, "acc_norm": 0.684971098265896, "acc_norm_stderr": 0.025009313790069706 }, "harness|hendrycksTest-moral_scenarios|5": { "acc": 0.38324022346368714, "acc_stderr": 0.016260159604429128, "acc_norm": 0.38324022346368714, "acc_norm_stderr": 0.016260159604429128 }, "harness|hendrycksTest-nutrition|5": { "acc": 0.6535947712418301, "acc_stderr": 0.02724561304721536, "acc_norm": 0.6535947712418301, "acc_norm_stderr": 0.02724561304721536 }, "harness|hendrycksTest-philosophy|5": { "acc": 0.6688102893890675, "acc_stderr": 0.026730620728004906, "acc_norm": 0.6688102893890675, "acc_norm_stderr": 0.026730620728004906 }, "harness|hendrycksTest-prehistory|5": { "acc": 0.6728395061728395, "acc_stderr": 0.026105673861409828, "acc_norm": 0.6728395061728395, "acc_norm_stderr": 0.026105673861409828 }, "harness|hendrycksTest-professional_accounting|5": { "acc": 0.4397163120567376, "acc_stderr": 0.029609912075594113, "acc_norm": 0.4397163120567376, "acc_norm_stderr": 0.029609912075594113 }, "harness|hendrycksTest-professional_law|5": { "acc": 0.4211212516297262, "acc_stderr": 0.012610325733489905, "acc_norm": 0.4211212516297262, "acc_norm_stderr": 0.012610325733489905 }, "harness|hendrycksTest-professional_medicine|5": { "acc": 0.5992647058823529, "acc_stderr": 0.029768263528933105, "acc_norm": 0.5992647058823529, "acc_norm_stderr": 0.029768263528933105 }, "harness|hendrycksTest-professional_psychology|5": { "acc": 0.6143790849673203, "acc_stderr": 0.01969145905235403, "acc_norm": 0.6143790849673203, "acc_norm_stderr": 0.01969145905235403 }, "harness|hendrycksTest-public_relations|5": { "acc": 0.5363636363636364, "acc_stderr": 0.04776449162396197, "acc_norm": 0.5363636363636364, "acc_norm_stderr": 0.04776449162396197 }, "harness|hendrycksTest-security_studies|5": { "acc": 0.6612244897959184, "acc_stderr": 0.030299506562154185, "acc_norm": 0.6612244897959184, "acc_norm_stderr": 0.030299506562154185 }, "harness|hendrycksTest-sociology|5": { "acc": 0.8059701492537313, "acc_stderr": 0.02796267760476892, "acc_norm": 0.8059701492537313, "acc_norm_stderr": 0.02796267760476892 }, "harness|hendrycksTest-us_foreign_policy|5": { "acc": 0.81, "acc_stderr": 0.039427724440366255, "acc_norm": 0.81, "acc_norm_stderr": 0.039427724440366255 }, "harness|hendrycksTest-virology|5": { "acc": 0.5, "acc_stderr": 0.03892494720807614, "acc_norm": 0.5, "acc_norm_stderr": 0.03892494720807614 }, "harness|hendrycksTest-world_religions|5": { "acc": 0.7953216374269005, "acc_stderr": 0.03094445977853321, "acc_norm": 0.7953216374269005, "acc_norm_stderr": 0.03094445977853321 }, "harness|truthfulqa:mc|0": { "mc1": 0.2802937576499388, "mc1_stderr": 0.015723139524608767, "mc2": 0.4145856205202453, "mc2_stderr": 0.015440318842858623 }, "harness|winogrande|5": { "acc": 0.7719021310181531, "acc_stderr": 0.01179301581766359 }, "harness|gsm8k|5": { "acc": 0.30856709628506446, "acc_stderr": 0.012723076049815884 } } ``` ## Dataset Details ### Dataset Description <!-- Provide a longer summary of what this dataset is. --> - **Curated by:** [More Information Needed] - **Funded by [optional]:** [More Information Needed] - **Shared by [optional]:** [More Information Needed] - **Language(s) (NLP):** [More Information Needed] - **License:** [More Information Needed] ### Dataset Sources [optional] <!-- Provide the basic links for the dataset. --> - **Repository:** [More Information Needed] - **Paper [optional]:** [More Information Needed] - **Demo [optional]:** [More Information Needed] ## Uses <!-- Address questions around how the dataset is intended to be used. --> ### Direct Use <!-- This section describes suitable use cases for the dataset. --> [More Information Needed] ### Out-of-Scope Use <!-- This section addresses misuse, malicious use, and uses that the dataset will not work well for. --> [More Information Needed] ## Dataset Structure <!-- This section provides a description of the dataset fields, and additional information about the dataset structure such as criteria used to create the splits, relationships between data points, etc. --> [More Information Needed] ## Dataset Creation ### Curation Rationale <!-- Motivation for the creation of this dataset. --> [More Information Needed] ### Source Data <!-- This section describes the source data (e.g. news text and headlines, social media posts, translated sentences, ...). --> #### Data Collection and Processing <!-- This section describes the data collection and processing process such as data selection criteria, filtering and normalization methods, tools and libraries used, etc. --> [More Information Needed] #### Who are the source data producers? <!-- This section describes the people or systems who originally created the data. It should also include self-reported demographic or identity information for the source data creators if this information is available. --> [More Information Needed] ### Annotations [optional] <!-- If the dataset contains annotations which are not part of the initial data collection, use this section to describe them. --> #### Annotation process <!-- This section describes the annotation process such as annotation tools used in the process, the amount of data annotated, annotation guidelines provided to the annotators, interannotator statistics, annotation validation, etc. --> [More Information Needed] #### Who are the annotators? <!-- This section describes the people or systems who created the annotations. --> [More Information Needed] #### Personal and Sensitive Information <!-- State whether the dataset contains data that might be considered personal, sensitive, or private (e.g., data that reveals addresses, uniquely identifiable names or aliases, racial or ethnic origins, sexual orientations, religious beliefs, political opinions, financial or health data, etc.). If efforts were made to anonymize the data, describe the anonymization process. --> [More Information Needed] ## Bias, Risks, and Limitations <!-- This section is meant to convey both technical and sociotechnical limitations. --> [More Information Needed] ### Recommendations <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. --> Users should be made aware of the risks, biases and limitations of the dataset. More information needed for further recommendations. ## Citation [optional] <!-- If there is a paper or blog post introducing the dataset, the APA and Bibtex information for that should go in this section. --> **BibTeX:** [More Information Needed] **APA:** [More Information Needed] ## Glossary [optional] <!-- If relevant, include terms and calculations in this section that can help readers understand the dataset or dataset card. --> [More Information Needed] ## More Information [optional] [More Information Needed] ## Dataset Card Authors [optional] [More Information Needed] ## Dataset Card Contact [More Information Needed]
提供机构:
open-llm-leaderboard-old
原始信息汇总

数据集概述

数据集简介

该数据集是在模型ewqr2130/alignment-handbook-zephyr-7b_ppo_5e7step_51Open LLM Leaderboard上的评估运行期间自动创建的。

数据集结构

  • 数据集包含63个配置,每个配置对应一个评估任务。
  • 数据集从1次运行中创建。每个运行可以在每个配置中作为一个特定的分片找到,分片名称使用运行的时间戳。
  • "train"分片始终指向最新的结果。
  • 一个额外的配置"results"存储所有运行的聚合结果,用于计算和显示Open LLM Leaderboard上的聚合指标。

数据加载示例

python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_ewqr2130__alignment-handbook-zephyr-7b_ppo_5e7step_51", "harness_winogrande_5", split="train")

最新结果

以下是2024-01-19T08:19:43.491405运行的最新结果:

python { "all": { "acc": 0.5956914302139066, "acc_stderr": 0.03334872123940361, "acc_norm": 0.6014998966219293, "acc_norm_stderr": 0.03404340642551802, "mc1": 0.2802937576499388, "mc1_stderr": 0.015723139524608767, "mc2": 0.4145856205202453, "mc2_stderr": 0.015440318842858623 }, "harness|arc:challenge|25": { "acc": 0.5648464163822525, "acc_stderr": 0.01448798619718604, "acc_norm": 0.5972696245733788, "acc_norm_stderr": 0.014332236306790149 }, "harness|hellaswag|10": { "acc": 0.6258713403704441, "acc_stderr": 0.0048290815328265015, "acc_norm": 0.8252340171280621, "acc_norm_stderr": 0.003789906792644689 }, "harness|hendrycksTest-abstract_algebra|5": { "acc": 0.3, "acc_stderr": 0.046056618647183814, "acc_norm": 0.3, "acc_norm_stderr": 0.046056618647183814 }, "harness|hendrycksTest-anatomy|5": { "acc": 0.5851851851851851, "acc_stderr": 0.04256193767901408, "acc_norm": 0.5851851851851851, "acc_norm_stderr": 0.04256193767901408 }, "harness|hendrycksTest-astronomy|5": { "acc": 0.625, "acc_stderr": 0.039397364351956274, "acc_norm": 0.625, "acc_norm_stderr": 0.039397364351956274 }, "harness|hendrycksTest-business_ethics|5": { "acc": 0.51, "acc_stderr": 0.05024183937956912, "acc_norm": 0.51, "acc_norm_stderr": 0.05024183937956912 }, "harness|hendrycksTest-clinical_knowledge|5": { "acc": 0.6716981132075471, "acc_stderr": 0.02890159361241178, "acc_norm": 0.6716981132075471, "acc_norm_stderr": 0.02890159361241178 }, "harness|hendrycksTest-college_biology|5": { "acc": 0.6666666666666666, "acc_stderr": 0.03942082639927213, "acc_norm": 0.6666666666666666, "acc_norm_stderr": 0.03942082639927213 }, "harness|hendrycksTest-college_chemistry|5": { "acc": 0.41, "acc_stderr": 0.049431107042371025, "acc_norm": 0.41, "acc_norm_stderr": 0.049431107042371025 }, "harness|hendrycksTest-college_computer_science|5": { "acc": 0.51, "acc_stderr": 0.05024183937956912, "acc_norm": 0.51, "acc_norm_stderr": 0.05024183937956912 }, "harness|hendrycksTest-college_mathematics|5": { "acc": 0.35, "acc_stderr": 0.04793724854411019, "acc_norm": 0.35, "acc_norm_stderr": 0.04793724854411019 }, "harness|hendrycksTest-college_medicine|5": { "acc": 0.6127167630057804, "acc_stderr": 0.03714325906302065, "acc_norm": 0.6127167630057804, "acc_norm_stderr": 0.03714325906302065 }, "harness|hendrycksTest-college_physics|5": { "acc": 0.38235294117647056, "acc_stderr": 0.04835503696107223, "acc_norm": 0.38235294117647056, "acc_norm_stderr": 0.04835503696107223 }, "harness|hendrycksTest-computer_security|5": { "acc": 0.77, "acc_stderr": 0.04229525846816506, "acc_norm": 0.77, "acc_norm_stderr": 0.04229525846816506 }, "harness|hendrycksTest-conceptual_physics|5": { "acc": 0.5191489361702127, "acc_stderr": 0.032662042990646796, "acc_norm": 0.5191489361702127, "acc_norm_stderr": 0.032662042990646796 }, "harness|hendrycksTest-econometrics|5": { "acc": 0.40350877192982454, "acc_stderr": 0.046151869625837026, "acc_norm": 0.40350877192982454, "acc_norm_stderr": 0.046151869625837026 }, "harness|hendrycksTest-electrical_engineering|5": { "acc": 0.5586206896551724, "acc_stderr": 0.04137931034482757, "acc_norm": 0.5586206896551724, "acc_norm_stderr": 0.04137931034482757 }, "harness|hendrycksTest-elementary_mathematics|5": { "acc": 0.3783068783068783, "acc_stderr": 0.02497695405315524, "acc_norm": 0.3783068783068783, "acc_norm_stderr": 0.02497695405315524 }, "harness|hendrycksTest-formal_logic|5": { "acc": 0.373015873015873, "acc_stderr": 0.04325506042017086, "acc_norm": 0.373015873015873, "acc_norm_stderr": 0.04325506042017086 }, "harness|hendrycksTest-global_facts|5": { "acc": 0.42, "acc_stderr": 0.049604496374885836, "acc_norm": 0.42, "acc_norm_stderr": 0.049604496374885836 }, "harness|hendrycksTest-high_school_biology|5": { "acc": 0.7129032258064516, "acc_stderr": 0.025736542745594525, "acc_norm": 0.7129032258064516, "acc_norm_stderr": 0.025736542745594525 }, "harness|hendrycksTest-high_school_chemistry|5": { "acc": 0.5024630541871922, "acc_stderr": 0.03517945038691063, "acc_norm": 0.5024630541871922, "acc_norm_stderr": 0.03517945038691063 }, "harness|hendrycksTest-high_school_computer_science|5": { "acc": 0.65, "acc_stderr": 0.047937248544110196, "acc_norm": 0.65, "acc_norm_stderr": 0.047937248544110196 }, "harness|hendrycksTest-high_school_european_history|5": { "acc": 0.7454545454545455, "acc_stderr": 0.03401506715249039, "acc_norm": 0.7454545454545455, "acc_norm_stderr": 0.03401506715249039 }, "harness|hendrycksTest-high_school_geography|5": { "acc": 0.7323232323232324, "acc_stderr": 0.03154449888270285, "acc_norm": 0.7323232323232324, "acc_norm_stderr": 0.03154449888270285 }, "harness|hendrycksTest-high_school_government_and_politics|5": { "acc": 0.7875647668393783, "acc_stderr": 0.02951928261681723, "acc_norm": 0.7875647668393783, "acc_norm_stderr": 0.02951928261681723 }, "harness|hendrycksTest-high_school_macroeconomics|5": { "acc": 0.5820512820512821, "acc_stderr": 0.02500732988246121, "acc_norm": 0.5820512820512821, "acc_norm_stderr": 0.02500732988246121 }, "harness|hendrycksTest-high_school_mathematics|5": { "acc": 0.3296296296296296, "acc_stderr": 0.028661201116524565, "acc_norm": 0.3

搜集汇总
数据集介绍
main_image_url
构建方式
在大型语言模型评估领域,该数据集作为开放大语言模型排行榜的自动化产物应运而生。其构建过程依托于对特定模型'ewqr2130/alignment-handbook-zephyr-7b_ppo_5e7step_51'在排行榜框架下的系统性评测。数据集通过一次完整的评估运行自动生成,将涵盖的63项评测任务分别映射为独立的配置单元。每个配置单元内,评估运行的结果以时间戳命名的分割形式存储,并设立'训练'分割始终指向最新的评估结果,同时设立专门的'结果'配置来聚合所有运行的指标数据,以支撑排行榜的综合性度量计算。
特点
该数据集的核心特征体现在其作为模型能力量化基准的精细结构上。它系统性地收纳了模型在广泛认知任务上的表现细节,覆盖从常识推理、学科知识到数学计算等多个维度。数据集通过多配置架构组织,每个配置对应一项具体评测任务,例如ARC挑战赛、HellaSwag或GSM8K等,从而允许研究者进行细粒度的性能剖析。其设计确保了评估结果的可追溯性,每次运行均以独立分割保存,便于进行历时性比较与结果复现,为模型能力的深度诊断提供了结构化数据基础。
使用方法
为利用该数据集进行模型评估分析,研究者可通过Hugging Face的`datasets`库便捷加载。典型的使用方式是调用`load_dataset`函数,指定数据集名称、目标配置(如`harness_winogrande_5`)以及所需的分割(例如`train`以获取最新结果)。通过这种方式,用户可以提取特定任务下模型输出的详细记录与性能指标。这些结构化数据可直接用于生成分析报告、进行跨模型对比,或作为深入探究模型在特定能力维度上优势与局限的实证依据。
背景与挑战
背景概述
在大型语言模型(LLM)迅猛发展的背景下,评估其综合能力成为推动技术进步的关键环节。HuggingFace平台推出的Open LLM Leaderboard正是为了系统化、标准化地衡量不同模型的性能而构建的评估体系。数据集“open-llm-leaderboard-old/details_ewqr2130__alignment-handbook-zephyr-7b_ppo_5e7step_51”作为该排行榜的一部分,于2024年1月由社区贡献者ewqr2130创建,旨在记录特定模型在多样化基准测试中的详细评估结果。该数据集涵盖了ARC挑战赛、HellaSwag、MMLU以及TruthfulQA等63项任务,为核心研究问题——即如何客观量化模型在常识推理、专业知识、数学计算及真实性等多维度的表现——提供了实证数据支撑,对推动模型对齐与优化研究具有重要参考价值。
当前挑战
该数据集所应对的领域挑战在于,大型语言模型的评估本身即是一个复杂且多维的难题。模型需要在广泛的任务中展现出稳健的泛化能力、深刻的领域知识以及可靠的推理逻辑,而现有基准测试往往难以全面捕捉模型在真实场景中的细微缺陷,例如在专业学科(如高等数学、形式逻辑)上的表现波动,或在生成内容真实性方面的潜在偏差。在构建过程中,挑战则体现在评估流程的自动化与数据整合上。确保来自不同任务、不同运行批次的结果能够被准确归集、版本化存储并以统一格式呈现,需要精密的工程设计与数据管理策略,以避免信息丢失或混淆,从而保证评估结果的可靠性与可复现性。
常用场景
经典使用场景
在大型语言模型评估领域,该数据集作为Open LLM Leaderboard的评估运行结果,其经典使用场景在于为研究人员提供模型性能的细粒度分析。通过涵盖ARC挑战赛、HellaSwag、MMLU以及TruthfulQA等多个基准任务,数据集允许对模型在常识推理、知识问答和真实性等方面的能力进行横向对比,从而为模型优化和选择提供实证依据。
实际应用
在实际应用中,该数据集为模型开发者和企业用户提供了关键的决策支持。通过分析模型在特定任务(如数学推理、专业领域知识)上的表现,团队可以识别模型的优势与短板,从而有针对性地进行微调或部署。例如,在构建需要高可靠性的问答系统时,可依据数据集中的真实性评估结果筛选合适的模型。
衍生相关工作
围绕该数据集衍生的经典工作主要集中在评估方法的改进与模型能力的深入分析上。例如,基于Open LLM Leaderboard的评估框架,后续研究提出了更细粒度的任务分解、对抗性测试集构建以及评估偏差分析等方法。这些工作进一步丰富了大型语言模型的评估生态,推动了评估标准向更全面、更公平的方向演进。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作