five

open-llm-leaderboard-old/details_222gate__BrurryDog-7b-v0.1

收藏
Hugging Face2024-01-20 更新2024-06-22 收录
下载链接:
https://hf-mirror.com/datasets/open-llm-leaderboard-old/details_222gate__BrurryDog-7b-v0.1
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: Evaluation run of 222gate/BrurryDog-7b-v0.1 dataset_summary: "Dataset automatically created during the evaluation run of model\ \ [222gate/BrurryDog-7b-v0.1](https://huggingface.co/222gate/BrurryDog-7b-v0.1)\ \ on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).\n\ \nThe dataset is composed of 63 configuration, each one coresponding to one of the\ \ evaluated task.\n\nThe dataset has been created from 1 run(s). Each run can be\ \ found as a specific split in each configuration, the split being named using the\ \ timestamp of the run.The \"train\" split is always pointing to the latest results.\n\ \nAn additional configuration \"results\" store all the aggregated results of the\ \ run (and is used to compute and display the aggregated metrics on the [Open LLM\ \ Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)).\n\ \nTo load the details from a run, you can for instance do the following:\n```python\n\ from datasets import load_dataset\ndata = load_dataset(\"open-llm-leaderboard/details_222gate__BrurryDog-7b-v0.1\"\ ,\n\t\"harness_winogrande_5\",\n\tsplit=\"train\")\n```\n\n## Latest results\n\n\ These are the [latest results from run 2024-01-20T03:26:36.549937](https://huggingface.co/datasets/open-llm-leaderboard/details_222gate__BrurryDog-7b-v0.1/blob/main/results_2024-01-20T03-26-36.549937.json)(note\ \ that their might be results for other tasks in the repos if successive evals didn't\ \ cover the same tasks. You find each in the results and the \"latest\" split for\ \ each eval):\n\n```python\n{\n \"all\": {\n \"acc\": 0.6527842705257364,\n\ \ \"acc_stderr\": 0.03215067286738653,\n \"acc_norm\": 0.652698073988589,\n\ \ \"acc_norm_stderr\": 0.03281396676337054,\n \"mc1\": 0.5777233782129743,\n\ \ \"mc1_stderr\": 0.017290733254248177,\n \"mc2\": 0.7004617737856811,\n\ \ \"mc2_stderr\": 0.01511981164818303\n },\n \"harness|arc:challenge|25\"\ : {\n \"acc\": 0.7005119453924915,\n \"acc_stderr\": 0.01338502163731357,\n\ \ \"acc_norm\": 0.7252559726962458,\n \"acc_norm_stderr\": 0.013044617212771227\n\ \ },\n \"harness|hellaswag|10\": {\n \"acc\": 0.7216689902409879,\n\ \ \"acc_stderr\": 0.004472613148508909,\n \"acc_norm\": 0.8836885082652858,\n\ \ \"acc_norm_stderr\": 0.0031994286759858682\n },\n \"harness|hendrycksTest-abstract_algebra|5\"\ : {\n \"acc\": 0.32,\n \"acc_stderr\": 0.046882617226215034,\n \ \ \"acc_norm\": 0.32,\n \"acc_norm_stderr\": 0.046882617226215034\n \ \ },\n \"harness|hendrycksTest-anatomy|5\": {\n \"acc\": 0.6370370370370371,\n\ \ \"acc_stderr\": 0.041539484047423976,\n \"acc_norm\": 0.6370370370370371,\n\ \ \"acc_norm_stderr\": 0.041539484047423976\n },\n \"harness|hendrycksTest-astronomy|5\"\ : {\n \"acc\": 0.6973684210526315,\n \"acc_stderr\": 0.03738520676119669,\n\ \ \"acc_norm\": 0.6973684210526315,\n \"acc_norm_stderr\": 0.03738520676119669\n\ \ },\n \"harness|hendrycksTest-business_ethics|5\": {\n \"acc\": 0.65,\n\ \ \"acc_stderr\": 0.0479372485441102,\n \"acc_norm\": 0.65,\n \ \ \"acc_norm_stderr\": 0.0479372485441102\n },\n \"harness|hendrycksTest-clinical_knowledge|5\"\ : {\n \"acc\": 0.7245283018867924,\n \"acc_stderr\": 0.027495663683724057,\n\ \ \"acc_norm\": 0.7245283018867924,\n \"acc_norm_stderr\": 0.027495663683724057\n\ \ },\n \"harness|hendrycksTest-college_biology|5\": {\n \"acc\": 0.7708333333333334,\n\ \ \"acc_stderr\": 0.03514697467862388,\n \"acc_norm\": 0.7708333333333334,\n\ \ \"acc_norm_stderr\": 0.03514697467862388\n },\n \"harness|hendrycksTest-college_chemistry|5\"\ : {\n \"acc\": 0.45,\n \"acc_stderr\": 0.05,\n \"acc_norm\"\ : 0.45,\n \"acc_norm_stderr\": 0.05\n },\n \"harness|hendrycksTest-college_computer_science|5\"\ : {\n \"acc\": 0.51,\n \"acc_stderr\": 0.05024183937956911,\n \ \ \"acc_norm\": 0.51,\n \"acc_norm_stderr\": 0.05024183937956911\n \ \ },\n \"harness|hendrycksTest-college_mathematics|5\": {\n \"acc\": 0.32,\n\ \ \"acc_stderr\": 0.04688261722621504,\n \"acc_norm\": 0.32,\n \ \ \"acc_norm_stderr\": 0.04688261722621504\n },\n \"harness|hendrycksTest-college_medicine|5\"\ : {\n \"acc\": 0.6763005780346821,\n \"acc_stderr\": 0.035676037996391706,\n\ \ \"acc_norm\": 0.6763005780346821,\n \"acc_norm_stderr\": 0.035676037996391706\n\ \ },\n \"harness|hendrycksTest-college_physics|5\": {\n \"acc\": 0.4215686274509804,\n\ \ \"acc_stderr\": 0.04913595201274498,\n \"acc_norm\": 0.4215686274509804,\n\ \ \"acc_norm_stderr\": 0.04913595201274498\n },\n \"harness|hendrycksTest-computer_security|5\"\ : {\n \"acc\": 0.77,\n \"acc_stderr\": 0.04229525846816507,\n \ \ \"acc_norm\": 0.77,\n \"acc_norm_stderr\": 0.04229525846816507\n \ \ },\n \"harness|hendrycksTest-conceptual_physics|5\": {\n \"acc\": 0.5914893617021276,\n\ \ \"acc_stderr\": 0.032134180267015755,\n \"acc_norm\": 0.5914893617021276,\n\ \ \"acc_norm_stderr\": 0.032134180267015755\n },\n \"harness|hendrycksTest-econometrics|5\"\ : {\n \"acc\": 0.47368421052631576,\n \"acc_stderr\": 0.046970851366478626,\n\ \ \"acc_norm\": 0.47368421052631576,\n \"acc_norm_stderr\": 0.046970851366478626\n\ \ },\n \"harness|hendrycksTest-electrical_engineering|5\": {\n \"acc\"\ : 0.5724137931034483,\n \"acc_stderr\": 0.04122737111370333,\n \"\ acc_norm\": 0.5724137931034483,\n \"acc_norm_stderr\": 0.04122737111370333\n\ \ },\n \"harness|hendrycksTest-elementary_mathematics|5\": {\n \"acc\"\ : 0.41798941798941797,\n \"acc_stderr\": 0.025402555503260912,\n \"\ acc_norm\": 0.41798941798941797,\n \"acc_norm_stderr\": 0.025402555503260912\n\ \ },\n \"harness|hendrycksTest-formal_logic|5\": {\n \"acc\": 0.4444444444444444,\n\ \ \"acc_stderr\": 0.04444444444444449,\n \"acc_norm\": 0.4444444444444444,\n\ \ \"acc_norm_stderr\": 0.04444444444444449\n },\n \"harness|hendrycksTest-global_facts|5\"\ : {\n \"acc\": 0.37,\n \"acc_stderr\": 0.048523658709391,\n \ \ \"acc_norm\": 0.37,\n \"acc_norm_stderr\": 0.048523658709391\n },\n\ \ \"harness|hendrycksTest-high_school_biology|5\": {\n \"acc\": 0.7838709677419354,\n\ \ \"acc_stderr\": 0.02341529343356852,\n \"acc_norm\": 0.7838709677419354,\n\ \ \"acc_norm_stderr\": 0.02341529343356852\n },\n \"harness|hendrycksTest-high_school_chemistry|5\"\ : {\n \"acc\": 0.5123152709359606,\n \"acc_stderr\": 0.035169204442208966,\n\ \ \"acc_norm\": 0.5123152709359606,\n \"acc_norm_stderr\": 0.035169204442208966\n\ \ },\n \"harness|hendrycksTest-high_school_computer_science|5\": {\n \ \ \"acc\": 0.68,\n \"acc_stderr\": 0.04688261722621505,\n \"acc_norm\"\ : 0.68,\n \"acc_norm_stderr\": 0.04688261722621505\n },\n \"harness|hendrycksTest-high_school_european_history|5\"\ : {\n \"acc\": 0.7878787878787878,\n \"acc_stderr\": 0.03192271569548301,\n\ \ \"acc_norm\": 0.7878787878787878,\n \"acc_norm_stderr\": 0.03192271569548301\n\ \ },\n \"harness|hendrycksTest-high_school_geography|5\": {\n \"acc\"\ : 0.7828282828282829,\n \"acc_stderr\": 0.02937661648494563,\n \"\ acc_norm\": 0.7828282828282829,\n \"acc_norm_stderr\": 0.02937661648494563\n\ \ },\n \"harness|hendrycksTest-high_school_government_and_politics|5\": {\n\ \ \"acc\": 0.9015544041450777,\n \"acc_stderr\": 0.021500249576033477,\n\ \ \"acc_norm\": 0.9015544041450777,\n \"acc_norm_stderr\": 0.021500249576033477\n\ \ },\n \"harness|hendrycksTest-high_school_macroeconomics|5\": {\n \ \ \"acc\": 0.6743589743589744,\n \"acc_stderr\": 0.02375966576741229,\n \ \ \"acc_norm\": 0.6743589743589744,\n \"acc_norm_stderr\": 0.02375966576741229\n\ \ },\n \"harness|hendrycksTest-high_school_mathematics|5\": {\n \"\ acc\": 0.3148148148148148,\n \"acc_stderr\": 0.02831753349606649,\n \ \ \"acc_norm\": 0.3148148148148148,\n \"acc_norm_stderr\": 0.02831753349606649\n\ \ },\n \"harness|hendrycksTest-high_school_microeconomics|5\": {\n \ \ \"acc\": 0.6764705882352942,\n \"acc_stderr\": 0.03038835355188679,\n \ \ \"acc_norm\": 0.6764705882352942,\n \"acc_norm_stderr\": 0.03038835355188679\n\ \ },\n \"harness|hendrycksTest-high_school_physics|5\": {\n \"acc\"\ : 0.3841059602649007,\n \"acc_stderr\": 0.03971301814719197,\n \"\ acc_norm\": 0.3841059602649007,\n \"acc_norm_stderr\": 0.03971301814719197\n\ \ },\n \"harness|hendrycksTest-high_school_psychology|5\": {\n \"acc\"\ : 0.8458715596330275,\n \"acc_stderr\": 0.015480826865374303,\n \"\ acc_norm\": 0.8458715596330275,\n \"acc_norm_stderr\": 0.015480826865374303\n\ \ },\n \"harness|hendrycksTest-high_school_statistics|5\": {\n \"acc\"\ : 0.5509259259259259,\n \"acc_stderr\": 0.03392238405321617,\n \"\ acc_norm\": 0.5509259259259259,\n \"acc_norm_stderr\": 0.03392238405321617\n\ \ },\n \"harness|hendrycksTest-high_school_us_history|5\": {\n \"acc\"\ : 0.8431372549019608,\n \"acc_stderr\": 0.02552472232455334,\n \"\ acc_norm\": 0.8431372549019608,\n \"acc_norm_stderr\": 0.02552472232455334\n\ \ },\n \"harness|hendrycksTest-high_school_world_history|5\": {\n \"\ acc\": 0.7974683544303798,\n \"acc_stderr\": 0.026160568246601446,\n \ \ \"acc_norm\": 0.7974683544303798,\n \"acc_norm_stderr\": 0.026160568246601446\n\ \ },\n \"harness|hendrycksTest-human_aging|5\": {\n \"acc\": 0.6860986547085202,\n\ \ \"acc_stderr\": 0.031146796482972465,\n \"acc_norm\": 0.6860986547085202,\n\ \ \"acc_norm_stderr\": 0.031146796482972465\n },\n \"harness|hendrycksTest-human_sexuality|5\"\ : {\n \"acc\": 0.7938931297709924,\n \"acc_stderr\": 0.03547771004159464,\n\ \ \"acc_norm\": 0.7938931297709924,\n \"acc_norm_stderr\": 0.03547771004159464\n\ \ },\n \"harness|hendrycksTest-international_law|5\": {\n \"acc\":\ \ 0.7768595041322314,\n \"acc_stderr\": 0.03800754475228732,\n \"\ acc_norm\": 0.7768595041322314,\n \"acc_norm_stderr\": 0.03800754475228732\n\ \ },\n \"harness|hendrycksTest-jurisprudence|5\": {\n \"acc\": 0.7777777777777778,\n\ \ \"acc_stderr\": 0.0401910747255735,\n \"acc_norm\": 0.7777777777777778,\n\ \ \"acc_norm_stderr\": 0.0401910747255735\n },\n \"harness|hendrycksTest-logical_fallacies|5\"\ : {\n \"acc\": 0.7668711656441718,\n \"acc_stderr\": 0.0332201579577674,\n\ \ \"acc_norm\": 0.7668711656441718,\n \"acc_norm_stderr\": 0.0332201579577674\n\ \ },\n \"harness|hendrycksTest-machine_learning|5\": {\n \"acc\": 0.4642857142857143,\n\ \ \"acc_stderr\": 0.04733667890053756,\n \"acc_norm\": 0.4642857142857143,\n\ \ \"acc_norm_stderr\": 0.04733667890053756\n },\n \"harness|hendrycksTest-management|5\"\ : {\n \"acc\": 0.7572815533980582,\n \"acc_stderr\": 0.04245022486384495,\n\ \ \"acc_norm\": 0.7572815533980582,\n \"acc_norm_stderr\": 0.04245022486384495\n\ \ },\n \"harness|hendrycksTest-marketing|5\": {\n \"acc\": 0.8803418803418803,\n\ \ \"acc_stderr\": 0.021262719400406964,\n \"acc_norm\": 0.8803418803418803,\n\ \ \"acc_norm_stderr\": 0.021262719400406964\n },\n \"harness|hendrycksTest-medical_genetics|5\"\ : {\n \"acc\": 0.68,\n \"acc_stderr\": 0.04688261722621504,\n \ \ \"acc_norm\": 0.68,\n \"acc_norm_stderr\": 0.04688261722621504\n \ \ },\n \"harness|hendrycksTest-miscellaneous|5\": {\n \"acc\": 0.8339719029374202,\n\ \ \"acc_stderr\": 0.013306478243066302,\n \"acc_norm\": 0.8339719029374202,\n\ \ \"acc_norm_stderr\": 0.013306478243066302\n },\n \"harness|hendrycksTest-moral_disputes|5\"\ : {\n \"acc\": 0.7427745664739884,\n \"acc_stderr\": 0.023532925431044283,\n\ \ \"acc_norm\": 0.7427745664739884,\n \"acc_norm_stderr\": 0.023532925431044283\n\ \ },\n \"harness|hendrycksTest-moral_scenarios|5\": {\n \"acc\": 0.4480446927374302,\n\ \ \"acc_stderr\": 0.016631976628930595,\n \"acc_norm\": 0.4480446927374302,\n\ \ \"acc_norm_stderr\": 0.016631976628930595\n },\n \"harness|hendrycksTest-nutrition|5\"\ : {\n \"acc\": 0.7091503267973857,\n \"acc_stderr\": 0.02600480036395213,\n\ \ \"acc_norm\": 0.7091503267973857,\n \"acc_norm_stderr\": 0.02600480036395213\n\ \ },\n \"harness|hendrycksTest-philosophy|5\": {\n \"acc\": 0.7170418006430869,\n\ \ \"acc_stderr\": 0.02558306248998481,\n \"acc_norm\": 0.7170418006430869,\n\ \ \"acc_norm_stderr\": 0.02558306248998481\n },\n \"harness|hendrycksTest-prehistory|5\"\ : {\n \"acc\": 0.7469135802469136,\n \"acc_stderr\": 0.024191808600713,\n\ \ \"acc_norm\": 0.7469135802469136,\n \"acc_norm_stderr\": 0.024191808600713\n\ \ },\n \"harness|hendrycksTest-professional_accounting|5\": {\n \"\ acc\": 0.5,\n \"acc_stderr\": 0.029827499313594685,\n \"acc_norm\"\ : 0.5,\n \"acc_norm_stderr\": 0.029827499313594685\n },\n \"harness|hendrycksTest-professional_law|5\"\ : {\n \"acc\": 0.4654498044328553,\n \"acc_stderr\": 0.0127397115540457,\n\ \ \"acc_norm\": 0.4654498044328553,\n \"acc_norm_stderr\": 0.0127397115540457\n\ \ },\n \"harness|hendrycksTest-professional_medicine|5\": {\n \"acc\"\ : 0.6691176470588235,\n \"acc_stderr\": 0.02858270975389845,\n \"\ acc_norm\": 0.6691176470588235,\n \"acc_norm_stderr\": 0.02858270975389845\n\ \ },\n \"harness|hendrycksTest-professional_psychology|5\": {\n \"\ acc\": 0.6617647058823529,\n \"acc_stderr\": 0.019139943748487046,\n \ \ \"acc_norm\": 0.6617647058823529,\n \"acc_norm_stderr\": 0.019139943748487046\n\ \ },\n \"harness|hendrycksTest-public_relations|5\": {\n \"acc\": 0.6818181818181818,\n\ \ \"acc_stderr\": 0.04461272175910509,\n \"acc_norm\": 0.6818181818181818,\n\ \ \"acc_norm_stderr\": 0.04461272175910509\n },\n \"harness|hendrycksTest-security_studies|5\"\ : {\n \"acc\": 0.7551020408163265,\n \"acc_stderr\": 0.02752963744017493,\n\ \ \"acc_norm\": 0.7551020408163265,\n \"acc_norm_stderr\": 0.02752963744017493\n\ \ },\n \"harness|hendrycksTest-sociology|5\": {\n \"acc\": 0.8258706467661692,\n\ \ \"acc_stderr\": 0.026814951200421603,\n \"acc_norm\": 0.8258706467661692,\n\ \ \"acc_norm_stderr\": 0.026814951200421603\n },\n \"harness|hendrycksTest-us_foreign_policy|5\"\ : {\n \"acc\": 0.84,\n \"acc_stderr\": 0.03684529491774709,\n \ \ \"acc_norm\": 0.84,\n \"acc_norm_stderr\": 0.03684529491774709\n \ \ },\n \"harness|hendrycksTest-virology|5\": {\n \"acc\": 0.536144578313253,\n\ \ \"acc_stderr\": 0.038823108508905954,\n \"acc_norm\": 0.536144578313253,\n\ \ \"acc_norm_stderr\": 0.038823108508905954\n },\n \"harness|hendrycksTest-world_religions|5\"\ : {\n \"acc\": 0.8304093567251462,\n \"acc_stderr\": 0.02878210810540171,\n\ \ \"acc_norm\": 0.8304093567251462,\n \"acc_norm_stderr\": 0.02878210810540171\n\ \ },\n \"harness|truthfulqa:mc|0\": {\n \"mc1\": 0.5777233782129743,\n\ \ \"mc1_stderr\": 0.017290733254248177,\n \"mc2\": 0.7004617737856811,\n\ \ \"mc2_stderr\": 0.01511981164818303\n },\n \"harness|winogrande|5\"\ : {\n \"acc\": 0.8287292817679558,\n \"acc_stderr\": 0.010588417294962524\n\ \ },\n \"harness|gsm8k|5\": {\n \"acc\": 0.66868840030326,\n \ \ \"acc_stderr\": 0.012964999679688664\n }\n}\n```" repo_url: https://huggingface.co/222gate/BrurryDog-7b-v0.1 leaderboard_url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard point_of_contact: clementine@hf.co configs: - config_name: harness_arc_challenge_25 data_files: - split: 2024_01_20T03_26_36.549937 path: - '**/details_harness|arc:challenge|25_2024-01-20T03-26-36.549937.parquet' - split: latest path: - '**/details_harness|arc:challenge|25_2024-01-20T03-26-36.549937.parquet' - config_name: harness_gsm8k_5 data_files: - split: 2024_01_20T03_26_36.549937 path: - '**/details_harness|gsm8k|5_2024-01-20T03-26-36.549937.parquet' - split: latest path: - '**/details_harness|gsm8k|5_2024-01-20T03-26-36.549937.parquet' - config_name: harness_hellaswag_10 data_files: - split: 2024_01_20T03_26_36.549937 path: - '**/details_harness|hellaswag|10_2024-01-20T03-26-36.549937.parquet' - split: latest path: - '**/details_harness|hellaswag|10_2024-01-20T03-26-36.549937.parquet' - config_name: harness_hendrycksTest_5 data_files: - split: 2024_01_20T03_26_36.549937 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-international_law|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-management|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-marketing|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-sociology|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-virology|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2024-01-20T03-26-36.549937.parquet' - split: latest path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-international_law|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-management|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-marketing|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-sociology|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-virology|5_2024-01-20T03-26-36.549937.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2024-01-20T03-26-36.549937.parquet' - config_name: harness_hendrycksTest_abstract_algebra_5 data_files: - split: 2024_01_20T03_26_36.549937 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-01-20T03-26-36.549937.parquet' - split: latest path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-01-20T03-26-36.549937.parquet' - config_name: harness_hendrycksTest_anatomy_5 data_files: - split: 2024_01_20T03_26_36.549937 path: - '**/details_harness|hendrycksTest-anatomy|5_2024-01-20T03-26-36.549937.parquet' - split: latest path: - '**/details_harness|hendrycksTest-anatomy|5_2024-01-20T03-26-36.549937.parquet' - config_name: harness_hendrycksTest_astronomy_5 data_files: - split: 2024_01_20T03_26_36.549937 path: - '**/details_harness|hendrycksTest-astronomy|5_2024-01-20T03-26-36.549937.parquet' - split: latest path: - '**/details_harness|hendrycksTest-astronomy|5_2024-01-20T03-26-36.549937.parquet' - config_name: harness_hendrycksTest_business_ethics_5 data_files: - split: 2024_01_20T03_26_36.549937 path: - '**/details_harness|hendrycksTest-business_ethics|5_2024-01-20T03-26-36.549937.parquet' - split: latest path: - '**/details_harness|hendrycksTest-business_ethics|5_2024-01-20T03-26-36.549937.parquet' - config_name: harness_hendrycksTest_clinical_knowledge_5 data_files: - split: 2024_01_20T03_26_36.549937 path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-01-20T03-26-36.549937.parquet' - split: latest path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-01-20T03-26-36.549937.parquet' - config_name: harness_hendrycksTest_college_biology_5 data_files: - split: 2024_01_20T03_26_36.549937 path: - '**/details_harness|hendrycksTest-college_biology|5_2024-01-20T03-26-36.549937.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_biology|5_2024-01-20T03-26-36.549937.parquet' - config_name: harness_hendrycksTest_college_chemistry_5 data_files: - split: 2024_01_20T03_26_36.549937 path: - '**/details_harness|hendrycksTest-college_chemistry|5_2024-01-20T03-26-36.549937.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_chemistry|5_2024-01-20T03-26-36.549937.parquet' - config_name: harness_hendrycksTest_college_computer_science_5 data_files: - split: 2024_01_20T03_26_36.549937 path: - '**/details_harness|hendrycksTest-college_computer_science|5_2024-01-20T03-26-36.549937.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_computer_science|5_2024-01-20T03-26-36.549937.parquet' - config_name: harness_hendrycksTest_college_mathematics_5 data_files: - split: 2024_01_20T03_26_36.549937 path: - '**/details_harness|hendrycksTest-college_mathematics|5_2024-01-20T03-26-36.549937.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_mathematics|5_2024-01-20T03-26-36.549937.parquet' - config_name: harness_hendrycksTest_college_medicine_5 data_files: - split: 2024_01_20T03_26_36.549937 path: - '**/details_harness|hendrycksTest-college_medicine|5_2024-01-20T03-26-36.549937.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_medicine|5_2024-01-20T03-26-36.549937.parquet' - config_name: harness_hendrycksTest_college_physics_5 data_files: - split: 2024_01_20T03_26_36.549937 path: - '**/details_harness|hendrycksTest-college_physics|5_2024-01-20T03-26-36.549937.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_physics|5_2024-01-20T03-26-36.549937.parquet' - config_name: harness_hendrycksTest_computer_security_5 data_files: - split: 2024_01_20T03_26_36.549937 path: - '**/details_harness|hendrycksTest-computer_security|5_2024-01-20T03-26-36.549937.parquet' - split: latest path: - '**/details_harness|hendrycksTest-computer_security|5_2024-01-20T03-26-36.549937.parquet' - config_name: harness_hendrycksTest_conceptual_physics_5 data_files: - split: 2024_01_20T03_26_36.549937 path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-01-20T03-26-36.549937.parquet' - split: latest path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-01-20T03-26-36.549937.parquet' - config_name: harness_hendrycksTest_econometrics_5 data_files: - split: 2024_01_20T03_26_36.549937 path: - '**/details_harness|hendrycksTest-econometrics|5_2024-01-20T03-26-36.549937.parquet' - split: latest path: - '**/details_harness|hendrycksTest-econometrics|5_2024-01-20T03-26-36.549937.parquet' - config_name: harness_hendrycksTest_electrical_engineering_5 data_files: - split: 2024_01_20T03_26_36.549937 path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-01-20T03-26-36.549937.parquet' - split: latest path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-01-20T03-26-36.549937.parquet' - config_name: harness_hendrycksTest_elementary_mathematics_5 data_files: - split: 2024_01_20T03_26_36.549937 path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-01-20T03-26-36.549937.parquet' - split: latest path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-01-20T03-26-36.549937.parquet' - config_name: harness_hendrycksTest_formal_logic_5 data_files: - split: 2024_01_20T03_26_36.549937 path: - '**/details_harness|hendrycksTest-formal_logic|5_2024-01-20T03-26-36.549937.parquet' - split: latest path: - '**/details_harness|hendrycksTest-formal_logic|5_2024-01-20T03-26-36.549937.parquet' - config_name: harness_hendrycksTest_global_facts_5 data_files: - split: 2024_01_20T03_26_36.549937 path: - '**/details_harness|hendrycksTest-global_facts|5_2024-01-20T03-26-36.549937.parquet' - split: latest path: - '**/details_harness|hendrycksTest-global_facts|5_2024-01-20T03-26-36.549937.parquet' - config_name: harness_hendrycksTest_high_school_biology_5 data_files: - split: 2024_01_20T03_26_36.549937 path: - '**/details_harness|hendrycksTest-high_school_biology|5_2024-01-20T03-26-36.549937.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_biology|5_2024-01-20T03-26-36.549937.parquet' - config_name: harness_hendrycksTest_high_school_chemistry_5 data_files: - split: 2024_01_20T03_26_36.549937 path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-01-20T03-26-36.549937.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-01-20T03-26-36.549937.parquet' - config_name: harness_hendrycksTest_high_school_computer_science_5 data_files: - split: 2024_01_20T03_26_36.549937 path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-01-20T03-26-36.549937.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-01-20T03-26-36.549937.parquet' - config_name: harness_hendrycksTest_high_school_european_history_5 data_files: - split: 2024_01_20T03_26_36.549937 path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-01-20T03-26-36.549937.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-01-20T03-26-36.549937.parquet' - config_name: harness_hendrycksTest_high_school_geography_5 data_files: - split: 2024_01_20T03_26_36.549937 path: - '**/details_harness|hendrycksTest-high_school_geography|5_2024-01-20T03-26-36.549937.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_geography|5_2024-01-20T03-26-36.549937.parquet' - config_name: harness_hendrycksTest_high_school_government_and_politics_5 data_files: - split: 2024_01_20T03_26_36.549937 path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-01-20T03-26-36.549937.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-01-20T03-26-36.549937.parquet' - config_name: harness_hendrycksTest_high_school_macroeconomics_5 data_files: - split: 2024_01_20T03_26_36.549937 path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-01-20T03-26-36.549937.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-01-20T03-26-36.549937.parquet' - config_name: harness_hendrycksTest_high_school_mathematics_5 data_files: - split: 2024_01_20T03_26_36.549937 path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-01-20T03-26-36.549937.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-01-20T03-26-36.549937.parquet' - config_name: harness_hendrycksTest_high_school_microeconomics_5 data_files: - split: 2024_01_20T03_26_36.549937 path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-01-20T03-26-36.549937.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-01-20T03-26-36.549937.parquet' - config_name: harness_hendrycksTest_high_school_physics_5 data_files: - split: 2024_01_20T03_26_36.549937 path: - '**/details_harness|hendrycksTest-high_school_physics|5_2024-01-20T03-26-36.549937.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_physics|5_2024-01-20T03-26-36.549937.parquet' - config_name: harness_hendrycksTest_high_school_psychology_5 data_files: - split: 2024_01_20T03_26_36.549937 path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-01-20T03-26-36.549937.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-01-20T03-26-36.549937.parquet' - config_name: harness_hendrycksTest_high_school_statistics_5 data_files: - split: 2024_01_20T03_26_36.549937 path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-01-20T03-26-36.549937.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-01-20T03-26-36.549937.parquet' - config_name: harness_hendrycksTest_high_school_us_history_5 data_files: - split: 2024_01_20T03_26_36.549937 path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-01-20T03-26-36.549937.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-01-20T03-26-36.549937.parquet' - config_name: harness_hendrycksTest_high_school_world_history_5 data_files: - split: 2024_01_20T03_26_36.549937 path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-01-20T03-26-36.549937.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-01-20T03-26-36.549937.parquet' - config_name: harness_hendrycksTest_human_aging_5 data_files: - split: 2024_01_20T03_26_36.549937 path: - '**/details_harness|hendrycksTest-human_aging|5_2024-01-20T03-26-36.549937.parquet' - split: latest path: - '**/details_harness|hendrycksTest-human_aging|5_2024-01-20T03-26-36.549937.parquet' - config_name: harness_hendrycksTest_human_sexuality_5 data_files: - split: 2024_01_20T03_26_36.549937 path: - '**/details_harness|hendrycksTest-human_sexuality|5_2024-01-20T03-26-36.549937.parquet' - split: latest path: - '**/details_harness|hendrycksTest-human_sexuality|5_2024-01-20T03-26-36.549937.parquet' - config_name: harness_hendrycksTest_international_law_5 data_files: - split: 2024_01_20T03_26_36.549937 path: - '**/details_harness|hendrycksTest-international_law|5_2024-01-20T03-26-36.549937.parquet' - split: latest path: - '**/details_harness|hendrycksTest-international_law|5_2024-01-20T03-26-36.549937.parquet' - config_name: harness_hendrycksTest_jurisprudence_5 data_files: - split: 2024_01_20T03_26_36.549937 path: - '**/details_harness|hendrycksTest-jurisprudence|5_2024-01-20T03-26-36.549937.parquet' - split: latest path: - '**/details_harness|hendrycksTest-jurisprudence|5_2024-01-20T03-26-36.549937.parquet' - config_name: harness_hendrycksTest_logical_fallacies_5 data_files: - split: 2024_01_20T03_26_36.549937 path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-01-20T03-26-36.549937.parquet' - split: latest path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-01-20T03-26-36.549937.parquet' - config_name: harness_hendrycksTest_machine_learning_5 data_files: - split: 2024_01_20T03_26_36.549937 path: - '**/details_harness|hendrycksTest-machine_learning|5_2024-01-20T03-26-36.549937.parquet' - split: latest path: - '**/details_harness|hendrycksTest-machine_learning|5_2024-01-20T03-26-36.549937.parquet' - config_name: harness_hendrycksTest_management_5 data_files: - split: 2024_01_20T03_26_36.549937 path: - '**/details_harness|hendrycksTest-management|5_2024-01-20T03-26-36.549937.parquet' - split: latest path: - '**/details_harness|hendrycksTest-management|5_2024-01-20T03-26-36.549937.parquet' - config_name: harness_hendrycksTest_marketing_5 data_files: - split: 2024_01_20T03_26_36.549937 path: - '**/details_harness|hendrycksTest-marketing|5_2024-01-20T03-26-36.549937.parquet' - split: latest path: - '**/details_harness|hendrycksTest-marketing|5_2024-01-20T03-26-36.549937.parquet' - config_name: harness_hendrycksTest_medical_genetics_5 data_files: - split: 2024_01_20T03_26_36.549937 path: - '**/details_harness|hendrycksTest-medical_genetics|5_2024-01-20T03-26-36.549937.parquet' - split: latest path: - '**/details_harness|hendrycksTest-medical_genetics|5_2024-01-20T03-26-36.549937.parquet' - config_name: harness_hendrycksTest_miscellaneous_5 data_files: - split: 2024_01_20T03_26_36.549937 path: - '**/details_harness|hendrycksTest-miscellaneous|5_2024-01-20T03-26-36.549937.parquet' - split: latest path: - '**/details_harness|hendrycksTest-miscellaneous|5_2024-01-20T03-26-36.549937.parquet' - config_name: harness_hendrycksTest_moral_disputes_5 data_files: - split: 2024_01_20T03_26_36.549937 path: - '**/details_harness|hendrycksTest-moral_disputes|5_2024-01-20T03-26-36.549937.parquet' - split: latest path: - '**/details_harness|hendrycksTest-moral_disputes|5_2024-01-20T03-26-36.549937.parquet' - config_name: harness_hendrycksTest_moral_scenarios_5 data_files: - split: 2024_01_20T03_26_36.549937 path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-01-20T03-26-36.549937.parquet' - split: latest path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-01-20T03-26-36.549937.parquet' - config_name: harness_hendrycksTest_nutrition_5 data_files: - split: 2024_01_20T03_26_36.549937 path: - '**/details_harness|hendrycksTest-nutrition|5_2024-01-20T03-26-36.549937.parquet' - split: latest path: - '**/details_harness|hendrycksTest-nutrition|5_2024-01-20T03-26-36.549937.parquet' - config_name: harness_hendrycksTest_philosophy_5 data_files: - split: 2024_01_20T03_26_36.549937 path: - '**/details_harness|hendrycksTest-philosophy|5_2024-01-20T03-26-36.549937.parquet' - split: latest path: - '**/details_harness|hendrycksTest-philosophy|5_2024-01-20T03-26-36.549937.parquet' - config_name: harness_hendrycksTest_prehistory_5 data_files: - split: 2024_01_20T03_26_36.549937 path: - '**/details_harness|hendrycksTest-prehistory|5_2024-01-20T03-26-36.549937.parquet' - split: latest path: - '**/details_harness|hendrycksTest-prehistory|5_2024-01-20T03-26-36.549937.parquet' - config_name: harness_hendrycksTest_professional_accounting_5 data_files: - split: 2024_01_20T03_26_36.549937 path: - '**/details_harness|hendrycksTest-professional_accounting|5_2024-01-20T03-26-36.549937.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_accounting|5_2024-01-20T03-26-36.549937.parquet' - config_name: harness_hendrycksTest_professional_law_5 data_files: - split: 2024_01_20T03_26_36.549937 path: - '**/details_harness|hendrycksTest-professional_law|5_2024-01-20T03-26-36.549937.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_law|5_2024-01-20T03-26-36.549937.parquet' - config_name: harness_hendrycksTest_professional_medicine_5 data_files: - split: 2024_01_20T03_26_36.549937 path: - '**/details_harness|hendrycksTest-professional_medicine|5_2024-01-20T03-26-36.549937.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_medicine|5_2024-01-20T03-26-36.549937.parquet' - config_name: harness_hendrycksTest_professional_psychology_5 data_files: - split: 2024_01_20T03_26_36.549937 path: - '**/details_harness|hendrycksTest-professional_psychology|5_2024-01-20T03-26-36.549937.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_psychology|5_2024-01-20T03-26-36.549937.parquet' - config_name: harness_hendrycksTest_public_relations_5 data_files: - split: 2024_01_20T03_26_36.549937 path: - '**/details_harness|hendrycksTest-public_relations|5_2024-01-20T03-26-36.549937.parquet' - split: latest path: - '**/details_harness|hendrycksTest-public_relations|5_2024-01-20T03-26-36.549937.parquet' - config_name: harness_hendrycksTest_security_studies_5 data_files: - split: 2024_01_20T03_26_36.549937 path: - '**/details_harness|hendrycksTest-security_studies|5_2024-01-20T03-26-36.549937.parquet' - split: latest path: - '**/details_harness|hendrycksTest-security_studies|5_2024-01-20T03-26-36.549937.parquet' - config_name: harness_hendrycksTest_sociology_5 data_files: - split: 2024_01_20T03_26_36.549937 path: - '**/details_harness|hendrycksTest-sociology|5_2024-01-20T03-26-36.549937.parquet' - split: latest path: - '**/details_harness|hendrycksTest-sociology|5_2024-01-20T03-26-36.549937.parquet' - config_name: harness_hendrycksTest_us_foreign_policy_5 data_files: - split: 2024_01_20T03_26_36.549937 path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-01-20T03-26-36.549937.parquet' - split: latest path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-01-20T03-26-36.549937.parquet' - config_name: harness_hendrycksTest_virology_5 data_files: - split: 2024_01_20T03_26_36.549937 path: - '**/details_harness|hendrycksTest-virology|5_2024-01-20T03-26-36.549937.parquet' - split: latest path: - '**/details_harness|hendrycksTest-virology|5_2024-01-20T03-26-36.549937.parquet' - config_name: harness_hendrycksTest_world_religions_5 data_files: - split: 2024_01_20T03_26_36.549937 path: - '**/details_harness|hendrycksTest-world_religions|5_2024-01-20T03-26-36.549937.parquet' - split: latest path: - '**/details_harness|hendrycksTest-world_religions|5_2024-01-20T03-26-36.549937.parquet' - config_name: harness_truthfulqa_mc_0 data_files: - split: 2024_01_20T03_26_36.549937 path: - '**/details_harness|truthfulqa:mc|0_2024-01-20T03-26-36.549937.parquet' - split: latest path: - '**/details_harness|truthfulqa:mc|0_2024-01-20T03-26-36.549937.parquet' - config_name: harness_winogrande_5 data_files: - split: 2024_01_20T03_26_36.549937 path: - '**/details_harness|winogrande|5_2024-01-20T03-26-36.549937.parquet' - split: latest path: - '**/details_harness|winogrande|5_2024-01-20T03-26-36.549937.parquet' - config_name: results data_files: - split: 2024_01_20T03_26_36.549937 path: - results_2024-01-20T03-26-36.549937.parquet - split: latest path: - results_2024-01-20T03-26-36.549937.parquet --- # Dataset Card for Evaluation run of 222gate/BrurryDog-7b-v0.1 <!-- Provide a quick summary of the dataset. --> Dataset automatically created during the evaluation run of model [222gate/BrurryDog-7b-v0.1](https://huggingface.co/222gate/BrurryDog-7b-v0.1) on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard). The dataset is composed of 63 configuration, each one coresponding to one of the evaluated task. The dataset has been created from 1 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results. An additional configuration "results" store all the aggregated results of the run (and is used to compute and display the aggregated metrics on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)). To load the details from a run, you can for instance do the following: ```python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_222gate__BrurryDog-7b-v0.1", "harness_winogrande_5", split="train") ``` ## Latest results These are the [latest results from run 2024-01-20T03:26:36.549937](https://huggingface.co/datasets/open-llm-leaderboard/details_222gate__BrurryDog-7b-v0.1/blob/main/results_2024-01-20T03-26-36.549937.json)(note that their might be results for other tasks in the repos if successive evals didn't cover the same tasks. You find each in the results and the "latest" split for each eval): ```python { "all": { "acc": 0.6527842705257364, "acc_stderr": 0.03215067286738653, "acc_norm": 0.652698073988589, "acc_norm_stderr": 0.03281396676337054, "mc1": 0.5777233782129743, "mc1_stderr": 0.017290733254248177, "mc2": 0.7004617737856811, "mc2_stderr": 0.01511981164818303 }, "harness|arc:challenge|25": { "acc": 0.7005119453924915, "acc_stderr": 0.01338502163731357, "acc_norm": 0.7252559726962458, "acc_norm_stderr": 0.013044617212771227 }, "harness|hellaswag|10": { "acc": 0.7216689902409879, "acc_stderr": 0.004472613148508909, "acc_norm": 0.8836885082652858, "acc_norm_stderr": 0.0031994286759858682 }, "harness|hendrycksTest-abstract_algebra|5": { "acc": 0.32, "acc_stderr": 0.046882617226215034, "acc_norm": 0.32, "acc_norm_stderr": 0.046882617226215034 }, "harness|hendrycksTest-anatomy|5": { "acc": 0.6370370370370371, "acc_stderr": 0.041539484047423976, "acc_norm": 0.6370370370370371, "acc_norm_stderr": 0.041539484047423976 }, "harness|hendrycksTest-astronomy|5": { "acc": 0.6973684210526315, "acc_stderr": 0.03738520676119669, "acc_norm": 0.6973684210526315, "acc_norm_stderr": 0.03738520676119669 }, "harness|hendrycksTest-business_ethics|5": { "acc": 0.65, "acc_stderr": 0.0479372485441102, "acc_norm": 0.65, "acc_norm_stderr": 0.0479372485441102 }, "harness|hendrycksTest-clinical_knowledge|5": { "acc": 0.7245283018867924, "acc_stderr": 0.027495663683724057, "acc_norm": 0.7245283018867924, "acc_norm_stderr": 0.027495663683724057 }, "harness|hendrycksTest-college_biology|5": { "acc": 0.7708333333333334, "acc_stderr": 0.03514697467862388, "acc_norm": 0.7708333333333334, "acc_norm_stderr": 0.03514697467862388 }, "harness|hendrycksTest-college_chemistry|5": { "acc": 0.45, "acc_stderr": 0.05, "acc_norm": 0.45, "acc_norm_stderr": 0.05 }, "harness|hendrycksTest-college_computer_science|5": { "acc": 0.51, "acc_stderr": 0.05024183937956911, "acc_norm": 0.51, "acc_norm_stderr": 0.05024183937956911 }, "harness|hendrycksTest-college_mathematics|5": { "acc": 0.32, "acc_stderr": 0.04688261722621504, "acc_norm": 0.32, "acc_norm_stderr": 0.04688261722621504 }, "harness|hendrycksTest-college_medicine|5": { "acc": 0.6763005780346821, "acc_stderr": 0.035676037996391706, "acc_norm": 0.6763005780346821, "acc_norm_stderr": 0.035676037996391706 }, "harness|hendrycksTest-college_physics|5": { "acc": 0.4215686274509804, "acc_stderr": 0.04913595201274498, "acc_norm": 0.4215686274509804, "acc_norm_stderr": 0.04913595201274498 }, "harness|hendrycksTest-computer_security|5": { "acc": 0.77, "acc_stderr": 0.04229525846816507, "acc_norm": 0.77, "acc_norm_stderr": 0.04229525846816507 }, "harness|hendrycksTest-conceptual_physics|5": { "acc": 0.5914893617021276, "acc_stderr": 0.032134180267015755, "acc_norm": 0.5914893617021276, "acc_norm_stderr": 0.032134180267015755 }, "harness|hendrycksTest-econometrics|5": { "acc": 0.47368421052631576, "acc_stderr": 0.046970851366478626, "acc_norm": 0.47368421052631576, "acc_norm_stderr": 0.046970851366478626 }, "harness|hendrycksTest-electrical_engineering|5": { "acc": 0.5724137931034483, "acc_stderr": 0.04122737111370333, "acc_norm": 0.5724137931034483, "acc_norm_stderr": 0.04122737111370333 }, "harness|hendrycksTest-elementary_mathematics|5": { "acc": 0.41798941798941797, "acc_stderr": 0.025402555503260912, "acc_norm": 0.41798941798941797, "acc_norm_stderr": 0.025402555503260912 }, "harness|hendrycksTest-formal_logic|5": { "acc": 0.4444444444444444, "acc_stderr": 0.04444444444444449, "acc_norm": 0.4444444444444444, "acc_norm_stderr": 0.04444444444444449 }, "harness|hendrycksTest-global_facts|5": { "acc": 0.37, "acc_stderr": 0.048523658709391, "acc_norm": 0.37, "acc_norm_stderr": 0.048523658709391 }, "harness|hendrycksTest-high_school_biology|5": { "acc": 0.7838709677419354, "acc_stderr": 0.02341529343356852, "acc_norm": 0.7838709677419354, "acc_norm_stderr": 0.02341529343356852 }, "harness|hendrycksTest-high_school_chemistry|5": { "acc": 0.5123152709359606, "acc_stderr": 0.035169204442208966, "acc_norm": 0.5123152709359606, "acc_norm_stderr": 0.035169204442208966 }, "harness|hendrycksTest-high_school_computer_science|5": { "acc": 0.68, "acc_stderr": 0.04688261722621505, "acc_norm": 0.68, "acc_norm_stderr": 0.04688261722621505 }, "harness|hendrycksTest-high_school_european_history|5": { "acc": 0.7878787878787878, "acc_stderr": 0.03192271569548301, "acc_norm": 0.7878787878787878, "acc_norm_stderr": 0.03192271569548301 }, "harness|hendrycksTest-high_school_geography|5": { "acc": 0.7828282828282829, "acc_stderr": 0.02937661648494563, "acc_norm": 0.7828282828282829, "acc_norm_stderr": 0.02937661648494563 }, "harness|hendrycksTest-high_school_government_and_politics|5": { "acc": 0.9015544041450777, "acc_stderr": 0.021500249576033477, "acc_norm": 0.9015544041450777, "acc_norm_stderr": 0.021500249576033477 }, "harness|hendrycksTest-high_school_macroeconomics|5": { "acc": 0.6743589743589744, "acc_stderr": 0.02375966576741229, "acc_norm": 0.6743589743589744, "acc_norm_stderr": 0.02375966576741229 }, "harness|hendrycksTest-high_school_mathematics|5": { "acc": 0.3148148148148148, "acc_stderr": 0.02831753349606649, "acc_norm": 0.3148148148148148, "acc_norm_stderr": 0.02831753349606649 }, "harness|hendrycksTest-high_school_microeconomics|5": { "acc": 0.6764705882352942, "acc_stderr": 0.03038835355188679, "acc_norm": 0.6764705882352942, "acc_norm_stderr": 0.03038835355188679 }, "harness|hendrycksTest-high_school_physics|5": { "acc": 0.3841059602649007, "acc_stderr": 0.03971301814719197, "acc_norm": 0.3841059602649007, "acc_norm_stderr": 0.03971301814719197 }, "harness|hendrycksTest-high_school_psychology|5": { "acc": 0.8458715596330275, "acc_stderr": 0.015480826865374303, "acc_norm": 0.8458715596330275, "acc_norm_stderr": 0.015480826865374303 }, "harness|hendrycksTest-high_school_statistics|5": { "acc": 0.5509259259259259, "acc_stderr": 0.03392238405321617, "acc_norm": 0.5509259259259259, "acc_norm_stderr": 0.03392238405321617 }, "harness|hendrycksTest-high_school_us_history|5": { "acc": 0.8431372549019608, "acc_stderr": 0.02552472232455334, "acc_norm": 0.8431372549019608, "acc_norm_stderr": 0.02552472232455334 }, "harness|hendrycksTest-high_school_world_history|5": { "acc": 0.7974683544303798, "acc_stderr": 0.026160568246601446, "acc_norm": 0.7974683544303798, "acc_norm_stderr": 0.026160568246601446 }, "harness|hendrycksTest-human_aging|5": { "acc": 0.6860986547085202, "acc_stderr": 0.031146796482972465, "acc_norm": 0.6860986547085202, "acc_norm_stderr": 0.031146796482972465 }, "harness|hendrycksTest-human_sexuality|5": { "acc": 0.7938931297709924, "acc_stderr": 0.03547771004159464, "acc_norm": 0.7938931297709924, "acc_norm_stderr": 0.03547771004159464 }, "harness|hendrycksTest-international_law|5": { "acc": 0.7768595041322314, "acc_stderr": 0.03800754475228732, "acc_norm": 0.7768595041322314, "acc_norm_stderr": 0.03800754475228732 }, "harness|hendrycksTest-jurisprudence|5": { "acc": 0.7777777777777778, "acc_stderr": 0.0401910747255735, "acc_norm": 0.7777777777777778, "acc_norm_stderr": 0.0401910747255735 }, "harness|hendrycksTest-logical_fallacies|5": { "acc": 0.7668711656441718, "acc_stderr": 0.0332201579577674, "acc_norm": 0.7668711656441718, "acc_norm_stderr": 0.0332201579577674 }, "harness|hendrycksTest-machine_learning|5": { "acc": 0.4642857142857143, "acc_stderr": 0.04733667890053756, "acc_norm": 0.4642857142857143, "acc_norm_stderr": 0.04733667890053756 }, "harness|hendrycksTest-management|5": { "acc": 0.7572815533980582, "acc_stderr": 0.04245022486384495, "acc_norm": 0.7572815533980582, "acc_norm_stderr": 0.04245022486384495 }, "harness|hendrycksTest-marketing|5": { "acc": 0.8803418803418803, "acc_stderr": 0.021262719400406964, "acc_norm": 0.8803418803418803, "acc_norm_stderr": 0.021262719400406964 }, "harness|hendrycksTest-medical_genetics|5": { "acc": 0.68, "acc_stderr": 0.04688261722621504, "acc_norm": 0.68, "acc_norm_stderr": 0.04688261722621504 }, "harness|hendrycksTest-miscellaneous|5": { "acc": 0.8339719029374202, "acc_stderr": 0.013306478243066302, "acc_norm": 0.8339719029374202, "acc_norm_stderr": 0.013306478243066302 }, "harness|hendrycksTest-moral_disputes|5": { "acc": 0.7427745664739884, "acc_stderr": 0.023532925431044283, "acc_norm": 0.7427745664739884, "acc_norm_stderr": 0.023532925431044283 }, "harness|hendrycksTest-moral_scenarios|5": { "acc": 0.4480446927374302, "acc_stderr": 0.016631976628930595, "acc_norm": 0.4480446927374302, "acc_norm_stderr": 0.016631976628930595 }, "harness|hendrycksTest-nutrition|5": { "acc": 0.7091503267973857, "acc_stderr": 0.02600480036395213, "acc_norm": 0.7091503267973857, "acc_norm_stderr": 0.02600480036395213 }, "harness|hendrycksTest-philosophy|5": { "acc": 0.7170418006430869, "acc_stderr": 0.02558306248998481, "acc_norm": 0.7170418006430869, "acc_norm_stderr": 0.02558306248998481 }, "harness|hendrycksTest-prehistory|5": { "acc": 0.7469135802469136, "acc_stderr": 0.024191808600713, "acc_norm": 0.7469135802469136, "acc_norm_stderr": 0.024191808600713 }, "harness|hendrycksTest-professional_accounting|5": { "acc": 0.5, "acc_stderr": 0.029827499313594685, "acc_norm": 0.5, "acc_norm_stderr": 0.029827499313594685 }, "harness|hendrycksTest-professional_law|5": { "acc": 0.4654498044328553, "acc_stderr": 0.0127397115540457, "acc_norm": 0.4654498044328553, "acc_norm_stderr": 0.0127397115540457 }, "harness|hendrycksTest-professional_medicine|5": { "acc": 0.6691176470588235, "acc_stderr": 0.02858270975389845, "acc_norm": 0.6691176470588235, "acc_norm_stderr": 0.02858270975389845 }, "harness|hendrycksTest-professional_psychology|5": { "acc": 0.6617647058823529, "acc_stderr": 0.019139943748487046, "acc_norm": 0.6617647058823529, "acc_norm_stderr": 0.019139943748487046 }, "harness|hendrycksTest-public_relations|5": { "acc": 0.6818181818181818, "acc_stderr": 0.04461272175910509, "acc_norm": 0.6818181818181818, "acc_norm_stderr": 0.04461272175910509 }, "harness|hendrycksTest-security_studies|5": { "acc": 0.7551020408163265, "acc_stderr": 0.02752963744017493, "acc_norm": 0.7551020408163265, "acc_norm_stderr": 0.02752963744017493 }, "harness|hendrycksTest-sociology|5": { "acc": 0.8258706467661692, "acc_stderr": 0.026814951200421603, "acc_norm": 0.8258706467661692, "acc_norm_stderr": 0.026814951200421603 }, "harness|hendrycksTest-us_foreign_policy|5": { "acc": 0.84, "acc_stderr": 0.03684529491774709, "acc_norm": 0.84, "acc_norm_stderr": 0.03684529491774709 }, "harness|hendrycksTest-virology|5": { "acc": 0.536144578313253, "acc_stderr": 0.038823108508905954, "acc_norm": 0.536144578313253, "acc_norm_stderr": 0.038823108508905954 }, "harness|hendrycksTest-world_religions|5": { "acc": 0.8304093567251462, "acc_stderr": 0.02878210810540171, "acc_norm": 0.8304093567251462, "acc_norm_stderr": 0.02878210810540171 }, "harness|truthfulqa:mc|0": { "mc1": 0.5777233782129743, "mc1_stderr": 0.017290733254248177, "mc2": 0.7004617737856811, "mc2_stderr": 0.01511981164818303 }, "harness|winogrande|5": { "acc": 0.8287292817679558, "acc_stderr": 0.010588417294962524 }, "harness|gsm8k|5": { "acc": 0.66868840030326, "acc_stderr": 0.012964999679688664 } } ``` ## Dataset Details ### Dataset Description <!-- Provide a longer summary of what this dataset is. --> - **Curated by:** [More Information Needed] - **Funded by [optional]:** [More Information Needed] - **Shared by [optional]:** [More Information Needed] - **Language(s) (NLP):** [More Information Needed] - **License:** [More Information Needed] ### Dataset Sources [optional] <!-- Provide the basic links for the dataset. --> - **Repository:** [More Information Needed] - **Paper [optional]:** [More Information Needed] - **Demo [optional]:** [More Information Needed] ## Uses <!-- Address questions around how the dataset is intended to be used. --> ### Direct Use <!-- This section describes suitable use cases for the dataset. --> [More Information Needed] ### Out-of-Scope Use <!-- This section addresses misuse, malicious use, and uses that the dataset will not work well for. --> [More Information Needed] ## Dataset Structure <!-- This section provides a description of the dataset fields, and additional information about the dataset structure such as criteria used to create the splits, relationships between data points, etc. --> [More Information Needed] ## Dataset Creation ### Curation Rationale <!-- Motivation for the creation of this dataset. --> [More Information Needed] ### Source Data <!-- This section describes the source data (e.g. news text and headlines, social media posts, translated sentences, ...). --> #### Data Collection and Processing <!-- This section describes the data collection and processing process such as data selection criteria, filtering and normalization methods, tools and libraries used, etc. --> [More Information Needed] #### Who are the source data producers? <!-- This section describes the people or systems who originally created the data. It should also include self-reported demographic or identity information for the source data creators if this information is available. --> [More Information Needed] ### Annotations [optional] <!-- If the dataset contains annotations which are not part of the initial data collection, use this section to describe them. --> #### Annotation process <!-- This section describes the annotation process such as annotation tools used in the process, the amount of data annotated, annotation guidelines provided to the annotators, interannotator statistics, annotation validation, etc. --> [More Information Needed] #### Who are the annotators? <!-- This section describes the people or systems who created the annotations. --> [More Information Needed] #### Personal and Sensitive Information <!-- State whether the dataset contains data that might be considered personal, sensitive, or private (e.g., data that reveals addresses, uniquely identifiable names or aliases, racial or ethnic origins, sexual orientations, religious beliefs, political opinions, financial or health data, etc.). If efforts were made to anonymize the data, describe the anonymization process. --> [More Information Needed] ## Bias, Risks, and Limitations <!-- This section is meant to convey both technical and sociotechnical limitations. --> [More Information Needed] ### Recommendations <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. --> Users should be made aware of the risks, biases and limitations of the dataset. More information needed for further recommendations. ## Citation [optional] <!-- If there is a paper or blog post introducing the dataset, the APA and Bibtex information for that should go in this section. --> **BibTeX:** [More Information Needed] **APA:** [More Information Needed] ## Glossary [optional] <!-- If relevant, include terms and calculations in this section that can help readers understand the dataset or dataset card. --> [More Information Needed] ## More Information [optional] [More Information Needed] ## Dataset Card Authors [optional] [More Information Needed] ## Dataset Card Contact [More Information Needed]
提供机构:
open-llm-leaderboard-old
原始信息汇总

数据集概述

数据集简介

该数据集是在对模型 222gate/BrurryDog-7b-v0.1 进行评估运行时自动创建的,评估结果发布在 Open LLM Leaderboard 上。

数据集结构

  • 配置数量:63个配置,每个配置对应一个评估任务。
  • 运行次数:数据集来自1次运行。每个运行结果作为一个特定的分割存储在每个配置中,分割名称使用运行的时间戳。
  • 最新结果:"train" 分割始终指向最新的结果。
  • 汇总结果:一个额外的配置 "results" 存储所有运行的汇总结果,用于计算和显示在 Open LLM Leaderboard 上的聚合指标。

数据加载示例

python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_222gate__BrurryDog-7b-v0.1", "harness_winogrande_5", split="train")

最新结果

以下是 2024-01-20T03:26:36.549937 运行的最新结果

python { "all": { "acc": 0.6527842705257364, "acc_stderr": 0.03215067286738653, "acc_norm": 0.652698073988589, "acc_norm_stderr": 0.03281396676337054, "mc1": 0.5777233782129743, "mc1_stderr": 0.017290733254248177, "mc2": 0.7004617737856811, "mc2_stderr": 0.01511981164818303 }, "harness|arc:challenge|25": { "acc": 0.7005119453924915, "acc_stderr": 0.01338502163731357, "acc_norm": 0.7252559726962458, "acc_norm_stderr": 0.013044617212771227 }, "harness|hellaswag|10": { "acc": 0.7216689902409879, "acc_stderr": 0.004472613148508909, "acc_norm": 0.8836885082652858, "acc_norm_stderr": 0.0031994286759858682 }, "harness|hendrycksTest-abstract_algebra|5": { "acc": 0.32, "acc_stderr": 0.046882617226215034, "acc_norm": 0.32, "acc_norm_stderr": 0.046882617226215034 }, "harness|hendrycksTest-anatomy|5": { "acc": 0.6370370370370371, "acc_stderr": 0.041539484047423976, "acc_norm": 0.6370370370370371, "acc_norm_stderr": 0.041539484047423976 }, "harness|hendrycksTest-astronomy|5": { "acc": 0.6973684210526315, "acc_stderr": 0.03738520676119669, "acc_norm": 0.6973684210526315, "acc_norm_stderr": 0.03738520676119669 }, "harness|hendrycksTest-business_ethics|5": { "acc": 0.65, "acc_stderr": 0.0479372485441102, "acc_norm": 0.65, "acc_norm_stderr": 0.0479372485441102 }, "harness|hendrycksTest-clinical_knowledge|5": { "acc": 0.7245283018867924, "acc_stderr": 0.027495663683724057, "acc_norm": 0.7245283018867924, "acc_norm_stderr": 0.027495663683724057 }, "harness|hendrycksTest-college_biology|5": { "acc": 0.7708333333333334, "acc_stderr": 0.03514697467862388, "acc_norm": 0.7708333333333334, "acc_norm_stderr": 0.03514697467862388 }, "harness|hendrycksTest-college_chemistry|5": { "acc": 0.45, "acc_stderr": 0.05, "acc_norm": 0.45, "acc_norm_stderr": 0.05 }, "harness|hendrycksTest-college_computer_science|5": { "acc": 0.51, "acc_stderr": 0.05024183937956911, "acc_norm": 0.51, "acc_norm_stderr": 0.05024183937956911 }, "harness|hendrycksTest-college_mathematics|5": { "acc": 0.32, "acc_stderr": 0.04688261722621504, "acc_norm": 0.32, "acc_norm_stderr": 0.04688261722621504 }, "harness|hendrycksTest-college_medicine|5": { "acc": 0.6763005780346821, "acc_stderr": 0.035676037996391706, "acc_norm": 0.6763005780346821, "acc_norm_stderr": 0.035676037996391706 }, "harness|hendrycksTest-college_physics|5": { "acc": 0.4215686274509804, "acc_stderr": 0.04913595201274498, "acc_norm": 0.4215686274509804, "acc_norm_stderr": 0.04913595201274498 }, "harness|hendrycksTest-computer_security|5": { "acc": 0.77, "acc_stderr": 0.04229525846816507, "acc_norm": 0.77, "acc_norm_stderr": 0.04229525846816507 }, "harness|hendrycksTest-conceptual_physics|5": { "acc": 0.5914893617021276, "acc_stderr": 0.032134180267015755, "acc_norm": 0.5914893617021276, "acc_norm_stderr": 0.032134180267015755 }, "harness|hendrycksTest-econometrics|5": { "acc": 0.47368421052631576, "acc_stderr": 0.046970851366478626, "acc_norm": 0.47368421052631576, "acc_norm_stderr": 0.046970851366478626 }, "harness|hendrycksTest-electrical_engineering|5": { "acc": 0.5724137931034483, "acc_stderr": 0.04122737111370333, "acc_norm": 0.5724137931034483, "acc_norm_stderr": 0.04122737111370333 }, "harness|hendrycksTest-elementary_mathematics|5": { "acc": 0.41798941798941797, "acc_stderr": 0.025402555503260912, "acc_norm": 0.41798941798941797, "acc_norm_stderr": 0.025402555503260912 }, "harness|hendrycksTest-formal_logic|5": { "acc": 0.4444444444444444, "acc_stderr": 0.04444444444444449, "acc_norm": 0.4444444444444444, "acc_norm_stderr": 0.04444444444444449 }, "harness|hendrycksTest-global_facts|5": { "acc": 0.37, "acc_stderr": 0.048523658709391, "acc_norm": 0.37, "acc_norm_stderr": 0.048523658709391 }, "harness|hendrycksTest-high_school_biology|5": { "acc": 0.7838709677419354, "acc_stderr": 0.02341529343356852, "acc_norm": 0.7838709677419354, "acc_norm_stderr": 0.02341529343356852 }, "harness|hendrycksTest-high_school_chemistry|5": { "acc": 0.5123152709359606, "acc_stderr": 0.035169204442208966, "acc_norm": 0.5123152709359606, "acc_norm_stderr": 0.035169204442208966 }, "harness|hendrycksTest-high_school_computer_science|5": { "acc": 0.68, "acc_stderr": 0.04688261722621505, "acc_norm": 0.68, "acc_norm_stderr": 0.04688261722621505 }, "harness|hendrycksTest-high_school_european_history|5": { "acc": 0.7878787878787878, "acc_stderr": 0.03192271569548301, "acc_norm": 0.7878787878787878, "acc_norm_stderr": 0.03192271569548301 }, "harness|hendrycksTest-high_school_geography|5": { "acc": 0.7828282828282829, "acc_stderr": 0.02937661648494563, "acc_norm": 0.7828282828282829, "acc_norm_stderr": 0.02937661648494563 }, "harness|hendrycksTest-high_school_government_and_politics|5": { "acc": 0.9015544041450777, "acc_stderr": 0.021500249576033477, "acc_norm": 0.9015544041450777, "acc_norm_stderr": 0.021500249576033477 }, "harness|hendrycksTest-high_school_macroeconomics|5": { "acc": 0.6743589743589744, "acc_stderr": 0.02375966576741229, "acc_norm": 0.6743589743589744, "acc_norm_stderr": 0.02375966576741229 }, "harness|hendrycksTest-high_school_mathematics|5": { "acc": 0.3148148148148148, "acc_stderr": 0.02831753349606649, "acc_norm": 0.3148148148148148, "acc_norm_stderr": 0.02831

搜集汇总
数据集介绍
main_image_url
构建方式
该数据集是在Open LLM Leaderboard框架下,针对模型222gate/BrurryDog-7b-v0.1进行自动化评估时生成的。构建过程基于一次完整的评估运行(run),将模型在63个不同任务上的评测结果以Parquet格式存储,每个任务对应一个独立的配置(config)。每个配置内部包含以时间戳命名的数据拆分(split),其中'train'拆分始终指向最新一次的评估结果,而额外的'results'配置则聚合了所有运行的总体指标,用于在排行榜上计算和展示综合性能。
特点
数据集的核心特点在于其多维度、细粒度的评测结构。它覆盖了从常识推理(如ARC、HellaSwag)到数学推理(如GSM8K)、从多学科知识(如MMLU的57个学科)到语言理解(如Winogrande、TruthfulQA)等广泛任务,每个任务均提供准确率(acc)及其标准误差(acc_stderr)等统计指标。这种设计使得研究者能够深入分析模型在不同能力维度的表现差异,而非仅关注单一总分。
使用方法
使用者可通过HuggingFace的datasets库便捷加载数据。例如,加载Winogrande任务的训练拆分时,只需调用load_dataset函数并指定数据集名称、配置名称(如'harness_winogrande_5')及拆分参数(split='train')。每个配置下的数据以parquet文件形式组织,支持按时间戳回溯历史评估结果,便于进行模型性能的纵向对比与复现分析。
背景与挑战
背景概述
在大型语言模型(LLM)研究蓬勃发展的时代,模型性能的标准化评估成为推动领域进步的关键环节。Open LLM Leaderboard由Hugging Face团队于2023年创建,旨在为开源大语言模型提供统一、透明的评测平台。该数据集记录了模型222gate/BrurryDog-7b-v0.1在63个任务配置上的详细评估结果,涵盖常识推理、数学推理、知识问答等多个维度。作为社区驱动的基准体系,该数据集不仅为研究者提供了可复现的模型性能数据,还通过自动化流程降低了评估门槛,促进了LLM领域的公平比较与良性竞争。其影响力体现在:它已成为开源社区衡量模型能力的重要参考,推动了模型开发与迭代的透明化进程。
当前挑战
该数据集所解决的领域问题在于:LLM评估缺乏统一标准,不同研究采用各异的任务集和指标,导致结果难以横向对比。具体挑战包括:1)评估任务多样性带来的复杂性——需涵盖57个学科知识(如医学、法律、数学)及推理类型(如常识推理、数学推理),每个任务需设计合适的提示模板和评估尺度;2)构建过程中的技术挑战——自动化评估流水线需处理不同模型输出格式的差异,确保结果的可复现性;3)结果聚合的统计挑战——如何从多个任务得分中合理计算综合指标(如准确率及其标准差),并处理不同任务间的权重分配问题,避免单一任务主导整体排名。
常用场景
经典使用场景
该数据集作为Open LLM Leaderboard评估流水线的自动产出,其核心用途在于为开源大语言模型提供标准化、多维度的性能基准测试。研究者可借助该数据集,通过加载如'harness_winogrande_5'等特定配置,复现模型在常识推理、科学知识、数学运算等63项任务上的细粒度表现,从而精准定位模型的优势与短板。
实际应用
在实际应用中,该数据集为模型开发者提供了快速的性能诊断工具。开发者可通过对比数据集中的细粒度结果,识别模型在具体任务(如高中地理、专业法律)上的表现,从而针对性地优化训练数据或架构设计。同时,它也服务于企业级模型选型,帮助团队基于多任务得分选择最适合业务场景的基础模型。
衍生相关工作
该数据集衍生了一系列经典工作,包括基于其评估结果撰写的模型分析报告、多任务能力对比研究,以及为提升特定任务表现而设计的微调策略。此外,它启发了后续如Open LLM Leaderboard v2等更全面的评估基准,并促使研究者开发出更高效的模型性能预测方法,进一步推动了开源大模型生态的规范化发展。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作