five

open-llm-leaderboard-old/details_mayacinka__NeuralZephyr-Beagle-7B

收藏
Hugging Face2024-02-17 更新2024-06-22 收录
下载链接:
https://hf-mirror.com/datasets/open-llm-leaderboard-old/details_mayacinka__NeuralZephyr-Beagle-7B
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: Evaluation run of mayacinka/NeuralZephyr-Beagle-7B dataset_summary: "Dataset automatically created during the evaluation run of model\ \ [mayacinka/NeuralZephyr-Beagle-7B](https://huggingface.co/mayacinka/NeuralZephyr-Beagle-7B)\ \ on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).\n\ \nThe dataset is composed of 63 configuration, each one coresponding to one of the\ \ evaluated task.\n\nThe dataset has been created from 1 run(s). Each run can be\ \ found as a specific split in each configuration, the split being named using the\ \ timestamp of the run.The \"train\" split is always pointing to the latest results.\n\ \nAn additional configuration \"results\" store all the aggregated results of the\ \ run (and is used to compute and display the aggregated metrics on the [Open LLM\ \ Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)).\n\ \nTo load the details from a run, you can for instance do the following:\n```python\n\ from datasets import load_dataset\ndata = load_dataset(\"open-llm-leaderboard/details_mayacinka__NeuralZephyr-Beagle-7B\"\ ,\n\t\"harness_winogrande_5\",\n\tsplit=\"train\")\n```\n\n## Latest results\n\n\ These are the [latest results from run 2024-02-17T00:55:18.728023](https://huggingface.co/datasets/open-llm-leaderboard/details_mayacinka__NeuralZephyr-Beagle-7B/blob/main/results_2024-02-17T00-55-18.728023.json)(note\ \ that their might be results for other tasks in the repos if successive evals didn't\ \ cover the same tasks. You find each in the results and the \"latest\" split for\ \ each eval):\n\n```python\n{\n \"all\": {\n \"acc\": 0.6499857973810573,\n\ \ \"acc_stderr\": 0.0322179748198474,\n \"acc_norm\": 0.651015316120668,\n\ \ \"acc_norm_stderr\": 0.032873289143278944,\n \"mc1\": 0.4944920440636475,\n\ \ \"mc1_stderr\": 0.01750243899045106,\n \"mc2\": 0.6516576799205165,\n\ \ \"mc2_stderr\": 0.01520679103207334\n },\n \"harness|arc:challenge|25\"\ : {\n \"acc\": 0.6578498293515358,\n \"acc_stderr\": 0.013864152159177275,\n\ \ \"acc_norm\": 0.6860068259385665,\n \"acc_norm_stderr\": 0.013562691224726297\n\ \ },\n \"harness|hellaswag|10\": {\n \"acc\": 0.6852220673172674,\n\ \ \"acc_stderr\": 0.004634782156128581,\n \"acc_norm\": 0.8637721569408484,\n\ \ \"acc_norm_stderr\": 0.0034232928816321498\n },\n \"harness|hendrycksTest-abstract_algebra|5\"\ : {\n \"acc\": 0.34,\n \"acc_stderr\": 0.047609522856952365,\n \ \ \"acc_norm\": 0.34,\n \"acc_norm_stderr\": 0.047609522856952365\n \ \ },\n \"harness|hendrycksTest-anatomy|5\": {\n \"acc\": 0.6444444444444445,\n\ \ \"acc_stderr\": 0.04135176749720385,\n \"acc_norm\": 0.6444444444444445,\n\ \ \"acc_norm_stderr\": 0.04135176749720385\n },\n \"harness|hendrycksTest-astronomy|5\"\ : {\n \"acc\": 0.7039473684210527,\n \"acc_stderr\": 0.03715062154998904,\n\ \ \"acc_norm\": 0.7039473684210527,\n \"acc_norm_stderr\": 0.03715062154998904\n\ \ },\n \"harness|hendrycksTest-business_ethics|5\": {\n \"acc\": 0.61,\n\ \ \"acc_stderr\": 0.04902071300001975,\n \"acc_norm\": 0.61,\n \ \ \"acc_norm_stderr\": 0.04902071300001975\n },\n \"harness|hendrycksTest-clinical_knowledge|5\"\ : {\n \"acc\": 0.7169811320754716,\n \"acc_stderr\": 0.027724236492700918,\n\ \ \"acc_norm\": 0.7169811320754716,\n \"acc_norm_stderr\": 0.027724236492700918\n\ \ },\n \"harness|hendrycksTest-college_biology|5\": {\n \"acc\": 0.7361111111111112,\n\ \ \"acc_stderr\": 0.03685651095897532,\n \"acc_norm\": 0.7361111111111112,\n\ \ \"acc_norm_stderr\": 0.03685651095897532\n },\n \"harness|hendrycksTest-college_chemistry|5\"\ : {\n \"acc\": 0.49,\n \"acc_stderr\": 0.05024183937956912,\n \ \ \"acc_norm\": 0.49,\n \"acc_norm_stderr\": 0.05024183937956912\n \ \ },\n \"harness|hendrycksTest-college_computer_science|5\": {\n \"acc\"\ : 0.52,\n \"acc_stderr\": 0.050211673156867795,\n \"acc_norm\": 0.52,\n\ \ \"acc_norm_stderr\": 0.050211673156867795\n },\n \"harness|hendrycksTest-college_mathematics|5\"\ : {\n \"acc\": 0.3,\n \"acc_stderr\": 0.046056618647183814,\n \ \ \"acc_norm\": 0.3,\n \"acc_norm_stderr\": 0.046056618647183814\n \ \ },\n \"harness|hendrycksTest-college_medicine|5\": {\n \"acc\": 0.6820809248554913,\n\ \ \"acc_stderr\": 0.0355068398916558,\n \"acc_norm\": 0.6820809248554913,\n\ \ \"acc_norm_stderr\": 0.0355068398916558\n },\n \"harness|hendrycksTest-college_physics|5\"\ : {\n \"acc\": 0.4117647058823529,\n \"acc_stderr\": 0.048971049527263666,\n\ \ \"acc_norm\": 0.4117647058823529,\n \"acc_norm_stderr\": 0.048971049527263666\n\ \ },\n \"harness|hendrycksTest-computer_security|5\": {\n \"acc\":\ \ 0.77,\n \"acc_stderr\": 0.042295258468165065,\n \"acc_norm\": 0.77,\n\ \ \"acc_norm_stderr\": 0.042295258468165065\n },\n \"harness|hendrycksTest-conceptual_physics|5\"\ : {\n \"acc\": 0.5914893617021276,\n \"acc_stderr\": 0.032134180267015755,\n\ \ \"acc_norm\": 0.5914893617021276,\n \"acc_norm_stderr\": 0.032134180267015755\n\ \ },\n \"harness|hendrycksTest-econometrics|5\": {\n \"acc\": 0.5175438596491229,\n\ \ \"acc_stderr\": 0.04700708033551038,\n \"acc_norm\": 0.5175438596491229,\n\ \ \"acc_norm_stderr\": 0.04700708033551038\n },\n \"harness|hendrycksTest-electrical_engineering|5\"\ : {\n \"acc\": 0.5724137931034483,\n \"acc_stderr\": 0.04122737111370333,\n\ \ \"acc_norm\": 0.5724137931034483,\n \"acc_norm_stderr\": 0.04122737111370333\n\ \ },\n \"harness|hendrycksTest-elementary_mathematics|5\": {\n \"acc\"\ : 0.41005291005291006,\n \"acc_stderr\": 0.02533120243894444,\n \"\ acc_norm\": 0.41005291005291006,\n \"acc_norm_stderr\": 0.02533120243894444\n\ \ },\n \"harness|hendrycksTest-formal_logic|5\": {\n \"acc\": 0.4444444444444444,\n\ \ \"acc_stderr\": 0.044444444444444495,\n \"acc_norm\": 0.4444444444444444,\n\ \ \"acc_norm_stderr\": 0.044444444444444495\n },\n \"harness|hendrycksTest-global_facts|5\"\ : {\n \"acc\": 0.34,\n \"acc_stderr\": 0.04760952285695235,\n \ \ \"acc_norm\": 0.34,\n \"acc_norm_stderr\": 0.04760952285695235\n \ \ },\n \"harness|hendrycksTest-high_school_biology|5\": {\n \"acc\": 0.7903225806451613,\n\ \ \"acc_stderr\": 0.023157879349083525,\n \"acc_norm\": 0.7903225806451613,\n\ \ \"acc_norm_stderr\": 0.023157879349083525\n },\n \"harness|hendrycksTest-high_school_chemistry|5\"\ : {\n \"acc\": 0.5024630541871922,\n \"acc_stderr\": 0.035179450386910616,\n\ \ \"acc_norm\": 0.5024630541871922,\n \"acc_norm_stderr\": 0.035179450386910616\n\ \ },\n \"harness|hendrycksTest-high_school_computer_science|5\": {\n \ \ \"acc\": 0.7,\n \"acc_stderr\": 0.046056618647183814,\n \"acc_norm\"\ : 0.7,\n \"acc_norm_stderr\": 0.046056618647183814\n },\n \"harness|hendrycksTest-high_school_european_history|5\"\ : {\n \"acc\": 0.7696969696969697,\n \"acc_stderr\": 0.0328766675860349,\n\ \ \"acc_norm\": 0.7696969696969697,\n \"acc_norm_stderr\": 0.0328766675860349\n\ \ },\n \"harness|hendrycksTest-high_school_geography|5\": {\n \"acc\"\ : 0.7929292929292929,\n \"acc_stderr\": 0.02886977846026705,\n \"\ acc_norm\": 0.7929292929292929,\n \"acc_norm_stderr\": 0.02886977846026705\n\ \ },\n \"harness|hendrycksTest-high_school_government_and_politics|5\": {\n\ \ \"acc\": 0.8808290155440415,\n \"acc_stderr\": 0.02338193534812142,\n\ \ \"acc_norm\": 0.8808290155440415,\n \"acc_norm_stderr\": 0.02338193534812142\n\ \ },\n \"harness|hendrycksTest-high_school_macroeconomics|5\": {\n \ \ \"acc\": 0.6692307692307692,\n \"acc_stderr\": 0.02385479568097112,\n \ \ \"acc_norm\": 0.6692307692307692,\n \"acc_norm_stderr\": 0.02385479568097112\n\ \ },\n \"harness|hendrycksTest-high_school_mathematics|5\": {\n \"\ acc\": 0.3592592592592593,\n \"acc_stderr\": 0.029252905927251972,\n \ \ \"acc_norm\": 0.3592592592592593,\n \"acc_norm_stderr\": 0.029252905927251972\n\ \ },\n \"harness|hendrycksTest-high_school_microeconomics|5\": {\n \ \ \"acc\": 0.7016806722689075,\n \"acc_stderr\": 0.029719142876342856,\n\ \ \"acc_norm\": 0.7016806722689075,\n \"acc_norm_stderr\": 0.029719142876342856\n\ \ },\n \"harness|hendrycksTest-high_school_physics|5\": {\n \"acc\"\ : 0.3576158940397351,\n \"acc_stderr\": 0.03913453431177258,\n \"\ acc_norm\": 0.3576158940397351,\n \"acc_norm_stderr\": 0.03913453431177258\n\ \ },\n \"harness|hendrycksTest-high_school_psychology|5\": {\n \"acc\"\ : 0.8330275229357799,\n \"acc_stderr\": 0.01599015488507338,\n \"\ acc_norm\": 0.8330275229357799,\n \"acc_norm_stderr\": 0.01599015488507338\n\ \ },\n \"harness|hendrycksTest-high_school_statistics|5\": {\n \"acc\"\ : 0.5555555555555556,\n \"acc_stderr\": 0.03388857118502325,\n \"\ acc_norm\": 0.5555555555555556,\n \"acc_norm_stderr\": 0.03388857118502325\n\ \ },\n \"harness|hendrycksTest-high_school_us_history|5\": {\n \"acc\"\ : 0.8333333333333334,\n \"acc_stderr\": 0.02615686752393104,\n \"\ acc_norm\": 0.8333333333333334,\n \"acc_norm_stderr\": 0.02615686752393104\n\ \ },\n \"harness|hendrycksTest-high_school_world_history|5\": {\n \"\ acc\": 0.7805907172995781,\n \"acc_stderr\": 0.026939106581553945,\n \ \ \"acc_norm\": 0.7805907172995781,\n \"acc_norm_stderr\": 0.026939106581553945\n\ \ },\n \"harness|hendrycksTest-human_aging|5\": {\n \"acc\": 0.6816143497757847,\n\ \ \"acc_stderr\": 0.03126580522513713,\n \"acc_norm\": 0.6816143497757847,\n\ \ \"acc_norm_stderr\": 0.03126580522513713\n },\n \"harness|hendrycksTest-human_sexuality|5\"\ : {\n \"acc\": 0.7786259541984732,\n \"acc_stderr\": 0.03641297081313728,\n\ \ \"acc_norm\": 0.7786259541984732,\n \"acc_norm_stderr\": 0.03641297081313728\n\ \ },\n \"harness|hendrycksTest-international_law|5\": {\n \"acc\":\ \ 0.768595041322314,\n \"acc_stderr\": 0.03849856098794088,\n \"acc_norm\"\ : 0.768595041322314,\n \"acc_norm_stderr\": 0.03849856098794088\n },\n\ \ \"harness|hendrycksTest-jurisprudence|5\": {\n \"acc\": 0.7592592592592593,\n\ \ \"acc_stderr\": 0.04133119440243838,\n \"acc_norm\": 0.7592592592592593,\n\ \ \"acc_norm_stderr\": 0.04133119440243838\n },\n \"harness|hendrycksTest-logical_fallacies|5\"\ : {\n \"acc\": 0.7791411042944786,\n \"acc_stderr\": 0.03259177392742178,\n\ \ \"acc_norm\": 0.7791411042944786,\n \"acc_norm_stderr\": 0.03259177392742178\n\ \ },\n \"harness|hendrycksTest-machine_learning|5\": {\n \"acc\": 0.45535714285714285,\n\ \ \"acc_stderr\": 0.047268355537191,\n \"acc_norm\": 0.45535714285714285,\n\ \ \"acc_norm_stderr\": 0.047268355537191\n },\n \"harness|hendrycksTest-management|5\"\ : {\n \"acc\": 0.7766990291262136,\n \"acc_stderr\": 0.04123553189891431,\n\ \ \"acc_norm\": 0.7766990291262136,\n \"acc_norm_stderr\": 0.04123553189891431\n\ \ },\n \"harness|hendrycksTest-marketing|5\": {\n \"acc\": 0.8717948717948718,\n\ \ \"acc_stderr\": 0.02190190511507333,\n \"acc_norm\": 0.8717948717948718,\n\ \ \"acc_norm_stderr\": 0.02190190511507333\n },\n \"harness|hendrycksTest-medical_genetics|5\"\ : {\n \"acc\": 0.72,\n \"acc_stderr\": 0.04512608598542128,\n \ \ \"acc_norm\": 0.72,\n \"acc_norm_stderr\": 0.04512608598542128\n \ \ },\n \"harness|hendrycksTest-miscellaneous|5\": {\n \"acc\": 0.8237547892720306,\n\ \ \"acc_stderr\": 0.013625556907993452,\n \"acc_norm\": 0.8237547892720306,\n\ \ \"acc_norm_stderr\": 0.013625556907993452\n },\n \"harness|hendrycksTest-moral_disputes|5\"\ : {\n \"acc\": 0.7312138728323699,\n \"acc_stderr\": 0.02386800326250011,\n\ \ \"acc_norm\": 0.7312138728323699,\n \"acc_norm_stderr\": 0.02386800326250011\n\ \ },\n \"harness|hendrycksTest-moral_scenarios|5\": {\n \"acc\": 0.43798882681564244,\n\ \ \"acc_stderr\": 0.016593394227564843,\n \"acc_norm\": 0.43798882681564244,\n\ \ \"acc_norm_stderr\": 0.016593394227564843\n },\n \"harness|hendrycksTest-nutrition|5\"\ : {\n \"acc\": 0.7189542483660131,\n \"acc_stderr\": 0.025738854797818737,\n\ \ \"acc_norm\": 0.7189542483660131,\n \"acc_norm_stderr\": 0.025738854797818737\n\ \ },\n \"harness|hendrycksTest-philosophy|5\": {\n \"acc\": 0.7138263665594855,\n\ \ \"acc_stderr\": 0.02567025924218893,\n \"acc_norm\": 0.7138263665594855,\n\ \ \"acc_norm_stderr\": 0.02567025924218893\n },\n \"harness|hendrycksTest-prehistory|5\"\ : {\n \"acc\": 0.7376543209876543,\n \"acc_stderr\": 0.02447722285613511,\n\ \ \"acc_norm\": 0.7376543209876543,\n \"acc_norm_stderr\": 0.02447722285613511\n\ \ },\n \"harness|hendrycksTest-professional_accounting|5\": {\n \"\ acc\": 0.5,\n \"acc_stderr\": 0.029827499313594685,\n \"acc_norm\"\ : 0.5,\n \"acc_norm_stderr\": 0.029827499313594685\n },\n \"harness|hendrycksTest-professional_law|5\"\ : {\n \"acc\": 0.4654498044328553,\n \"acc_stderr\": 0.012739711554045702,\n\ \ \"acc_norm\": 0.4654498044328553,\n \"acc_norm_stderr\": 0.012739711554045702\n\ \ },\n \"harness|hendrycksTest-professional_medicine|5\": {\n \"acc\"\ : 0.6764705882352942,\n \"acc_stderr\": 0.02841820861940676,\n \"\ acc_norm\": 0.6764705882352942,\n \"acc_norm_stderr\": 0.02841820861940676\n\ \ },\n \"harness|hendrycksTest-professional_psychology|5\": {\n \"\ acc\": 0.6617647058823529,\n \"acc_stderr\": 0.019139943748487046,\n \ \ \"acc_norm\": 0.6617647058823529,\n \"acc_norm_stderr\": 0.019139943748487046\n\ \ },\n \"harness|hendrycksTest-public_relations|5\": {\n \"acc\": 0.6727272727272727,\n\ \ \"acc_stderr\": 0.0449429086625209,\n \"acc_norm\": 0.6727272727272727,\n\ \ \"acc_norm_stderr\": 0.0449429086625209\n },\n \"harness|hendrycksTest-security_studies|5\"\ : {\n \"acc\": 0.726530612244898,\n \"acc_stderr\": 0.028535560337128445,\n\ \ \"acc_norm\": 0.726530612244898,\n \"acc_norm_stderr\": 0.028535560337128445\n\ \ },\n \"harness|hendrycksTest-sociology|5\": {\n \"acc\": 0.8407960199004975,\n\ \ \"acc_stderr\": 0.02587064676616913,\n \"acc_norm\": 0.8407960199004975,\n\ \ \"acc_norm_stderr\": 0.02587064676616913\n },\n \"harness|hendrycksTest-us_foreign_policy|5\"\ : {\n \"acc\": 0.85,\n \"acc_stderr\": 0.03588702812826371,\n \ \ \"acc_norm\": 0.85,\n \"acc_norm_stderr\": 0.03588702812826371\n \ \ },\n \"harness|hendrycksTest-virology|5\": {\n \"acc\": 0.536144578313253,\n\ \ \"acc_stderr\": 0.038823108508905954,\n \"acc_norm\": 0.536144578313253,\n\ \ \"acc_norm_stderr\": 0.038823108508905954\n },\n \"harness|hendrycksTest-world_religions|5\"\ : {\n \"acc\": 0.847953216374269,\n \"acc_stderr\": 0.027539122889061456,\n\ \ \"acc_norm\": 0.847953216374269,\n \"acc_norm_stderr\": 0.027539122889061456\n\ \ },\n \"harness|truthfulqa:mc|0\": {\n \"mc1\": 0.4944920440636475,\n\ \ \"mc1_stderr\": 0.01750243899045106,\n \"mc2\": 0.6516576799205165,\n\ \ \"mc2_stderr\": 0.01520679103207334\n },\n \"harness|winogrande|5\"\ : {\n \"acc\": 0.8113654301499605,\n \"acc_stderr\": 0.010995172318019816\n\ \ },\n \"harness|gsm8k|5\": {\n \"acc\": 0.6345716451857468,\n \ \ \"acc_stderr\": 0.013264282030266633\n }\n}\n```" repo_url: https://huggingface.co/mayacinka/NeuralZephyr-Beagle-7B leaderboard_url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard point_of_contact: clementine@hf.co configs: - config_name: harness_arc_challenge_25 data_files: - split: 2024_02_17T00_55_18.728023 path: - '**/details_harness|arc:challenge|25_2024-02-17T00-55-18.728023.parquet' - split: latest path: - '**/details_harness|arc:challenge|25_2024-02-17T00-55-18.728023.parquet' - config_name: harness_gsm8k_5 data_files: - split: 2024_02_17T00_55_18.728023 path: - '**/details_harness|gsm8k|5_2024-02-17T00-55-18.728023.parquet' - split: latest path: - '**/details_harness|gsm8k|5_2024-02-17T00-55-18.728023.parquet' - config_name: harness_hellaswag_10 data_files: - split: 2024_02_17T00_55_18.728023 path: - '**/details_harness|hellaswag|10_2024-02-17T00-55-18.728023.parquet' - split: latest path: - '**/details_harness|hellaswag|10_2024-02-17T00-55-18.728023.parquet' - config_name: harness_hendrycksTest_5 data_files: - split: 2024_02_17T00_55_18.728023 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-international_law|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-management|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-marketing|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-sociology|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-virology|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2024-02-17T00-55-18.728023.parquet' - split: latest path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-international_law|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-management|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-marketing|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-sociology|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-virology|5_2024-02-17T00-55-18.728023.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2024-02-17T00-55-18.728023.parquet' - config_name: harness_hendrycksTest_abstract_algebra_5 data_files: - split: 2024_02_17T00_55_18.728023 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-02-17T00-55-18.728023.parquet' - split: latest path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-02-17T00-55-18.728023.parquet' - config_name: harness_hendrycksTest_anatomy_5 data_files: - split: 2024_02_17T00_55_18.728023 path: - '**/details_harness|hendrycksTest-anatomy|5_2024-02-17T00-55-18.728023.parquet' - split: latest path: - '**/details_harness|hendrycksTest-anatomy|5_2024-02-17T00-55-18.728023.parquet' - config_name: harness_hendrycksTest_astronomy_5 data_files: - split: 2024_02_17T00_55_18.728023 path: - '**/details_harness|hendrycksTest-astronomy|5_2024-02-17T00-55-18.728023.parquet' - split: latest path: - '**/details_harness|hendrycksTest-astronomy|5_2024-02-17T00-55-18.728023.parquet' - config_name: harness_hendrycksTest_business_ethics_5 data_files: - split: 2024_02_17T00_55_18.728023 path: - '**/details_harness|hendrycksTest-business_ethics|5_2024-02-17T00-55-18.728023.parquet' - split: latest path: - '**/details_harness|hendrycksTest-business_ethics|5_2024-02-17T00-55-18.728023.parquet' - config_name: harness_hendrycksTest_clinical_knowledge_5 data_files: - split: 2024_02_17T00_55_18.728023 path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-02-17T00-55-18.728023.parquet' - split: latest path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-02-17T00-55-18.728023.parquet' - config_name: harness_hendrycksTest_college_biology_5 data_files: - split: 2024_02_17T00_55_18.728023 path: - '**/details_harness|hendrycksTest-college_biology|5_2024-02-17T00-55-18.728023.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_biology|5_2024-02-17T00-55-18.728023.parquet' - config_name: harness_hendrycksTest_college_chemistry_5 data_files: - split: 2024_02_17T00_55_18.728023 path: - '**/details_harness|hendrycksTest-college_chemistry|5_2024-02-17T00-55-18.728023.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_chemistry|5_2024-02-17T00-55-18.728023.parquet' - config_name: harness_hendrycksTest_college_computer_science_5 data_files: - split: 2024_02_17T00_55_18.728023 path: - '**/details_harness|hendrycksTest-college_computer_science|5_2024-02-17T00-55-18.728023.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_computer_science|5_2024-02-17T00-55-18.728023.parquet' - config_name: harness_hendrycksTest_college_mathematics_5 data_files: - split: 2024_02_17T00_55_18.728023 path: - '**/details_harness|hendrycksTest-college_mathematics|5_2024-02-17T00-55-18.728023.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_mathematics|5_2024-02-17T00-55-18.728023.parquet' - config_name: harness_hendrycksTest_college_medicine_5 data_files: - split: 2024_02_17T00_55_18.728023 path: - '**/details_harness|hendrycksTest-college_medicine|5_2024-02-17T00-55-18.728023.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_medicine|5_2024-02-17T00-55-18.728023.parquet' - config_name: harness_hendrycksTest_college_physics_5 data_files: - split: 2024_02_17T00_55_18.728023 path: - '**/details_harness|hendrycksTest-college_physics|5_2024-02-17T00-55-18.728023.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_physics|5_2024-02-17T00-55-18.728023.parquet' - config_name: harness_hendrycksTest_computer_security_5 data_files: - split: 2024_02_17T00_55_18.728023 path: - '**/details_harness|hendrycksTest-computer_security|5_2024-02-17T00-55-18.728023.parquet' - split: latest path: - '**/details_harness|hendrycksTest-computer_security|5_2024-02-17T00-55-18.728023.parquet' - config_name: harness_hendrycksTest_conceptual_physics_5 data_files: - split: 2024_02_17T00_55_18.728023 path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-02-17T00-55-18.728023.parquet' - split: latest path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-02-17T00-55-18.728023.parquet' - config_name: harness_hendrycksTest_econometrics_5 data_files: - split: 2024_02_17T00_55_18.728023 path: - '**/details_harness|hendrycksTest-econometrics|5_2024-02-17T00-55-18.728023.parquet' - split: latest path: - '**/details_harness|hendrycksTest-econometrics|5_2024-02-17T00-55-18.728023.parquet' - config_name: harness_hendrycksTest_electrical_engineering_5 data_files: - split: 2024_02_17T00_55_18.728023 path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-02-17T00-55-18.728023.parquet' - split: latest path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-02-17T00-55-18.728023.parquet' - config_name: harness_hendrycksTest_elementary_mathematics_5 data_files: - split: 2024_02_17T00_55_18.728023 path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-02-17T00-55-18.728023.parquet' - split: latest path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-02-17T00-55-18.728023.parquet' - config_name: harness_hendrycksTest_formal_logic_5 data_files: - split: 2024_02_17T00_55_18.728023 path: - '**/details_harness|hendrycksTest-formal_logic|5_2024-02-17T00-55-18.728023.parquet' - split: latest path: - '**/details_harness|hendrycksTest-formal_logic|5_2024-02-17T00-55-18.728023.parquet' - config_name: harness_hendrycksTest_global_facts_5 data_files: - split: 2024_02_17T00_55_18.728023 path: - '**/details_harness|hendrycksTest-global_facts|5_2024-02-17T00-55-18.728023.parquet' - split: latest path: - '**/details_harness|hendrycksTest-global_facts|5_2024-02-17T00-55-18.728023.parquet' - config_name: harness_hendrycksTest_high_school_biology_5 data_files: - split: 2024_02_17T00_55_18.728023 path: - '**/details_harness|hendrycksTest-high_school_biology|5_2024-02-17T00-55-18.728023.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_biology|5_2024-02-17T00-55-18.728023.parquet' - config_name: harness_hendrycksTest_high_school_chemistry_5 data_files: - split: 2024_02_17T00_55_18.728023 path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-02-17T00-55-18.728023.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-02-17T00-55-18.728023.parquet' - config_name: harness_hendrycksTest_high_school_computer_science_5 data_files: - split: 2024_02_17T00_55_18.728023 path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-02-17T00-55-18.728023.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-02-17T00-55-18.728023.parquet' - config_name: harness_hendrycksTest_high_school_european_history_5 data_files: - split: 2024_02_17T00_55_18.728023 path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-02-17T00-55-18.728023.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-02-17T00-55-18.728023.parquet' - config_name: harness_hendrycksTest_high_school_geography_5 data_files: - split: 2024_02_17T00_55_18.728023 path: - '**/details_harness|hendrycksTest-high_school_geography|5_2024-02-17T00-55-18.728023.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_geography|5_2024-02-17T00-55-18.728023.parquet' - config_name: harness_hendrycksTest_high_school_government_and_politics_5 data_files: - split: 2024_02_17T00_55_18.728023 path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-02-17T00-55-18.728023.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-02-17T00-55-18.728023.parquet' - config_name: harness_hendrycksTest_high_school_macroeconomics_5 data_files: - split: 2024_02_17T00_55_18.728023 path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-02-17T00-55-18.728023.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-02-17T00-55-18.728023.parquet' - config_name: harness_hendrycksTest_high_school_mathematics_5 data_files: - split: 2024_02_17T00_55_18.728023 path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-02-17T00-55-18.728023.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-02-17T00-55-18.728023.parquet' - config_name: harness_hendrycksTest_high_school_microeconomics_5 data_files: - split: 2024_02_17T00_55_18.728023 path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-02-17T00-55-18.728023.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-02-17T00-55-18.728023.parquet' - config_name: harness_hendrycksTest_high_school_physics_5 data_files: - split: 2024_02_17T00_55_18.728023 path: - '**/details_harness|hendrycksTest-high_school_physics|5_2024-02-17T00-55-18.728023.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_physics|5_2024-02-17T00-55-18.728023.parquet' - config_name: harness_hendrycksTest_high_school_psychology_5 data_files: - split: 2024_02_17T00_55_18.728023 path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-02-17T00-55-18.728023.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-02-17T00-55-18.728023.parquet' - config_name: harness_hendrycksTest_high_school_statistics_5 data_files: - split: 2024_02_17T00_55_18.728023 path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-02-17T00-55-18.728023.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-02-17T00-55-18.728023.parquet' - config_name: harness_hendrycksTest_high_school_us_history_5 data_files: - split: 2024_02_17T00_55_18.728023 path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-02-17T00-55-18.728023.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-02-17T00-55-18.728023.parquet' - config_name: harness_hendrycksTest_high_school_world_history_5 data_files: - split: 2024_02_17T00_55_18.728023 path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-02-17T00-55-18.728023.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-02-17T00-55-18.728023.parquet' - config_name: harness_hendrycksTest_human_aging_5 data_files: - split: 2024_02_17T00_55_18.728023 path: - '**/details_harness|hendrycksTest-human_aging|5_2024-02-17T00-55-18.728023.parquet' - split: latest path: - '**/details_harness|hendrycksTest-human_aging|5_2024-02-17T00-55-18.728023.parquet' - config_name: harness_hendrycksTest_human_sexuality_5 data_files: - split: 2024_02_17T00_55_18.728023 path: - '**/details_harness|hendrycksTest-human_sexuality|5_2024-02-17T00-55-18.728023.parquet' - split: latest path: - '**/details_harness|hendrycksTest-human_sexuality|5_2024-02-17T00-55-18.728023.parquet' - config_name: harness_hendrycksTest_international_law_5 data_files: - split: 2024_02_17T00_55_18.728023 path: - '**/details_harness|hendrycksTest-international_law|5_2024-02-17T00-55-18.728023.parquet' - split: latest path: - '**/details_harness|hendrycksTest-international_law|5_2024-02-17T00-55-18.728023.parquet' - config_name: harness_hendrycksTest_jurisprudence_5 data_files: - split: 2024_02_17T00_55_18.728023 path: - '**/details_harness|hendrycksTest-jurisprudence|5_2024-02-17T00-55-18.728023.parquet' - split: latest path: - '**/details_harness|hendrycksTest-jurisprudence|5_2024-02-17T00-55-18.728023.parquet' - config_name: harness_hendrycksTest_logical_fallacies_5 data_files: - split: 2024_02_17T00_55_18.728023 path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-02-17T00-55-18.728023.parquet' - split: latest path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-02-17T00-55-18.728023.parquet' - config_name: harness_hendrycksTest_machine_learning_5 data_files: - split: 2024_02_17T00_55_18.728023 path: - '**/details_harness|hendrycksTest-machine_learning|5_2024-02-17T00-55-18.728023.parquet' - split: latest path: - '**/details_harness|hendrycksTest-machine_learning|5_2024-02-17T00-55-18.728023.parquet' - config_name: harness_hendrycksTest_management_5 data_files: - split: 2024_02_17T00_55_18.728023 path: - '**/details_harness|hendrycksTest-management|5_2024-02-17T00-55-18.728023.parquet' - split: latest path: - '**/details_harness|hendrycksTest-management|5_2024-02-17T00-55-18.728023.parquet' - config_name: harness_hendrycksTest_marketing_5 data_files: - split: 2024_02_17T00_55_18.728023 path: - '**/details_harness|hendrycksTest-marketing|5_2024-02-17T00-55-18.728023.parquet' - split: latest path: - '**/details_harness|hendrycksTest-marketing|5_2024-02-17T00-55-18.728023.parquet' - config_name: harness_hendrycksTest_medical_genetics_5 data_files: - split: 2024_02_17T00_55_18.728023 path: - '**/details_harness|hendrycksTest-medical_genetics|5_2024-02-17T00-55-18.728023.parquet' - split: latest path: - '**/details_harness|hendrycksTest-medical_genetics|5_2024-02-17T00-55-18.728023.parquet' - config_name: harness_hendrycksTest_miscellaneous_5 data_files: - split: 2024_02_17T00_55_18.728023 path: - '**/details_harness|hendrycksTest-miscellaneous|5_2024-02-17T00-55-18.728023.parquet' - split: latest path: - '**/details_harness|hendrycksTest-miscellaneous|5_2024-02-17T00-55-18.728023.parquet' - config_name: harness_hendrycksTest_moral_disputes_5 data_files: - split: 2024_02_17T00_55_18.728023 path: - '**/details_harness|hendrycksTest-moral_disputes|5_2024-02-17T00-55-18.728023.parquet' - split: latest path: - '**/details_harness|hendrycksTest-moral_disputes|5_2024-02-17T00-55-18.728023.parquet' - config_name: harness_hendrycksTest_moral_scenarios_5 data_files: - split: 2024_02_17T00_55_18.728023 path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-02-17T00-55-18.728023.parquet' - split: latest path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-02-17T00-55-18.728023.parquet' - config_name: harness_hendrycksTest_nutrition_5 data_files: - split: 2024_02_17T00_55_18.728023 path: - '**/details_harness|hendrycksTest-nutrition|5_2024-02-17T00-55-18.728023.parquet' - split: latest path: - '**/details_harness|hendrycksTest-nutrition|5_2024-02-17T00-55-18.728023.parquet' - config_name: harness_hendrycksTest_philosophy_5 data_files: - split: 2024_02_17T00_55_18.728023 path: - '**/details_harness|hendrycksTest-philosophy|5_2024-02-17T00-55-18.728023.parquet' - split: latest path: - '**/details_harness|hendrycksTest-philosophy|5_2024-02-17T00-55-18.728023.parquet' - config_name: harness_hendrycksTest_prehistory_5 data_files: - split: 2024_02_17T00_55_18.728023 path: - '**/details_harness|hendrycksTest-prehistory|5_2024-02-17T00-55-18.728023.parquet' - split: latest path: - '**/details_harness|hendrycksTest-prehistory|5_2024-02-17T00-55-18.728023.parquet' - config_name: harness_hendrycksTest_professional_accounting_5 data_files: - split: 2024_02_17T00_55_18.728023 path: - '**/details_harness|hendrycksTest-professional_accounting|5_2024-02-17T00-55-18.728023.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_accounting|5_2024-02-17T00-55-18.728023.parquet' - config_name: harness_hendrycksTest_professional_law_5 data_files: - split: 2024_02_17T00_55_18.728023 path: - '**/details_harness|hendrycksTest-professional_law|5_2024-02-17T00-55-18.728023.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_law|5_2024-02-17T00-55-18.728023.parquet' - config_name: harness_hendrycksTest_professional_medicine_5 data_files: - split: 2024_02_17T00_55_18.728023 path: - '**/details_harness|hendrycksTest-professional_medicine|5_2024-02-17T00-55-18.728023.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_medicine|5_2024-02-17T00-55-18.728023.parquet' - config_name: harness_hendrycksTest_professional_psychology_5 data_files: - split: 2024_02_17T00_55_18.728023 path: - '**/details_harness|hendrycksTest-professional_psychology|5_2024-02-17T00-55-18.728023.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_psychology|5_2024-02-17T00-55-18.728023.parquet' - config_name: harness_hendrycksTest_public_relations_5 data_files: - split: 2024_02_17T00_55_18.728023 path: - '**/details_harness|hendrycksTest-public_relations|5_2024-02-17T00-55-18.728023.parquet' - split: latest path: - '**/details_harness|hendrycksTest-public_relations|5_2024-02-17T00-55-18.728023.parquet' - config_name: harness_hendrycksTest_security_studies_5 data_files: - split: 2024_02_17T00_55_18.728023 path: - '**/details_harness|hendrycksTest-security_studies|5_2024-02-17T00-55-18.728023.parquet' - split: latest path: - '**/details_harness|hendrycksTest-security_studies|5_2024-02-17T00-55-18.728023.parquet' - config_name: harness_hendrycksTest_sociology_5 data_files: - split: 2024_02_17T00_55_18.728023 path: - '**/details_harness|hendrycksTest-sociology|5_2024-02-17T00-55-18.728023.parquet' - split: latest path: - '**/details_harness|hendrycksTest-sociology|5_2024-02-17T00-55-18.728023.parquet' - config_name: harness_hendrycksTest_us_foreign_policy_5 data_files: - split: 2024_02_17T00_55_18.728023 path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-02-17T00-55-18.728023.parquet' - split: latest path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-02-17T00-55-18.728023.parquet' - config_name: harness_hendrycksTest_virology_5 data_files: - split: 2024_02_17T00_55_18.728023 path: - '**/details_harness|hendrycksTest-virology|5_2024-02-17T00-55-18.728023.parquet' - split: latest path: - '**/details_harness|hendrycksTest-virology|5_2024-02-17T00-55-18.728023.parquet' - config_name: harness_hendrycksTest_world_religions_5 data_files: - split: 2024_02_17T00_55_18.728023 path: - '**/details_harness|hendrycksTest-world_religions|5_2024-02-17T00-55-18.728023.parquet' - split: latest path: - '**/details_harness|hendrycksTest-world_religions|5_2024-02-17T00-55-18.728023.parquet' - config_name: harness_truthfulqa_mc_0 data_files: - split: 2024_02_17T00_55_18.728023 path: - '**/details_harness|truthfulqa:mc|0_2024-02-17T00-55-18.728023.parquet' - split: latest path: - '**/details_harness|truthfulqa:mc|0_2024-02-17T00-55-18.728023.parquet' - config_name: harness_winogrande_5 data_files: - split: 2024_02_17T00_55_18.728023 path: - '**/details_harness|winogrande|5_2024-02-17T00-55-18.728023.parquet' - split: latest path: - '**/details_harness|winogrande|5_2024-02-17T00-55-18.728023.parquet' - config_name: results data_files: - split: 2024_02_17T00_55_18.728023 path: - results_2024-02-17T00-55-18.728023.parquet - split: latest path: - results_2024-02-17T00-55-18.728023.parquet --- # Dataset Card for Evaluation run of mayacinka/NeuralZephyr-Beagle-7B <!-- Provide a quick summary of the dataset. --> Dataset automatically created during the evaluation run of model [mayacinka/NeuralZephyr-Beagle-7B](https://huggingface.co/mayacinka/NeuralZephyr-Beagle-7B) on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard). The dataset is composed of 63 configuration, each one coresponding to one of the evaluated task. The dataset has been created from 1 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results. An additional configuration "results" store all the aggregated results of the run (and is used to compute and display the aggregated metrics on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)). To load the details from a run, you can for instance do the following: ```python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_mayacinka__NeuralZephyr-Beagle-7B", "harness_winogrande_5", split="train") ``` ## Latest results These are the [latest results from run 2024-02-17T00:55:18.728023](https://huggingface.co/datasets/open-llm-leaderboard/details_mayacinka__NeuralZephyr-Beagle-7B/blob/main/results_2024-02-17T00-55-18.728023.json)(note that their might be results for other tasks in the repos if successive evals didn't cover the same tasks. You find each in the results and the "latest" split for each eval): ```python { "all": { "acc": 0.6499857973810573, "acc_stderr": 0.0322179748198474, "acc_norm": 0.651015316120668, "acc_norm_stderr": 0.032873289143278944, "mc1": 0.4944920440636475, "mc1_stderr": 0.01750243899045106, "mc2": 0.6516576799205165, "mc2_stderr": 0.01520679103207334 }, "harness|arc:challenge|25": { "acc": 0.6578498293515358, "acc_stderr": 0.013864152159177275, "acc_norm": 0.6860068259385665, "acc_norm_stderr": 0.013562691224726297 }, "harness|hellaswag|10": { "acc": 0.6852220673172674, "acc_stderr": 0.004634782156128581, "acc_norm": 0.8637721569408484, "acc_norm_stderr": 0.0034232928816321498 }, "harness|hendrycksTest-abstract_algebra|5": { "acc": 0.34, "acc_stderr": 0.047609522856952365, "acc_norm": 0.34, "acc_norm_stderr": 0.047609522856952365 }, "harness|hendrycksTest-anatomy|5": { "acc": 0.6444444444444445, "acc_stderr": 0.04135176749720385, "acc_norm": 0.6444444444444445, "acc_norm_stderr": 0.04135176749720385 }, "harness|hendrycksTest-astronomy|5": { "acc": 0.7039473684210527, "acc_stderr": 0.03715062154998904, "acc_norm": 0.7039473684210527, "acc_norm_stderr": 0.03715062154998904 }, "harness|hendrycksTest-business_ethics|5": { "acc": 0.61, "acc_stderr": 0.04902071300001975, "acc_norm": 0.61, "acc_norm_stderr": 0.04902071300001975 }, "harness|hendrycksTest-clinical_knowledge|5": { "acc": 0.7169811320754716, "acc_stderr": 0.027724236492700918, "acc_norm": 0.7169811320754716, "acc_norm_stderr": 0.027724236492700918 }, "harness|hendrycksTest-college_biology|5": { "acc": 0.7361111111111112, "acc_stderr": 0.03685651095897532, "acc_norm": 0.7361111111111112, "acc_norm_stderr": 0.03685651095897532 }, "harness|hendrycksTest-college_chemistry|5": { "acc": 0.49, "acc_stderr": 0.05024183937956912, "acc_norm": 0.49, "acc_norm_stderr": 0.05024183937956912 }, "harness|hendrycksTest-college_computer_science|5": { "acc": 0.52, "acc_stderr": 0.050211673156867795, "acc_norm": 0.52, "acc_norm_stderr": 0.050211673156867795 }, "harness|hendrycksTest-college_mathematics|5": { "acc": 0.3, "acc_stderr": 0.046056618647183814, "acc_norm": 0.3, "acc_norm_stderr": 0.046056618647183814 }, "harness|hendrycksTest-college_medicine|5": { "acc": 0.6820809248554913, "acc_stderr": 0.0355068398916558, "acc_norm": 0.6820809248554913, "acc_norm_stderr": 0.0355068398916558 }, "harness|hendrycksTest-college_physics|5": { "acc": 0.4117647058823529, "acc_stderr": 0.048971049527263666, "acc_norm": 0.4117647058823529, "acc_norm_stderr": 0.048971049527263666 }, "harness|hendrycksTest-computer_security|5": { "acc": 0.77, "acc_stderr": 0.042295258468165065, "acc_norm": 0.77, "acc_norm_stderr": 0.042295258468165065 }, "harness|hendrycksTest-conceptual_physics|5": { "acc": 0.5914893617021276, "acc_stderr": 0.032134180267015755, "acc_norm": 0.5914893617021276, "acc_norm_stderr": 0.032134180267015755 }, "harness|hendrycksTest-econometrics|5": { "acc": 0.5175438596491229, "acc_stderr": 0.04700708033551038, "acc_norm": 0.5175438596491229, "acc_norm_stderr": 0.04700708033551038 }, "harness|hendrycksTest-electrical_engineering|5": { "acc": 0.5724137931034483, "acc_stderr": 0.04122737111370333, "acc_norm": 0.5724137931034483, "acc_norm_stderr": 0.04122737111370333 }, "harness|hendrycksTest-elementary_mathematics|5": { "acc": 0.41005291005291006, "acc_stderr": 0.02533120243894444, "acc_norm": 0.41005291005291006, "acc_norm_stderr": 0.02533120243894444 }, "harness|hendrycksTest-formal_logic|5": { "acc": 0.4444444444444444, "acc_stderr": 0.044444444444444495, "acc_norm": 0.4444444444444444, "acc_norm_stderr": 0.044444444444444495 }, "harness|hendrycksTest-global_facts|5": { "acc": 0.34, "acc_stderr": 0.04760952285695235, "acc_norm": 0.34, "acc_norm_stderr": 0.04760952285695235 }, "harness|hendrycksTest-high_school_biology|5": { "acc": 0.7903225806451613, "acc_stderr": 0.023157879349083525, "acc_norm": 0.7903225806451613, "acc_norm_stderr": 0.023157879349083525 }, "harness|hendrycksTest-high_school_chemistry|5": { "acc": 0.5024630541871922, "acc_stderr": 0.035179450386910616, "acc_norm": 0.5024630541871922, "acc_norm_stderr": 0.035179450386910616 }, "harness|hendrycksTest-high_school_computer_science|5": { "acc": 0.7, "acc_stderr": 0.046056618647183814, "acc_norm": 0.7, "acc_norm_stderr": 0.046056618647183814 }, "harness|hendrycksTest-high_school_european_history|5": { "acc": 0.7696969696969697, "acc_stderr": 0.0328766675860349, "acc_norm": 0.7696969696969697, "acc_norm_stderr": 0.0328766675860349 }, "harness|hendrycksTest-high_school_geography|5": { "acc": 0.7929292929292929, "acc_stderr": 0.02886977846026705, "acc_norm": 0.7929292929292929, "acc_norm_stderr": 0.02886977846026705 }, "harness|hendrycksTest-high_school_government_and_politics|5": { "acc": 0.8808290155440415, "acc_stderr": 0.02338193534812142, "acc_norm": 0.8808290155440415, "acc_norm_stderr": 0.02338193534812142 }, "harness|hendrycksTest-high_school_macroeconomics|5": { "acc": 0.6692307692307692, "acc_stderr": 0.02385479568097112, "acc_norm": 0.6692307692307692, "acc_norm_stderr": 0.02385479568097112 }, "harness|hendrycksTest-high_school_mathematics|5": { "acc": 0.3592592592592593, "acc_stderr": 0.029252905927251972, "acc_norm": 0.3592592592592593, "acc_norm_stderr": 0.029252905927251972 }, "harness|hendrycksTest-high_school_microeconomics|5": { "acc": 0.7016806722689075, "acc_stderr": 0.029719142876342856, "acc_norm": 0.7016806722689075, "acc_norm_stderr": 0.029719142876342856 }, "harness|hendrycksTest-high_school_physics|5": { "acc": 0.3576158940397351, "acc_stderr": 0.03913453431177258, "acc_norm": 0.3576158940397351, "acc_norm_stderr": 0.03913453431177258 }, "harness|hendrycksTest-high_school_psychology|5": { "acc": 0.8330275229357799, "acc_stderr": 0.01599015488507338, "acc_norm": 0.8330275229357799, "acc_norm_stderr": 0.01599015488507338 }, "harness|hendrycksTest-high_school_statistics|5": { "acc": 0.5555555555555556, "acc_stderr": 0.03388857118502325, "acc_norm": 0.5555555555555556, "acc_norm_stderr": 0.03388857118502325 }, "harness|hendrycksTest-high_school_us_history|5": { "acc": 0.8333333333333334, "acc_stderr": 0.02615686752393104, "acc_norm": 0.8333333333333334, "acc_norm_stderr": 0.02615686752393104 }, "harness|hendrycksTest-high_school_world_history|5": { "acc": 0.7805907172995781, "acc_stderr": 0.026939106581553945, "acc_norm": 0.7805907172995781, "acc_norm_stderr": 0.026939106581553945 }, "harness|hendrycksTest-human_aging|5": { "acc": 0.6816143497757847, "acc_stderr": 0.03126580522513713, "acc_norm": 0.6816143497757847, "acc_norm_stderr": 0.03126580522513713 }, "harness|hendrycksTest-human_sexuality|5": { "acc": 0.7786259541984732, "acc_stderr": 0.03641297081313728, "acc_norm": 0.7786259541984732, "acc_norm_stderr": 0.03641297081313728 }, "harness|hendrycksTest-international_law|5": { "acc": 0.768595041322314, "acc_stderr": 0.03849856098794088, "acc_norm": 0.768595041322314, "acc_norm_stderr": 0.03849856098794088 }, "harness|hendrycksTest-jurisprudence|5": { "acc": 0.7592592592592593, "acc_stderr": 0.04133119440243838, "acc_norm": 0.7592592592592593, "acc_norm_stderr": 0.04133119440243838 }, "harness|hendrycksTest-logical_fallacies|5": { "acc": 0.7791411042944786, "acc_stderr": 0.03259177392742178, "acc_norm": 0.7791411042944786, "acc_norm_stderr": 0.03259177392742178 }, "harness|hendrycksTest-machine_learning|5": { "acc": 0.45535714285714285, "acc_stderr": 0.047268355537191, "acc_norm": 0.45535714285714285, "acc_norm_stderr": 0.047268355537191 }, "harness|hendrycksTest-management|5": { "acc": 0.7766990291262136, "acc_stderr": 0.04123553189891431, "acc_norm": 0.7766990291262136, "acc_norm_stderr": 0.04123553189891431 }, "harness|hendrycksTest-marketing|5": { "acc": 0.8717948717948718, "acc_stderr": 0.02190190511507333, "acc_norm": 0.8717948717948718, "acc_norm_stderr": 0.02190190511507333 }, "harness|hendrycksTest-medical_genetics|5": { "acc": 0.72, "acc_stderr": 0.04512608598542128, "acc_norm": 0.72, "acc_norm_stderr": 0.04512608598542128 }, "harness|hendrycksTest-miscellaneous|5": { "acc": 0.8237547892720306, "acc_stderr": 0.013625556907993452, "acc_norm": 0.8237547892720306, "acc_norm_stderr": 0.013625556907993452 }, "harness|hendrycksTest-moral_disputes|5": { "acc": 0.7312138728323699, "acc_stderr": 0.02386800326250011, "acc_norm": 0.7312138728323699, "acc_norm_stderr": 0.02386800326250011 }, "harness|hendrycksTest-moral_scenarios|5": { "acc": 0.43798882681564244, "acc_stderr": 0.016593394227564843, "acc_norm": 0.43798882681564244, "acc_norm_stderr": 0.016593394227564843 }, "harness|hendrycksTest-nutrition|5": { "acc": 0.7189542483660131, "acc_stderr": 0.025738854797818737, "acc_norm": 0.7189542483660131, "acc_norm_stderr": 0.025738854797818737 }, "harness|hendrycksTest-philosophy|5": { "acc": 0.7138263665594855, "acc_stderr": 0.02567025924218893, "acc_norm": 0.7138263665594855, "acc_norm_stderr": 0.02567025924218893 }, "harness|hendrycksTest-prehistory|5": { "acc": 0.7376543209876543, "acc_stderr": 0.02447722285613511, "acc_norm": 0.7376543209876543, "acc_norm_stderr": 0.02447722285613511 }, "harness|hendrycksTest-professional_accounting|5": { "acc": 0.5, "acc_stderr": 0.029827499313594685, "acc_norm": 0.5, "acc_norm_stderr": 0.029827499313594685 }, "harness|hendrycksTest-professional_law|5": { "acc": 0.4654498044328553, "acc_stderr": 0.012739711554045702, "acc_norm": 0.4654498044328553, "acc_norm_stderr": 0.012739711554045702 }, "harness|hendrycksTest-professional_medicine|5": { "acc": 0.6764705882352942, "acc_stderr": 0.02841820861940676, "acc_norm": 0.6764705882352942, "acc_norm_stderr": 0.02841820861940676 }, "harness|hendrycksTest-professional_psychology|5": { "acc": 0.6617647058823529, "acc_stderr": 0.019139943748487046, "acc_norm": 0.6617647058823529, "acc_norm_stderr": 0.019139943748487046 }, "harness|hendrycksTest-public_relations|5": { "acc": 0.6727272727272727, "acc_stderr": 0.0449429086625209, "acc_norm": 0.6727272727272727, "acc_norm_stderr": 0.0449429086625209 }, "harness|hendrycksTest-security_studies|5": { "acc": 0.726530612244898, "acc_stderr": 0.028535560337128445, "acc_norm": 0.726530612244898, "acc_norm_stderr": 0.028535560337128445 }, "harness|hendrycksTest-sociology|5": { "acc": 0.8407960199004975, "acc_stderr": 0.02587064676616913, "acc_norm": 0.8407960199004975, "acc_norm_stderr": 0.02587064676616913 }, "harness|hendrycksTest-us_foreign_policy|5": { "acc": 0.85, "acc_stderr": 0.03588702812826371, "acc_norm": 0.85, "acc_norm_stderr": 0.03588702812826371 }, "harness|hendrycksTest-virology|5": { "acc": 0.536144578313253, "acc_stderr": 0.038823108508905954, "acc_norm": 0.536144578313253, "acc_norm_stderr": 0.038823108508905954 }, "harness|hendrycksTest-world_religions|5": { "acc": 0.847953216374269, "acc_stderr": 0.027539122889061456, "acc_norm": 0.847953216374269, "acc_norm_stderr": 0.027539122889061456 }, "harness|truthfulqa:mc|0": { "mc1": 0.4944920440636475, "mc1_stderr": 0.01750243899045106, "mc2": 0.6516576799205165, "mc2_stderr": 0.01520679103207334 }, "harness|winogrande|5": { "acc": 0.8113654301499605, "acc_stderr": 0.010995172318019816 }, "harness|gsm8k|5": { "acc": 0.6345716451857468, "acc_stderr": 0.013264282030266633 } } ``` ## Dataset Details ### Dataset Description <!-- Provide a longer summary of what this dataset is. --> - **Curated by:** [More Information Needed] - **Funded by [optional]:** [More Information Needed] - **Shared by [optional]:** [More Information Needed] - **Language(s) (NLP):** [More Information Needed] - **License:** [More Information Needed] ### Dataset Sources [optional] <!-- Provide the basic links for the dataset. --> - **Repository:** [More Information Needed] - **Paper [optional]:** [More Information Needed] - **Demo [optional]:** [More Information Needed] ## Uses <!-- Address questions around how the dataset is intended to be used. --> ### Direct Use <!-- This section describes suitable use cases for the dataset. --> [More Information Needed] ### Out-of-Scope Use <!-- This section addresses misuse, malicious use, and uses that the dataset will not work well for. --> [More Information Needed] ## Dataset Structure <!-- This section provides a description of the dataset fields, and additional information about the dataset structure such as criteria used to create the splits, relationships between data points, etc. --> [More Information Needed] ## Dataset Creation ### Curation Rationale <!-- Motivation for the creation of this dataset. --> [More Information Needed] ### Source Data <!-- This section describes the source data (e.g. news text and headlines, social media posts, translated sentences, ...). --> #### Data Collection and Processing <!-- This section describes the data collection and processing process such as data selection criteria, filtering and normalization methods, tools and libraries used, etc. --> [More Information Needed] #### Who are the source data producers? <!-- This section describes the people or systems who originally created the data. It should also include self-reported demographic or identity information for the source data creators if this information is available. --> [More Information Needed] ### Annotations [optional] <!-- If the dataset contains annotations which are not part of the initial data collection, use this section to describe them. --> #### Annotation process <!-- This section describes the annotation process such as annotation tools used in the process, the amount of data annotated, annotation guidelines provided to the annotators, interannotator statistics, annotation validation, etc. --> [More Information Needed] #### Who are the annotators? <!-- This section describes the people or systems who created the annotations. --> [More Information Needed] #### Personal and Sensitive Information <!-- State whether the dataset contains data that might be considered personal, sensitive, or private (e.g., data that reveals addresses, uniquely identifiable names or aliases, racial or ethnic origins, sexual orientations, religious beliefs, political opinions, financial or health data, etc.). If efforts were made to anonymize the data, describe the anonymization process. --> [More Information Needed] ## Bias, Risks, and Limitations <!-- This section is meant to convey both technical and sociotechnical limitations. --> [More Information Needed] ### Recommendations <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. --> Users should be made aware of the risks, biases and limitations of the dataset. More information needed for further recommendations. ## Citation [optional] <!-- If there is a paper or blog post introducing the dataset, the APA and Bibtex information for that should go in this section. --> **BibTeX:** [More Information Needed] **APA:** [More Information Needed] ## Glossary [optional] <!-- If relevant, include terms and calculations in this section that can help readers understand the dataset or dataset card. --> [More Information Needed] ## More Information [optional] [More Information Needed] ## Dataset Card Authors [optional] [More Information Needed] ## Dataset Card Contact [More Information Needed]

该数据集是在评估模型mayacinka/NeuralZephyr-Beagle-7B运行期间自动创建的,用于Open LLM排行榜。数据集包含63个配置,每个配置对应一个评估任务。每个配置包含特定的时间戳命名的分割,其中train分割始终指向最新结果。此外,results配置存储了运行的所有聚合结果,用于在排行榜上计算和显示聚合指标。数据集还提供了特定运行的最新结果,详细列出了各种任务的指标。
提供机构:
open-llm-leaderboard-old
原始信息汇总

数据集概述

数据集简介

该数据集是在评估模型 mayacinka/NeuralZephyr-Beagle-7BOpen LLM Leaderboard 上的运行过程中自动创建的。

数据集组成

  • 该数据集包含 63 个配置,每个配置对应一个评估任务。
  • 数据集从 1 次运行中创建,每次运行可以在每个配置中找到特定的分割,分割名称使用运行的时间戳。
  • "train" 分割始终指向最新的结果。
  • 一个额外的配置 "results" 存储所有运行的聚合结果,用于计算和显示 Open LLM Leaderboard 上的聚合指标。

数据加载示例

python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_mayacinka__NeuralZephyr-Beagle-7B", "harness_winogrande_5", split="train")

最新结果

以下是 2024-02-17T00:55:18.728023 运行的最新结果

python { "all": { "acc": 0.6499857973810573, "acc_stderr": 0.0322179748198474, "acc_norm": 0.651015316120668, "acc_norm_stderr": 0.032873289143278944, "mc1": 0.4944920440636475, "mc1_stderr": 0.01750243899045106, "mc2": 0.6516576799205165, "mc2_stderr": 0.01520679103207334 }, "harness|arc:challenge|25": { "acc": 0.6578498293515358, "acc_stderr": 0.013864152159177275, "acc_norm": 0.6860068259385665, "acc_norm_stderr": 0.013562691224726297 }, "harness|hellaswag|10": { "acc": 0.6852220673172674, "acc_stderr": 0.004634782156128581, "acc_norm": 0.8637721569408484, "acc_norm_stderr": 0.0034232928816321498 }, "harness|hendrycksTest-abstract_algebra|5": { "acc": 0.34, "acc_stderr": 0.047609522856952365, "acc_norm": 0.34, "acc_norm_stderr": 0.047609522856952365 }, "harness|hendrycksTest-anatomy|5": { "acc": 0.6444444444444445, "acc_stderr": 0.04135176749720385, "acc_norm": 0.6444444444444445, "acc_norm_stderr": 0.04135176749720385 }, "harness|hendrycksTest-astronomy|5": { "acc": 0.7039473684210527, "acc_stderr": 0.03715062154998904, "acc_norm": 0.7039473684210527, "acc_norm_stderr": 0.03715062154998904 }, "harness|hendrycksTest-business_ethics|5": { "acc": 0.61, "acc_stderr": 0.04902071300001975, "acc_norm": 0.61, "acc_norm_stderr": 0.04902071300001975 }, "harness|hendrycksTest-clinical_knowledge|5": { "acc": 0.7169811320754716, "acc_stderr": 0.027724236492700918, "acc_norm": 0.7169811320754716, "acc_norm_stderr": 0.027724236492700918 }, "harness|hendrycksTest-college_biology|5": { "acc": 0.7361111111111112, "acc_stderr": 0.03685651095897532, "acc_norm": 0.7361111111111112, "acc_norm_stderr": 0.03685651095897532 }, "harness|hendrycksTest-college_chemistry|5": { "acc": 0.49, "acc_stderr": 0.05024183937956912, "acc_norm": 0.49, "acc_norm_stderr": 0.05024183937956912 }, "harness|hendrycksTest-college_computer_science|5": { "acc": 0.52, "acc_stderr": 0.050211673156867795, "acc_norm": 0.52, "acc_norm_stderr": 0.050211673156867795 }, "harness|hendrycksTest-college_mathematics|5": { "acc": 0.3, "acc_stderr": 0.046056618647183814, "acc_norm": 0.3, "acc_norm_stderr": 0.046056618647183814 }, "harness|hendrycksTest-college_medicine|5": { "acc": 0.6820809248554913, "acc_stderr": 0.0355068398916558, "acc_norm": 0.6820809248554913, "acc_norm_stderr": 0.0355068398916558 }, "harness|hendrycksTest-college_physics|5": { "acc": 0.4117647058823529, "acc_stderr": 0.048971049527263666, "acc_norm": 0.4117647058823529, "acc_norm_stderr": 0.048971049527263666 }, "harness|hendrycksTest-computer_security|5": { "acc": 0.77, "acc_stderr": 0.042295258468165065, "acc_norm": 0.77, "acc_norm_stderr": 0.042295258468165065 }, "harness|hendrycksTest-conceptual_physics|5": { "acc": 0.5914893617021276, "acc_stderr": 0.032134180267015755, "acc_norm": 0.5914893617021276, "acc_norm_stderr": 0.032134180267015755 }, "harness|hendrycksTest-econometrics|5": { "acc": 0.5175438596491229, "acc_stderr": 0.04700708033551038, "acc_norm": 0.5175438596491229, "acc_norm_stderr": 0.04700708033551038 }, "harness|hendrycksTest-electrical_engineering|5": { "acc": 0.5724137931034483, "acc_stderr": 0.04122737111370333, "acc_norm": 0.5724137931034483, "acc_norm_stderr": 0.04122737111370333 }, "harness|hendrycksTest-elementary_mathematics|5": { "acc": 0.41005291005291006, "acc_stderr": 0.02533120243894444, "acc_norm": 0.41005291005291006, "acc_norm_stderr": 0.02533120243894444 }, "harness|hendrycksTest-formal_logic|5": { "acc": 0.4444444444444444, "acc_stderr": 0.044444444444444495, "acc_norm": 0.4444444444444444, "acc_norm_stderr": 0.044444444444444495 }, "harness|hendrycksTest-global_facts|5": { "acc": 0.34, "acc_stderr": 0.04760952285695235, "acc_norm": 0.34, "acc_norm_stderr": 0.04760952285695235 }, "harness|hendrycksTest-high_school_biology|5": { "acc": 0.7903225806451613, "acc_stderr": 0.023157879349083525, "acc_norm": 0.7903225806451613, "acc_norm_stderr": 0.023157879349083525 }, "harness|hendrycksTest-high_school_chemistry|5": { "acc": 0.5024630541871922, "acc_stderr": 0.035179450386910616, "acc_norm": 0.5024630541871922, "acc_norm_stderr": 0.035179450386910616 }, "harness|hendrycksTest-high_school_computer_science|5": { "acc": 0.7, "acc_stderr": 0.046056618647183814, "acc_norm": 0.7, "acc_norm_stderr": 0.046056618647183814 }, "harness|hendrycksTest-high_school_european_history|5": { "acc": 0.7696969696969697, "acc_stderr": 0.0328766675860349, "acc_norm": 0.7696969696969697, "acc_norm_stderr": 0.0328766675860349 }, "harness|hendrycksTest-high_school_geography|5": { "acc": 0.7929292929292929, "acc_stderr": 0.02886977846026705, "acc_norm": 0.7929292929292929, "acc_norm_stderr": 0.02886977846026705 }, "harness|hendrycksTest-high_school_government_and_politics|5": { "acc": 0.8808290155440415, "acc_stderr": 0.02338193534812142, "acc_norm": 0.8808290155440415, "acc_norm_stderr": 0.02338193534812142 }, "harness|hendrycksTest-high_school_macroeconomics|5": { "acc": 0.6692307692307692, "acc_stderr": 0.02385479568097112, "acc_norm": 0.6692307692307692, "acc_norm_stderr": 0.02385479568097112 }, "harness|hendrycksTest-high_school_mathematics|5": { "acc": 0.3592592592592593, "acc_stderr": 0.029252905927251972, "acc_norm": 0.3592592592592593, "acc_norm_stderr":

搜集汇总
数据集介绍
main_image_url
构建方式
在大型语言模型评估领域,对模型性能的精确度量至关重要。该数据集是Open LLM Leaderboard在对模型mayacinka/NeuralZephyr-Beagle-7B进行自动化评估过程中自动生成的产物。其构建基于一次完整的评估运行,涵盖了63个不同的配置,每个配置对应一项被评估的具体任务。数据集的每个配置中都包含了以运行时间戳命名的特定数据分割,而名为“train”的分割则始终指向最新一次的评估结果。此外,一个名为“results”的独立配置被专门用于聚合存储此次运行的全部汇总指标,这些指标最终被用于在排行榜上计算和展示模型的综合性能。
特点
该数据集最显著的特征在于其结构化的多任务评估架构。它并非单一维度的测试集合,而是通过63个独立配置,系统地覆盖了从常识推理(如HellaSwag、Winogrande)到数学问题求解(GSM8K),乃至涵盖57个学科领域的MMLU测试集(如高中生物、大学物理、国际法等)的广泛能力范畴。这种设计使得研究者能够清晰洞察模型在不同认知维度和知识领域上的具体表现。数据集还提供了每次评估的详细日志与聚合后的统计指标(包括准确率及标准误),为深入分析模型行为提供了丰富的数据支持。
使用方法
研究者可通过Hugging Face的datasets库便捷地访问该数据集。加载时需指定具体的任务配置名称(例如“harness_winogrande_5”)以及所需的数据分割(如最新结果的“train”分割)。通过这种方式,可以获取特定任务下模型输出的详细评估日志。若要获取全部任务的汇总结果,则需加载名为“results”的配置。推荐使用如下Python代码进行加载:from datasets import load_dataset; data = load_dataset("open-llm-leaderboard/details_mayacinka__NeuralZephyr-Beagle-7B", "harness_winogrande_5", split="train")。
背景与挑战
背景概述
大语言模型的迅猛发展催生了对其性能进行系统化评估的迫切需求,Open LLM Leaderboard应运而生,成为衡量模型在多样化自然语言处理任务上表现的重要基准平台。该数据集创建于2024年,由HuggingFace团队主导,旨在为模型mayacinka/NeuralZephyr-Beagle-7B提供详尽的评估记录。它覆盖了从常识推理、数学解题到多领域知识问答等63个任务配置,汇聚了ARC-Challenge、HellaSwag、GSM8K及涵盖57个学科的大规模多任务语言理解测试等核心基准。这一数据集的诞生不仅为研究者提供了透明、可复现的模型性能快照,更推动了开源社区对模型能力的标准化认知,对语言模型的可信度评估与迭代优化产生了深远影响。
当前挑战
该数据集所解决的领域问题在于,大语言模型的能力评估长期缺乏统一、多维度的量化标准,各模型在异构任务上的表现难以横向比较,导致研究进展缺乏可参照的客观标尺。构建过程中面临的核心挑战包括:其一,如何设计涵盖广泛认知维度的任务集合,以全面反映模型在推理、知识、理解等方面的真实水平,避免评估的片面性;其二,需确保评估结果的可复现性与公平性,通过统一运行环境与标准化评测脚本,消除因框架差异带来的性能偏差;其三,处理大规模评测数据的管理与版本迭代问题,实现多轮次评估结果的高效存储与动态更新,以追踪模型性能的演进轨迹。
常用场景
经典使用场景
在大型语言模型的评估体系中,该数据集记录了模型NeuralZephyr-Beagle-7B在Open LLM Leaderboard基准上的完整评测结果。其经典使用场景在于为研究者提供细粒度的任务级性能数据,涵盖ARC挑战赛、HellaSwag常识推理、GSM8K数学问题求解以及涵盖57个学科的MMLU测试等。通过加载不同配置下的评测细节,学者可深入分析模型在特定领域的表现优劣,从而指导后续的模型优化与微调策略。
衍生相关工作
基于此数据集,衍生了一系列围绕大语言模型性能分析与改进的经典工作。例如,研究者利用其中的细粒度结果,提出了针对特定学科知识缺陷的定向微调方法;亦有工作借鉴其多任务评测框架,构建了更全面的模型能力图谱。此外,该数据集所采用的标准化评估流程,启发了后续诸如Open LLM Leaderboard等社区基准的持续演进,成为衡量模型进步的重要参照系。
数据集最近研究
最新研究方向
在大型语言模型评测领域,Open LLM Leaderboard 已成为衡量模型综合能力的重要基准平台。围绕 NeuralZephyr-Beagle-7B 模型的评测数据集,近期研究聚焦于多任务、多维度下的模型泛化性能评估,涵盖从常识推理(如 HellaSwag、Winogrande)到数学求解(如 GSM8K)、从专业知识(如 MMLU 各子领域)到事实一致性(如 TruthfulQA)的广泛任务。该数据集通过标准化流程记录模型在 63 个配置下的细粒度表现,为揭示模型在复杂推理、领域深度及对抗性场景中的优势与短板提供了关键依据。前沿方向包括利用此类细粒度评测结果指导模型微调与对齐优化,以及探索不同任务间的能力迁移与瓶颈,这对于推动更可靠、更通用的语言智能系统发展具有深远意义。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作