five

open-llm-leaderboard-old/details_freecs__ThetaWave-7B-v0.1

收藏
Hugging Face2024-01-24 更新2024-06-22 收录
下载链接:
https://hf-mirror.com/datasets/open-llm-leaderboard-old/details_freecs__ThetaWave-7B-v0.1
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: Evaluation run of freecs/ThetaWave-7B-v0.1 dataset_summary: "Dataset automatically created during the evaluation run of model\ \ [freecs/ThetaWave-7B-v0.1](https://huggingface.co/freecs/ThetaWave-7B-v0.1) on\ \ the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).\n\ \nThe dataset is composed of 63 configuration, each one coresponding to one of the\ \ evaluated task.\n\nThe dataset has been created from 1 run(s). Each run can be\ \ found as a specific split in each configuration, the split being named using the\ \ timestamp of the run.The \"train\" split is always pointing to the latest results.\n\ \nAn additional configuration \"results\" store all the aggregated results of the\ \ run (and is used to compute and display the aggregated metrics on the [Open LLM\ \ Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)).\n\ \nTo load the details from a run, you can for instance do the following:\n```python\n\ from datasets import load_dataset\ndata = load_dataset(\"open-llm-leaderboard/details_freecs__ThetaWave-7B-v0.1\"\ ,\n\t\"harness_winogrande_5\",\n\tsplit=\"train\")\n```\n\n## Latest results\n\n\ These are the [latest results from run 2024-01-24T07:54:32.474467](https://huggingface.co/datasets/open-llm-leaderboard/details_freecs__ThetaWave-7B-v0.1/blob/main/results_2024-01-24T07-54-32.474467.json)(note\ \ that their might be results for other tasks in the repos if successive evals didn't\ \ cover the same tasks. You find each in the results and the \"latest\" split for\ \ each eval):\n\n```python\n{\n \"all\": {\n \"acc\": 0.6369850549630417,\n\ \ \"acc_stderr\": 0.03254410038117921,\n \"acc_norm\": 0.6388746936562116,\n\ \ \"acc_norm_stderr\": 0.03320176501025405,\n \"mc1\": 0.4357405140758874,\n\ \ \"mc1_stderr\": 0.017358345398863124,\n \"mc2\": 0.6024313355556469,\n\ \ \"mc2_stderr\": 0.01530031296029918\n },\n \"harness|arc:challenge|25\"\ : {\n \"acc\": 0.621160409556314,\n \"acc_stderr\": 0.014175915490000326,\n\ \ \"acc_norm\": 0.6629692832764505,\n \"acc_norm_stderr\": 0.013813476652902274\n\ \ },\n \"harness|hellaswag|10\": {\n \"acc\": 0.6620195180242979,\n\ \ \"acc_stderr\": 0.004720551323547126,\n \"acc_norm\": 0.8540131447918742,\n\ \ \"acc_norm_stderr\": 0.0035237141526513\n },\n \"harness|hendrycksTest-abstract_algebra|5\"\ : {\n \"acc\": 0.32,\n \"acc_stderr\": 0.046882617226215034,\n \ \ \"acc_norm\": 0.32,\n \"acc_norm_stderr\": 0.046882617226215034\n \ \ },\n \"harness|hendrycksTest-anatomy|5\": {\n \"acc\": 0.6148148148148148,\n\ \ \"acc_stderr\": 0.04203921040156279,\n \"acc_norm\": 0.6148148148148148,\n\ \ \"acc_norm_stderr\": 0.04203921040156279\n },\n \"harness|hendrycksTest-astronomy|5\"\ : {\n \"acc\": 0.7039473684210527,\n \"acc_stderr\": 0.03715062154998904,\n\ \ \"acc_norm\": 0.7039473684210527,\n \"acc_norm_stderr\": 0.03715062154998904\n\ \ },\n \"harness|hendrycksTest-business_ethics|5\": {\n \"acc\": 0.59,\n\ \ \"acc_stderr\": 0.04943110704237102,\n \"acc_norm\": 0.59,\n \ \ \"acc_norm_stderr\": 0.04943110704237102\n },\n \"harness|hendrycksTest-clinical_knowledge|5\"\ : {\n \"acc\": 0.7132075471698113,\n \"acc_stderr\": 0.02783491252754406,\n\ \ \"acc_norm\": 0.7132075471698113,\n \"acc_norm_stderr\": 0.02783491252754406\n\ \ },\n \"harness|hendrycksTest-college_biology|5\": {\n \"acc\": 0.75,\n\ \ \"acc_stderr\": 0.03621034121889507,\n \"acc_norm\": 0.75,\n \ \ \"acc_norm_stderr\": 0.03621034121889507\n },\n \"harness|hendrycksTest-college_chemistry|5\"\ : {\n \"acc\": 0.45,\n \"acc_stderr\": 0.05,\n \"acc_norm\"\ : 0.45,\n \"acc_norm_stderr\": 0.05\n },\n \"harness|hendrycksTest-college_computer_science|5\"\ : {\n \"acc\": 0.55,\n \"acc_stderr\": 0.05,\n \"acc_norm\"\ : 0.55,\n \"acc_norm_stderr\": 0.05\n },\n \"harness|hendrycksTest-college_mathematics|5\"\ : {\n \"acc\": 0.33,\n \"acc_stderr\": 0.04725815626252604,\n \ \ \"acc_norm\": 0.33,\n \"acc_norm_stderr\": 0.04725815626252604\n \ \ },\n \"harness|hendrycksTest-college_medicine|5\": {\n \"acc\": 0.6473988439306358,\n\ \ \"acc_stderr\": 0.036430371689585475,\n \"acc_norm\": 0.6473988439306358,\n\ \ \"acc_norm_stderr\": 0.036430371689585475\n },\n \"harness|hendrycksTest-college_physics|5\"\ : {\n \"acc\": 0.46078431372549017,\n \"acc_stderr\": 0.04959859966384181,\n\ \ \"acc_norm\": 0.46078431372549017,\n \"acc_norm_stderr\": 0.04959859966384181\n\ \ },\n \"harness|hendrycksTest-computer_security|5\": {\n \"acc\":\ \ 0.79,\n \"acc_stderr\": 0.040936018074033256,\n \"acc_norm\": 0.79,\n\ \ \"acc_norm_stderr\": 0.040936018074033256\n },\n \"harness|hendrycksTest-conceptual_physics|5\"\ : {\n \"acc\": 0.5574468085106383,\n \"acc_stderr\": 0.032469569197899575,\n\ \ \"acc_norm\": 0.5574468085106383,\n \"acc_norm_stderr\": 0.032469569197899575\n\ \ },\n \"harness|hendrycksTest-econometrics|5\": {\n \"acc\": 0.4649122807017544,\n\ \ \"acc_stderr\": 0.046920083813689104,\n \"acc_norm\": 0.4649122807017544,\n\ \ \"acc_norm_stderr\": 0.046920083813689104\n },\n \"harness|hendrycksTest-electrical_engineering|5\"\ : {\n \"acc\": 0.5793103448275863,\n \"acc_stderr\": 0.0411391498118926,\n\ \ \"acc_norm\": 0.5793103448275863,\n \"acc_norm_stderr\": 0.0411391498118926\n\ \ },\n \"harness|hendrycksTest-elementary_mathematics|5\": {\n \"acc\"\ : 0.4126984126984127,\n \"acc_stderr\": 0.025355741263055266,\n \"\ acc_norm\": 0.4126984126984127,\n \"acc_norm_stderr\": 0.025355741263055266\n\ \ },\n \"harness|hendrycksTest-formal_logic|5\": {\n \"acc\": 0.4444444444444444,\n\ \ \"acc_stderr\": 0.044444444444444495,\n \"acc_norm\": 0.4444444444444444,\n\ \ \"acc_norm_stderr\": 0.044444444444444495\n },\n \"harness|hendrycksTest-global_facts|5\"\ : {\n \"acc\": 0.36,\n \"acc_stderr\": 0.04824181513244218,\n \ \ \"acc_norm\": 0.36,\n \"acc_norm_stderr\": 0.04824181513244218\n \ \ },\n \"harness|hendrycksTest-high_school_biology|5\": {\n \"acc\": 0.7290322580645161,\n\ \ \"acc_stderr\": 0.025284416114900152,\n \"acc_norm\": 0.7290322580645161,\n\ \ \"acc_norm_stderr\": 0.025284416114900152\n },\n \"harness|hendrycksTest-high_school_chemistry|5\"\ : {\n \"acc\": 0.4827586206896552,\n \"acc_stderr\": 0.035158955511657,\n\ \ \"acc_norm\": 0.4827586206896552,\n \"acc_norm_stderr\": 0.035158955511657\n\ \ },\n \"harness|hendrycksTest-high_school_computer_science|5\": {\n \ \ \"acc\": 0.63,\n \"acc_stderr\": 0.04852365870939098,\n \"acc_norm\"\ : 0.63,\n \"acc_norm_stderr\": 0.04852365870939098\n },\n \"harness|hendrycksTest-high_school_european_history|5\"\ : {\n \"acc\": 0.7696969696969697,\n \"acc_stderr\": 0.0328766675860349,\n\ \ \"acc_norm\": 0.7696969696969697,\n \"acc_norm_stderr\": 0.0328766675860349\n\ \ },\n \"harness|hendrycksTest-high_school_geography|5\": {\n \"acc\"\ : 0.7777777777777778,\n \"acc_stderr\": 0.02962022787479049,\n \"\ acc_norm\": 0.7777777777777778,\n \"acc_norm_stderr\": 0.02962022787479049\n\ \ },\n \"harness|hendrycksTest-high_school_government_and_politics|5\": {\n\ \ \"acc\": 0.8704663212435233,\n \"acc_stderr\": 0.024233532297758723,\n\ \ \"acc_norm\": 0.8704663212435233,\n \"acc_norm_stderr\": 0.024233532297758723\n\ \ },\n \"harness|hendrycksTest-high_school_macroeconomics|5\": {\n \ \ \"acc\": 0.617948717948718,\n \"acc_stderr\": 0.024635549163908234,\n \ \ \"acc_norm\": 0.617948717948718,\n \"acc_norm_stderr\": 0.024635549163908234\n\ \ },\n \"harness|hendrycksTest-high_school_mathematics|5\": {\n \"\ acc\": 0.362962962962963,\n \"acc_stderr\": 0.02931820364520686,\n \ \ \"acc_norm\": 0.362962962962963,\n \"acc_norm_stderr\": 0.02931820364520686\n\ \ },\n \"harness|hendrycksTest-high_school_microeconomics|5\": {\n \ \ \"acc\": 0.6554621848739496,\n \"acc_stderr\": 0.03086868260412162,\n \ \ \"acc_norm\": 0.6554621848739496,\n \"acc_norm_stderr\": 0.03086868260412162\n\ \ },\n \"harness|hendrycksTest-high_school_physics|5\": {\n \"acc\"\ : 0.37748344370860926,\n \"acc_stderr\": 0.03958027231121569,\n \"\ acc_norm\": 0.37748344370860926,\n \"acc_norm_stderr\": 0.03958027231121569\n\ \ },\n \"harness|hendrycksTest-high_school_psychology|5\": {\n \"acc\"\ : 0.818348623853211,\n \"acc_stderr\": 0.01653061740926685,\n \"acc_norm\"\ : 0.818348623853211,\n \"acc_norm_stderr\": 0.01653061740926685\n },\n\ \ \"harness|hendrycksTest-high_school_statistics|5\": {\n \"acc\": 0.5,\n\ \ \"acc_stderr\": 0.034099716973523674,\n \"acc_norm\": 0.5,\n \ \ \"acc_norm_stderr\": 0.034099716973523674\n },\n \"harness|hendrycksTest-high_school_us_history|5\"\ : {\n \"acc\": 0.7990196078431373,\n \"acc_stderr\": 0.028125972265654373,\n\ \ \"acc_norm\": 0.7990196078431373,\n \"acc_norm_stderr\": 0.028125972265654373\n\ \ },\n \"harness|hendrycksTest-high_school_world_history|5\": {\n \"\ acc\": 0.7805907172995781,\n \"acc_stderr\": 0.026939106581553945,\n \ \ \"acc_norm\": 0.7805907172995781,\n \"acc_norm_stderr\": 0.026939106581553945\n\ \ },\n \"harness|hendrycksTest-human_aging|5\": {\n \"acc\": 0.6636771300448431,\n\ \ \"acc_stderr\": 0.031708824268455,\n \"acc_norm\": 0.6636771300448431,\n\ \ \"acc_norm_stderr\": 0.031708824268455\n },\n \"harness|hendrycksTest-human_sexuality|5\"\ : {\n \"acc\": 0.7557251908396947,\n \"acc_stderr\": 0.03768335959728744,\n\ \ \"acc_norm\": 0.7557251908396947,\n \"acc_norm_stderr\": 0.03768335959728744\n\ \ },\n \"harness|hendrycksTest-international_law|5\": {\n \"acc\":\ \ 0.8181818181818182,\n \"acc_stderr\": 0.03520893951097652,\n \"\ acc_norm\": 0.8181818181818182,\n \"acc_norm_stderr\": 0.03520893951097652\n\ \ },\n \"harness|hendrycksTest-jurisprudence|5\": {\n \"acc\": 0.7592592592592593,\n\ \ \"acc_stderr\": 0.04133119440243839,\n \"acc_norm\": 0.7592592592592593,\n\ \ \"acc_norm_stderr\": 0.04133119440243839\n },\n \"harness|hendrycksTest-logical_fallacies|5\"\ : {\n \"acc\": 0.7361963190184049,\n \"acc_stderr\": 0.03462419931615624,\n\ \ \"acc_norm\": 0.7361963190184049,\n \"acc_norm_stderr\": 0.03462419931615624\n\ \ },\n \"harness|hendrycksTest-machine_learning|5\": {\n \"acc\": 0.48214285714285715,\n\ \ \"acc_stderr\": 0.047427623612430116,\n \"acc_norm\": 0.48214285714285715,\n\ \ \"acc_norm_stderr\": 0.047427623612430116\n },\n \"harness|hendrycksTest-management|5\"\ : {\n \"acc\": 0.7281553398058253,\n \"acc_stderr\": 0.044052680241409216,\n\ \ \"acc_norm\": 0.7281553398058253,\n \"acc_norm_stderr\": 0.044052680241409216\n\ \ },\n \"harness|hendrycksTest-marketing|5\": {\n \"acc\": 0.9017094017094017,\n\ \ \"acc_stderr\": 0.019503444900757567,\n \"acc_norm\": 0.9017094017094017,\n\ \ \"acc_norm_stderr\": 0.019503444900757567\n },\n \"harness|hendrycksTest-medical_genetics|5\"\ : {\n \"acc\": 0.74,\n \"acc_stderr\": 0.04408440022768078,\n \ \ \"acc_norm\": 0.74,\n \"acc_norm_stderr\": 0.04408440022768078\n \ \ },\n \"harness|hendrycksTest-miscellaneous|5\": {\n \"acc\": 0.8250319284802043,\n\ \ \"acc_stderr\": 0.01358661921990333,\n \"acc_norm\": 0.8250319284802043,\n\ \ \"acc_norm_stderr\": 0.01358661921990333\n },\n \"harness|hendrycksTest-moral_disputes|5\"\ : {\n \"acc\": 0.6791907514450867,\n \"acc_stderr\": 0.025131000233647897,\n\ \ \"acc_norm\": 0.6791907514450867,\n \"acc_norm_stderr\": 0.025131000233647897\n\ \ },\n \"harness|hendrycksTest-moral_scenarios|5\": {\n \"acc\": 0.4134078212290503,\n\ \ \"acc_stderr\": 0.016469814928406164,\n \"acc_norm\": 0.4134078212290503,\n\ \ \"acc_norm_stderr\": 0.016469814928406164\n },\n \"harness|hendrycksTest-nutrition|5\"\ : {\n \"acc\": 0.7124183006535948,\n \"acc_stderr\": 0.02591780611714716,\n\ \ \"acc_norm\": 0.7124183006535948,\n \"acc_norm_stderr\": 0.02591780611714716\n\ \ },\n \"harness|hendrycksTest-philosophy|5\": {\n \"acc\": 0.6913183279742765,\n\ \ \"acc_stderr\": 0.02623696588115327,\n \"acc_norm\": 0.6913183279742765,\n\ \ \"acc_norm_stderr\": 0.02623696588115327\n },\n \"harness|hendrycksTest-prehistory|5\"\ : {\n \"acc\": 0.7098765432098766,\n \"acc_stderr\": 0.025251173936495033,\n\ \ \"acc_norm\": 0.7098765432098766,\n \"acc_norm_stderr\": 0.025251173936495033\n\ \ },\n \"harness|hendrycksTest-professional_accounting|5\": {\n \"\ acc\": 0.4716312056737589,\n \"acc_stderr\": 0.02977945095730307,\n \ \ \"acc_norm\": 0.4716312056737589,\n \"acc_norm_stderr\": 0.02977945095730307\n\ \ },\n \"harness|hendrycksTest-professional_law|5\": {\n \"acc\": 0.44654498044328556,\n\ \ \"acc_stderr\": 0.012697046024399675,\n \"acc_norm\": 0.44654498044328556,\n\ \ \"acc_norm_stderr\": 0.012697046024399675\n },\n \"harness|hendrycksTest-professional_medicine|5\"\ : {\n \"acc\": 0.6433823529411765,\n \"acc_stderr\": 0.029097209568411952,\n\ \ \"acc_norm\": 0.6433823529411765,\n \"acc_norm_stderr\": 0.029097209568411952\n\ \ },\n \"harness|hendrycksTest-professional_psychology|5\": {\n \"\ acc\": 0.6454248366013072,\n \"acc_stderr\": 0.019353360547553697,\n \ \ \"acc_norm\": 0.6454248366013072,\n \"acc_norm_stderr\": 0.019353360547553697\n\ \ },\n \"harness|hendrycksTest-public_relations|5\": {\n \"acc\": 0.7272727272727273,\n\ \ \"acc_stderr\": 0.04265792110940588,\n \"acc_norm\": 0.7272727272727273,\n\ \ \"acc_norm_stderr\": 0.04265792110940588\n },\n \"harness|hendrycksTest-security_studies|5\"\ : {\n \"acc\": 0.7183673469387755,\n \"acc_stderr\": 0.028795185574291296,\n\ \ \"acc_norm\": 0.7183673469387755,\n \"acc_norm_stderr\": 0.028795185574291296\n\ \ },\n \"harness|hendrycksTest-sociology|5\": {\n \"acc\": 0.8009950248756219,\n\ \ \"acc_stderr\": 0.028231365092758406,\n \"acc_norm\": 0.8009950248756219,\n\ \ \"acc_norm_stderr\": 0.028231365092758406\n },\n \"harness|hendrycksTest-us_foreign_policy|5\"\ : {\n \"acc\": 0.83,\n \"acc_stderr\": 0.03775251680686371,\n \ \ \"acc_norm\": 0.83,\n \"acc_norm_stderr\": 0.03775251680686371\n \ \ },\n \"harness|hendrycksTest-virology|5\": {\n \"acc\": 0.5240963855421686,\n\ \ \"acc_stderr\": 0.03887971849597264,\n \"acc_norm\": 0.5240963855421686,\n\ \ \"acc_norm_stderr\": 0.03887971849597264\n },\n \"harness|hendrycksTest-world_religions|5\"\ : {\n \"acc\": 0.8421052631578947,\n \"acc_stderr\": 0.027966785859160882,\n\ \ \"acc_norm\": 0.8421052631578947,\n \"acc_norm_stderr\": 0.027966785859160882\n\ \ },\n \"harness|truthfulqa:mc|0\": {\n \"mc1\": 0.4357405140758874,\n\ \ \"mc1_stderr\": 0.017358345398863124,\n \"mc2\": 0.6024313355556469,\n\ \ \"mc2_stderr\": 0.01530031296029918\n },\n \"harness|winogrande|5\"\ : {\n \"acc\": 0.8018942383583267,\n \"acc_stderr\": 0.011201862744487052\n\ \ },\n \"harness|gsm8k|5\": {\n \"acc\": 0.5943896891584534,\n \ \ \"acc_stderr\": 0.013524848894462115\n }\n}\n```" repo_url: https://huggingface.co/freecs/ThetaWave-7B-v0.1 leaderboard_url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard point_of_contact: clementine@hf.co configs: - config_name: harness_arc_challenge_25 data_files: - split: 2024_01_24T07_54_32.474467 path: - '**/details_harness|arc:challenge|25_2024-01-24T07-54-32.474467.parquet' - split: latest path: - '**/details_harness|arc:challenge|25_2024-01-24T07-54-32.474467.parquet' - config_name: harness_gsm8k_5 data_files: - split: 2024_01_24T07_54_32.474467 path: - '**/details_harness|gsm8k|5_2024-01-24T07-54-32.474467.parquet' - split: latest path: - '**/details_harness|gsm8k|5_2024-01-24T07-54-32.474467.parquet' - config_name: harness_hellaswag_10 data_files: - split: 2024_01_24T07_54_32.474467 path: - '**/details_harness|hellaswag|10_2024-01-24T07-54-32.474467.parquet' - split: latest path: - '**/details_harness|hellaswag|10_2024-01-24T07-54-32.474467.parquet' - config_name: harness_hendrycksTest_5 data_files: - split: 2024_01_24T07_54_32.474467 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-international_law|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-management|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-marketing|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-sociology|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-virology|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2024-01-24T07-54-32.474467.parquet' - split: latest path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-international_law|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-management|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-marketing|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-sociology|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-virology|5_2024-01-24T07-54-32.474467.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2024-01-24T07-54-32.474467.parquet' - config_name: harness_hendrycksTest_abstract_algebra_5 data_files: - split: 2024_01_24T07_54_32.474467 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-01-24T07-54-32.474467.parquet' - split: latest path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-01-24T07-54-32.474467.parquet' - config_name: harness_hendrycksTest_anatomy_5 data_files: - split: 2024_01_24T07_54_32.474467 path: - '**/details_harness|hendrycksTest-anatomy|5_2024-01-24T07-54-32.474467.parquet' - split: latest path: - '**/details_harness|hendrycksTest-anatomy|5_2024-01-24T07-54-32.474467.parquet' - config_name: harness_hendrycksTest_astronomy_5 data_files: - split: 2024_01_24T07_54_32.474467 path: - '**/details_harness|hendrycksTest-astronomy|5_2024-01-24T07-54-32.474467.parquet' - split: latest path: - '**/details_harness|hendrycksTest-astronomy|5_2024-01-24T07-54-32.474467.parquet' - config_name: harness_hendrycksTest_business_ethics_5 data_files: - split: 2024_01_24T07_54_32.474467 path: - '**/details_harness|hendrycksTest-business_ethics|5_2024-01-24T07-54-32.474467.parquet' - split: latest path: - '**/details_harness|hendrycksTest-business_ethics|5_2024-01-24T07-54-32.474467.parquet' - config_name: harness_hendrycksTest_clinical_knowledge_5 data_files: - split: 2024_01_24T07_54_32.474467 path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-01-24T07-54-32.474467.parquet' - split: latest path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-01-24T07-54-32.474467.parquet' - config_name: harness_hendrycksTest_college_biology_5 data_files: - split: 2024_01_24T07_54_32.474467 path: - '**/details_harness|hendrycksTest-college_biology|5_2024-01-24T07-54-32.474467.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_biology|5_2024-01-24T07-54-32.474467.parquet' - config_name: harness_hendrycksTest_college_chemistry_5 data_files: - split: 2024_01_24T07_54_32.474467 path: - '**/details_harness|hendrycksTest-college_chemistry|5_2024-01-24T07-54-32.474467.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_chemistry|5_2024-01-24T07-54-32.474467.parquet' - config_name: harness_hendrycksTest_college_computer_science_5 data_files: - split: 2024_01_24T07_54_32.474467 path: - '**/details_harness|hendrycksTest-college_computer_science|5_2024-01-24T07-54-32.474467.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_computer_science|5_2024-01-24T07-54-32.474467.parquet' - config_name: harness_hendrycksTest_college_mathematics_5 data_files: - split: 2024_01_24T07_54_32.474467 path: - '**/details_harness|hendrycksTest-college_mathematics|5_2024-01-24T07-54-32.474467.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_mathematics|5_2024-01-24T07-54-32.474467.parquet' - config_name: harness_hendrycksTest_college_medicine_5 data_files: - split: 2024_01_24T07_54_32.474467 path: - '**/details_harness|hendrycksTest-college_medicine|5_2024-01-24T07-54-32.474467.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_medicine|5_2024-01-24T07-54-32.474467.parquet' - config_name: harness_hendrycksTest_college_physics_5 data_files: - split: 2024_01_24T07_54_32.474467 path: - '**/details_harness|hendrycksTest-college_physics|5_2024-01-24T07-54-32.474467.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_physics|5_2024-01-24T07-54-32.474467.parquet' - config_name: harness_hendrycksTest_computer_security_5 data_files: - split: 2024_01_24T07_54_32.474467 path: - '**/details_harness|hendrycksTest-computer_security|5_2024-01-24T07-54-32.474467.parquet' - split: latest path: - '**/details_harness|hendrycksTest-computer_security|5_2024-01-24T07-54-32.474467.parquet' - config_name: harness_hendrycksTest_conceptual_physics_5 data_files: - split: 2024_01_24T07_54_32.474467 path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-01-24T07-54-32.474467.parquet' - split: latest path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-01-24T07-54-32.474467.parquet' - config_name: harness_hendrycksTest_econometrics_5 data_files: - split: 2024_01_24T07_54_32.474467 path: - '**/details_harness|hendrycksTest-econometrics|5_2024-01-24T07-54-32.474467.parquet' - split: latest path: - '**/details_harness|hendrycksTest-econometrics|5_2024-01-24T07-54-32.474467.parquet' - config_name: harness_hendrycksTest_electrical_engineering_5 data_files: - split: 2024_01_24T07_54_32.474467 path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-01-24T07-54-32.474467.parquet' - split: latest path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-01-24T07-54-32.474467.parquet' - config_name: harness_hendrycksTest_elementary_mathematics_5 data_files: - split: 2024_01_24T07_54_32.474467 path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-01-24T07-54-32.474467.parquet' - split: latest path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-01-24T07-54-32.474467.parquet' - config_name: harness_hendrycksTest_formal_logic_5 data_files: - split: 2024_01_24T07_54_32.474467 path: - '**/details_harness|hendrycksTest-formal_logic|5_2024-01-24T07-54-32.474467.parquet' - split: latest path: - '**/details_harness|hendrycksTest-formal_logic|5_2024-01-24T07-54-32.474467.parquet' - config_name: harness_hendrycksTest_global_facts_5 data_files: - split: 2024_01_24T07_54_32.474467 path: - '**/details_harness|hendrycksTest-global_facts|5_2024-01-24T07-54-32.474467.parquet' - split: latest path: - '**/details_harness|hendrycksTest-global_facts|5_2024-01-24T07-54-32.474467.parquet' - config_name: harness_hendrycksTest_high_school_biology_5 data_files: - split: 2024_01_24T07_54_32.474467 path: - '**/details_harness|hendrycksTest-high_school_biology|5_2024-01-24T07-54-32.474467.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_biology|5_2024-01-24T07-54-32.474467.parquet' - config_name: harness_hendrycksTest_high_school_chemistry_5 data_files: - split: 2024_01_24T07_54_32.474467 path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-01-24T07-54-32.474467.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-01-24T07-54-32.474467.parquet' - config_name: harness_hendrycksTest_high_school_computer_science_5 data_files: - split: 2024_01_24T07_54_32.474467 path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-01-24T07-54-32.474467.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-01-24T07-54-32.474467.parquet' - config_name: harness_hendrycksTest_high_school_european_history_5 data_files: - split: 2024_01_24T07_54_32.474467 path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-01-24T07-54-32.474467.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-01-24T07-54-32.474467.parquet' - config_name: harness_hendrycksTest_high_school_geography_5 data_files: - split: 2024_01_24T07_54_32.474467 path: - '**/details_harness|hendrycksTest-high_school_geography|5_2024-01-24T07-54-32.474467.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_geography|5_2024-01-24T07-54-32.474467.parquet' - config_name: harness_hendrycksTest_high_school_government_and_politics_5 data_files: - split: 2024_01_24T07_54_32.474467 path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-01-24T07-54-32.474467.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-01-24T07-54-32.474467.parquet' - config_name: harness_hendrycksTest_high_school_macroeconomics_5 data_files: - split: 2024_01_24T07_54_32.474467 path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-01-24T07-54-32.474467.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-01-24T07-54-32.474467.parquet' - config_name: harness_hendrycksTest_high_school_mathematics_5 data_files: - split: 2024_01_24T07_54_32.474467 path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-01-24T07-54-32.474467.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-01-24T07-54-32.474467.parquet' - config_name: harness_hendrycksTest_high_school_microeconomics_5 data_files: - split: 2024_01_24T07_54_32.474467 path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-01-24T07-54-32.474467.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-01-24T07-54-32.474467.parquet' - config_name: harness_hendrycksTest_high_school_physics_5 data_files: - split: 2024_01_24T07_54_32.474467 path: - '**/details_harness|hendrycksTest-high_school_physics|5_2024-01-24T07-54-32.474467.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_physics|5_2024-01-24T07-54-32.474467.parquet' - config_name: harness_hendrycksTest_high_school_psychology_5 data_files: - split: 2024_01_24T07_54_32.474467 path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-01-24T07-54-32.474467.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-01-24T07-54-32.474467.parquet' - config_name: harness_hendrycksTest_high_school_statistics_5 data_files: - split: 2024_01_24T07_54_32.474467 path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-01-24T07-54-32.474467.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-01-24T07-54-32.474467.parquet' - config_name: harness_hendrycksTest_high_school_us_history_5 data_files: - split: 2024_01_24T07_54_32.474467 path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-01-24T07-54-32.474467.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-01-24T07-54-32.474467.parquet' - config_name: harness_hendrycksTest_high_school_world_history_5 data_files: - split: 2024_01_24T07_54_32.474467 path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-01-24T07-54-32.474467.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-01-24T07-54-32.474467.parquet' - config_name: harness_hendrycksTest_human_aging_5 data_files: - split: 2024_01_24T07_54_32.474467 path: - '**/details_harness|hendrycksTest-human_aging|5_2024-01-24T07-54-32.474467.parquet' - split: latest path: - '**/details_harness|hendrycksTest-human_aging|5_2024-01-24T07-54-32.474467.parquet' - config_name: harness_hendrycksTest_human_sexuality_5 data_files: - split: 2024_01_24T07_54_32.474467 path: - '**/details_harness|hendrycksTest-human_sexuality|5_2024-01-24T07-54-32.474467.parquet' - split: latest path: - '**/details_harness|hendrycksTest-human_sexuality|5_2024-01-24T07-54-32.474467.parquet' - config_name: harness_hendrycksTest_international_law_5 data_files: - split: 2024_01_24T07_54_32.474467 path: - '**/details_harness|hendrycksTest-international_law|5_2024-01-24T07-54-32.474467.parquet' - split: latest path: - '**/details_harness|hendrycksTest-international_law|5_2024-01-24T07-54-32.474467.parquet' - config_name: harness_hendrycksTest_jurisprudence_5 data_files: - split: 2024_01_24T07_54_32.474467 path: - '**/details_harness|hendrycksTest-jurisprudence|5_2024-01-24T07-54-32.474467.parquet' - split: latest path: - '**/details_harness|hendrycksTest-jurisprudence|5_2024-01-24T07-54-32.474467.parquet' - config_name: harness_hendrycksTest_logical_fallacies_5 data_files: - split: 2024_01_24T07_54_32.474467 path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-01-24T07-54-32.474467.parquet' - split: latest path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-01-24T07-54-32.474467.parquet' - config_name: harness_hendrycksTest_machine_learning_5 data_files: - split: 2024_01_24T07_54_32.474467 path: - '**/details_harness|hendrycksTest-machine_learning|5_2024-01-24T07-54-32.474467.parquet' - split: latest path: - '**/details_harness|hendrycksTest-machine_learning|5_2024-01-24T07-54-32.474467.parquet' - config_name: harness_hendrycksTest_management_5 data_files: - split: 2024_01_24T07_54_32.474467 path: - '**/details_harness|hendrycksTest-management|5_2024-01-24T07-54-32.474467.parquet' - split: latest path: - '**/details_harness|hendrycksTest-management|5_2024-01-24T07-54-32.474467.parquet' - config_name: harness_hendrycksTest_marketing_5 data_files: - split: 2024_01_24T07_54_32.474467 path: - '**/details_harness|hendrycksTest-marketing|5_2024-01-24T07-54-32.474467.parquet' - split: latest path: - '**/details_harness|hendrycksTest-marketing|5_2024-01-24T07-54-32.474467.parquet' - config_name: harness_hendrycksTest_medical_genetics_5 data_files: - split: 2024_01_24T07_54_32.474467 path: - '**/details_harness|hendrycksTest-medical_genetics|5_2024-01-24T07-54-32.474467.parquet' - split: latest path: - '**/details_harness|hendrycksTest-medical_genetics|5_2024-01-24T07-54-32.474467.parquet' - config_name: harness_hendrycksTest_miscellaneous_5 data_files: - split: 2024_01_24T07_54_32.474467 path: - '**/details_harness|hendrycksTest-miscellaneous|5_2024-01-24T07-54-32.474467.parquet' - split: latest path: - '**/details_harness|hendrycksTest-miscellaneous|5_2024-01-24T07-54-32.474467.parquet' - config_name: harness_hendrycksTest_moral_disputes_5 data_files: - split: 2024_01_24T07_54_32.474467 path: - '**/details_harness|hendrycksTest-moral_disputes|5_2024-01-24T07-54-32.474467.parquet' - split: latest path: - '**/details_harness|hendrycksTest-moral_disputes|5_2024-01-24T07-54-32.474467.parquet' - config_name: harness_hendrycksTest_moral_scenarios_5 data_files: - split: 2024_01_24T07_54_32.474467 path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-01-24T07-54-32.474467.parquet' - split: latest path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-01-24T07-54-32.474467.parquet' - config_name: harness_hendrycksTest_nutrition_5 data_files: - split: 2024_01_24T07_54_32.474467 path: - '**/details_harness|hendrycksTest-nutrition|5_2024-01-24T07-54-32.474467.parquet' - split: latest path: - '**/details_harness|hendrycksTest-nutrition|5_2024-01-24T07-54-32.474467.parquet' - config_name: harness_hendrycksTest_philosophy_5 data_files: - split: 2024_01_24T07_54_32.474467 path: - '**/details_harness|hendrycksTest-philosophy|5_2024-01-24T07-54-32.474467.parquet' - split: latest path: - '**/details_harness|hendrycksTest-philosophy|5_2024-01-24T07-54-32.474467.parquet' - config_name: harness_hendrycksTest_prehistory_5 data_files: - split: 2024_01_24T07_54_32.474467 path: - '**/details_harness|hendrycksTest-prehistory|5_2024-01-24T07-54-32.474467.parquet' - split: latest path: - '**/details_harness|hendrycksTest-prehistory|5_2024-01-24T07-54-32.474467.parquet' - config_name: harness_hendrycksTest_professional_accounting_5 data_files: - split: 2024_01_24T07_54_32.474467 path: - '**/details_harness|hendrycksTest-professional_accounting|5_2024-01-24T07-54-32.474467.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_accounting|5_2024-01-24T07-54-32.474467.parquet' - config_name: harness_hendrycksTest_professional_law_5 data_files: - split: 2024_01_24T07_54_32.474467 path: - '**/details_harness|hendrycksTest-professional_law|5_2024-01-24T07-54-32.474467.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_law|5_2024-01-24T07-54-32.474467.parquet' - config_name: harness_hendrycksTest_professional_medicine_5 data_files: - split: 2024_01_24T07_54_32.474467 path: - '**/details_harness|hendrycksTest-professional_medicine|5_2024-01-24T07-54-32.474467.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_medicine|5_2024-01-24T07-54-32.474467.parquet' - config_name: harness_hendrycksTest_professional_psychology_5 data_files: - split: 2024_01_24T07_54_32.474467 path: - '**/details_harness|hendrycksTest-professional_psychology|5_2024-01-24T07-54-32.474467.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_psychology|5_2024-01-24T07-54-32.474467.parquet' - config_name: harness_hendrycksTest_public_relations_5 data_files: - split: 2024_01_24T07_54_32.474467 path: - '**/details_harness|hendrycksTest-public_relations|5_2024-01-24T07-54-32.474467.parquet' - split: latest path: - '**/details_harness|hendrycksTest-public_relations|5_2024-01-24T07-54-32.474467.parquet' - config_name: harness_hendrycksTest_security_studies_5 data_files: - split: 2024_01_24T07_54_32.474467 path: - '**/details_harness|hendrycksTest-security_studies|5_2024-01-24T07-54-32.474467.parquet' - split: latest path: - '**/details_harness|hendrycksTest-security_studies|5_2024-01-24T07-54-32.474467.parquet' - config_name: harness_hendrycksTest_sociology_5 data_files: - split: 2024_01_24T07_54_32.474467 path: - '**/details_harness|hendrycksTest-sociology|5_2024-01-24T07-54-32.474467.parquet' - split: latest path: - '**/details_harness|hendrycksTest-sociology|5_2024-01-24T07-54-32.474467.parquet' - config_name: harness_hendrycksTest_us_foreign_policy_5 data_files: - split: 2024_01_24T07_54_32.474467 path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-01-24T07-54-32.474467.parquet' - split: latest path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-01-24T07-54-32.474467.parquet' - config_name: harness_hendrycksTest_virology_5 data_files: - split: 2024_01_24T07_54_32.474467 path: - '**/details_harness|hendrycksTest-virology|5_2024-01-24T07-54-32.474467.parquet' - split: latest path: - '**/details_harness|hendrycksTest-virology|5_2024-01-24T07-54-32.474467.parquet' - config_name: harness_hendrycksTest_world_religions_5 data_files: - split: 2024_01_24T07_54_32.474467 path: - '**/details_harness|hendrycksTest-world_religions|5_2024-01-24T07-54-32.474467.parquet' - split: latest path: - '**/details_harness|hendrycksTest-world_religions|5_2024-01-24T07-54-32.474467.parquet' - config_name: harness_truthfulqa_mc_0 data_files: - split: 2024_01_24T07_54_32.474467 path: - '**/details_harness|truthfulqa:mc|0_2024-01-24T07-54-32.474467.parquet' - split: latest path: - '**/details_harness|truthfulqa:mc|0_2024-01-24T07-54-32.474467.parquet' - config_name: harness_winogrande_5 data_files: - split: 2024_01_24T07_54_32.474467 path: - '**/details_harness|winogrande|5_2024-01-24T07-54-32.474467.parquet' - split: latest path: - '**/details_harness|winogrande|5_2024-01-24T07-54-32.474467.parquet' - config_name: results data_files: - split: 2024_01_24T07_54_32.474467 path: - results_2024-01-24T07-54-32.474467.parquet - split: latest path: - results_2024-01-24T07-54-32.474467.parquet --- # Dataset Card for Evaluation run of freecs/ThetaWave-7B-v0.1 <!-- Provide a quick summary of the dataset. --> Dataset automatically created during the evaluation run of model [freecs/ThetaWave-7B-v0.1](https://huggingface.co/freecs/ThetaWave-7B-v0.1) on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard). The dataset is composed of 63 configuration, each one coresponding to one of the evaluated task. The dataset has been created from 1 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results. An additional configuration "results" store all the aggregated results of the run (and is used to compute and display the aggregated metrics on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)). To load the details from a run, you can for instance do the following: ```python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_freecs__ThetaWave-7B-v0.1", "harness_winogrande_5", split="train") ``` ## Latest results These are the [latest results from run 2024-01-24T07:54:32.474467](https://huggingface.co/datasets/open-llm-leaderboard/details_freecs__ThetaWave-7B-v0.1/blob/main/results_2024-01-24T07-54-32.474467.json)(note that their might be results for other tasks in the repos if successive evals didn't cover the same tasks. You find each in the results and the "latest" split for each eval): ```python { "all": { "acc": 0.6369850549630417, "acc_stderr": 0.03254410038117921, "acc_norm": 0.6388746936562116, "acc_norm_stderr": 0.03320176501025405, "mc1": 0.4357405140758874, "mc1_stderr": 0.017358345398863124, "mc2": 0.6024313355556469, "mc2_stderr": 0.01530031296029918 }, "harness|arc:challenge|25": { "acc": 0.621160409556314, "acc_stderr": 0.014175915490000326, "acc_norm": 0.6629692832764505, "acc_norm_stderr": 0.013813476652902274 }, "harness|hellaswag|10": { "acc": 0.6620195180242979, "acc_stderr": 0.004720551323547126, "acc_norm": 0.8540131447918742, "acc_norm_stderr": 0.0035237141526513 }, "harness|hendrycksTest-abstract_algebra|5": { "acc": 0.32, "acc_stderr": 0.046882617226215034, "acc_norm": 0.32, "acc_norm_stderr": 0.046882617226215034 }, "harness|hendrycksTest-anatomy|5": { "acc": 0.6148148148148148, "acc_stderr": 0.04203921040156279, "acc_norm": 0.6148148148148148, "acc_norm_stderr": 0.04203921040156279 }, "harness|hendrycksTest-astronomy|5": { "acc": 0.7039473684210527, "acc_stderr": 0.03715062154998904, "acc_norm": 0.7039473684210527, "acc_norm_stderr": 0.03715062154998904 }, "harness|hendrycksTest-business_ethics|5": { "acc": 0.59, "acc_stderr": 0.04943110704237102, "acc_norm": 0.59, "acc_norm_stderr": 0.04943110704237102 }, "harness|hendrycksTest-clinical_knowledge|5": { "acc": 0.7132075471698113, "acc_stderr": 0.02783491252754406, "acc_norm": 0.7132075471698113, "acc_norm_stderr": 0.02783491252754406 }, "harness|hendrycksTest-college_biology|5": { "acc": 0.75, "acc_stderr": 0.03621034121889507, "acc_norm": 0.75, "acc_norm_stderr": 0.03621034121889507 }, "harness|hendrycksTest-college_chemistry|5": { "acc": 0.45, "acc_stderr": 0.05, "acc_norm": 0.45, "acc_norm_stderr": 0.05 }, "harness|hendrycksTest-college_computer_science|5": { "acc": 0.55, "acc_stderr": 0.05, "acc_norm": 0.55, "acc_norm_stderr": 0.05 }, "harness|hendrycksTest-college_mathematics|5": { "acc": 0.33, "acc_stderr": 0.04725815626252604, "acc_norm": 0.33, "acc_norm_stderr": 0.04725815626252604 }, "harness|hendrycksTest-college_medicine|5": { "acc": 0.6473988439306358, "acc_stderr": 0.036430371689585475, "acc_norm": 0.6473988439306358, "acc_norm_stderr": 0.036430371689585475 }, "harness|hendrycksTest-college_physics|5": { "acc": 0.46078431372549017, "acc_stderr": 0.04959859966384181, "acc_norm": 0.46078431372549017, "acc_norm_stderr": 0.04959859966384181 }, "harness|hendrycksTest-computer_security|5": { "acc": 0.79, "acc_stderr": 0.040936018074033256, "acc_norm": 0.79, "acc_norm_stderr": 0.040936018074033256 }, "harness|hendrycksTest-conceptual_physics|5": { "acc": 0.5574468085106383, "acc_stderr": 0.032469569197899575, "acc_norm": 0.5574468085106383, "acc_norm_stderr": 0.032469569197899575 }, "harness|hendrycksTest-econometrics|5": { "acc": 0.4649122807017544, "acc_stderr": 0.046920083813689104, "acc_norm": 0.4649122807017544, "acc_norm_stderr": 0.046920083813689104 }, "harness|hendrycksTest-electrical_engineering|5": { "acc": 0.5793103448275863, "acc_stderr": 0.0411391498118926, "acc_norm": 0.5793103448275863, "acc_norm_stderr": 0.0411391498118926 }, "harness|hendrycksTest-elementary_mathematics|5": { "acc": 0.4126984126984127, "acc_stderr": 0.025355741263055266, "acc_norm": 0.4126984126984127, "acc_norm_stderr": 0.025355741263055266 }, "harness|hendrycksTest-formal_logic|5": { "acc": 0.4444444444444444, "acc_stderr": 0.044444444444444495, "acc_norm": 0.4444444444444444, "acc_norm_stderr": 0.044444444444444495 }, "harness|hendrycksTest-global_facts|5": { "acc": 0.36, "acc_stderr": 0.04824181513244218, "acc_norm": 0.36, "acc_norm_stderr": 0.04824181513244218 }, "harness|hendrycksTest-high_school_biology|5": { "acc": 0.7290322580645161, "acc_stderr": 0.025284416114900152, "acc_norm": 0.7290322580645161, "acc_norm_stderr": 0.025284416114900152 }, "harness|hendrycksTest-high_school_chemistry|5": { "acc": 0.4827586206896552, "acc_stderr": 0.035158955511657, "acc_norm": 0.4827586206896552, "acc_norm_stderr": 0.035158955511657 }, "harness|hendrycksTest-high_school_computer_science|5": { "acc": 0.63, "acc_stderr": 0.04852365870939098, "acc_norm": 0.63, "acc_norm_stderr": 0.04852365870939098 }, "harness|hendrycksTest-high_school_european_history|5": { "acc": 0.7696969696969697, "acc_stderr": 0.0328766675860349, "acc_norm": 0.7696969696969697, "acc_norm_stderr": 0.0328766675860349 }, "harness|hendrycksTest-high_school_geography|5": { "acc": 0.7777777777777778, "acc_stderr": 0.02962022787479049, "acc_norm": 0.7777777777777778, "acc_norm_stderr": 0.02962022787479049 }, "harness|hendrycksTest-high_school_government_and_politics|5": { "acc": 0.8704663212435233, "acc_stderr": 0.024233532297758723, "acc_norm": 0.8704663212435233, "acc_norm_stderr": 0.024233532297758723 }, "harness|hendrycksTest-high_school_macroeconomics|5": { "acc": 0.617948717948718, "acc_stderr": 0.024635549163908234, "acc_norm": 0.617948717948718, "acc_norm_stderr": 0.024635549163908234 }, "harness|hendrycksTest-high_school_mathematics|5": { "acc": 0.362962962962963, "acc_stderr": 0.02931820364520686, "acc_norm": 0.362962962962963, "acc_norm_stderr": 0.02931820364520686 }, "harness|hendrycksTest-high_school_microeconomics|5": { "acc": 0.6554621848739496, "acc_stderr": 0.03086868260412162, "acc_norm": 0.6554621848739496, "acc_norm_stderr": 0.03086868260412162 }, "harness|hendrycksTest-high_school_physics|5": { "acc": 0.37748344370860926, "acc_stderr": 0.03958027231121569, "acc_norm": 0.37748344370860926, "acc_norm_stderr": 0.03958027231121569 }, "harness|hendrycksTest-high_school_psychology|5": { "acc": 0.818348623853211, "acc_stderr": 0.01653061740926685, "acc_norm": 0.818348623853211, "acc_norm_stderr": 0.01653061740926685 }, "harness|hendrycksTest-high_school_statistics|5": { "acc": 0.5, "acc_stderr": 0.034099716973523674, "acc_norm": 0.5, "acc_norm_stderr": 0.034099716973523674 }, "harness|hendrycksTest-high_school_us_history|5": { "acc": 0.7990196078431373, "acc_stderr": 0.028125972265654373, "acc_norm": 0.7990196078431373, "acc_norm_stderr": 0.028125972265654373 }, "harness|hendrycksTest-high_school_world_history|5": { "acc": 0.7805907172995781, "acc_stderr": 0.026939106581553945, "acc_norm": 0.7805907172995781, "acc_norm_stderr": 0.026939106581553945 }, "harness|hendrycksTest-human_aging|5": { "acc": 0.6636771300448431, "acc_stderr": 0.031708824268455, "acc_norm": 0.6636771300448431, "acc_norm_stderr": 0.031708824268455 }, "harness|hendrycksTest-human_sexuality|5": { "acc": 0.7557251908396947, "acc_stderr": 0.03768335959728744, "acc_norm": 0.7557251908396947, "acc_norm_stderr": 0.03768335959728744 }, "harness|hendrycksTest-international_law|5": { "acc": 0.8181818181818182, "acc_stderr": 0.03520893951097652, "acc_norm": 0.8181818181818182, "acc_norm_stderr": 0.03520893951097652 }, "harness|hendrycksTest-jurisprudence|5": { "acc": 0.7592592592592593, "acc_stderr": 0.04133119440243839, "acc_norm": 0.7592592592592593, "acc_norm_stderr": 0.04133119440243839 }, "harness|hendrycksTest-logical_fallacies|5": { "acc": 0.7361963190184049, "acc_stderr": 0.03462419931615624, "acc_norm": 0.7361963190184049, "acc_norm_stderr": 0.03462419931615624 }, "harness|hendrycksTest-machine_learning|5": { "acc": 0.48214285714285715, "acc_stderr": 0.047427623612430116, "acc_norm": 0.48214285714285715, "acc_norm_stderr": 0.047427623612430116 }, "harness|hendrycksTest-management|5": { "acc": 0.7281553398058253, "acc_stderr": 0.044052680241409216, "acc_norm": 0.7281553398058253, "acc_norm_stderr": 0.044052680241409216 }, "harness|hendrycksTest-marketing|5": { "acc": 0.9017094017094017, "acc_stderr": 0.019503444900757567, "acc_norm": 0.9017094017094017, "acc_norm_stderr": 0.019503444900757567 }, "harness|hendrycksTest-medical_genetics|5": { "acc": 0.74, "acc_stderr": 0.04408440022768078, "acc_norm": 0.74, "acc_norm_stderr": 0.04408440022768078 }, "harness|hendrycksTest-miscellaneous|5": { "acc": 0.8250319284802043, "acc_stderr": 0.01358661921990333, "acc_norm": 0.8250319284802043, "acc_norm_stderr": 0.01358661921990333 }, "harness|hendrycksTest-moral_disputes|5": { "acc": 0.6791907514450867, "acc_stderr": 0.025131000233647897, "acc_norm": 0.6791907514450867, "acc_norm_stderr": 0.025131000233647897 }, "harness|hendrycksTest-moral_scenarios|5": { "acc": 0.4134078212290503, "acc_stderr": 0.016469814928406164, "acc_norm": 0.4134078212290503, "acc_norm_stderr": 0.016469814928406164 }, "harness|hendrycksTest-nutrition|5": { "acc": 0.7124183006535948, "acc_stderr": 0.02591780611714716, "acc_norm": 0.7124183006535948, "acc_norm_stderr": 0.02591780611714716 }, "harness|hendrycksTest-philosophy|5": { "acc": 0.6913183279742765, "acc_stderr": 0.02623696588115327, "acc_norm": 0.6913183279742765, "acc_norm_stderr": 0.02623696588115327 }, "harness|hendrycksTest-prehistory|5": { "acc": 0.7098765432098766, "acc_stderr": 0.025251173936495033, "acc_norm": 0.7098765432098766, "acc_norm_stderr": 0.025251173936495033 }, "harness|hendrycksTest-professional_accounting|5": { "acc": 0.4716312056737589, "acc_stderr": 0.02977945095730307, "acc_norm": 0.4716312056737589, "acc_norm_stderr": 0.02977945095730307 }, "harness|hendrycksTest-professional_law|5": { "acc": 0.44654498044328556, "acc_stderr": 0.012697046024399675, "acc_norm": 0.44654498044328556, "acc_norm_stderr": 0.012697046024399675 }, "harness|hendrycksTest-professional_medicine|5": { "acc": 0.6433823529411765, "acc_stderr": 0.029097209568411952, "acc_norm": 0.6433823529411765, "acc_norm_stderr": 0.029097209568411952 }, "harness|hendrycksTest-professional_psychology|5": { "acc": 0.6454248366013072, "acc_stderr": 0.019353360547553697, "acc_norm": 0.6454248366013072, "acc_norm_stderr": 0.019353360547553697 }, "harness|hendrycksTest-public_relations|5": { "acc": 0.7272727272727273, "acc_stderr": 0.04265792110940588, "acc_norm": 0.7272727272727273, "acc_norm_stderr": 0.04265792110940588 }, "harness|hendrycksTest-security_studies|5": { "acc": 0.7183673469387755, "acc_stderr": 0.028795185574291296, "acc_norm": 0.7183673469387755, "acc_norm_stderr": 0.028795185574291296 }, "harness|hendrycksTest-sociology|5": { "acc": 0.8009950248756219, "acc_stderr": 0.028231365092758406, "acc_norm": 0.8009950248756219, "acc_norm_stderr": 0.028231365092758406 }, "harness|hendrycksTest-us_foreign_policy|5": { "acc": 0.83, "acc_stderr": 0.03775251680686371, "acc_norm": 0.83, "acc_norm_stderr": 0.03775251680686371 }, "harness|hendrycksTest-virology|5": { "acc": 0.5240963855421686, "acc_stderr": 0.03887971849597264, "acc_norm": 0.5240963855421686, "acc_norm_stderr": 0.03887971849597264 }, "harness|hendrycksTest-world_religions|5": { "acc": 0.8421052631578947, "acc_stderr": 0.027966785859160882, "acc_norm": 0.8421052631578947, "acc_norm_stderr": 0.027966785859160882 }, "harness|truthfulqa:mc|0": { "mc1": 0.4357405140758874, "mc1_stderr": 0.017358345398863124, "mc2": 0.6024313355556469, "mc2_stderr": 0.01530031296029918 }, "harness|winogrande|5": { "acc": 0.8018942383583267, "acc_stderr": 0.011201862744487052 }, "harness|gsm8k|5": { "acc": 0.5943896891584534, "acc_stderr": 0.013524848894462115 } } ``` ## Dataset Details ### Dataset Description <!-- Provide a longer summary of what this dataset is. --> - **Curated by:** [More Information Needed] - **Funded by [optional]:** [More Information Needed] - **Shared by [optional]:** [More Information Needed] - **Language(s) (NLP):** [More Information Needed] - **License:** [More Information Needed] ### Dataset Sources [optional] <!-- Provide the basic links for the dataset. --> - **Repository:** [More Information Needed] - **Paper [optional]:** [More Information Needed] - **Demo [optional]:** [More Information Needed] ## Uses <!-- Address questions around how the dataset is intended to be used. --> ### Direct Use <!-- This section describes suitable use cases for the dataset. --> [More Information Needed] ### Out-of-Scope Use <!-- This section addresses misuse, malicious use, and uses that the dataset will not work well for. --> [More Information Needed] ## Dataset Structure <!-- This section provides a description of the dataset fields, and additional information about the dataset structure such as criteria used to create the splits, relationships between data points, etc. --> [More Information Needed] ## Dataset Creation ### Curation Rationale <!-- Motivation for the creation of this dataset. --> [More Information Needed] ### Source Data <!-- This section describes the source data (e.g. news text and headlines, social media posts, translated sentences, ...). --> #### Data Collection and Processing <!-- This section describes the data collection and processing process such as data selection criteria, filtering and normalization methods, tools and libraries used, etc. --> [More Information Needed] #### Who are the source data producers? <!-- This section describes the people or systems who originally created the data. It should also include self-reported demographic or identity information for the source data creators if this information is available. --> [More Information Needed] ### Annotations [optional] <!-- If the dataset contains annotations which are not part of the initial data collection, use this section to describe them. --> #### Annotation process <!-- This section describes the annotation process such as annotation tools used in the process, the amount of data annotated, annotation guidelines provided to the annotators, interannotator statistics, annotation validation, etc. --> [More Information Needed] #### Who are the annotators? <!-- This section describes the people or systems who created the annotations. --> [More Information Needed] #### Personal and Sensitive Information <!-- State whether the dataset contains data that might be considered personal, sensitive, or private (e.g., data that reveals addresses, uniquely identifiable names or aliases, racial or ethnic origins, sexual orientations, religious beliefs, political opinions, financial or health data, etc.). If efforts were made to anonymize the data, describe the anonymization process. --> [More Information Needed] ## Bias, Risks, and Limitations <!-- This section is meant to convey both technical and sociotechnical limitations. --> [More Information Needed] ### Recommendations <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. --> Users should be made aware of the risks, biases and limitations of the dataset. More information needed for further recommendations. ## Citation [optional] <!-- If there is a paper or blog post introducing the dataset, the APA and Bibtex information for that should go in this section. --> **BibTeX:** [More Information Needed] **APA:** [More Information Needed] ## Glossary [optional] <!-- If relevant, include terms and calculations in this section that can help readers understand the dataset or dataset card. --> [More Information Needed] ## More Information [optional] [More Information Needed] ## Dataset Card Authors [optional] [More Information Needed] ## Dataset Card Contact [More Information Needed]
提供机构:
open-llm-leaderboard-old
原始信息汇总

数据集概述

数据集是在对模型 freecs/ThetaWave-7B-v0.1 进行评估运行期间自动创建的,用于 Open LLM Leaderboard

数据集组成

  • 数据集包含 63 个配置,每个配置对应一个评估任务。
  • 数据集来自 1 次运行,每次运行可以在每个配置中找到特定的分割,分割名称使用运行的时间戳。
  • "train" 分割始终指向最新的结果。
  • 额外的 "results" 配置存储所有运行的聚合结果,用于计算和显示 Open LLM Leaderboard 上的聚合指标。

数据加载示例

python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_freecs__ThetaWave-7B-v0.1", "harness_winogrande_5", split="train")

最新结果

以下是 2024-01-24T07:54:32.474467 运行的最新结果

python { "all": { "acc": 0.6369850549630417, "acc_stderr": 0.03254410038117921, "acc_norm": 0.6388746936562116, "acc_norm_stderr": 0.03320176501025405, "mc1": 0.4357405140758874, "mc1_stderr": 0.017358345398863124, "mc2": 0.6024313355556469, "mc2_stderr": 0.01530031296029918 }, "harness|arc:challenge|25": { "acc": 0.621160409556314, "acc_stderr": 0.014175915490000326, "acc_norm": 0.6629692832764505, "acc_norm_stderr": 0.013813476652902274 }, "harness|hellaswag|10": { "acc": 0.6620195180242979, "acc_stderr": 0.004720551323547126, "acc_norm": 0.8540131447918742, "acc_norm_stderr": 0.0035237141526513 }, "harness|hendrycksTest-abstract_algebra|5": { "acc": 0.32, "acc_stderr": 0.046882617226215034, "acc_norm": 0.32, "acc_norm_stderr": 0.046882617226215034 }, "harness|hendrycksTest-anatomy|5": { "acc": 0.6148148148148148, "acc_stderr": 0.04203921040156279, "acc_norm": 0.6148148148148148, "acc_norm_stderr": 0.04203921040156279 }, "harness|hendrycksTest-astronomy|5": { "acc": 0.7039473684210527, "acc_stderr": 0.03715062154998904, "acc_norm": 0.7039473684210527, "acc_norm_stderr": 0.03715062154998904 }, "harness|hendrycksTest-business_ethics|5": { "acc": 0.59, "acc_stderr": 0.04943110704237102, "acc_norm": 0.59, "acc_norm_stderr": 0.04943110704237102 }, "harness|hendrycksTest-clinical_knowledge|5": { "acc": 0.7132075471698113, "acc_stderr": 0.02783491252754406, "acc_norm": 0.7132075471698113, "acc_norm_stderr": 0.02783491252754406 }, "harness|hendrycksTest-college_biology|5": { "acc": 0.75, "acc_stderr": 0.03621034121889507, "acc_norm": 0.75, "acc_norm_stderr": 0.03621034121889507 }, "harness|hendrycksTest-college_chemistry|5": { "acc": 0.45, "acc_stderr": 0.05, "acc_norm": 0.45, "acc_norm_stderr": 0.05 }, "harness|hendrycksTest-college_computer_science|5": { "acc": 0.55, "acc_stderr": 0.05, "acc_norm": 0.55, "acc_norm_stderr": 0.05 }, "harness|hendrycksTest-college_mathematics|5": { "acc": 0.33, "acc_stderr": 0.04725815626252604, "acc_norm": 0.33, "acc_norm_stderr": 0.04725815626252604 }, "harness|hendrycksTest-college_medicine|5": { "acc": 0.6473988439306358, "acc_stderr": 0.036430371689585475, "acc_norm": 0.6473988439306358, "acc_norm_stderr": 0.036430371689585475 }, "harness|hendrycksTest-college_physics|5": { "acc": 0.46078431372549017, "acc_stderr": 0.04959859966384181, "acc_norm": 0.46078431372549017, "acc_norm_stderr": 0.04959859966384181 }, "harness|hendrycksTest-computer_security|5": { "acc": 0.79, "acc_stderr": 0.040936018074033256, "acc_norm": 0.79, "acc_norm_stderr": 0.040936018074033256 }, "harness|hendrycksTest-conceptual_physics|5": { "acc": 0.5574468085106383, "acc_stderr": 0.032469569197899575, "acc_norm": 0.5574468085106383, "acc_norm_stderr": 0.032469569197899575 }, "harness|hendrycksTest-econometrics|5": { "acc": 0.4649122807017544, "acc_stderr": 0.046920083813689104, "acc_norm": 0.4649122807017544, "acc_norm_stderr": 0.046920083813689104 }, "harness|hendrycksTest-electrical_engineering|5": { "acc": 0.5793103448275863, "acc_stderr": 0.0411391498118926, "acc_norm": 0.5793103448275863, "acc_norm_stderr": 0.0411391498118926 }, "harness|hendrycksTest-elementary_mathematics|5": { "acc": 0.4126984126984127, "acc_stderr": 0.025355741263055266, "acc_norm": 0.4126984126984127, "acc_norm_stderr": 0.025355741263055266 }, "harness|hendrycksTest-formal_logic|5": { "acc": 0.4444444444444444, "acc_stderr": 0.044444444444444495, "acc_norm": 0.4444444444444444, "acc_norm_stderr": 0.044444444444444495 }, "harness|hendrycksTest-global_facts|5": { "acc": 0.36, "acc_stderr": 0.04824181513244218, "acc_norm": 0.36, "acc_norm_stderr": 0.04824181513244218 }, "harness|hendrycksTest-high_school_biology|5": { "acc": 0.7290322580645161, "acc_stderr": 0.025284416114900152, "acc_norm": 0.7290322580645161, "acc_norm_stderr": 0.025284416114900152 }, "harness|hendrycksTest-high_school_chemistry|5": { "acc": 0.4827586206896552, "acc_stderr": 0.035158955511657, "acc_norm": 0.4827586206896552, "acc_norm_stderr": 0.035158955511657 }, "harness|hendrycksTest-high_school_computer_science|5": { "acc": 0.63, "acc_stderr": 0.04852365870939098, "acc_norm": 0.63, "acc_norm_stderr": 0.04852365870939098 }, "harness|hendrycksTest-high_school_european_history|5": { "acc": 0.7696969696969697, "acc_stderr": 0.0328766675860349, "acc_norm": 0.7696969696969697, "acc_norm_stderr": 0.0328766675860349 }, "harness|hendrycksTest-high_school_geography|5": { "acc": 0.7777777777777778, "acc_stderr": 0.02962022787479049, "acc_norm": 0.7777777777777778, "acc_norm_stderr": 0.02962022787479049 }, "harness|hendrycksTest-high_school_government_and_politics|5": { "acc": 0.8704663212435233, "acc_stderr": 0.024233532297758723, "acc_norm": 0.8704663212435233, "acc_norm_stderr": 0.024233532297758723 }, "harness|hendrycksTest-high_school_macroeconomics|5": { "acc": 0.617948717948718, "acc_stderr": 0.024635549163908234, "acc_norm": 0.617948717948718, "acc_norm_stderr": 0.024635549163908234 }, "harness|hendrycksTest-high_school_mathematics|5": { "acc": 0.362962962962963, "acc_stderr": 0.02931820364520686, "acc_norm": 0.362962962962963, "acc_norm_stderr": 0.02931820364520686 }, "harness|hendrycksTest-high_school_microeconomics|5": { "acc": 0.6554621848739496, "acc_stderr": 0.03086868260412162, "acc_norm": 0.

搜集汇总
数据集介绍
main_image_url
构建方式
在大型语言模型评估体系的构建中,Open LLM Leaderboard为模型性能的量化分析提供了标准化平台。该数据集作为对freecs/ThetaWave-7B-v0.1模型进行自动化评估的产物,其构建过程体现了系统性与结构化特征。数据集由63个配置组成,每个配置精确对应一项被评估的任务,涵盖了从常识推理到数学求解的多样化评测维度。数据源自单一评估运行,每次运行的结果被独立存储为特定分割,并以时间戳命名以区分不同批次。其中,“train”分割始终指向最新评估结果,而额外配置的“results”则聚合了所有运行的宏观指标,为模型整体表现的可视化呈现提供支撑。
特点
该数据集的核心特色在于其细粒度的任务分解与动态版本管理机制。63个配置分别对应如ARC挑战赛、HellaSwag、GSM8K等不同评测任务,每个任务下均存储了详细的准确率及标准误差等统计指标。通过时间戳命名的分割结构,数据集能够保留历史评估轨迹,便于研究者追溯模型性能的演变过程。此外,数据集的“latest”分割确保用户始终能获取最新评估结果,而“results”配置则整合了跨任务的综合评分,如总体准确率与归一化准确率,为模型横向比较提供了统一基准。这种分层设计兼顾了微观任务分析与宏观性能概览的双重需求。
使用方法
使用者可通过Hugging Face的datasets库便捷地加载该数据集。具体而言,调用load_dataset函数并指定数据集名称及目标配置,例如加载Winogrande任务的最新评估结果,需设置配置名为“harness_winogrande_5”并选择分割为“train”。每个配置下的数据以Parquet格式存储,支持高效读取。对于需要跨任务分析的场景,用户可遍历所有63个配置,通过提取各任务的准确率等指标,构建模型性能的全面画像。同时,数据集中的“results”配置提供了聚合后的JSON格式结果,可直接用于报告生成或可视化展示,极大简化了评估数据的后处理流程。
背景与挑战
背景概述
随着大规模语言模型在自然语言处理领域的广泛应用,对其性能进行系统化、标准化的评估成为推动技术进步的关键环节。在此背景下,Hugging Face团队于2023年发起了Open LLM Leaderboard项目,旨在通过统一基准测试框架,客观衡量不同模型在多种任务上的表现。该数据集记录了freecs/ThetaWave-7B-v0.1模型在2024年1月24日的评估运行结果,由Clémentine等人主导构建。数据集涵盖63个配置项,对应ARC-Challenge、HellaSwag、MMLU、TruthfulQA、Winogrande和GSM8K等多样化任务,全面反映了模型在推理、常识、数学及知识理解等方面的能力。这一评估体系为研究者提供了可复现的模型性能参考,促进了开源大模型的公平比较与迭代优化。
当前挑战
该数据集所解决的领域问题在于,大模型性能评估缺乏统一标准,不同研究采用各异基准导致结果难以横向对比。构建过程中面临的挑战包括:其一,需设计涵盖多维度能力的任务集合,确保评估全面性,例如ARC-Challenge测试科学推理、GSM8K检验数学能力,而MMLU覆盖57个学科知识;其二,需处理评估结果的标准化与可复现性,通过64个配置项分别存储各任务细节,并采用Parquet格式与时间戳分片管理数据;其三,需应对模型输出中随机性带来的统计波动,如TruthfulQA的MC1与MC2指标标准差分别为1.7%和1.5%,需通过多次运行取平均以提升评估稳健性。
常用场景
经典使用场景
在大型语言模型(LLM)迅猛发展的浪潮中,如何公正、全面地评估模型性能成为学界与工业界共同关注的焦点。该数据集作为Open LLM Leaderboard的评估运行产物,系统性地记录了ThetaWave-7B-v0.1模型在63个任务配置下的细粒度表现。其最经典的使用场景在于为研究者提供一份标准化、可复现的模型能力剖面图,涵盖ARC挑战赛、HellaSwag常识推理、GSM8K数学问题求解以及涵盖57个学科的MMLU测试,从而实现对模型在推理、知识储备与多领域理解上的多维诊断。
解决学术问题
该数据集精准回应了LLM评估中普遍存在的碎片化与不可复现难题。通过统一管理各任务的原始得分、标准化误差与归一化指标,它使得跨模型、跨时间节点的性能对比成为可能。学术研究常借助此类数据来验证新提出的训练策略或架构改进是否带来实质性提升,例如分析ThetaWave-7B在Winogrande上达到80.19%准确率所暗示的常识推理优势,从而推动模型鲁棒性与泛化能力的理论探索。其意义在于为开源社区构建了透明、可信的评估基石,加速了语言智能演化进程。
衍生相关工作
围绕此类评估数据集,衍生出一系列富有影响力的后续工作。一方面,研究者基于Leaderboard的公开结果构建了模型排行榜元分析,揭示不同参数量级与训练数据配比对性能的影响规律;另一方面,该数据集启发了自动化评估工具链的研发,如利用其结构化格式快速复现评估流程,催生了Harness等标准化测试框架的迭代。这些工作共同编织出一张评估生态网络,推动着LLM从实验室走向大规模可信部署。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作