five

open-llm-leaderboard-old/details_Qwen__Qwen2-1.5B

收藏
Hugging Face2024-05-30 更新2024-06-29 收录
下载链接:
https://hf-mirror.com/datasets/open-llm-leaderboard-old/details_Qwen__Qwen2-1.5B
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: Evaluation run of Qwen/Qwen2-1.5B dataset_summary: "Dataset automatically created during the evaluation run of model\ \ [Qwen/Qwen2-1.5B](https://huggingface.co/Qwen/Qwen2-1.5B) on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).\n\ \nThe dataset is composed of 63 configuration, each one coresponding to one of the\ \ evaluated task.\n\nThe dataset has been created from 1 run(s). Each run can be\ \ found as a specific split in each configuration, the split being named using the\ \ timestamp of the run.The \"train\" split is always pointing to the latest results.\n\ \nAn additional configuration \"results\" store all the aggregated results of the\ \ run (and is used to compute and display the aggregated metrics on the [Open LLM\ \ Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)).\n\ \nTo load the details from a run, you can for instance do the following:\n```python\n\ from datasets import load_dataset\ndata = load_dataset(\"open-llm-leaderboard/details_Qwen__Qwen2-1.5B_private\"\ ,\n\t\"harness_winogrande_5\",\n\tsplit=\"train\")\n```\n\n## Latest results\n\n\ These are the [latest results from run 2024-05-30T12:39:22.414033](https://huggingface.co/datasets/open-llm-leaderboard/details_Qwen__Qwen2-1.5B_private/blob/main/results_2024-05-30T12-39-22.414033.json)(note\ \ that their might be results for other tasks in the repos if successive evals didn't\ \ cover the same tasks. You find each in the results and the \"latest\" split for\ \ each eval):\n\n```python\n{\n \"all\": {\n \"acc\": 0.5695668154598998,\n\ \ \"acc_stderr\": 0.033895050381280295,\n \"acc_norm\": 0.571987728994217,\n\ \ \"acc_norm_stderr\": 0.03458235640417673,\n \"mc1\": 0.3023255813953488,\n\ \ \"mc1_stderr\": 0.016077509266133026,\n \"mc2\": 0.4591892724530115,\n\ \ \"mc2_stderr\": 0.014394535619493496\n },\n \"harness|arc:challenge|25\"\ : {\n \"acc\": 0.4104095563139932,\n \"acc_stderr\": 0.01437492219264266,\n\ \ \"acc_norm\": 0.44283276450511944,\n \"acc_norm_stderr\": 0.014515573873348902\n\ \ },\n \"harness|hellaswag|10\": {\n \"acc\": 0.48864767974507073,\n\ \ \"acc_stderr\": 0.004988495127747281,\n \"acc_norm\": 0.6666998605855408,\n\ \ \"acc_norm_stderr\": 0.0047042938987299\n },\n \"harness|hendrycksTest-abstract_algebra|5\"\ : {\n \"acc\": 0.34,\n \"acc_stderr\": 0.047609522856952365,\n \ \ \"acc_norm\": 0.34,\n \"acc_norm_stderr\": 0.047609522856952365\n \ \ },\n \"harness|hendrycksTest-anatomy|5\": {\n \"acc\": 0.4888888888888889,\n\ \ \"acc_stderr\": 0.04318275491977976,\n \"acc_norm\": 0.4888888888888889,\n\ \ \"acc_norm_stderr\": 0.04318275491977976\n },\n \"harness|hendrycksTest-astronomy|5\"\ : {\n \"acc\": 0.5526315789473685,\n \"acc_stderr\": 0.04046336883978251,\n\ \ \"acc_norm\": 0.5526315789473685,\n \"acc_norm_stderr\": 0.04046336883978251\n\ \ },\n \"harness|hendrycksTest-business_ethics|5\": {\n \"acc\": 0.67,\n\ \ \"acc_stderr\": 0.047258156262526066,\n \"acc_norm\": 0.67,\n \ \ \"acc_norm_stderr\": 0.047258156262526066\n },\n \"harness|hendrycksTest-clinical_knowledge|5\"\ : {\n \"acc\": 0.6,\n \"acc_stderr\": 0.030151134457776285,\n \ \ \"acc_norm\": 0.6,\n \"acc_norm_stderr\": 0.030151134457776285\n \ \ },\n \"harness|hendrycksTest-college_biology|5\": {\n \"acc\": 0.5625,\n\ \ \"acc_stderr\": 0.04148415739394154,\n \"acc_norm\": 0.5625,\n \ \ \"acc_norm_stderr\": 0.04148415739394154\n },\n \"harness|hendrycksTest-college_chemistry|5\"\ : {\n \"acc\": 0.38,\n \"acc_stderr\": 0.04878317312145633,\n \ \ \"acc_norm\": 0.38,\n \"acc_norm_stderr\": 0.04878317312145633\n \ \ },\n \"harness|hendrycksTest-college_computer_science|5\": {\n \"acc\"\ : 0.51,\n \"acc_stderr\": 0.05024183937956911,\n \"acc_norm\": 0.51,\n\ \ \"acc_norm_stderr\": 0.05024183937956911\n },\n \"harness|hendrycksTest-college_mathematics|5\"\ : {\n \"acc\": 0.32,\n \"acc_stderr\": 0.04688261722621504,\n \ \ \"acc_norm\": 0.32,\n \"acc_norm_stderr\": 0.04688261722621504\n \ \ },\n \"harness|hendrycksTest-college_medicine|5\": {\n \"acc\": 0.4913294797687861,\n\ \ \"acc_stderr\": 0.03811890988940412,\n \"acc_norm\": 0.4913294797687861,\n\ \ \"acc_norm_stderr\": 0.03811890988940412\n },\n \"harness|hendrycksTest-college_physics|5\"\ : {\n \"acc\": 0.4019607843137255,\n \"acc_stderr\": 0.04878608714466996,\n\ \ \"acc_norm\": 0.4019607843137255,\n \"acc_norm_stderr\": 0.04878608714466996\n\ \ },\n \"harness|hendrycksTest-computer_security|5\": {\n \"acc\":\ \ 0.66,\n \"acc_stderr\": 0.04760952285695237,\n \"acc_norm\": 0.66,\n\ \ \"acc_norm_stderr\": 0.04760952285695237\n },\n \"harness|hendrycksTest-conceptual_physics|5\"\ : {\n \"acc\": 0.502127659574468,\n \"acc_stderr\": 0.03268572658667492,\n\ \ \"acc_norm\": 0.502127659574468,\n \"acc_norm_stderr\": 0.03268572658667492\n\ \ },\n \"harness|hendrycksTest-econometrics|5\": {\n \"acc\": 0.38596491228070173,\n\ \ \"acc_stderr\": 0.04579639422070434,\n \"acc_norm\": 0.38596491228070173,\n\ \ \"acc_norm_stderr\": 0.04579639422070434\n },\n \"harness|hendrycksTest-electrical_engineering|5\"\ : {\n \"acc\": 0.5793103448275863,\n \"acc_stderr\": 0.0411391498118926,\n\ \ \"acc_norm\": 0.5793103448275863,\n \"acc_norm_stderr\": 0.0411391498118926\n\ \ },\n \"harness|hendrycksTest-elementary_mathematics|5\": {\n \"acc\"\ : 0.42328042328042326,\n \"acc_stderr\": 0.025446365634406765,\n \"\ acc_norm\": 0.42328042328042326,\n \"acc_norm_stderr\": 0.025446365634406765\n\ \ },\n \"harness|hendrycksTest-formal_logic|5\": {\n \"acc\": 0.3888888888888889,\n\ \ \"acc_stderr\": 0.04360314860077459,\n \"acc_norm\": 0.3888888888888889,\n\ \ \"acc_norm_stderr\": 0.04360314860077459\n },\n \"harness|hendrycksTest-global_facts|5\"\ : {\n \"acc\": 0.36,\n \"acc_stderr\": 0.04824181513244218,\n \ \ \"acc_norm\": 0.36,\n \"acc_norm_stderr\": 0.04824181513244218\n \ \ },\n \"harness|hendrycksTest-high_school_biology|5\": {\n \"acc\": 0.6419354838709678,\n\ \ \"acc_stderr\": 0.02727389059430064,\n \"acc_norm\": 0.6419354838709678,\n\ \ \"acc_norm_stderr\": 0.02727389059430064\n },\n \"harness|hendrycksTest-high_school_chemistry|5\"\ : {\n \"acc\": 0.49261083743842365,\n \"acc_stderr\": 0.03517603540361008,\n\ \ \"acc_norm\": 0.49261083743842365,\n \"acc_norm_stderr\": 0.03517603540361008\n\ \ },\n \"harness|hendrycksTest-high_school_computer_science|5\": {\n \ \ \"acc\": 0.56,\n \"acc_stderr\": 0.04988876515698589,\n \"acc_norm\"\ : 0.56,\n \"acc_norm_stderr\": 0.04988876515698589\n },\n \"harness|hendrycksTest-high_school_european_history|5\"\ : {\n \"acc\": 0.7090909090909091,\n \"acc_stderr\": 0.03546563019624336,\n\ \ \"acc_norm\": 0.7090909090909091,\n \"acc_norm_stderr\": 0.03546563019624336\n\ \ },\n \"harness|hendrycksTest-high_school_geography|5\": {\n \"acc\"\ : 0.7373737373737373,\n \"acc_stderr\": 0.031353050095330855,\n \"\ acc_norm\": 0.7373737373737373,\n \"acc_norm_stderr\": 0.031353050095330855\n\ \ },\n \"harness|hendrycksTest-high_school_government_and_politics|5\": {\n\ \ \"acc\": 0.7875647668393783,\n \"acc_stderr\": 0.02951928261681723,\n\ \ \"acc_norm\": 0.7875647668393783,\n \"acc_norm_stderr\": 0.02951928261681723\n\ \ },\n \"harness|hendrycksTest-high_school_macroeconomics|5\": {\n \ \ \"acc\": 0.5846153846153846,\n \"acc_stderr\": 0.02498535492310234,\n \ \ \"acc_norm\": 0.5846153846153846,\n \"acc_norm_stderr\": 0.02498535492310234\n\ \ },\n \"harness|hendrycksTest-high_school_mathematics|5\": {\n \"\ acc\": 0.3592592592592593,\n \"acc_stderr\": 0.029252905927251983,\n \ \ \"acc_norm\": 0.3592592592592593,\n \"acc_norm_stderr\": 0.029252905927251983\n\ \ },\n \"harness|hendrycksTest-high_school_microeconomics|5\": {\n \ \ \"acc\": 0.6218487394957983,\n \"acc_stderr\": 0.031499305777849054,\n\ \ \"acc_norm\": 0.6218487394957983,\n \"acc_norm_stderr\": 0.031499305777849054\n\ \ },\n \"harness|hendrycksTest-high_school_physics|5\": {\n \"acc\"\ : 0.3576158940397351,\n \"acc_stderr\": 0.03913453431177258,\n \"\ acc_norm\": 0.3576158940397351,\n \"acc_norm_stderr\": 0.03913453431177258\n\ \ },\n \"harness|hendrycksTest-high_school_psychology|5\": {\n \"acc\"\ : 0.7522935779816514,\n \"acc_stderr\": 0.018508143602547825,\n \"\ acc_norm\": 0.7522935779816514,\n \"acc_norm_stderr\": 0.018508143602547825\n\ \ },\n \"harness|hendrycksTest-high_school_statistics|5\": {\n \"acc\"\ : 0.4444444444444444,\n \"acc_stderr\": 0.03388857118502326,\n \"\ acc_norm\": 0.4444444444444444,\n \"acc_norm_stderr\": 0.03388857118502326\n\ \ },\n \"harness|hendrycksTest-high_school_us_history|5\": {\n \"acc\"\ : 0.6470588235294118,\n \"acc_stderr\": 0.03354092437591519,\n \"\ acc_norm\": 0.6470588235294118,\n \"acc_norm_stderr\": 0.03354092437591519\n\ \ },\n \"harness|hendrycksTest-high_school_world_history|5\": {\n \"\ acc\": 0.7046413502109705,\n \"acc_stderr\": 0.029696338713422882,\n \ \ \"acc_norm\": 0.7046413502109705,\n \"acc_norm_stderr\": 0.029696338713422882\n\ \ },\n \"harness|hendrycksTest-human_aging|5\": {\n \"acc\": 0.6188340807174888,\n\ \ \"acc_stderr\": 0.03259625118416828,\n \"acc_norm\": 0.6188340807174888,\n\ \ \"acc_norm_stderr\": 0.03259625118416828\n },\n \"harness|hendrycksTest-human_sexuality|5\"\ : {\n \"acc\": 0.6717557251908397,\n \"acc_stderr\": 0.04118438565806298,\n\ \ \"acc_norm\": 0.6717557251908397,\n \"acc_norm_stderr\": 0.04118438565806298\n\ \ },\n \"harness|hendrycksTest-international_law|5\": {\n \"acc\":\ \ 0.743801652892562,\n \"acc_stderr\": 0.03984979653302871,\n \"acc_norm\"\ : 0.743801652892562,\n \"acc_norm_stderr\": 0.03984979653302871\n },\n\ \ \"harness|hendrycksTest-jurisprudence|5\": {\n \"acc\": 0.6944444444444444,\n\ \ \"acc_stderr\": 0.044531975073749834,\n \"acc_norm\": 0.6944444444444444,\n\ \ \"acc_norm_stderr\": 0.044531975073749834\n },\n \"harness|hendrycksTest-logical_fallacies|5\"\ : {\n \"acc\": 0.6871165644171779,\n \"acc_stderr\": 0.036429145782924076,\n\ \ \"acc_norm\": 0.6871165644171779,\n \"acc_norm_stderr\": 0.036429145782924076\n\ \ },\n \"harness|hendrycksTest-machine_learning|5\": {\n \"acc\": 0.41964285714285715,\n\ \ \"acc_stderr\": 0.04684099321077106,\n \"acc_norm\": 0.41964285714285715,\n\ \ \"acc_norm_stderr\": 0.04684099321077106\n },\n \"harness|hendrycksTest-management|5\"\ : {\n \"acc\": 0.8349514563106796,\n \"acc_stderr\": 0.036756688322331886,\n\ \ \"acc_norm\": 0.8349514563106796,\n \"acc_norm_stderr\": 0.036756688322331886\n\ \ },\n \"harness|hendrycksTest-marketing|5\": {\n \"acc\": 0.8034188034188035,\n\ \ \"acc_stderr\": 0.02603538609895129,\n \"acc_norm\": 0.8034188034188035,\n\ \ \"acc_norm_stderr\": 0.02603538609895129\n },\n \"harness|hendrycksTest-medical_genetics|5\"\ : {\n \"acc\": 0.62,\n \"acc_stderr\": 0.048783173121456316,\n \ \ \"acc_norm\": 0.62,\n \"acc_norm_stderr\": 0.048783173121456316\n \ \ },\n \"harness|hendrycksTest-miscellaneous|5\": {\n \"acc\": 0.7203065134099617,\n\ \ \"acc_stderr\": 0.01605079214803652,\n \"acc_norm\": 0.7203065134099617,\n\ \ \"acc_norm_stderr\": 0.01605079214803652\n },\n \"harness|hendrycksTest-moral_disputes|5\"\ : {\n \"acc\": 0.6271676300578035,\n \"acc_stderr\": 0.02603389061357628,\n\ \ \"acc_norm\": 0.6271676300578035,\n \"acc_norm_stderr\": 0.02603389061357628\n\ \ },\n \"harness|hendrycksTest-moral_scenarios|5\": {\n \"acc\": 0.29720670391061454,\n\ \ \"acc_stderr\": 0.0152853133536416,\n \"acc_norm\": 0.29720670391061454,\n\ \ \"acc_norm_stderr\": 0.0152853133536416\n },\n \"harness|hendrycksTest-nutrition|5\"\ : {\n \"acc\": 0.6928104575163399,\n \"acc_stderr\": 0.02641560191438898,\n\ \ \"acc_norm\": 0.6928104575163399,\n \"acc_norm_stderr\": 0.02641560191438898\n\ \ },\n \"harness|hendrycksTest-philosophy|5\": {\n \"acc\": 0.6430868167202572,\n\ \ \"acc_stderr\": 0.027210420375934023,\n \"acc_norm\": 0.6430868167202572,\n\ \ \"acc_norm_stderr\": 0.027210420375934023\n },\n \"harness|hendrycksTest-prehistory|5\"\ : {\n \"acc\": 0.5895061728395061,\n \"acc_stderr\": 0.027371350925124764,\n\ \ \"acc_norm\": 0.5895061728395061,\n \"acc_norm_stderr\": 0.027371350925124764\n\ \ },\n \"harness|hendrycksTest-professional_accounting|5\": {\n \"\ acc\": 0.43617021276595747,\n \"acc_stderr\": 0.029583452036284073,\n \ \ \"acc_norm\": 0.43617021276595747,\n \"acc_norm_stderr\": 0.029583452036284073\n\ \ },\n \"harness|hendrycksTest-professional_law|5\": {\n \"acc\": 0.4282920469361147,\n\ \ \"acc_stderr\": 0.012638223880313161,\n \"acc_norm\": 0.4282920469361147,\n\ \ \"acc_norm_stderr\": 0.012638223880313161\n },\n \"harness|hendrycksTest-professional_medicine|5\"\ : {\n \"acc\": 0.5036764705882353,\n \"acc_stderr\": 0.030372015885428195,\n\ \ \"acc_norm\": 0.5036764705882353,\n \"acc_norm_stderr\": 0.030372015885428195\n\ \ },\n \"harness|hendrycksTest-professional_psychology|5\": {\n \"\ acc\": 0.553921568627451,\n \"acc_stderr\": 0.020109864547181357,\n \ \ \"acc_norm\": 0.553921568627451,\n \"acc_norm_stderr\": 0.020109864547181357\n\ \ },\n \"harness|hendrycksTest-public_relations|5\": {\n \"acc\": 0.6,\n\ \ \"acc_stderr\": 0.0469237132203465,\n \"acc_norm\": 0.6,\n \ \ \"acc_norm_stderr\": 0.0469237132203465\n },\n \"harness|hendrycksTest-security_studies|5\"\ : {\n \"acc\": 0.6816326530612244,\n \"acc_stderr\": 0.029822533793982066,\n\ \ \"acc_norm\": 0.6816326530612244,\n \"acc_norm_stderr\": 0.029822533793982066\n\ \ },\n \"harness|hendrycksTest-sociology|5\": {\n \"acc\": 0.7562189054726368,\n\ \ \"acc_stderr\": 0.03036049015401466,\n \"acc_norm\": 0.7562189054726368,\n\ \ \"acc_norm_stderr\": 0.03036049015401466\n },\n \"harness|hendrycksTest-us_foreign_policy|5\"\ : {\n \"acc\": 0.79,\n \"acc_stderr\": 0.040936018074033256,\n \ \ \"acc_norm\": 0.79,\n \"acc_norm_stderr\": 0.040936018074033256\n \ \ },\n \"harness|hendrycksTest-virology|5\": {\n \"acc\": 0.463855421686747,\n\ \ \"acc_stderr\": 0.03882310850890593,\n \"acc_norm\": 0.463855421686747,\n\ \ \"acc_norm_stderr\": 0.03882310850890593\n },\n \"harness|hendrycksTest-world_religions|5\"\ : {\n \"acc\": 0.7426900584795322,\n \"acc_stderr\": 0.03352799844161865,\n\ \ \"acc_norm\": 0.7426900584795322,\n \"acc_norm_stderr\": 0.03352799844161865\n\ \ },\n \"harness|truthfulqa:mc|0\": {\n \"mc1\": 0.3023255813953488,\n\ \ \"mc1_stderr\": 0.016077509266133026,\n \"mc2\": 0.4591892724530115,\n\ \ \"mc2_stderr\": 0.014394535619493496\n },\n \"harness|winogrande|5\"\ : {\n \"acc\": 0.648776637726914,\n \"acc_stderr\": 0.01341598137054513\n\ \ },\n \"harness|gsm8k|5\": {\n \"acc\": 0.5579984836997726,\n \ \ \"acc_stderr\": 0.013679514492814574\n }\n}\n```" repo_url: https://huggingface.co/Qwen/Qwen2-1.5B leaderboard_url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard point_of_contact: clementine@hf.co configs: - config_name: harness_arc_challenge_25 data_files: - split: 2024_05_30T12_39_22.414033 path: - '**/details_harness|arc:challenge|25_2024-05-30T12-39-22.414033.parquet' - split: latest path: - '**/details_harness|arc:challenge|25_2024-05-30T12-39-22.414033.parquet' - config_name: harness_gsm8k_5 data_files: - split: 2024_05_30T12_39_22.414033 path: - '**/details_harness|gsm8k|5_2024-05-30T12-39-22.414033.parquet' - split: latest path: - '**/details_harness|gsm8k|5_2024-05-30T12-39-22.414033.parquet' - config_name: harness_hellaswag_10 data_files: - split: 2024_05_30T12_39_22.414033 path: - '**/details_harness|hellaswag|10_2024-05-30T12-39-22.414033.parquet' - split: latest path: - '**/details_harness|hellaswag|10_2024-05-30T12-39-22.414033.parquet' - config_name: harness_hendrycksTest_5 data_files: - split: 2024_05_30T12_39_22.414033 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-international_law|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-management|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-marketing|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-sociology|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-virology|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2024-05-30T12-39-22.414033.parquet' - split: latest path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-international_law|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-management|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-marketing|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-sociology|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-virology|5_2024-05-30T12-39-22.414033.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2024-05-30T12-39-22.414033.parquet' - config_name: harness_hendrycksTest_abstract_algebra_5 data_files: - split: 2024_05_30T12_39_22.414033 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-05-30T12-39-22.414033.parquet' - split: latest path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2024-05-30T12-39-22.414033.parquet' - config_name: harness_hendrycksTest_anatomy_5 data_files: - split: 2024_05_30T12_39_22.414033 path: - '**/details_harness|hendrycksTest-anatomy|5_2024-05-30T12-39-22.414033.parquet' - split: latest path: - '**/details_harness|hendrycksTest-anatomy|5_2024-05-30T12-39-22.414033.parquet' - config_name: harness_hendrycksTest_astronomy_5 data_files: - split: 2024_05_30T12_39_22.414033 path: - '**/details_harness|hendrycksTest-astronomy|5_2024-05-30T12-39-22.414033.parquet' - split: latest path: - '**/details_harness|hendrycksTest-astronomy|5_2024-05-30T12-39-22.414033.parquet' - config_name: harness_hendrycksTest_business_ethics_5 data_files: - split: 2024_05_30T12_39_22.414033 path: - '**/details_harness|hendrycksTest-business_ethics|5_2024-05-30T12-39-22.414033.parquet' - split: latest path: - '**/details_harness|hendrycksTest-business_ethics|5_2024-05-30T12-39-22.414033.parquet' - config_name: harness_hendrycksTest_clinical_knowledge_5 data_files: - split: 2024_05_30T12_39_22.414033 path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-05-30T12-39-22.414033.parquet' - split: latest path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-05-30T12-39-22.414033.parquet' - config_name: harness_hendrycksTest_college_biology_5 data_files: - split: 2024_05_30T12_39_22.414033 path: - '**/details_harness|hendrycksTest-college_biology|5_2024-05-30T12-39-22.414033.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_biology|5_2024-05-30T12-39-22.414033.parquet' - config_name: harness_hendrycksTest_college_chemistry_5 data_files: - split: 2024_05_30T12_39_22.414033 path: - '**/details_harness|hendrycksTest-college_chemistry|5_2024-05-30T12-39-22.414033.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_chemistry|5_2024-05-30T12-39-22.414033.parquet' - config_name: harness_hendrycksTest_college_computer_science_5 data_files: - split: 2024_05_30T12_39_22.414033 path: - '**/details_harness|hendrycksTest-college_computer_science|5_2024-05-30T12-39-22.414033.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_computer_science|5_2024-05-30T12-39-22.414033.parquet' - config_name: harness_hendrycksTest_college_mathematics_5 data_files: - split: 2024_05_30T12_39_22.414033 path: - '**/details_harness|hendrycksTest-college_mathematics|5_2024-05-30T12-39-22.414033.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_mathematics|5_2024-05-30T12-39-22.414033.parquet' - config_name: harness_hendrycksTest_college_medicine_5 data_files: - split: 2024_05_30T12_39_22.414033 path: - '**/details_harness|hendrycksTest-college_medicine|5_2024-05-30T12-39-22.414033.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_medicine|5_2024-05-30T12-39-22.414033.parquet' - config_name: harness_hendrycksTest_college_physics_5 data_files: - split: 2024_05_30T12_39_22.414033 path: - '**/details_harness|hendrycksTest-college_physics|5_2024-05-30T12-39-22.414033.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_physics|5_2024-05-30T12-39-22.414033.parquet' - config_name: harness_hendrycksTest_computer_security_5 data_files: - split: 2024_05_30T12_39_22.414033 path: - '**/details_harness|hendrycksTest-computer_security|5_2024-05-30T12-39-22.414033.parquet' - split: latest path: - '**/details_harness|hendrycksTest-computer_security|5_2024-05-30T12-39-22.414033.parquet' - config_name: harness_hendrycksTest_conceptual_physics_5 data_files: - split: 2024_05_30T12_39_22.414033 path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-05-30T12-39-22.414033.parquet' - split: latest path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2024-05-30T12-39-22.414033.parquet' - config_name: harness_hendrycksTest_econometrics_5 data_files: - split: 2024_05_30T12_39_22.414033 path: - '**/details_harness|hendrycksTest-econometrics|5_2024-05-30T12-39-22.414033.parquet' - split: latest path: - '**/details_harness|hendrycksTest-econometrics|5_2024-05-30T12-39-22.414033.parquet' - config_name: harness_hendrycksTest_electrical_engineering_5 data_files: - split: 2024_05_30T12_39_22.414033 path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-05-30T12-39-22.414033.parquet' - split: latest path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2024-05-30T12-39-22.414033.parquet' - config_name: harness_hendrycksTest_elementary_mathematics_5 data_files: - split: 2024_05_30T12_39_22.414033 path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-05-30T12-39-22.414033.parquet' - split: latest path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-05-30T12-39-22.414033.parquet' - config_name: harness_hendrycksTest_formal_logic_5 data_files: - split: 2024_05_30T12_39_22.414033 path: - '**/details_harness|hendrycksTest-formal_logic|5_2024-05-30T12-39-22.414033.parquet' - split: latest path: - '**/details_harness|hendrycksTest-formal_logic|5_2024-05-30T12-39-22.414033.parquet' - config_name: harness_hendrycksTest_global_facts_5 data_files: - split: 2024_05_30T12_39_22.414033 path: - '**/details_harness|hendrycksTest-global_facts|5_2024-05-30T12-39-22.414033.parquet' - split: latest path: - '**/details_harness|hendrycksTest-global_facts|5_2024-05-30T12-39-22.414033.parquet' - config_name: harness_hendrycksTest_high_school_biology_5 data_files: - split: 2024_05_30T12_39_22.414033 path: - '**/details_harness|hendrycksTest-high_school_biology|5_2024-05-30T12-39-22.414033.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_biology|5_2024-05-30T12-39-22.414033.parquet' - config_name: harness_hendrycksTest_high_school_chemistry_5 data_files: - split: 2024_05_30T12_39_22.414033 path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-05-30T12-39-22.414033.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-05-30T12-39-22.414033.parquet' - config_name: harness_hendrycksTest_high_school_computer_science_5 data_files: - split: 2024_05_30T12_39_22.414033 path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-05-30T12-39-22.414033.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-05-30T12-39-22.414033.parquet' - config_name: harness_hendrycksTest_high_school_european_history_5 data_files: - split: 2024_05_30T12_39_22.414033 path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-05-30T12-39-22.414033.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2024-05-30T12-39-22.414033.parquet' - config_name: harness_hendrycksTest_high_school_geography_5 data_files: - split: 2024_05_30T12_39_22.414033 path: - '**/details_harness|hendrycksTest-high_school_geography|5_2024-05-30T12-39-22.414033.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_geography|5_2024-05-30T12-39-22.414033.parquet' - config_name: harness_hendrycksTest_high_school_government_and_politics_5 data_files: - split: 2024_05_30T12_39_22.414033 path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-05-30T12-39-22.414033.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-05-30T12-39-22.414033.parquet' - config_name: harness_hendrycksTest_high_school_macroeconomics_5 data_files: - split: 2024_05_30T12_39_22.414033 path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-05-30T12-39-22.414033.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-05-30T12-39-22.414033.parquet' - config_name: harness_hendrycksTest_high_school_mathematics_5 data_files: - split: 2024_05_30T12_39_22.414033 path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-05-30T12-39-22.414033.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-05-30T12-39-22.414033.parquet' - config_name: harness_hendrycksTest_high_school_microeconomics_5 data_files: - split: 2024_05_30T12_39_22.414033 path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-05-30T12-39-22.414033.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-05-30T12-39-22.414033.parquet' - config_name: harness_hendrycksTest_high_school_physics_5 data_files: - split: 2024_05_30T12_39_22.414033 path: - '**/details_harness|hendrycksTest-high_school_physics|5_2024-05-30T12-39-22.414033.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_physics|5_2024-05-30T12-39-22.414033.parquet' - config_name: harness_hendrycksTest_high_school_psychology_5 data_files: - split: 2024_05_30T12_39_22.414033 path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-05-30T12-39-22.414033.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2024-05-30T12-39-22.414033.parquet' - config_name: harness_hendrycksTest_high_school_statistics_5 data_files: - split: 2024_05_30T12_39_22.414033 path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-05-30T12-39-22.414033.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2024-05-30T12-39-22.414033.parquet' - config_name: harness_hendrycksTest_high_school_us_history_5 data_files: - split: 2024_05_30T12_39_22.414033 path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-05-30T12-39-22.414033.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2024-05-30T12-39-22.414033.parquet' - config_name: harness_hendrycksTest_high_school_world_history_5 data_files: - split: 2024_05_30T12_39_22.414033 path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-05-30T12-39-22.414033.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2024-05-30T12-39-22.414033.parquet' - config_name: harness_hendrycksTest_human_aging_5 data_files: - split: 2024_05_30T12_39_22.414033 path: - '**/details_harness|hendrycksTest-human_aging|5_2024-05-30T12-39-22.414033.parquet' - split: latest path: - '**/details_harness|hendrycksTest-human_aging|5_2024-05-30T12-39-22.414033.parquet' - config_name: harness_hendrycksTest_human_sexuality_5 data_files: - split: 2024_05_30T12_39_22.414033 path: - '**/details_harness|hendrycksTest-human_sexuality|5_2024-05-30T12-39-22.414033.parquet' - split: latest path: - '**/details_harness|hendrycksTest-human_sexuality|5_2024-05-30T12-39-22.414033.parquet' - config_name: harness_hendrycksTest_international_law_5 data_files: - split: 2024_05_30T12_39_22.414033 path: - '**/details_harness|hendrycksTest-international_law|5_2024-05-30T12-39-22.414033.parquet' - split: latest path: - '**/details_harness|hendrycksTest-international_law|5_2024-05-30T12-39-22.414033.parquet' - config_name: harness_hendrycksTest_jurisprudence_5 data_files: - split: 2024_05_30T12_39_22.414033 path: - '**/details_harness|hendrycksTest-jurisprudence|5_2024-05-30T12-39-22.414033.parquet' - split: latest path: - '**/details_harness|hendrycksTest-jurisprudence|5_2024-05-30T12-39-22.414033.parquet' - config_name: harness_hendrycksTest_logical_fallacies_5 data_files: - split: 2024_05_30T12_39_22.414033 path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-05-30T12-39-22.414033.parquet' - split: latest path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2024-05-30T12-39-22.414033.parquet' - config_name: harness_hendrycksTest_machine_learning_5 data_files: - split: 2024_05_30T12_39_22.414033 path: - '**/details_harness|hendrycksTest-machine_learning|5_2024-05-30T12-39-22.414033.parquet' - split: latest path: - '**/details_harness|hendrycksTest-machine_learning|5_2024-05-30T12-39-22.414033.parquet' - config_name: harness_hendrycksTest_management_5 data_files: - split: 2024_05_30T12_39_22.414033 path: - '**/details_harness|hendrycksTest-management|5_2024-05-30T12-39-22.414033.parquet' - split: latest path: - '**/details_harness|hendrycksTest-management|5_2024-05-30T12-39-22.414033.parquet' - config_name: harness_hendrycksTest_marketing_5 data_files: - split: 2024_05_30T12_39_22.414033 path: - '**/details_harness|hendrycksTest-marketing|5_2024-05-30T12-39-22.414033.parquet' - split: latest path: - '**/details_harness|hendrycksTest-marketing|5_2024-05-30T12-39-22.414033.parquet' - config_name: harness_hendrycksTest_medical_genetics_5 data_files: - split: 2024_05_30T12_39_22.414033 path: - '**/details_harness|hendrycksTest-medical_genetics|5_2024-05-30T12-39-22.414033.parquet' - split: latest path: - '**/details_harness|hendrycksTest-medical_genetics|5_2024-05-30T12-39-22.414033.parquet' - config_name: harness_hendrycksTest_miscellaneous_5 data_files: - split: 2024_05_30T12_39_22.414033 path: - '**/details_harness|hendrycksTest-miscellaneous|5_2024-05-30T12-39-22.414033.parquet' - split: latest path: - '**/details_harness|hendrycksTest-miscellaneous|5_2024-05-30T12-39-22.414033.parquet' - config_name: harness_hendrycksTest_moral_disputes_5 data_files: - split: 2024_05_30T12_39_22.414033 path: - '**/details_harness|hendrycksTest-moral_disputes|5_2024-05-30T12-39-22.414033.parquet' - split: latest path: - '**/details_harness|hendrycksTest-moral_disputes|5_2024-05-30T12-39-22.414033.parquet' - config_name: harness_hendrycksTest_moral_scenarios_5 data_files: - split: 2024_05_30T12_39_22.414033 path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-05-30T12-39-22.414033.parquet' - split: latest path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2024-05-30T12-39-22.414033.parquet' - config_name: harness_hendrycksTest_nutrition_5 data_files: - split: 2024_05_30T12_39_22.414033 path: - '**/details_harness|hendrycksTest-nutrition|5_2024-05-30T12-39-22.414033.parquet' - split: latest path: - '**/details_harness|hendrycksTest-nutrition|5_2024-05-30T12-39-22.414033.parquet' - config_name: harness_hendrycksTest_philosophy_5 data_files: - split: 2024_05_30T12_39_22.414033 path: - '**/details_harness|hendrycksTest-philosophy|5_2024-05-30T12-39-22.414033.parquet' - split: latest path: - '**/details_harness|hendrycksTest-philosophy|5_2024-05-30T12-39-22.414033.parquet' - config_name: harness_hendrycksTest_prehistory_5 data_files: - split: 2024_05_30T12_39_22.414033 path: - '**/details_harness|hendrycksTest-prehistory|5_2024-05-30T12-39-22.414033.parquet' - split: latest path: - '**/details_harness|hendrycksTest-prehistory|5_2024-05-30T12-39-22.414033.parquet' - config_name: harness_hendrycksTest_professional_accounting_5 data_files: - split: 2024_05_30T12_39_22.414033 path: - '**/details_harness|hendrycksTest-professional_accounting|5_2024-05-30T12-39-22.414033.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_accounting|5_2024-05-30T12-39-22.414033.parquet' - config_name: harness_hendrycksTest_professional_law_5 data_files: - split: 2024_05_30T12_39_22.414033 path: - '**/details_harness|hendrycksTest-professional_law|5_2024-05-30T12-39-22.414033.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_law|5_2024-05-30T12-39-22.414033.parquet' - config_name: harness_hendrycksTest_professional_medicine_5 data_files: - split: 2024_05_30T12_39_22.414033 path: - '**/details_harness|hendrycksTest-professional_medicine|5_2024-05-30T12-39-22.414033.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_medicine|5_2024-05-30T12-39-22.414033.parquet' - config_name: harness_hendrycksTest_professional_psychology_5 data_files: - split: 2024_05_30T12_39_22.414033 path: - '**/details_harness|hendrycksTest-professional_psychology|5_2024-05-30T12-39-22.414033.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_psychology|5_2024-05-30T12-39-22.414033.parquet' - config_name: harness_hendrycksTest_public_relations_5 data_files: - split: 2024_05_30T12_39_22.414033 path: - '**/details_harness|hendrycksTest-public_relations|5_2024-05-30T12-39-22.414033.parquet' - split: latest path: - '**/details_harness|hendrycksTest-public_relations|5_2024-05-30T12-39-22.414033.parquet' - config_name: harness_hendrycksTest_security_studies_5 data_files: - split: 2024_05_30T12_39_22.414033 path: - '**/details_harness|hendrycksTest-security_studies|5_2024-05-30T12-39-22.414033.parquet' - split: latest path: - '**/details_harness|hendrycksTest-security_studies|5_2024-05-30T12-39-22.414033.parquet' - config_name: harness_hendrycksTest_sociology_5 data_files: - split: 2024_05_30T12_39_22.414033 path: - '**/details_harness|hendrycksTest-sociology|5_2024-05-30T12-39-22.414033.parquet' - split: latest path: - '**/details_harness|hendrycksTest-sociology|5_2024-05-30T12-39-22.414033.parquet' - config_name: harness_hendrycksTest_us_foreign_policy_5 data_files: - split: 2024_05_30T12_39_22.414033 path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-05-30T12-39-22.414033.parquet' - split: latest path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-05-30T12-39-22.414033.parquet' - config_name: harness_hendrycksTest_virology_5 data_files: - split: 2024_05_30T12_39_22.414033 path: - '**/details_harness|hendrycksTest-virology|5_2024-05-30T12-39-22.414033.parquet' - split: latest path: - '**/details_harness|hendrycksTest-virology|5_2024-05-30T12-39-22.414033.parquet' - config_name: harness_hendrycksTest_world_religions_5 data_files: - split: 2024_05_30T12_39_22.414033 path: - '**/details_harness|hendrycksTest-world_religions|5_2024-05-30T12-39-22.414033.parquet' - split: latest path: - '**/details_harness|hendrycksTest-world_religions|5_2024-05-30T12-39-22.414033.parquet' - config_name: harness_truthfulqa_mc_0 data_files: - split: 2024_05_30T12_39_22.414033 path: - '**/details_harness|truthfulqa:mc|0_2024-05-30T12-39-22.414033.parquet' - split: latest path: - '**/details_harness|truthfulqa:mc|0_2024-05-30T12-39-22.414033.parquet' - config_name: harness_winogrande_5 data_files: - split: 2024_05_30T12_39_22.414033 path: - '**/details_harness|winogrande|5_2024-05-30T12-39-22.414033.parquet' - split: latest path: - '**/details_harness|winogrande|5_2024-05-30T12-39-22.414033.parquet' - config_name: results data_files: - split: 2024_05_30T12_39_22.414033 path: - results_2024-05-30T12-39-22.414033.parquet - split: latest path: - results_2024-05-30T12-39-22.414033.parquet --- # Dataset Card for Evaluation run of Qwen/Qwen2-1.5B <!-- Provide a quick summary of the dataset. --> Dataset automatically created during the evaluation run of model [Qwen/Qwen2-1.5B](https://huggingface.co/Qwen/Qwen2-1.5B) on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard). The dataset is composed of 63 configuration, each one coresponding to one of the evaluated task. The dataset has been created from 1 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results. An additional configuration "results" store all the aggregated results of the run (and is used to compute and display the aggregated metrics on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)). To load the details from a run, you can for instance do the following: ```python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_Qwen__Qwen2-1.5B_private", "harness_winogrande_5", split="train") ``` ## Latest results These are the [latest results from run 2024-05-30T12:39:22.414033](https://huggingface.co/datasets/open-llm-leaderboard/details_Qwen__Qwen2-1.5B_private/blob/main/results_2024-05-30T12-39-22.414033.json)(note that their might be results for other tasks in the repos if successive evals didn't cover the same tasks. You find each in the results and the "latest" split for each eval): ```python { "all": { "acc": 0.5695668154598998, "acc_stderr": 0.033895050381280295, "acc_norm": 0.571987728994217, "acc_norm_stderr": 0.03458235640417673, "mc1": 0.3023255813953488, "mc1_stderr": 0.016077509266133026, "mc2": 0.4591892724530115, "mc2_stderr": 0.014394535619493496 }, "harness|arc:challenge|25": { "acc": 0.4104095563139932, "acc_stderr": 0.01437492219264266, "acc_norm": 0.44283276450511944, "acc_norm_stderr": 0.014515573873348902 }, "harness|hellaswag|10": { "acc": 0.48864767974507073, "acc_stderr": 0.004988495127747281, "acc_norm": 0.6666998605855408, "acc_norm_stderr": 0.0047042938987299 }, "harness|hendrycksTest-abstract_algebra|5": { "acc": 0.34, "acc_stderr": 0.047609522856952365, "acc_norm": 0.34, "acc_norm_stderr": 0.047609522856952365 }, "harness|hendrycksTest-anatomy|5": { "acc": 0.4888888888888889, "acc_stderr": 0.04318275491977976, "acc_norm": 0.4888888888888889, "acc_norm_stderr": 0.04318275491977976 }, "harness|hendrycksTest-astronomy|5": { "acc": 0.5526315789473685, "acc_stderr": 0.04046336883978251, "acc_norm": 0.5526315789473685, "acc_norm_stderr": 0.04046336883978251 }, "harness|hendrycksTest-business_ethics|5": { "acc": 0.67, "acc_stderr": 0.047258156262526066, "acc_norm": 0.67, "acc_norm_stderr": 0.047258156262526066 }, "harness|hendrycksTest-clinical_knowledge|5": { "acc": 0.6, "acc_stderr": 0.030151134457776285, "acc_norm": 0.6, "acc_norm_stderr": 0.030151134457776285 }, "harness|hendrycksTest-college_biology|5": { "acc": 0.5625, "acc_stderr": 0.04148415739394154, "acc_norm": 0.5625, "acc_norm_stderr": 0.04148415739394154 }, "harness|hendrycksTest-college_chemistry|5": { "acc": 0.38, "acc_stderr": 0.04878317312145633, "acc_norm": 0.38, "acc_norm_stderr": 0.04878317312145633 }, "harness|hendrycksTest-college_computer_science|5": { "acc": 0.51, "acc_stderr": 0.05024183937956911, "acc_norm": 0.51, "acc_norm_stderr": 0.05024183937956911 }, "harness|hendrycksTest-college_mathematics|5": { "acc": 0.32, "acc_stderr": 0.04688261722621504, "acc_norm": 0.32, "acc_norm_stderr": 0.04688261722621504 }, "harness|hendrycksTest-college_medicine|5": { "acc": 0.4913294797687861, "acc_stderr": 0.03811890988940412, "acc_norm": 0.4913294797687861, "acc_norm_stderr": 0.03811890988940412 }, "harness|hendrycksTest-college_physics|5": { "acc": 0.4019607843137255, "acc_stderr": 0.04878608714466996, "acc_norm": 0.4019607843137255, "acc_norm_stderr": 0.04878608714466996 }, "harness|hendrycksTest-computer_security|5": { "acc": 0.66, "acc_stderr": 0.04760952285695237, "acc_norm": 0.66, "acc_norm_stderr": 0.04760952285695237 }, "harness|hendrycksTest-conceptual_physics|5": { "acc": 0.502127659574468, "acc_stderr": 0.03268572658667492, "acc_norm": 0.502127659574468, "acc_norm_stderr": 0.03268572658667492 }, "harness|hendrycksTest-econometrics|5": { "acc": 0.38596491228070173, "acc_stderr": 0.04579639422070434, "acc_norm": 0.38596491228070173, "acc_norm_stderr": 0.04579639422070434 }, "harness|hendrycksTest-electrical_engineering|5": { "acc": 0.5793103448275863, "acc_stderr": 0.0411391498118926, "acc_norm": 0.5793103448275863, "acc_norm_stderr": 0.0411391498118926 }, "harness|hendrycksTest-elementary_mathematics|5": { "acc": 0.42328042328042326, "acc_stderr": 0.025446365634406765, "acc_norm": 0.42328042328042326, "acc_norm_stderr": 0.025446365634406765 }, "harness|hendrycksTest-formal_logic|5": { "acc": 0.3888888888888889, "acc_stderr": 0.04360314860077459, "acc_norm": 0.3888888888888889, "acc_norm_stderr": 0.04360314860077459 }, "harness|hendrycksTest-global_facts|5": { "acc": 0.36, "acc_stderr": 0.04824181513244218, "acc_norm": 0.36, "acc_norm_stderr": 0.04824181513244218 }, "harness|hendrycksTest-high_school_biology|5": { "acc": 0.6419354838709678, "acc_stderr": 0.02727389059430064, "acc_norm": 0.6419354838709678, "acc_norm_stderr": 0.02727389059430064 }, "harness|hendrycksTest-high_school_chemistry|5": { "acc": 0.49261083743842365, "acc_stderr": 0.03517603540361008, "acc_norm": 0.49261083743842365, "acc_norm_stderr": 0.03517603540361008 }, "harness|hendrycksTest-high_school_computer_science|5": { "acc": 0.56, "acc_stderr": 0.04988876515698589, "acc_norm": 0.56, "acc_norm_stderr": 0.04988876515698589 }, "harness|hendrycksTest-high_school_european_history|5": { "acc": 0.7090909090909091, "acc_stderr": 0.03546563019624336, "acc_norm": 0.7090909090909091, "acc_norm_stderr": 0.03546563019624336 }, "harness|hendrycksTest-high_school_geography|5": { "acc": 0.7373737373737373, "acc_stderr": 0.031353050095330855, "acc_norm": 0.7373737373737373, "acc_norm_stderr": 0.031353050095330855 }, "harness|hendrycksTest-high_school_government_and_politics|5": { "acc": 0.7875647668393783, "acc_stderr": 0.02951928261681723, "acc_norm": 0.7875647668393783, "acc_norm_stderr": 0.02951928261681723 }, "harness|hendrycksTest-high_school_macroeconomics|5": { "acc": 0.5846153846153846, "acc_stderr": 0.02498535492310234, "acc_norm": 0.5846153846153846, "acc_norm_stderr": 0.02498535492310234 }, "harness|hendrycksTest-high_school_mathematics|5": { "acc": 0.3592592592592593, "acc_stderr": 0.029252905927251983, "acc_norm": 0.3592592592592593, "acc_norm_stderr": 0.029252905927251983 }, "harness|hendrycksTest-high_school_microeconomics|5": { "acc": 0.6218487394957983, "acc_stderr": 0.031499305777849054, "acc_norm": 0.6218487394957983, "acc_norm_stderr": 0.031499305777849054 }, "harness|hendrycksTest-high_school_physics|5": { "acc": 0.3576158940397351, "acc_stderr": 0.03913453431177258, "acc_norm": 0.3576158940397351, "acc_norm_stderr": 0.03913453431177258 }, "harness|hendrycksTest-high_school_psychology|5": { "acc": 0.7522935779816514, "acc_stderr": 0.018508143602547825, "acc_norm": 0.7522935779816514, "acc_norm_stderr": 0.018508143602547825 }, "harness|hendrycksTest-high_school_statistics|5": { "acc": 0.4444444444444444, "acc_stderr": 0.03388857118502326, "acc_norm": 0.4444444444444444, "acc_norm_stderr": 0.03388857118502326 }, "harness|hendrycksTest-high_school_us_history|5": { "acc": 0.6470588235294118, "acc_stderr": 0.03354092437591519, "acc_norm": 0.6470588235294118, "acc_norm_stderr": 0.03354092437591519 }, "harness|hendrycksTest-high_school_world_history|5": { "acc": 0.7046413502109705, "acc_stderr": 0.029696338713422882, "acc_norm": 0.7046413502109705, "acc_norm_stderr": 0.029696338713422882 }, "harness|hendrycksTest-human_aging|5": { "acc": 0.6188340807174888, "acc_stderr": 0.03259625118416828, "acc_norm": 0.6188340807174888, "acc_norm_stderr": 0.03259625118416828 }, "harness|hendrycksTest-human_sexuality|5": { "acc": 0.6717557251908397, "acc_stderr": 0.04118438565806298, "acc_norm": 0.6717557251908397, "acc_norm_stderr": 0.04118438565806298 }, "harness|hendrycksTest-international_law|5": { "acc": 0.743801652892562, "acc_stderr": 0.03984979653302871, "acc_norm": 0.743801652892562, "acc_norm_stderr": 0.03984979653302871 }, "harness|hendrycksTest-jurisprudence|5": { "acc": 0.6944444444444444, "acc_stderr": 0.044531975073749834, "acc_norm": 0.6944444444444444, "acc_norm_stderr": 0.044531975073749834 }, "harness|hendrycksTest-logical_fallacies|5": { "acc": 0.6871165644171779, "acc_stderr": 0.036429145782924076, "acc_norm": 0.6871165644171779, "acc_norm_stderr": 0.036429145782924076 }, "harness|hendrycksTest-machine_learning|5": { "acc": 0.41964285714285715, "acc_stderr": 0.04684099321077106, "acc_norm": 0.41964285714285715, "acc_norm_stderr": 0.04684099321077106 }, "harness|hendrycksTest-management|5": { "acc": 0.8349514563106796, "acc_stderr": 0.036756688322331886, "acc_norm": 0.8349514563106796, "acc_norm_stderr": 0.036756688322331886 }, "harness|hendrycksTest-marketing|5": { "acc": 0.8034188034188035, "acc_stderr": 0.02603538609895129, "acc_norm": 0.8034188034188035, "acc_norm_stderr": 0.02603538609895129 }, "harness|hendrycksTest-medical_genetics|5": { "acc": 0.62, "acc_stderr": 0.048783173121456316, "acc_norm": 0.62, "acc_norm_stderr": 0.048783173121456316 }, "harness|hendrycksTest-miscellaneous|5": { "acc": 0.7203065134099617, "acc_stderr": 0.01605079214803652, "acc_norm": 0.7203065134099617, "acc_norm_stderr": 0.01605079214803652 }, "harness|hendrycksTest-moral_disputes|5": { "acc": 0.6271676300578035, "acc_stderr": 0.02603389061357628, "acc_norm": 0.6271676300578035, "acc_norm_stderr": 0.02603389061357628 }, "harness|hendrycksTest-moral_scenarios|5": { "acc": 0.29720670391061454, "acc_stderr": 0.0152853133536416, "acc_norm": 0.29720670391061454, "acc_norm_stderr": 0.0152853133536416 }, "harness|hendrycksTest-nutrition|5": { "acc": 0.6928104575163399, "acc_stderr": 0.02641560191438898, "acc_norm": 0.6928104575163399, "acc_norm_stderr": 0.02641560191438898 }, "harness|hendrycksTest-philosophy|5": { "acc": 0.6430868167202572, "acc_stderr": 0.027210420375934023, "acc_norm": 0.6430868167202572, "acc_norm_stderr": 0.027210420375934023 }, "harness|hendrycksTest-prehistory|5": { "acc": 0.5895061728395061, "acc_stderr": 0.027371350925124764, "acc_norm": 0.5895061728395061, "acc_norm_stderr": 0.027371350925124764 }, "harness|hendrycksTest-professional_accounting|5": { "acc": 0.43617021276595747, "acc_stderr": 0.029583452036284073, "acc_norm": 0.43617021276595747, "acc_norm_stderr": 0.029583452036284073 }, "harness|hendrycksTest-professional_law|5": { "acc": 0.4282920469361147, "acc_stderr": 0.012638223880313161, "acc_norm": 0.4282920469361147, "acc_norm_stderr": 0.012638223880313161 }, "harness|hendrycksTest-professional_medicine|5": { "acc": 0.5036764705882353, "acc_stderr": 0.030372015885428195, "acc_norm": 0.5036764705882353, "acc_norm_stderr": 0.030372015885428195 }, "harness|hendrycksTest-professional_psychology|5": { "acc": 0.553921568627451, "acc_stderr": 0.020109864547181357, "acc_norm": 0.553921568627451, "acc_norm_stderr": 0.020109864547181357 }, "harness|hendrycksTest-public_relations|5": { "acc": 0.6, "acc_stderr": 0.0469237132203465, "acc_norm": 0.6, "acc_norm_stderr": 0.0469237132203465 }, "harness|hendrycksTest-security_studies|5": { "acc": 0.6816326530612244, "acc_stderr": 0.029822533793982066, "acc_norm": 0.6816326530612244, "acc_norm_stderr": 0.029822533793982066 }, "harness|hendrycksTest-sociology|5": { "acc": 0.7562189054726368, "acc_stderr": 0.03036049015401466, "acc_norm": 0.7562189054726368, "acc_norm_stderr": 0.03036049015401466 }, "harness|hendrycksTest-us_foreign_policy|5": { "acc": 0.79, "acc_stderr": 0.040936018074033256, "acc_norm": 0.79, "acc_norm_stderr": 0.040936018074033256 }, "harness|hendrycksTest-virology|5": { "acc": 0.463855421686747, "acc_stderr": 0.03882310850890593, "acc_norm": 0.463855421686747, "acc_norm_stderr": 0.03882310850890593 }, "harness|hendrycksTest-world_religions|5": { "acc": 0.7426900584795322, "acc_stderr": 0.03352799844161865, "acc_norm": 0.7426900584795322, "acc_norm_stderr": 0.03352799844161865 }, "harness|truthfulqa:mc|0": { "mc1": 0.3023255813953488, "mc1_stderr": 0.016077509266133026, "mc2": 0.4591892724530115, "mc2_stderr": 0.014394535619493496 }, "harness|winogrande|5": { "acc": 0.648776637726914, "acc_stderr": 0.01341598137054513 }, "harness|gsm8k|5": { "acc": 0.5579984836997726, "acc_stderr": 0.013679514492814574 } } ``` ## Dataset Details ### Dataset Description <!-- Provide a longer summary of what this dataset is. --> - **Curated by:** [More Information Needed] - **Funded by [optional]:** [More Information Needed] - **Shared by [optional]:** [More Information Needed] - **Language(s) (NLP):** [More Information Needed] - **License:** [More Information Needed] ### Dataset Sources [optional] <!-- Provide the basic links for the dataset. --> - **Repository:** [More Information Needed] - **Paper [optional]:** [More Information Needed] - **Demo [optional]:** [More Information Needed] ## Uses <!-- Address questions around how the dataset is intended to be used. --> ### Direct Use <!-- This section describes suitable use cases for the dataset. --> [More Information Needed] ### Out-of-Scope Use <!-- This section addresses misuse, malicious use, and uses that the dataset will not work well for. --> [More Information Needed] ## Dataset Structure <!-- This section provides a description of the dataset fields, and additional information about the dataset structure such as criteria used to create the splits, relationships between data points, etc. --> [More Information Needed] ## Dataset Creation ### Curation Rationale <!-- Motivation for the creation of this dataset. --> [More Information Needed] ### Source Data <!-- This section describes the source data (e.g. news text and headlines, social media posts, translated sentences, ...). --> #### Data Collection and Processing <!-- This section describes the data collection and processing process such as data selection criteria, filtering and normalization methods, tools and libraries used, etc. --> [More Information Needed] #### Who are the source data producers? <!-- This section describes the people or systems who originally created the data. It should also include self-reported demographic or identity information for the source data creators if this information is available. --> [More Information Needed] ### Annotations [optional] <!-- If the dataset contains annotations which are not part of the initial data collection, use this section to describe them. --> #### Annotation process <!-- This section describes the annotation process such as annotation tools used in the process, the amount of data annotated, annotation guidelines provided to the annotators, interannotator statistics, annotation validation, etc. --> [More Information Needed] #### Who are the annotators? <!-- This section describes the people or systems who created the annotations. --> [More Information Needed] #### Personal and Sensitive Information <!-- State whether the dataset contains data that might be considered personal, sensitive, or private (e.g., data that reveals addresses, uniquely identifiable names or aliases, racial or ethnic origins, sexual orientations, religious beliefs, political opinions, financial or health data, etc.). If efforts were made to anonymize the data, describe the anonymization process. --> [More Information Needed] ## Bias, Risks, and Limitations <!-- This section is meant to convey both technical and sociotechnical limitations. --> [More Information Needed] ### Recommendations <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. --> Users should be made aware of the risks, biases and limitations of the dataset. More information needed for further recommendations. ## Citation [optional] <!-- If there is a paper or blog post introducing the dataset, the APA and Bibtex information for that should go in this section. --> **BibTeX:** [More Information Needed] **APA:** [More Information Needed] ## Glossary [optional] <!-- If relevant, include terms and calculations in this section that can help readers understand the dataset or dataset card. --> [More Information Needed] ## More Information [optional] [More Information Needed] ## Dataset Card Authors [optional] [More Information Needed] ## Dataset Card Contact [More Information Needed]
提供机构:
open-llm-leaderboard-old
原始信息汇总

数据集概述

数据集简介

数据集结构

  • 配置数量: 63
  • 配置对应任务: 每个配置对应一个评估任务。
  • 数据来源: 数据集由1次运行创建,每次运行可以在每个配置中找到特定的分割,分割名称使用运行的时间戳。
  • 最新结果: "train" 分割始终指向最新的结果。
  • 结果汇总: 一个额外的配置 "results" 存储所有运行的汇总结果,用于计算和显示在 Open LLM Leaderboard 上的聚合指标。

数据加载示例

python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_Qwen__Qwen2-1.5B_private", "harness_winogrande_5", split="train")

最新结果

  • 运行时间: 2024-05-30T12:39:22.414033
  • 详细结果: 包含多个任务的评估结果,具体包括准确率(acc)、标准化准确率(acc_norm)、标准误差(acc_stderr、acc_norm_stderr)等指标。

配置详情

  • 配置名称:
    • harness_arc_challenge_25
    • harness_gsm8k_5
    • harness_hellaswag_10
    • harness_hendrycksTest_5
  • 数据文件:
    • 每个配置包含多个数据文件,路径格式为 **/details_harness|任务名称|5_2024-05-30T12-39-22.414033.parquet
    • 分割包括 2024_05_30T12_39_22.414033latest
搜集汇总
数据集介绍
main_image_url
构建方式
该数据集源于对Qwen/Qwen2-1.5B模型在Open LLM Leaderboard平台上的自动化评估流程。数据集由63个配置构成,每个配置对应一项被评估的任务,例如ARC挑战、HellaSwag、GSM8K等。评估过程执行了单次运行,每次运行的结果以时间戳为标识存储在独立的分割中,而'train'分割则始终指向最新一次运行的结果。此外,一个名为'results'的附加配置汇总了所有任务的聚合指标,用于在排行榜上展示综合性能。数据以Parquet格式存储,便于高效加载与处理。
特点
该数据集的核心特色在于其结构化与时效性。它涵盖了从常识推理到数学问题求解的多样化任务,包括ARC挑战、HellaSwag、GSM8K以及涵盖57个学科的MMLU测试集,全面评估了模型在零样本和少样本场景下的表现。每个配置都记录了详细的评估指标,如准确率及其标准误差,并提供了归一化后的准确率,增强了数据的可解释性。通过时间戳分割,用户可以追溯历史评估结果,追踪模型性能的演进轨迹。
使用方法
用户可通过HuggingFace的datasets库便捷地加载该数据集。例如,使用`load_dataset`函数指定配置名称(如'harness_winogrande_5')和分割(如'train')即可获取最新评估细节。对于特定历史结果,可依据时间戳分割名称进行访问。数据集中的'results'配置提供了聚合后的全局指标,便于直接用于模型性能比较或二次分析。所有数据均以Parquet格式存储,支持高效读取,适合大规模评估场景下的快速检索与处理。
背景与挑战
背景概述
在大型语言模型(LLM)迅猛发展的浪潮中,如何客观、系统地评估模型的多维度能力成为学界与工业界共同关注的焦点。Open LLM Leaderboard应运而生,由Hugging Face团队于2023年创建,旨在通过标准化评测框架为社区提供模型性能的透明对比。该数据集围绕Qwen2-1.5B模型在2024年5月30日的单次评估运行生成,涵盖了63个任务配置,横跨常识推理(如ARC-Challenge、HellaSwag)、数学推理(GSM8K)、知识问答(MMLU下的57个学科)及语言理解(Winogrande、TruthfulQA)等核心领域。其核心研究问题在于揭示中等规模模型(1.5B参数)在多样化任务上的能力边界,为模型选型与改进提供量化依据。作为Open LLM Leaderboard生态的一部分,该数据集推动了LLM评估的标准化进程,使研究者能够基于统一基准比较不同模型的优劣,对模型迭代方向产生了深远影响。
当前挑战
该数据集所反映的核心挑战在于:1) 模型在复杂推理与事实性知识任务上的表现仍显薄弱,例如在MMLU的高等数学、大学物理等学科中准确率仅约32%-40%,在TruthfulQA上的MC1准确率仅30.2%,暴露出模型在对抗性事实核查与多步推理方面的显著局限;2) 评测基准本身面临任务多样性不足与难度分布不均的问题,尽管覆盖57个学科,但部分任务(如商务伦理、营销)的准确率已超80%,而另一些(如道德情景)则低于30%,导致整体分数难以精准反映模型真实能力;3) 构建过程中,单次评估运行的数据采集仅依赖单一时间戳,缺乏多次运行的统计稳定性,且结果存储于Parquet格式的私有数据文件中,给复现与跨模型比较带来额外复杂性。
常用场景
经典使用场景
在大型语言模型迅猛发展的浪潮中,open-llm-leaderboard-old/details_Qwen__Qwen2-1.5B数据集承载着对Qwen2-1.5B模型进行系统性评估的重任。其经典使用场景在于,通过集成ARC-Challenge、HellaSwag、MMLU、TruthfulQA、Winogrande和GSM8K等涵盖常识推理、学术知识、事实一致性与数学求解的多样化任务,为研究者提供了一套标准化的模型性能度量框架。该数据集将每次评估运行细分为63个独立配置,每个配置对应一项具体任务,并保留时间戳以追踪模型能力的动态演进。这种精细化的结构设计,使得研究人员能够深入剖析模型在特定领域的优劣表现,从而为后续的模型优化与迭代提供精准的数据支撑。
衍生相关工作
该数据集衍生了一系列围绕大语言模型评估与改进的经典工作。一方面,其结构化的评估数据被广泛用作基准,催生了诸如模型能力退化分析、少样本学习性能边界探索等研究方向。另一方面,基于该数据集暴露的模型弱点,研究者开发了针对性的微调方法与提示工程策略,例如通过增强数学推理链或优化知识检索来提升特定任务分数。此外,该数据集与Open LLM Leaderboard生态紧密结合,为后续模型如Qwen2系列其他尺寸版本的横向对比提供了标准化模板,间接推动了社区对模型可复现性与评估公平性的持续关注。
数据集最近研究
最新研究方向
当前,大语言模型的性能评估已成为自然语言处理领域的研究热点,Open LLM Leaderboard作为业界公认的基准评价平台,为模型能力的横向对比提供了标准化框架。Qwen2-1.5B作为通义千问系列的重要成员,其在此排行榜上的评测数据揭示了小参数模型在多样化任务中的表现边界。该数据集记录了模型在ARC挑战赛、HellaSwag常识推理、MMLU多学科知识及GSM8K数学推理等63项任务上的细粒度结果,涵盖从高中地理到专业医学等广泛领域。前沿研究正聚焦于如何通过这类细粒度评估数据,分析模型在低资源场景下的泛化瓶颈,例如在抽象代数、大学数学等逻辑密集型任务中,Qwen2-1.5B的准确率仅徘徊于30%-40%区间,而在高中政府与政治、管理学等社会科学任务中却展现出超过78%的优异表现。这种能力分布的不均衡性,促使研究者探索针对性的数据增强与课程学习策略,以弥合模型在符号推理与常识理解之间的鸿沟。该评估数据集的开放共享,不仅为模型迭代提供了可复现的验证基准,更推动了轻量化大模型在垂直领域的应用落地,其影响力已延伸至AI教育评测和智能助手的可靠性验证等热点场景。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作