five

open-llm-leaderboard/details_Mikivis__gpt2-large-lora-sft

收藏
Hugging Face2023-10-27 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/open-llm-leaderboard/details_Mikivis__gpt2-large-lora-sft
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: Evaluation run of Mikivis/gpt2-large-lora-sft dataset_summary: "Dataset automatically created during the evaluation run of model\ \ [Mikivis/gpt2-large-lora-sft](https://huggingface.co/Mikivis/gpt2-large-lora-sft)\ \ on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).\n\ \nThe dataset is composed of 64 configuration, each one coresponding to one of the\ \ evaluated task.\n\nThe dataset has been created from 2 run(s). Each run can be\ \ found as a specific split in each configuration, the split being named using the\ \ timestamp of the run.The \"train\" split is always pointing to the latest results.\n\ \nAn additional configuration \"results\" store all the aggregated results of the\ \ run (and is used to compute and display the agregated metrics on the [Open LLM\ \ Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)).\n\ \nTo load the details from a run, you can for instance do the following:\n```python\n\ from datasets import load_dataset\ndata = load_dataset(\"open-llm-leaderboard/details_Mikivis__gpt2-large-lora-sft\"\ ,\n\t\"harness_winogrande_5\",\n\tsplit=\"train\")\n```\n\n## Latest results\n\n\ These are the [latest results from run 2023-10-27T18:03:42.739284](https://huggingface.co/datasets/open-llm-leaderboard/details_Mikivis__gpt2-large-lora-sft/blob/main/results_2023-10-27T18-03-42.739284.json)(note\ \ that their might be results for other tasks in the repos if successive evals didn't\ \ cover the same tasks. You find each in the results and the \"latest\" split for\ \ each eval):\n\n```python\n{\n \"all\": {\n \"em\": 0.0026216442953020135,\n\ \ \"em_stderr\": 0.0005236685642965714,\n \"f1\": 0.05463401845637592,\n\ \ \"f1_stderr\": 0.001420933825490078,\n \"acc\": 0.27545382794001577,\n\ \ \"acc_stderr\": 0.006989729694570417\n },\n \"harness|drop|3\": {\n\ \ \"em\": 0.0026216442953020135,\n \"em_stderr\": 0.0005236685642965714,\n\ \ \"f1\": 0.05463401845637592,\n \"f1_stderr\": 0.001420933825490078\n\ \ },\n \"harness|gsm8k|5\": {\n \"acc\": 0.0,\n \"acc_stderr\"\ : 0.0\n },\n \"harness|winogrande|5\": {\n \"acc\": 0.5509076558800315,\n\ \ \"acc_stderr\": 0.013979459389140834\n }\n}\n```" repo_url: https://huggingface.co/Mikivis/gpt2-large-lora-sft leaderboard_url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard point_of_contact: clementine@hf.co configs: - config_name: harness_arc_challenge_25 data_files: - split: 2023_09_05T03_15_39.228135 path: - '**/details_harness|arc:challenge|25_2023-09-05T03:15:39.228135.parquet' - split: latest path: - '**/details_harness|arc:challenge|25_2023-09-05T03:15:39.228135.parquet' - config_name: harness_drop_3 data_files: - split: 2023_10_27T18_03_42.739284 path: - '**/details_harness|drop|3_2023-10-27T18-03-42.739284.parquet' - split: latest path: - '**/details_harness|drop|3_2023-10-27T18-03-42.739284.parquet' - config_name: harness_gsm8k_5 data_files: - split: 2023_10_27T18_03_42.739284 path: - '**/details_harness|gsm8k|5_2023-10-27T18-03-42.739284.parquet' - split: latest path: - '**/details_harness|gsm8k|5_2023-10-27T18-03-42.739284.parquet' - config_name: harness_hellaswag_10 data_files: - split: 2023_09_05T03_15_39.228135 path: - '**/details_harness|hellaswag|10_2023-09-05T03:15:39.228135.parquet' - split: latest path: - '**/details_harness|hellaswag|10_2023-09-05T03:15:39.228135.parquet' - config_name: harness_hendrycksTest_5 data_files: - split: 2023_09_05T03_15_39.228135 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-international_law|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-management|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-marketing|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-sociology|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-virology|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2023-09-05T03:15:39.228135.parquet' - split: latest path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-international_law|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-management|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-marketing|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-sociology|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-virology|5_2023-09-05T03:15:39.228135.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2023-09-05T03:15:39.228135.parquet' - config_name: harness_hendrycksTest_abstract_algebra_5 data_files: - split: 2023_09_05T03_15_39.228135 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2023-09-05T03:15:39.228135.parquet' - split: latest path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2023-09-05T03:15:39.228135.parquet' - config_name: harness_hendrycksTest_anatomy_5 data_files: - split: 2023_09_05T03_15_39.228135 path: - '**/details_harness|hendrycksTest-anatomy|5_2023-09-05T03:15:39.228135.parquet' - split: latest path: - '**/details_harness|hendrycksTest-anatomy|5_2023-09-05T03:15:39.228135.parquet' - config_name: harness_hendrycksTest_astronomy_5 data_files: - split: 2023_09_05T03_15_39.228135 path: - '**/details_harness|hendrycksTest-astronomy|5_2023-09-05T03:15:39.228135.parquet' - split: latest path: - '**/details_harness|hendrycksTest-astronomy|5_2023-09-05T03:15:39.228135.parquet' - config_name: harness_hendrycksTest_business_ethics_5 data_files: - split: 2023_09_05T03_15_39.228135 path: - '**/details_harness|hendrycksTest-business_ethics|5_2023-09-05T03:15:39.228135.parquet' - split: latest path: - '**/details_harness|hendrycksTest-business_ethics|5_2023-09-05T03:15:39.228135.parquet' - config_name: harness_hendrycksTest_clinical_knowledge_5 data_files: - split: 2023_09_05T03_15_39.228135 path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2023-09-05T03:15:39.228135.parquet' - split: latest path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2023-09-05T03:15:39.228135.parquet' - config_name: harness_hendrycksTest_college_biology_5 data_files: - split: 2023_09_05T03_15_39.228135 path: - '**/details_harness|hendrycksTest-college_biology|5_2023-09-05T03:15:39.228135.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_biology|5_2023-09-05T03:15:39.228135.parquet' - config_name: harness_hendrycksTest_college_chemistry_5 data_files: - split: 2023_09_05T03_15_39.228135 path: - '**/details_harness|hendrycksTest-college_chemistry|5_2023-09-05T03:15:39.228135.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_chemistry|5_2023-09-05T03:15:39.228135.parquet' - config_name: harness_hendrycksTest_college_computer_science_5 data_files: - split: 2023_09_05T03_15_39.228135 path: - '**/details_harness|hendrycksTest-college_computer_science|5_2023-09-05T03:15:39.228135.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_computer_science|5_2023-09-05T03:15:39.228135.parquet' - config_name: harness_hendrycksTest_college_mathematics_5 data_files: - split: 2023_09_05T03_15_39.228135 path: - '**/details_harness|hendrycksTest-college_mathematics|5_2023-09-05T03:15:39.228135.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_mathematics|5_2023-09-05T03:15:39.228135.parquet' - config_name: harness_hendrycksTest_college_medicine_5 data_files: - split: 2023_09_05T03_15_39.228135 path: - '**/details_harness|hendrycksTest-college_medicine|5_2023-09-05T03:15:39.228135.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_medicine|5_2023-09-05T03:15:39.228135.parquet' - config_name: harness_hendrycksTest_college_physics_5 data_files: - split: 2023_09_05T03_15_39.228135 path: - '**/details_harness|hendrycksTest-college_physics|5_2023-09-05T03:15:39.228135.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_physics|5_2023-09-05T03:15:39.228135.parquet' - config_name: harness_hendrycksTest_computer_security_5 data_files: - split: 2023_09_05T03_15_39.228135 path: - '**/details_harness|hendrycksTest-computer_security|5_2023-09-05T03:15:39.228135.parquet' - split: latest path: - '**/details_harness|hendrycksTest-computer_security|5_2023-09-05T03:15:39.228135.parquet' - config_name: harness_hendrycksTest_conceptual_physics_5 data_files: - split: 2023_09_05T03_15_39.228135 path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2023-09-05T03:15:39.228135.parquet' - split: latest path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2023-09-05T03:15:39.228135.parquet' - config_name: harness_hendrycksTest_econometrics_5 data_files: - split: 2023_09_05T03_15_39.228135 path: - '**/details_harness|hendrycksTest-econometrics|5_2023-09-05T03:15:39.228135.parquet' - split: latest path: - '**/details_harness|hendrycksTest-econometrics|5_2023-09-05T03:15:39.228135.parquet' - config_name: harness_hendrycksTest_electrical_engineering_5 data_files: - split: 2023_09_05T03_15_39.228135 path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2023-09-05T03:15:39.228135.parquet' - split: latest path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2023-09-05T03:15:39.228135.parquet' - config_name: harness_hendrycksTest_elementary_mathematics_5 data_files: - split: 2023_09_05T03_15_39.228135 path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2023-09-05T03:15:39.228135.parquet' - split: latest path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2023-09-05T03:15:39.228135.parquet' - config_name: harness_hendrycksTest_formal_logic_5 data_files: - split: 2023_09_05T03_15_39.228135 path: - '**/details_harness|hendrycksTest-formal_logic|5_2023-09-05T03:15:39.228135.parquet' - split: latest path: - '**/details_harness|hendrycksTest-formal_logic|5_2023-09-05T03:15:39.228135.parquet' - config_name: harness_hendrycksTest_global_facts_5 data_files: - split: 2023_09_05T03_15_39.228135 path: - '**/details_harness|hendrycksTest-global_facts|5_2023-09-05T03:15:39.228135.parquet' - split: latest path: - '**/details_harness|hendrycksTest-global_facts|5_2023-09-05T03:15:39.228135.parquet' - config_name: harness_hendrycksTest_high_school_biology_5 data_files: - split: 2023_09_05T03_15_39.228135 path: - '**/details_harness|hendrycksTest-high_school_biology|5_2023-09-05T03:15:39.228135.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_biology|5_2023-09-05T03:15:39.228135.parquet' - config_name: harness_hendrycksTest_high_school_chemistry_5 data_files: - split: 2023_09_05T03_15_39.228135 path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2023-09-05T03:15:39.228135.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2023-09-05T03:15:39.228135.parquet' - config_name: harness_hendrycksTest_high_school_computer_science_5 data_files: - split: 2023_09_05T03_15_39.228135 path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2023-09-05T03:15:39.228135.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2023-09-05T03:15:39.228135.parquet' - config_name: harness_hendrycksTest_high_school_european_history_5 data_files: - split: 2023_09_05T03_15_39.228135 path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2023-09-05T03:15:39.228135.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2023-09-05T03:15:39.228135.parquet' - config_name: harness_hendrycksTest_high_school_geography_5 data_files: - split: 2023_09_05T03_15_39.228135 path: - '**/details_harness|hendrycksTest-high_school_geography|5_2023-09-05T03:15:39.228135.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_geography|5_2023-09-05T03:15:39.228135.parquet' - config_name: harness_hendrycksTest_high_school_government_and_politics_5 data_files: - split: 2023_09_05T03_15_39.228135 path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2023-09-05T03:15:39.228135.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2023-09-05T03:15:39.228135.parquet' - config_name: harness_hendrycksTest_high_school_macroeconomics_5 data_files: - split: 2023_09_05T03_15_39.228135 path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2023-09-05T03:15:39.228135.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2023-09-05T03:15:39.228135.parquet' - config_name: harness_hendrycksTest_high_school_mathematics_5 data_files: - split: 2023_09_05T03_15_39.228135 path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2023-09-05T03:15:39.228135.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2023-09-05T03:15:39.228135.parquet' - config_name: harness_hendrycksTest_high_school_microeconomics_5 data_files: - split: 2023_09_05T03_15_39.228135 path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2023-09-05T03:15:39.228135.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2023-09-05T03:15:39.228135.parquet' - config_name: harness_hendrycksTest_high_school_physics_5 data_files: - split: 2023_09_05T03_15_39.228135 path: - '**/details_harness|hendrycksTest-high_school_physics|5_2023-09-05T03:15:39.228135.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_physics|5_2023-09-05T03:15:39.228135.parquet' - config_name: harness_hendrycksTest_high_school_psychology_5 data_files: - split: 2023_09_05T03_15_39.228135 path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2023-09-05T03:15:39.228135.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2023-09-05T03:15:39.228135.parquet' - config_name: harness_hendrycksTest_high_school_statistics_5 data_files: - split: 2023_09_05T03_15_39.228135 path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2023-09-05T03:15:39.228135.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2023-09-05T03:15:39.228135.parquet' - config_name: harness_hendrycksTest_high_school_us_history_5 data_files: - split: 2023_09_05T03_15_39.228135 path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2023-09-05T03:15:39.228135.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2023-09-05T03:15:39.228135.parquet' - config_name: harness_hendrycksTest_high_school_world_history_5 data_files: - split: 2023_09_05T03_15_39.228135 path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2023-09-05T03:15:39.228135.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2023-09-05T03:15:39.228135.parquet' - config_name: harness_hendrycksTest_human_aging_5 data_files: - split: 2023_09_05T03_15_39.228135 path: - '**/details_harness|hendrycksTest-human_aging|5_2023-09-05T03:15:39.228135.parquet' - split: latest path: - '**/details_harness|hendrycksTest-human_aging|5_2023-09-05T03:15:39.228135.parquet' - config_name: harness_hendrycksTest_human_sexuality_5 data_files: - split: 2023_09_05T03_15_39.228135 path: - '**/details_harness|hendrycksTest-human_sexuality|5_2023-09-05T03:15:39.228135.parquet' - split: latest path: - '**/details_harness|hendrycksTest-human_sexuality|5_2023-09-05T03:15:39.228135.parquet' - config_name: harness_hendrycksTest_international_law_5 data_files: - split: 2023_09_05T03_15_39.228135 path: - '**/details_harness|hendrycksTest-international_law|5_2023-09-05T03:15:39.228135.parquet' - split: latest path: - '**/details_harness|hendrycksTest-international_law|5_2023-09-05T03:15:39.228135.parquet' - config_name: harness_hendrycksTest_jurisprudence_5 data_files: - split: 2023_09_05T03_15_39.228135 path: - '**/details_harness|hendrycksTest-jurisprudence|5_2023-09-05T03:15:39.228135.parquet' - split: latest path: - '**/details_harness|hendrycksTest-jurisprudence|5_2023-09-05T03:15:39.228135.parquet' - config_name: harness_hendrycksTest_logical_fallacies_5 data_files: - split: 2023_09_05T03_15_39.228135 path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2023-09-05T03:15:39.228135.parquet' - split: latest path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2023-09-05T03:15:39.228135.parquet' - config_name: harness_hendrycksTest_machine_learning_5 data_files: - split: 2023_09_05T03_15_39.228135 path: - '**/details_harness|hendrycksTest-machine_learning|5_2023-09-05T03:15:39.228135.parquet' - split: latest path: - '**/details_harness|hendrycksTest-machine_learning|5_2023-09-05T03:15:39.228135.parquet' - config_name: harness_hendrycksTest_management_5 data_files: - split: 2023_09_05T03_15_39.228135 path: - '**/details_harness|hendrycksTest-management|5_2023-09-05T03:15:39.228135.parquet' - split: latest path: - '**/details_harness|hendrycksTest-management|5_2023-09-05T03:15:39.228135.parquet' - config_name: harness_hendrycksTest_marketing_5 data_files: - split: 2023_09_05T03_15_39.228135 path: - '**/details_harness|hendrycksTest-marketing|5_2023-09-05T03:15:39.228135.parquet' - split: latest path: - '**/details_harness|hendrycksTest-marketing|5_2023-09-05T03:15:39.228135.parquet' - config_name: harness_hendrycksTest_medical_genetics_5 data_files: - split: 2023_09_05T03_15_39.228135 path: - '**/details_harness|hendrycksTest-medical_genetics|5_2023-09-05T03:15:39.228135.parquet' - split: latest path: - '**/details_harness|hendrycksTest-medical_genetics|5_2023-09-05T03:15:39.228135.parquet' - config_name: harness_hendrycksTest_miscellaneous_5 data_files: - split: 2023_09_05T03_15_39.228135 path: - '**/details_harness|hendrycksTest-miscellaneous|5_2023-09-05T03:15:39.228135.parquet' - split: latest path: - '**/details_harness|hendrycksTest-miscellaneous|5_2023-09-05T03:15:39.228135.parquet' - config_name: harness_hendrycksTest_moral_disputes_5 data_files: - split: 2023_09_05T03_15_39.228135 path: - '**/details_harness|hendrycksTest-moral_disputes|5_2023-09-05T03:15:39.228135.parquet' - split: latest path: - '**/details_harness|hendrycksTest-moral_disputes|5_2023-09-05T03:15:39.228135.parquet' - config_name: harness_hendrycksTest_moral_scenarios_5 data_files: - split: 2023_09_05T03_15_39.228135 path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2023-09-05T03:15:39.228135.parquet' - split: latest path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2023-09-05T03:15:39.228135.parquet' - config_name: harness_hendrycksTest_nutrition_5 data_files: - split: 2023_09_05T03_15_39.228135 path: - '**/details_harness|hendrycksTest-nutrition|5_2023-09-05T03:15:39.228135.parquet' - split: latest path: - '**/details_harness|hendrycksTest-nutrition|5_2023-09-05T03:15:39.228135.parquet' - config_name: harness_hendrycksTest_philosophy_5 data_files: - split: 2023_09_05T03_15_39.228135 path: - '**/details_harness|hendrycksTest-philosophy|5_2023-09-05T03:15:39.228135.parquet' - split: latest path: - '**/details_harness|hendrycksTest-philosophy|5_2023-09-05T03:15:39.228135.parquet' - config_name: harness_hendrycksTest_prehistory_5 data_files: - split: 2023_09_05T03_15_39.228135 path: - '**/details_harness|hendrycksTest-prehistory|5_2023-09-05T03:15:39.228135.parquet' - split: latest path: - '**/details_harness|hendrycksTest-prehistory|5_2023-09-05T03:15:39.228135.parquet' - config_name: harness_hendrycksTest_professional_accounting_5 data_files: - split: 2023_09_05T03_15_39.228135 path: - '**/details_harness|hendrycksTest-professional_accounting|5_2023-09-05T03:15:39.228135.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_accounting|5_2023-09-05T03:15:39.228135.parquet' - config_name: harness_hendrycksTest_professional_law_5 data_files: - split: 2023_09_05T03_15_39.228135 path: - '**/details_harness|hendrycksTest-professional_law|5_2023-09-05T03:15:39.228135.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_law|5_2023-09-05T03:15:39.228135.parquet' - config_name: harness_hendrycksTest_professional_medicine_5 data_files: - split: 2023_09_05T03_15_39.228135 path: - '**/details_harness|hendrycksTest-professional_medicine|5_2023-09-05T03:15:39.228135.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_medicine|5_2023-09-05T03:15:39.228135.parquet' - config_name: harness_hendrycksTest_professional_psychology_5 data_files: - split: 2023_09_05T03_15_39.228135 path: - '**/details_harness|hendrycksTest-professional_psychology|5_2023-09-05T03:15:39.228135.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_psychology|5_2023-09-05T03:15:39.228135.parquet' - config_name: harness_hendrycksTest_public_relations_5 data_files: - split: 2023_09_05T03_15_39.228135 path: - '**/details_harness|hendrycksTest-public_relations|5_2023-09-05T03:15:39.228135.parquet' - split: latest path: - '**/details_harness|hendrycksTest-public_relations|5_2023-09-05T03:15:39.228135.parquet' - config_name: harness_hendrycksTest_security_studies_5 data_files: - split: 2023_09_05T03_15_39.228135 path: - '**/details_harness|hendrycksTest-security_studies|5_2023-09-05T03:15:39.228135.parquet' - split: latest path: - '**/details_harness|hendrycksTest-security_studies|5_2023-09-05T03:15:39.228135.parquet' - config_name: harness_hendrycksTest_sociology_5 data_files: - split: 2023_09_05T03_15_39.228135 path: - '**/details_harness|hendrycksTest-sociology|5_2023-09-05T03:15:39.228135.parquet' - split: latest path: - '**/details_harness|hendrycksTest-sociology|5_2023-09-05T03:15:39.228135.parquet' - config_name: harness_hendrycksTest_us_foreign_policy_5 data_files: - split: 2023_09_05T03_15_39.228135 path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2023-09-05T03:15:39.228135.parquet' - split: latest path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2023-09-05T03:15:39.228135.parquet' - config_name: harness_hendrycksTest_virology_5 data_files: - split: 2023_09_05T03_15_39.228135 path: - '**/details_harness|hendrycksTest-virology|5_2023-09-05T03:15:39.228135.parquet' - split: latest path: - '**/details_harness|hendrycksTest-virology|5_2023-09-05T03:15:39.228135.parquet' - config_name: harness_hendrycksTest_world_religions_5 data_files: - split: 2023_09_05T03_15_39.228135 path: - '**/details_harness|hendrycksTest-world_religions|5_2023-09-05T03:15:39.228135.parquet' - split: latest path: - '**/details_harness|hendrycksTest-world_religions|5_2023-09-05T03:15:39.228135.parquet' - config_name: harness_truthfulqa_mc_0 data_files: - split: 2023_09_05T03_15_39.228135 path: - '**/details_harness|truthfulqa:mc|0_2023-09-05T03:15:39.228135.parquet' - split: latest path: - '**/details_harness|truthfulqa:mc|0_2023-09-05T03:15:39.228135.parquet' - config_name: harness_winogrande_5 data_files: - split: 2023_10_27T18_03_42.739284 path: - '**/details_harness|winogrande|5_2023-10-27T18-03-42.739284.parquet' - split: latest path: - '**/details_harness|winogrande|5_2023-10-27T18-03-42.739284.parquet' - config_name: results data_files: - split: 2023_09_05T03_15_39.228135 path: - results_2023-09-05T03:15:39.228135.parquet - split: 2023_10_27T18_03_42.739284 path: - results_2023-10-27T18-03-42.739284.parquet - split: latest path: - results_2023-10-27T18-03-42.739284.parquet --- # Dataset Card for Evaluation run of Mikivis/gpt2-large-lora-sft ## Dataset Description - **Homepage:** - **Repository:** https://huggingface.co/Mikivis/gpt2-large-lora-sft - **Paper:** - **Leaderboard:** https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard - **Point of Contact:** clementine@hf.co ### Dataset Summary Dataset automatically created during the evaluation run of model [Mikivis/gpt2-large-lora-sft](https://huggingface.co/Mikivis/gpt2-large-lora-sft) on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard). The dataset is composed of 64 configuration, each one coresponding to one of the evaluated task. The dataset has been created from 2 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results. An additional configuration "results" store all the aggregated results of the run (and is used to compute and display the agregated metrics on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)). To load the details from a run, you can for instance do the following: ```python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_Mikivis__gpt2-large-lora-sft", "harness_winogrande_5", split="train") ``` ## Latest results These are the [latest results from run 2023-10-27T18:03:42.739284](https://huggingface.co/datasets/open-llm-leaderboard/details_Mikivis__gpt2-large-lora-sft/blob/main/results_2023-10-27T18-03-42.739284.json)(note that their might be results for other tasks in the repos if successive evals didn't cover the same tasks. You find each in the results and the "latest" split for each eval): ```python { "all": { "em": 0.0026216442953020135, "em_stderr": 0.0005236685642965714, "f1": 0.05463401845637592, "f1_stderr": 0.001420933825490078, "acc": 0.27545382794001577, "acc_stderr": 0.006989729694570417 }, "harness|drop|3": { "em": 0.0026216442953020135, "em_stderr": 0.0005236685642965714, "f1": 0.05463401845637592, "f1_stderr": 0.001420933825490078 }, "harness|gsm8k|5": { "acc": 0.0, "acc_stderr": 0.0 }, "harness|winogrande|5": { "acc": 0.5509076558800315, "acc_stderr": 0.013979459389140834 } } ``` ### Supported Tasks and Leaderboards [More Information Needed] ### Languages [More Information Needed] ## Dataset Structure ### Data Instances [More Information Needed] ### Data Fields [More Information Needed] ### Data Splits [More Information Needed] ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information [More Information Needed] ### Citation Information [More Information Needed] ### Contributions [More Information Needed]
提供机构:
open-llm-leaderboard
原始信息汇总

数据集概述

数据集来源

该数据集是在对模型 Mikivis/gpt2-large-lora-sft 进行评估运行时自动创建的,评估运行在 Open LLM Leaderboard 上进行。

数据集结构

  • 数据集包含 64 个配置,每个配置对应一个评估任务。
  • 数据集从 2 次运行中创建,每次运行可以在每个配置中作为一个特定的分割找到,分割名称使用运行的时间戳。
  • "train" 分割始终指向最新的结果。
  • 一个额外的配置 "results" 存储所有运行的聚合结果,用于计算和显示 Open LLM Leaderboard 上的聚合指标。

数据加载示例

python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_Mikivis__gpt2-large-lora-sft", "harness_winogrande_5", split="train")

最新结果

以下是 2023-10-27T18:03:42.739284 运行 的最新结果: python { "all": { "em": 0.0026216442953020135, "em_stderr": 0.0005236685642965714, "f1": 0.05463401845637592, "f1_stderr": 0.001420933825490078, "acc": 0.27545382794001577, "acc_stderr": 0.006989729694570417 }, "harness|drop|3": { "em": 0.0026216442953020135, "em_stderr": 0.0005236685642965714, "f1": 0.05463401845637592, "f1_stderr": 0.001420933825490078 }, "harness|gsm8k|5": { "acc": 0.0, "acc_stderr": 0.0 }, "harness|winogrande|5": { "acc": 0.5509076558800315, "acc_stderr": 0.013979459389140834 } }

配置详情

以下是数据集的配置详情:

  • harness_arc_challenge_25

    • 分割:2023_09_05T03_15_39.228135
    • 路径:**/details_harness|arc:challenge|25_2023-09-05T03:15:39.228135.parquet
    • 分割:latest
    • 路径:**/details_harness|arc:challenge|25_2023-09-05T03:15:39.228135.parquet
  • harness_drop_3

    • 分割:2023_10_27T18_03_42.739284
    • 路径:**/details_harness|drop|3_2023-10-27T18-03-42.739284.parquet
    • 分割:latest
    • 路径:**/details_harness|drop|3_2023-10-27T18-03-42.739284.parquet
  • harness_gsm8k_5

    • 分割:2023_10_27T18_03_42.739284
    • 路径:**/details_harness|gsm8k|5_2023-10-27T18-03-42.739284.parquet
    • 分割:latest
    • 路径:**/details_harness|gsm8k|5_2023-10-27T18-03-42.739284.parquet
  • harness_hellaswag_10

    • 分割:2023_09_05T03_15_39.228135
    • 路径:**/details_harness|hellaswag|10_2023-09-05T03:15:39.228135.parquet
    • 分割:latest
    • 路径:**/details_harness|hellaswag|10_2023-09-05T03:15:39.228135.parquet
  • harness_hendrycksTest_5

    • 分割:2023_09_05T03_15_39.228135
    • 路径:**/details_harness|hendrycksTest-abstract_algebra|5_2023-09-05T03:15:39.228135.parquet 等 40 个文件
    • 分割:latest
    • 路径:**/details_harness|hendrycksTest-abstract_algebra|5_2023-09-05T03:15:39.228135.parquet 等 40 个文件
  • harness_hendrycksTest_abstract_algebra_5

    • 分割:2023_09_05T03_15_39.228135
    • 路径:**/details_harness|hendrycksTest-abstract_algebra|5_2023-09-05T03:15:39.228135.parquet
    • 分割:latest
    • 路径:**/details_harness|hendrycksTest-abstract_algebra|5_2023-09-05T03:15:39.228135.parquet
  • harness_hendrycksTest_anatomy_5

    • 分割:2023_09_05T03_15_39.228135
    • 路径:**/details_harness|hendrycksTest-anatomy|5_2023-09-05T03:15:39.228135.parquet
    • 分割:latest
    • 路径:**/details_harness|hendrycksTest-anatomy|5_2023-09-05T03:15:39.228135.parquet
  • harness_hendrycksTest_astronomy_5

    • 分割:2023_09_05T03_15_39.228135
    • 路径:**/details_harness|hendrycksTest-astronomy|5_2023-09-05T03:15:39.228135.parquet
    • 分割:latest
    • 路径:**/details_harness|hendrycksTest-astronomy|5_2023-09-05T03:15:39.228135.parquet
  • harness_hendrycksTest_business_ethics_5

    • 分割:2023_09_05T03_15_39.228135
    • 路径:**/details_harness|hendrycksTest-business_ethics|5_2023-09-05T03:15:39.228135.parquet
    • 分割:latest
    • 路径:**/details_harness|hendrycksTest-business_ethics|5_2023-09-05T03:15:39.228135.parquet
  • harness_hendrycksTest_clinical_knowledge_5

    • 分割:2023_09_05T03_15_39.228135
    • 路径:**/details_harness|hendrycksTest-clinical_knowledge|5_2023-09-05T03:15:39.228135.parquet
    • 分割:latest
    • 路径:**/details_harness|hendrycksTest-clinical_knowledge|5_2023-09-05T03:15:39.228135.parquet
  • harness_hendrycksTest_college_biology_5

    • 分割:2023_09_05T03_15_39.228135
    • 路径:**/details_harness|hendrycksTest-college_biology|5_2023-09-05T03:15:39.228135.parquet
    • 分割:latest
    • 路径:**/details_harness|hendrycksTest-college_biology|5_2023-09-05T03:15:39.228135.parquet
  • harness_hendrycksTest_college_chemistry_5

    • 分割:2023_09_05T03_15_39.228135
    • 路径:**/details_harness|hendrycksTest-college_chemistry|5_2023-09-05T03:15:39.228135.parquet
    • 分割:latest
    • 路径:**/details_harness|hendrycksTest-college_chemistry|5_2023-09-05T03:15:39.228135.parquet
  • harness_hendrycksTest_college_computer_science_5

    • 分割:2023_09_05T03_15_39.228135
    • 路径:**/details_harness|hendrycksTest-college_computer_science|5_2023-09-05T03:15:39.228135.parquet
    • 分割:latest
    • 路径:**/details_harness|hendrycksTest-college_computer_science|5_2023-09-05T03:15:39.228135.parquet
  • harness_hendrycksTest_college_mathematics_5

    • 分割:2023_09_05T03_15_39.228135
    • 路径:**/details_harness|hendrycksTest-college_mathematics|5_2023-09-05T03:15:39.228135.parquet
    • 分割:latest
    • 路径:**/details_harness|hendrycksTest-college_mathematics|5_2023-09-05T03:15:39.228135.parquet
  • harness_hendrycksTest_college_medicine_5

    • 分割:2023_09_05T03_15_39.228135
    • 路径:**/details_harness|hendrycksTest-college_medicine|5_2023-09-05T03:15:39.228135.parquet
    • 分割:latest
    • 路径:**/details_harness|hendrycksTest-college_medicine|5_2023-09-05T03:15:39.228135.parquet
  • harness_hendrycksTest_college_physics_5

    • 分割:2023_09_05T03_15_39.228135
    • 路径:**/details_harness|hendrycksTest-college_physics|5_2023-09-05T03:15:39.228135.parquet
    • 分割:latest
    • 路径:**/details_harness|hendrycksTest-college_physics|5_2023-09-05T03:15:39.228135.parquet
  • harness_hendrycksTest_computer_security_5

    • 分割:2023_09_05T03_15_39.228135
    • 路径:**/details_harness|hendrycksTest-computer_security|5_2023-09-05T03:15:39.228135.parquet
    • 分割:latest
    • 路径:**/details_harness|hendrycksTest-computer_security|5_2023-09-05T03:15:39.228135.parquet
  • harness_hendrycksTest_conceptual_physics_5

    • 分割:2023_09_05T03_15_39.228135
    • 路径:**/details_harness|hendrycksTest-conceptual_physics|5_2023-09-05T03:15:39.228135.parquet
    • 分割:latest
    • 路径:**/details_harness|hendrycksTest-conceptual_physics|5_2023-09-05T03:15:39.228135.parquet
  • harness_hendrycksTest_econometrics_5

    • 分割:2023_09_05T03_15_39.228135
    • 路径:**/details_harness|hendrycksTest-econometrics|5_2023-09-05T03:15:39.228135.parquet
    • 分割:latest
    • 路径:**/details_harness|hendrycksTest-econometrics|5_2023-09-05T03:15:39.228135.parquet
  • harness_hendrycksTest_electrical_engineering_5

    • 分割:2023_09_05T03_15_39.228135
    • 路径:**/details_harness|hendrycksTest-electrical_engineering|5_2023-09-05T03:15:39.228135.parquet
    • 分割:latest
    • 路径:`**/details_harness|hendrycks
搜集汇总
数据集介绍
main_image_url
构建方式
在大型语言模型评估领域,该数据集作为Open LLM Leaderboard评估流程的自动化产物,其构建过程体现了标准化评测框架的严谨性。数据集通过两次独立的评估运行生成,每次运行均以时间戳标识并作为独立的数据切分存储。评估涵盖64项不同的任务配置,每个配置对应一项特定的评测任务,例如ARC挑战、DROP、GSM8K等,确保了评估维度的全面性。数据以Parquet格式存储,并通过“latest”切分指向最新的评估结果,实现了数据的版本化管理与动态更新。
特点
该数据集的核心特征在于其作为模型性能评估的详细记录,提供了多维度的细粒度分析能力。数据集不仅包含如准确率(acc)、精确匹配(em)和F1分数等聚合指标,还记录了各项指标的标准误(stderr),为模型性能的稳定性评估提供了统计依据。其结构设计允许用户追溯不同时间点的评估历史,通过时间戳切分对比模型迭代过程中的表现变化。此外,数据集特别包含一个名为“results”的配置,专门用于存储和计算在Open LLM Leaderboard上展示的聚合指标,实现了从原始评估细节到公开排行榜分数的无缝衔接。
使用方法
为利用该数据集进行模型性能分析,研究人员可通过Hugging Face的`datasets`库便捷加载。使用方式具有高度灵活性,用户需指定数据集名称、具体任务配置(如`harness_winogrande_5`)以及所需的数据切分(例如“train”或具体时间戳)。加载后,数据以结构化的格式呈现,便于进行后续的数据处理、可视化或对比分析。这种方法使得研究者能够深入探究模型在特定任务上的详细表现,或进行跨任务、跨时间维度的综合性评估,为模型优化与学术研究提供坚实的数据支撑。
背景与挑战
背景概述
在大型语言模型(LLM)迅猛发展的浪潮中,评估其综合能力成为学术界与工业界共同关注的核心议题。Open LLM Leaderboard作为HuggingFace平台推出的权威评测基准,旨在通过一系列标准化任务对开源语言模型进行系统性评估。数据集‘open-llm-leaderboard/details_Mikivis__gpt2-large-lora-sft’正是这一框架下的产物,它自动记录了模型‘Mikivis/gpt2-large-lora-sft’在2023年期间的评测详情。该数据集由HuggingFace团队主导构建,其核心研究问题在于如何量化与比较不同LLM在常识推理、知识问答及数学解题等多样化任务上的性能表现,从而为模型优化与选择提供客观依据,对推动开源LLM生态的透明化与健康发展产生了深远影响。
当前挑战
该数据集所应对的领域挑战,本质上是解决大型语言模型多维度能力评估的复杂性难题。评测需涵盖从常识推理(如ARC、HellaSwag)到专业领域知识(如MMLU),乃至数学问题求解(如GSM8K)等广泛任务,确保评估的全面性与公平性极具挑战。在构建过程中,技术挑战同样显著:如何自动化地收集、整合来自多次独立评测运行(run)的庞杂结果数据,并确保其格式统一与可追溯性;以及如何设计高效的数据存储结构(如分配置、分时间戳的切分)以支持灵活查询与结果聚合,同时避免因评测任务集动态变化导致的数据不一致问题。
常用场景
经典使用场景
在大型语言模型评估领域,该数据集作为Open LLM Leaderboard评估流程的产物,其经典使用场景在于为特定模型(如Mikivis/gpt2-large-lora-sft)提供详尽的、任务粒度的性能剖析。研究人员通过加载不同配置(如harness_winogrande_5)下的数据,能够深入分析模型在常识推理、数学解题、知识问答等多样化基准任务上的表现细节,从而超越单一总分,实现模型能力维度的精细化诊断与横向对比。
实际应用
在实际应用层面,该数据集为模型开发者、企业技术选型及AI应用部署提供了关键决策依据。开发者可依据详细的评估结果优化模型架构与训练策略;企业在集成语言模型服务时,能参考其在特定任务(如DROP阅读理解、GSM8K数学推理)上的表现数据进行技术选型;同时,持续积累的评估档案有助于监控模型迭代效果,确保实际应用中的性能稳定与可靠,降低了AI系统集成与维护的风险与成本。
衍生相关工作
围绕此类评估数据集,已衍生出多项经典研究工作。例如,基于多任务评估结果的模型能力图谱绘制研究,系统性地揭示了模型在不同知识领域的强弱项;针对评估基准偏差的分析工作,利用详尽的错误样本数据提出了更公平的评估方法;此外,还有研究利用时序评估数据追踪模型社区的整体进步趋势,或探究特定微调技术(如LoRA)对不同任务性能的影响模式,这些工作共同深化了社区对语言模型评估与发展的理解。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作