five

open-llm-leaderboard/details_KnutJaegersberg__deacon-13b

收藏
Hugging Face2023-10-27 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/open-llm-leaderboard/details_KnutJaegersberg__deacon-13b
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: Evaluation run of KnutJaegersberg/deacon-13b dataset_summary: "Dataset automatically created during the evaluation run of model\ \ [KnutJaegersberg/deacon-13b](https://huggingface.co/KnutJaegersberg/deacon-13b)\ \ on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).\n\ \nThe dataset is composed of 64 configuration, each one coresponding to one of the\ \ evaluated task.\n\nThe dataset has been created from 2 run(s). Each run can be\ \ found as a specific split in each configuration, the split being named using the\ \ timestamp of the run.The \"train\" split is always pointing to the latest results.\n\ \nAn additional configuration \"results\" store all the aggregated results of the\ \ run (and is used to compute and display the agregated metrics on the [Open LLM\ \ Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)).\n\ \nTo load the details from a run, you can for instance do the following:\n```python\n\ from datasets import load_dataset\ndata = load_dataset(\"open-llm-leaderboard/details_KnutJaegersberg__deacon-13b\"\ ,\n\t\"harness_winogrande_5\",\n\tsplit=\"train\")\n```\n\n## Latest results\n\n\ These are the [latest results from run 2023-10-27T07:52:54.857198](https://huggingface.co/datasets/open-llm-leaderboard/details_KnutJaegersberg__deacon-13b/blob/main/results_2023-10-27T07-52-54.857198.json)(note\ \ that their might be results for other tasks in the repos if successive evals didn't\ \ cover the same tasks. You find each in the results and the \"latest\" split for\ \ each eval):\n\n```python\n{\n \"all\": {\n \"em\": 0.0012583892617449664,\n\ \ \"em_stderr\": 0.0003630560893119389,\n \"f1\": 0.05671665268456401,\n\ \ \"f1_stderr\": 0.001312852180013837,\n \"acc\": 0.43354338539457016,\n\ \ \"acc_stderr\": 0.010175607297065709\n },\n \"harness|drop|3\": {\n\ \ \"em\": 0.0012583892617449664,\n \"em_stderr\": 0.0003630560893119389,\n\ \ \"f1\": 0.05671665268456401,\n \"f1_stderr\": 0.001312852180013837\n\ \ },\n \"harness|gsm8k|5\": {\n \"acc\": 0.10386656557998483,\n \ \ \"acc_stderr\": 0.008403622228924029\n },\n \"harness|winogrande|5\"\ : {\n \"acc\": 0.7632202052091555,\n \"acc_stderr\": 0.011947592365207389\n\ \ }\n}\n```" repo_url: https://huggingface.co/KnutJaegersberg/deacon-13b leaderboard_url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard point_of_contact: clementine@hf.co configs: - config_name: harness_arc_challenge_25 data_files: - split: 2023_09_22T07_24_15.341487 path: - '**/details_harness|arc:challenge|25_2023-09-22T07-24-15.341487.parquet' - split: latest path: - '**/details_harness|arc:challenge|25_2023-09-22T07-24-15.341487.parquet' - config_name: harness_drop_3 data_files: - split: 2023_10_27T07_52_54.857198 path: - '**/details_harness|drop|3_2023-10-27T07-52-54.857198.parquet' - split: latest path: - '**/details_harness|drop|3_2023-10-27T07-52-54.857198.parquet' - config_name: harness_gsm8k_5 data_files: - split: 2023_10_27T07_52_54.857198 path: - '**/details_harness|gsm8k|5_2023-10-27T07-52-54.857198.parquet' - split: latest path: - '**/details_harness|gsm8k|5_2023-10-27T07-52-54.857198.parquet' - config_name: harness_hellaswag_10 data_files: - split: 2023_09_22T07_24_15.341487 path: - '**/details_harness|hellaswag|10_2023-09-22T07-24-15.341487.parquet' - split: latest path: - '**/details_harness|hellaswag|10_2023-09-22T07-24-15.341487.parquet' - config_name: harness_hendrycksTest_5 data_files: - split: 2023_09_22T07_24_15.341487 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-international_law|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-management|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-marketing|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-sociology|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-virology|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2023-09-22T07-24-15.341487.parquet' - split: latest path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-international_law|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-management|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-marketing|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-sociology|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-virology|5_2023-09-22T07-24-15.341487.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2023-09-22T07-24-15.341487.parquet' - config_name: harness_hendrycksTest_abstract_algebra_5 data_files: - split: 2023_09_22T07_24_15.341487 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2023-09-22T07-24-15.341487.parquet' - split: latest path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2023-09-22T07-24-15.341487.parquet' - config_name: harness_hendrycksTest_anatomy_5 data_files: - split: 2023_09_22T07_24_15.341487 path: - '**/details_harness|hendrycksTest-anatomy|5_2023-09-22T07-24-15.341487.parquet' - split: latest path: - '**/details_harness|hendrycksTest-anatomy|5_2023-09-22T07-24-15.341487.parquet' - config_name: harness_hendrycksTest_astronomy_5 data_files: - split: 2023_09_22T07_24_15.341487 path: - '**/details_harness|hendrycksTest-astronomy|5_2023-09-22T07-24-15.341487.parquet' - split: latest path: - '**/details_harness|hendrycksTest-astronomy|5_2023-09-22T07-24-15.341487.parquet' - config_name: harness_hendrycksTest_business_ethics_5 data_files: - split: 2023_09_22T07_24_15.341487 path: - '**/details_harness|hendrycksTest-business_ethics|5_2023-09-22T07-24-15.341487.parquet' - split: latest path: - '**/details_harness|hendrycksTest-business_ethics|5_2023-09-22T07-24-15.341487.parquet' - config_name: harness_hendrycksTest_clinical_knowledge_5 data_files: - split: 2023_09_22T07_24_15.341487 path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2023-09-22T07-24-15.341487.parquet' - split: latest path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2023-09-22T07-24-15.341487.parquet' - config_name: harness_hendrycksTest_college_biology_5 data_files: - split: 2023_09_22T07_24_15.341487 path: - '**/details_harness|hendrycksTest-college_biology|5_2023-09-22T07-24-15.341487.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_biology|5_2023-09-22T07-24-15.341487.parquet' - config_name: harness_hendrycksTest_college_chemistry_5 data_files: - split: 2023_09_22T07_24_15.341487 path: - '**/details_harness|hendrycksTest-college_chemistry|5_2023-09-22T07-24-15.341487.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_chemistry|5_2023-09-22T07-24-15.341487.parquet' - config_name: harness_hendrycksTest_college_computer_science_5 data_files: - split: 2023_09_22T07_24_15.341487 path: - '**/details_harness|hendrycksTest-college_computer_science|5_2023-09-22T07-24-15.341487.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_computer_science|5_2023-09-22T07-24-15.341487.parquet' - config_name: harness_hendrycksTest_college_mathematics_5 data_files: - split: 2023_09_22T07_24_15.341487 path: - '**/details_harness|hendrycksTest-college_mathematics|5_2023-09-22T07-24-15.341487.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_mathematics|5_2023-09-22T07-24-15.341487.parquet' - config_name: harness_hendrycksTest_college_medicine_5 data_files: - split: 2023_09_22T07_24_15.341487 path: - '**/details_harness|hendrycksTest-college_medicine|5_2023-09-22T07-24-15.341487.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_medicine|5_2023-09-22T07-24-15.341487.parquet' - config_name: harness_hendrycksTest_college_physics_5 data_files: - split: 2023_09_22T07_24_15.341487 path: - '**/details_harness|hendrycksTest-college_physics|5_2023-09-22T07-24-15.341487.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_physics|5_2023-09-22T07-24-15.341487.parquet' - config_name: harness_hendrycksTest_computer_security_5 data_files: - split: 2023_09_22T07_24_15.341487 path: - '**/details_harness|hendrycksTest-computer_security|5_2023-09-22T07-24-15.341487.parquet' - split: latest path: - '**/details_harness|hendrycksTest-computer_security|5_2023-09-22T07-24-15.341487.parquet' - config_name: harness_hendrycksTest_conceptual_physics_5 data_files: - split: 2023_09_22T07_24_15.341487 path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2023-09-22T07-24-15.341487.parquet' - split: latest path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2023-09-22T07-24-15.341487.parquet' - config_name: harness_hendrycksTest_econometrics_5 data_files: - split: 2023_09_22T07_24_15.341487 path: - '**/details_harness|hendrycksTest-econometrics|5_2023-09-22T07-24-15.341487.parquet' - split: latest path: - '**/details_harness|hendrycksTest-econometrics|5_2023-09-22T07-24-15.341487.parquet' - config_name: harness_hendrycksTest_electrical_engineering_5 data_files: - split: 2023_09_22T07_24_15.341487 path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2023-09-22T07-24-15.341487.parquet' - split: latest path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2023-09-22T07-24-15.341487.parquet' - config_name: harness_hendrycksTest_elementary_mathematics_5 data_files: - split: 2023_09_22T07_24_15.341487 path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2023-09-22T07-24-15.341487.parquet' - split: latest path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2023-09-22T07-24-15.341487.parquet' - config_name: harness_hendrycksTest_formal_logic_5 data_files: - split: 2023_09_22T07_24_15.341487 path: - '**/details_harness|hendrycksTest-formal_logic|5_2023-09-22T07-24-15.341487.parquet' - split: latest path: - '**/details_harness|hendrycksTest-formal_logic|5_2023-09-22T07-24-15.341487.parquet' - config_name: harness_hendrycksTest_global_facts_5 data_files: - split: 2023_09_22T07_24_15.341487 path: - '**/details_harness|hendrycksTest-global_facts|5_2023-09-22T07-24-15.341487.parquet' - split: latest path: - '**/details_harness|hendrycksTest-global_facts|5_2023-09-22T07-24-15.341487.parquet' - config_name: harness_hendrycksTest_high_school_biology_5 data_files: - split: 2023_09_22T07_24_15.341487 path: - '**/details_harness|hendrycksTest-high_school_biology|5_2023-09-22T07-24-15.341487.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_biology|5_2023-09-22T07-24-15.341487.parquet' - config_name: harness_hendrycksTest_high_school_chemistry_5 data_files: - split: 2023_09_22T07_24_15.341487 path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2023-09-22T07-24-15.341487.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2023-09-22T07-24-15.341487.parquet' - config_name: harness_hendrycksTest_high_school_computer_science_5 data_files: - split: 2023_09_22T07_24_15.341487 path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2023-09-22T07-24-15.341487.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2023-09-22T07-24-15.341487.parquet' - config_name: harness_hendrycksTest_high_school_european_history_5 data_files: - split: 2023_09_22T07_24_15.341487 path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2023-09-22T07-24-15.341487.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2023-09-22T07-24-15.341487.parquet' - config_name: harness_hendrycksTest_high_school_geography_5 data_files: - split: 2023_09_22T07_24_15.341487 path: - '**/details_harness|hendrycksTest-high_school_geography|5_2023-09-22T07-24-15.341487.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_geography|5_2023-09-22T07-24-15.341487.parquet' - config_name: harness_hendrycksTest_high_school_government_and_politics_5 data_files: - split: 2023_09_22T07_24_15.341487 path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2023-09-22T07-24-15.341487.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2023-09-22T07-24-15.341487.parquet' - config_name: harness_hendrycksTest_high_school_macroeconomics_5 data_files: - split: 2023_09_22T07_24_15.341487 path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2023-09-22T07-24-15.341487.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2023-09-22T07-24-15.341487.parquet' - config_name: harness_hendrycksTest_high_school_mathematics_5 data_files: - split: 2023_09_22T07_24_15.341487 path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2023-09-22T07-24-15.341487.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2023-09-22T07-24-15.341487.parquet' - config_name: harness_hendrycksTest_high_school_microeconomics_5 data_files: - split: 2023_09_22T07_24_15.341487 path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2023-09-22T07-24-15.341487.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2023-09-22T07-24-15.341487.parquet' - config_name: harness_hendrycksTest_high_school_physics_5 data_files: - split: 2023_09_22T07_24_15.341487 path: - '**/details_harness|hendrycksTest-high_school_physics|5_2023-09-22T07-24-15.341487.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_physics|5_2023-09-22T07-24-15.341487.parquet' - config_name: harness_hendrycksTest_high_school_psychology_5 data_files: - split: 2023_09_22T07_24_15.341487 path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2023-09-22T07-24-15.341487.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2023-09-22T07-24-15.341487.parquet' - config_name: harness_hendrycksTest_high_school_statistics_5 data_files: - split: 2023_09_22T07_24_15.341487 path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2023-09-22T07-24-15.341487.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2023-09-22T07-24-15.341487.parquet' - config_name: harness_hendrycksTest_high_school_us_history_5 data_files: - split: 2023_09_22T07_24_15.341487 path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2023-09-22T07-24-15.341487.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2023-09-22T07-24-15.341487.parquet' - config_name: harness_hendrycksTest_high_school_world_history_5 data_files: - split: 2023_09_22T07_24_15.341487 path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2023-09-22T07-24-15.341487.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2023-09-22T07-24-15.341487.parquet' - config_name: harness_hendrycksTest_human_aging_5 data_files: - split: 2023_09_22T07_24_15.341487 path: - '**/details_harness|hendrycksTest-human_aging|5_2023-09-22T07-24-15.341487.parquet' - split: latest path: - '**/details_harness|hendrycksTest-human_aging|5_2023-09-22T07-24-15.341487.parquet' - config_name: harness_hendrycksTest_human_sexuality_5 data_files: - split: 2023_09_22T07_24_15.341487 path: - '**/details_harness|hendrycksTest-human_sexuality|5_2023-09-22T07-24-15.341487.parquet' - split: latest path: - '**/details_harness|hendrycksTest-human_sexuality|5_2023-09-22T07-24-15.341487.parquet' - config_name: harness_hendrycksTest_international_law_5 data_files: - split: 2023_09_22T07_24_15.341487 path: - '**/details_harness|hendrycksTest-international_law|5_2023-09-22T07-24-15.341487.parquet' - split: latest path: - '**/details_harness|hendrycksTest-international_law|5_2023-09-22T07-24-15.341487.parquet' - config_name: harness_hendrycksTest_jurisprudence_5 data_files: - split: 2023_09_22T07_24_15.341487 path: - '**/details_harness|hendrycksTest-jurisprudence|5_2023-09-22T07-24-15.341487.parquet' - split: latest path: - '**/details_harness|hendrycksTest-jurisprudence|5_2023-09-22T07-24-15.341487.parquet' - config_name: harness_hendrycksTest_logical_fallacies_5 data_files: - split: 2023_09_22T07_24_15.341487 path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2023-09-22T07-24-15.341487.parquet' - split: latest path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2023-09-22T07-24-15.341487.parquet' - config_name: harness_hendrycksTest_machine_learning_5 data_files: - split: 2023_09_22T07_24_15.341487 path: - '**/details_harness|hendrycksTest-machine_learning|5_2023-09-22T07-24-15.341487.parquet' - split: latest path: - '**/details_harness|hendrycksTest-machine_learning|5_2023-09-22T07-24-15.341487.parquet' - config_name: harness_hendrycksTest_management_5 data_files: - split: 2023_09_22T07_24_15.341487 path: - '**/details_harness|hendrycksTest-management|5_2023-09-22T07-24-15.341487.parquet' - split: latest path: - '**/details_harness|hendrycksTest-management|5_2023-09-22T07-24-15.341487.parquet' - config_name: harness_hendrycksTest_marketing_5 data_files: - split: 2023_09_22T07_24_15.341487 path: - '**/details_harness|hendrycksTest-marketing|5_2023-09-22T07-24-15.341487.parquet' - split: latest path: - '**/details_harness|hendrycksTest-marketing|5_2023-09-22T07-24-15.341487.parquet' - config_name: harness_hendrycksTest_medical_genetics_5 data_files: - split: 2023_09_22T07_24_15.341487 path: - '**/details_harness|hendrycksTest-medical_genetics|5_2023-09-22T07-24-15.341487.parquet' - split: latest path: - '**/details_harness|hendrycksTest-medical_genetics|5_2023-09-22T07-24-15.341487.parquet' - config_name: harness_hendrycksTest_miscellaneous_5 data_files: - split: 2023_09_22T07_24_15.341487 path: - '**/details_harness|hendrycksTest-miscellaneous|5_2023-09-22T07-24-15.341487.parquet' - split: latest path: - '**/details_harness|hendrycksTest-miscellaneous|5_2023-09-22T07-24-15.341487.parquet' - config_name: harness_hendrycksTest_moral_disputes_5 data_files: - split: 2023_09_22T07_24_15.341487 path: - '**/details_harness|hendrycksTest-moral_disputes|5_2023-09-22T07-24-15.341487.parquet' - split: latest path: - '**/details_harness|hendrycksTest-moral_disputes|5_2023-09-22T07-24-15.341487.parquet' - config_name: harness_hendrycksTest_moral_scenarios_5 data_files: - split: 2023_09_22T07_24_15.341487 path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2023-09-22T07-24-15.341487.parquet' - split: latest path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2023-09-22T07-24-15.341487.parquet' - config_name: harness_hendrycksTest_nutrition_5 data_files: - split: 2023_09_22T07_24_15.341487 path: - '**/details_harness|hendrycksTest-nutrition|5_2023-09-22T07-24-15.341487.parquet' - split: latest path: - '**/details_harness|hendrycksTest-nutrition|5_2023-09-22T07-24-15.341487.parquet' - config_name: harness_hendrycksTest_philosophy_5 data_files: - split: 2023_09_22T07_24_15.341487 path: - '**/details_harness|hendrycksTest-philosophy|5_2023-09-22T07-24-15.341487.parquet' - split: latest path: - '**/details_harness|hendrycksTest-philosophy|5_2023-09-22T07-24-15.341487.parquet' - config_name: harness_hendrycksTest_prehistory_5 data_files: - split: 2023_09_22T07_24_15.341487 path: - '**/details_harness|hendrycksTest-prehistory|5_2023-09-22T07-24-15.341487.parquet' - split: latest path: - '**/details_harness|hendrycksTest-prehistory|5_2023-09-22T07-24-15.341487.parquet' - config_name: harness_hendrycksTest_professional_accounting_5 data_files: - split: 2023_09_22T07_24_15.341487 path: - '**/details_harness|hendrycksTest-professional_accounting|5_2023-09-22T07-24-15.341487.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_accounting|5_2023-09-22T07-24-15.341487.parquet' - config_name: harness_hendrycksTest_professional_law_5 data_files: - split: 2023_09_22T07_24_15.341487 path: - '**/details_harness|hendrycksTest-professional_law|5_2023-09-22T07-24-15.341487.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_law|5_2023-09-22T07-24-15.341487.parquet' - config_name: harness_hendrycksTest_professional_medicine_5 data_files: - split: 2023_09_22T07_24_15.341487 path: - '**/details_harness|hendrycksTest-professional_medicine|5_2023-09-22T07-24-15.341487.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_medicine|5_2023-09-22T07-24-15.341487.parquet' - config_name: harness_hendrycksTest_professional_psychology_5 data_files: - split: 2023_09_22T07_24_15.341487 path: - '**/details_harness|hendrycksTest-professional_psychology|5_2023-09-22T07-24-15.341487.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_psychology|5_2023-09-22T07-24-15.341487.parquet' - config_name: harness_hendrycksTest_public_relations_5 data_files: - split: 2023_09_22T07_24_15.341487 path: - '**/details_harness|hendrycksTest-public_relations|5_2023-09-22T07-24-15.341487.parquet' - split: latest path: - '**/details_harness|hendrycksTest-public_relations|5_2023-09-22T07-24-15.341487.parquet' - config_name: harness_hendrycksTest_security_studies_5 data_files: - split: 2023_09_22T07_24_15.341487 path: - '**/details_harness|hendrycksTest-security_studies|5_2023-09-22T07-24-15.341487.parquet' - split: latest path: - '**/details_harness|hendrycksTest-security_studies|5_2023-09-22T07-24-15.341487.parquet' - config_name: harness_hendrycksTest_sociology_5 data_files: - split: 2023_09_22T07_24_15.341487 path: - '**/details_harness|hendrycksTest-sociology|5_2023-09-22T07-24-15.341487.parquet' - split: latest path: - '**/details_harness|hendrycksTest-sociology|5_2023-09-22T07-24-15.341487.parquet' - config_name: harness_hendrycksTest_us_foreign_policy_5 data_files: - split: 2023_09_22T07_24_15.341487 path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2023-09-22T07-24-15.341487.parquet' - split: latest path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2023-09-22T07-24-15.341487.parquet' - config_name: harness_hendrycksTest_virology_5 data_files: - split: 2023_09_22T07_24_15.341487 path: - '**/details_harness|hendrycksTest-virology|5_2023-09-22T07-24-15.341487.parquet' - split: latest path: - '**/details_harness|hendrycksTest-virology|5_2023-09-22T07-24-15.341487.parquet' - config_name: harness_hendrycksTest_world_religions_5 data_files: - split: 2023_09_22T07_24_15.341487 path: - '**/details_harness|hendrycksTest-world_religions|5_2023-09-22T07-24-15.341487.parquet' - split: latest path: - '**/details_harness|hendrycksTest-world_religions|5_2023-09-22T07-24-15.341487.parquet' - config_name: harness_truthfulqa_mc_0 data_files: - split: 2023_09_22T07_24_15.341487 path: - '**/details_harness|truthfulqa:mc|0_2023-09-22T07-24-15.341487.parquet' - split: latest path: - '**/details_harness|truthfulqa:mc|0_2023-09-22T07-24-15.341487.parquet' - config_name: harness_winogrande_5 data_files: - split: 2023_10_27T07_52_54.857198 path: - '**/details_harness|winogrande|5_2023-10-27T07-52-54.857198.parquet' - split: latest path: - '**/details_harness|winogrande|5_2023-10-27T07-52-54.857198.parquet' - config_name: results data_files: - split: 2023_09_22T07_24_15.341487 path: - results_2023-09-22T07-24-15.341487.parquet - split: 2023_10_27T07_52_54.857198 path: - results_2023-10-27T07-52-54.857198.parquet - split: latest path: - results_2023-10-27T07-52-54.857198.parquet --- # Dataset Card for Evaluation run of KnutJaegersberg/deacon-13b ## Dataset Description - **Homepage:** - **Repository:** https://huggingface.co/KnutJaegersberg/deacon-13b - **Paper:** - **Leaderboard:** https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard - **Point of Contact:** clementine@hf.co ### Dataset Summary Dataset automatically created during the evaluation run of model [KnutJaegersberg/deacon-13b](https://huggingface.co/KnutJaegersberg/deacon-13b) on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard). The dataset is composed of 64 configuration, each one coresponding to one of the evaluated task. The dataset has been created from 2 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results. An additional configuration "results" store all the aggregated results of the run (and is used to compute and display the agregated metrics on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)). To load the details from a run, you can for instance do the following: ```python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_KnutJaegersberg__deacon-13b", "harness_winogrande_5", split="train") ``` ## Latest results These are the [latest results from run 2023-10-27T07:52:54.857198](https://huggingface.co/datasets/open-llm-leaderboard/details_KnutJaegersberg__deacon-13b/blob/main/results_2023-10-27T07-52-54.857198.json)(note that their might be results for other tasks in the repos if successive evals didn't cover the same tasks. You find each in the results and the "latest" split for each eval): ```python { "all": { "em": 0.0012583892617449664, "em_stderr": 0.0003630560893119389, "f1": 0.05671665268456401, "f1_stderr": 0.001312852180013837, "acc": 0.43354338539457016, "acc_stderr": 0.010175607297065709 }, "harness|drop|3": { "em": 0.0012583892617449664, "em_stderr": 0.0003630560893119389, "f1": 0.05671665268456401, "f1_stderr": 0.001312852180013837 }, "harness|gsm8k|5": { "acc": 0.10386656557998483, "acc_stderr": 0.008403622228924029 }, "harness|winogrande|5": { "acc": 0.7632202052091555, "acc_stderr": 0.011947592365207389 } } ``` ### Supported Tasks and Leaderboards [More Information Needed] ### Languages [More Information Needed] ## Dataset Structure ### Data Instances [More Information Needed] ### Data Fields [More Information Needed] ### Data Splits [More Information Needed] ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information [More Information Needed] ### Citation Information [More Information Needed] ### Contributions [More Information Needed]
提供机构:
open-llm-leaderboard
原始信息汇总

数据集概述

该数据集是在对模型 KnutJaegersberg/deacon-13b 进行评估运行期间自动创建的,用于 Open LLM Leaderboard

数据集组成

  • 数据集包含 64 个配置,每个配置对应一个评估任务。
  • 数据集从 2 次运行中创建,每次运行可以在每个配置中找到特定的分割,分割名称使用运行的时间戳。
  • "train" 分割始终指向最新的结果。
  • 一个额外的配置 "results" 存储所有运行的聚合结果,用于计算和显示 Open LLM Leaderboard 上的聚合指标。

数据加载示例

python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_KnutJaegersberg__deacon-13b", "harness_winogrande_5", split="train")

最新结果

以下是 2023-10-27T07:52:54.857198 运行的最新结果: python { "all": { "em": 0.0012583892617449664, "em_stderr": 0.0003630560893119389, "f1": 0.05671665268456401, "f1_stderr": 0.001312852180013837, "acc": 0.43354338539457016, "acc_stderr": 0.010175607297065709 }, "harness|drop|3": { "em": 0.0012583892617449664, "em_stderr": 0.0003630560893119389, "f1": 0.05671665268456401, "f1_stderr": 0.001312852180013837 }, "harness|gsm8k|5": { "acc": 0.10386656557998483, "acc_stderr": 0.008403622228924029 }, "harness|winogrande|5": { "acc": 0.7632202052091555, "acc_stderr": 0.011947592365207389 } }

配置详情

以下是数据集的配置详情:

  • harness_arc_challenge_25

    • 分割:2023_09_22T07_24_15.341487
    • 路径:**/details_harness|arc:challenge|25_2023-09-22T07-24-15.341487.parquet
    • 分割:latest
    • 路径:**/details_harness|arc:challenge|25_2023-09-22T07-24-15.341487.parquet
  • harness_drop_3

    • 分割:2023_10_27T07_52_54.857198
    • 路径:**/details_harness|drop|3_2023-10-27T07-52-54.857198.parquet
    • 分割:latest
    • 路径:**/details_harness|drop|3_2023-10-27T07-52-54.857198.parquet
  • harness_gsm8k_5

    • 分割:2023_10_27T07_52_54.857198
    • 路径:**/details_harness|gsm8k|5_2023-10-27T07-52-54.857198.parquet
    • 分割:latest
    • 路径:**/details_harness|gsm8k|5_2023-10-27T07-52-54.857198.parquet
  • harness_hellaswag_10

    • 分割:2023_09_22T07_24_15.341487
    • 路径:**/details_harness|hellaswag|10_2023-09-22T07-24-15.341487.parquet
    • 分割:latest
    • 路径:**/details_harness|hellaswag|10_2023-09-22T07-24-15.341487.parquet
  • harness_hendrycksTest_5

    • 分割:2023_09_22T07_24_15.341487
    • 路径:**/details_harness|hendrycksTest-abstract_algebra|5_2023-09-22T07-24-15.341487.parquet 等 40 个文件
    • 分割:latest
    • 路径:**/details_harness|hendrycksTest-abstract_algebra|5_2023-09-22T07-24-15.341487.parquet 等 40 个文件
  • harness_hendrycksTest_abstract_algebra_5

    • 分割:2023_09_22T07_24_15.341487
    • 路径:**/details_harness|hendrycksTest-abstract_algebra|5_2023-09-22T07-24-15.341487.parquet
    • 分割:latest
    • 路径:**/details_harness|hendrycksTest-abstract_algebra|5_2023-09-22T07-24-15.341487.parquet
  • harness_hendrycksTest_anatomy_5

    • 分割:2023_09_22T07_24_15.341487
    • 路径:**/details_harness|hendrycksTest-anatomy|5_2023-09-22T07-24-15.341487.parquet
    • 分割:latest
    • 路径:**/details_harness|hendrycksTest-anatomy|5_2023-09-22T07-24-15.341487.parquet
  • harness_hendrycksTest_astronomy_5

    • 分割:2023_09_22T07_24_15.341487
    • 路径:**/details_harness|hendrycksTest-astronomy|5_2023-09-22T07-24-15.341487.parquet
    • 分割:latest
    • 路径:**/details_harness|hendrycksTest-astronomy|5_2023-09-22T07-24-15.341487.parquet
  • harness_hendrycksTest_business_ethics_5

    • 分割:2023_09_22T07_24_15.341487
    • 路径:**/details_harness|hendrycksTest-business_ethics|5_2023-09-22T07-24-15.341487.parquet
    • 分割:latest
    • 路径:**/details_harness|hendrycksTest-business_ethics|5_2023-09-22T07-24-15.341487.parquet
  • harness_hendrycksTest_clinical_knowledge_5

    • 分割:2023_09_22T07_24_15.341487
    • 路径:**/details_harness|hendrycksTest-clinical_knowledge|5_2023-09-22T07-24-15.341487.parquet
    • 分割:latest
    • 路径:**/details_harness|hendrycksTest-clinical_knowledge|5_2023-09-22T07-24-15.341487.parquet
  • harness_hendrycksTest_college_biology_5

    • 分割:2023_09_22T07_24_15.341487
    • 路径:**/details_harness|hendrycksTest-college_biology|5_2023-09-22T07-24-15.341487.parquet
    • 分割:latest
    • 路径:**/details_harness|hendrycksTest-college_biology|5_2023-09-22T07-24-15.341487.parquet
  • harness_hendrycksTest_college_chemistry_5

    • 分割:2023_09_22T07_24_15.341487
    • 路径:**/details_harness|hendrycksTest-college_chemistry|5_2023-09-22T07-24-15.341487.parquet
    • 分割:latest
    • 路径:**/details_harness|hendrycksTest-college_chemistry|5_2023-09-22T07-24-15.341487.parquet
  • harness_hendrycksTest_college_computer_science_5

    • 分割:2023_09_22T07_24_15.341487
    • 路径:**/details_harness|hendrycksTest-college_computer_science|5_2023-09-22T07-24-15.341487.parquet
    • 分割:latest
    • 路径:**/details_harness|hendrycksTest-college_computer_science|5_2023-09-22T07-24-15.341487.parquet
  • harness_hendrycksTest_college_mathematics_5

    • 分割:2023_09_22T07_24_15.341487
    • 路径:**/details_harness|hendrycksTest-college_mathematics|5_2023-09-22T07-24-15.341487.parquet
    • 分割:latest
    • 路径:**/details_harness|hendrycksTest-college_mathematics|5_2023-09-22T07-24-15.341487.parquet
  • harness_hendrycksTest_college_medicine_5

    • 分割:2023_09_22T07_24_15.341487
    • 路径:**/details_harness|hendrycksTest-college_medicine|5_2023-09-22T07-24-15.341487.parquet
    • 分割:latest
    • 路径:**/details_harness|hendrycksTest-college_medicine|5_2023-09-22T07-24-15.341487.parquet
  • harness_hendrycksTest_college_physics_5

    • 分割:2023_09_22T07_24_15.341487
    • 路径:**/details_harness|hendrycksTest-college_physics|5_2023-09-22T07-24-15.341487.parquet
    • 分割:latest
    • 路径:**/details_harness|hendrycksTest-college_physics|5_2023-09-22T07-24-15.341487.parquet
  • harness_hendrycksTest_computer_security_5

    • 分割:2023_09_22T07_24_15.341487
    • 路径:**/details_harness|hendrycksTest-computer_security|5_2023-09-22T07-24-15.341487.parquet
    • 分割:latest
    • 路径:**/details_harness|hendrycksTest-computer_security|5_2023-09-22T07-24-15.341487.parquet
  • harness_hendrycksTest_conceptual_physics_5

    • 分割:2023_09_22T07_24_15.341487
    • 路径:**/details_harness|hendrycksTest-conceptual_physics|5_2023-09-22T07-24-15.341487.parquet
    • 分割:latest
    • 路径:**/details_harness|hendrycksTest-conceptual_physics|5_2023-09-22T07-24-15.341487.parquet
  • harness_hendrycksTest_econometrics_5

    • 分割:2023_09_22T07_24_15.341487
    • 路径:**/details_harness|hendrycksTest-econometrics|5_2023-09-22T07-24-15.341487.parquet
    • 分割:latest
    • 路径:**/details_harness|hendrycksTest-econometrics|5_2023-09-22T07-24-15.341487.parquet
  • harness_hendrycksTest_electrical_engineering_5

    • 分割:2023_09_22T07_24_15.341487
    • 路径:**/details_harness|hendrycksTest-electrical_engineering|5_2023-09-22T07-24-15.341487.parquet
    • 分割:latest
    • 路径:
搜集汇总
数据集介绍
main_image_url
构建方式
该数据集是在Open LLM Leaderboard评估框架下,对KnutJaegersberg/deacon-13b模型进行自动化评测的过程中生成的。其构建过程围绕64个评测任务配置展开,每个配置对应一项具体的评估任务,例如ARC挑战、DROP、GSM8K、HellaSwag以及涵盖57个学科的Hendrycks测试等。数据源自两次独立的运行,每次运行的结果被存储为独立的split,split名称以运行时间戳标识,而'train' split则始终指向最新一次的评测结果。此外,数据集还包含一个名为'results'的独立配置,用于汇总所有运行的整体指标,为排行榜上的聚合度量计算与展示提供支撑。
特点
该数据集的核心特点在于其多层次、多任务的结构化设计。它囊括了64个配置,每个配置代表一个具体的评测任务,覆盖了从常识推理、数学解题到多学科知识问答的广泛能力维度。每个配置内部通过时间戳split保留了历史评测记录,便于追踪模型性能的演变轨迹,而'train' split则自动指向最新数据,确保使用者总能获取到最前沿的评估结果。'results'配置作为汇总枢纽,整合了所有任务的宏观指标,如精确匹配率、F1分数和准确率及其标准误差,为模型性能的横向对比提供了统一标尺。
使用方法
使用该数据集时,研究者可以通过HuggingFace的datasets库便捷加载。具体而言,调用load_dataset函数,指定数据集名称'open-llm-leaderboard/details_KnutJaegersberg__deacon-13b',并传入目标任务的配置名称(如'harness_winogrande_5')以及所需的split(例如'train'以获取最新结果),即可获取相应任务的详细评估数据。对于希望分析历史运行记录的用户,可以选择以时间戳命名的split来访问特定时刻的评测结果。而若要获取所有任务的聚合性能,则需加载'results'配置,从中解析出各任务的准确率、标准误差等综合指标,从而支持模型能力的全面评估与比较。
背景与挑战
背景概述
在大型语言模型(LLM)蓬勃发展的时代,如何系统、公正地评估模型的综合能力成为学界与工业界亟待解决的关键问题。Open LLM Leaderboard应运而生,由Hugging Face团队于2023年创建,旨在为社区提供一个标准化、透明的模型性能竞技平台。该数据集作为KnutJaegersberg/deacon-13b模型在Leaderboard上的评估记录,诞生于2023年9月至10月间,由Clementine(clementine@hf.co)等研究人员主导维护。其核心研究问题在于通过多任务、多维度基准测试(包括ARC挑战、DROP、GSM8K、HellaSwag、Winogrande及涵盖57个学科的MMLU测试),量化模型在推理、常识、数学及专业知识等方面的表现。该数据集不仅为deacon-13b模型提供了详尽的性能画像,更为后续研究者对比不同架构、规模与训练策略的LLM提供了不可或缺的参考基线,对推动开源大模型的公平评估与迭代优化产生了深远影响。
当前挑战
该数据集所面临的挑战首先源于LLM评估领域的固有难题:现有基准测试虽覆盖广泛,却难以全面反映模型在真实世界复杂场景中的泛化能力与鲁棒性。例如,deacon-13b在DROP任务上极低的精确匹配得分(0.13%)与F1分数(5.67%),揭示了模型在处理细节密集型阅读理解时的显著短板,而GSM8K数学推理任务上仅10.39%的准确率,更凸显了当前模型在符号推理与多步计算中的脆弱性。此外,数据集构建过程本身亦充满技术挑战:需将来自不同时间点、不同配置(如few-shot示例数量差异)的多次评估结果整合为统一、可追溯的格式,并确保“latest”分片始终指向最新评测数据。这种动态更新机制对数据版本控制、存储一致性及API兼容性提出了严苛要求,任何元数据管理上的疏漏都可能导致结果混淆或复现困难。
常用场景
经典使用场景
在大规模语言模型评估领域,Open LLM Leaderboard 数据集为模型性能的横向对比提供了标准化基准。该数据集围绕 deacon-13b 模型在 ARC、HellaSwag、MMLU、TruthfulQA 等经典任务上的推理结果构建,研究者可通过加载不同配置下的评估分片,复现模型在常识推理、知识问答、数学求解等维度的表现。其典型应用是作为模型能力诊断工具,用于检验预训练或微调后模型在多样化 NLP 任务上的泛化水平。
衍生相关工作
围绕该数据集衍生出多项重要工作,包括基于评估结果训练的模型性能预测器、用于自动识别模型弱点的对抗性测试集生成方法,以及跨模型的能力图谱分析工具。研究者利用其多任务评估结构,开发了诸如“评估结果可视化仪表盘”和“模型能力雷达图”等辅助工具,使非专业用户也能直观理解模型优劣。这些衍生工作共同构建了一个从评估到优化的闭环生态,持续推动语言模型评测技术的进步。
数据集最近研究
最新研究方向
在大型语言模型(LLM)性能评估领域,Open LLM Leaderboard 已成为衡量模型综合能力的权威基准平台。针对 deacon-13b 模型的评测数据集,其最新研究方向聚焦于多维度、细粒度的能力解耦分析。该数据集涵盖了从常识推理(如 Winogrande 的 76.3% 准确率)到数学问题求解(GSM8K 的 10.4% 准确率)的广泛任务,揭示了当前 13B 参数级别模型在知识密集型任务与符号推理任务间的显著能力鸿沟。前沿研究正致力于利用此类细粒度评测结果,探索通过混合专家架构或指令微调策略来弥合这一差距,同时关注评测任务中涌现的偏差模式,为构建更鲁棒、更通用的下一代 LLM 提供数据驱动的优化方向。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作