five

open-llm-leaderboard-old/details_ewof__koishi-instruct-3b

收藏
Hugging Face2023-09-17 更新2024-06-22 收录
下载链接:
https://hf-mirror.com/datasets/open-llm-leaderboard-old/details_ewof__koishi-instruct-3b
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: Evaluation run of ewof/koishi-instruct-3b dataset_summary: "Dataset automatically created during the evaluation run of model\ \ [ewof/koishi-instruct-3b](https://huggingface.co/ewof/koishi-instruct-3b) on the\ \ [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).\n\ \nThe dataset is composed of 64 configuration, each one coresponding to one of the\ \ evaluated task.\n\nThe dataset has been created from 2 run(s). Each run can be\ \ found as a specific split in each configuration, the split being named using the\ \ timestamp of the run.The \"train\" split is always pointing to the latest results.\n\ \nAn additional configuration \"results\" store all the aggregated results of the\ \ run (and is used to compute and display the agregated metrics on the [Open LLM\ \ Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)).\n\ \nTo load the details from a run, you can for instance do the following:\n```python\n\ from datasets import load_dataset\ndata = load_dataset(\"open-llm-leaderboard/details_ewof__koishi-instruct-3b\"\ ,\n\t\"harness_winogrande_5\",\n\tsplit=\"train\")\n```\n\n## Latest results\n\n\ These are the [latest results from run 2023-09-17T08:44:21.498764](https://huggingface.co/datasets/open-llm-leaderboard/details_ewof__koishi-instruct-3b/blob/main/results_2023-09-17T08-44-21.498764.json)(note\ \ that their might be results for other tasks in the repos if successive evals didn't\ \ cover the same tasks. You find each in the results and the \"latest\" split for\ \ each eval):\n\n```python\n{\n \"all\": {\n \"em\": 0.001153523489932886,\n\ \ \"em_stderr\": 0.0003476179896857095,\n \"f1\": 0.05410444630872499,\n\ \ \"f1_stderr\": 0.0012841997819823922,\n \"acc\": 0.32612811480319515,\n\ \ \"acc_stderr\": 0.008201890700454486\n },\n \"harness|drop|3\": {\n\ \ \"em\": 0.001153523489932886,\n \"em_stderr\": 0.0003476179896857095,\n\ \ \"f1\": 0.05410444630872499,\n \"f1_stderr\": 0.0012841997819823922\n\ \ },\n \"harness|gsm8k|5\": {\n \"acc\": 0.011372251705837756,\n \ \ \"acc_stderr\": 0.002920666198788737\n },\n \"harness|winogrande|5\"\ : {\n \"acc\": 0.6408839779005525,\n \"acc_stderr\": 0.013483115202120236\n\ \ }\n}\n```" repo_url: https://huggingface.co/ewof/koishi-instruct-3b leaderboard_url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard point_of_contact: clementine@hf.co configs: - config_name: harness_arc_challenge_25 data_files: - split: 2023_07_19T14_49_25.234956 path: - '**/details_harness|arc:challenge|25_2023-07-19T14:49:25.234956.parquet' - split: latest path: - '**/details_harness|arc:challenge|25_2023-07-19T14:49:25.234956.parquet' - config_name: harness_drop_3 data_files: - split: 2023_09_17T08_44_21.498764 path: - '**/details_harness|drop|3_2023-09-17T08-44-21.498764.parquet' - split: latest path: - '**/details_harness|drop|3_2023-09-17T08-44-21.498764.parquet' - config_name: harness_gsm8k_5 data_files: - split: 2023_09_17T08_44_21.498764 path: - '**/details_harness|gsm8k|5_2023-09-17T08-44-21.498764.parquet' - split: latest path: - '**/details_harness|gsm8k|5_2023-09-17T08-44-21.498764.parquet' - config_name: harness_hellaswag_10 data_files: - split: 2023_07_19T14_49_25.234956 path: - '**/details_harness|hellaswag|10_2023-07-19T14:49:25.234956.parquet' - split: latest path: - '**/details_harness|hellaswag|10_2023-07-19T14:49:25.234956.parquet' - config_name: harness_hendrycksTest_5 data_files: - split: 2023_07_19T14_49_25.234956 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-international_law|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-management|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-marketing|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-sociology|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-virology|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2023-07-19T14:49:25.234956.parquet' - split: latest path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-international_law|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-management|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-marketing|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-sociology|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-virology|5_2023-07-19T14:49:25.234956.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2023-07-19T14:49:25.234956.parquet' - config_name: harness_hendrycksTest_abstract_algebra_5 data_files: - split: 2023_07_19T14_49_25.234956 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2023-07-19T14:49:25.234956.parquet' - split: latest path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2023-07-19T14:49:25.234956.parquet' - config_name: harness_hendrycksTest_anatomy_5 data_files: - split: 2023_07_19T14_49_25.234956 path: - '**/details_harness|hendrycksTest-anatomy|5_2023-07-19T14:49:25.234956.parquet' - split: latest path: - '**/details_harness|hendrycksTest-anatomy|5_2023-07-19T14:49:25.234956.parquet' - config_name: harness_hendrycksTest_astronomy_5 data_files: - split: 2023_07_19T14_49_25.234956 path: - '**/details_harness|hendrycksTest-astronomy|5_2023-07-19T14:49:25.234956.parquet' - split: latest path: - '**/details_harness|hendrycksTest-astronomy|5_2023-07-19T14:49:25.234956.parquet' - config_name: harness_hendrycksTest_business_ethics_5 data_files: - split: 2023_07_19T14_49_25.234956 path: - '**/details_harness|hendrycksTest-business_ethics|5_2023-07-19T14:49:25.234956.parquet' - split: latest path: - '**/details_harness|hendrycksTest-business_ethics|5_2023-07-19T14:49:25.234956.parquet' - config_name: harness_hendrycksTest_clinical_knowledge_5 data_files: - split: 2023_07_19T14_49_25.234956 path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2023-07-19T14:49:25.234956.parquet' - split: latest path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2023-07-19T14:49:25.234956.parquet' - config_name: harness_hendrycksTest_college_biology_5 data_files: - split: 2023_07_19T14_49_25.234956 path: - '**/details_harness|hendrycksTest-college_biology|5_2023-07-19T14:49:25.234956.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_biology|5_2023-07-19T14:49:25.234956.parquet' - config_name: harness_hendrycksTest_college_chemistry_5 data_files: - split: 2023_07_19T14_49_25.234956 path: - '**/details_harness|hendrycksTest-college_chemistry|5_2023-07-19T14:49:25.234956.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_chemistry|5_2023-07-19T14:49:25.234956.parquet' - config_name: harness_hendrycksTest_college_computer_science_5 data_files: - split: 2023_07_19T14_49_25.234956 path: - '**/details_harness|hendrycksTest-college_computer_science|5_2023-07-19T14:49:25.234956.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_computer_science|5_2023-07-19T14:49:25.234956.parquet' - config_name: harness_hendrycksTest_college_mathematics_5 data_files: - split: 2023_07_19T14_49_25.234956 path: - '**/details_harness|hendrycksTest-college_mathematics|5_2023-07-19T14:49:25.234956.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_mathematics|5_2023-07-19T14:49:25.234956.parquet' - config_name: harness_hendrycksTest_college_medicine_5 data_files: - split: 2023_07_19T14_49_25.234956 path: - '**/details_harness|hendrycksTest-college_medicine|5_2023-07-19T14:49:25.234956.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_medicine|5_2023-07-19T14:49:25.234956.parquet' - config_name: harness_hendrycksTest_college_physics_5 data_files: - split: 2023_07_19T14_49_25.234956 path: - '**/details_harness|hendrycksTest-college_physics|5_2023-07-19T14:49:25.234956.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_physics|5_2023-07-19T14:49:25.234956.parquet' - config_name: harness_hendrycksTest_computer_security_5 data_files: - split: 2023_07_19T14_49_25.234956 path: - '**/details_harness|hendrycksTest-computer_security|5_2023-07-19T14:49:25.234956.parquet' - split: latest path: - '**/details_harness|hendrycksTest-computer_security|5_2023-07-19T14:49:25.234956.parquet' - config_name: harness_hendrycksTest_conceptual_physics_5 data_files: - split: 2023_07_19T14_49_25.234956 path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2023-07-19T14:49:25.234956.parquet' - split: latest path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2023-07-19T14:49:25.234956.parquet' - config_name: harness_hendrycksTest_econometrics_5 data_files: - split: 2023_07_19T14_49_25.234956 path: - '**/details_harness|hendrycksTest-econometrics|5_2023-07-19T14:49:25.234956.parquet' - split: latest path: - '**/details_harness|hendrycksTest-econometrics|5_2023-07-19T14:49:25.234956.parquet' - config_name: harness_hendrycksTest_electrical_engineering_5 data_files: - split: 2023_07_19T14_49_25.234956 path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2023-07-19T14:49:25.234956.parquet' - split: latest path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2023-07-19T14:49:25.234956.parquet' - config_name: harness_hendrycksTest_elementary_mathematics_5 data_files: - split: 2023_07_19T14_49_25.234956 path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2023-07-19T14:49:25.234956.parquet' - split: latest path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2023-07-19T14:49:25.234956.parquet' - config_name: harness_hendrycksTest_formal_logic_5 data_files: - split: 2023_07_19T14_49_25.234956 path: - '**/details_harness|hendrycksTest-formal_logic|5_2023-07-19T14:49:25.234956.parquet' - split: latest path: - '**/details_harness|hendrycksTest-formal_logic|5_2023-07-19T14:49:25.234956.parquet' - config_name: harness_hendrycksTest_global_facts_5 data_files: - split: 2023_07_19T14_49_25.234956 path: - '**/details_harness|hendrycksTest-global_facts|5_2023-07-19T14:49:25.234956.parquet' - split: latest path: - '**/details_harness|hendrycksTest-global_facts|5_2023-07-19T14:49:25.234956.parquet' - config_name: harness_hendrycksTest_high_school_biology_5 data_files: - split: 2023_07_19T14_49_25.234956 path: - '**/details_harness|hendrycksTest-high_school_biology|5_2023-07-19T14:49:25.234956.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_biology|5_2023-07-19T14:49:25.234956.parquet' - config_name: harness_hendrycksTest_high_school_chemistry_5 data_files: - split: 2023_07_19T14_49_25.234956 path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2023-07-19T14:49:25.234956.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2023-07-19T14:49:25.234956.parquet' - config_name: harness_hendrycksTest_high_school_computer_science_5 data_files: - split: 2023_07_19T14_49_25.234956 path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2023-07-19T14:49:25.234956.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2023-07-19T14:49:25.234956.parquet' - config_name: harness_hendrycksTest_high_school_european_history_5 data_files: - split: 2023_07_19T14_49_25.234956 path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2023-07-19T14:49:25.234956.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2023-07-19T14:49:25.234956.parquet' - config_name: harness_hendrycksTest_high_school_geography_5 data_files: - split: 2023_07_19T14_49_25.234956 path: - '**/details_harness|hendrycksTest-high_school_geography|5_2023-07-19T14:49:25.234956.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_geography|5_2023-07-19T14:49:25.234956.parquet' - config_name: harness_hendrycksTest_high_school_government_and_politics_5 data_files: - split: 2023_07_19T14_49_25.234956 path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2023-07-19T14:49:25.234956.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2023-07-19T14:49:25.234956.parquet' - config_name: harness_hendrycksTest_high_school_macroeconomics_5 data_files: - split: 2023_07_19T14_49_25.234956 path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2023-07-19T14:49:25.234956.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2023-07-19T14:49:25.234956.parquet' - config_name: harness_hendrycksTest_high_school_mathematics_5 data_files: - split: 2023_07_19T14_49_25.234956 path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2023-07-19T14:49:25.234956.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2023-07-19T14:49:25.234956.parquet' - config_name: harness_hendrycksTest_high_school_microeconomics_5 data_files: - split: 2023_07_19T14_49_25.234956 path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2023-07-19T14:49:25.234956.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2023-07-19T14:49:25.234956.parquet' - config_name: harness_hendrycksTest_high_school_physics_5 data_files: - split: 2023_07_19T14_49_25.234956 path: - '**/details_harness|hendrycksTest-high_school_physics|5_2023-07-19T14:49:25.234956.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_physics|5_2023-07-19T14:49:25.234956.parquet' - config_name: harness_hendrycksTest_high_school_psychology_5 data_files: - split: 2023_07_19T14_49_25.234956 path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2023-07-19T14:49:25.234956.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2023-07-19T14:49:25.234956.parquet' - config_name: harness_hendrycksTest_high_school_statistics_5 data_files: - split: 2023_07_19T14_49_25.234956 path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2023-07-19T14:49:25.234956.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2023-07-19T14:49:25.234956.parquet' - config_name: harness_hendrycksTest_high_school_us_history_5 data_files: - split: 2023_07_19T14_49_25.234956 path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2023-07-19T14:49:25.234956.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2023-07-19T14:49:25.234956.parquet' - config_name: harness_hendrycksTest_high_school_world_history_5 data_files: - split: 2023_07_19T14_49_25.234956 path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2023-07-19T14:49:25.234956.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2023-07-19T14:49:25.234956.parquet' - config_name: harness_hendrycksTest_human_aging_5 data_files: - split: 2023_07_19T14_49_25.234956 path: - '**/details_harness|hendrycksTest-human_aging|5_2023-07-19T14:49:25.234956.parquet' - split: latest path: - '**/details_harness|hendrycksTest-human_aging|5_2023-07-19T14:49:25.234956.parquet' - config_name: harness_hendrycksTest_human_sexuality_5 data_files: - split: 2023_07_19T14_49_25.234956 path: - '**/details_harness|hendrycksTest-human_sexuality|5_2023-07-19T14:49:25.234956.parquet' - split: latest path: - '**/details_harness|hendrycksTest-human_sexuality|5_2023-07-19T14:49:25.234956.parquet' - config_name: harness_hendrycksTest_international_law_5 data_files: - split: 2023_07_19T14_49_25.234956 path: - '**/details_harness|hendrycksTest-international_law|5_2023-07-19T14:49:25.234956.parquet' - split: latest path: - '**/details_harness|hendrycksTest-international_law|5_2023-07-19T14:49:25.234956.parquet' - config_name: harness_hendrycksTest_jurisprudence_5 data_files: - split: 2023_07_19T14_49_25.234956 path: - '**/details_harness|hendrycksTest-jurisprudence|5_2023-07-19T14:49:25.234956.parquet' - split: latest path: - '**/details_harness|hendrycksTest-jurisprudence|5_2023-07-19T14:49:25.234956.parquet' - config_name: harness_hendrycksTest_logical_fallacies_5 data_files: - split: 2023_07_19T14_49_25.234956 path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2023-07-19T14:49:25.234956.parquet' - split: latest path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2023-07-19T14:49:25.234956.parquet' - config_name: harness_hendrycksTest_machine_learning_5 data_files: - split: 2023_07_19T14_49_25.234956 path: - '**/details_harness|hendrycksTest-machine_learning|5_2023-07-19T14:49:25.234956.parquet' - split: latest path: - '**/details_harness|hendrycksTest-machine_learning|5_2023-07-19T14:49:25.234956.parquet' - config_name: harness_hendrycksTest_management_5 data_files: - split: 2023_07_19T14_49_25.234956 path: - '**/details_harness|hendrycksTest-management|5_2023-07-19T14:49:25.234956.parquet' - split: latest path: - '**/details_harness|hendrycksTest-management|5_2023-07-19T14:49:25.234956.parquet' - config_name: harness_hendrycksTest_marketing_5 data_files: - split: 2023_07_19T14_49_25.234956 path: - '**/details_harness|hendrycksTest-marketing|5_2023-07-19T14:49:25.234956.parquet' - split: latest path: - '**/details_harness|hendrycksTest-marketing|5_2023-07-19T14:49:25.234956.parquet' - config_name: harness_hendrycksTest_medical_genetics_5 data_files: - split: 2023_07_19T14_49_25.234956 path: - '**/details_harness|hendrycksTest-medical_genetics|5_2023-07-19T14:49:25.234956.parquet' - split: latest path: - '**/details_harness|hendrycksTest-medical_genetics|5_2023-07-19T14:49:25.234956.parquet' - config_name: harness_hendrycksTest_miscellaneous_5 data_files: - split: 2023_07_19T14_49_25.234956 path: - '**/details_harness|hendrycksTest-miscellaneous|5_2023-07-19T14:49:25.234956.parquet' - split: latest path: - '**/details_harness|hendrycksTest-miscellaneous|5_2023-07-19T14:49:25.234956.parquet' - config_name: harness_hendrycksTest_moral_disputes_5 data_files: - split: 2023_07_19T14_49_25.234956 path: - '**/details_harness|hendrycksTest-moral_disputes|5_2023-07-19T14:49:25.234956.parquet' - split: latest path: - '**/details_harness|hendrycksTest-moral_disputes|5_2023-07-19T14:49:25.234956.parquet' - config_name: harness_hendrycksTest_moral_scenarios_5 data_files: - split: 2023_07_19T14_49_25.234956 path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2023-07-19T14:49:25.234956.parquet' - split: latest path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2023-07-19T14:49:25.234956.parquet' - config_name: harness_hendrycksTest_nutrition_5 data_files: - split: 2023_07_19T14_49_25.234956 path: - '**/details_harness|hendrycksTest-nutrition|5_2023-07-19T14:49:25.234956.parquet' - split: latest path: - '**/details_harness|hendrycksTest-nutrition|5_2023-07-19T14:49:25.234956.parquet' - config_name: harness_hendrycksTest_philosophy_5 data_files: - split: 2023_07_19T14_49_25.234956 path: - '**/details_harness|hendrycksTest-philosophy|5_2023-07-19T14:49:25.234956.parquet' - split: latest path: - '**/details_harness|hendrycksTest-philosophy|5_2023-07-19T14:49:25.234956.parquet' - config_name: harness_hendrycksTest_prehistory_5 data_files: - split: 2023_07_19T14_49_25.234956 path: - '**/details_harness|hendrycksTest-prehistory|5_2023-07-19T14:49:25.234956.parquet' - split: latest path: - '**/details_harness|hendrycksTest-prehistory|5_2023-07-19T14:49:25.234956.parquet' - config_name: harness_hendrycksTest_professional_accounting_5 data_files: - split: 2023_07_19T14_49_25.234956 path: - '**/details_harness|hendrycksTest-professional_accounting|5_2023-07-19T14:49:25.234956.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_accounting|5_2023-07-19T14:49:25.234956.parquet' - config_name: harness_hendrycksTest_professional_law_5 data_files: - split: 2023_07_19T14_49_25.234956 path: - '**/details_harness|hendrycksTest-professional_law|5_2023-07-19T14:49:25.234956.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_law|5_2023-07-19T14:49:25.234956.parquet' - config_name: harness_hendrycksTest_professional_medicine_5 data_files: - split: 2023_07_19T14_49_25.234956 path: - '**/details_harness|hendrycksTest-professional_medicine|5_2023-07-19T14:49:25.234956.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_medicine|5_2023-07-19T14:49:25.234956.parquet' - config_name: harness_hendrycksTest_professional_psychology_5 data_files: - split: 2023_07_19T14_49_25.234956 path: - '**/details_harness|hendrycksTest-professional_psychology|5_2023-07-19T14:49:25.234956.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_psychology|5_2023-07-19T14:49:25.234956.parquet' - config_name: harness_hendrycksTest_public_relations_5 data_files: - split: 2023_07_19T14_49_25.234956 path: - '**/details_harness|hendrycksTest-public_relations|5_2023-07-19T14:49:25.234956.parquet' - split: latest path: - '**/details_harness|hendrycksTest-public_relations|5_2023-07-19T14:49:25.234956.parquet' - config_name: harness_hendrycksTest_security_studies_5 data_files: - split: 2023_07_19T14_49_25.234956 path: - '**/details_harness|hendrycksTest-security_studies|5_2023-07-19T14:49:25.234956.parquet' - split: latest path: - '**/details_harness|hendrycksTest-security_studies|5_2023-07-19T14:49:25.234956.parquet' - config_name: harness_hendrycksTest_sociology_5 data_files: - split: 2023_07_19T14_49_25.234956 path: - '**/details_harness|hendrycksTest-sociology|5_2023-07-19T14:49:25.234956.parquet' - split: latest path: - '**/details_harness|hendrycksTest-sociology|5_2023-07-19T14:49:25.234956.parquet' - config_name: harness_hendrycksTest_us_foreign_policy_5 data_files: - split: 2023_07_19T14_49_25.234956 path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2023-07-19T14:49:25.234956.parquet' - split: latest path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2023-07-19T14:49:25.234956.parquet' - config_name: harness_hendrycksTest_virology_5 data_files: - split: 2023_07_19T14_49_25.234956 path: - '**/details_harness|hendrycksTest-virology|5_2023-07-19T14:49:25.234956.parquet' - split: latest path: - '**/details_harness|hendrycksTest-virology|5_2023-07-19T14:49:25.234956.parquet' - config_name: harness_hendrycksTest_world_religions_5 data_files: - split: 2023_07_19T14_49_25.234956 path: - '**/details_harness|hendrycksTest-world_religions|5_2023-07-19T14:49:25.234956.parquet' - split: latest path: - '**/details_harness|hendrycksTest-world_religions|5_2023-07-19T14:49:25.234956.parquet' - config_name: harness_truthfulqa_mc_0 data_files: - split: 2023_07_19T14_49_25.234956 path: - '**/details_harness|truthfulqa:mc|0_2023-07-19T14:49:25.234956.parquet' - split: latest path: - '**/details_harness|truthfulqa:mc|0_2023-07-19T14:49:25.234956.parquet' - config_name: harness_winogrande_5 data_files: - split: 2023_09_17T08_44_21.498764 path: - '**/details_harness|winogrande|5_2023-09-17T08-44-21.498764.parquet' - split: latest path: - '**/details_harness|winogrande|5_2023-09-17T08-44-21.498764.parquet' - config_name: results data_files: - split: 2023_07_19T14_49_25.234956 path: - results_2023-07-19T14:49:25.234956.parquet - split: 2023_09_17T08_44_21.498764 path: - results_2023-09-17T08-44-21.498764.parquet - split: latest path: - results_2023-09-17T08-44-21.498764.parquet --- # Dataset Card for Evaluation run of ewof/koishi-instruct-3b ## Dataset Description - **Homepage:** - **Repository:** https://huggingface.co/ewof/koishi-instruct-3b - **Paper:** - **Leaderboard:** https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard - **Point of Contact:** clementine@hf.co ### Dataset Summary Dataset automatically created during the evaluation run of model [ewof/koishi-instruct-3b](https://huggingface.co/ewof/koishi-instruct-3b) on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard). The dataset is composed of 64 configuration, each one coresponding to one of the evaluated task. The dataset has been created from 2 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results. An additional configuration "results" store all the aggregated results of the run (and is used to compute and display the agregated metrics on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)). To load the details from a run, you can for instance do the following: ```python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_ewof__koishi-instruct-3b", "harness_winogrande_5", split="train") ``` ## Latest results These are the [latest results from run 2023-09-17T08:44:21.498764](https://huggingface.co/datasets/open-llm-leaderboard/details_ewof__koishi-instruct-3b/blob/main/results_2023-09-17T08-44-21.498764.json)(note that their might be results for other tasks in the repos if successive evals didn't cover the same tasks. You find each in the results and the "latest" split for each eval): ```python { "all": { "em": 0.001153523489932886, "em_stderr": 0.0003476179896857095, "f1": 0.05410444630872499, "f1_stderr": 0.0012841997819823922, "acc": 0.32612811480319515, "acc_stderr": 0.008201890700454486 }, "harness|drop|3": { "em": 0.001153523489932886, "em_stderr": 0.0003476179896857095, "f1": 0.05410444630872499, "f1_stderr": 0.0012841997819823922 }, "harness|gsm8k|5": { "acc": 0.011372251705837756, "acc_stderr": 0.002920666198788737 }, "harness|winogrande|5": { "acc": 0.6408839779005525, "acc_stderr": 0.013483115202120236 } } ``` ### Supported Tasks and Leaderboards [More Information Needed] ### Languages [More Information Needed] ## Dataset Structure ### Data Instances [More Information Needed] ### Data Fields [More Information Needed] ### Data Splits [More Information Needed] ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information [More Information Needed] ### Citation Information [More Information Needed] ### Contributions [More Information Needed]
提供机构:
open-llm-leaderboard-old
原始信息汇总

数据集概述

数据集来源

该数据集是在评估模型 ewof/koishi-instruct-3bOpen LLM Leaderboard 上的运行过程中自动创建的。

数据集结构

  • 数据集包含 64 个配置,每个配置对应一个评估任务。
  • 数据集从 2 次运行中创建,每次运行可以在每个配置中找到特定的分割,分割名称使用运行的时间戳。
  • "train" 分割始终指向最新的结果。

额外配置

  • 一个额外的配置 "results" 存储所有运行的聚合结果,用于计算和显示 Open LLM Leaderboard 上的聚合指标。

数据加载示例

python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_ewof__koishi-instruct-3b", "harness_winogrande_5", split="train")

最新结果

以下是 2023-09-17T08:44:21.498764 运行的最新结果: python { "all": { "em": 0.001153523489932886, "em_stderr": 0.0003476179896857095, "f1": 0.05410444630872499, "f1_stderr": 0.0012841997819823922, "acc": 0.32612811480319515, "acc_stderr": 0.008201890700454486 }, "harness|drop|3": { "em": 0.001153523489932886, "em_stderr": 0.0003476179896857095, "f1": 0.05410444630872499, "f1_stderr": 0.0012841997819823922 }, "harness|gsm8k|5": { "acc": 0.011372251705837756, "acc_stderr": 0.002920666198788737 }, "harness|winogrande|5": { "acc": 0.6408839779005525, "acc_stderr": 0.013483115202120236 } }

配置详情

  • harness_arc_challenge_25

    • 分割:2023_07_19T14_49_25.234956
      • 路径:**/details_harness|arc:challenge|25_2023-07-19T14:49:25.234956.parquet
    • 分割:latest
      • 路径:**/details_harness|arc:challenge|25_2023-07-19T14:49:25.234956.parquet
  • harness_drop_3

    • 分割:2023_09_17T08_44_21.498764
      • 路径:**/details_harness|drop|3_2023-09-17T08-44-21.498764.parquet
    • 分割:latest
      • 路径:**/details_harness|drop|3_2023-09-17T08-44-21.498764.parquet
  • harness_gsm8k_5

    • 分割:2023_09_17T08_44_21.498764
      • 路径:**/details_harness|gsm8k|5_2023-09-17T08-44-21.498764.parquet
    • 分割:latest
      • 路径:**/details_harness|gsm8k|5_2023-09-17T08-44-21.498764.parquet
  • harness_hellaswag_10

    • 分割:2023_07_19T14_49_25.234956
      • 路径:**/details_harness|hellaswag|10_2023-07-19T14:49:25.234956.parquet
    • 分割:latest
      • 路径:**/details_harness|hellaswag|10_2023-07-19T14:49:25.234956.parquet
  • harness_hendrycksTest_5

    • 分割:2023_07_19T14_49_25.234956
      • 路径:**/details_harness|hendrycksTest-abstract_algebra|5_2023-07-19T14:49:25.234956.parquet
      • 路径:**/details_harness|hendrycksTest-anatomy|5_2023-07-19T14:49:25.234956.parquet
      • 路径:**/details_harness|hendrycksTest-astronomy|5_2023-07-19T14:49:25.234956.parquet
      • 路径:**/details_harness|hendrycksTest-business_ethics|5_2023-07-19T14:49:25.234956.parquet
      • 路径:**/details_harness|hendrycksTest-clinical_knowledge|5_2023-07-19T14:49:25.234956.parquet
      • 路径:**/details_harness|hendrycksTest-college_biology|5_2023-07-19T14:49:25.234956.parquet
      • 路径:**/details_harness|hendrycksTest-college_chemistry|5_2023-07-19T14:49:25.234956.parquet
      • 路径:**/details_harness|hendrycksTest-college_computer_science|5_2023-07-19T14:49:25.234956.parquet
      • 路径:**/details_harness|hendrycksTest-college_mathematics|5_2023-07-19T14:49:25.234956.parquet
      • 路径:**/details_harness|hendrycksTest-college_medicine|5_2023-07-19T14:49:25.234956.parquet
      • 路径:**/details_harness|hendrycksTest-college_physics|5_2023-07-19T14:49:25.234956.parquet
      • 路径:**/details_harness|hendrycksTest-computer_security|5_2023-07-19T14:49:25.234956.parquet
      • 路径:**/details_harness|hendrycksTest-conceptual_physics|5_2023-07-19T14:49:25.234956.parquet
      • 路径:**/details_harness|hendrycksTest-econometrics|5_2023-07-19T14:49:25.234956.parquet
      • 路径:**/details_harness|hendrycksTest-electrical_engineering|5_2023-07-19T14:49:25.234956.parquet
      • 路径:**/details_harness|hendrycksTest-elementary_mathematics|5_2023-07-19T14:49:25.234956.parquet
      • 路径:**/details_harness|hendrycksTest-formal_logic|5_2023-07-19T14:49:25.234956.parquet
      • 路径:**/details_harness|hendrycksTest-global_facts|5_2023-07-19T14:49:25.234956.parquet
      • 路径:**/details_harness|hendrycksTest-high_school_biology|5_2023-07-19T14:49:25.234956.parquet
      • 路径:**/details_harness|hendrycksTest-high_school_chemistry|5_2023-07-19T14:49:25.234956.parquet
      • 路径:**/details_harness|hendrycksTest-high_school_computer_science|5_2023-07-19T14:49:25.234956.parquet
      • 路径:**/details_harness|hendrycksTest-high_school_european_history|5_2023-07-19T14:49:25.234956.parquet
      • 路径:**/details_harness|hendrycksTest-high_school_geography|5_2023-07-19T14:49:25.234956.parquet
      • 路径:**/details_harness|hendrycksTest-high_school_government_and_politics|5_2023-07-19T14:49:25.234956.parquet
      • 路径:**/details_harness|hendrycksTest-high_school_macroeconomics|5_2023-07-19T14:49:25.234956.parquet
      • 路径:**/details_harness|hendrycksTest-high_school_mathematics|5_2023-07-19T14:49:25.234956.parquet
      • 路径:**/details_harness|hendrycksTest-high_school_microeconomics|5_2023-07-19T14:49:25.234956.parquet
      • 路径:**/details_harness|hendrycksTest-high_school_physics|5_2023-07-19T14:49:25.234956.parquet
      • 路径:**/details_harness|hendrycksTest-high_school_psychology|5_2023-07-19T14:49:25.234956.parquet
      • 路径:**/details_harness|hendrycksTest-high_school_statistics|5_2023-07-19T14:49:25.234956.parquet
      • 路径:**/details_harness|hendrycksTest-high_school_us_history|5_2023-07-19T14:49:25.234956.parquet
      • 路径:**/details_harness|hendrycksTest-high_school_world_history|5_2023-07-19T14:49:25.234956.parquet
      • 路径:**/details_harness|hendrycksTest-human_aging|5_2023-07-19T14:49:25.234956.parquet
      • 路径:**/details_harness|hendrycksTest-human_sexuality|5_2023-07-19T14:49:25.234956.parquet
      • 路径:**/details_harness|hendrycksTest-international_law|5_2023-07-19T14:49:25.234956.parquet
      • 路径:**/details_harness|hendrycksTest-jurisprudence|5_2023-07-19T14:49:25.234956.parquet
      • 路径:**/details_harness|hendrycksTest-logical_fallacies|5_2023-07-19T14:49:25.234956.parquet
      • 路径:**/details_harness|hendrycksTest-machine_learning|5_2023-07-19T14:49:25.234956.parquet
      • 路径:**/details_harness|hendrycksTest-management|5_2023-07-19T14:49:25.234956.parquet
      • 路径:**/details_harness|hendrycksTest-marketing|5_2023-07-19T14:49:25.234956.parquet
      • 路径:**/details_harness|hendrycksTest-medical_genetics|5_2023-07-19T14:49:25.234956.parquet
      • 路径:**/details_harness|hendrycksTest-miscellaneous|5_2023-07-19T14:49:25.234956.parquet
      • 路径:**/details_harness|hendrycksTest-moral_disputes|5_2023-07-19T14:49:25.234956.parquet
      • 路径:**/details_harness|hendrycksTest-moral_scenarios|5_2023-07-19T14:49:25.234956.parquet
      • 路径:**/details_harness|hendrycksTest-nutrition|5_2023-07-19T14:49:25.234956.parquet
      • 路径:**/details_harness|hendrycksTest-philosophy|5_2023-07-19T14:49:25.234956.parquet
      • 路径:`**/details_harness|hendrycksTest-prehistory|5_2023-07-19T14
搜集汇总
数据集介绍
main_image_url
构建方式
在大型语言模型评估领域,Open LLM Leaderboard 作为权威的基准评测平台,其评估过程所产生的细粒度结果数据被系统性地整理为该数据集。该数据集源自对模型 ewof/koishi-instruct-3b 的两次独立评估运行,每次运行的结果分别以时间戳命名的分割(split)形式存储于各个配置中。数据集共包含 64 个配置,每个配置对应一个被评估的特定任务,如 ARC-Challenge、GSM8K 和 Winogrande 等。此外,一个名为 "results" 的独立配置专门汇总了所有运行的聚合指标,用于在 Leaderboard 上计算和展示最终性能。最新一次运行的结果被默认指向 "train" 分割,确保了数据访问的时效性。
使用方法
研究人员可通过 Hugging Face 的 datasets 库便捷地加载该数据集。以 Python 为例,调用 `load_dataset("open-llm-leaderboard/details_ewof__koishi-instruct-3b", "harness_winogrande_5", split="train")` 即可获取 Winogrande 任务的最新评估详情。若需访问历史运行数据,只需将分割名称替换为对应的具体时间戳,如 "2023_09_17T08_44_21.498764"。此外,通过加载 "results" 配置,可以一次性获取所有任务的聚合结果,便于进行宏观性能概览。这种灵活的数据访问方式,极大地方便了研究者对模型进行多角度、深层次的性能剖析与复现验证。
背景与挑战
背景概述
随着大语言模型(LLM)技术的迅猛发展,对其性能进行系统、公正的评估成为推动领域进步的关键环节。在此背景下,Hugging Face团队于2023年发起了Open LLM Leaderboard项目,旨在构建一个开放、动态的模型评测平台。该数据集作为对koishi-instruct-3b模型进行评测的副产品,由Hugging Face研究团队(联系人为Clementine)于2023年7月至9月间创建,核心研究问题聚焦于如何标准化、可复现地衡量不同LLM在多样化任务上的表现。通过集成ARC、HellaSwag、MMLU等多项基准测试,该数据集为社区提供了宝贵的模型性能细粒度分析资源,对促进LLM评估方法的透明化与规范化产生了深远影响。
当前挑战
该数据集所解决的领域问题在于,LLM评测常因任务选择、评估指标及实现细节的差异而难以横向比较,Open LLM Leaderboard通过统一评测框架(Language Model Evaluation Harness)降低了这一壁垒。具体挑战包括:1)评测覆盖面的广度与深度平衡,需从常识推理(如HellaSwag)、数学逻辑(GSM8K)到专业知识(MMLU的57个子领域)等多维度衡量模型能力,确保评估的全面性;2)构建过程中的版本与复现管理,由于模型评测涉及多次运行(该数据集包含2次运行记录),如何通过时间戳分割与最新结果自动指向机制来保证数据溯源与结果一致性,成为技术实现上的关键难点。
常用场景
经典使用场景
该数据集作为Open LLM Leaderboard评估流程的副产品,核心用途为记录并复现特定模型(如ewof/koishi-instruct-3b)在多样化自然语言理解与推理任务上的细粒度表现。其包含64个独立配置,对应ARC挑战赛、HellaSwag、MMLU、GSM8K及Winogrande等经典基准测试。研究者可通过加载特定任务配置(如harness_winogrande_5)与时间戳分割,精确追溯模型在常识推理、数学求解及知识问答等维度的得分及统计误差,从而为模型性能比较与迭代优化提供透明、可验证的基准数据支撑。
解决学术问题
该数据集系统性地解决了大语言模型评估中普遍存在的可复现性危机与细粒度分析缺失的问题。通过标准化记录多次评估运行中的逐项指标(如准确率、F1分数及其标准误),它使学术界能够超越单一聚合得分的局限,深入剖析模型在不同认知难度层级(从常识推理到专业学科)上的能力边界。这一机制显著推动了关于模型泛化性、鲁棒性及偏见诊断的实证研究,为构建更公平、更全面的模型评价体系奠定了方法论基础,并成为后续众多模型发布时不可或缺的评估参照。
实际应用
在实际应用中,该数据集充当了模型选型与部署决策的量化依据。开发团队可借助其详尽的子任务得分,快速判断koishi-instruct-3b在需要常识理解(如Winogrande)或数学推理(如GSM8K)的工业场景中的适用性。同时,该数据集支持持续集成流程中的自动回归测试,通过对比不同时间戳的运行结果,有效监控模型更新或微调后的性能波动。其结构化存储格式也便于与可视化看板对接,使非专业用户能直观理解模型在具体任务上的长处与短板,从而指导实际产品的功能优化。
数据集最近研究
最新研究方向
在大型语言模型评估体系持续演进的背景下,该数据集聚焦于对特定指令微调模型(如koishi-instruct-3b)在多项基准任务上的性能进行细粒度剖析。当前前沿研究方向着重于利用Open LLM Leaderboard标准化评估框架,通过涵盖常识推理(如Winogrande)、数学解题(GSM8K)及阅读理解(DROP)等多维度任务,系统揭示小规模模型在复杂推理与知识应用中的能力边界。这一研究方向与社区对模型可复现性评估及公平比较的热切需求紧密相连,其意义在于为轻量化模型的优化提供实证依据,并推动开放、透明的模型评测生态建设,助力研究者更精准地定位模型短板与改进方向。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作