five

open-llm-leaderboard-old/details_KoboldAI__GPT-NeoX-20B-Erebus

收藏
Hugging Face2023-10-24 更新2024-06-22 收录
下载链接:
https://hf-mirror.com/datasets/open-llm-leaderboard-old/details_KoboldAI__GPT-NeoX-20B-Erebus
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: Evaluation run of KoboldAI/GPT-NeoX-20B-Erebus dataset_summary: "Dataset automatically created during the evaluation run of model\ \ [KoboldAI/GPT-NeoX-20B-Erebus](https://huggingface.co/KoboldAI/GPT-NeoX-20B-Erebus)\ \ on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).\n\ \nThe dataset is composed of 64 configuration, each one coresponding to one of the\ \ evaluated task.\n\nThe dataset has been created from 2 run(s). Each run can be\ \ found as a specific split in each configuration, the split being named using the\ \ timestamp of the run.The \"train\" split is always pointing to the latest results.\n\ \nAn additional configuration \"results\" store all the aggregated results of the\ \ run (and is used to compute and display the agregated metrics on the [Open LLM\ \ Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)).\n\ \nTo load the details from a run, you can for instance do the following:\n```python\n\ from datasets import load_dataset\ndata = load_dataset(\"open-llm-leaderboard/details_KoboldAI__GPT-NeoX-20B-Erebus\"\ ,\n\t\"harness_winogrande_5\",\n\tsplit=\"train\")\n```\n\n## Latest results\n\n\ These are the [latest results from run 2023-10-24T16:29:58.049517](https://huggingface.co/datasets/open-llm-leaderboard/details_KoboldAI__GPT-NeoX-20B-Erebus/blob/main/results_2023-10-24T16-29-58.049517.json)(note\ \ that their might be results for other tasks in the repos if successive evals didn't\ \ cover the same tasks. You find each in the results and the \"latest\" split for\ \ each eval):\n\n```python\n{\n \"all\": {\n \"em\": 0.0009437919463087249,\n\ \ \"em_stderr\": 0.0003144653119413213,\n \"f1\": 0.050781250000000264,\n\ \ \"f1_stderr\": 0.0012129008741175679,\n \"acc\": 0.3519405232133358,\n\ \ \"acc_stderr\": 0.00860227452891923\n },\n \"harness|drop|3\": {\n\ \ \"em\": 0.0009437919463087249,\n \"em_stderr\": 0.0003144653119413213,\n\ \ \"f1\": 0.050781250000000264,\n \"f1_stderr\": 0.0012129008741175679\n\ \ },\n \"harness|gsm8k|5\": {\n \"acc\": 0.022744503411675512,\n \ \ \"acc_stderr\": 0.004106620637749689\n },\n \"harness|winogrande|5\"\ : {\n \"acc\": 0.681136543014996,\n \"acc_stderr\": 0.013097928420088771\n\ \ }\n}\n```" repo_url: https://huggingface.co/KoboldAI/GPT-NeoX-20B-Erebus leaderboard_url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard point_of_contact: clementine@hf.co configs: - config_name: harness_arc_challenge_25 data_files: - split: 2023_07_19T21_38_23.585493 path: - '**/details_harness|arc:challenge|25_2023-07-19T21:38:23.585493.parquet' - split: latest path: - '**/details_harness|arc:challenge|25_2023-07-19T21:38:23.585493.parquet' - config_name: harness_drop_3 data_files: - split: 2023_10_24T16_29_58.049517 path: - '**/details_harness|drop|3_2023-10-24T16-29-58.049517.parquet' - split: latest path: - '**/details_harness|drop|3_2023-10-24T16-29-58.049517.parquet' - config_name: harness_gsm8k_5 data_files: - split: 2023_10_24T16_29_58.049517 path: - '**/details_harness|gsm8k|5_2023-10-24T16-29-58.049517.parquet' - split: latest path: - '**/details_harness|gsm8k|5_2023-10-24T16-29-58.049517.parquet' - config_name: harness_hellaswag_10 data_files: - split: 2023_07_19T21_38_23.585493 path: - '**/details_harness|hellaswag|10_2023-07-19T21:38:23.585493.parquet' - split: latest path: - '**/details_harness|hellaswag|10_2023-07-19T21:38:23.585493.parquet' - config_name: harness_hendrycksTest_5 data_files: - split: 2023_07_19T21_38_23.585493 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-international_law|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-management|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-marketing|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-sociology|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-virology|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2023-07-19T21:38:23.585493.parquet' - split: latest path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-international_law|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-management|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-marketing|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-sociology|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-virology|5_2023-07-19T21:38:23.585493.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2023-07-19T21:38:23.585493.parquet' - config_name: harness_hendrycksTest_abstract_algebra_5 data_files: - split: 2023_07_19T21_38_23.585493 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2023-07-19T21:38:23.585493.parquet' - split: latest path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2023-07-19T21:38:23.585493.parquet' - config_name: harness_hendrycksTest_anatomy_5 data_files: - split: 2023_07_19T21_38_23.585493 path: - '**/details_harness|hendrycksTest-anatomy|5_2023-07-19T21:38:23.585493.parquet' - split: latest path: - '**/details_harness|hendrycksTest-anatomy|5_2023-07-19T21:38:23.585493.parquet' - config_name: harness_hendrycksTest_astronomy_5 data_files: - split: 2023_07_19T21_38_23.585493 path: - '**/details_harness|hendrycksTest-astronomy|5_2023-07-19T21:38:23.585493.parquet' - split: latest path: - '**/details_harness|hendrycksTest-astronomy|5_2023-07-19T21:38:23.585493.parquet' - config_name: harness_hendrycksTest_business_ethics_5 data_files: - split: 2023_07_19T21_38_23.585493 path: - '**/details_harness|hendrycksTest-business_ethics|5_2023-07-19T21:38:23.585493.parquet' - split: latest path: - '**/details_harness|hendrycksTest-business_ethics|5_2023-07-19T21:38:23.585493.parquet' - config_name: harness_hendrycksTest_clinical_knowledge_5 data_files: - split: 2023_07_19T21_38_23.585493 path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2023-07-19T21:38:23.585493.parquet' - split: latest path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2023-07-19T21:38:23.585493.parquet' - config_name: harness_hendrycksTest_college_biology_5 data_files: - split: 2023_07_19T21_38_23.585493 path: - '**/details_harness|hendrycksTest-college_biology|5_2023-07-19T21:38:23.585493.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_biology|5_2023-07-19T21:38:23.585493.parquet' - config_name: harness_hendrycksTest_college_chemistry_5 data_files: - split: 2023_07_19T21_38_23.585493 path: - '**/details_harness|hendrycksTest-college_chemistry|5_2023-07-19T21:38:23.585493.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_chemistry|5_2023-07-19T21:38:23.585493.parquet' - config_name: harness_hendrycksTest_college_computer_science_5 data_files: - split: 2023_07_19T21_38_23.585493 path: - '**/details_harness|hendrycksTest-college_computer_science|5_2023-07-19T21:38:23.585493.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_computer_science|5_2023-07-19T21:38:23.585493.parquet' - config_name: harness_hendrycksTest_college_mathematics_5 data_files: - split: 2023_07_19T21_38_23.585493 path: - '**/details_harness|hendrycksTest-college_mathematics|5_2023-07-19T21:38:23.585493.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_mathematics|5_2023-07-19T21:38:23.585493.parquet' - config_name: harness_hendrycksTest_college_medicine_5 data_files: - split: 2023_07_19T21_38_23.585493 path: - '**/details_harness|hendrycksTest-college_medicine|5_2023-07-19T21:38:23.585493.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_medicine|5_2023-07-19T21:38:23.585493.parquet' - config_name: harness_hendrycksTest_college_physics_5 data_files: - split: 2023_07_19T21_38_23.585493 path: - '**/details_harness|hendrycksTest-college_physics|5_2023-07-19T21:38:23.585493.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_physics|5_2023-07-19T21:38:23.585493.parquet' - config_name: harness_hendrycksTest_computer_security_5 data_files: - split: 2023_07_19T21_38_23.585493 path: - '**/details_harness|hendrycksTest-computer_security|5_2023-07-19T21:38:23.585493.parquet' - split: latest path: - '**/details_harness|hendrycksTest-computer_security|5_2023-07-19T21:38:23.585493.parquet' - config_name: harness_hendrycksTest_conceptual_physics_5 data_files: - split: 2023_07_19T21_38_23.585493 path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2023-07-19T21:38:23.585493.parquet' - split: latest path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2023-07-19T21:38:23.585493.parquet' - config_name: harness_hendrycksTest_econometrics_5 data_files: - split: 2023_07_19T21_38_23.585493 path: - '**/details_harness|hendrycksTest-econometrics|5_2023-07-19T21:38:23.585493.parquet' - split: latest path: - '**/details_harness|hendrycksTest-econometrics|5_2023-07-19T21:38:23.585493.parquet' - config_name: harness_hendrycksTest_electrical_engineering_5 data_files: - split: 2023_07_19T21_38_23.585493 path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2023-07-19T21:38:23.585493.parquet' - split: latest path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2023-07-19T21:38:23.585493.parquet' - config_name: harness_hendrycksTest_elementary_mathematics_5 data_files: - split: 2023_07_19T21_38_23.585493 path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2023-07-19T21:38:23.585493.parquet' - split: latest path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2023-07-19T21:38:23.585493.parquet' - config_name: harness_hendrycksTest_formal_logic_5 data_files: - split: 2023_07_19T21_38_23.585493 path: - '**/details_harness|hendrycksTest-formal_logic|5_2023-07-19T21:38:23.585493.parquet' - split: latest path: - '**/details_harness|hendrycksTest-formal_logic|5_2023-07-19T21:38:23.585493.parquet' - config_name: harness_hendrycksTest_global_facts_5 data_files: - split: 2023_07_19T21_38_23.585493 path: - '**/details_harness|hendrycksTest-global_facts|5_2023-07-19T21:38:23.585493.parquet' - split: latest path: - '**/details_harness|hendrycksTest-global_facts|5_2023-07-19T21:38:23.585493.parquet' - config_name: harness_hendrycksTest_high_school_biology_5 data_files: - split: 2023_07_19T21_38_23.585493 path: - '**/details_harness|hendrycksTest-high_school_biology|5_2023-07-19T21:38:23.585493.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_biology|5_2023-07-19T21:38:23.585493.parquet' - config_name: harness_hendrycksTest_high_school_chemistry_5 data_files: - split: 2023_07_19T21_38_23.585493 path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2023-07-19T21:38:23.585493.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2023-07-19T21:38:23.585493.parquet' - config_name: harness_hendrycksTest_high_school_computer_science_5 data_files: - split: 2023_07_19T21_38_23.585493 path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2023-07-19T21:38:23.585493.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2023-07-19T21:38:23.585493.parquet' - config_name: harness_hendrycksTest_high_school_european_history_5 data_files: - split: 2023_07_19T21_38_23.585493 path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2023-07-19T21:38:23.585493.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2023-07-19T21:38:23.585493.parquet' - config_name: harness_hendrycksTest_high_school_geography_5 data_files: - split: 2023_07_19T21_38_23.585493 path: - '**/details_harness|hendrycksTest-high_school_geography|5_2023-07-19T21:38:23.585493.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_geography|5_2023-07-19T21:38:23.585493.parquet' - config_name: harness_hendrycksTest_high_school_government_and_politics_5 data_files: - split: 2023_07_19T21_38_23.585493 path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2023-07-19T21:38:23.585493.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2023-07-19T21:38:23.585493.parquet' - config_name: harness_hendrycksTest_high_school_macroeconomics_5 data_files: - split: 2023_07_19T21_38_23.585493 path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2023-07-19T21:38:23.585493.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2023-07-19T21:38:23.585493.parquet' - config_name: harness_hendrycksTest_high_school_mathematics_5 data_files: - split: 2023_07_19T21_38_23.585493 path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2023-07-19T21:38:23.585493.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2023-07-19T21:38:23.585493.parquet' - config_name: harness_hendrycksTest_high_school_microeconomics_5 data_files: - split: 2023_07_19T21_38_23.585493 path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2023-07-19T21:38:23.585493.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2023-07-19T21:38:23.585493.parquet' - config_name: harness_hendrycksTest_high_school_physics_5 data_files: - split: 2023_07_19T21_38_23.585493 path: - '**/details_harness|hendrycksTest-high_school_physics|5_2023-07-19T21:38:23.585493.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_physics|5_2023-07-19T21:38:23.585493.parquet' - config_name: harness_hendrycksTest_high_school_psychology_5 data_files: - split: 2023_07_19T21_38_23.585493 path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2023-07-19T21:38:23.585493.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2023-07-19T21:38:23.585493.parquet' - config_name: harness_hendrycksTest_high_school_statistics_5 data_files: - split: 2023_07_19T21_38_23.585493 path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2023-07-19T21:38:23.585493.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2023-07-19T21:38:23.585493.parquet' - config_name: harness_hendrycksTest_high_school_us_history_5 data_files: - split: 2023_07_19T21_38_23.585493 path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2023-07-19T21:38:23.585493.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2023-07-19T21:38:23.585493.parquet' - config_name: harness_hendrycksTest_high_school_world_history_5 data_files: - split: 2023_07_19T21_38_23.585493 path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2023-07-19T21:38:23.585493.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2023-07-19T21:38:23.585493.parquet' - config_name: harness_hendrycksTest_human_aging_5 data_files: - split: 2023_07_19T21_38_23.585493 path: - '**/details_harness|hendrycksTest-human_aging|5_2023-07-19T21:38:23.585493.parquet' - split: latest path: - '**/details_harness|hendrycksTest-human_aging|5_2023-07-19T21:38:23.585493.parquet' - config_name: harness_hendrycksTest_human_sexuality_5 data_files: - split: 2023_07_19T21_38_23.585493 path: - '**/details_harness|hendrycksTest-human_sexuality|5_2023-07-19T21:38:23.585493.parquet' - split: latest path: - '**/details_harness|hendrycksTest-human_sexuality|5_2023-07-19T21:38:23.585493.parquet' - config_name: harness_hendrycksTest_international_law_5 data_files: - split: 2023_07_19T21_38_23.585493 path: - '**/details_harness|hendrycksTest-international_law|5_2023-07-19T21:38:23.585493.parquet' - split: latest path: - '**/details_harness|hendrycksTest-international_law|5_2023-07-19T21:38:23.585493.parquet' - config_name: harness_hendrycksTest_jurisprudence_5 data_files: - split: 2023_07_19T21_38_23.585493 path: - '**/details_harness|hendrycksTest-jurisprudence|5_2023-07-19T21:38:23.585493.parquet' - split: latest path: - '**/details_harness|hendrycksTest-jurisprudence|5_2023-07-19T21:38:23.585493.parquet' - config_name: harness_hendrycksTest_logical_fallacies_5 data_files: - split: 2023_07_19T21_38_23.585493 path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2023-07-19T21:38:23.585493.parquet' - split: latest path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2023-07-19T21:38:23.585493.parquet' - config_name: harness_hendrycksTest_machine_learning_5 data_files: - split: 2023_07_19T21_38_23.585493 path: - '**/details_harness|hendrycksTest-machine_learning|5_2023-07-19T21:38:23.585493.parquet' - split: latest path: - '**/details_harness|hendrycksTest-machine_learning|5_2023-07-19T21:38:23.585493.parquet' - config_name: harness_hendrycksTest_management_5 data_files: - split: 2023_07_19T21_38_23.585493 path: - '**/details_harness|hendrycksTest-management|5_2023-07-19T21:38:23.585493.parquet' - split: latest path: - '**/details_harness|hendrycksTest-management|5_2023-07-19T21:38:23.585493.parquet' - config_name: harness_hendrycksTest_marketing_5 data_files: - split: 2023_07_19T21_38_23.585493 path: - '**/details_harness|hendrycksTest-marketing|5_2023-07-19T21:38:23.585493.parquet' - split: latest path: - '**/details_harness|hendrycksTest-marketing|5_2023-07-19T21:38:23.585493.parquet' - config_name: harness_hendrycksTest_medical_genetics_5 data_files: - split: 2023_07_19T21_38_23.585493 path: - '**/details_harness|hendrycksTest-medical_genetics|5_2023-07-19T21:38:23.585493.parquet' - split: latest path: - '**/details_harness|hendrycksTest-medical_genetics|5_2023-07-19T21:38:23.585493.parquet' - config_name: harness_hendrycksTest_miscellaneous_5 data_files: - split: 2023_07_19T21_38_23.585493 path: - '**/details_harness|hendrycksTest-miscellaneous|5_2023-07-19T21:38:23.585493.parquet' - split: latest path: - '**/details_harness|hendrycksTest-miscellaneous|5_2023-07-19T21:38:23.585493.parquet' - config_name: harness_hendrycksTest_moral_disputes_5 data_files: - split: 2023_07_19T21_38_23.585493 path: - '**/details_harness|hendrycksTest-moral_disputes|5_2023-07-19T21:38:23.585493.parquet' - split: latest path: - '**/details_harness|hendrycksTest-moral_disputes|5_2023-07-19T21:38:23.585493.parquet' - config_name: harness_hendrycksTest_moral_scenarios_5 data_files: - split: 2023_07_19T21_38_23.585493 path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2023-07-19T21:38:23.585493.parquet' - split: latest path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2023-07-19T21:38:23.585493.parquet' - config_name: harness_hendrycksTest_nutrition_5 data_files: - split: 2023_07_19T21_38_23.585493 path: - '**/details_harness|hendrycksTest-nutrition|5_2023-07-19T21:38:23.585493.parquet' - split: latest path: - '**/details_harness|hendrycksTest-nutrition|5_2023-07-19T21:38:23.585493.parquet' - config_name: harness_hendrycksTest_philosophy_5 data_files: - split: 2023_07_19T21_38_23.585493 path: - '**/details_harness|hendrycksTest-philosophy|5_2023-07-19T21:38:23.585493.parquet' - split: latest path: - '**/details_harness|hendrycksTest-philosophy|5_2023-07-19T21:38:23.585493.parquet' - config_name: harness_hendrycksTest_prehistory_5 data_files: - split: 2023_07_19T21_38_23.585493 path: - '**/details_harness|hendrycksTest-prehistory|5_2023-07-19T21:38:23.585493.parquet' - split: latest path: - '**/details_harness|hendrycksTest-prehistory|5_2023-07-19T21:38:23.585493.parquet' - config_name: harness_hendrycksTest_professional_accounting_5 data_files: - split: 2023_07_19T21_38_23.585493 path: - '**/details_harness|hendrycksTest-professional_accounting|5_2023-07-19T21:38:23.585493.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_accounting|5_2023-07-19T21:38:23.585493.parquet' - config_name: harness_hendrycksTest_professional_law_5 data_files: - split: 2023_07_19T21_38_23.585493 path: - '**/details_harness|hendrycksTest-professional_law|5_2023-07-19T21:38:23.585493.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_law|5_2023-07-19T21:38:23.585493.parquet' - config_name: harness_hendrycksTest_professional_medicine_5 data_files: - split: 2023_07_19T21_38_23.585493 path: - '**/details_harness|hendrycksTest-professional_medicine|5_2023-07-19T21:38:23.585493.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_medicine|5_2023-07-19T21:38:23.585493.parquet' - config_name: harness_hendrycksTest_professional_psychology_5 data_files: - split: 2023_07_19T21_38_23.585493 path: - '**/details_harness|hendrycksTest-professional_psychology|5_2023-07-19T21:38:23.585493.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_psychology|5_2023-07-19T21:38:23.585493.parquet' - config_name: harness_hendrycksTest_public_relations_5 data_files: - split: 2023_07_19T21_38_23.585493 path: - '**/details_harness|hendrycksTest-public_relations|5_2023-07-19T21:38:23.585493.parquet' - split: latest path: - '**/details_harness|hendrycksTest-public_relations|5_2023-07-19T21:38:23.585493.parquet' - config_name: harness_hendrycksTest_security_studies_5 data_files: - split: 2023_07_19T21_38_23.585493 path: - '**/details_harness|hendrycksTest-security_studies|5_2023-07-19T21:38:23.585493.parquet' - split: latest path: - '**/details_harness|hendrycksTest-security_studies|5_2023-07-19T21:38:23.585493.parquet' - config_name: harness_hendrycksTest_sociology_5 data_files: - split: 2023_07_19T21_38_23.585493 path: - '**/details_harness|hendrycksTest-sociology|5_2023-07-19T21:38:23.585493.parquet' - split: latest path: - '**/details_harness|hendrycksTest-sociology|5_2023-07-19T21:38:23.585493.parquet' - config_name: harness_hendrycksTest_us_foreign_policy_5 data_files: - split: 2023_07_19T21_38_23.585493 path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2023-07-19T21:38:23.585493.parquet' - split: latest path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2023-07-19T21:38:23.585493.parquet' - config_name: harness_hendrycksTest_virology_5 data_files: - split: 2023_07_19T21_38_23.585493 path: - '**/details_harness|hendrycksTest-virology|5_2023-07-19T21:38:23.585493.parquet' - split: latest path: - '**/details_harness|hendrycksTest-virology|5_2023-07-19T21:38:23.585493.parquet' - config_name: harness_hendrycksTest_world_religions_5 data_files: - split: 2023_07_19T21_38_23.585493 path: - '**/details_harness|hendrycksTest-world_religions|5_2023-07-19T21:38:23.585493.parquet' - split: latest path: - '**/details_harness|hendrycksTest-world_religions|5_2023-07-19T21:38:23.585493.parquet' - config_name: harness_truthfulqa_mc_0 data_files: - split: 2023_07_19T21_38_23.585493 path: - '**/details_harness|truthfulqa:mc|0_2023-07-19T21:38:23.585493.parquet' - split: latest path: - '**/details_harness|truthfulqa:mc|0_2023-07-19T21:38:23.585493.parquet' - config_name: harness_winogrande_5 data_files: - split: 2023_10_24T16_29_58.049517 path: - '**/details_harness|winogrande|5_2023-10-24T16-29-58.049517.parquet' - split: latest path: - '**/details_harness|winogrande|5_2023-10-24T16-29-58.049517.parquet' - config_name: results data_files: - split: 2023_07_19T21_38_23.585493 path: - results_2023-07-19T21:38:23.585493.parquet - split: 2023_10_24T16_29_58.049517 path: - results_2023-10-24T16-29-58.049517.parquet - split: latest path: - results_2023-10-24T16-29-58.049517.parquet --- # Dataset Card for Evaluation run of KoboldAI/GPT-NeoX-20B-Erebus ## Dataset Description - **Homepage:** - **Repository:** https://huggingface.co/KoboldAI/GPT-NeoX-20B-Erebus - **Paper:** - **Leaderboard:** https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard - **Point of Contact:** clementine@hf.co ### Dataset Summary Dataset automatically created during the evaluation run of model [KoboldAI/GPT-NeoX-20B-Erebus](https://huggingface.co/KoboldAI/GPT-NeoX-20B-Erebus) on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard). The dataset is composed of 64 configuration, each one coresponding to one of the evaluated task. The dataset has been created from 2 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results. An additional configuration "results" store all the aggregated results of the run (and is used to compute and display the agregated metrics on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)). To load the details from a run, you can for instance do the following: ```python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_KoboldAI__GPT-NeoX-20B-Erebus", "harness_winogrande_5", split="train") ``` ## Latest results These are the [latest results from run 2023-10-24T16:29:58.049517](https://huggingface.co/datasets/open-llm-leaderboard/details_KoboldAI__GPT-NeoX-20B-Erebus/blob/main/results_2023-10-24T16-29-58.049517.json)(note that their might be results for other tasks in the repos if successive evals didn't cover the same tasks. You find each in the results and the "latest" split for each eval): ```python { "all": { "em": 0.0009437919463087249, "em_stderr": 0.0003144653119413213, "f1": 0.050781250000000264, "f1_stderr": 0.0012129008741175679, "acc": 0.3519405232133358, "acc_stderr": 0.00860227452891923 }, "harness|drop|3": { "em": 0.0009437919463087249, "em_stderr": 0.0003144653119413213, "f1": 0.050781250000000264, "f1_stderr": 0.0012129008741175679 }, "harness|gsm8k|5": { "acc": 0.022744503411675512, "acc_stderr": 0.004106620637749689 }, "harness|winogrande|5": { "acc": 0.681136543014996, "acc_stderr": 0.013097928420088771 } } ``` ### Supported Tasks and Leaderboards [More Information Needed] ### Languages [More Information Needed] ## Dataset Structure ### Data Instances [More Information Needed] ### Data Fields [More Information Needed] ### Data Splits [More Information Needed] ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information [More Information Needed] ### Citation Information [More Information Needed] ### Contributions [More Information Needed]
提供机构:
open-llm-leaderboard-old
原始信息汇总

数据集概述

数据集是在对模型 KoboldAI/GPT-NeoX-20B-Erebus 进行评估运行期间自动创建的,用于 Open LLM Leaderboard

数据集组成

  • 数据集包含 64 个配置,每个配置对应一个评估任务。
  • 数据集从 2 次运行中创建,每次运行可以在每个配置中找到特定的分割,分割名称使用运行的时间戳。
  • "train" 分割始终指向最新的结果。
  • 额外的配置 "results" 存储所有运行的聚合结果,用于计算和显示 Open LLM Leaderboard 上的聚合指标。

数据加载示例

python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_KoboldAI__GPT-NeoX-20B-Erebus", "harness_winogrande_5", split="train")

最新结果

以下是 2023-10-24T16:29:58.049517 运行的最新结果: python { "all": { "em": 0.0009437919463087249, "em_stderr": 0.0003144653119413213, "f1": 0.050781250000000264, "f1_stderr": 0.0012129008741175679, "acc": 0.3519405232133358, "acc_stderr": 0.00860227452891923 }, "harness|drop|3": { "em": 0.0009437919463087249, "em_stderr": 0.0003144653119413213, "f1": 0.050781250000000264, "f1_stderr": 0.0012129008741175679 }, "harness|gsm8k|5": { "acc": 0.022744503411675512, "acc_stderr": 0.004106620637749689 }, "harness|winogrande|5": { "acc": 0.681136543014996, "acc_stderr": 0.013097928420088771 } }

配置详情

以下是部分配置及其数据文件路径:

  • harness_arc_challenge_25

    • 分割: 2023_07_19T21_38_23.585493
      • 路径: **/details_harness|arc:challenge|25_2023-07-19T21:38:23.585493.parquet
    • 分割: latest
      • 路径: **/details_harness|arc:challenge|25_2023-07-19T21:38:23.585493.parquet
  • harness_drop_3

    • 分割: 2023_10_24T16_29_58.049517
      • 路径: **/details_harness|drop|3_2023-10-24T16-29-58.049517.parquet
    • 分割: latest
      • 路径: **/details_harness|drop|3_2023-10-24T16-29-58.049517.parquet
  • harness_gsm8k_5

    • 分割: 2023_10_24T16_29_58.049517
      • 路径: **/details_harness|gsm8k|5_2023-10-24T16-29-58.049517.parquet
    • 分割: latest
      • 路径: **/details_harness|gsm8k|5_2023-10-24T16-29-58.049517.parquet
  • harness_hellaswag_10

    • 分割: 2023_07_19T21_38_23.585493
      • 路径: **/details_harness|hellaswag|10_2023-07-19T21:38:23.585493.parquet
    • 分割: latest
      • 路径: **/details_harness|hellaswag|10_2023-07-19T21:38:23.585493.parquet
  • harness_hendrycksTest_5

    • 分割: 2023_07_19T21_38_23.585493
      • 路径:
        • **/details_harness|hendrycksTest-abstract_algebra|5_2023-07-19T21:38:23.585493.parquet
        • **/details_harness|hendrycksTest-anatomy|5_2023-07-19T21:38:23.585493.parquet
        • **/details_harness|hendrycksTest-astronomy|5_2023-07-19T21:38:23.585493.parquet
        • **/details_harness|hendrycksTest-business_ethics|5_2023-07-19T21:38:23.585493.parquet
        • **/details_harness|hendrycksTest-clinical_knowledge|5_2023-07-19T21:38:23.585493.parquet
        • **/details_harness|hendrycksTest-college_biology|5_2023-07-19T21:38:23.585493.parquet
        • **/details_harness|hendrycksTest-college_chemistry|5_2023-07-19T21:38:23.585493.parquet
        • **/details_harness|hendrycksTest-college_computer_science|5_2023-07-19T21:38:23.585493.parquet
        • **/details_harness|hendrycksTest-college_mathematics|5_2023-07-19T21:38:23.585493.parquet
        • **/details_harness|hendrycksTest-college_medicine|5_2023-07-19T21:38:23.585493.parquet
        • **/details_harness|hendrycksTest-college_physics|5_2023-07-19T21:38:23.585493.parquet
        • **/details_harness|hendrycksTest-computer_security|5_2023-07-19T21:38:23.585493.parquet
        • **/details_harness|hendrycksTest-conceptual_physics|5_2023-07-19T21:38:23.585493.parquet
        • **/details_harness|hendrycksTest-econometrics|5_2023-07-19T21:38:23.585493.parquet
        • **/details_harness|hendrycksTest-electrical_engineering|5_2023-07-19T21:38:23.585493.parquet
        • **/details_harness|hendrycksTest-elementary_mathematics|5_2023-07-19T21:38:23.585493.parquet
        • **/details_harness|hendrycksTest-formal_logic|5_2023-07-19T21:38:23.585493.parquet
        • **/details_harness|hendrycksTest-global_facts|5_2023-07-19T21:38:23.585493.parquet
        • **/details_harness|hendrycksTest-high_school_biology|5_2023-07-19T21:38:23.585493.parquet
        • **/details_harness|hendrycksTest-high_school_chemistry|5_2023-07-19T21:38:23.585493.parquet
        • **/details_harness|hendrycksTest-high_school_computer_science|5_2023-07-19T21:38:23.585493.parquet
        • **/details_harness|hendrycksTest-high_school_european_history|5_2023-07-19T21:38:23.585493.parquet
        • **/details_harness|hendrycksTest-high_school_geography|5_2023-07-19T21:38:23.585493.parquet
        • **/details_harness|hendrycksTest-high_school_government_and_politics|5_2023-07-19T21:38:23.585493.parquet
        • **/details_harness|hendrycksTest-high_school_macroeconomics|5_2023-07-19T21:38:23.585493.parquet
        • **/details_harness|hendrycksTest-high_school_mathematics|5_2023-07-19T21:38:23.585493.parquet
        • **/details_harness|hendrycksTest-high_school_microeconomics|5_2023-07-19T21:38:23.585493.parquet
        • **/details_harness|hendrycksTest-high_school_physics|5_2023-07-19T21:38:23.585493.parquet
        • **/details_harness|hendrycksTest-high_school_psychology|5_2023-07-19T21:38:23.585493.parquet
        • **/details_harness|hendrycksTest-high_school_statistics|5_2023-07-19T21:38:23.585493.parquet
        • **/details_harness|hendrycksTest-high_school_us_history|5_2023-07-19T21:38:23.585493.parquet
        • **/details_harness|hendrycksTest-high_school_world_history|5_2023-07-19T21:38:23.585493.parquet
        • **/details_harness|hendrycksTest-human_aging|5_2023-07-19T21:38:23.585493.parquet
        • **/details_harness|hendrycksTest-human_sexuality|5_2023-07-19T21:38:23.585493.parquet
        • **/details_harness|hendrycksTest-international_law|5_2023-07-19T21:38:23.585493.parquet
        • **/details_harness|hendrycksTest-jurisprudence|5_2023-07-19T21:38:23.585493.parquet
        • **/details_harness|hendrycksTest-logical_fallacies|5_2023-07-19T21:38:23.585493.parquet
        • **/details_harness|hendrycksTest-machine_learning|5_2023-07-19T21:38:23.585493.parquet
        • **/details_harness|hendrycksTest-management|5_2023-07-19T21:38:23.585493.parquet
        • **/details_harness|hendrycksTest-marketing|5_2023-07-19T21:38:23.585493.parquet
        • **/details_harness|hendrycksTest-medical_genetics|5_2023-07-19T21:38:23.585493.parquet
        • **/details_harness|hendrycksTest-miscellaneous|5_2023-07-19T21:38:23.585493.parquet
        • **/details_harness|hendrycksTest-moral_disputes|5_2023-07-19T21:38:23.585493.parquet
        • **/details_harness|hendrycksTest-moral_scenarios|5_2023-07-19T21:38:23.585493.parquet
        • **/details_harness|hendrycksTest-nutrition|5_2023-07-19T21:38:23.585493.parquet
        • **/details_harness|hendrycksTest-philosophy|5_2023-07-19T21:38:23.585493.parquet
        • **/details_harness|hendrycksTest-prehistory|5_2023-07-19T21:38:23.585493.parquet
        • **/details_harness|hendrycksTest-professional_accounting|5_2023-07-19T21:38:23.585493.parquet
        • **/details_harness|hendrycksTest-professional_law|5_2023-07-19T21
搜集汇总
数据集介绍
main_image_url
构建方式
在大型语言模型评测领域,对模型性能的精确量化是推动技术迭代的关键。该数据集是在Open LLM Leaderboard框架下,对KoboldAI/GPT-NeoX-20B-Erebus模型进行自动化评估过程中生成的副产品。其构建逻辑围绕64个评测任务配置展开,每个配置对应一项独立的评估任务,例如ARC挑战、DROP、GSM8K等。数据来源于两次独立的评估运行,每次运行的结果以时间戳为标识存储在对应的分割(split)中,其中“train”分割始终指向最新一次的评测结果。此外,数据集还包含一个名为“results”的额外配置,用于聚合所有运行的整体指标,为排行榜上的综合度量展示提供支撑。所有数据以Parquet格式存储,确保了高效读写与结构化存储。
特点
该数据集最显著的特征在于其精细化的任务粒度和时间维度上的可追溯性。它并非单一的评分集合,而是将每次评测的细节按任务拆解为独立的配置,使得研究者能够深入分析模型在特定基准(如Winogrande、HellaSwag)上的微观表现。每个配置下,不同时间戳的分割忠实记录了模型能力随版本迭代的演变轨迹,为纵向对比提供了天然的数据基础。同时,“results”配置作为汇总枢纽,以JSON格式呈现了诸如准确率(acc)、F1分数等宏观指标及其标准误差,既便于快速把握整体性能,又支持与Open LLM Leaderboard的实时联动,构成了一个兼具深度与广度的模型评估档案。
使用方法
使用该数据集进行模型分析时,研究者可通过Hugging Face的datasets库便捷加载。具体而言,调用`load_dataset`函数并指定数据集名称,随后选择目标任务的配置名称(如“harness_winogrande_5”)及所需的分割(如“train”或具体时间戳),即可获取对应任务的详细评估记录。若需获取所有任务的聚合结果,则加载“results”配置,其包含的JSON字段可直接解析出各基准的最终得分。这种模块化的设计允许用户灵活地聚焦于单一任务的微观剖析,或通过汇总数据实现多任务间的宏观对比,从而高效支撑从模型调优到性能报告撰写的全流程工作。
背景与挑战
背景概述
在大规模语言模型迅猛发展的时代背景下,如何系统、公正地评估模型的多维度能力成为学术界与工业界共同关注的焦点。Open LLM Leaderboard由Hugging Face团队于2023年发起,旨在为开源大语言模型提供标准化评测平台,其核心研究问题在于如何通过统一的任务集与评估流程,揭示模型在推理、常识理解、数学求解等关键维度的真实表现。该数据集记录了KoboldAI/GPT-NeoX-20B-Erebus模型在2023年7月至10月间的两次评估运行结果,涵盖ARC、DROP、GSM8K、HellaSwag、MMLU及WinoGrande等经典基准测试,共计64个配置项。作为社区驱动的评测资源,该数据集不仅为模型开发者提供了可复现的性能基准,更推动了开源大模型评估范式的透明化与标准化进程,对后续模型迭代与比较研究产生了深远影响。
当前挑战
该数据集所解决的领域问题在于,大语言模型的性能评估长期面临任务单一、评分标准不统一及结果难以复现等困境,Open LLM Leaderboard通过集成涵盖常识推理、数学推理、阅读理解与知识问答的多样化任务,构建了多维度的评测体系。构建过程中遭遇的挑战包括:首先,各任务的数据格式与评估指标差异显著,需设计统一的接口与解析逻辑以兼容不同基准测试;其次,模型评估结果受超参数、随机种子及硬件环境等因素影响,需通过多次运行与统计误差分析确保结果可靠性;最后,随着新模型与任务的涌现,数据集需持续更新配置与评测流程,以维持其时效性与代表性,这对数据维护与版本管理提出了较高要求。
常用场景
经典使用场景
在大型语言模型的评估体系中,该数据集作为Open LLM Leaderboard的标准化评测组件,被广泛用于衡量模型在多样化任务上的泛化能力。其经典使用场景涵盖常识推理(如Winogrande)、数学推理(如GSM8K)、阅读理解(如DROP)以及多学科知识(如HendrycksTest)等基准测试。研究者通过加载特定配置的评测结果,能够精准分析模型在某一任务上的表现细节,例如准确率、F1分数等细粒度指标,从而为模型优化提供量化依据。这一数据集的架构设计使得跨模型、跨时间轮的对比分析成为可能,是推动开源大模型性能透明化与可复现评估的关键基础设施。
实际应用
在实际应用中,该数据集为模型选型和部署提供了关键决策依据。企业或研究机构在挑选适合特定场景(如教育问答、智能客服或知识检索)的语言模型时,可依据该数据集记录的细粒度评测结果,对比不同模型在目标任务上的表现。例如,若需部署面向数学辅导的对话系统,可直接查阅GSM8K任务的准确率指标,筛选出数学推理能力更优的模型。此外,该数据集的时序版本记录特性支持追踪模型迭代过程中的性能演进,有助于监控模型退化或提升趋势,从而在工业级应用中实现持续的质量管控。
衍生相关工作
该数据集衍生了一系列关于大模型评测标准化与可解释性的经典工作。一方面,其数据结构设计启发了后续Open LLM Leaderboard的扩展,促使更多模型(如LLaMA、Falcon系列)采用相同格式公开评测细节,形成了开放评测生态。另一方面,研究者基于该数据集的细粒度结果,提出了任务难度敏感的分析方法,例如通过对比不同模型在HendrycksTest子任务上的得分差异,揭示模型在特定学科知识上的知识盲区。此外,该数据集的时间戳分片机制催生了模型性能退化检测研究,为评估模型在持续训练或微调后的稳定性提供了方法论基础。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作