five

open-llm-leaderboard/details_golaxy__gogpt-7b-bloom

收藏
Hugging Face2023-10-14 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/open-llm-leaderboard/details_golaxy__gogpt-7b-bloom
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: Evaluation run of golaxy/gogpt-7b-bloom dataset_summary: "Dataset automatically created during the evaluation run of model\ \ [golaxy/gogpt-7b-bloom](https://huggingface.co/golaxy/gogpt-7b-bloom) on the [Open\ \ LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).\n\ \nThe dataset is composed of 64 configuration, each one coresponding to one of the\ \ evaluated task.\n\nThe dataset has been created from 3 run(s). Each run can be\ \ found as a specific split in each configuration, the split being named using the\ \ timestamp of the run.The \"train\" split is always pointing to the latest results.\n\ \nAn additional configuration \"results\" store all the aggregated results of the\ \ run (and is used to compute and display the agregated metrics on the [Open LLM\ \ Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)).\n\ \nTo load the details from a run, you can for instance do the following:\n```python\n\ from datasets import load_dataset\ndata = load_dataset(\"open-llm-leaderboard/details_golaxy__gogpt-7b-bloom\"\ ,\n\t\"harness_winogrande_5\",\n\tsplit=\"train\")\n```\n\n## Latest results\n\n\ These are the [latest results from run 2023-10-14T21:01:38.341280](https://huggingface.co/datasets/open-llm-leaderboard/details_golaxy__gogpt-7b-bloom/blob/main/results_2023-10-14T21-01-38.341280.json)(note\ \ that their might be results for other tasks in the repos if successive evals didn't\ \ cover the same tasks. You find each in the results and the \"latest\" split for\ \ each eval):\n\n```python\n{\n \"all\": {\n \"em\": 0.2214765100671141,\n\ \ \"em_stderr\": 0.004252451287967787,\n \"f1\": 0.25772336409395996,\n\ \ \"f1_stderr\": 0.00428261897007673,\n \"acc\": 0.31452249408050514,\n\ \ \"acc_stderr\": 0.006788199951115784\n },\n \"harness|drop|3\": {\n\ \ \"em\": 0.2214765100671141,\n \"em_stderr\": 0.004252451287967787,\n\ \ \"f1\": 0.25772336409395996,\n \"f1_stderr\": 0.00428261897007673\n\ \ },\n \"harness|gsm8k|5\": {\n \"acc\": 0.0,\n \"acc_stderr\"\ : 0.0\n },\n \"harness|winogrande|5\": {\n \"acc\": 0.6290449881610103,\n\ \ \"acc_stderr\": 0.013576399902231568\n }\n}\n```" repo_url: https://huggingface.co/golaxy/gogpt-7b-bloom leaderboard_url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard point_of_contact: clementine@hf.co configs: - config_name: harness_arc_challenge_25 data_files: - split: 2023_07_31T10_56_27.356745 path: - '**/details_harness|arc:challenge|25_2023-07-31T10:56:27.356745.parquet' - split: latest path: - '**/details_harness|arc:challenge|25_2023-07-31T10:56:27.356745.parquet' - config_name: harness_drop_3 data_files: - split: 2023_09_17T07_35_20.075381 path: - '**/details_harness|drop|3_2023-09-17T07-35-20.075381.parquet' - split: 2023_10_14T21_01_38.341280 path: - '**/details_harness|drop|3_2023-10-14T21-01-38.341280.parquet' - split: latest path: - '**/details_harness|drop|3_2023-10-14T21-01-38.341280.parquet' - config_name: harness_gsm8k_5 data_files: - split: 2023_09_17T07_35_20.075381 path: - '**/details_harness|gsm8k|5_2023-09-17T07-35-20.075381.parquet' - split: 2023_10_14T21_01_38.341280 path: - '**/details_harness|gsm8k|5_2023-10-14T21-01-38.341280.parquet' - split: latest path: - '**/details_harness|gsm8k|5_2023-10-14T21-01-38.341280.parquet' - config_name: harness_hellaswag_10 data_files: - split: 2023_07_31T10_56_27.356745 path: - '**/details_harness|hellaswag|10_2023-07-31T10:56:27.356745.parquet' - split: latest path: - '**/details_harness|hellaswag|10_2023-07-31T10:56:27.356745.parquet' - config_name: harness_hendrycksTest_5 data_files: - split: 2023_07_31T10_56_27.356745 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-international_law|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-management|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-marketing|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-sociology|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-virology|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2023-07-31T10:56:27.356745.parquet' - split: latest path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-international_law|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-management|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-marketing|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-sociology|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-virology|5_2023-07-31T10:56:27.356745.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2023-07-31T10:56:27.356745.parquet' - config_name: harness_hendrycksTest_abstract_algebra_5 data_files: - split: 2023_07_31T10_56_27.356745 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2023-07-31T10:56:27.356745.parquet' - split: latest path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2023-07-31T10:56:27.356745.parquet' - config_name: harness_hendrycksTest_anatomy_5 data_files: - split: 2023_07_31T10_56_27.356745 path: - '**/details_harness|hendrycksTest-anatomy|5_2023-07-31T10:56:27.356745.parquet' - split: latest path: - '**/details_harness|hendrycksTest-anatomy|5_2023-07-31T10:56:27.356745.parquet' - config_name: harness_hendrycksTest_astronomy_5 data_files: - split: 2023_07_31T10_56_27.356745 path: - '**/details_harness|hendrycksTest-astronomy|5_2023-07-31T10:56:27.356745.parquet' - split: latest path: - '**/details_harness|hendrycksTest-astronomy|5_2023-07-31T10:56:27.356745.parquet' - config_name: harness_hendrycksTest_business_ethics_5 data_files: - split: 2023_07_31T10_56_27.356745 path: - '**/details_harness|hendrycksTest-business_ethics|5_2023-07-31T10:56:27.356745.parquet' - split: latest path: - '**/details_harness|hendrycksTest-business_ethics|5_2023-07-31T10:56:27.356745.parquet' - config_name: harness_hendrycksTest_clinical_knowledge_5 data_files: - split: 2023_07_31T10_56_27.356745 path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2023-07-31T10:56:27.356745.parquet' - split: latest path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2023-07-31T10:56:27.356745.parquet' - config_name: harness_hendrycksTest_college_biology_5 data_files: - split: 2023_07_31T10_56_27.356745 path: - '**/details_harness|hendrycksTest-college_biology|5_2023-07-31T10:56:27.356745.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_biology|5_2023-07-31T10:56:27.356745.parquet' - config_name: harness_hendrycksTest_college_chemistry_5 data_files: - split: 2023_07_31T10_56_27.356745 path: - '**/details_harness|hendrycksTest-college_chemistry|5_2023-07-31T10:56:27.356745.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_chemistry|5_2023-07-31T10:56:27.356745.parquet' - config_name: harness_hendrycksTest_college_computer_science_5 data_files: - split: 2023_07_31T10_56_27.356745 path: - '**/details_harness|hendrycksTest-college_computer_science|5_2023-07-31T10:56:27.356745.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_computer_science|5_2023-07-31T10:56:27.356745.parquet' - config_name: harness_hendrycksTest_college_mathematics_5 data_files: - split: 2023_07_31T10_56_27.356745 path: - '**/details_harness|hendrycksTest-college_mathematics|5_2023-07-31T10:56:27.356745.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_mathematics|5_2023-07-31T10:56:27.356745.parquet' - config_name: harness_hendrycksTest_college_medicine_5 data_files: - split: 2023_07_31T10_56_27.356745 path: - '**/details_harness|hendrycksTest-college_medicine|5_2023-07-31T10:56:27.356745.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_medicine|5_2023-07-31T10:56:27.356745.parquet' - config_name: harness_hendrycksTest_college_physics_5 data_files: - split: 2023_07_31T10_56_27.356745 path: - '**/details_harness|hendrycksTest-college_physics|5_2023-07-31T10:56:27.356745.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_physics|5_2023-07-31T10:56:27.356745.parquet' - config_name: harness_hendrycksTest_computer_security_5 data_files: - split: 2023_07_31T10_56_27.356745 path: - '**/details_harness|hendrycksTest-computer_security|5_2023-07-31T10:56:27.356745.parquet' - split: latest path: - '**/details_harness|hendrycksTest-computer_security|5_2023-07-31T10:56:27.356745.parquet' - config_name: harness_hendrycksTest_conceptual_physics_5 data_files: - split: 2023_07_31T10_56_27.356745 path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2023-07-31T10:56:27.356745.parquet' - split: latest path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2023-07-31T10:56:27.356745.parquet' - config_name: harness_hendrycksTest_econometrics_5 data_files: - split: 2023_07_31T10_56_27.356745 path: - '**/details_harness|hendrycksTest-econometrics|5_2023-07-31T10:56:27.356745.parquet' - split: latest path: - '**/details_harness|hendrycksTest-econometrics|5_2023-07-31T10:56:27.356745.parquet' - config_name: harness_hendrycksTest_electrical_engineering_5 data_files: - split: 2023_07_31T10_56_27.356745 path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2023-07-31T10:56:27.356745.parquet' - split: latest path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2023-07-31T10:56:27.356745.parquet' - config_name: harness_hendrycksTest_elementary_mathematics_5 data_files: - split: 2023_07_31T10_56_27.356745 path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2023-07-31T10:56:27.356745.parquet' - split: latest path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2023-07-31T10:56:27.356745.parquet' - config_name: harness_hendrycksTest_formal_logic_5 data_files: - split: 2023_07_31T10_56_27.356745 path: - '**/details_harness|hendrycksTest-formal_logic|5_2023-07-31T10:56:27.356745.parquet' - split: latest path: - '**/details_harness|hendrycksTest-formal_logic|5_2023-07-31T10:56:27.356745.parquet' - config_name: harness_hendrycksTest_global_facts_5 data_files: - split: 2023_07_31T10_56_27.356745 path: - '**/details_harness|hendrycksTest-global_facts|5_2023-07-31T10:56:27.356745.parquet' - split: latest path: - '**/details_harness|hendrycksTest-global_facts|5_2023-07-31T10:56:27.356745.parquet' - config_name: harness_hendrycksTest_high_school_biology_5 data_files: - split: 2023_07_31T10_56_27.356745 path: - '**/details_harness|hendrycksTest-high_school_biology|5_2023-07-31T10:56:27.356745.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_biology|5_2023-07-31T10:56:27.356745.parquet' - config_name: harness_hendrycksTest_high_school_chemistry_5 data_files: - split: 2023_07_31T10_56_27.356745 path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2023-07-31T10:56:27.356745.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2023-07-31T10:56:27.356745.parquet' - config_name: harness_hendrycksTest_high_school_computer_science_5 data_files: - split: 2023_07_31T10_56_27.356745 path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2023-07-31T10:56:27.356745.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2023-07-31T10:56:27.356745.parquet' - config_name: harness_hendrycksTest_high_school_european_history_5 data_files: - split: 2023_07_31T10_56_27.356745 path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2023-07-31T10:56:27.356745.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2023-07-31T10:56:27.356745.parquet' - config_name: harness_hendrycksTest_high_school_geography_5 data_files: - split: 2023_07_31T10_56_27.356745 path: - '**/details_harness|hendrycksTest-high_school_geography|5_2023-07-31T10:56:27.356745.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_geography|5_2023-07-31T10:56:27.356745.parquet' - config_name: harness_hendrycksTest_high_school_government_and_politics_5 data_files: - split: 2023_07_31T10_56_27.356745 path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2023-07-31T10:56:27.356745.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2023-07-31T10:56:27.356745.parquet' - config_name: harness_hendrycksTest_high_school_macroeconomics_5 data_files: - split: 2023_07_31T10_56_27.356745 path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2023-07-31T10:56:27.356745.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2023-07-31T10:56:27.356745.parquet' - config_name: harness_hendrycksTest_high_school_mathematics_5 data_files: - split: 2023_07_31T10_56_27.356745 path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2023-07-31T10:56:27.356745.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2023-07-31T10:56:27.356745.parquet' - config_name: harness_hendrycksTest_high_school_microeconomics_5 data_files: - split: 2023_07_31T10_56_27.356745 path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2023-07-31T10:56:27.356745.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2023-07-31T10:56:27.356745.parquet' - config_name: harness_hendrycksTest_high_school_physics_5 data_files: - split: 2023_07_31T10_56_27.356745 path: - '**/details_harness|hendrycksTest-high_school_physics|5_2023-07-31T10:56:27.356745.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_physics|5_2023-07-31T10:56:27.356745.parquet' - config_name: harness_hendrycksTest_high_school_psychology_5 data_files: - split: 2023_07_31T10_56_27.356745 path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2023-07-31T10:56:27.356745.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2023-07-31T10:56:27.356745.parquet' - config_name: harness_hendrycksTest_high_school_statistics_5 data_files: - split: 2023_07_31T10_56_27.356745 path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2023-07-31T10:56:27.356745.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2023-07-31T10:56:27.356745.parquet' - config_name: harness_hendrycksTest_high_school_us_history_5 data_files: - split: 2023_07_31T10_56_27.356745 path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2023-07-31T10:56:27.356745.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2023-07-31T10:56:27.356745.parquet' - config_name: harness_hendrycksTest_high_school_world_history_5 data_files: - split: 2023_07_31T10_56_27.356745 path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2023-07-31T10:56:27.356745.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2023-07-31T10:56:27.356745.parquet' - config_name: harness_hendrycksTest_human_aging_5 data_files: - split: 2023_07_31T10_56_27.356745 path: - '**/details_harness|hendrycksTest-human_aging|5_2023-07-31T10:56:27.356745.parquet' - split: latest path: - '**/details_harness|hendrycksTest-human_aging|5_2023-07-31T10:56:27.356745.parquet' - config_name: harness_hendrycksTest_human_sexuality_5 data_files: - split: 2023_07_31T10_56_27.356745 path: - '**/details_harness|hendrycksTest-human_sexuality|5_2023-07-31T10:56:27.356745.parquet' - split: latest path: - '**/details_harness|hendrycksTest-human_sexuality|5_2023-07-31T10:56:27.356745.parquet' - config_name: harness_hendrycksTest_international_law_5 data_files: - split: 2023_07_31T10_56_27.356745 path: - '**/details_harness|hendrycksTest-international_law|5_2023-07-31T10:56:27.356745.parquet' - split: latest path: - '**/details_harness|hendrycksTest-international_law|5_2023-07-31T10:56:27.356745.parquet' - config_name: harness_hendrycksTest_jurisprudence_5 data_files: - split: 2023_07_31T10_56_27.356745 path: - '**/details_harness|hendrycksTest-jurisprudence|5_2023-07-31T10:56:27.356745.parquet' - split: latest path: - '**/details_harness|hendrycksTest-jurisprudence|5_2023-07-31T10:56:27.356745.parquet' - config_name: harness_hendrycksTest_logical_fallacies_5 data_files: - split: 2023_07_31T10_56_27.356745 path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2023-07-31T10:56:27.356745.parquet' - split: latest path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2023-07-31T10:56:27.356745.parquet' - config_name: harness_hendrycksTest_machine_learning_5 data_files: - split: 2023_07_31T10_56_27.356745 path: - '**/details_harness|hendrycksTest-machine_learning|5_2023-07-31T10:56:27.356745.parquet' - split: latest path: - '**/details_harness|hendrycksTest-machine_learning|5_2023-07-31T10:56:27.356745.parquet' - config_name: harness_hendrycksTest_management_5 data_files: - split: 2023_07_31T10_56_27.356745 path: - '**/details_harness|hendrycksTest-management|5_2023-07-31T10:56:27.356745.parquet' - split: latest path: - '**/details_harness|hendrycksTest-management|5_2023-07-31T10:56:27.356745.parquet' - config_name: harness_hendrycksTest_marketing_5 data_files: - split: 2023_07_31T10_56_27.356745 path: - '**/details_harness|hendrycksTest-marketing|5_2023-07-31T10:56:27.356745.parquet' - split: latest path: - '**/details_harness|hendrycksTest-marketing|5_2023-07-31T10:56:27.356745.parquet' - config_name: harness_hendrycksTest_medical_genetics_5 data_files: - split: 2023_07_31T10_56_27.356745 path: - '**/details_harness|hendrycksTest-medical_genetics|5_2023-07-31T10:56:27.356745.parquet' - split: latest path: - '**/details_harness|hendrycksTest-medical_genetics|5_2023-07-31T10:56:27.356745.parquet' - config_name: harness_hendrycksTest_miscellaneous_5 data_files: - split: 2023_07_31T10_56_27.356745 path: - '**/details_harness|hendrycksTest-miscellaneous|5_2023-07-31T10:56:27.356745.parquet' - split: latest path: - '**/details_harness|hendrycksTest-miscellaneous|5_2023-07-31T10:56:27.356745.parquet' - config_name: harness_hendrycksTest_moral_disputes_5 data_files: - split: 2023_07_31T10_56_27.356745 path: - '**/details_harness|hendrycksTest-moral_disputes|5_2023-07-31T10:56:27.356745.parquet' - split: latest path: - '**/details_harness|hendrycksTest-moral_disputes|5_2023-07-31T10:56:27.356745.parquet' - config_name: harness_hendrycksTest_moral_scenarios_5 data_files: - split: 2023_07_31T10_56_27.356745 path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2023-07-31T10:56:27.356745.parquet' - split: latest path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2023-07-31T10:56:27.356745.parquet' - config_name: harness_hendrycksTest_nutrition_5 data_files: - split: 2023_07_31T10_56_27.356745 path: - '**/details_harness|hendrycksTest-nutrition|5_2023-07-31T10:56:27.356745.parquet' - split: latest path: - '**/details_harness|hendrycksTest-nutrition|5_2023-07-31T10:56:27.356745.parquet' - config_name: harness_hendrycksTest_philosophy_5 data_files: - split: 2023_07_31T10_56_27.356745 path: - '**/details_harness|hendrycksTest-philosophy|5_2023-07-31T10:56:27.356745.parquet' - split: latest path: - '**/details_harness|hendrycksTest-philosophy|5_2023-07-31T10:56:27.356745.parquet' - config_name: harness_hendrycksTest_prehistory_5 data_files: - split: 2023_07_31T10_56_27.356745 path: - '**/details_harness|hendrycksTest-prehistory|5_2023-07-31T10:56:27.356745.parquet' - split: latest path: - '**/details_harness|hendrycksTest-prehistory|5_2023-07-31T10:56:27.356745.parquet' - config_name: harness_hendrycksTest_professional_accounting_5 data_files: - split: 2023_07_31T10_56_27.356745 path: - '**/details_harness|hendrycksTest-professional_accounting|5_2023-07-31T10:56:27.356745.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_accounting|5_2023-07-31T10:56:27.356745.parquet' - config_name: harness_hendrycksTest_professional_law_5 data_files: - split: 2023_07_31T10_56_27.356745 path: - '**/details_harness|hendrycksTest-professional_law|5_2023-07-31T10:56:27.356745.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_law|5_2023-07-31T10:56:27.356745.parquet' - config_name: harness_hendrycksTest_professional_medicine_5 data_files: - split: 2023_07_31T10_56_27.356745 path: - '**/details_harness|hendrycksTest-professional_medicine|5_2023-07-31T10:56:27.356745.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_medicine|5_2023-07-31T10:56:27.356745.parquet' - config_name: harness_hendrycksTest_professional_psychology_5 data_files: - split: 2023_07_31T10_56_27.356745 path: - '**/details_harness|hendrycksTest-professional_psychology|5_2023-07-31T10:56:27.356745.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_psychology|5_2023-07-31T10:56:27.356745.parquet' - config_name: harness_hendrycksTest_public_relations_5 data_files: - split: 2023_07_31T10_56_27.356745 path: - '**/details_harness|hendrycksTest-public_relations|5_2023-07-31T10:56:27.356745.parquet' - split: latest path: - '**/details_harness|hendrycksTest-public_relations|5_2023-07-31T10:56:27.356745.parquet' - config_name: harness_hendrycksTest_security_studies_5 data_files: - split: 2023_07_31T10_56_27.356745 path: - '**/details_harness|hendrycksTest-security_studies|5_2023-07-31T10:56:27.356745.parquet' - split: latest path: - '**/details_harness|hendrycksTest-security_studies|5_2023-07-31T10:56:27.356745.parquet' - config_name: harness_hendrycksTest_sociology_5 data_files: - split: 2023_07_31T10_56_27.356745 path: - '**/details_harness|hendrycksTest-sociology|5_2023-07-31T10:56:27.356745.parquet' - split: latest path: - '**/details_harness|hendrycksTest-sociology|5_2023-07-31T10:56:27.356745.parquet' - config_name: harness_hendrycksTest_us_foreign_policy_5 data_files: - split: 2023_07_31T10_56_27.356745 path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2023-07-31T10:56:27.356745.parquet' - split: latest path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2023-07-31T10:56:27.356745.parquet' - config_name: harness_hendrycksTest_virology_5 data_files: - split: 2023_07_31T10_56_27.356745 path: - '**/details_harness|hendrycksTest-virology|5_2023-07-31T10:56:27.356745.parquet' - split: latest path: - '**/details_harness|hendrycksTest-virology|5_2023-07-31T10:56:27.356745.parquet' - config_name: harness_hendrycksTest_world_religions_5 data_files: - split: 2023_07_31T10_56_27.356745 path: - '**/details_harness|hendrycksTest-world_religions|5_2023-07-31T10:56:27.356745.parquet' - split: latest path: - '**/details_harness|hendrycksTest-world_religions|5_2023-07-31T10:56:27.356745.parquet' - config_name: harness_truthfulqa_mc_0 data_files: - split: 2023_07_31T10_56_27.356745 path: - '**/details_harness|truthfulqa:mc|0_2023-07-31T10:56:27.356745.parquet' - split: latest path: - '**/details_harness|truthfulqa:mc|0_2023-07-31T10:56:27.356745.parquet' - config_name: harness_winogrande_5 data_files: - split: 2023_09_17T07_35_20.075381 path: - '**/details_harness|winogrande|5_2023-09-17T07-35-20.075381.parquet' - split: 2023_10_14T21_01_38.341280 path: - '**/details_harness|winogrande|5_2023-10-14T21-01-38.341280.parquet' - split: latest path: - '**/details_harness|winogrande|5_2023-10-14T21-01-38.341280.parquet' - config_name: results data_files: - split: 2023_07_31T10_56_27.356745 path: - results_2023-07-31T10:56:27.356745.parquet - split: 2023_09_17T07_35_20.075381 path: - results_2023-09-17T07-35-20.075381.parquet - split: 2023_10_14T21_01_38.341280 path: - results_2023-10-14T21-01-38.341280.parquet - split: latest path: - results_2023-10-14T21-01-38.341280.parquet --- # Dataset Card for Evaluation run of golaxy/gogpt-7b-bloom ## Dataset Description - **Homepage:** - **Repository:** https://huggingface.co/golaxy/gogpt-7b-bloom - **Paper:** - **Leaderboard:** https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard - **Point of Contact:** clementine@hf.co ### Dataset Summary Dataset automatically created during the evaluation run of model [golaxy/gogpt-7b-bloom](https://huggingface.co/golaxy/gogpt-7b-bloom) on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard). The dataset is composed of 64 configuration, each one coresponding to one of the evaluated task. The dataset has been created from 3 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results. An additional configuration "results" store all the aggregated results of the run (and is used to compute and display the agregated metrics on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)). To load the details from a run, you can for instance do the following: ```python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_golaxy__gogpt-7b-bloom", "harness_winogrande_5", split="train") ``` ## Latest results These are the [latest results from run 2023-10-14T21:01:38.341280](https://huggingface.co/datasets/open-llm-leaderboard/details_golaxy__gogpt-7b-bloom/blob/main/results_2023-10-14T21-01-38.341280.json)(note that their might be results for other tasks in the repos if successive evals didn't cover the same tasks. You find each in the results and the "latest" split for each eval): ```python { "all": { "em": 0.2214765100671141, "em_stderr": 0.004252451287967787, "f1": 0.25772336409395996, "f1_stderr": 0.00428261897007673, "acc": 0.31452249408050514, "acc_stderr": 0.006788199951115784 }, "harness|drop|3": { "em": 0.2214765100671141, "em_stderr": 0.004252451287967787, "f1": 0.25772336409395996, "f1_stderr": 0.00428261897007673 }, "harness|gsm8k|5": { "acc": 0.0, "acc_stderr": 0.0 }, "harness|winogrande|5": { "acc": 0.6290449881610103, "acc_stderr": 0.013576399902231568 } } ``` ### Supported Tasks and Leaderboards [More Information Needed] ### Languages [More Information Needed] ## Dataset Structure ### Data Instances [More Information Needed] ### Data Fields [More Information Needed] ### Data Splits [More Information Needed] ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information [More Information Needed] ### Citation Information [More Information Needed] ### Contributions [More Information Needed]
提供机构:
open-llm-leaderboard
原始信息汇总

数据集概述

数据集创建背景

该数据集是在对模型 golaxy/gogpt-7b-bloom 进行评估运行期间自动创建的,用于 Open LLM Leaderboard 的评估。

数据集结构

  • 数据集由 64 个配置组成,每个配置对应一个评估任务。
  • 数据集从 3 次运行中创建,每次运行可以在每个配置中找到特定的分割,分割名称使用运行的时间戳。
  • "train" 分割始终指向最新的结果。
  • 一个额外的配置 "results" 存储所有运行的聚合结果,用于计算和显示 Open LLM Leaderboard 上的聚合指标。

数据加载示例

python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_golaxy__gogpt-7b-bloom", "harness_winogrande_5", split="train")

最新结果

以下是 2023-10-14T21:01:38.341280 运行 的最新结果: python { "all": { "em": 0.2214765100671141, "em_stderr": 0.004252451287967787, "f1": 0.25772336409395996, "f1_stderr": 0.00428261897007673, "acc": 0.31452249408050514, "acc_stderr": 0.006788199951115784 }, "harness|drop|3": { "em": 0.2214765100671141, "em_stderr": 0.004252451287967787, "f1": 0.25772336409395996, "f1_stderr": 0.00428261897007673 }, "harness|gsm8k|5": { "acc": 0.0, "acc_stderr": 0.0 }, "harness|winogrande|5": { "acc": 0.6290449881610103, "acc_stderr": 0.013576399902231568 } }

配置详情

  • harness_arc_challenge_25

    • 分割:2023_07_31T10_56_27.356745, latest
    • 路径:**/details_harness|arc:challenge|25_2023-07-31T10:56:27.356745.parquet
  • harness_drop_3

    • 分割:2023_09_17T07_35_20.075381, 2023_10_14T21_01_38.341280, latest
    • 路径:**/details_harness|drop|3_2023-09-17T07-35-20.075381.parquet, **/details_harness|drop|3_2023-10-14T21-01-38.341280.parquet
  • harness_gsm8k_5

    • 分割:2023_09_17T07_35_20.075381, 2023_10_14T21_01_38.341280, latest
    • 路径:**/details_harness|gsm8k|5_2023-09-17T07-35-20.075381.parquet, **/details_harness|gsm8k|5_2023-10-14T21-01-38.341280.parquet
  • harness_hellaswag_10

    • 分割:2023_07_31T10_56_27.356745, latest
    • 路径:**/details_harness|hellaswag|10_2023-07-31T10:56:27.356745.parquet
  • harness_hendrycksTest_5

    • 分割:2023_07_31T10_56_27.356745, latest
    • 路径:多个路径,详见原文
  • harness_hendrycksTest_abstract_algebra_5

    • 分割:2023_07_31T10_56_27.356745, latest
    • 路径:**/details_harness|hendrycksTest-abstract_algebra|5_2023-07-31T10:56:27.356745.parquet
  • harness_hendrycksTest_anatomy_5

    • 分割:2023_07_31T10_56_27.356745, latest
    • 路径:**/details_harness|hendrycksTest-anatomy|5_2023-07-31T10:56:27.356745.parquet
  • harness_hendrycksTest_astronomy_5

    • 分割:2023_07_31T10_56_27.356745, latest
    • 路径:**/details_harness|hendrycksTest-astronomy|5_2023-07-31T10:56:27.356745.parquet
  • harness_hendrycksTest_business_ethics_5

    • 分割:2023_07_31T10_56_27.356745, latest
    • 路径:**/details_harness|hendrycksTest-business_ethics|5_2023-07-31T10:56:27.356745.parquet
  • harness_hendrycksTest_clinical_knowledge_5

    • 分割:2023_07_31T10_56_27.356745, latest
    • 路径:**/details_harness|hendrycksTest-clinical_knowledge|5_2023-07-31T10:56:27.356745.parquet
  • harness_hendrycksTest_college_biology_5

    • 分割:2023_07_31T10_56_27.356745, latest
    • 路径:**/details_harness|hendrycksTest-college_biology|5_2023-07-31T10:56:27.356745.parquet
  • harness_hendrycksTest_college_chemistry_5

    • 分割:2023_07_31T10_56_27.356745, latest
    • 路径:**/details_harness|hendrycksTest-college_chemistry|5_2023-07-31T10:56:27.356745.parquet
  • harness_hendrycksTest_college_computer_science_5

    • 分割:2023_07_31T10_56_27.356745, latest
    • 路径:**/details_harness|hendrycksTest-college_computer_science|5_2023-07-31T10:56:27.356745.parquet
  • harness_hendrycksTest_college_mathematics_5

    • 分割:2023_07_31T10_56_27.356745, latest
    • 路径:**/details_harness|hendrycksTest-college_mathematics|5_2023-07-31T10:56:27.356745.parquet
  • harness_hendrycksTest_college_medicine_5

    • 分割:2023_07_31T10_56_27.356745, latest
    • 路径:**/details_harness|hendrycksTest-college_medicine|5_2023-07-31T10:56:27.356745.parquet
  • harness_hendrycksTest_college_physics_5

    • 分割:2023_07_31T10_56_27.356745, latest
    • 路径:**/details_harness|hendrycksTest-college_physics|5_2023-07-31T10:56:27.356745.parquet
  • harness_hendrycksTest_computer_security_5

    • 分割:2023_07_31T10_56_27.356745, latest
    • 路径:**/details_harness|hendrycksTest-computer_security|5_2023-07-31T10:56:27.356745.parquet
  • harness_hendrycksTest_conceptual_physics_5

    • 分割:2023_07_31T10_56_27.356745, latest
    • 路径:**/details_harness|hendrycksTest-conceptual_physics|5_2023-07-31T10:56:27.356745.parquet
  • harness_hendrycksTest_econometrics_5

    • 分割:2023_07_31T10_56_27.356745, latest
    • 路径:**/details_harness|hendrycksTest-econometrics|5_2023-07-31T10:56:27.356745.parquet
  • harness_hendrycksTest_electrical_engineering_5

    • 分割:2023_07_31T10_56_27.356745, latest
    • 路径:**/details_harness|hendrycksTest-electrical_engineering|5_2023-07-31T10:56:27.356745.parquet

以上是数据集的详细概述,包括数据集的创建背景、结构、加载示例、最新结果以及各个配置的详细信息。

搜集汇总
数据集介绍
main_image_url
构建方式
在大型语言模型评估的广阔领域中,Open LLM Leaderboard为模型性能的量化分析提供了标准化平台。该数据集是在对golaxy/gogpt-7b-bloom模型进行系统评估的过程中自动生成的,其构建过程依托于Hugging Face的评估框架。数据集共包含64个配置,每个配置对应一项特定的评估任务,如ARC挑战、DROP、GSM8K等。数据来源于三次独立的运行记录,每次运行的结果被组织为独立的split,并以时间戳命名,而'train' split始终指向最新一次的评估结果。此外,还设立了名为'results'的额外配置,用于汇总所有运行的聚合指标,为模型在排行榜上的综合表现提供计算依据。
特点
该数据集的结构设计展现了高度的组织性与实用性。其最显著的特点在于通过64个精细化配置,将模型在不同任务上的表现进行了细致划分,涵盖了从常识推理(如HellaSwag)到数学问题求解(如GSM8K)的多样化能力。每个配置下的数据以Parquet格式存储,保证了数据加载的高效性。数据集的另一大特色是其时间序列特性,通过保留多次运行记录并设置'latest' split,研究人员可以追溯模型性能的演变过程,并直接获取最新评估结果。这种设计不仅便于进行纵向对比,也为模型迭代优化提供了清晰的数据支撑。
使用方法
使用该数据集进行模型性能分析时,研究人员可以通过Hugging Face的datasets库轻松加载特定任务的评估细节。例如,调用load_dataset函数,指定数据集名称与配置名称(如'harness_winogrande_5'),并选择split参数为'train',即可获取该任务的最新评估结果。若需分析历史运行数据,则可将split参数设置为对应的时间戳字符串。此外,通过加载'results'配置,用户能够快速访问模型在所有任务上的聚合指标,从而全面评估模型的综合能力。这一灵活的数据访问机制,使得从微观任务表现到宏观性能概览的分析路径变得畅通无阻。
背景与挑战
背景概述
在大型语言模型(LLM)领域,评估模型的性能与泛化能力成为推动技术演进的核心议题。Open LLM Leaderboard由HuggingFace团队于2023年发起,旨在构建一个标准化、可复现的模型评测平台,以应对不同模型间比较缺乏统一基准的困境。该数据集记录了golaxy/gogpt-7b-bloom模型在Leaderboard上的评测结果,涵盖ARC、HellaSwag、MMLU、GSM8K、DROP及WinoGrande等多维度任务,评估指标包括准确率、F1分数和精确匹配率。作为开源社区的重要贡献,该数据集不仅为模型开发者提供了透明的性能反馈,还促进了LLM在常识推理、数学计算和阅读理解等领域的横向对比,对推动中文大模型(如GoGPT系列)的国际化评估具有里程碑意义。
当前挑战
该数据集所解决的领域问题主要聚焦于大型语言模型的标准化评估,其核心挑战在于如何设计覆盖广泛认知能力的评测体系,以准确反映模型在复杂任务上的真实表现。构建过程中,团队面临多重技术难点:首先,需整合来自不同来源的异构任务(如数学推理的GSM8K与常识推理的WinoGrande),确保评测维度的多样性与公平性;其次,数据版本控制与结果可复现性要求严格,例如每次评测生成独立的时间戳分割,并维护“latest”指针指向最新结果,这增加了元数据管理的复杂性;最后,面对如gogpt-7b-bloom等新兴模型,其跨语言能力评估的缺失成为瓶颈,现有评测集多以英文为主,对中文模型的本土化适配构成显著挑战。
常用场景
经典使用场景
在大型语言模型评估领域,该数据集作为Open LLM Leaderboard的标准化评测组件,承载了golaxy/gogpt-7b-bloom模型在多个经典基准任务上的细粒度表现记录。其核心用途在于通过64个配置项覆盖ARC挑战赛、HellaSwag常识推理、GSM8K数学问题求解、Winogrande指代消解以及涵盖57个学科领域的MMLU知识测试等任务,为研究者提供模型在推理、常识理解、数学计算及多学科知识掌握等方面的量化评估结果。数据集以parquet格式存储每次运行的详细得分,支持按时间戳分割回溯历史结果,便于追踪模型迭代过程中的性能演变。
解决学术问题
该数据集系统性地回应了大型语言模型能力评估中缺乏标准化、可复现评测体系的学术痛点。通过统一集成多种主流基准测试,它解决了以往不同研究使用异构评测工具导致的结果不可比问题,为模型间横向对比提供了公平的度量基准。其细粒度的任务配置和结果记录机制,使得研究者能够精准定位模型在特定能力维度的优势与短板,从而指导模型改进方向。数据集的公开透明特性还促进了评测过程的可验证性,为自然语言处理领域关于模型泛化能力、知识迁移效果等核心议题的探讨提供了坚实的数据支撑。
衍生相关工作
围绕该数据集衍生了多个具有影响力的研究工作。Open LLM Leaderboard本身已成为社区广泛认可的模型排名基准,催生了大量关于评测方法论优化的探讨,如任务难度校准、少样本学习设置影响分析等。基于该数据集记录的细粒度结果,研究者开展了模型能力涌现现象的系统性分析,揭示了模型规模与特定任务表现之间的非线性关系。此外,数据集的多任务配置结构启发了跨任务迁移学习的研究,推动了针对特定能力短板进行定向增强的训练策略创新。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作