five

open-llm-leaderboard/details_Weyaxi__Luban-Marcoroni-13B-v2

收藏
Hugging Face2023-10-28 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/open-llm-leaderboard/details_Weyaxi__Luban-Marcoroni-13B-v2
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: Evaluation run of Weyaxi/Luban-Marcoroni-13B-v2 dataset_summary: "Dataset automatically created during the evaluation run of model\ \ [Weyaxi/Luban-Marcoroni-13B-v2](https://huggingface.co/Weyaxi/Luban-Marcoroni-13B-v2)\ \ on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).\n\ \nThe dataset is composed of 64 configuration, each one coresponding to one of the\ \ evaluated task.\n\nThe dataset has been created from 2 run(s). Each run can be\ \ found as a specific split in each configuration, the split being named using the\ \ timestamp of the run.The \"train\" split is always pointing to the latest results.\n\ \nAn additional configuration \"results\" store all the aggregated results of the\ \ run (and is used to compute and display the agregated metrics on the [Open LLM\ \ Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)).\n\ \nTo load the details from a run, you can for instance do the following:\n```python\n\ from datasets import load_dataset\ndata = load_dataset(\"open-llm-leaderboard/details_Weyaxi__Luban-Marcoroni-13B-v2\"\ ,\n\t\"harness_winogrande_5\",\n\tsplit=\"train\")\n```\n\n## Latest results\n\n\ These are the [latest results from run 2023-10-28T11:01:27.302979](https://huggingface.co/datasets/open-llm-leaderboard/details_Weyaxi__Luban-Marcoroni-13B-v2/blob/main/results_2023-10-28T11-01-27.302979.json)(note\ \ that their might be results for other tasks in the repos if successive evals didn't\ \ cover the same tasks. You find each in the results and the \"latest\" split for\ \ each eval):\n\n```python\n{\n \"all\": {\n \"em\": 0.00776006711409396,\n\ \ \"em_stderr\": 0.0008986296432392762,\n \"f1\": 0.10253880033557114,\n\ \ \"f1_stderr\": 0.001982157556823196,\n \"acc\": 0.4344259989839472,\n\ \ \"acc_stderr\": 0.010037121788760327\n },\n \"harness|drop|3\": {\n\ \ \"em\": 0.00776006711409396,\n \"em_stderr\": 0.0008986296432392762,\n\ \ \"f1\": 0.10253880033557114,\n \"f1_stderr\": 0.001982157556823196\n\ \ },\n \"harness|gsm8k|5\": {\n \"acc\": 0.09931766489764973,\n \ \ \"acc_stderr\": 0.008238371412683973\n },\n \"harness|winogrande|5\"\ : {\n \"acc\": 0.7695343330702447,\n \"acc_stderr\": 0.011835872164836682\n\ \ }\n}\n```" repo_url: https://huggingface.co/Weyaxi/Luban-Marcoroni-13B-v2 leaderboard_url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard point_of_contact: clementine@hf.co configs: - config_name: harness_arc_challenge_25 data_files: - split: 2023_09_13T20_54_44.969205 path: - '**/details_harness|arc:challenge|25_2023-09-13T20-54-44.969205.parquet' - split: latest path: - '**/details_harness|arc:challenge|25_2023-09-13T20-54-44.969205.parquet' - config_name: harness_drop_3 data_files: - split: 2023_10_28T11_01_27.302979 path: - '**/details_harness|drop|3_2023-10-28T11-01-27.302979.parquet' - split: latest path: - '**/details_harness|drop|3_2023-10-28T11-01-27.302979.parquet' - config_name: harness_gsm8k_5 data_files: - split: 2023_10_28T11_01_27.302979 path: - '**/details_harness|gsm8k|5_2023-10-28T11-01-27.302979.parquet' - split: latest path: - '**/details_harness|gsm8k|5_2023-10-28T11-01-27.302979.parquet' - config_name: harness_hellaswag_10 data_files: - split: 2023_09_13T20_54_44.969205 path: - '**/details_harness|hellaswag|10_2023-09-13T20-54-44.969205.parquet' - split: latest path: - '**/details_harness|hellaswag|10_2023-09-13T20-54-44.969205.parquet' - config_name: harness_hendrycksTest_5 data_files: - split: 2023_09_13T20_54_44.969205 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-international_law|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-management|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-marketing|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-sociology|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-virology|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2023-09-13T20-54-44.969205.parquet' - split: latest path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-international_law|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-management|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-marketing|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-sociology|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-virology|5_2023-09-13T20-54-44.969205.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2023-09-13T20-54-44.969205.parquet' - config_name: harness_hendrycksTest_abstract_algebra_5 data_files: - split: 2023_09_13T20_54_44.969205 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2023-09-13T20-54-44.969205.parquet' - split: latest path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2023-09-13T20-54-44.969205.parquet' - config_name: harness_hendrycksTest_anatomy_5 data_files: - split: 2023_09_13T20_54_44.969205 path: - '**/details_harness|hendrycksTest-anatomy|5_2023-09-13T20-54-44.969205.parquet' - split: latest path: - '**/details_harness|hendrycksTest-anatomy|5_2023-09-13T20-54-44.969205.parquet' - config_name: harness_hendrycksTest_astronomy_5 data_files: - split: 2023_09_13T20_54_44.969205 path: - '**/details_harness|hendrycksTest-astronomy|5_2023-09-13T20-54-44.969205.parquet' - split: latest path: - '**/details_harness|hendrycksTest-astronomy|5_2023-09-13T20-54-44.969205.parquet' - config_name: harness_hendrycksTest_business_ethics_5 data_files: - split: 2023_09_13T20_54_44.969205 path: - '**/details_harness|hendrycksTest-business_ethics|5_2023-09-13T20-54-44.969205.parquet' - split: latest path: - '**/details_harness|hendrycksTest-business_ethics|5_2023-09-13T20-54-44.969205.parquet' - config_name: harness_hendrycksTest_clinical_knowledge_5 data_files: - split: 2023_09_13T20_54_44.969205 path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2023-09-13T20-54-44.969205.parquet' - split: latest path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2023-09-13T20-54-44.969205.parquet' - config_name: harness_hendrycksTest_college_biology_5 data_files: - split: 2023_09_13T20_54_44.969205 path: - '**/details_harness|hendrycksTest-college_biology|5_2023-09-13T20-54-44.969205.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_biology|5_2023-09-13T20-54-44.969205.parquet' - config_name: harness_hendrycksTest_college_chemistry_5 data_files: - split: 2023_09_13T20_54_44.969205 path: - '**/details_harness|hendrycksTest-college_chemistry|5_2023-09-13T20-54-44.969205.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_chemistry|5_2023-09-13T20-54-44.969205.parquet' - config_name: harness_hendrycksTest_college_computer_science_5 data_files: - split: 2023_09_13T20_54_44.969205 path: - '**/details_harness|hendrycksTest-college_computer_science|5_2023-09-13T20-54-44.969205.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_computer_science|5_2023-09-13T20-54-44.969205.parquet' - config_name: harness_hendrycksTest_college_mathematics_5 data_files: - split: 2023_09_13T20_54_44.969205 path: - '**/details_harness|hendrycksTest-college_mathematics|5_2023-09-13T20-54-44.969205.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_mathematics|5_2023-09-13T20-54-44.969205.parquet' - config_name: harness_hendrycksTest_college_medicine_5 data_files: - split: 2023_09_13T20_54_44.969205 path: - '**/details_harness|hendrycksTest-college_medicine|5_2023-09-13T20-54-44.969205.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_medicine|5_2023-09-13T20-54-44.969205.parquet' - config_name: harness_hendrycksTest_college_physics_5 data_files: - split: 2023_09_13T20_54_44.969205 path: - '**/details_harness|hendrycksTest-college_physics|5_2023-09-13T20-54-44.969205.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_physics|5_2023-09-13T20-54-44.969205.parquet' - config_name: harness_hendrycksTest_computer_security_5 data_files: - split: 2023_09_13T20_54_44.969205 path: - '**/details_harness|hendrycksTest-computer_security|5_2023-09-13T20-54-44.969205.parquet' - split: latest path: - '**/details_harness|hendrycksTest-computer_security|5_2023-09-13T20-54-44.969205.parquet' - config_name: harness_hendrycksTest_conceptual_physics_5 data_files: - split: 2023_09_13T20_54_44.969205 path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2023-09-13T20-54-44.969205.parquet' - split: latest path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2023-09-13T20-54-44.969205.parquet' - config_name: harness_hendrycksTest_econometrics_5 data_files: - split: 2023_09_13T20_54_44.969205 path: - '**/details_harness|hendrycksTest-econometrics|5_2023-09-13T20-54-44.969205.parquet' - split: latest path: - '**/details_harness|hendrycksTest-econometrics|5_2023-09-13T20-54-44.969205.parquet' - config_name: harness_hendrycksTest_electrical_engineering_5 data_files: - split: 2023_09_13T20_54_44.969205 path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2023-09-13T20-54-44.969205.parquet' - split: latest path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2023-09-13T20-54-44.969205.parquet' - config_name: harness_hendrycksTest_elementary_mathematics_5 data_files: - split: 2023_09_13T20_54_44.969205 path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2023-09-13T20-54-44.969205.parquet' - split: latest path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2023-09-13T20-54-44.969205.parquet' - config_name: harness_hendrycksTest_formal_logic_5 data_files: - split: 2023_09_13T20_54_44.969205 path: - '**/details_harness|hendrycksTest-formal_logic|5_2023-09-13T20-54-44.969205.parquet' - split: latest path: - '**/details_harness|hendrycksTest-formal_logic|5_2023-09-13T20-54-44.969205.parquet' - config_name: harness_hendrycksTest_global_facts_5 data_files: - split: 2023_09_13T20_54_44.969205 path: - '**/details_harness|hendrycksTest-global_facts|5_2023-09-13T20-54-44.969205.parquet' - split: latest path: - '**/details_harness|hendrycksTest-global_facts|5_2023-09-13T20-54-44.969205.parquet' - config_name: harness_hendrycksTest_high_school_biology_5 data_files: - split: 2023_09_13T20_54_44.969205 path: - '**/details_harness|hendrycksTest-high_school_biology|5_2023-09-13T20-54-44.969205.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_biology|5_2023-09-13T20-54-44.969205.parquet' - config_name: harness_hendrycksTest_high_school_chemistry_5 data_files: - split: 2023_09_13T20_54_44.969205 path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2023-09-13T20-54-44.969205.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2023-09-13T20-54-44.969205.parquet' - config_name: harness_hendrycksTest_high_school_computer_science_5 data_files: - split: 2023_09_13T20_54_44.969205 path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2023-09-13T20-54-44.969205.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2023-09-13T20-54-44.969205.parquet' - config_name: harness_hendrycksTest_high_school_european_history_5 data_files: - split: 2023_09_13T20_54_44.969205 path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2023-09-13T20-54-44.969205.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2023-09-13T20-54-44.969205.parquet' - config_name: harness_hendrycksTest_high_school_geography_5 data_files: - split: 2023_09_13T20_54_44.969205 path: - '**/details_harness|hendrycksTest-high_school_geography|5_2023-09-13T20-54-44.969205.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_geography|5_2023-09-13T20-54-44.969205.parquet' - config_name: harness_hendrycksTest_high_school_government_and_politics_5 data_files: - split: 2023_09_13T20_54_44.969205 path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2023-09-13T20-54-44.969205.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2023-09-13T20-54-44.969205.parquet' - config_name: harness_hendrycksTest_high_school_macroeconomics_5 data_files: - split: 2023_09_13T20_54_44.969205 path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2023-09-13T20-54-44.969205.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2023-09-13T20-54-44.969205.parquet' - config_name: harness_hendrycksTest_high_school_mathematics_5 data_files: - split: 2023_09_13T20_54_44.969205 path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2023-09-13T20-54-44.969205.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2023-09-13T20-54-44.969205.parquet' - config_name: harness_hendrycksTest_high_school_microeconomics_5 data_files: - split: 2023_09_13T20_54_44.969205 path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2023-09-13T20-54-44.969205.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2023-09-13T20-54-44.969205.parquet' - config_name: harness_hendrycksTest_high_school_physics_5 data_files: - split: 2023_09_13T20_54_44.969205 path: - '**/details_harness|hendrycksTest-high_school_physics|5_2023-09-13T20-54-44.969205.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_physics|5_2023-09-13T20-54-44.969205.parquet' - config_name: harness_hendrycksTest_high_school_psychology_5 data_files: - split: 2023_09_13T20_54_44.969205 path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2023-09-13T20-54-44.969205.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2023-09-13T20-54-44.969205.parquet' - config_name: harness_hendrycksTest_high_school_statistics_5 data_files: - split: 2023_09_13T20_54_44.969205 path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2023-09-13T20-54-44.969205.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2023-09-13T20-54-44.969205.parquet' - config_name: harness_hendrycksTest_high_school_us_history_5 data_files: - split: 2023_09_13T20_54_44.969205 path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2023-09-13T20-54-44.969205.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2023-09-13T20-54-44.969205.parquet' - config_name: harness_hendrycksTest_high_school_world_history_5 data_files: - split: 2023_09_13T20_54_44.969205 path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2023-09-13T20-54-44.969205.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2023-09-13T20-54-44.969205.parquet' - config_name: harness_hendrycksTest_human_aging_5 data_files: - split: 2023_09_13T20_54_44.969205 path: - '**/details_harness|hendrycksTest-human_aging|5_2023-09-13T20-54-44.969205.parquet' - split: latest path: - '**/details_harness|hendrycksTest-human_aging|5_2023-09-13T20-54-44.969205.parquet' - config_name: harness_hendrycksTest_human_sexuality_5 data_files: - split: 2023_09_13T20_54_44.969205 path: - '**/details_harness|hendrycksTest-human_sexuality|5_2023-09-13T20-54-44.969205.parquet' - split: latest path: - '**/details_harness|hendrycksTest-human_sexuality|5_2023-09-13T20-54-44.969205.parquet' - config_name: harness_hendrycksTest_international_law_5 data_files: - split: 2023_09_13T20_54_44.969205 path: - '**/details_harness|hendrycksTest-international_law|5_2023-09-13T20-54-44.969205.parquet' - split: latest path: - '**/details_harness|hendrycksTest-international_law|5_2023-09-13T20-54-44.969205.parquet' - config_name: harness_hendrycksTest_jurisprudence_5 data_files: - split: 2023_09_13T20_54_44.969205 path: - '**/details_harness|hendrycksTest-jurisprudence|5_2023-09-13T20-54-44.969205.parquet' - split: latest path: - '**/details_harness|hendrycksTest-jurisprudence|5_2023-09-13T20-54-44.969205.parquet' - config_name: harness_hendrycksTest_logical_fallacies_5 data_files: - split: 2023_09_13T20_54_44.969205 path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2023-09-13T20-54-44.969205.parquet' - split: latest path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2023-09-13T20-54-44.969205.parquet' - config_name: harness_hendrycksTest_machine_learning_5 data_files: - split: 2023_09_13T20_54_44.969205 path: - '**/details_harness|hendrycksTest-machine_learning|5_2023-09-13T20-54-44.969205.parquet' - split: latest path: - '**/details_harness|hendrycksTest-machine_learning|5_2023-09-13T20-54-44.969205.parquet' - config_name: harness_hendrycksTest_management_5 data_files: - split: 2023_09_13T20_54_44.969205 path: - '**/details_harness|hendrycksTest-management|5_2023-09-13T20-54-44.969205.parquet' - split: latest path: - '**/details_harness|hendrycksTest-management|5_2023-09-13T20-54-44.969205.parquet' - config_name: harness_hendrycksTest_marketing_5 data_files: - split: 2023_09_13T20_54_44.969205 path: - '**/details_harness|hendrycksTest-marketing|5_2023-09-13T20-54-44.969205.parquet' - split: latest path: - '**/details_harness|hendrycksTest-marketing|5_2023-09-13T20-54-44.969205.parquet' - config_name: harness_hendrycksTest_medical_genetics_5 data_files: - split: 2023_09_13T20_54_44.969205 path: - '**/details_harness|hendrycksTest-medical_genetics|5_2023-09-13T20-54-44.969205.parquet' - split: latest path: - '**/details_harness|hendrycksTest-medical_genetics|5_2023-09-13T20-54-44.969205.parquet' - config_name: harness_hendrycksTest_miscellaneous_5 data_files: - split: 2023_09_13T20_54_44.969205 path: - '**/details_harness|hendrycksTest-miscellaneous|5_2023-09-13T20-54-44.969205.parquet' - split: latest path: - '**/details_harness|hendrycksTest-miscellaneous|5_2023-09-13T20-54-44.969205.parquet' - config_name: harness_hendrycksTest_moral_disputes_5 data_files: - split: 2023_09_13T20_54_44.969205 path: - '**/details_harness|hendrycksTest-moral_disputes|5_2023-09-13T20-54-44.969205.parquet' - split: latest path: - '**/details_harness|hendrycksTest-moral_disputes|5_2023-09-13T20-54-44.969205.parquet' - config_name: harness_hendrycksTest_moral_scenarios_5 data_files: - split: 2023_09_13T20_54_44.969205 path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2023-09-13T20-54-44.969205.parquet' - split: latest path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2023-09-13T20-54-44.969205.parquet' - config_name: harness_hendrycksTest_nutrition_5 data_files: - split: 2023_09_13T20_54_44.969205 path: - '**/details_harness|hendrycksTest-nutrition|5_2023-09-13T20-54-44.969205.parquet' - split: latest path: - '**/details_harness|hendrycksTest-nutrition|5_2023-09-13T20-54-44.969205.parquet' - config_name: harness_hendrycksTest_philosophy_5 data_files: - split: 2023_09_13T20_54_44.969205 path: - '**/details_harness|hendrycksTest-philosophy|5_2023-09-13T20-54-44.969205.parquet' - split: latest path: - '**/details_harness|hendrycksTest-philosophy|5_2023-09-13T20-54-44.969205.parquet' - config_name: harness_hendrycksTest_prehistory_5 data_files: - split: 2023_09_13T20_54_44.969205 path: - '**/details_harness|hendrycksTest-prehistory|5_2023-09-13T20-54-44.969205.parquet' - split: latest path: - '**/details_harness|hendrycksTest-prehistory|5_2023-09-13T20-54-44.969205.parquet' - config_name: harness_hendrycksTest_professional_accounting_5 data_files: - split: 2023_09_13T20_54_44.969205 path: - '**/details_harness|hendrycksTest-professional_accounting|5_2023-09-13T20-54-44.969205.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_accounting|5_2023-09-13T20-54-44.969205.parquet' - config_name: harness_hendrycksTest_professional_law_5 data_files: - split: 2023_09_13T20_54_44.969205 path: - '**/details_harness|hendrycksTest-professional_law|5_2023-09-13T20-54-44.969205.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_law|5_2023-09-13T20-54-44.969205.parquet' - config_name: harness_hendrycksTest_professional_medicine_5 data_files: - split: 2023_09_13T20_54_44.969205 path: - '**/details_harness|hendrycksTest-professional_medicine|5_2023-09-13T20-54-44.969205.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_medicine|5_2023-09-13T20-54-44.969205.parquet' - config_name: harness_hendrycksTest_professional_psychology_5 data_files: - split: 2023_09_13T20_54_44.969205 path: - '**/details_harness|hendrycksTest-professional_psychology|5_2023-09-13T20-54-44.969205.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_psychology|5_2023-09-13T20-54-44.969205.parquet' - config_name: harness_hendrycksTest_public_relations_5 data_files: - split: 2023_09_13T20_54_44.969205 path: - '**/details_harness|hendrycksTest-public_relations|5_2023-09-13T20-54-44.969205.parquet' - split: latest path: - '**/details_harness|hendrycksTest-public_relations|5_2023-09-13T20-54-44.969205.parquet' - config_name: harness_hendrycksTest_security_studies_5 data_files: - split: 2023_09_13T20_54_44.969205 path: - '**/details_harness|hendrycksTest-security_studies|5_2023-09-13T20-54-44.969205.parquet' - split: latest path: - '**/details_harness|hendrycksTest-security_studies|5_2023-09-13T20-54-44.969205.parquet' - config_name: harness_hendrycksTest_sociology_5 data_files: - split: 2023_09_13T20_54_44.969205 path: - '**/details_harness|hendrycksTest-sociology|5_2023-09-13T20-54-44.969205.parquet' - split: latest path: - '**/details_harness|hendrycksTest-sociology|5_2023-09-13T20-54-44.969205.parquet' - config_name: harness_hendrycksTest_us_foreign_policy_5 data_files: - split: 2023_09_13T20_54_44.969205 path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2023-09-13T20-54-44.969205.parquet' - split: latest path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2023-09-13T20-54-44.969205.parquet' - config_name: harness_hendrycksTest_virology_5 data_files: - split: 2023_09_13T20_54_44.969205 path: - '**/details_harness|hendrycksTest-virology|5_2023-09-13T20-54-44.969205.parquet' - split: latest path: - '**/details_harness|hendrycksTest-virology|5_2023-09-13T20-54-44.969205.parquet' - config_name: harness_hendrycksTest_world_religions_5 data_files: - split: 2023_09_13T20_54_44.969205 path: - '**/details_harness|hendrycksTest-world_religions|5_2023-09-13T20-54-44.969205.parquet' - split: latest path: - '**/details_harness|hendrycksTest-world_religions|5_2023-09-13T20-54-44.969205.parquet' - config_name: harness_truthfulqa_mc_0 data_files: - split: 2023_09_13T20_54_44.969205 path: - '**/details_harness|truthfulqa:mc|0_2023-09-13T20-54-44.969205.parquet' - split: latest path: - '**/details_harness|truthfulqa:mc|0_2023-09-13T20-54-44.969205.parquet' - config_name: harness_winogrande_5 data_files: - split: 2023_10_28T11_01_27.302979 path: - '**/details_harness|winogrande|5_2023-10-28T11-01-27.302979.parquet' - split: latest path: - '**/details_harness|winogrande|5_2023-10-28T11-01-27.302979.parquet' - config_name: results data_files: - split: 2023_09_13T20_54_44.969205 path: - results_2023-09-13T20-54-44.969205.parquet - split: 2023_10_28T11_01_27.302979 path: - results_2023-10-28T11-01-27.302979.parquet - split: latest path: - results_2023-10-28T11-01-27.302979.parquet --- # Dataset Card for Evaluation run of Weyaxi/Luban-Marcoroni-13B-v2 ## Dataset Description - **Homepage:** - **Repository:** https://huggingface.co/Weyaxi/Luban-Marcoroni-13B-v2 - **Paper:** - **Leaderboard:** https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard - **Point of Contact:** clementine@hf.co ### Dataset Summary Dataset automatically created during the evaluation run of model [Weyaxi/Luban-Marcoroni-13B-v2](https://huggingface.co/Weyaxi/Luban-Marcoroni-13B-v2) on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard). The dataset is composed of 64 configuration, each one coresponding to one of the evaluated task. The dataset has been created from 2 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results. An additional configuration "results" store all the aggregated results of the run (and is used to compute and display the agregated metrics on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)). To load the details from a run, you can for instance do the following: ```python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_Weyaxi__Luban-Marcoroni-13B-v2", "harness_winogrande_5", split="train") ``` ## Latest results These are the [latest results from run 2023-10-28T11:01:27.302979](https://huggingface.co/datasets/open-llm-leaderboard/details_Weyaxi__Luban-Marcoroni-13B-v2/blob/main/results_2023-10-28T11-01-27.302979.json)(note that their might be results for other tasks in the repos if successive evals didn't cover the same tasks. You find each in the results and the "latest" split for each eval): ```python { "all": { "em": 0.00776006711409396, "em_stderr": 0.0008986296432392762, "f1": 0.10253880033557114, "f1_stderr": 0.001982157556823196, "acc": 0.4344259989839472, "acc_stderr": 0.010037121788760327 }, "harness|drop|3": { "em": 0.00776006711409396, "em_stderr": 0.0008986296432392762, "f1": 0.10253880033557114, "f1_stderr": 0.001982157556823196 }, "harness|gsm8k|5": { "acc": 0.09931766489764973, "acc_stderr": 0.008238371412683973 }, "harness|winogrande|5": { "acc": 0.7695343330702447, "acc_stderr": 0.011835872164836682 } } ``` ### Supported Tasks and Leaderboards [More Information Needed] ### Languages [More Information Needed] ## Dataset Structure ### Data Instances [More Information Needed] ### Data Fields [More Information Needed] ### Data Splits [More Information Needed] ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information [More Information Needed] ### Citation Information [More Information Needed] ### Contributions [More Information Needed]
提供机构:
open-llm-leaderboard
原始信息汇总

数据集概述

数据集简介

该数据集是在评估模型 Weyaxi/Luban-Marcoroni-13B-v2Open LLM Leaderboard 上的运行过程中自动创建的。

数据集结构

  • 数据集包含 64 个配置,每个配置对应一个评估任务。
  • 数据集从 2 次运行中创建,每次运行可以在每个配置中找到特定的分割,分割名称使用运行的时间戳。
  • "train" 分割始终指向最新的结果。
  • 额外的配置 "results" 存储所有运行的聚合结果,用于计算和显示在 Open LLM Leaderboard 上的聚合指标。

数据加载示例

python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_Weyaxi__Luban-Marcoroni-13B-v2", "harness_winogrande_5", split="train")

最新结果

以下是 2023-10-28T11:01:27.302979 运行的最新结果: python { "all": { "em": 0.00776006711409396, "em_stderr": 0.0008986296432392762, "f1": 0.10253880033557114, "f1_stderr": 0.001982157556823196, "acc": 0.4344259989839472, "acc_stderr": 0.010037121788760327 }, "harness|drop|3": { "em": 0.00776006711409396, "em_stderr": 0.0008986296432392762, "f1": 0.10253880033557114, "f1_stderr": 0.001982157556823196 }, "harness|gsm8k|5": { "acc": 0.09931766489764973, "acc_stderr": 0.008238371412683973 }, "harness|winogrande|5": { "acc": 0.7695343330702447, "acc_stderr": 0.011835872164836682 } }

配置详情

以下是数据集的配置详情:

配置列表

  • harness_arc_challenge_25

    • 分割: 2023_09_13T20_54_44.969205
    • 路径: **/details_harness|arc:challenge|25_2023-09-13T20-54-44.969205.parquet
    • 分割: latest
    • 路径: **/details_harness|arc:challenge|25_2023-09-13T20-54-44.969205.parquet
  • harness_drop_3

    • 分割: 2023_10_28T11_01_27.302979
    • 路径: **/details_harness|drop|3_2023-10-28T11-01-27.302979.parquet
    • 分割: latest
    • 路径: **/details_harness|drop|3_2023-10-28T11-01-27.302979.parquet
  • harness_gsm8k_5

    • 分割: 2023_10_28T11_01_27.302979
    • 路径: **/details_harness|gsm8k|5_2023-10-28T11-01-27.302979.parquet
    • 分割: latest
    • 路径: **/details_harness|gsm8k|5_2023-10-28T11-01-27.302979.parquet
  • harness_hellaswag_10

    • 分割: 2023_09_13T20_54_44.969205
    • 路径: **/details_harness|hellaswag|10_2023-09-13T20-54-44.969205.parquet
    • 分割: latest
    • 路径: **/details_harness|hellaswag|10_2023-09-13T20-54-44.969205.parquet
  • harness_hendrycksTest_5

    • 分割: 2023_09_13T20_54_44.969205
    • 路径:
      • **/details_harness|hendrycksTest-abstract_algebra|5_2023-09-13T20-54-44.969205.parquet
      • **/details_harness|hendrycksTest-anatomy|5_2023-09-13T20-54-44.969205.parquet
      • **/details_harness|hendrycksTest-astronomy|5_2023-09-13T20-54-44.969205.parquet
      • **/details_harness|hendrycksTest-business_ethics|5_2023-09-13T20-54-44.969205.parquet
      • **/details_harness|hendrycksTest-clinical_knowledge|5_2023-09-13T20-54-44.969205.parquet
      • **/details_harness|hendrycksTest-college_biology|5_2023-09-13T20-54-44.969205.parquet
      • **/details_harness|hendrycksTest-college_chemistry|5_2023-09-13T20-54-44.969205.parquet
      • **/details_harness|hendrycksTest-college_computer_science|5_2023-09-13T20-54-44.969205.parquet
      • **/details_harness|hendrycksTest-college_mathematics|5_2023-09-13T20-54-44.969205.parquet
      • **/details_harness|hendrycksTest-college_medicine|5_2023-09-13T20-54-44.969205.parquet
      • **/details_harness|hendrycksTest-college_physics|5_2023-09-13T20-54-44.969205.parquet
      • **/details_harness|hendrycksTest-computer_security|5_2023-09-13T20-54-44.969205.parquet
      • **/details_harness|hendrycksTest-conceptual_physics|5_2023-09-13T20-54-44.969205.parquet
      • **/details_harness|hendrycksTest-econometrics|5_2023-09-13T20-54-44.969205.parquet
      • **/details_harness|hendrycksTest-electrical_engineering|5_2023-09-13T20-54-44.969205.parquet
      • **/details_harness|hendrycksTest-elementary_mathematics|5_2023-09-13T20-54-44.969205.parquet
      • **/details_harness|hendrycksTest-formal_logic|5_2023-09-13T20-54-44.969205.parquet
      • **/details_harness|hendrycksTest-global_facts|5_2023-09-13T20-54-44.969205.parquet
      • **/details_harness|hendrycksTest-high_school_biology|5_2023-09-13T20-54-44.969205.parquet
      • **/details_harness|hendrycksTest-high_school_chemistry|5_2023-09-13T20-54-44.969205.parquet
      • **/details_harness|hendrycksTest-high_school_computer_science|5_2023-09-13T20-54-44.969205.parquet
      • **/details_harness|hendrycksTest-high_school_european_history|5_2023-09-13T20-54-44.969205.parquet
      • **/details_harness|hendrycksTest-high_school_geography|5_2023-09-13T20-54-44.969205.parquet
      • **/details_harness|hendrycksTest-high_school_government_and_politics|5_2023-09-13T20-54-44.969205.parquet
      • **/details_harness|hendrycksTest-high_school_macroeconomics|5_2023-09-13T20-54-44.969205.parquet
      • **/details_harness|hendrycksTest-high_school_mathematics|5_2023-09-13T20-54-44.969205.parquet
      • **/details_harness|hendrycksTest-high_school_microeconomics|5_2023-09-13T20-54-44.969205.parquet
      • **/details_harness|hendrycksTest-high_school_physics|5_2023-09-13T20-54-44.969205.parquet
      • **/details_harness|hendrycksTest-high_school_psychology|5_2023-09-13T20-54-44.969205.parquet
      • **/details_harness|hendrycksTest-high_school_statistics|5_2023-09-13T20-54-44.969205.parquet
      • **/details_harness|hendrycksTest-high_school_us_history|5_2023-09-13T20-54-44.969205.parquet
      • **/details_harness|hendrycksTest-high_school_world_history|5_2023-09-13T20-54-44.969205.parquet
      • **/details_harness|hendrycksTest-human_aging|5_2023-09-13T20-54-44.969205.parquet
      • **/details_harness|hendrycksTest-human_sexuality|5_2023-09-13T20-54-44.969205.parquet
      • **/details_harness|hendrycksTest-international_law|5_2023-09-13T20-54-44.969205.parquet
      • **/details_harness|hendrycksTest-jurisprudence|5_2023-09-13T20-54-44.969205.parquet
      • **/details_harness|hendrycksTest-logical_fallacies|5_2023-09-13T20-54-44.969205.parquet
      • **/details_harness|hendrycksTest-machine_learning|5_2023-09-13T20-54-44.969205.parquet
      • **/details_harness|hendrycksTest-management|5_2023-09-13T20-54-44.969205.parquet
      • **/details_harness|hendrycksTest-marketing|5_2023-09-13T20-54-44.969205.parquet
      • **/details_harness|hendrycksTest-medical_genetics|5_2023-09-13T20-54-44.969205.parquet
      • **/details_harness|hendrycksTest-miscellaneous|5_2023-09-13T20-54-44.969205.parquet
      • **/details_harness|hendrycksTest-moral_disputes|5_2023-09-13T20-54-44.969205.parquet
      • **/details_harness|hendrycksTest-moral_scenarios|5_2023-09-13T20-54-44.969205.parquet
      • **/details_harness|hendrycksTest-nutrition|5_2023-09-13T20-54-44.969205.parquet
      • **/details_harness|hendrycksTest-philosophy|5_2023-09-13T20-54-44.969205.parquet
      • **/details_harness|hendrycksTest-prehistory|5_2023-09-13T20-54-44.969205.parquet
      • **/details_harness|hendrycksTest-professional_accounting|5_2023-09-13T20-54-44.969205.parquet
      • `**/details_harness|hendrycksTest-professional_law|5_2023-09-
搜集汇总
数据集介绍
main_image_url
构建方式
在大型语言模型评测领域,Open LLM Leaderboard 作为权威的基准平台,其评估结果数据集具有重要的参考价值。该数据集是在对模型 Weyaxi/Luban-Marcoroni-13B-v2 进行自动化评估过程中动态生成的。构建方式上,数据集涵盖了 64 个配置,每个配置对应一项具体的评测任务,如 ARC-Challenge、DROP、GSM8K 等。评估共执行了两次运行,每次运行的结果被存储为对应配置下的独立分割(split),分割名称以运行的时间戳命名。其中,“train”分割始终指向最新一次运行的结果。此外,还设有一个名为“results”的独立配置,用于汇总所有运行的聚合指标,这些指标被用于在 Leaderboard 上计算和展示模型的总体性能。数据以 Parquet 格式存储,确保了高效的数据读取与处理能力。
特点
该数据集呈现出鲜明的结构化与时效性特征。从结构上看,它通过多配置设计将不同评测任务的结果清晰分离,每个任务下又按运行时间戳划分为多个分割,便于研究人员追溯模型在不同阶段的性能变化。这种层次化的组织方式使得细粒度的任务级分析成为可能。时效性方面,“train”分割自动指向最新结果,保证了数据使用者始终能获取到最新的评估数据。同时,“results”配置提供了全局聚合指标,包括准确率(acc)、F1 分数等统计量及其标准误差,为模型间的横向对比提供了标准化依据。数据集中还包含了如 Winogrande 常识推理任务的 76.95% 准确率等具体结果,直观反映了模型在特定任务上的表现。
使用方法
研究人员可通过 Hugging Face 的 datasets 库便捷地加载与使用该数据集。具体而言,使用 load_dataset 函数,并指定数据集名称、目标配置名称以及所需的分割名称,即可获取特定任务在特定运行下的详细评估结果。例如,加载 Winogrande 任务的最新结果时,可设置 config 参数为“harness_winogrande_5”,split 参数为“train”。数据加载后,返回的 Parquet 格式数据易于转换为 Pandas DataFrame 等常用结构,便于后续的统计分析或可视化。对于需要比较多次运行结果的场景,可以指定具体的时间戳分割名称,如“2023_10_28T11_01_27.302979”,从而获取历史数据。这种灵活的加载机制支持了从单任务细粒度分析到跨任务综合评估的多样化研究需求。
背景与挑战
背景概述
随着大语言模型(LLM)技术的飞速发展,如何公正、全面地评估模型性能成为领域内亟待解决的核心问题。Open LLM Leaderboard由Hugging Face团队于2023年创建,旨在为社区提供一个标准化、透明的模型评测平台。该数据集正是针对Weyaxi/Luban-Marcoroni-13B-v2模型的评估结果而自动生成的,涵盖了包括ARC挑战、DROP、GSM8K、HellaSwag及MMLU在内的多项基准任务,涉及常识推理、数学计算、阅读理解与多学科知识。其核心研究问题在于通过多维度、多任务的评测框架,揭示13B参数级别模型在多样化自然语言处理任务上的能力边界与局限,为模型优化与对比提供可靠依据。该数据集的出现推动了LLM评测的规范化进程,已成为研究者衡量模型进步的重要参考。
当前挑战
该数据集所面临的挑战首先体现在领域问题的复杂性上:LLM评测需同时覆盖推理、知识、数学与语言理解等异构任务,单一指标难以全面反映模型真实水平,如GSM8K上不足10%的准确率揭示了模型在数学推理上的显著短板。构建过程中的挑战则集中于数据标准化与可复现性:评估结果由多次运行自动生成,不同时间戳的分片需通过特定加载方式才能对齐,增加了数据整合的难度;此外,各任务配置繁多(如MMLU涵盖57个子领域),评测管道的版本差异可能导致结果波动,这对保持评估一致性提出了严苛要求。
常用场景
经典使用场景
在自然语言处理与大规模语言模型评估的学术疆域中,Open LLM Leaderboard上的评估运行数据集扮演着基准测试的核心角色。该数据集专为模型Weyaxi/Luban-Marcoroni-13B-v2的自动化评测而构建,覆盖了涵盖常识推理(如Winogrande)、数学解题(GSM8K)、阅读理解(DROP)以及多领域知识(HendrycksTest)在内的64项任务配置。研究者可便捷地通过HuggingFace Datasets库加载各任务的细粒度结果,从而系统性地审视模型在不同能力维度上的表现,成为衡量模型综合性能的标准化利器。
衍生相关工作
围绕该评估数据集,衍生出一系列推动语言模型评测范式演进的工作。经典案例包括基于其多任务配置的模型能力迁移分析,以及利用结果聚合指标改进排行榜排名的研究。数据集的细粒度拆分(如按学科分类的HendrycksTest)催生了针对特定知识领域(如医学、法律)的专项评估基准。更深远的影响在于,其自动化创建流程被后续模型评测框架广泛借鉴,形成了可复用的评估流水线标准,成为Open LLM Leaderboard生态中不可或缺的基础设施。
数据集最近研究
最新研究方向
在大型语言模型(LLM)竞技日益白热化的背景下,Open LLM Leaderboard已成为衡量模型综合能力的关键基准平台。该数据集聚焦于Weyaxi/Luban-Marcoroni-13B-v2模型的评估结果,其研究前沿紧密围绕多任务标准化测试与模型泛化能力的量化分析。通过整合ARC推理挑战、DROP阅读理解、GSM8K数学推理及Winogrande常识推理等多样化任务,研究者得以在统一框架下剖析13B参数级别模型的优势与局限。当前热点事件集中于LLM在复杂推理任务(如DROP的精确匹配率仅0.78%)与常识推理(Winogrande准确率达76.95%)之间的表现落差,这一发现不仅揭示了模型在细粒度语义理解上的瓶颈,更推动了针对混合训练策略与知识蒸馏技术的优化浪潮。该数据集作为Open LLM Leaderboard的官方评估记录,为社区提供了可复现的标准化评估范式,其影响在于加速了模型间横向对比的透明度,并促使研究者重新审视推理能力与知识记忆之间的平衡关系,对下一代高效LLM的研发具有里程碑式的指导意义。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作