five

open-llm-leaderboard-old/details_posicube__Llama2-chat-AYB-13B

收藏
Hugging Face2023-10-24 更新2024-06-22 收录
下载链接:
https://hf-mirror.com/datasets/open-llm-leaderboard-old/details_posicube__Llama2-chat-AYB-13B
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: Evaluation run of posicube/Llama2-chat-AYB-13B dataset_summary: "Dataset automatically created during the evaluation run of model\ \ [posicube/Llama2-chat-AYB-13B](https://huggingface.co/posicube/Llama2-chat-AYB-13B)\ \ on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).\n\ \nThe dataset is composed of 64 configuration, each one coresponding to one of the\ \ evaluated task.\n\nThe dataset has been created from 2 run(s). Each run can be\ \ found as a specific split in each configuration, the split being named using the\ \ timestamp of the run.The \"train\" split is always pointing to the latest results.\n\ \nAn additional configuration \"results\" store all the aggregated results of the\ \ run (and is used to compute and display the agregated metrics on the [Open LLM\ \ Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)).\n\ \nTo load the details from a run, you can for instance do the following:\n```python\n\ from datasets import load_dataset\ndata = load_dataset(\"open-llm-leaderboard/details_posicube__Llama2-chat-AYB-13B\"\ ,\n\t\"harness_winogrande_5\",\n\tsplit=\"train\")\n```\n\n## Latest results\n\n\ These are the [latest results from run 2023-10-24T15:23:04.071945](https://huggingface.co/datasets/open-llm-leaderboard/details_posicube__Llama2-chat-AYB-13B/blob/main/results_2023-10-24T15-23-04.071945.json)(note\ \ that their might be results for other tasks in the repos if successive evals didn't\ \ cover the same tasks. You find each in the results and the \"latest\" split for\ \ each eval):\n\n```python\n{\n \"all\": {\n \"em\": 0.10906040268456375,\n\ \ \"em_stderr\": 0.0031922531959087046,\n \"f1\": 0.20405201342281792,\n\ \ \"f1_stderr\": 0.003418767120803739,\n \"acc\": 0.4376976530855872,\n\ \ \"acc_stderr\": 0.010340318967318105\n },\n \"harness|drop|3\": {\n\ \ \"em\": 0.10906040268456375,\n \"em_stderr\": 0.0031922531959087046,\n\ \ \"f1\": 0.20405201342281792,\n \"f1_stderr\": 0.003418767120803739\n\ \ },\n \"harness|gsm8k|5\": {\n \"acc\": 0.11296436694465505,\n \ \ \"acc_stderr\": 0.008719339028833057\n },\n \"harness|winogrande|5\"\ : {\n \"acc\": 0.7624309392265194,\n \"acc_stderr\": 0.011961298905803153\n\ \ }\n}\n```" repo_url: https://huggingface.co/posicube/Llama2-chat-AYB-13B leaderboard_url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard point_of_contact: clementine@hf.co configs: - config_name: harness_arc_challenge_25 data_files: - split: 2023_10_04T07_48_01.042889 path: - '**/details_harness|arc:challenge|25_2023-10-04T07-48-01.042889.parquet' - split: latest path: - '**/details_harness|arc:challenge|25_2023-10-04T07-48-01.042889.parquet' - config_name: harness_drop_3 data_files: - split: 2023_10_24T15_23_04.071945 path: - '**/details_harness|drop|3_2023-10-24T15-23-04.071945.parquet' - split: latest path: - '**/details_harness|drop|3_2023-10-24T15-23-04.071945.parquet' - config_name: harness_gsm8k_5 data_files: - split: 2023_10_24T15_23_04.071945 path: - '**/details_harness|gsm8k|5_2023-10-24T15-23-04.071945.parquet' - split: latest path: - '**/details_harness|gsm8k|5_2023-10-24T15-23-04.071945.parquet' - config_name: harness_hellaswag_10 data_files: - split: 2023_10_04T07_48_01.042889 path: - '**/details_harness|hellaswag|10_2023-10-04T07-48-01.042889.parquet' - split: latest path: - '**/details_harness|hellaswag|10_2023-10-04T07-48-01.042889.parquet' - config_name: harness_hendrycksTest_5 data_files: - split: 2023_10_04T07_48_01.042889 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-international_law|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-management|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-marketing|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-sociology|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-virology|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2023-10-04T07-48-01.042889.parquet' - split: latest path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-international_law|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-management|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-marketing|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-sociology|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-virology|5_2023-10-04T07-48-01.042889.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2023-10-04T07-48-01.042889.parquet' - config_name: harness_hendrycksTest_abstract_algebra_5 data_files: - split: 2023_10_04T07_48_01.042889 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2023-10-04T07-48-01.042889.parquet' - split: latest path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2023-10-04T07-48-01.042889.parquet' - config_name: harness_hendrycksTest_anatomy_5 data_files: - split: 2023_10_04T07_48_01.042889 path: - '**/details_harness|hendrycksTest-anatomy|5_2023-10-04T07-48-01.042889.parquet' - split: latest path: - '**/details_harness|hendrycksTest-anatomy|5_2023-10-04T07-48-01.042889.parquet' - config_name: harness_hendrycksTest_astronomy_5 data_files: - split: 2023_10_04T07_48_01.042889 path: - '**/details_harness|hendrycksTest-astronomy|5_2023-10-04T07-48-01.042889.parquet' - split: latest path: - '**/details_harness|hendrycksTest-astronomy|5_2023-10-04T07-48-01.042889.parquet' - config_name: harness_hendrycksTest_business_ethics_5 data_files: - split: 2023_10_04T07_48_01.042889 path: - '**/details_harness|hendrycksTest-business_ethics|5_2023-10-04T07-48-01.042889.parquet' - split: latest path: - '**/details_harness|hendrycksTest-business_ethics|5_2023-10-04T07-48-01.042889.parquet' - config_name: harness_hendrycksTest_clinical_knowledge_5 data_files: - split: 2023_10_04T07_48_01.042889 path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2023-10-04T07-48-01.042889.parquet' - split: latest path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2023-10-04T07-48-01.042889.parquet' - config_name: harness_hendrycksTest_college_biology_5 data_files: - split: 2023_10_04T07_48_01.042889 path: - '**/details_harness|hendrycksTest-college_biology|5_2023-10-04T07-48-01.042889.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_biology|5_2023-10-04T07-48-01.042889.parquet' - config_name: harness_hendrycksTest_college_chemistry_5 data_files: - split: 2023_10_04T07_48_01.042889 path: - '**/details_harness|hendrycksTest-college_chemistry|5_2023-10-04T07-48-01.042889.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_chemistry|5_2023-10-04T07-48-01.042889.parquet' - config_name: harness_hendrycksTest_college_computer_science_5 data_files: - split: 2023_10_04T07_48_01.042889 path: - '**/details_harness|hendrycksTest-college_computer_science|5_2023-10-04T07-48-01.042889.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_computer_science|5_2023-10-04T07-48-01.042889.parquet' - config_name: harness_hendrycksTest_college_mathematics_5 data_files: - split: 2023_10_04T07_48_01.042889 path: - '**/details_harness|hendrycksTest-college_mathematics|5_2023-10-04T07-48-01.042889.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_mathematics|5_2023-10-04T07-48-01.042889.parquet' - config_name: harness_hendrycksTest_college_medicine_5 data_files: - split: 2023_10_04T07_48_01.042889 path: - '**/details_harness|hendrycksTest-college_medicine|5_2023-10-04T07-48-01.042889.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_medicine|5_2023-10-04T07-48-01.042889.parquet' - config_name: harness_hendrycksTest_college_physics_5 data_files: - split: 2023_10_04T07_48_01.042889 path: - '**/details_harness|hendrycksTest-college_physics|5_2023-10-04T07-48-01.042889.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_physics|5_2023-10-04T07-48-01.042889.parquet' - config_name: harness_hendrycksTest_computer_security_5 data_files: - split: 2023_10_04T07_48_01.042889 path: - '**/details_harness|hendrycksTest-computer_security|5_2023-10-04T07-48-01.042889.parquet' - split: latest path: - '**/details_harness|hendrycksTest-computer_security|5_2023-10-04T07-48-01.042889.parquet' - config_name: harness_hendrycksTest_conceptual_physics_5 data_files: - split: 2023_10_04T07_48_01.042889 path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2023-10-04T07-48-01.042889.parquet' - split: latest path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2023-10-04T07-48-01.042889.parquet' - config_name: harness_hendrycksTest_econometrics_5 data_files: - split: 2023_10_04T07_48_01.042889 path: - '**/details_harness|hendrycksTest-econometrics|5_2023-10-04T07-48-01.042889.parquet' - split: latest path: - '**/details_harness|hendrycksTest-econometrics|5_2023-10-04T07-48-01.042889.parquet' - config_name: harness_hendrycksTest_electrical_engineering_5 data_files: - split: 2023_10_04T07_48_01.042889 path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2023-10-04T07-48-01.042889.parquet' - split: latest path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2023-10-04T07-48-01.042889.parquet' - config_name: harness_hendrycksTest_elementary_mathematics_5 data_files: - split: 2023_10_04T07_48_01.042889 path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2023-10-04T07-48-01.042889.parquet' - split: latest path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2023-10-04T07-48-01.042889.parquet' - config_name: harness_hendrycksTest_formal_logic_5 data_files: - split: 2023_10_04T07_48_01.042889 path: - '**/details_harness|hendrycksTest-formal_logic|5_2023-10-04T07-48-01.042889.parquet' - split: latest path: - '**/details_harness|hendrycksTest-formal_logic|5_2023-10-04T07-48-01.042889.parquet' - config_name: harness_hendrycksTest_global_facts_5 data_files: - split: 2023_10_04T07_48_01.042889 path: - '**/details_harness|hendrycksTest-global_facts|5_2023-10-04T07-48-01.042889.parquet' - split: latest path: - '**/details_harness|hendrycksTest-global_facts|5_2023-10-04T07-48-01.042889.parquet' - config_name: harness_hendrycksTest_high_school_biology_5 data_files: - split: 2023_10_04T07_48_01.042889 path: - '**/details_harness|hendrycksTest-high_school_biology|5_2023-10-04T07-48-01.042889.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_biology|5_2023-10-04T07-48-01.042889.parquet' - config_name: harness_hendrycksTest_high_school_chemistry_5 data_files: - split: 2023_10_04T07_48_01.042889 path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2023-10-04T07-48-01.042889.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2023-10-04T07-48-01.042889.parquet' - config_name: harness_hendrycksTest_high_school_computer_science_5 data_files: - split: 2023_10_04T07_48_01.042889 path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2023-10-04T07-48-01.042889.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2023-10-04T07-48-01.042889.parquet' - config_name: harness_hendrycksTest_high_school_european_history_5 data_files: - split: 2023_10_04T07_48_01.042889 path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2023-10-04T07-48-01.042889.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2023-10-04T07-48-01.042889.parquet' - config_name: harness_hendrycksTest_high_school_geography_5 data_files: - split: 2023_10_04T07_48_01.042889 path: - '**/details_harness|hendrycksTest-high_school_geography|5_2023-10-04T07-48-01.042889.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_geography|5_2023-10-04T07-48-01.042889.parquet' - config_name: harness_hendrycksTest_high_school_government_and_politics_5 data_files: - split: 2023_10_04T07_48_01.042889 path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2023-10-04T07-48-01.042889.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2023-10-04T07-48-01.042889.parquet' - config_name: harness_hendrycksTest_high_school_macroeconomics_5 data_files: - split: 2023_10_04T07_48_01.042889 path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2023-10-04T07-48-01.042889.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2023-10-04T07-48-01.042889.parquet' - config_name: harness_hendrycksTest_high_school_mathematics_5 data_files: - split: 2023_10_04T07_48_01.042889 path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2023-10-04T07-48-01.042889.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2023-10-04T07-48-01.042889.parquet' - config_name: harness_hendrycksTest_high_school_microeconomics_5 data_files: - split: 2023_10_04T07_48_01.042889 path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2023-10-04T07-48-01.042889.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2023-10-04T07-48-01.042889.parquet' - config_name: harness_hendrycksTest_high_school_physics_5 data_files: - split: 2023_10_04T07_48_01.042889 path: - '**/details_harness|hendrycksTest-high_school_physics|5_2023-10-04T07-48-01.042889.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_physics|5_2023-10-04T07-48-01.042889.parquet' - config_name: harness_hendrycksTest_high_school_psychology_5 data_files: - split: 2023_10_04T07_48_01.042889 path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2023-10-04T07-48-01.042889.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2023-10-04T07-48-01.042889.parquet' - config_name: harness_hendrycksTest_high_school_statistics_5 data_files: - split: 2023_10_04T07_48_01.042889 path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2023-10-04T07-48-01.042889.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2023-10-04T07-48-01.042889.parquet' - config_name: harness_hendrycksTest_high_school_us_history_5 data_files: - split: 2023_10_04T07_48_01.042889 path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2023-10-04T07-48-01.042889.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2023-10-04T07-48-01.042889.parquet' - config_name: harness_hendrycksTest_high_school_world_history_5 data_files: - split: 2023_10_04T07_48_01.042889 path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2023-10-04T07-48-01.042889.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2023-10-04T07-48-01.042889.parquet' - config_name: harness_hendrycksTest_human_aging_5 data_files: - split: 2023_10_04T07_48_01.042889 path: - '**/details_harness|hendrycksTest-human_aging|5_2023-10-04T07-48-01.042889.parquet' - split: latest path: - '**/details_harness|hendrycksTest-human_aging|5_2023-10-04T07-48-01.042889.parquet' - config_name: harness_hendrycksTest_human_sexuality_5 data_files: - split: 2023_10_04T07_48_01.042889 path: - '**/details_harness|hendrycksTest-human_sexuality|5_2023-10-04T07-48-01.042889.parquet' - split: latest path: - '**/details_harness|hendrycksTest-human_sexuality|5_2023-10-04T07-48-01.042889.parquet' - config_name: harness_hendrycksTest_international_law_5 data_files: - split: 2023_10_04T07_48_01.042889 path: - '**/details_harness|hendrycksTest-international_law|5_2023-10-04T07-48-01.042889.parquet' - split: latest path: - '**/details_harness|hendrycksTest-international_law|5_2023-10-04T07-48-01.042889.parquet' - config_name: harness_hendrycksTest_jurisprudence_5 data_files: - split: 2023_10_04T07_48_01.042889 path: - '**/details_harness|hendrycksTest-jurisprudence|5_2023-10-04T07-48-01.042889.parquet' - split: latest path: - '**/details_harness|hendrycksTest-jurisprudence|5_2023-10-04T07-48-01.042889.parquet' - config_name: harness_hendrycksTest_logical_fallacies_5 data_files: - split: 2023_10_04T07_48_01.042889 path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2023-10-04T07-48-01.042889.parquet' - split: latest path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2023-10-04T07-48-01.042889.parquet' - config_name: harness_hendrycksTest_machine_learning_5 data_files: - split: 2023_10_04T07_48_01.042889 path: - '**/details_harness|hendrycksTest-machine_learning|5_2023-10-04T07-48-01.042889.parquet' - split: latest path: - '**/details_harness|hendrycksTest-machine_learning|5_2023-10-04T07-48-01.042889.parquet' - config_name: harness_hendrycksTest_management_5 data_files: - split: 2023_10_04T07_48_01.042889 path: - '**/details_harness|hendrycksTest-management|5_2023-10-04T07-48-01.042889.parquet' - split: latest path: - '**/details_harness|hendrycksTest-management|5_2023-10-04T07-48-01.042889.parquet' - config_name: harness_hendrycksTest_marketing_5 data_files: - split: 2023_10_04T07_48_01.042889 path: - '**/details_harness|hendrycksTest-marketing|5_2023-10-04T07-48-01.042889.parquet' - split: latest path: - '**/details_harness|hendrycksTest-marketing|5_2023-10-04T07-48-01.042889.parquet' - config_name: harness_hendrycksTest_medical_genetics_5 data_files: - split: 2023_10_04T07_48_01.042889 path: - '**/details_harness|hendrycksTest-medical_genetics|5_2023-10-04T07-48-01.042889.parquet' - split: latest path: - '**/details_harness|hendrycksTest-medical_genetics|5_2023-10-04T07-48-01.042889.parquet' - config_name: harness_hendrycksTest_miscellaneous_5 data_files: - split: 2023_10_04T07_48_01.042889 path: - '**/details_harness|hendrycksTest-miscellaneous|5_2023-10-04T07-48-01.042889.parquet' - split: latest path: - '**/details_harness|hendrycksTest-miscellaneous|5_2023-10-04T07-48-01.042889.parquet' - config_name: harness_hendrycksTest_moral_disputes_5 data_files: - split: 2023_10_04T07_48_01.042889 path: - '**/details_harness|hendrycksTest-moral_disputes|5_2023-10-04T07-48-01.042889.parquet' - split: latest path: - '**/details_harness|hendrycksTest-moral_disputes|5_2023-10-04T07-48-01.042889.parquet' - config_name: harness_hendrycksTest_moral_scenarios_5 data_files: - split: 2023_10_04T07_48_01.042889 path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2023-10-04T07-48-01.042889.parquet' - split: latest path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2023-10-04T07-48-01.042889.parquet' - config_name: harness_hendrycksTest_nutrition_5 data_files: - split: 2023_10_04T07_48_01.042889 path: - '**/details_harness|hendrycksTest-nutrition|5_2023-10-04T07-48-01.042889.parquet' - split: latest path: - '**/details_harness|hendrycksTest-nutrition|5_2023-10-04T07-48-01.042889.parquet' - config_name: harness_hendrycksTest_philosophy_5 data_files: - split: 2023_10_04T07_48_01.042889 path: - '**/details_harness|hendrycksTest-philosophy|5_2023-10-04T07-48-01.042889.parquet' - split: latest path: - '**/details_harness|hendrycksTest-philosophy|5_2023-10-04T07-48-01.042889.parquet' - config_name: harness_hendrycksTest_prehistory_5 data_files: - split: 2023_10_04T07_48_01.042889 path: - '**/details_harness|hendrycksTest-prehistory|5_2023-10-04T07-48-01.042889.parquet' - split: latest path: - '**/details_harness|hendrycksTest-prehistory|5_2023-10-04T07-48-01.042889.parquet' - config_name: harness_hendrycksTest_professional_accounting_5 data_files: - split: 2023_10_04T07_48_01.042889 path: - '**/details_harness|hendrycksTest-professional_accounting|5_2023-10-04T07-48-01.042889.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_accounting|5_2023-10-04T07-48-01.042889.parquet' - config_name: harness_hendrycksTest_professional_law_5 data_files: - split: 2023_10_04T07_48_01.042889 path: - '**/details_harness|hendrycksTest-professional_law|5_2023-10-04T07-48-01.042889.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_law|5_2023-10-04T07-48-01.042889.parquet' - config_name: harness_hendrycksTest_professional_medicine_5 data_files: - split: 2023_10_04T07_48_01.042889 path: - '**/details_harness|hendrycksTest-professional_medicine|5_2023-10-04T07-48-01.042889.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_medicine|5_2023-10-04T07-48-01.042889.parquet' - config_name: harness_hendrycksTest_professional_psychology_5 data_files: - split: 2023_10_04T07_48_01.042889 path: - '**/details_harness|hendrycksTest-professional_psychology|5_2023-10-04T07-48-01.042889.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_psychology|5_2023-10-04T07-48-01.042889.parquet' - config_name: harness_hendrycksTest_public_relations_5 data_files: - split: 2023_10_04T07_48_01.042889 path: - '**/details_harness|hendrycksTest-public_relations|5_2023-10-04T07-48-01.042889.parquet' - split: latest path: - '**/details_harness|hendrycksTest-public_relations|5_2023-10-04T07-48-01.042889.parquet' - config_name: harness_hendrycksTest_security_studies_5 data_files: - split: 2023_10_04T07_48_01.042889 path: - '**/details_harness|hendrycksTest-security_studies|5_2023-10-04T07-48-01.042889.parquet' - split: latest path: - '**/details_harness|hendrycksTest-security_studies|5_2023-10-04T07-48-01.042889.parquet' - config_name: harness_hendrycksTest_sociology_5 data_files: - split: 2023_10_04T07_48_01.042889 path: - '**/details_harness|hendrycksTest-sociology|5_2023-10-04T07-48-01.042889.parquet' - split: latest path: - '**/details_harness|hendrycksTest-sociology|5_2023-10-04T07-48-01.042889.parquet' - config_name: harness_hendrycksTest_us_foreign_policy_5 data_files: - split: 2023_10_04T07_48_01.042889 path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2023-10-04T07-48-01.042889.parquet' - split: latest path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2023-10-04T07-48-01.042889.parquet' - config_name: harness_hendrycksTest_virology_5 data_files: - split: 2023_10_04T07_48_01.042889 path: - '**/details_harness|hendrycksTest-virology|5_2023-10-04T07-48-01.042889.parquet' - split: latest path: - '**/details_harness|hendrycksTest-virology|5_2023-10-04T07-48-01.042889.parquet' - config_name: harness_hendrycksTest_world_religions_5 data_files: - split: 2023_10_04T07_48_01.042889 path: - '**/details_harness|hendrycksTest-world_religions|5_2023-10-04T07-48-01.042889.parquet' - split: latest path: - '**/details_harness|hendrycksTest-world_religions|5_2023-10-04T07-48-01.042889.parquet' - config_name: harness_truthfulqa_mc_0 data_files: - split: 2023_10_04T07_48_01.042889 path: - '**/details_harness|truthfulqa:mc|0_2023-10-04T07-48-01.042889.parquet' - split: latest path: - '**/details_harness|truthfulqa:mc|0_2023-10-04T07-48-01.042889.parquet' - config_name: harness_winogrande_5 data_files: - split: 2023_10_24T15_23_04.071945 path: - '**/details_harness|winogrande|5_2023-10-24T15-23-04.071945.parquet' - split: latest path: - '**/details_harness|winogrande|5_2023-10-24T15-23-04.071945.parquet' - config_name: results data_files: - split: 2023_10_04T07_48_01.042889 path: - results_2023-10-04T07-48-01.042889.parquet - split: 2023_10_24T15_23_04.071945 path: - results_2023-10-24T15-23-04.071945.parquet - split: latest path: - results_2023-10-24T15-23-04.071945.parquet --- # Dataset Card for Evaluation run of posicube/Llama2-chat-AYB-13B ## Dataset Description - **Homepage:** - **Repository:** https://huggingface.co/posicube/Llama2-chat-AYB-13B - **Paper:** - **Leaderboard:** https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard - **Point of Contact:** clementine@hf.co ### Dataset Summary Dataset automatically created during the evaluation run of model [posicube/Llama2-chat-AYB-13B](https://huggingface.co/posicube/Llama2-chat-AYB-13B) on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard). The dataset is composed of 64 configuration, each one coresponding to one of the evaluated task. The dataset has been created from 2 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results. An additional configuration "results" store all the aggregated results of the run (and is used to compute and display the agregated metrics on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)). To load the details from a run, you can for instance do the following: ```python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_posicube__Llama2-chat-AYB-13B", "harness_winogrande_5", split="train") ``` ## Latest results These are the [latest results from run 2023-10-24T15:23:04.071945](https://huggingface.co/datasets/open-llm-leaderboard/details_posicube__Llama2-chat-AYB-13B/blob/main/results_2023-10-24T15-23-04.071945.json)(note that their might be results for other tasks in the repos if successive evals didn't cover the same tasks. You find each in the results and the "latest" split for each eval): ```python { "all": { "em": 0.10906040268456375, "em_stderr": 0.0031922531959087046, "f1": 0.20405201342281792, "f1_stderr": 0.003418767120803739, "acc": 0.4376976530855872, "acc_stderr": 0.010340318967318105 }, "harness|drop|3": { "em": 0.10906040268456375, "em_stderr": 0.0031922531959087046, "f1": 0.20405201342281792, "f1_stderr": 0.003418767120803739 }, "harness|gsm8k|5": { "acc": 0.11296436694465505, "acc_stderr": 0.008719339028833057 }, "harness|winogrande|5": { "acc": 0.7624309392265194, "acc_stderr": 0.011961298905803153 } } ``` ### Supported Tasks and Leaderboards [More Information Needed] ### Languages [More Information Needed] ## Dataset Structure ### Data Instances [More Information Needed] ### Data Fields [More Information Needed] ### Data Splits [More Information Needed] ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information [More Information Needed] ### Citation Information [More Information Needed] ### Contributions [More Information Needed]
提供机构:
open-llm-leaderboard-old
原始信息汇总

数据集概述

该数据集是在评估模型 posicube/Llama2-chat-AYB-13BOpen LLM Leaderboard 上的运行过程中自动创建的。

数据集组成

  • 数据集包含 64 个配置,每个配置对应一个评估任务。
  • 数据集从 2 次运行中创建。每个运行可以在每个配置中作为一个特定的分片找到,分片名称使用运行的时间戳。
  • "train" 分片始终指向最新的结果。
  • 一个额外的配置 "results" 存储所有运行的聚合结果,用于计算并在 Open LLM Leaderboard 上显示聚合指标。

数据加载示例

python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_posicube__Llama2-chat-AYB-13B", "harness_winogrande_5", split="train")

最新结果

以下是 2023-10-24T15:23:04.071945 运行的最新结果

python { "all": { "em": 0.10906040268456375, "em_stderr": 0.0031922531959087046, "f1": 0.20405201342281792, "f1_stderr": 0.003418767120803739, "acc": 0.4376976530855872, "acc_stderr": 0.010340318967318105 }, "harness|drop|3": { "em": 0.10906040268456375, "em_stderr": 0.0031922531959087046, "f1": 0.20405201342281792, "f1_stderr": 0.003418767120803739 }, "harness|gsm8k|5": { "acc": 0.11296436694465505, "acc_stderr": 0.008719339028833057 }, "harness|winogrande|5": { "acc": 0.7624309392265194, "acc_stderr": 0.011961298905803153 } }

配置详情

  • harness_arc_challenge_25

    • 分片: 2023_10_04T07_48_01.042889
      • 路径: **/details_harness|arc:challenge|25_2023-10-04T07-48-01.042889.parquet
    • 分片: latest
      • 路径: **/details_harness|arc:challenge|25_2023-10-04T07-48-01.042889.parquet
  • harness_drop_3

    • 分片: 2023_10_24T15_23_04.071945
      • 路径: **/details_harness|drop|3_2023-10-24T15-23-04.071945.parquet
    • 分片: latest
      • 路径: **/details_harness|drop|3_2023-10-24T15-23-04.071945.parquet
  • harness_gsm8k_5

    • 分片: 2023_10_24T15_23_04.071945
      • 路径: **/details_harness|gsm8k|5_2023-10-24T15-23-04.071945.parquet
    • 分片: latest
      • 路径: **/details_harness|gsm8k|5_2023-10-24T15-23-04.071945.parquet
  • harness_hellaswag_10

    • 分片: 2023_10_04T07_48_01.042889
      • 路径: **/details_harness|hellaswag|10_2023-10-04T07-48-01.042889.parquet
    • 分片: latest
      • 路径: **/details_harness|hellaswag|10_2023-10-04T07-48-01.042889.parquet
  • harness_hendrycksTest_5

    • 分片: 2023_10_04T07_48_01.042889
      • 路径:
        • **/details_harness|hendrycksTest-abstract_algebra|5_2023-10-04T07-48-01.042889.parquet
        • **/details_harness|hendrycksTest-anatomy|5_2023-10-04T07-48-01.042889.parquet
        • **/details_harness|hendrycksTest-astronomy|5_2023-10-04T07-48-01.042889.parquet
        • **/details_harness|hendrycksTest-business_ethics|5_2023-10-04T07-48-01.042889.parquet
        • **/details_harness|hendrycksTest-clinical_knowledge|5_2023-10-04T07-48-01.042889.parquet
        • **/details_harness|hendrycksTest-college_biology|5_2023-10-04T07-48-01.042889.parquet
        • **/details_harness|hendrycksTest-college_chemistry|5_2023-10-04T07-48-01.042889.parquet
        • **/details_harness|hendrycksTest-college_computer_science|5_2023-10-04T07-48-01.042889.parquet
        • **/details_harness|hendrycksTest-college_mathematics|5_2023-10-04T07-48-01.042889.parquet
        • **/details_harness|hendrycksTest-college_medicine|5_2023-10-04T07-48-01.042889.parquet
        • **/details_harness|hendrycksTest-college_physics|5_2023-10-04T07-48-01.042889.parquet
        • **/details_harness|hendrycksTest-computer_security|5_2023-10-04T07-48-01.042889.parquet
        • **/details_harness|hendrycksTest-conceptual_physics|5_2023-10-04T07-48-01.042889.parquet
        • **/details_harness|hendrycksTest-econometrics|5_2023-10-04T07-48-01.042889.parquet
        • **/details_harness|hendrycksTest-electrical_engineering|5_2023-10-04T07-48-01.042889.parquet
        • **/details_harness|hendrycksTest-elementary_mathematics|5_2023-10-04T07-48-01.042889.parquet
        • **/details_harness|hendrycksTest-formal_logic|5_2023-10-04T07-48-01.042889.parquet
        • **/details_harness|hendrycksTest-global_facts|5_2023-10-04T07-48-01.042889.parquet
        • **/details_harness|hendrycksTest-high_school_biology|5_2023-10-04T07-48-01.042889.parquet
        • **/details_harness|hendrycksTest-high_school_chemistry|5_2023-10-04T07-48-01.042889.parquet
        • **/details_harness|hendrycksTest-high_school_computer_science|5_2023-10-04T07-48-01.042889.parquet
        • **/details_harness|hendrycksTest-high_school_european_history|5_2023-10-04T07-48-01.042889.parquet
        • **/details_harness|hendrycksTest-high_school_geography|5_2023-10-04T07-48-01.042889.parquet
        • **/details_harness|hendrycksTest-high_school_government_and_politics|5_2023-10-04T07-48-01.042889.parquet
        • **/details_harness|hendrycksTest-high_school_macroeconomics|5_2023-10-04T07-48-01.042889.parquet
        • **/details_harness|hendrycksTest-high_school_mathematics|5_2023-10-04T07-48-01.042889.parquet
        • **/details_harness|hendrycksTest-high_school_microeconomics|5_2023-10-04T07-48-01.042889.parquet
        • **/details_harness|hendrycksTest-high_school_physics|5_2023-10-04T07-48-01.042889.parquet
        • **/details_harness|hendrycksTest-high_school_psychology|5_2023-10-04T07-48-01.042889.parquet
        • **/details_harness|hendrycksTest-high_school_statistics|5_2023-10-04T07-48-01.042889.parquet
        • **/details_harness|hendrycksTest-high_school_us_history|5_2023-10-04T07-48-01.042889.parquet
        • **/details_harness|hendrycksTest-high_school_world_history|5_2023-10-04T07-48-01.042889.parquet
        • **/details_harness|hendrycksTest-human_aging|5_2023-10-04T07-48-01.042889.parquet
        • **/details_harness|hendrycksTest-human_sexuality|5_2023-10-04T07-48-01.042889.parquet
        • **/details_harness|hendrycksTest-international_law|5_2023-10-04T07-48-01.042889.parquet
        • **/details_harness|hendrycksTest-jurisprudence|5_2023-10-04T07-48-01.042889.parquet
        • **/details_harness|hendrycksTest-logical_fallacies|5_2023-10-04T07-48-01.042889.parquet
        • **/details_harness|hendrycksTest-machine_learning|5_2023-10-04T07-48-01.042889.parquet
        • **/details_harness|hendrycksTest-management|5_2023-10-04T07-48-01.042889.parquet
        • **/details_harness|hendrycksTest-marketing|5_2023-10-04T07-48-01.042889.parquet
        • **/details_harness|hendrycksTest-medical_genetics|5_2023-10-04T07-48-01.042889.parquet
        • **/details_harness|hendrycksTest-miscellaneous|5_2023-10-04T07-48-01.042889.parquet
        • **/details_harness|hendrycksTest-moral_disputes|5_2023-10-04T07-48-01.042889.parquet
        • **/details_harness|hendrycksTest-moral_scenarios|5_2023-10-04T07-48-01.042889.parquet
        • **/details_harness|hendrycksTest-nutrition|5_2023-10-04T07-48-01.042889.parquet
        • **/details_harness|hendrycksTest-philosophy|5_2023-10-04T07-48-01.042889.parquet
        • **/details_harness|hendrycksTest-prehistory|5_2023-10-04T07-48-01.042889.parquet
        • **/details_harness|hendrycksTest-professional_accounting|5_2023-10-04T07-48-01.042889.parquet
        • `**/details_harness|hendrycksTest-professional_law|5_2023-10-04T07-48-01
搜集汇总
数据集介绍
main_image_url
构建方式
在大规模语言模型评估的背景下,该数据集是围绕posicube/Llama2-chat-AYB-13B模型在Open LLM Leaderboard上的评估过程自动生成的。其构建方式依托于两个独立的评估运行,每次运行的结果被组织为数据集中的特定分割,分割名称以时间戳标识,而'train'分割始终指向最新一次的评估结果。数据集包含64个配置,每个配置对应一个评估任务,例如ARC挑战、DROP、GSM8K、WinoGrande及涵盖57个学科的HendrycksTest等。此外,一个名为'results'的配置专门存储了所有运行的聚合指标,用于在Leaderboard上计算和展示综合表现。所有数据以Parquet格式存储,确保了高效的数据读取与处理。
使用方法
使用该数据集时,研究者可通过HuggingFace的datasets库便捷加载。例如,加载特定任务(如WinoGrande)的最新评估细节,只需调用load_dataset函数并指定配置名称'harness_winogrande_5'及分割'train'。若需访问历史运行数据,则可将分割参数替换为对应的时间戳字符串。对于跨任务分析,可利用'results'配置获取所有任务的聚合结果,从而快速评估模型的综合能力。数据以Parquet格式存储,支持直接读取为DataFrame,方便后续的统计分析与可视化。这种设计使得数据集既适用于细粒度的任务级研究,也适合宏观的模型性能基准测试。
背景与挑战
背景概述
在大规模语言模型迅猛发展的浪潮中,如何系统性地评估模型的综合能力成为学术界与工业界共同关注的焦点。Hugging Face团队于2023年发起的Open LLM Leaderboard项目,旨在为开源大语言模型提供一套标准化、可复现的评估基准。该数据集正是针对posicube团队开发的Llama2-chat-AYB-13B模型在排行榜上的评估运行记录,由Hugging Face的Clémentine Fourrier等人主导创建,于2023年10月完成。其核心研究问题在于通过多任务、多配置的自动化评估流水线,量化该模型在ARC挑战、DROP、GSM8K、HellaSwag及MMLU等涵盖推理、常识、数学与知识理解维度的任务上的表现。该数据集不仅为模型开发者提供了细粒度的性能剖析,更推动了开源社区对LLM评估透明化与可比性的追求,成为后续模型迭代与比较的重要参考依据。
当前挑战
该数据集所解决的领域问题在于为大语言模型提供多维度、标准化的性能度量,以克服传统单一任务评估的片面性。具体挑战包括:1)评估任务覆盖的广度与深度——需在ARC、DROP、GSM8K、HellaSwag及MMLU等数十个子任务中平衡难度与代表性,确保评测结果能反映模型在推理、问答、数学及知识储备方面的真实水平;2)评估过程的自动化与可复现性——构建流水线需处理不同任务配置(如few-shot样本数量)、数据格式统一及结果聚合的复杂性,避免因环境差异导致结果偏差;3)数据构建中的版本管理与一致性——多次运行产生的分片数据需通过时间戳严格区分,并维护“latest”分片指向最新结果,这对数据存储结构的设计与更新逻辑提出了严苛要求。
常用场景
经典使用场景
在大型语言模型评估领域,该数据集作为Open LLM Leaderboard的评测记录,被广泛用于标准化评估模型在多样化任务上的表现。其经典使用场景涵盖ARC挑战、HellaSwag、MMLU、GSM8K及WinoGrande等基准测试,通过记录模型在推理、常识理解、数学解题及常识消歧等维度的细粒度性能指标,为研究者提供可复现的横向比较基准,从而揭示模型在不同认知能力层面的优势与局限。
解决学术问题
该数据集系统性地解决了大语言模型评估中缺乏统一、透明、可复现度量标准的学术困境。通过存储每次评测运行的详细日志与聚合结果,它使研究者能够精准追踪模型在多次迭代中的性能演化,量化微调策略、数据增强或架构调整带来的边际增益。其意义在于构建了模型能力图谱的标准化参照系,推动了从单一指标崇拜向多维能力剖面的范式转型,为理解模型泛化边界与鲁棒性提供了数据基石。
实际应用
在实际应用中,该数据集被模型开发者与平台运营方用于自动化模型筛选与质量监控。例如,当新版本模型在GSM8K上准确率从11.3%提升至15%时,数据集记录的详细日志可辅助定位是数学推理能力增强还是词汇模式过拟合。此外,企业可基于历史评测数据构建性能预测模型,预判不同参数量级模型在特定场景下的适配性,从而优化资源分配与部署决策。
数据集最近研究
最新研究方向
在当前大语言模型(LLM)领域,模型性能的标准化评估与对比已成为推动技术进步的核心议题。以posicube/Llama2-chat-AYB-13B模型在Open LLM Leaderboard上的评测数据为切入点,该数据集系统性地记录了模型在涵盖常识推理(如Winogrande)、数学求解(GSM8K)及阅读理解(DROP)等多元任务上的表现,其精确到标准误的指标(如acc、f1)为研究者提供了细粒度的性能剖析。这一方向与近期社区对模型鲁棒性、泛化能力及任务间迁移学习的关注紧密相连,例如通过分析GSM8K上11.3%的准确率与Winogrande上76.2%准确率的显著差异,可揭示模型在逻辑推理与数学计算能力上的不均衡性。该数据集的意义在于,它不仅为模型迭代提供了可复现的基准,更催化了针对特定能力短板(如数学推理)的专项优化研究,从而引导LLM向更均衡、更可靠的实用化方向发展。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作