five

open-llm-leaderboard/details_NewstaR__Koss-7B-chat

收藏
Hugging Face2023-10-23 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/open-llm-leaderboard/details_NewstaR__Koss-7B-chat
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: Evaluation run of NewstaR/Koss-7B-chat dataset_summary: "Dataset automatically created during the evaluation run of model\ \ [NewstaR/Koss-7B-chat](https://huggingface.co/NewstaR/Koss-7B-chat) on the [Open\ \ LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).\n\ \nThe dataset is composed of 64 configuration, each one coresponding to one of the\ \ evaluated task.\n\nThe dataset has been created from 2 run(s). Each run can be\ \ found as a specific split in each configuration, the split being named using the\ \ timestamp of the run.The \"train\" split is always pointing to the latest results.\n\ \nAn additional configuration \"results\" store all the aggregated results of the\ \ run (and is used to compute and display the agregated metrics on the [Open LLM\ \ Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)).\n\ \nTo load the details from a run, you can for instance do the following:\n```python\n\ from datasets import load_dataset\ndata = load_dataset(\"open-llm-leaderboard/details_NewstaR__Koss-7B-chat\"\ ,\n\t\"harness_winogrande_5\",\n\tsplit=\"train\")\n```\n\n## Latest results\n\n\ These are the [latest results from run 2023-10-23T08:06:32.820862](https://huggingface.co/datasets/open-llm-leaderboard/details_NewstaR__Koss-7B-chat/blob/main/results_2023-10-23T08-06-32.820862.json)(note\ \ that their might be results for other tasks in the repos if successive evals didn't\ \ cover the same tasks. You find each in the results and the \"latest\" split for\ \ each eval):\n\n```python\n{\n \"all\": {\n \"em\": 0.06333892617449664,\n\ \ \"em_stderr\": 0.002494400790190545,\n \"f1\": 0.12617449664429503,\n\ \ \"f1_stderr\": 0.002812859883562843,\n \"acc\": 0.39549166962367155,\n\ \ \"acc_stderr\": 0.009921949302668327\n },\n \"harness|drop|3\": {\n\ \ \"em\": 0.06333892617449664,\n \"em_stderr\": 0.002494400790190545,\n\ \ \"f1\": 0.12617449664429503,\n \"f1_stderr\": 0.002812859883562843\n\ \ },\n \"harness|gsm8k|5\": {\n \"acc\": 0.07354056103108415,\n \ \ \"acc_stderr\": 0.0071898357543652685\n },\n \"harness|winogrande|5\"\ : {\n \"acc\": 0.7174427782162589,\n \"acc_stderr\": 0.012654062850971384\n\ \ }\n}\n```" repo_url: https://huggingface.co/NewstaR/Koss-7B-chat leaderboard_url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard point_of_contact: clementine@hf.co configs: - config_name: harness_arc_challenge_25 data_files: - split: 2023_10_04T03_19_48.694479 path: - '**/details_harness|arc:challenge|25_2023-10-04T03-19-48.694479.parquet' - split: latest path: - '**/details_harness|arc:challenge|25_2023-10-04T03-19-48.694479.parquet' - config_name: harness_drop_3 data_files: - split: 2023_10_23T08_06_32.820862 path: - '**/details_harness|drop|3_2023-10-23T08-06-32.820862.parquet' - split: latest path: - '**/details_harness|drop|3_2023-10-23T08-06-32.820862.parquet' - config_name: harness_gsm8k_5 data_files: - split: 2023_10_23T08_06_32.820862 path: - '**/details_harness|gsm8k|5_2023-10-23T08-06-32.820862.parquet' - split: latest path: - '**/details_harness|gsm8k|5_2023-10-23T08-06-32.820862.parquet' - config_name: harness_hellaswag_10 data_files: - split: 2023_10_04T03_19_48.694479 path: - '**/details_harness|hellaswag|10_2023-10-04T03-19-48.694479.parquet' - split: latest path: - '**/details_harness|hellaswag|10_2023-10-04T03-19-48.694479.parquet' - config_name: harness_hendrycksTest_5 data_files: - split: 2023_10_04T03_19_48.694479 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-international_law|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-management|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-marketing|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-sociology|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-virology|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2023-10-04T03-19-48.694479.parquet' - split: latest path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-international_law|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-management|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-marketing|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-sociology|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-virology|5_2023-10-04T03-19-48.694479.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2023-10-04T03-19-48.694479.parquet' - config_name: harness_hendrycksTest_abstract_algebra_5 data_files: - split: 2023_10_04T03_19_48.694479 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2023-10-04T03-19-48.694479.parquet' - split: latest path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2023-10-04T03-19-48.694479.parquet' - config_name: harness_hendrycksTest_anatomy_5 data_files: - split: 2023_10_04T03_19_48.694479 path: - '**/details_harness|hendrycksTest-anatomy|5_2023-10-04T03-19-48.694479.parquet' - split: latest path: - '**/details_harness|hendrycksTest-anatomy|5_2023-10-04T03-19-48.694479.parquet' - config_name: harness_hendrycksTest_astronomy_5 data_files: - split: 2023_10_04T03_19_48.694479 path: - '**/details_harness|hendrycksTest-astronomy|5_2023-10-04T03-19-48.694479.parquet' - split: latest path: - '**/details_harness|hendrycksTest-astronomy|5_2023-10-04T03-19-48.694479.parquet' - config_name: harness_hendrycksTest_business_ethics_5 data_files: - split: 2023_10_04T03_19_48.694479 path: - '**/details_harness|hendrycksTest-business_ethics|5_2023-10-04T03-19-48.694479.parquet' - split: latest path: - '**/details_harness|hendrycksTest-business_ethics|5_2023-10-04T03-19-48.694479.parquet' - config_name: harness_hendrycksTest_clinical_knowledge_5 data_files: - split: 2023_10_04T03_19_48.694479 path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2023-10-04T03-19-48.694479.parquet' - split: latest path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2023-10-04T03-19-48.694479.parquet' - config_name: harness_hendrycksTest_college_biology_5 data_files: - split: 2023_10_04T03_19_48.694479 path: - '**/details_harness|hendrycksTest-college_biology|5_2023-10-04T03-19-48.694479.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_biology|5_2023-10-04T03-19-48.694479.parquet' - config_name: harness_hendrycksTest_college_chemistry_5 data_files: - split: 2023_10_04T03_19_48.694479 path: - '**/details_harness|hendrycksTest-college_chemistry|5_2023-10-04T03-19-48.694479.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_chemistry|5_2023-10-04T03-19-48.694479.parquet' - config_name: harness_hendrycksTest_college_computer_science_5 data_files: - split: 2023_10_04T03_19_48.694479 path: - '**/details_harness|hendrycksTest-college_computer_science|5_2023-10-04T03-19-48.694479.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_computer_science|5_2023-10-04T03-19-48.694479.parquet' - config_name: harness_hendrycksTest_college_mathematics_5 data_files: - split: 2023_10_04T03_19_48.694479 path: - '**/details_harness|hendrycksTest-college_mathematics|5_2023-10-04T03-19-48.694479.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_mathematics|5_2023-10-04T03-19-48.694479.parquet' - config_name: harness_hendrycksTest_college_medicine_5 data_files: - split: 2023_10_04T03_19_48.694479 path: - '**/details_harness|hendrycksTest-college_medicine|5_2023-10-04T03-19-48.694479.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_medicine|5_2023-10-04T03-19-48.694479.parquet' - config_name: harness_hendrycksTest_college_physics_5 data_files: - split: 2023_10_04T03_19_48.694479 path: - '**/details_harness|hendrycksTest-college_physics|5_2023-10-04T03-19-48.694479.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_physics|5_2023-10-04T03-19-48.694479.parquet' - config_name: harness_hendrycksTest_computer_security_5 data_files: - split: 2023_10_04T03_19_48.694479 path: - '**/details_harness|hendrycksTest-computer_security|5_2023-10-04T03-19-48.694479.parquet' - split: latest path: - '**/details_harness|hendrycksTest-computer_security|5_2023-10-04T03-19-48.694479.parquet' - config_name: harness_hendrycksTest_conceptual_physics_5 data_files: - split: 2023_10_04T03_19_48.694479 path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2023-10-04T03-19-48.694479.parquet' - split: latest path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2023-10-04T03-19-48.694479.parquet' - config_name: harness_hendrycksTest_econometrics_5 data_files: - split: 2023_10_04T03_19_48.694479 path: - '**/details_harness|hendrycksTest-econometrics|5_2023-10-04T03-19-48.694479.parquet' - split: latest path: - '**/details_harness|hendrycksTest-econometrics|5_2023-10-04T03-19-48.694479.parquet' - config_name: harness_hendrycksTest_electrical_engineering_5 data_files: - split: 2023_10_04T03_19_48.694479 path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2023-10-04T03-19-48.694479.parquet' - split: latest path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2023-10-04T03-19-48.694479.parquet' - config_name: harness_hendrycksTest_elementary_mathematics_5 data_files: - split: 2023_10_04T03_19_48.694479 path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2023-10-04T03-19-48.694479.parquet' - split: latest path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2023-10-04T03-19-48.694479.parquet' - config_name: harness_hendrycksTest_formal_logic_5 data_files: - split: 2023_10_04T03_19_48.694479 path: - '**/details_harness|hendrycksTest-formal_logic|5_2023-10-04T03-19-48.694479.parquet' - split: latest path: - '**/details_harness|hendrycksTest-formal_logic|5_2023-10-04T03-19-48.694479.parquet' - config_name: harness_hendrycksTest_global_facts_5 data_files: - split: 2023_10_04T03_19_48.694479 path: - '**/details_harness|hendrycksTest-global_facts|5_2023-10-04T03-19-48.694479.parquet' - split: latest path: - '**/details_harness|hendrycksTest-global_facts|5_2023-10-04T03-19-48.694479.parquet' - config_name: harness_hendrycksTest_high_school_biology_5 data_files: - split: 2023_10_04T03_19_48.694479 path: - '**/details_harness|hendrycksTest-high_school_biology|5_2023-10-04T03-19-48.694479.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_biology|5_2023-10-04T03-19-48.694479.parquet' - config_name: harness_hendrycksTest_high_school_chemistry_5 data_files: - split: 2023_10_04T03_19_48.694479 path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2023-10-04T03-19-48.694479.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2023-10-04T03-19-48.694479.parquet' - config_name: harness_hendrycksTest_high_school_computer_science_5 data_files: - split: 2023_10_04T03_19_48.694479 path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2023-10-04T03-19-48.694479.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2023-10-04T03-19-48.694479.parquet' - config_name: harness_hendrycksTest_high_school_european_history_5 data_files: - split: 2023_10_04T03_19_48.694479 path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2023-10-04T03-19-48.694479.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2023-10-04T03-19-48.694479.parquet' - config_name: harness_hendrycksTest_high_school_geography_5 data_files: - split: 2023_10_04T03_19_48.694479 path: - '**/details_harness|hendrycksTest-high_school_geography|5_2023-10-04T03-19-48.694479.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_geography|5_2023-10-04T03-19-48.694479.parquet' - config_name: harness_hendrycksTest_high_school_government_and_politics_5 data_files: - split: 2023_10_04T03_19_48.694479 path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2023-10-04T03-19-48.694479.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2023-10-04T03-19-48.694479.parquet' - config_name: harness_hendrycksTest_high_school_macroeconomics_5 data_files: - split: 2023_10_04T03_19_48.694479 path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2023-10-04T03-19-48.694479.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2023-10-04T03-19-48.694479.parquet' - config_name: harness_hendrycksTest_high_school_mathematics_5 data_files: - split: 2023_10_04T03_19_48.694479 path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2023-10-04T03-19-48.694479.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2023-10-04T03-19-48.694479.parquet' - config_name: harness_hendrycksTest_high_school_microeconomics_5 data_files: - split: 2023_10_04T03_19_48.694479 path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2023-10-04T03-19-48.694479.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2023-10-04T03-19-48.694479.parquet' - config_name: harness_hendrycksTest_high_school_physics_5 data_files: - split: 2023_10_04T03_19_48.694479 path: - '**/details_harness|hendrycksTest-high_school_physics|5_2023-10-04T03-19-48.694479.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_physics|5_2023-10-04T03-19-48.694479.parquet' - config_name: harness_hendrycksTest_high_school_psychology_5 data_files: - split: 2023_10_04T03_19_48.694479 path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2023-10-04T03-19-48.694479.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2023-10-04T03-19-48.694479.parquet' - config_name: harness_hendrycksTest_high_school_statistics_5 data_files: - split: 2023_10_04T03_19_48.694479 path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2023-10-04T03-19-48.694479.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2023-10-04T03-19-48.694479.parquet' - config_name: harness_hendrycksTest_high_school_us_history_5 data_files: - split: 2023_10_04T03_19_48.694479 path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2023-10-04T03-19-48.694479.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2023-10-04T03-19-48.694479.parquet' - config_name: harness_hendrycksTest_high_school_world_history_5 data_files: - split: 2023_10_04T03_19_48.694479 path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2023-10-04T03-19-48.694479.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2023-10-04T03-19-48.694479.parquet' - config_name: harness_hendrycksTest_human_aging_5 data_files: - split: 2023_10_04T03_19_48.694479 path: - '**/details_harness|hendrycksTest-human_aging|5_2023-10-04T03-19-48.694479.parquet' - split: latest path: - '**/details_harness|hendrycksTest-human_aging|5_2023-10-04T03-19-48.694479.parquet' - config_name: harness_hendrycksTest_human_sexuality_5 data_files: - split: 2023_10_04T03_19_48.694479 path: - '**/details_harness|hendrycksTest-human_sexuality|5_2023-10-04T03-19-48.694479.parquet' - split: latest path: - '**/details_harness|hendrycksTest-human_sexuality|5_2023-10-04T03-19-48.694479.parquet' - config_name: harness_hendrycksTest_international_law_5 data_files: - split: 2023_10_04T03_19_48.694479 path: - '**/details_harness|hendrycksTest-international_law|5_2023-10-04T03-19-48.694479.parquet' - split: latest path: - '**/details_harness|hendrycksTest-international_law|5_2023-10-04T03-19-48.694479.parquet' - config_name: harness_hendrycksTest_jurisprudence_5 data_files: - split: 2023_10_04T03_19_48.694479 path: - '**/details_harness|hendrycksTest-jurisprudence|5_2023-10-04T03-19-48.694479.parquet' - split: latest path: - '**/details_harness|hendrycksTest-jurisprudence|5_2023-10-04T03-19-48.694479.parquet' - config_name: harness_hendrycksTest_logical_fallacies_5 data_files: - split: 2023_10_04T03_19_48.694479 path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2023-10-04T03-19-48.694479.parquet' - split: latest path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2023-10-04T03-19-48.694479.parquet' - config_name: harness_hendrycksTest_machine_learning_5 data_files: - split: 2023_10_04T03_19_48.694479 path: - '**/details_harness|hendrycksTest-machine_learning|5_2023-10-04T03-19-48.694479.parquet' - split: latest path: - '**/details_harness|hendrycksTest-machine_learning|5_2023-10-04T03-19-48.694479.parquet' - config_name: harness_hendrycksTest_management_5 data_files: - split: 2023_10_04T03_19_48.694479 path: - '**/details_harness|hendrycksTest-management|5_2023-10-04T03-19-48.694479.parquet' - split: latest path: - '**/details_harness|hendrycksTest-management|5_2023-10-04T03-19-48.694479.parquet' - config_name: harness_hendrycksTest_marketing_5 data_files: - split: 2023_10_04T03_19_48.694479 path: - '**/details_harness|hendrycksTest-marketing|5_2023-10-04T03-19-48.694479.parquet' - split: latest path: - '**/details_harness|hendrycksTest-marketing|5_2023-10-04T03-19-48.694479.parquet' - config_name: harness_hendrycksTest_medical_genetics_5 data_files: - split: 2023_10_04T03_19_48.694479 path: - '**/details_harness|hendrycksTest-medical_genetics|5_2023-10-04T03-19-48.694479.parquet' - split: latest path: - '**/details_harness|hendrycksTest-medical_genetics|5_2023-10-04T03-19-48.694479.parquet' - config_name: harness_hendrycksTest_miscellaneous_5 data_files: - split: 2023_10_04T03_19_48.694479 path: - '**/details_harness|hendrycksTest-miscellaneous|5_2023-10-04T03-19-48.694479.parquet' - split: latest path: - '**/details_harness|hendrycksTest-miscellaneous|5_2023-10-04T03-19-48.694479.parquet' - config_name: harness_hendrycksTest_moral_disputes_5 data_files: - split: 2023_10_04T03_19_48.694479 path: - '**/details_harness|hendrycksTest-moral_disputes|5_2023-10-04T03-19-48.694479.parquet' - split: latest path: - '**/details_harness|hendrycksTest-moral_disputes|5_2023-10-04T03-19-48.694479.parquet' - config_name: harness_hendrycksTest_moral_scenarios_5 data_files: - split: 2023_10_04T03_19_48.694479 path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2023-10-04T03-19-48.694479.parquet' - split: latest path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2023-10-04T03-19-48.694479.parquet' - config_name: harness_hendrycksTest_nutrition_5 data_files: - split: 2023_10_04T03_19_48.694479 path: - '**/details_harness|hendrycksTest-nutrition|5_2023-10-04T03-19-48.694479.parquet' - split: latest path: - '**/details_harness|hendrycksTest-nutrition|5_2023-10-04T03-19-48.694479.parquet' - config_name: harness_hendrycksTest_philosophy_5 data_files: - split: 2023_10_04T03_19_48.694479 path: - '**/details_harness|hendrycksTest-philosophy|5_2023-10-04T03-19-48.694479.parquet' - split: latest path: - '**/details_harness|hendrycksTest-philosophy|5_2023-10-04T03-19-48.694479.parquet' - config_name: harness_hendrycksTest_prehistory_5 data_files: - split: 2023_10_04T03_19_48.694479 path: - '**/details_harness|hendrycksTest-prehistory|5_2023-10-04T03-19-48.694479.parquet' - split: latest path: - '**/details_harness|hendrycksTest-prehistory|5_2023-10-04T03-19-48.694479.parquet' - config_name: harness_hendrycksTest_professional_accounting_5 data_files: - split: 2023_10_04T03_19_48.694479 path: - '**/details_harness|hendrycksTest-professional_accounting|5_2023-10-04T03-19-48.694479.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_accounting|5_2023-10-04T03-19-48.694479.parquet' - config_name: harness_hendrycksTest_professional_law_5 data_files: - split: 2023_10_04T03_19_48.694479 path: - '**/details_harness|hendrycksTest-professional_law|5_2023-10-04T03-19-48.694479.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_law|5_2023-10-04T03-19-48.694479.parquet' - config_name: harness_hendrycksTest_professional_medicine_5 data_files: - split: 2023_10_04T03_19_48.694479 path: - '**/details_harness|hendrycksTest-professional_medicine|5_2023-10-04T03-19-48.694479.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_medicine|5_2023-10-04T03-19-48.694479.parquet' - config_name: harness_hendrycksTest_professional_psychology_5 data_files: - split: 2023_10_04T03_19_48.694479 path: - '**/details_harness|hendrycksTest-professional_psychology|5_2023-10-04T03-19-48.694479.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_psychology|5_2023-10-04T03-19-48.694479.parquet' - config_name: harness_hendrycksTest_public_relations_5 data_files: - split: 2023_10_04T03_19_48.694479 path: - '**/details_harness|hendrycksTest-public_relations|5_2023-10-04T03-19-48.694479.parquet' - split: latest path: - '**/details_harness|hendrycksTest-public_relations|5_2023-10-04T03-19-48.694479.parquet' - config_name: harness_hendrycksTest_security_studies_5 data_files: - split: 2023_10_04T03_19_48.694479 path: - '**/details_harness|hendrycksTest-security_studies|5_2023-10-04T03-19-48.694479.parquet' - split: latest path: - '**/details_harness|hendrycksTest-security_studies|5_2023-10-04T03-19-48.694479.parquet' - config_name: harness_hendrycksTest_sociology_5 data_files: - split: 2023_10_04T03_19_48.694479 path: - '**/details_harness|hendrycksTest-sociology|5_2023-10-04T03-19-48.694479.parquet' - split: latest path: - '**/details_harness|hendrycksTest-sociology|5_2023-10-04T03-19-48.694479.parquet' - config_name: harness_hendrycksTest_us_foreign_policy_5 data_files: - split: 2023_10_04T03_19_48.694479 path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2023-10-04T03-19-48.694479.parquet' - split: latest path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2023-10-04T03-19-48.694479.parquet' - config_name: harness_hendrycksTest_virology_5 data_files: - split: 2023_10_04T03_19_48.694479 path: - '**/details_harness|hendrycksTest-virology|5_2023-10-04T03-19-48.694479.parquet' - split: latest path: - '**/details_harness|hendrycksTest-virology|5_2023-10-04T03-19-48.694479.parquet' - config_name: harness_hendrycksTest_world_religions_5 data_files: - split: 2023_10_04T03_19_48.694479 path: - '**/details_harness|hendrycksTest-world_religions|5_2023-10-04T03-19-48.694479.parquet' - split: latest path: - '**/details_harness|hendrycksTest-world_religions|5_2023-10-04T03-19-48.694479.parquet' - config_name: harness_truthfulqa_mc_0 data_files: - split: 2023_10_04T03_19_48.694479 path: - '**/details_harness|truthfulqa:mc|0_2023-10-04T03-19-48.694479.parquet' - split: latest path: - '**/details_harness|truthfulqa:mc|0_2023-10-04T03-19-48.694479.parquet' - config_name: harness_winogrande_5 data_files: - split: 2023_10_23T08_06_32.820862 path: - '**/details_harness|winogrande|5_2023-10-23T08-06-32.820862.parquet' - split: latest path: - '**/details_harness|winogrande|5_2023-10-23T08-06-32.820862.parquet' - config_name: results data_files: - split: 2023_10_04T03_19_48.694479 path: - results_2023-10-04T03-19-48.694479.parquet - split: 2023_10_23T08_06_32.820862 path: - results_2023-10-23T08-06-32.820862.parquet - split: latest path: - results_2023-10-23T08-06-32.820862.parquet --- # Dataset Card for Evaluation run of NewstaR/Koss-7B-chat ## Dataset Description - **Homepage:** - **Repository:** https://huggingface.co/NewstaR/Koss-7B-chat - **Paper:** - **Leaderboard:** https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard - **Point of Contact:** clementine@hf.co ### Dataset Summary Dataset automatically created during the evaluation run of model [NewstaR/Koss-7B-chat](https://huggingface.co/NewstaR/Koss-7B-chat) on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard). The dataset is composed of 64 configuration, each one coresponding to one of the evaluated task. The dataset has been created from 2 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results. An additional configuration "results" store all the aggregated results of the run (and is used to compute and display the agregated metrics on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)). To load the details from a run, you can for instance do the following: ```python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_NewstaR__Koss-7B-chat", "harness_winogrande_5", split="train") ``` ## Latest results These are the [latest results from run 2023-10-23T08:06:32.820862](https://huggingface.co/datasets/open-llm-leaderboard/details_NewstaR__Koss-7B-chat/blob/main/results_2023-10-23T08-06-32.820862.json)(note that their might be results for other tasks in the repos if successive evals didn't cover the same tasks. You find each in the results and the "latest" split for each eval): ```python { "all": { "em": 0.06333892617449664, "em_stderr": 0.002494400790190545, "f1": 0.12617449664429503, "f1_stderr": 0.002812859883562843, "acc": 0.39549166962367155, "acc_stderr": 0.009921949302668327 }, "harness|drop|3": { "em": 0.06333892617449664, "em_stderr": 0.002494400790190545, "f1": 0.12617449664429503, "f1_stderr": 0.002812859883562843 }, "harness|gsm8k|5": { "acc": 0.07354056103108415, "acc_stderr": 0.0071898357543652685 }, "harness|winogrande|5": { "acc": 0.7174427782162589, "acc_stderr": 0.012654062850971384 } } ``` ### Supported Tasks and Leaderboards [More Information Needed] ### Languages [More Information Needed] ## Dataset Structure ### Data Instances [More Information Needed] ### Data Fields [More Information Needed] ### Data Splits [More Information Needed] ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information [More Information Needed] ### Citation Information [More Information Needed] ### Contributions [More Information Needed]
提供机构:
open-llm-leaderboard
原始信息汇总

数据集概述

数据集简介

该数据集是在评估模型 NewstaR/Koss-7B-chatOpen LLM Leaderboard 上的自动创建的。

数据集结构

  • 数据集包含 64 个配置,每个配置对应一个评估任务。
  • 数据集从 2 次运行中创建,每次运行可以在每个配置中找到特定的分割,分割名称使用运行的时间戳。
  • "train" 分割始终指向最新的结果。
  • 一个额外的配置 "results" 存储所有运行的聚合结果,用于计算和显示聚合指标在 Open LLM Leaderboard 上。

数据加载示例

python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_NewstaR__Koss-7B-chat", "harness_winogrande_5", split="train")

最新结果

这些是最新结果来自运行 2023-10-23T08:06:32.820862: python { "all": { "em": 0.06333892617449664, "em_stderr": 0.002494400790190545, "f1": 0.12617449664429503, "f1_stderr": 0.002812859883562843, "acc": 0.39549166962367155, "acc_stderr": 0.009921949302668327 }, "harness|drop|3": { "em": 0.06333892617449664, "em_stderr": 0.002494400790190545, "f1": 0.12617449664429503, "f1_stderr": 0.002812859883562843 }, "harness|gsm8k|5": { "acc": 0.07354056103108415, "acc_stderr": 0.0071898357543652685 }, "harness|winogrande|5": { "acc": 0.7174427782162589, "acc_stderr": 0.012654062850971384 } }

配置详情

  • harness_arc_challenge_25

    • 分割:2023_10_04T03_19_48.694479
    • 路径:**/details_harness|arc:challenge|25_2023-10-04T03-19-48.694479.parquet
    • 分割:latest
    • 路径:**/details_harness|arc:challenge|25_2023-10-04T03-19-48.694479.parquet
  • harness_drop_3

    • 分割:2023_10_23T08_06_32.820862
    • 路径:**/details_harness|drop|3_2023-10-23T08-06-32.820862.parquet
    • 分割:latest
    • 路径:**/details_harness|drop|3_2023-10-23T08-06-32.820862.parquet
  • harness_gsm8k_5

    • 分割:2023_10_23T08_06_32.820862
    • 路径:**/details_harness|gsm8k|5_2023-10-23T08-06-32.820862.parquet
    • 分割:latest
    • 路径:**/details_harness|gsm8k|5_2023-10-23T08-06-32.820862.parquet
  • harness_hellaswag_10

    • 分割:2023_10_04T03_19_48.694479
    • 路径:**/details_harness|hellaswag|10_2023-10-04T03-19-48.694479.parquet
    • 分割:latest
    • 路径:**/details_harness|hellaswag|10_2023-10-04T03-19-48.694479.parquet
  • harness_hendrycksTest_5

    • 分割:2023_10_04T03_19_48.694479
    • 路径:多个路径,例如 **/details_harness|hendrycksTest-abstract_algebra|5_2023-10-04T03-19-48.694479.parquet
    • 分割:latest
    • 路径:多个路径,例如 **/details_harness|hendrycksTest-abstract_algebra|5_2023-10-04T03-19-48.694479.parquet
  • harness_hendrycksTest_abstract_algebra_5

    • 分割:2023_10_04T03_19_48.694479
    • 路径:**/details_harness|hendrycksTest-abstract_algebra|5_2023-10-04T03-19-48.694479.parquet
    • 分割:latest
    • 路径:**/details_harness|hendrycksTest-abstract_algebra|5_2023-10-04T03-19-48.694479.parquet
  • harness_hendrycksTest_anatomy_5

    • 分割:2023_10_04T03_19_48.694479
    • 路径:**/details_harness|hendrycksTest-anatomy|5_2023-10-04T03-19-48.694479.parquet
    • 分割:latest
    • 路径:**/details_harness|hendrycksTest-anatomy|5_2023-10-04T03-19-48.694479.parquet
  • harness_hendrycksTest_astronomy_5

    • 分割:2023_10_04T03_19_48.694479
    • 路径:**/details_harness|hendrycksTest-astronomy|5_2023-10-04T03-19-48.694479.parquet
    • 分割:latest
    • 路径:**/details_harness|hendrycksTest-astronomy|5_2023-10-04T03-19-48.694479.parquet
  • harness_hendrycksTest_business_ethics_5

    • 分割:2023_10_04T03_19_48.694479
    • 路径:**/details_harness|hendrycksTest-business_ethics|5_2023-10-04T03-19-48.694479.parquet
    • 分割:latest
    • 路径:**/details_harness|hendrycksTest-business_ethics|5_2023-10-04T03-19-48.694479.parquet
  • harness_hendrycksTest_clinical_knowledge_5

    • 分割:2023_10_04T03_19_48.694479
    • 路径:**/details_harness|hendrycksTest-clinical_knowledge|5_2023-10-04T03-19-48.694479.parquet
    • 分割:latest
    • 路径:**/details_harness|hendrycksTest-clinical_knowledge|5_2023-10-04T03-19-48.694479.parquet
  • harness_hendrycksTest_college_biology_5

    • 分割:2023_10_04T03_19_48.694479
    • 路径:**/details_harness|hendrycksTest-college_biology|5_2023-10-04T03-19-48.694479.parquet
    • 分割:latest
    • 路径:**/details_harness|hendrycksTest-college_biology|5_2023-10-04T03-19-48.694479.parquet
  • harness_hendrycksTest_college_chemistry_5

    • 分割:2023_10_04T03_19_48.694479
    • 路径:**/details_harness|hendrycksTest-college_chemistry|5_2023-10-04T03-19-48.694479.parquet
    • 分割:latest
    • 路径:**/details_harness|hendrycksTest-college_chemistry|5_2023-10-04T03-19-48.694479.parquet
  • harness_hendrycksTest_college_computer_science_5

    • 分割:2023_10_04T03_19_48.694479
    • 路径:**/details_harness|hendrycksTest-college_computer_science|5_2023-10-04T03-19-48.694479.parquet
    • 分割:latest
    • 路径:**/details_harness|hendrycksTest-college_computer_science|5_2023-10-04T03-19-48.694479.parquet
  • harness_hendrycksTest_college_mathematics_5

    • 分割:2023_10_04T03_19_48.694479
    • 路径:**/details_harness|hendrycksTest-college_mathematics|5_2023-10-04T03-19-48.694479.parquet
    • 分割:latest
    • 路径:**/details_harness|hendrycksTest-college_mathematics|5_2023-10-04T03-19-48.694479.parquet
  • harness_hendrycksTest_college_medicine_5

    • 分割:2023_10_04T03_19_48.694479
    • 路径:**/details_harness|hendrycksTest-college_medicine|5_2023-10-04T03-19-48.694479.parquet
    • 分割:latest
    • 路径:**/details_harness|hendrycksTest-college_medicine|5_2023-10-04T03-19-48.694479.parquet
  • harness_hendrycksTest_college_physics_5

    • 分割:2023_10_04T03_19_48.694479
    • 路径:**/details_harness|hendrycksTest-college_physics|5_2023-10-04T03-19-48.694479.parquet
    • 分割:latest
    • 路径:**/details_harness|hendrycksTest-college_physics|5_2023-10-04T03-19-48.694479.parquet
  • harness_hendrycksTest_computer_security_5

    • 分割:2023_10_04T03_19_48.694479
    • 路径:**/details_harness|hendrycksTest-computer_security|5_2023-10-04T03-19-48.694479.parquet
    • 分割:latest
    • 路径:**/details_harness|hendrycksTest-computer_security|5_2023-10-04T03-19-48.694479.parquet
  • harness_hendrycksTest_conceptual_physics_5

    • 分割:2023_10_04T03_19_48.694479
    • 路径:**/details_harness|hendrycksTest-conceptual_physics|5_2023-10-04T03-19-48.694479.parquet
    • 分割:latest
    • 路径:**/details_harness|hendrycksTest-conceptual_physics|5_2023-10-04T03-19-48.694479.parquet
  • harness_hendrycksTest_econometrics_5

    • 分割:2023_10_04T03_19_48.694479
    • 路径:**/details_harness|hendrycksTest-econometrics|5_2023-10-04T03-19-48.694479.parquet
    • 分割:latest
    • 路径:**/details_harness|hendrycksTest-econometrics|5_2023-10-04T03-19-48.694479.parquet
  • harness_hendrycksTest_electrical_engineering_5

    • 分割:2023_10_04T03_19_48.694479
    • 路径:**/details_harness|hendrycksTest-electrical_engineering|5_2023-10-04T03-19-48.694479.parquet
    • 分割:latest
    • 路径:**/details_harness|hendrycksTest-electrical_engineering|5_2023-10-04T03-19-48.694479.parquet
  • harness_hendrycksTest_elementary_mathematics_5

    • 分割:2023_10_04T03_19_48
搜集汇总
数据集介绍
main_image_url
构建方式
该数据集源自Open LLM Leaderboard对NewstaR/Koss-7B-chat模型的评估流程,通过自动化方式生成。数据集包含64个配置,每个配置对应一项被评估的任务,例如ARC挑战、DROP、GSM8K、Winogrande等。评估共执行了两次运行,每次运行的结果以时间戳命名作为独立的分割(split)存储于各配置中,而'train'分割始终指向最新一次评估的结果。此外,数据集还增设了一个名为'results'的配置,用于汇总所有运行的聚合指标,这些指标直接服务于Open LLM Leaderboard上综合评分的计算与展示。
特点
该数据集的核心特色在于其精细化的任务导向结构,每个配置独立封装特定评估任务的详细记录,便于研究人员按需聚焦。数据以Parquet格式存储,兼顾了高效读写与压缩性能。时间戳分割的设计支持历史版本追溯,允许用户对比不同运行间的模型表现差异。'results'配置提供了如精确匹配率(em)、F1分数和准确率(acc)等标准化指标,并附有标准误差,为模型性能的量化比较提供了严谨的统计基础。
使用方法
使用该数据集时,可通过Hugging Face的datasets库便捷加载。例如,加载Winogrande任务的最新结果,只需指定配置名称'harness_winogrande_5'及分割'train'即可。对于特定历史运行,可替换分割为对应的时间戳标识符。若需分析全局性能,可直接访问'results'配置以获取聚合指标。数据集接口设计统一,支持跨任务的标准化查询,适合进行模型评估的复现与深入分析。
背景与挑战
背景概述
在大型语言模型(LLM)领域,模型性能的标准化评估是推动技术进步的关键环节。Open LLM Leaderboard由Hugging Face团队于2023年创建,旨在为开源LLM提供透明、可复现的评测基准。该数据集作为Leaderboard的一部分,专门记录了对NewstaR团队开发的Koss-7B-chat模型的评测结果,涵盖ARC、DROP、GSM8K、HellaSwag、MMLU及WinoGrande等多维度任务。通过收集2023年10月多次运行的结果,该数据集不仅揭示了Koss-7B-chat在常识推理、数学求解与知识理解等方面的能力边界,更为社区提供了对比不同模型性能的宝贵参考,对推动7B参数级别对话模型的标准化评估具有重要影响。
当前挑战
当前挑战体现在两个层面。领域层面,LLM评测面临任务多样性与评估一致性的矛盾——Koss-7B-chat在WinoGrande上达到71.7%准确率,却在GSM8K上仅获7.4%,凸显了模型在推理任务上的显著短板,亟需更精细的评测体系来诊断能力差异。构建层面,该数据集面临多轮评测结果整合的复杂性,需处理64个配置项与不同时间戳的运行记录,确保最新结果始终指向'train'分割,同时维护历史数据可追溯性。此外,parquet格式文件的路径管理与跨任务数据一致性校验,也对数据管道的稳健性提出了严格要求。
常用场景
经典使用场景
在大型语言模型评估领域,Open LLM Leaderboard 上的评估数据集为模型性能的横向对比提供了标准化基准。具体而言,该数据集记录了 NewstaR/Koss-7B-chat 模型在多项经典任务上的表现,涵盖常识推理(如 Winogrande)、数学推理(如 GSM8K)、阅读理解(如 DROP)以及多学科知识测试(如 MMLU)。研究者可通过加载特定配置和分裂来复现模型在每项任务上的细粒度结果,从而深入分析模型在不同认知维度上的能力边界。这种细粒度的评估范式不仅有助于定位模型的优势与短板,也为后续优化提供了可量化的依据。
实际应用
在实际应用层面,该数据集的评估结果直接服务于模型选型与部署决策。例如,在构建智能客服系统时,开发者可以依据模型在 GSM8K 上的数学推理表现来判断其处理复杂数值问题的能力;在开发教育辅助工具时,模型在 MMLU 各学科子任务上的得分则能反映其知识广度与深度。此外,该数据集还支持持续集成与持续交付流程,通过自动化评估流水线实时监控模型更新后的性能变化,确保部署版本的质量稳定。
衍生相关工作
围绕该评估数据集,学术界已衍生出多项具有影响力的工作。Open LLM Leaderboard 本身作为开源社区中模型性能的权威排行榜,催生了大量针对特定任务优化的模型变体。例如,研究者基于 Winogrande 的评估结果提出了改进的代词消解策略;GSM8K 上的表现则推动了链式思维提示工程的发展。此外,该数据集还启发了多任务联合评估框架的设计,如 HELM 和 BIG-bench 等,进一步拓展了语言模型评估的维度与深度。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作