five

open-llm-leaderboard/details_cmarkea__bloomz-7b1-mt-sft-chat

收藏
Hugging Face2023-10-26 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/open-llm-leaderboard/details_cmarkea__bloomz-7b1-mt-sft-chat
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: Evaluation run of cmarkea/bloomz-7b1-mt-sft-chat dataset_summary: "Dataset automatically created during the evaluation run of model\ \ [cmarkea/bloomz-7b1-mt-sft-chat](https://huggingface.co/cmarkea/bloomz-7b1-mt-sft-chat)\ \ on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).\n\ \nThe dataset is composed of 64 configuration, each one coresponding to one of the\ \ evaluated task.\n\nThe dataset has been created from 2 run(s). Each run can be\ \ found as a specific split in each configuration, the split being named using the\ \ timestamp of the run.The \"train\" split is always pointing to the latest results.\n\ \nAn additional configuration \"results\" store all the aggregated results of the\ \ run (and is used to compute and display the agregated metrics on the [Open LLM\ \ Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)).\n\ \nTo load the details from a run, you can for instance do the following:\n```python\n\ from datasets import load_dataset\ndata = load_dataset(\"open-llm-leaderboard/details_cmarkea__bloomz-7b1-mt-sft-chat\"\ ,\n\t\"harness_winogrande_5\",\n\tsplit=\"train\")\n```\n\n## Latest results\n\n\ These are the [latest results from run 2023-10-26T06:59:23.411956](https://huggingface.co/datasets/open-llm-leaderboard/details_cmarkea__bloomz-7b1-mt-sft-chat/blob/main/results_2023-10-26T06-59-23.411956.json)(note\ \ that their might be results for other tasks in the repos if successive evals didn't\ \ cover the same tasks. You find each in the results and the \"latest\" split for\ \ each eval):\n\n```python\n{\n \"all\": {\n \"em\": 0.25734060402684567,\n\ \ \"em_stderr\": 0.004477016029355352,\n \"f1\": 0.29325293624161086,\n\ \ \"f1_stderr\": 0.004472692066418403,\n \"acc\": 0.3191491844351243,\n\ \ \"acc_stderr\": 0.0077737951169338446\n },\n \"harness|drop|3\":\ \ {\n \"em\": 0.25734060402684567,\n \"em_stderr\": 0.004477016029355352,\n\ \ \"f1\": 0.29325293624161086,\n \"f1_stderr\": 0.004472692066418403\n\ \ },\n \"harness|gsm8k|5\": {\n \"acc\": 0.00530705079605762,\n \ \ \"acc_stderr\": 0.0020013057209480422\n },\n \"harness|winogrande|5\"\ : {\n \"acc\": 0.632991318074191,\n \"acc_stderr\": 0.013546284512919646\n\ \ }\n}\n```" repo_url: https://huggingface.co/cmarkea/bloomz-7b1-mt-sft-chat leaderboard_url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard point_of_contact: clementine@hf.co configs: - config_name: harness_arc_challenge_25 data_files: - split: 2023_10_04T04_11_17.617298 path: - '**/details_harness|arc:challenge|25_2023-10-04T04-11-17.617298.parquet' - split: latest path: - '**/details_harness|arc:challenge|25_2023-10-04T04-11-17.617298.parquet' - config_name: harness_drop_3 data_files: - split: 2023_10_26T06_59_23.411956 path: - '**/details_harness|drop|3_2023-10-26T06-59-23.411956.parquet' - split: latest path: - '**/details_harness|drop|3_2023-10-26T06-59-23.411956.parquet' - config_name: harness_gsm8k_5 data_files: - split: 2023_10_26T06_59_23.411956 path: - '**/details_harness|gsm8k|5_2023-10-26T06-59-23.411956.parquet' - split: latest path: - '**/details_harness|gsm8k|5_2023-10-26T06-59-23.411956.parquet' - config_name: harness_hellaswag_10 data_files: - split: 2023_10_04T04_11_17.617298 path: - '**/details_harness|hellaswag|10_2023-10-04T04-11-17.617298.parquet' - split: latest path: - '**/details_harness|hellaswag|10_2023-10-04T04-11-17.617298.parquet' - config_name: harness_hendrycksTest_5 data_files: - split: 2023_10_04T04_11_17.617298 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-international_law|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-management|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-marketing|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-sociology|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-virology|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2023-10-04T04-11-17.617298.parquet' - split: latest path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-international_law|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-management|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-marketing|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-sociology|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-virology|5_2023-10-04T04-11-17.617298.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2023-10-04T04-11-17.617298.parquet' - config_name: harness_hendrycksTest_abstract_algebra_5 data_files: - split: 2023_10_04T04_11_17.617298 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2023-10-04T04-11-17.617298.parquet' - split: latest path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2023-10-04T04-11-17.617298.parquet' - config_name: harness_hendrycksTest_anatomy_5 data_files: - split: 2023_10_04T04_11_17.617298 path: - '**/details_harness|hendrycksTest-anatomy|5_2023-10-04T04-11-17.617298.parquet' - split: latest path: - '**/details_harness|hendrycksTest-anatomy|5_2023-10-04T04-11-17.617298.parquet' - config_name: harness_hendrycksTest_astronomy_5 data_files: - split: 2023_10_04T04_11_17.617298 path: - '**/details_harness|hendrycksTest-astronomy|5_2023-10-04T04-11-17.617298.parquet' - split: latest path: - '**/details_harness|hendrycksTest-astronomy|5_2023-10-04T04-11-17.617298.parquet' - config_name: harness_hendrycksTest_business_ethics_5 data_files: - split: 2023_10_04T04_11_17.617298 path: - '**/details_harness|hendrycksTest-business_ethics|5_2023-10-04T04-11-17.617298.parquet' - split: latest path: - '**/details_harness|hendrycksTest-business_ethics|5_2023-10-04T04-11-17.617298.parquet' - config_name: harness_hendrycksTest_clinical_knowledge_5 data_files: - split: 2023_10_04T04_11_17.617298 path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2023-10-04T04-11-17.617298.parquet' - split: latest path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2023-10-04T04-11-17.617298.parquet' - config_name: harness_hendrycksTest_college_biology_5 data_files: - split: 2023_10_04T04_11_17.617298 path: - '**/details_harness|hendrycksTest-college_biology|5_2023-10-04T04-11-17.617298.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_biology|5_2023-10-04T04-11-17.617298.parquet' - config_name: harness_hendrycksTest_college_chemistry_5 data_files: - split: 2023_10_04T04_11_17.617298 path: - '**/details_harness|hendrycksTest-college_chemistry|5_2023-10-04T04-11-17.617298.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_chemistry|5_2023-10-04T04-11-17.617298.parquet' - config_name: harness_hendrycksTest_college_computer_science_5 data_files: - split: 2023_10_04T04_11_17.617298 path: - '**/details_harness|hendrycksTest-college_computer_science|5_2023-10-04T04-11-17.617298.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_computer_science|5_2023-10-04T04-11-17.617298.parquet' - config_name: harness_hendrycksTest_college_mathematics_5 data_files: - split: 2023_10_04T04_11_17.617298 path: - '**/details_harness|hendrycksTest-college_mathematics|5_2023-10-04T04-11-17.617298.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_mathematics|5_2023-10-04T04-11-17.617298.parquet' - config_name: harness_hendrycksTest_college_medicine_5 data_files: - split: 2023_10_04T04_11_17.617298 path: - '**/details_harness|hendrycksTest-college_medicine|5_2023-10-04T04-11-17.617298.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_medicine|5_2023-10-04T04-11-17.617298.parquet' - config_name: harness_hendrycksTest_college_physics_5 data_files: - split: 2023_10_04T04_11_17.617298 path: - '**/details_harness|hendrycksTest-college_physics|5_2023-10-04T04-11-17.617298.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_physics|5_2023-10-04T04-11-17.617298.parquet' - config_name: harness_hendrycksTest_computer_security_5 data_files: - split: 2023_10_04T04_11_17.617298 path: - '**/details_harness|hendrycksTest-computer_security|5_2023-10-04T04-11-17.617298.parquet' - split: latest path: - '**/details_harness|hendrycksTest-computer_security|5_2023-10-04T04-11-17.617298.parquet' - config_name: harness_hendrycksTest_conceptual_physics_5 data_files: - split: 2023_10_04T04_11_17.617298 path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2023-10-04T04-11-17.617298.parquet' - split: latest path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2023-10-04T04-11-17.617298.parquet' - config_name: harness_hendrycksTest_econometrics_5 data_files: - split: 2023_10_04T04_11_17.617298 path: - '**/details_harness|hendrycksTest-econometrics|5_2023-10-04T04-11-17.617298.parquet' - split: latest path: - '**/details_harness|hendrycksTest-econometrics|5_2023-10-04T04-11-17.617298.parquet' - config_name: harness_hendrycksTest_electrical_engineering_5 data_files: - split: 2023_10_04T04_11_17.617298 path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2023-10-04T04-11-17.617298.parquet' - split: latest path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2023-10-04T04-11-17.617298.parquet' - config_name: harness_hendrycksTest_elementary_mathematics_5 data_files: - split: 2023_10_04T04_11_17.617298 path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2023-10-04T04-11-17.617298.parquet' - split: latest path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2023-10-04T04-11-17.617298.parquet' - config_name: harness_hendrycksTest_formal_logic_5 data_files: - split: 2023_10_04T04_11_17.617298 path: - '**/details_harness|hendrycksTest-formal_logic|5_2023-10-04T04-11-17.617298.parquet' - split: latest path: - '**/details_harness|hendrycksTest-formal_logic|5_2023-10-04T04-11-17.617298.parquet' - config_name: harness_hendrycksTest_global_facts_5 data_files: - split: 2023_10_04T04_11_17.617298 path: - '**/details_harness|hendrycksTest-global_facts|5_2023-10-04T04-11-17.617298.parquet' - split: latest path: - '**/details_harness|hendrycksTest-global_facts|5_2023-10-04T04-11-17.617298.parquet' - config_name: harness_hendrycksTest_high_school_biology_5 data_files: - split: 2023_10_04T04_11_17.617298 path: - '**/details_harness|hendrycksTest-high_school_biology|5_2023-10-04T04-11-17.617298.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_biology|5_2023-10-04T04-11-17.617298.parquet' - config_name: harness_hendrycksTest_high_school_chemistry_5 data_files: - split: 2023_10_04T04_11_17.617298 path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2023-10-04T04-11-17.617298.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2023-10-04T04-11-17.617298.parquet' - config_name: harness_hendrycksTest_high_school_computer_science_5 data_files: - split: 2023_10_04T04_11_17.617298 path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2023-10-04T04-11-17.617298.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2023-10-04T04-11-17.617298.parquet' - config_name: harness_hendrycksTest_high_school_european_history_5 data_files: - split: 2023_10_04T04_11_17.617298 path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2023-10-04T04-11-17.617298.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2023-10-04T04-11-17.617298.parquet' - config_name: harness_hendrycksTest_high_school_geography_5 data_files: - split: 2023_10_04T04_11_17.617298 path: - '**/details_harness|hendrycksTest-high_school_geography|5_2023-10-04T04-11-17.617298.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_geography|5_2023-10-04T04-11-17.617298.parquet' - config_name: harness_hendrycksTest_high_school_government_and_politics_5 data_files: - split: 2023_10_04T04_11_17.617298 path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2023-10-04T04-11-17.617298.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2023-10-04T04-11-17.617298.parquet' - config_name: harness_hendrycksTest_high_school_macroeconomics_5 data_files: - split: 2023_10_04T04_11_17.617298 path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2023-10-04T04-11-17.617298.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2023-10-04T04-11-17.617298.parquet' - config_name: harness_hendrycksTest_high_school_mathematics_5 data_files: - split: 2023_10_04T04_11_17.617298 path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2023-10-04T04-11-17.617298.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2023-10-04T04-11-17.617298.parquet' - config_name: harness_hendrycksTest_high_school_microeconomics_5 data_files: - split: 2023_10_04T04_11_17.617298 path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2023-10-04T04-11-17.617298.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2023-10-04T04-11-17.617298.parquet' - config_name: harness_hendrycksTest_high_school_physics_5 data_files: - split: 2023_10_04T04_11_17.617298 path: - '**/details_harness|hendrycksTest-high_school_physics|5_2023-10-04T04-11-17.617298.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_physics|5_2023-10-04T04-11-17.617298.parquet' - config_name: harness_hendrycksTest_high_school_psychology_5 data_files: - split: 2023_10_04T04_11_17.617298 path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2023-10-04T04-11-17.617298.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2023-10-04T04-11-17.617298.parquet' - config_name: harness_hendrycksTest_high_school_statistics_5 data_files: - split: 2023_10_04T04_11_17.617298 path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2023-10-04T04-11-17.617298.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2023-10-04T04-11-17.617298.parquet' - config_name: harness_hendrycksTest_high_school_us_history_5 data_files: - split: 2023_10_04T04_11_17.617298 path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2023-10-04T04-11-17.617298.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2023-10-04T04-11-17.617298.parquet' - config_name: harness_hendrycksTest_high_school_world_history_5 data_files: - split: 2023_10_04T04_11_17.617298 path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2023-10-04T04-11-17.617298.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2023-10-04T04-11-17.617298.parquet' - config_name: harness_hendrycksTest_human_aging_5 data_files: - split: 2023_10_04T04_11_17.617298 path: - '**/details_harness|hendrycksTest-human_aging|5_2023-10-04T04-11-17.617298.parquet' - split: latest path: - '**/details_harness|hendrycksTest-human_aging|5_2023-10-04T04-11-17.617298.parquet' - config_name: harness_hendrycksTest_human_sexuality_5 data_files: - split: 2023_10_04T04_11_17.617298 path: - '**/details_harness|hendrycksTest-human_sexuality|5_2023-10-04T04-11-17.617298.parquet' - split: latest path: - '**/details_harness|hendrycksTest-human_sexuality|5_2023-10-04T04-11-17.617298.parquet' - config_name: harness_hendrycksTest_international_law_5 data_files: - split: 2023_10_04T04_11_17.617298 path: - '**/details_harness|hendrycksTest-international_law|5_2023-10-04T04-11-17.617298.parquet' - split: latest path: - '**/details_harness|hendrycksTest-international_law|5_2023-10-04T04-11-17.617298.parquet' - config_name: harness_hendrycksTest_jurisprudence_5 data_files: - split: 2023_10_04T04_11_17.617298 path: - '**/details_harness|hendrycksTest-jurisprudence|5_2023-10-04T04-11-17.617298.parquet' - split: latest path: - '**/details_harness|hendrycksTest-jurisprudence|5_2023-10-04T04-11-17.617298.parquet' - config_name: harness_hendrycksTest_logical_fallacies_5 data_files: - split: 2023_10_04T04_11_17.617298 path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2023-10-04T04-11-17.617298.parquet' - split: latest path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2023-10-04T04-11-17.617298.parquet' - config_name: harness_hendrycksTest_machine_learning_5 data_files: - split: 2023_10_04T04_11_17.617298 path: - '**/details_harness|hendrycksTest-machine_learning|5_2023-10-04T04-11-17.617298.parquet' - split: latest path: - '**/details_harness|hendrycksTest-machine_learning|5_2023-10-04T04-11-17.617298.parquet' - config_name: harness_hendrycksTest_management_5 data_files: - split: 2023_10_04T04_11_17.617298 path: - '**/details_harness|hendrycksTest-management|5_2023-10-04T04-11-17.617298.parquet' - split: latest path: - '**/details_harness|hendrycksTest-management|5_2023-10-04T04-11-17.617298.parquet' - config_name: harness_hendrycksTest_marketing_5 data_files: - split: 2023_10_04T04_11_17.617298 path: - '**/details_harness|hendrycksTest-marketing|5_2023-10-04T04-11-17.617298.parquet' - split: latest path: - '**/details_harness|hendrycksTest-marketing|5_2023-10-04T04-11-17.617298.parquet' - config_name: harness_hendrycksTest_medical_genetics_5 data_files: - split: 2023_10_04T04_11_17.617298 path: - '**/details_harness|hendrycksTest-medical_genetics|5_2023-10-04T04-11-17.617298.parquet' - split: latest path: - '**/details_harness|hendrycksTest-medical_genetics|5_2023-10-04T04-11-17.617298.parquet' - config_name: harness_hendrycksTest_miscellaneous_5 data_files: - split: 2023_10_04T04_11_17.617298 path: - '**/details_harness|hendrycksTest-miscellaneous|5_2023-10-04T04-11-17.617298.parquet' - split: latest path: - '**/details_harness|hendrycksTest-miscellaneous|5_2023-10-04T04-11-17.617298.parquet' - config_name: harness_hendrycksTest_moral_disputes_5 data_files: - split: 2023_10_04T04_11_17.617298 path: - '**/details_harness|hendrycksTest-moral_disputes|5_2023-10-04T04-11-17.617298.parquet' - split: latest path: - '**/details_harness|hendrycksTest-moral_disputes|5_2023-10-04T04-11-17.617298.parquet' - config_name: harness_hendrycksTest_moral_scenarios_5 data_files: - split: 2023_10_04T04_11_17.617298 path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2023-10-04T04-11-17.617298.parquet' - split: latest path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2023-10-04T04-11-17.617298.parquet' - config_name: harness_hendrycksTest_nutrition_5 data_files: - split: 2023_10_04T04_11_17.617298 path: - '**/details_harness|hendrycksTest-nutrition|5_2023-10-04T04-11-17.617298.parquet' - split: latest path: - '**/details_harness|hendrycksTest-nutrition|5_2023-10-04T04-11-17.617298.parquet' - config_name: harness_hendrycksTest_philosophy_5 data_files: - split: 2023_10_04T04_11_17.617298 path: - '**/details_harness|hendrycksTest-philosophy|5_2023-10-04T04-11-17.617298.parquet' - split: latest path: - '**/details_harness|hendrycksTest-philosophy|5_2023-10-04T04-11-17.617298.parquet' - config_name: harness_hendrycksTest_prehistory_5 data_files: - split: 2023_10_04T04_11_17.617298 path: - '**/details_harness|hendrycksTest-prehistory|5_2023-10-04T04-11-17.617298.parquet' - split: latest path: - '**/details_harness|hendrycksTest-prehistory|5_2023-10-04T04-11-17.617298.parquet' - config_name: harness_hendrycksTest_professional_accounting_5 data_files: - split: 2023_10_04T04_11_17.617298 path: - '**/details_harness|hendrycksTest-professional_accounting|5_2023-10-04T04-11-17.617298.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_accounting|5_2023-10-04T04-11-17.617298.parquet' - config_name: harness_hendrycksTest_professional_law_5 data_files: - split: 2023_10_04T04_11_17.617298 path: - '**/details_harness|hendrycksTest-professional_law|5_2023-10-04T04-11-17.617298.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_law|5_2023-10-04T04-11-17.617298.parquet' - config_name: harness_hendrycksTest_professional_medicine_5 data_files: - split: 2023_10_04T04_11_17.617298 path: - '**/details_harness|hendrycksTest-professional_medicine|5_2023-10-04T04-11-17.617298.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_medicine|5_2023-10-04T04-11-17.617298.parquet' - config_name: harness_hendrycksTest_professional_psychology_5 data_files: - split: 2023_10_04T04_11_17.617298 path: - '**/details_harness|hendrycksTest-professional_psychology|5_2023-10-04T04-11-17.617298.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_psychology|5_2023-10-04T04-11-17.617298.parquet' - config_name: harness_hendrycksTest_public_relations_5 data_files: - split: 2023_10_04T04_11_17.617298 path: - '**/details_harness|hendrycksTest-public_relations|5_2023-10-04T04-11-17.617298.parquet' - split: latest path: - '**/details_harness|hendrycksTest-public_relations|5_2023-10-04T04-11-17.617298.parquet' - config_name: harness_hendrycksTest_security_studies_5 data_files: - split: 2023_10_04T04_11_17.617298 path: - '**/details_harness|hendrycksTest-security_studies|5_2023-10-04T04-11-17.617298.parquet' - split: latest path: - '**/details_harness|hendrycksTest-security_studies|5_2023-10-04T04-11-17.617298.parquet' - config_name: harness_hendrycksTest_sociology_5 data_files: - split: 2023_10_04T04_11_17.617298 path: - '**/details_harness|hendrycksTest-sociology|5_2023-10-04T04-11-17.617298.parquet' - split: latest path: - '**/details_harness|hendrycksTest-sociology|5_2023-10-04T04-11-17.617298.parquet' - config_name: harness_hendrycksTest_us_foreign_policy_5 data_files: - split: 2023_10_04T04_11_17.617298 path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2023-10-04T04-11-17.617298.parquet' - split: latest path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2023-10-04T04-11-17.617298.parquet' - config_name: harness_hendrycksTest_virology_5 data_files: - split: 2023_10_04T04_11_17.617298 path: - '**/details_harness|hendrycksTest-virology|5_2023-10-04T04-11-17.617298.parquet' - split: latest path: - '**/details_harness|hendrycksTest-virology|5_2023-10-04T04-11-17.617298.parquet' - config_name: harness_hendrycksTest_world_religions_5 data_files: - split: 2023_10_04T04_11_17.617298 path: - '**/details_harness|hendrycksTest-world_religions|5_2023-10-04T04-11-17.617298.parquet' - split: latest path: - '**/details_harness|hendrycksTest-world_religions|5_2023-10-04T04-11-17.617298.parquet' - config_name: harness_truthfulqa_mc_0 data_files: - split: 2023_10_04T04_11_17.617298 path: - '**/details_harness|truthfulqa:mc|0_2023-10-04T04-11-17.617298.parquet' - split: latest path: - '**/details_harness|truthfulqa:mc|0_2023-10-04T04-11-17.617298.parquet' - config_name: harness_winogrande_5 data_files: - split: 2023_10_26T06_59_23.411956 path: - '**/details_harness|winogrande|5_2023-10-26T06-59-23.411956.parquet' - split: latest path: - '**/details_harness|winogrande|5_2023-10-26T06-59-23.411956.parquet' - config_name: results data_files: - split: 2023_10_04T04_11_17.617298 path: - results_2023-10-04T04-11-17.617298.parquet - split: 2023_10_26T06_59_23.411956 path: - results_2023-10-26T06-59-23.411956.parquet - split: latest path: - results_2023-10-26T06-59-23.411956.parquet --- # Dataset Card for Evaluation run of cmarkea/bloomz-7b1-mt-sft-chat ## Dataset Description - **Homepage:** - **Repository:** https://huggingface.co/cmarkea/bloomz-7b1-mt-sft-chat - **Paper:** - **Leaderboard:** https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard - **Point of Contact:** clementine@hf.co ### Dataset Summary Dataset automatically created during the evaluation run of model [cmarkea/bloomz-7b1-mt-sft-chat](https://huggingface.co/cmarkea/bloomz-7b1-mt-sft-chat) on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard). The dataset is composed of 64 configuration, each one coresponding to one of the evaluated task. The dataset has been created from 2 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results. An additional configuration "results" store all the aggregated results of the run (and is used to compute and display the agregated metrics on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)). To load the details from a run, you can for instance do the following: ```python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_cmarkea__bloomz-7b1-mt-sft-chat", "harness_winogrande_5", split="train") ``` ## Latest results These are the [latest results from run 2023-10-26T06:59:23.411956](https://huggingface.co/datasets/open-llm-leaderboard/details_cmarkea__bloomz-7b1-mt-sft-chat/blob/main/results_2023-10-26T06-59-23.411956.json)(note that their might be results for other tasks in the repos if successive evals didn't cover the same tasks. You find each in the results and the "latest" split for each eval): ```python { "all": { "em": 0.25734060402684567, "em_stderr": 0.004477016029355352, "f1": 0.29325293624161086, "f1_stderr": 0.004472692066418403, "acc": 0.3191491844351243, "acc_stderr": 0.0077737951169338446 }, "harness|drop|3": { "em": 0.25734060402684567, "em_stderr": 0.004477016029355352, "f1": 0.29325293624161086, "f1_stderr": 0.004472692066418403 }, "harness|gsm8k|5": { "acc": 0.00530705079605762, "acc_stderr": 0.0020013057209480422 }, "harness|winogrande|5": { "acc": 0.632991318074191, "acc_stderr": 0.013546284512919646 } } ``` ### Supported Tasks and Leaderboards [More Information Needed] ### Languages [More Information Needed] ## Dataset Structure ### Data Instances [More Information Needed] ### Data Fields [More Information Needed] ### Data Splits [More Information Needed] ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information [More Information Needed] ### Citation Information [More Information Needed] ### Contributions [More Information Needed]
提供机构:
open-llm-leaderboard
原始信息汇总

数据集概述

该数据集是在评估模型 cmarkea/bloomz-7b1-mt-sft-chatOpen LLM Leaderboard 上的运行过程中自动创建的。

数据集组成

  • 数据集包含 64 个配置,每个配置对应一个评估任务。
  • 数据集由 2 次运行创建,每次运行可以在每个配置中找到特定的分割,分割名称使用运行的时间戳。
  • "train" 分割始终指向最新的结果。
  • 一个额外的配置 "results" 存储所有运行的聚合结果,用于计算和显示在 Open LLM Leaderboard 上的聚合指标。

数据加载示例

python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_cmarkea__bloomz-7b1-mt-sft-chat", "harness_winogrande_5", split="train")

最新结果

以下是 2023-10-26T06:59:23.411956 运行 的最新结果:

python { "all": { "em": 0.25734060402684567, "em_stderr": 0.004477016029355352, "f1": 0.29325293624161086, "f1_stderr": 0.004472692066418403, "acc": 0.3191491844351243, "acc_stderr": 0.0077737951169338446 }, "harness|drop|3": { "em": 0.25734060402684567, "em_stderr": 0.004477016029355352, "f1": 0.29325293624161086, "f1_stderr": 0.004472692066418403 }, "harness|gsm8k|5": { "acc": 0.00530705079605762, "acc_stderr": 0.0020013057209480422 }, "harness|winogrande|5": { "acc": 0.632991318074191, "acc_stderr": 0.013546284512919646 } }

配置详情

以下是部分配置的详细信息:

  • harness_arc_challenge_25

    • 分割:2023_10_04T04_11_17.617298
      • 路径:**/details_harness|arc:challenge|25_2023-10-04T04-11-17.617298.parquet
    • 分割:latest
      • 路径:**/details_harness|arc:challenge|25_2023-10-04T04-11-17.617298.parquet
  • harness_drop_3

    • 分割:2023_10_26T06_59_23.411956
      • 路径:**/details_harness|drop|3_2023-10-26T06-59-23.411956.parquet
    • 分割:latest
      • 路径:**/details_harness|drop|3_2023-10-26T06-59-23.411956.parquet
  • harness_gsm8k_5

    • 分割:2023_10_26T06_59_23.411956
      • 路径:**/details_harness|gsm8k|5_2023-10-26T06-59-23.411956.parquet
    • 分割:latest
      • 路径:**/details_harness|gsm8k|5_2023-10-26T06-59-23.411956.parquet
  • harness_hellaswag_10

    • 分割:2023_10_04T04_11_17.617298
      • 路径:**/details_harness|hellaswag|10_2023-10-04T04-11-17.617298.parquet
    • 分割:latest
      • 路径:**/details_harness|hellaswag|10_2023-10-04T04-11-17.617298.parquet
  • harness_hendrycksTest_5

    • 分割:2023_10_04T04_11_17.617298
      • 路径:
        • **/details_harness|hendrycksTest-abstract_algebra|5_2023-10-04T04-11-17.617298.parquet
        • **/details_harness|hendrycksTest-anatomy|5_2023-10-04T04-11-17.617298.parquet
        • ...(其他路径省略)
    • 分割:latest
      • 路径:
        • **/details_harness|hendrycksTest-abstract_algebra|5_2023-10-04T04-11-17.617298.parquet
        • **/details_harness|hendrycksTest-anatomy|5_2023-10-04T04-11-17.617298.parquet
        • ...(其他路径省略)
搜集汇总
数据集介绍
main_image_url
构建方式
该数据集是在Open LLM Leaderboard评估框架下,针对cmarkea/bloomz-7b1-mt-sft-chat模型在多次运行中自动生成的。数据集包含64个配置,每个配置对应一项评估任务,配置名称以任务标识和样本数量命名。每次评估运行的结果被存储为独立的split,split名称采用运行时间戳,而'train' split始终指向最新一次运行的结果。此外,一个名为'results'的额外配置汇总了所有运行的聚合指标,用于在排行榜上计算和展示综合性能。
特点
数据集结构精巧,以任务为单位组织数据,每个配置下包含多个运行时间戳对应的split,便于进行历史结果追溯与对比。'train' split自动更新至最新评估结果,确保了数据集的时效性。'results'配置提供了各任务及整体性能的聚合指标,包括准确率、精确匹配率、F1分数及其标准误,为模型性能的多维度分析提供了丰富素材。
使用方法
用户可通过Hugging Face的datasets库加载数据集。例如,使用load_dataset函数,指定数据集名称和具体的配置名称(如'harness_winogrande_5'),并选择split参数为'train'即可获取最新评估结果。若要访问历史运行数据,可将split参数设置为对应的时间戳字符串。加载后的数据以Parquet格式存储,方便进行高效的数据处理与分析。
背景与挑战
背景概述
在大型语言模型(LLM)能力评估领域,标准化基准测试的构建与透明化呈现已成为推动模型发展的关键环节。Open LLM Leaderboard由Hugging Face团队于2023年发起,旨在为社区提供一个统一、可复现的模型性能对比平台。该数据集作为cmarkea/bloomz-7b1-mt-sft-chat模型在Leaderboard上的评测结果记录,由Clementine等研究人员主导构建,核心研究问题聚焦于多语言指令微调模型在多样化任务中的泛化表现。通过整合ARC、HellaSwag、MMLU、GSM8K等涵盖推理、知识、数学能力的基准测试,该数据集不仅为bloomz-7b1这类多语言模型的性能评估提供了量化依据,更推动了开放科学框架下LLM评测标准的规范化进程,其影响力延伸至模型选型、微调策略优化及公平比较等下游应用场景。
当前挑战
该数据集所解决的领域问题在于:LLM评测常因任务多样性不足、评估流程不透明而难以横向比较。具体挑战包括:1)多任务异构性带来的评估一致性难题——需在64个配置中统一评测范式,例如从常识推理(Winogrande)到数学求解(GSM8K)的任务跨度巨大,模型在drop任务上仅达0.257的精确匹配率,凸显了多语言模型在复杂推理上的短板;2)构建过程中的时序管理挑战——两次评测运行(2023-10-04与2023-10-26)需通过时间戳分片实现版本追溯,确保结果可复现性,同时需处理不同次运行任务覆盖不完全时的数据整合问题,这对数据管道的一致性维护提出了严苛要求。
常用场景
经典使用场景
在大型语言模型的性能评估领域,该数据集作为Open LLM Leaderboard的标准化评测组件,被广泛用于衡量模型在多维度任务上的综合能力。它涵盖了ARC挑战赛、DROP阅读理解、GSM8K数学推理、HellaSwag常识推理、Winogrande指代消解以及涵盖57个学科的MMLU(HendrycksTest)等经典基准测试。研究者通过加载该数据集中的特定配置(如harness_winogrande_5),能够便捷地复现模型在某一具体任务上的表现,从而进行公平的横向对比与能力诊断。这一设计使得模型开发者可以系统性地追踪其模型在推理、常识、数学及专业知识等不同维度的演进历程。
实际应用
在实际应用层面,该数据集为工业界部署大语言模型提供了关键的决策依据。企业可以通过分析模型在DROP等复杂推理任务上的F1分数(29.3%)与EM分数(25.7%),评估其在文档理解与信息抽取场景中的可靠性。同时,GSM8K任务的表现直接反映了模型在金融计算或教育辅导等需要精确数学能力的场景中的适用性。MMLU的57个学科评测则为企业选择模型处理法律咨询、医学问答等垂直领域任务提供了量化参考,确保模型在关键业务场景中的表现符合预期标准。
衍生相关工作
基于该数据集及Open LLM Leaderboard框架,衍生出一系列重要工作与工具。例如,研究者开发了自动化的模型评测流水线(Evaluation Harness),支持对任意HuggingFace模型进行一键式标准化评测。此外,该数据集促成了模型性能排行榜的持续更新,催生了诸如模型能力雷达图、任务难度分层分析等可视化分析工具。在学术论文中,该数据集被大量引用作为模型性能的权威佐证,特别是在对比不同微调策略(如SFT与RLHF)对模型通用能力影响的研究中,成为了不可或缺的基准数据源。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作