five

open-llm-leaderboard/details_anas-awadalla__mpt-1b-redpajama-200b

收藏
Hugging Face2023-12-03 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/open-llm-leaderboard/details_anas-awadalla__mpt-1b-redpajama-200b
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: Evaluation run of anas-awadalla/mpt-1b-redpajama-200b dataset_summary: "Dataset automatically created during the evaluation run of model\ \ [anas-awadalla/mpt-1b-redpajama-200b](https://huggingface.co/anas-awadalla/mpt-1b-redpajama-200b)\ \ on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).\n\ \nThe dataset is composed of 64 configuration, each one coresponding to one of the\ \ evaluated task.\n\nThe dataset has been created from 4 run(s). Each run can be\ \ found as a specific split in each configuration, the split being named using the\ \ timestamp of the run.The \"train\" split is always pointing to the latest results.\n\ \nAn additional configuration \"results\" store all the aggregated results of the\ \ run (and is used to compute and display the aggregated metrics on the [Open LLM\ \ Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)).\n\ \nTo load the details from a run, you can for instance do the following:\n```python\n\ from datasets import load_dataset\ndata = load_dataset(\"open-llm-leaderboard/details_anas-awadalla__mpt-1b-redpajama-200b\"\ ,\n\t\"harness_gsm8k_5\",\n\tsplit=\"train\")\n```\n\n## Latest results\n\nThese\ \ are the [latest results from run 2023-12-03T16:06:56.054386](https://huggingface.co/datasets/open-llm-leaderboard/details_anas-awadalla__mpt-1b-redpajama-200b/blob/main/results_2023-12-03T16-06-56.054386.json)(note\ \ that their might be results for other tasks in the repos if successive evals didn't\ \ cover the same tasks. You find each in the results and the \"latest\" split for\ \ each eval):\n\n```python\n{\n \"all\": {\n \"acc\": 0.0,\n \"\ acc_stderr\": 0.0\n },\n \"harness|gsm8k|5\": {\n \"acc\": 0.0,\n \ \ \"acc_stderr\": 0.0\n }\n}\n```" repo_url: https://huggingface.co/anas-awadalla/mpt-1b-redpajama-200b leaderboard_url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard point_of_contact: clementine@hf.co configs: - config_name: harness_arc_challenge_25 data_files: - split: 2023_09_14T22_39_00.593372 path: - '**/details_harness|arc:challenge|25_2023-09-14T22-39-00.593372.parquet' - split: latest path: - '**/details_harness|arc:challenge|25_2023-09-14T22-39-00.593372.parquet' - config_name: harness_drop_3 data_files: - split: 2023_11_04T22_34_26.464302 path: - '**/details_harness|drop|3_2023-11-04T22-34-26.464302.parquet' - split: 2023_11_06T15_58_19.397762 path: - '**/details_harness|drop|3_2023-11-06T15-58-19.397762.parquet' - split: latest path: - '**/details_harness|drop|3_2023-11-06T15-58-19.397762.parquet' - config_name: harness_gsm8k_5 data_files: - split: 2023_11_04T22_34_26.464302 path: - '**/details_harness|gsm8k|5_2023-11-04T22-34-26.464302.parquet' - split: 2023_11_06T15_58_19.397762 path: - '**/details_harness|gsm8k|5_2023-11-06T15-58-19.397762.parquet' - split: 2023_12_03T16_06_56.054386 path: - '**/details_harness|gsm8k|5_2023-12-03T16-06-56.054386.parquet' - split: latest path: - '**/details_harness|gsm8k|5_2023-12-03T16-06-56.054386.parquet' - config_name: harness_hellaswag_10 data_files: - split: 2023_09_14T22_39_00.593372 path: - '**/details_harness|hellaswag|10_2023-09-14T22-39-00.593372.parquet' - split: latest path: - '**/details_harness|hellaswag|10_2023-09-14T22-39-00.593372.parquet' - config_name: harness_hendrycksTest_5 data_files: - split: 2023_09_14T22_39_00.593372 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-international_law|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-management|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-marketing|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-sociology|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-virology|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2023-09-14T22-39-00.593372.parquet' - split: latest path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-international_law|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-management|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-marketing|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-sociology|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-virology|5_2023-09-14T22-39-00.593372.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2023-09-14T22-39-00.593372.parquet' - config_name: harness_hendrycksTest_abstract_algebra_5 data_files: - split: 2023_09_14T22_39_00.593372 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2023-09-14T22-39-00.593372.parquet' - split: latest path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2023-09-14T22-39-00.593372.parquet' - config_name: harness_hendrycksTest_anatomy_5 data_files: - split: 2023_09_14T22_39_00.593372 path: - '**/details_harness|hendrycksTest-anatomy|5_2023-09-14T22-39-00.593372.parquet' - split: latest path: - '**/details_harness|hendrycksTest-anatomy|5_2023-09-14T22-39-00.593372.parquet' - config_name: harness_hendrycksTest_astronomy_5 data_files: - split: 2023_09_14T22_39_00.593372 path: - '**/details_harness|hendrycksTest-astronomy|5_2023-09-14T22-39-00.593372.parquet' - split: latest path: - '**/details_harness|hendrycksTest-astronomy|5_2023-09-14T22-39-00.593372.parquet' - config_name: harness_hendrycksTest_business_ethics_5 data_files: - split: 2023_09_14T22_39_00.593372 path: - '**/details_harness|hendrycksTest-business_ethics|5_2023-09-14T22-39-00.593372.parquet' - split: latest path: - '**/details_harness|hendrycksTest-business_ethics|5_2023-09-14T22-39-00.593372.parquet' - config_name: harness_hendrycksTest_clinical_knowledge_5 data_files: - split: 2023_09_14T22_39_00.593372 path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2023-09-14T22-39-00.593372.parquet' - split: latest path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2023-09-14T22-39-00.593372.parquet' - config_name: harness_hendrycksTest_college_biology_5 data_files: - split: 2023_09_14T22_39_00.593372 path: - '**/details_harness|hendrycksTest-college_biology|5_2023-09-14T22-39-00.593372.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_biology|5_2023-09-14T22-39-00.593372.parquet' - config_name: harness_hendrycksTest_college_chemistry_5 data_files: - split: 2023_09_14T22_39_00.593372 path: - '**/details_harness|hendrycksTest-college_chemistry|5_2023-09-14T22-39-00.593372.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_chemistry|5_2023-09-14T22-39-00.593372.parquet' - config_name: harness_hendrycksTest_college_computer_science_5 data_files: - split: 2023_09_14T22_39_00.593372 path: - '**/details_harness|hendrycksTest-college_computer_science|5_2023-09-14T22-39-00.593372.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_computer_science|5_2023-09-14T22-39-00.593372.parquet' - config_name: harness_hendrycksTest_college_mathematics_5 data_files: - split: 2023_09_14T22_39_00.593372 path: - '**/details_harness|hendrycksTest-college_mathematics|5_2023-09-14T22-39-00.593372.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_mathematics|5_2023-09-14T22-39-00.593372.parquet' - config_name: harness_hendrycksTest_college_medicine_5 data_files: - split: 2023_09_14T22_39_00.593372 path: - '**/details_harness|hendrycksTest-college_medicine|5_2023-09-14T22-39-00.593372.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_medicine|5_2023-09-14T22-39-00.593372.parquet' - config_name: harness_hendrycksTest_college_physics_5 data_files: - split: 2023_09_14T22_39_00.593372 path: - '**/details_harness|hendrycksTest-college_physics|5_2023-09-14T22-39-00.593372.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_physics|5_2023-09-14T22-39-00.593372.parquet' - config_name: harness_hendrycksTest_computer_security_5 data_files: - split: 2023_09_14T22_39_00.593372 path: - '**/details_harness|hendrycksTest-computer_security|5_2023-09-14T22-39-00.593372.parquet' - split: latest path: - '**/details_harness|hendrycksTest-computer_security|5_2023-09-14T22-39-00.593372.parquet' - config_name: harness_hendrycksTest_conceptual_physics_5 data_files: - split: 2023_09_14T22_39_00.593372 path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2023-09-14T22-39-00.593372.parquet' - split: latest path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2023-09-14T22-39-00.593372.parquet' - config_name: harness_hendrycksTest_econometrics_5 data_files: - split: 2023_09_14T22_39_00.593372 path: - '**/details_harness|hendrycksTest-econometrics|5_2023-09-14T22-39-00.593372.parquet' - split: latest path: - '**/details_harness|hendrycksTest-econometrics|5_2023-09-14T22-39-00.593372.parquet' - config_name: harness_hendrycksTest_electrical_engineering_5 data_files: - split: 2023_09_14T22_39_00.593372 path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2023-09-14T22-39-00.593372.parquet' - split: latest path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2023-09-14T22-39-00.593372.parquet' - config_name: harness_hendrycksTest_elementary_mathematics_5 data_files: - split: 2023_09_14T22_39_00.593372 path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2023-09-14T22-39-00.593372.parquet' - split: latest path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2023-09-14T22-39-00.593372.parquet' - config_name: harness_hendrycksTest_formal_logic_5 data_files: - split: 2023_09_14T22_39_00.593372 path: - '**/details_harness|hendrycksTest-formal_logic|5_2023-09-14T22-39-00.593372.parquet' - split: latest path: - '**/details_harness|hendrycksTest-formal_logic|5_2023-09-14T22-39-00.593372.parquet' - config_name: harness_hendrycksTest_global_facts_5 data_files: - split: 2023_09_14T22_39_00.593372 path: - '**/details_harness|hendrycksTest-global_facts|5_2023-09-14T22-39-00.593372.parquet' - split: latest path: - '**/details_harness|hendrycksTest-global_facts|5_2023-09-14T22-39-00.593372.parquet' - config_name: harness_hendrycksTest_high_school_biology_5 data_files: - split: 2023_09_14T22_39_00.593372 path: - '**/details_harness|hendrycksTest-high_school_biology|5_2023-09-14T22-39-00.593372.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_biology|5_2023-09-14T22-39-00.593372.parquet' - config_name: harness_hendrycksTest_high_school_chemistry_5 data_files: - split: 2023_09_14T22_39_00.593372 path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2023-09-14T22-39-00.593372.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2023-09-14T22-39-00.593372.parquet' - config_name: harness_hendrycksTest_high_school_computer_science_5 data_files: - split: 2023_09_14T22_39_00.593372 path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2023-09-14T22-39-00.593372.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2023-09-14T22-39-00.593372.parquet' - config_name: harness_hendrycksTest_high_school_european_history_5 data_files: - split: 2023_09_14T22_39_00.593372 path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2023-09-14T22-39-00.593372.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2023-09-14T22-39-00.593372.parquet' - config_name: harness_hendrycksTest_high_school_geography_5 data_files: - split: 2023_09_14T22_39_00.593372 path: - '**/details_harness|hendrycksTest-high_school_geography|5_2023-09-14T22-39-00.593372.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_geography|5_2023-09-14T22-39-00.593372.parquet' - config_name: harness_hendrycksTest_high_school_government_and_politics_5 data_files: - split: 2023_09_14T22_39_00.593372 path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2023-09-14T22-39-00.593372.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2023-09-14T22-39-00.593372.parquet' - config_name: harness_hendrycksTest_high_school_macroeconomics_5 data_files: - split: 2023_09_14T22_39_00.593372 path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2023-09-14T22-39-00.593372.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2023-09-14T22-39-00.593372.parquet' - config_name: harness_hendrycksTest_high_school_mathematics_5 data_files: - split: 2023_09_14T22_39_00.593372 path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2023-09-14T22-39-00.593372.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2023-09-14T22-39-00.593372.parquet' - config_name: harness_hendrycksTest_high_school_microeconomics_5 data_files: - split: 2023_09_14T22_39_00.593372 path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2023-09-14T22-39-00.593372.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2023-09-14T22-39-00.593372.parquet' - config_name: harness_hendrycksTest_high_school_physics_5 data_files: - split: 2023_09_14T22_39_00.593372 path: - '**/details_harness|hendrycksTest-high_school_physics|5_2023-09-14T22-39-00.593372.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_physics|5_2023-09-14T22-39-00.593372.parquet' - config_name: harness_hendrycksTest_high_school_psychology_5 data_files: - split: 2023_09_14T22_39_00.593372 path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2023-09-14T22-39-00.593372.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2023-09-14T22-39-00.593372.parquet' - config_name: harness_hendrycksTest_high_school_statistics_5 data_files: - split: 2023_09_14T22_39_00.593372 path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2023-09-14T22-39-00.593372.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2023-09-14T22-39-00.593372.parquet' - config_name: harness_hendrycksTest_high_school_us_history_5 data_files: - split: 2023_09_14T22_39_00.593372 path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2023-09-14T22-39-00.593372.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2023-09-14T22-39-00.593372.parquet' - config_name: harness_hendrycksTest_high_school_world_history_5 data_files: - split: 2023_09_14T22_39_00.593372 path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2023-09-14T22-39-00.593372.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2023-09-14T22-39-00.593372.parquet' - config_name: harness_hendrycksTest_human_aging_5 data_files: - split: 2023_09_14T22_39_00.593372 path: - '**/details_harness|hendrycksTest-human_aging|5_2023-09-14T22-39-00.593372.parquet' - split: latest path: - '**/details_harness|hendrycksTest-human_aging|5_2023-09-14T22-39-00.593372.parquet' - config_name: harness_hendrycksTest_human_sexuality_5 data_files: - split: 2023_09_14T22_39_00.593372 path: - '**/details_harness|hendrycksTest-human_sexuality|5_2023-09-14T22-39-00.593372.parquet' - split: latest path: - '**/details_harness|hendrycksTest-human_sexuality|5_2023-09-14T22-39-00.593372.parquet' - config_name: harness_hendrycksTest_international_law_5 data_files: - split: 2023_09_14T22_39_00.593372 path: - '**/details_harness|hendrycksTest-international_law|5_2023-09-14T22-39-00.593372.parquet' - split: latest path: - '**/details_harness|hendrycksTest-international_law|5_2023-09-14T22-39-00.593372.parquet' - config_name: harness_hendrycksTest_jurisprudence_5 data_files: - split: 2023_09_14T22_39_00.593372 path: - '**/details_harness|hendrycksTest-jurisprudence|5_2023-09-14T22-39-00.593372.parquet' - split: latest path: - '**/details_harness|hendrycksTest-jurisprudence|5_2023-09-14T22-39-00.593372.parquet' - config_name: harness_hendrycksTest_logical_fallacies_5 data_files: - split: 2023_09_14T22_39_00.593372 path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2023-09-14T22-39-00.593372.parquet' - split: latest path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2023-09-14T22-39-00.593372.parquet' - config_name: harness_hendrycksTest_machine_learning_5 data_files: - split: 2023_09_14T22_39_00.593372 path: - '**/details_harness|hendrycksTest-machine_learning|5_2023-09-14T22-39-00.593372.parquet' - split: latest path: - '**/details_harness|hendrycksTest-machine_learning|5_2023-09-14T22-39-00.593372.parquet' - config_name: harness_hendrycksTest_management_5 data_files: - split: 2023_09_14T22_39_00.593372 path: - '**/details_harness|hendrycksTest-management|5_2023-09-14T22-39-00.593372.parquet' - split: latest path: - '**/details_harness|hendrycksTest-management|5_2023-09-14T22-39-00.593372.parquet' - config_name: harness_hendrycksTest_marketing_5 data_files: - split: 2023_09_14T22_39_00.593372 path: - '**/details_harness|hendrycksTest-marketing|5_2023-09-14T22-39-00.593372.parquet' - split: latest path: - '**/details_harness|hendrycksTest-marketing|5_2023-09-14T22-39-00.593372.parquet' - config_name: harness_hendrycksTest_medical_genetics_5 data_files: - split: 2023_09_14T22_39_00.593372 path: - '**/details_harness|hendrycksTest-medical_genetics|5_2023-09-14T22-39-00.593372.parquet' - split: latest path: - '**/details_harness|hendrycksTest-medical_genetics|5_2023-09-14T22-39-00.593372.parquet' - config_name: harness_hendrycksTest_miscellaneous_5 data_files: - split: 2023_09_14T22_39_00.593372 path: - '**/details_harness|hendrycksTest-miscellaneous|5_2023-09-14T22-39-00.593372.parquet' - split: latest path: - '**/details_harness|hendrycksTest-miscellaneous|5_2023-09-14T22-39-00.593372.parquet' - config_name: harness_hendrycksTest_moral_disputes_5 data_files: - split: 2023_09_14T22_39_00.593372 path: - '**/details_harness|hendrycksTest-moral_disputes|5_2023-09-14T22-39-00.593372.parquet' - split: latest path: - '**/details_harness|hendrycksTest-moral_disputes|5_2023-09-14T22-39-00.593372.parquet' - config_name: harness_hendrycksTest_moral_scenarios_5 data_files: - split: 2023_09_14T22_39_00.593372 path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2023-09-14T22-39-00.593372.parquet' - split: latest path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2023-09-14T22-39-00.593372.parquet' - config_name: harness_hendrycksTest_nutrition_5 data_files: - split: 2023_09_14T22_39_00.593372 path: - '**/details_harness|hendrycksTest-nutrition|5_2023-09-14T22-39-00.593372.parquet' - split: latest path: - '**/details_harness|hendrycksTest-nutrition|5_2023-09-14T22-39-00.593372.parquet' - config_name: harness_hendrycksTest_philosophy_5 data_files: - split: 2023_09_14T22_39_00.593372 path: - '**/details_harness|hendrycksTest-philosophy|5_2023-09-14T22-39-00.593372.parquet' - split: latest path: - '**/details_harness|hendrycksTest-philosophy|5_2023-09-14T22-39-00.593372.parquet' - config_name: harness_hendrycksTest_prehistory_5 data_files: - split: 2023_09_14T22_39_00.593372 path: - '**/details_harness|hendrycksTest-prehistory|5_2023-09-14T22-39-00.593372.parquet' - split: latest path: - '**/details_harness|hendrycksTest-prehistory|5_2023-09-14T22-39-00.593372.parquet' - config_name: harness_hendrycksTest_professional_accounting_5 data_files: - split: 2023_09_14T22_39_00.593372 path: - '**/details_harness|hendrycksTest-professional_accounting|5_2023-09-14T22-39-00.593372.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_accounting|5_2023-09-14T22-39-00.593372.parquet' - config_name: harness_hendrycksTest_professional_law_5 data_files: - split: 2023_09_14T22_39_00.593372 path: - '**/details_harness|hendrycksTest-professional_law|5_2023-09-14T22-39-00.593372.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_law|5_2023-09-14T22-39-00.593372.parquet' - config_name: harness_hendrycksTest_professional_medicine_5 data_files: - split: 2023_09_14T22_39_00.593372 path: - '**/details_harness|hendrycksTest-professional_medicine|5_2023-09-14T22-39-00.593372.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_medicine|5_2023-09-14T22-39-00.593372.parquet' - config_name: harness_hendrycksTest_professional_psychology_5 data_files: - split: 2023_09_14T22_39_00.593372 path: - '**/details_harness|hendrycksTest-professional_psychology|5_2023-09-14T22-39-00.593372.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_psychology|5_2023-09-14T22-39-00.593372.parquet' - config_name: harness_hendrycksTest_public_relations_5 data_files: - split: 2023_09_14T22_39_00.593372 path: - '**/details_harness|hendrycksTest-public_relations|5_2023-09-14T22-39-00.593372.parquet' - split: latest path: - '**/details_harness|hendrycksTest-public_relations|5_2023-09-14T22-39-00.593372.parquet' - config_name: harness_hendrycksTest_security_studies_5 data_files: - split: 2023_09_14T22_39_00.593372 path: - '**/details_harness|hendrycksTest-security_studies|5_2023-09-14T22-39-00.593372.parquet' - split: latest path: - '**/details_harness|hendrycksTest-security_studies|5_2023-09-14T22-39-00.593372.parquet' - config_name: harness_hendrycksTest_sociology_5 data_files: - split: 2023_09_14T22_39_00.593372 path: - '**/details_harness|hendrycksTest-sociology|5_2023-09-14T22-39-00.593372.parquet' - split: latest path: - '**/details_harness|hendrycksTest-sociology|5_2023-09-14T22-39-00.593372.parquet' - config_name: harness_hendrycksTest_us_foreign_policy_5 data_files: - split: 2023_09_14T22_39_00.593372 path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2023-09-14T22-39-00.593372.parquet' - split: latest path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2023-09-14T22-39-00.593372.parquet' - config_name: harness_hendrycksTest_virology_5 data_files: - split: 2023_09_14T22_39_00.593372 path: - '**/details_harness|hendrycksTest-virology|5_2023-09-14T22-39-00.593372.parquet' - split: latest path: - '**/details_harness|hendrycksTest-virology|5_2023-09-14T22-39-00.593372.parquet' - config_name: harness_hendrycksTest_world_religions_5 data_files: - split: 2023_09_14T22_39_00.593372 path: - '**/details_harness|hendrycksTest-world_religions|5_2023-09-14T22-39-00.593372.parquet' - split: latest path: - '**/details_harness|hendrycksTest-world_religions|5_2023-09-14T22-39-00.593372.parquet' - config_name: harness_truthfulqa_mc_0 data_files: - split: 2023_09_14T22_39_00.593372 path: - '**/details_harness|truthfulqa:mc|0_2023-09-14T22-39-00.593372.parquet' - split: latest path: - '**/details_harness|truthfulqa:mc|0_2023-09-14T22-39-00.593372.parquet' - config_name: harness_winogrande_5 data_files: - split: 2023_11_04T22_34_26.464302 path: - '**/details_harness|winogrande|5_2023-11-04T22-34-26.464302.parquet' - split: 2023_11_06T15_58_19.397762 path: - '**/details_harness|winogrande|5_2023-11-06T15-58-19.397762.parquet' - split: latest path: - '**/details_harness|winogrande|5_2023-11-06T15-58-19.397762.parquet' - config_name: results data_files: - split: 2023_09_14T22_39_00.593372 path: - results_2023-09-14T22-39-00.593372.parquet - split: 2023_11_04T22_34_26.464302 path: - results_2023-11-04T22-34-26.464302.parquet - split: 2023_11_06T15_58_19.397762 path: - results_2023-11-06T15-58-19.397762.parquet - split: 2023_12_03T16_06_56.054386 path: - results_2023-12-03T16-06-56.054386.parquet - split: latest path: - results_2023-12-03T16-06-56.054386.parquet --- # Dataset Card for Evaluation run of anas-awadalla/mpt-1b-redpajama-200b ## Dataset Description - **Homepage:** - **Repository:** https://huggingface.co/anas-awadalla/mpt-1b-redpajama-200b - **Paper:** - **Leaderboard:** https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard - **Point of Contact:** clementine@hf.co ### Dataset Summary Dataset automatically created during the evaluation run of model [anas-awadalla/mpt-1b-redpajama-200b](https://huggingface.co/anas-awadalla/mpt-1b-redpajama-200b) on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard). The dataset is composed of 64 configuration, each one coresponding to one of the evaluated task. The dataset has been created from 4 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results. An additional configuration "results" store all the aggregated results of the run (and is used to compute and display the aggregated metrics on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)). To load the details from a run, you can for instance do the following: ```python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_anas-awadalla__mpt-1b-redpajama-200b", "harness_gsm8k_5", split="train") ``` ## Latest results These are the [latest results from run 2023-12-03T16:06:56.054386](https://huggingface.co/datasets/open-llm-leaderboard/details_anas-awadalla__mpt-1b-redpajama-200b/blob/main/results_2023-12-03T16-06-56.054386.json)(note that their might be results for other tasks in the repos if successive evals didn't cover the same tasks. You find each in the results and the "latest" split for each eval): ```python { "all": { "acc": 0.0, "acc_stderr": 0.0 }, "harness|gsm8k|5": { "acc": 0.0, "acc_stderr": 0.0 } } ``` ### Supported Tasks and Leaderboards [More Information Needed] ### Languages [More Information Needed] ## Dataset Structure ### Data Instances [More Information Needed] ### Data Fields [More Information Needed] ### Data Splits [More Information Needed] ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information [More Information Needed] ### Citation Information [More Information Needed] ### Contributions [More Information Needed]
提供机构:
open-llm-leaderboard
原始信息汇总

数据集概述

该数据集是在评估模型 anas-awadalla/mpt-1b-redpajama-200bOpen LLM Leaderboard 上的自动创建的。

数据集组成

  • 数据集包含 64 个配置,每个配置对应一个评估任务。
  • 数据集从 4 次运行中创建,每次运行可以在每个配置中找到特定的分割,分割名称使用运行的时间戳。
  • "train" 分割始终指向最新的结果。
  • 一个额外的配置 "results" 存储所有运行的聚合结果,用于计算和显示 Open LLM Leaderboard 上的聚合指标。

数据加载示例

python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_anas-awadalla__mpt-1b-redpajama-200b", "harness_gsm8k_5", split="train")

最新结果

以下是 2023-12-03T16:06:56.054386 运行的最新结果

python { "all": { "acc": 0.0, "acc_stderr": 0.0 }, "harness|gsm8k|5": { "acc": 0.0, "acc_stderr": 0.0 } }

配置详情

  • config_name: harness_arc_challenge_25

    • split: 2023_09_14T22_39_00.593372
      • path: **/details_harness|arc:challenge|25_2023-09-14T22-39-00.593372.parquet
    • split: latest
      • path: **/details_harness|arc:challenge|25_2023-09-14T22-39-00.593372.parquet
  • config_name: harness_drop_3

    • split: 2023_11_04T22_34_26.464302
      • path: **/details_harness|drop|3_2023-11-04T22-34-26.464302.parquet
    • split: 2023_11_06T15_58_19.397762
      • path: **/details_harness|drop|3_2023-11-06T15-58-19.397762.parquet
    • split: latest
      • path: **/details_harness|drop|3_2023-11-06T15-58-19.397762.parquet
  • config_name: harness_gsm8k_5

    • split: 2023_11_04T22_34_26.464302
      • path: **/details_harness|gsm8k|5_2023-11-04T22-34-26.464302.parquet
    • split: 2023_11_06T15_58_19.397762
      • path: **/details_harness|gsm8k|5_2023-11-06T15-58-19.397762.parquet
    • split: 2023_12_03T16_06_56.054386
      • path: **/details_harness|gsm8k|5_2023-12-03T16-06-56.054386.parquet
    • split: latest
      • path: **/details_harness|gsm8k|5_2023-12-03T16-06-56.054386.parquet
  • config_name: harness_hellaswag_10

    • split: 2023_09_14T22_39_00.593372
      • path: **/details_harness|hellaswag|10_2023-09-14T22-39-00.593372.parquet
    • split: latest
      • path: **/details_harness|hellaswag|10_2023-09-14T22-39-00.593372.parquet
  • config_name: harness_hendrycksTest_5

    • split: 2023_09_14T22_39_00.593372
      • path:
        • **/details_harness|hendrycksTest-abstract_algebra|5_2023-09-14T22-39-00.593372.parquet
        • **/details_harness|hendrycksTest-anatomy|5_2023-09-14T22-39-00.593372.parquet
        • **/details_harness|hendrycksTest-astronomy|5_2023-09-14T22-39-00.593372.parquet
        • **/details_harness|hendrycksTest-business_ethics|5_2023-09-14T22-39-00.593372.parquet
        • **/details_harness|hendrycksTest-clinical_knowledge|5_2023-09-14T22-39-00.593372.parquet
        • **/details_harness|hendrycksTest-college_biology|5_2023-09-14T22-39-00.593372.parquet
        • **/details_harness|hendrycksTest-college_chemistry|5_2023-09-14T22-39-00.593372.parquet
        • **/details_harness|hendrycksTest-college_computer_science|5_2023-09-14T22-39-00.593372.parquet
        • **/details_harness|hendrycksTest-college_mathematics|5_2023-09-14T22-39-00.593372.parquet
        • **/details_harness|hendrycksTest-college_medicine|5_2023-09-14T22-39-00.593372.parquet
        • **/details_harness|hendrycksTest-college_physics|5_2023-09-14T22-39-00.593372.parquet
        • **/details_harness|hendrycksTest-computer_security|5_2023-09-14T22-39-00.593372.parquet
        • **/details_harness|hendrycksTest-conceptual_physics|5_2023-09-14T22-39-00.593372.parquet
        • **/details_harness|hendrycksTest-econometrics|5_2023-09-14T22-39-00.593372.parquet
        • **/details_harness|hendrycksTest-electrical_engineering|5_2023-09-14T22-39-00.593372.parquet
        • **/details_harness|hendrycksTest-elementary_mathematics|5_2023-09-14T22-39-00.593372.parquet
        • **/details_harness|hendrycksTest-formal_logic|5_2023-09-14T22-39-00.593372.parquet
        • **/details_harness|hendrycksTest-global_facts|5_2023-09-14T22-39-00.593372.parquet
        • **/details_harness|hendrycksTest-high_school_biology|5_2023-09-14T22-39-00.593372.parquet
        • **/details_harness|hendrycksTest-high_school_chemistry|5_2023-09-14T22-39-00.593372.parquet
        • **/details_harness|hendrycksTest-high_school_computer_science|5_2023-09-14T22-39-00.593372.parquet
        • **/details_harness|hendrycksTest-high_school_european_history|5_2023-09-14T22-39-00.593372.parquet
        • **/details_harness|hendrycksTest-high_school_geography|5_2023-09-14T22-39-00.593372.parquet
        • **/details_harness|hendrycksTest-high_school_government_and_politics|5_2023-09-14T22-39-00.593372.parquet
        • **/details_harness|hendrycksTest-high_school_macroeconomics|5_2023-09-14T22-39-00.593372.parquet
        • **/details_harness|hendrycksTest-high_school_mathematics|5_2023-09-14T22-39-00.593372.parquet
        • **/details_harness|hendrycksTest-high_school_microeconomics|5_2023-09-14T22-39-00.593372.parquet
        • **/details_harness|hendrycksTest-high_school_physics|5_2023-09-14T22-39-00.593372.parquet
        • **/details_harness|hendrycksTest-high_school_psychology|5_2023-09-14T22-39-00.593372.parquet
        • **/details_harness|hendrycksTest-high_school_statistics|5_2023-09-14T22-39-00.593372.parquet
        • **/details_harness|hendrycksTest-high_school_us_history|5_2023-09-14T22-39-00.593372.parquet
        • **/details_harness|hendrycksTest-high_school_world_history|5_2023-09-14T22-39-00.593372.parquet
        • **/details_harness|hendrycksTest-human_aging|5_2023-09-14T22-39-00.593372.parquet
        • **/details_harness|hendrycksTest-human_sexuality|5_2023-09-14T22-39-00.593372.parquet
        • **/details_harness|hendrycksTest-international_law|5_2023-09-14T22-39-00.593372.parquet
        • **/details_harness|hendrycksTest-jurisprudence|5_2023-09-14T22-39-00.593372.parquet
        • **/details_harness|hendrycksTest-logical_fallacies|5_2023-09-14T22-39-00.593372.parquet
        • **/details_harness|hendrycksTest-machine_learning|5_2023-09-14T22-39-00.593372.parquet
        • **/details_harness|hendrycksTest-management|5_2023-09-14T22-39-00.593372.parquet
        • **/details_harness|hendrycksTest-marketing|5_2023-09-14T22-39-00.593372.parquet
        • **/details_harness|hendrycksTest-medical_genetics|5_2023-09-14T22-39-00.593372.parquet
        • **/details_harness|hendrycksTest-miscellaneous|5_2023-09-14T22-39-00.593372.parquet
        • **/details_harness|hendrycksTest-moral_disputes|5_2023-09-14T22-39-00.593372.parquet
        • **/details_harness|hendrycksTest-moral_scenarios|5_2023-09-14T22-39-00.593372.parquet
        • **/details_harness|hendrycksTest-nutrition|5_2023-09-14T22-39-00.593372.parquet
        • **/details_harness|hendrycksTest-philosophy|5_2023-09-14T22-39-00.593372.parquet
        • **/details_harness|hendrycksTest-prehistory|5_2023-09-14T22-39-00.593372.parquet
        • **/details_harness|hendrycksTest-professional_accounting|5_2023-09-14T22-39-00.593372.parquet
        • **/details_harness|hendrycksTest-professional_law|5_2023-09-14T22-39-00.593372.parquet
        • **/details_harness|hendrycksTest-professional_medicine|5_2023-09-14T22-39-00.593372.parquet
        • **/details_harness|hendrycksTest-professional_psychology|5_2023-09-14T22-39-00.593372.parquet
        • **/details_harness|hendrycksTest-public_relations|5
搜集汇总
数据集介绍
main_image_url
构建方式
在大规模语言模型评估的语境下,该数据集作为Open LLM Leaderboard评估流程的副产品自动生成,专门用于记录模型anas-awadalla/mpt-1b-redpajama-200b的评测细节。其构建过程依托于运行4次独立的评估任务,每次运行的结果被组织为数据集中的一个独立子集(split),并以该次运行的时间戳命名。数据集共包含64个配置(configuration),每个配置对应一项被评估的基准任务,例如ARC挑战、GSM8K数学推理及HellaSwag常识推理等。此外,一个名为“results”的额外配置汇总了所有运行的聚合指标,为排行榜上综合分数的计算与展示提供数据基础。
使用方法
用户可通过Hugging Face的datasets库便捷地加载该数据集。以Python代码为例,调用load_dataset函数并指定数据集名称、目标配置(如“harness_gsm8k_5”)及所需子集(如“train”),即可获取特定任务的评估细节。对于希望深入分析单次运行结果的场景,可选择对应时间戳命名的子集进行加载。若需访问所有任务的聚合指标,则使用“results”配置。这种灵活的加载机制支持研究者按需提取数据,无论是进行模型性能的细粒度诊断,还是开展跨任务的综合评估,均能高效实现。
背景与挑战
背景概述
大语言模型(LLM)的蓬勃发展催生了对其性能进行系统性评估的迫切需求。在此背景下,HuggingFace社区于2023年启动了Open LLM Leaderboard项目,旨在构建一个公开、透明且可复现的模型评测基准。该数据集正是为评估anas-awadalla/mpt-1b-redpajama-200b这一模型而自动生成的评测结果记录,由HuggingFace团队的核心成员Clémentine负责维护。其核心研究问题在于如何通过标准化、多任务(涵盖ARC挑战、GSM8K数学推理、HellaSwag常识推理及MMLU多学科知识等)的评测框架,客观量化参数量仅10亿的MPT模型在200B tokens RedPajama语料预训练后的综合能力。这一数据集的出现,为研究小规模开源模型在多样化自然语言理解任务上的表现提供了宝贵的第一手实证资料,并推动了社区对模型评估流程自动化与透明化的探索,成为LLM领域标准化评测的重要基石。
当前挑战
该数据集所应对的领域挑战集中于如何构建一个全面、公平且可扩展的LLM评估体系。具体而言,不同任务(如数学推理GSM8K与阅读理解DROP)的难度差异巨大,单一指标难以反映模型全貌,需设计多配置、多任务的评测矩阵以实现多维度的能力画像。在构建过程中,数据集面临的技术挑战包括:需处理来自多次评测运行(共4次)产生的海量异构结果,并确保不同时间戳的运行数据能够以统一的分裂(split)结构被高效索引与加载;同时,评测结果(如GSM8K准确率为0.0)需与模型实际能力进行严谨对应,避免因数据污染或评估设置偏差导致误导性结论。此外,如何将各任务粒度的详细结果(parquet格式)与聚合指标无缝整合,以支撑动态更新的排行榜展示,亦是数据工程层面的关键难点。
常用场景
经典使用场景
在大型语言模型的评估生态中,Open LLM Leaderboard 的评测数据集扮演着举足轻重的角色。针对 anas-awadalla/mpt-1b-redpajama-200b 这一模型,该数据集通过其精心编排的 64 个配置项,系统性地记录了模型在诸如 ARC-Challenge、HellaSwag、GSM8K 以及涵盖 57 个学科的 MMLU 等经典基准上的细粒度表现。研究者可便捷地加载特定任务(如数学推理的 GSM8K)的评测记录,从而深入剖析模型在常识推理、知识掌握与多步解题等维度的能力边界。
解决学术问题
该数据集的核心价值在于解决了大模型性能评估中普遍存在的可复现性与透明性缺失问题。通过标准化评测框架,它使得不同时间点的模型表现得以精确对比,为验证训练策略改进(如数据配比优化、模型架构微调)的有效性提供了坚实依据。其深远意义在于,由此衍生的细粒度评测结果能够揭示模型在特定知识领域或推理任务中的系统性短板,进而引导学术界针对性地设计更鲁棒的训练方案与评估指标,推动语言模型研究的健康发展。
实际应用
在实际应用中,该数据集为模型选型与部署决策提供了量化支撑。例如,在构建面向数学辅导的智能系统时,开发者可依据 GSM8K 子任务的评测数据,精准评估模型在数学推理上的可靠性;在开发通用型问答助手时,MMLU 的跨学科成绩则有助于权衡模型的知识广度与深度。企业可借此筛选出在特定业务场景下表现最优的模型版本,降低试错成本,加速 AI 应用的落地进程。
数据集最近研究
最新研究方向
在大型语言模型(LLM)评估领域,Open LLM Leaderboard 已成为衡量模型性能的标杆平台。该数据集记录了 MPT-1B-RedPajama-200b 模型在多项基准任务上的详细评估结果,涵盖 ARC-Challenge、HellaSwag、GSM8K 及涵盖 57 个学科的 MMLU 测试。这些任务从常识推理、数学解题到专业知识掌握,全方位剖析了小型模型的推理边界。当前前沿研究聚焦于如何通过细粒度评估揭示模型在零样本或少样本场景下的真实能力,尤其是 GSM8K 任务中准确率为 0 的结果凸显了参数规模对复杂推理的制约。该数据集的意义在于为社区提供了可追溯、可复现的评估框架,推动了对模型能力瓶颈的深入理解,并为后续优化轻量级模型在数学与逻辑推理上的表现指明了方向。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作