five

open-llm-leaderboard/details_PY007__TinyLlama-1.1B-intermediate-step-240k-503b

收藏
Hugging Face2023-10-28 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/open-llm-leaderboard/details_PY007__TinyLlama-1.1B-intermediate-step-240k-503b
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: Evaluation run of PY007/TinyLlama-1.1B-intermediate-step-240k-503b dataset_summary: "Dataset automatically created during the evaluation run of model\ \ [PY007/TinyLlama-1.1B-intermediate-step-240k-503b](https://huggingface.co/PY007/TinyLlama-1.1B-intermediate-step-240k-503b)\ \ on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).\n\ \nThe dataset is composed of 64 configuration, each one coresponding to one of the\ \ evaluated task.\n\nThe dataset has been created from 2 run(s). Each run can be\ \ found as a specific split in each configuration, the split being named using the\ \ timestamp of the run.The \"train\" split is always pointing to the latest results.\n\ \nAn additional configuration \"results\" store all the aggregated results of the\ \ run (and is used to compute and display the agregated metrics on the [Open LLM\ \ Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)).\n\ \nTo load the details from a run, you can for instance do the following:\n```python\n\ from datasets import load_dataset\ndata = load_dataset(\"open-llm-leaderboard/details_PY007__TinyLlama-1.1B-intermediate-step-240k-503b\"\ ,\n\t\"harness_winogrande_5\",\n\tsplit=\"train\")\n```\n\n## Latest results\n\n\ These are the [latest results from run 2023-10-28T05:32:33.745725](https://huggingface.co/datasets/open-llm-leaderboard/details_PY007__TinyLlama-1.1B-intermediate-step-240k-503b/blob/main/results_2023-10-28T05-32-33.745725.json)(note\ \ that their might be results for other tasks in the repos if successive evals didn't\ \ cover the same tasks. You find each in the results and the \"latest\" split for\ \ each eval):\n\n```python\n{\n \"all\": {\n \"em\": 0.0019924496644295304,\n\ \ \"em_stderr\": 0.00045666764626669333,\n \"f1\": 0.04375419463087258,\n\ \ \"f1_stderr\": 0.0012232801051450955,\n \"acc\": 0.2844681550025042,\n\ \ \"acc_stderr\": 0.007722228058459302\n },\n \"harness|drop|3\": {\n\ \ \"em\": 0.0019924496644295304,\n \"em_stderr\": 0.00045666764626669333,\n\ \ \"f1\": 0.04375419463087258,\n \"f1_stderr\": 0.0012232801051450955\n\ \ },\n \"harness|gsm8k|5\": {\n \"acc\": 0.003032600454890068,\n \ \ \"acc_stderr\": 0.0015145735612245499\n },\n \"harness|winogrande|5\"\ : {\n \"acc\": 0.5659037095501184,\n \"acc_stderr\": 0.013929882555694054\n\ \ }\n}\n```" repo_url: https://huggingface.co/PY007/TinyLlama-1.1B-intermediate-step-240k-503b leaderboard_url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard point_of_contact: clementine@hf.co configs: - config_name: harness_arc_challenge_25 data_files: - split: 2023_09_18T13_31_42.519724 path: - '**/details_harness|arc:challenge|25_2023-09-18T13-31-42.519724.parquet' - split: latest path: - '**/details_harness|arc:challenge|25_2023-09-18T13-31-42.519724.parquet' - config_name: harness_drop_3 data_files: - split: 2023_10_28T05_32_33.745725 path: - '**/details_harness|drop|3_2023-10-28T05-32-33.745725.parquet' - split: latest path: - '**/details_harness|drop|3_2023-10-28T05-32-33.745725.parquet' - config_name: harness_gsm8k_5 data_files: - split: 2023_10_28T05_32_33.745725 path: - '**/details_harness|gsm8k|5_2023-10-28T05-32-33.745725.parquet' - split: latest path: - '**/details_harness|gsm8k|5_2023-10-28T05-32-33.745725.parquet' - config_name: harness_hellaswag_10 data_files: - split: 2023_09_18T13_31_42.519724 path: - '**/details_harness|hellaswag|10_2023-09-18T13-31-42.519724.parquet' - split: latest path: - '**/details_harness|hellaswag|10_2023-09-18T13-31-42.519724.parquet' - config_name: harness_hendrycksTest_5 data_files: - split: 2023_09_18T13_31_42.519724 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-international_law|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-management|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-marketing|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-sociology|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-virology|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2023-09-18T13-31-42.519724.parquet' - split: latest path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-international_law|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-management|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-marketing|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-sociology|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-virology|5_2023-09-18T13-31-42.519724.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2023-09-18T13-31-42.519724.parquet' - config_name: harness_hendrycksTest_abstract_algebra_5 data_files: - split: 2023_09_18T13_31_42.519724 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2023-09-18T13-31-42.519724.parquet' - split: latest path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2023-09-18T13-31-42.519724.parquet' - config_name: harness_hendrycksTest_anatomy_5 data_files: - split: 2023_09_18T13_31_42.519724 path: - '**/details_harness|hendrycksTest-anatomy|5_2023-09-18T13-31-42.519724.parquet' - split: latest path: - '**/details_harness|hendrycksTest-anatomy|5_2023-09-18T13-31-42.519724.parquet' - config_name: harness_hendrycksTest_astronomy_5 data_files: - split: 2023_09_18T13_31_42.519724 path: - '**/details_harness|hendrycksTest-astronomy|5_2023-09-18T13-31-42.519724.parquet' - split: latest path: - '**/details_harness|hendrycksTest-astronomy|5_2023-09-18T13-31-42.519724.parquet' - config_name: harness_hendrycksTest_business_ethics_5 data_files: - split: 2023_09_18T13_31_42.519724 path: - '**/details_harness|hendrycksTest-business_ethics|5_2023-09-18T13-31-42.519724.parquet' - split: latest path: - '**/details_harness|hendrycksTest-business_ethics|5_2023-09-18T13-31-42.519724.parquet' - config_name: harness_hendrycksTest_clinical_knowledge_5 data_files: - split: 2023_09_18T13_31_42.519724 path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2023-09-18T13-31-42.519724.parquet' - split: latest path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2023-09-18T13-31-42.519724.parquet' - config_name: harness_hendrycksTest_college_biology_5 data_files: - split: 2023_09_18T13_31_42.519724 path: - '**/details_harness|hendrycksTest-college_biology|5_2023-09-18T13-31-42.519724.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_biology|5_2023-09-18T13-31-42.519724.parquet' - config_name: harness_hendrycksTest_college_chemistry_5 data_files: - split: 2023_09_18T13_31_42.519724 path: - '**/details_harness|hendrycksTest-college_chemistry|5_2023-09-18T13-31-42.519724.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_chemistry|5_2023-09-18T13-31-42.519724.parquet' - config_name: harness_hendrycksTest_college_computer_science_5 data_files: - split: 2023_09_18T13_31_42.519724 path: - '**/details_harness|hendrycksTest-college_computer_science|5_2023-09-18T13-31-42.519724.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_computer_science|5_2023-09-18T13-31-42.519724.parquet' - config_name: harness_hendrycksTest_college_mathematics_5 data_files: - split: 2023_09_18T13_31_42.519724 path: - '**/details_harness|hendrycksTest-college_mathematics|5_2023-09-18T13-31-42.519724.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_mathematics|5_2023-09-18T13-31-42.519724.parquet' - config_name: harness_hendrycksTest_college_medicine_5 data_files: - split: 2023_09_18T13_31_42.519724 path: - '**/details_harness|hendrycksTest-college_medicine|5_2023-09-18T13-31-42.519724.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_medicine|5_2023-09-18T13-31-42.519724.parquet' - config_name: harness_hendrycksTest_college_physics_5 data_files: - split: 2023_09_18T13_31_42.519724 path: - '**/details_harness|hendrycksTest-college_physics|5_2023-09-18T13-31-42.519724.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_physics|5_2023-09-18T13-31-42.519724.parquet' - config_name: harness_hendrycksTest_computer_security_5 data_files: - split: 2023_09_18T13_31_42.519724 path: - '**/details_harness|hendrycksTest-computer_security|5_2023-09-18T13-31-42.519724.parquet' - split: latest path: - '**/details_harness|hendrycksTest-computer_security|5_2023-09-18T13-31-42.519724.parquet' - config_name: harness_hendrycksTest_conceptual_physics_5 data_files: - split: 2023_09_18T13_31_42.519724 path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2023-09-18T13-31-42.519724.parquet' - split: latest path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2023-09-18T13-31-42.519724.parquet' - config_name: harness_hendrycksTest_econometrics_5 data_files: - split: 2023_09_18T13_31_42.519724 path: - '**/details_harness|hendrycksTest-econometrics|5_2023-09-18T13-31-42.519724.parquet' - split: latest path: - '**/details_harness|hendrycksTest-econometrics|5_2023-09-18T13-31-42.519724.parquet' - config_name: harness_hendrycksTest_electrical_engineering_5 data_files: - split: 2023_09_18T13_31_42.519724 path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2023-09-18T13-31-42.519724.parquet' - split: latest path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2023-09-18T13-31-42.519724.parquet' - config_name: harness_hendrycksTest_elementary_mathematics_5 data_files: - split: 2023_09_18T13_31_42.519724 path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2023-09-18T13-31-42.519724.parquet' - split: latest path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2023-09-18T13-31-42.519724.parquet' - config_name: harness_hendrycksTest_formal_logic_5 data_files: - split: 2023_09_18T13_31_42.519724 path: - '**/details_harness|hendrycksTest-formal_logic|5_2023-09-18T13-31-42.519724.parquet' - split: latest path: - '**/details_harness|hendrycksTest-formal_logic|5_2023-09-18T13-31-42.519724.parquet' - config_name: harness_hendrycksTest_global_facts_5 data_files: - split: 2023_09_18T13_31_42.519724 path: - '**/details_harness|hendrycksTest-global_facts|5_2023-09-18T13-31-42.519724.parquet' - split: latest path: - '**/details_harness|hendrycksTest-global_facts|5_2023-09-18T13-31-42.519724.parquet' - config_name: harness_hendrycksTest_high_school_biology_5 data_files: - split: 2023_09_18T13_31_42.519724 path: - '**/details_harness|hendrycksTest-high_school_biology|5_2023-09-18T13-31-42.519724.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_biology|5_2023-09-18T13-31-42.519724.parquet' - config_name: harness_hendrycksTest_high_school_chemistry_5 data_files: - split: 2023_09_18T13_31_42.519724 path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2023-09-18T13-31-42.519724.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2023-09-18T13-31-42.519724.parquet' - config_name: harness_hendrycksTest_high_school_computer_science_5 data_files: - split: 2023_09_18T13_31_42.519724 path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2023-09-18T13-31-42.519724.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2023-09-18T13-31-42.519724.parquet' - config_name: harness_hendrycksTest_high_school_european_history_5 data_files: - split: 2023_09_18T13_31_42.519724 path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2023-09-18T13-31-42.519724.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2023-09-18T13-31-42.519724.parquet' - config_name: harness_hendrycksTest_high_school_geography_5 data_files: - split: 2023_09_18T13_31_42.519724 path: - '**/details_harness|hendrycksTest-high_school_geography|5_2023-09-18T13-31-42.519724.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_geography|5_2023-09-18T13-31-42.519724.parquet' - config_name: harness_hendrycksTest_high_school_government_and_politics_5 data_files: - split: 2023_09_18T13_31_42.519724 path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2023-09-18T13-31-42.519724.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2023-09-18T13-31-42.519724.parquet' - config_name: harness_hendrycksTest_high_school_macroeconomics_5 data_files: - split: 2023_09_18T13_31_42.519724 path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2023-09-18T13-31-42.519724.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2023-09-18T13-31-42.519724.parquet' - config_name: harness_hendrycksTest_high_school_mathematics_5 data_files: - split: 2023_09_18T13_31_42.519724 path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2023-09-18T13-31-42.519724.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2023-09-18T13-31-42.519724.parquet' - config_name: harness_hendrycksTest_high_school_microeconomics_5 data_files: - split: 2023_09_18T13_31_42.519724 path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2023-09-18T13-31-42.519724.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2023-09-18T13-31-42.519724.parquet' - config_name: harness_hendrycksTest_high_school_physics_5 data_files: - split: 2023_09_18T13_31_42.519724 path: - '**/details_harness|hendrycksTest-high_school_physics|5_2023-09-18T13-31-42.519724.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_physics|5_2023-09-18T13-31-42.519724.parquet' - config_name: harness_hendrycksTest_high_school_psychology_5 data_files: - split: 2023_09_18T13_31_42.519724 path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2023-09-18T13-31-42.519724.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2023-09-18T13-31-42.519724.parquet' - config_name: harness_hendrycksTest_high_school_statistics_5 data_files: - split: 2023_09_18T13_31_42.519724 path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2023-09-18T13-31-42.519724.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2023-09-18T13-31-42.519724.parquet' - config_name: harness_hendrycksTest_high_school_us_history_5 data_files: - split: 2023_09_18T13_31_42.519724 path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2023-09-18T13-31-42.519724.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2023-09-18T13-31-42.519724.parquet' - config_name: harness_hendrycksTest_high_school_world_history_5 data_files: - split: 2023_09_18T13_31_42.519724 path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2023-09-18T13-31-42.519724.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2023-09-18T13-31-42.519724.parquet' - config_name: harness_hendrycksTest_human_aging_5 data_files: - split: 2023_09_18T13_31_42.519724 path: - '**/details_harness|hendrycksTest-human_aging|5_2023-09-18T13-31-42.519724.parquet' - split: latest path: - '**/details_harness|hendrycksTest-human_aging|5_2023-09-18T13-31-42.519724.parquet' - config_name: harness_hendrycksTest_human_sexuality_5 data_files: - split: 2023_09_18T13_31_42.519724 path: - '**/details_harness|hendrycksTest-human_sexuality|5_2023-09-18T13-31-42.519724.parquet' - split: latest path: - '**/details_harness|hendrycksTest-human_sexuality|5_2023-09-18T13-31-42.519724.parquet' - config_name: harness_hendrycksTest_international_law_5 data_files: - split: 2023_09_18T13_31_42.519724 path: - '**/details_harness|hendrycksTest-international_law|5_2023-09-18T13-31-42.519724.parquet' - split: latest path: - '**/details_harness|hendrycksTest-international_law|5_2023-09-18T13-31-42.519724.parquet' - config_name: harness_hendrycksTest_jurisprudence_5 data_files: - split: 2023_09_18T13_31_42.519724 path: - '**/details_harness|hendrycksTest-jurisprudence|5_2023-09-18T13-31-42.519724.parquet' - split: latest path: - '**/details_harness|hendrycksTest-jurisprudence|5_2023-09-18T13-31-42.519724.parquet' - config_name: harness_hendrycksTest_logical_fallacies_5 data_files: - split: 2023_09_18T13_31_42.519724 path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2023-09-18T13-31-42.519724.parquet' - split: latest path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2023-09-18T13-31-42.519724.parquet' - config_name: harness_hendrycksTest_machine_learning_5 data_files: - split: 2023_09_18T13_31_42.519724 path: - '**/details_harness|hendrycksTest-machine_learning|5_2023-09-18T13-31-42.519724.parquet' - split: latest path: - '**/details_harness|hendrycksTest-machine_learning|5_2023-09-18T13-31-42.519724.parquet' - config_name: harness_hendrycksTest_management_5 data_files: - split: 2023_09_18T13_31_42.519724 path: - '**/details_harness|hendrycksTest-management|5_2023-09-18T13-31-42.519724.parquet' - split: latest path: - '**/details_harness|hendrycksTest-management|5_2023-09-18T13-31-42.519724.parquet' - config_name: harness_hendrycksTest_marketing_5 data_files: - split: 2023_09_18T13_31_42.519724 path: - '**/details_harness|hendrycksTest-marketing|5_2023-09-18T13-31-42.519724.parquet' - split: latest path: - '**/details_harness|hendrycksTest-marketing|5_2023-09-18T13-31-42.519724.parquet' - config_name: harness_hendrycksTest_medical_genetics_5 data_files: - split: 2023_09_18T13_31_42.519724 path: - '**/details_harness|hendrycksTest-medical_genetics|5_2023-09-18T13-31-42.519724.parquet' - split: latest path: - '**/details_harness|hendrycksTest-medical_genetics|5_2023-09-18T13-31-42.519724.parquet' - config_name: harness_hendrycksTest_miscellaneous_5 data_files: - split: 2023_09_18T13_31_42.519724 path: - '**/details_harness|hendrycksTest-miscellaneous|5_2023-09-18T13-31-42.519724.parquet' - split: latest path: - '**/details_harness|hendrycksTest-miscellaneous|5_2023-09-18T13-31-42.519724.parquet' - config_name: harness_hendrycksTest_moral_disputes_5 data_files: - split: 2023_09_18T13_31_42.519724 path: - '**/details_harness|hendrycksTest-moral_disputes|5_2023-09-18T13-31-42.519724.parquet' - split: latest path: - '**/details_harness|hendrycksTest-moral_disputes|5_2023-09-18T13-31-42.519724.parquet' - config_name: harness_hendrycksTest_moral_scenarios_5 data_files: - split: 2023_09_18T13_31_42.519724 path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2023-09-18T13-31-42.519724.parquet' - split: latest path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2023-09-18T13-31-42.519724.parquet' - config_name: harness_hendrycksTest_nutrition_5 data_files: - split: 2023_09_18T13_31_42.519724 path: - '**/details_harness|hendrycksTest-nutrition|5_2023-09-18T13-31-42.519724.parquet' - split: latest path: - '**/details_harness|hendrycksTest-nutrition|5_2023-09-18T13-31-42.519724.parquet' - config_name: harness_hendrycksTest_philosophy_5 data_files: - split: 2023_09_18T13_31_42.519724 path: - '**/details_harness|hendrycksTest-philosophy|5_2023-09-18T13-31-42.519724.parquet' - split: latest path: - '**/details_harness|hendrycksTest-philosophy|5_2023-09-18T13-31-42.519724.parquet' - config_name: harness_hendrycksTest_prehistory_5 data_files: - split: 2023_09_18T13_31_42.519724 path: - '**/details_harness|hendrycksTest-prehistory|5_2023-09-18T13-31-42.519724.parquet' - split: latest path: - '**/details_harness|hendrycksTest-prehistory|5_2023-09-18T13-31-42.519724.parquet' - config_name: harness_hendrycksTest_professional_accounting_5 data_files: - split: 2023_09_18T13_31_42.519724 path: - '**/details_harness|hendrycksTest-professional_accounting|5_2023-09-18T13-31-42.519724.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_accounting|5_2023-09-18T13-31-42.519724.parquet' - config_name: harness_hendrycksTest_professional_law_5 data_files: - split: 2023_09_18T13_31_42.519724 path: - '**/details_harness|hendrycksTest-professional_law|5_2023-09-18T13-31-42.519724.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_law|5_2023-09-18T13-31-42.519724.parquet' - config_name: harness_hendrycksTest_professional_medicine_5 data_files: - split: 2023_09_18T13_31_42.519724 path: - '**/details_harness|hendrycksTest-professional_medicine|5_2023-09-18T13-31-42.519724.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_medicine|5_2023-09-18T13-31-42.519724.parquet' - config_name: harness_hendrycksTest_professional_psychology_5 data_files: - split: 2023_09_18T13_31_42.519724 path: - '**/details_harness|hendrycksTest-professional_psychology|5_2023-09-18T13-31-42.519724.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_psychology|5_2023-09-18T13-31-42.519724.parquet' - config_name: harness_hendrycksTest_public_relations_5 data_files: - split: 2023_09_18T13_31_42.519724 path: - '**/details_harness|hendrycksTest-public_relations|5_2023-09-18T13-31-42.519724.parquet' - split: latest path: - '**/details_harness|hendrycksTest-public_relations|5_2023-09-18T13-31-42.519724.parquet' - config_name: harness_hendrycksTest_security_studies_5 data_files: - split: 2023_09_18T13_31_42.519724 path: - '**/details_harness|hendrycksTest-security_studies|5_2023-09-18T13-31-42.519724.parquet' - split: latest path: - '**/details_harness|hendrycksTest-security_studies|5_2023-09-18T13-31-42.519724.parquet' - config_name: harness_hendrycksTest_sociology_5 data_files: - split: 2023_09_18T13_31_42.519724 path: - '**/details_harness|hendrycksTest-sociology|5_2023-09-18T13-31-42.519724.parquet' - split: latest path: - '**/details_harness|hendrycksTest-sociology|5_2023-09-18T13-31-42.519724.parquet' - config_name: harness_hendrycksTest_us_foreign_policy_5 data_files: - split: 2023_09_18T13_31_42.519724 path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2023-09-18T13-31-42.519724.parquet' - split: latest path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2023-09-18T13-31-42.519724.parquet' - config_name: harness_hendrycksTest_virology_5 data_files: - split: 2023_09_18T13_31_42.519724 path: - '**/details_harness|hendrycksTest-virology|5_2023-09-18T13-31-42.519724.parquet' - split: latest path: - '**/details_harness|hendrycksTest-virology|5_2023-09-18T13-31-42.519724.parquet' - config_name: harness_hendrycksTest_world_religions_5 data_files: - split: 2023_09_18T13_31_42.519724 path: - '**/details_harness|hendrycksTest-world_religions|5_2023-09-18T13-31-42.519724.parquet' - split: latest path: - '**/details_harness|hendrycksTest-world_religions|5_2023-09-18T13-31-42.519724.parquet' - config_name: harness_truthfulqa_mc_0 data_files: - split: 2023_09_18T13_31_42.519724 path: - '**/details_harness|truthfulqa:mc|0_2023-09-18T13-31-42.519724.parquet' - split: latest path: - '**/details_harness|truthfulqa:mc|0_2023-09-18T13-31-42.519724.parquet' - config_name: harness_winogrande_5 data_files: - split: 2023_10_28T05_32_33.745725 path: - '**/details_harness|winogrande|5_2023-10-28T05-32-33.745725.parquet' - split: latest path: - '**/details_harness|winogrande|5_2023-10-28T05-32-33.745725.parquet' - config_name: results data_files: - split: 2023_09_18T13_31_42.519724 path: - results_2023-09-18T13-31-42.519724.parquet - split: 2023_10_28T05_32_33.745725 path: - results_2023-10-28T05-32-33.745725.parquet - split: latest path: - results_2023-10-28T05-32-33.745725.parquet --- # Dataset Card for Evaluation run of PY007/TinyLlama-1.1B-intermediate-step-240k-503b ## Dataset Description - **Homepage:** - **Repository:** https://huggingface.co/PY007/TinyLlama-1.1B-intermediate-step-240k-503b - **Paper:** - **Leaderboard:** https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard - **Point of Contact:** clementine@hf.co ### Dataset Summary Dataset automatically created during the evaluation run of model [PY007/TinyLlama-1.1B-intermediate-step-240k-503b](https://huggingface.co/PY007/TinyLlama-1.1B-intermediate-step-240k-503b) on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard). The dataset is composed of 64 configuration, each one coresponding to one of the evaluated task. The dataset has been created from 2 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results. An additional configuration "results" store all the aggregated results of the run (and is used to compute and display the agregated metrics on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)). To load the details from a run, you can for instance do the following: ```python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_PY007__TinyLlama-1.1B-intermediate-step-240k-503b", "harness_winogrande_5", split="train") ``` ## Latest results These are the [latest results from run 2023-10-28T05:32:33.745725](https://huggingface.co/datasets/open-llm-leaderboard/details_PY007__TinyLlama-1.1B-intermediate-step-240k-503b/blob/main/results_2023-10-28T05-32-33.745725.json)(note that their might be results for other tasks in the repos if successive evals didn't cover the same tasks. You find each in the results and the "latest" split for each eval): ```python { "all": { "em": 0.0019924496644295304, "em_stderr": 0.00045666764626669333, "f1": 0.04375419463087258, "f1_stderr": 0.0012232801051450955, "acc": 0.2844681550025042, "acc_stderr": 0.007722228058459302 }, "harness|drop|3": { "em": 0.0019924496644295304, "em_stderr": 0.00045666764626669333, "f1": 0.04375419463087258, "f1_stderr": 0.0012232801051450955 }, "harness|gsm8k|5": { "acc": 0.003032600454890068, "acc_stderr": 0.0015145735612245499 }, "harness|winogrande|5": { "acc": 0.5659037095501184, "acc_stderr": 0.013929882555694054 } } ``` ### Supported Tasks and Leaderboards [More Information Needed] ### Languages [More Information Needed] ## Dataset Structure ### Data Instances [More Information Needed] ### Data Fields [More Information Needed] ### Data Splits [More Information Needed] ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information [More Information Needed] ### Citation Information [More Information Needed] ### Contributions [More Information Needed]
提供机构:
open-llm-leaderboard
原始信息汇总

数据集概述

数据集简介

该数据集是在评估模型 PY007/TinyLlama-1.1B-intermediate-step-240k-503bOpen LLM Leaderboard 上的运行过程中自动创建的。

数据集结构

  • 数据集包含 64 个配置,每个配置对应一个评估任务。
  • 数据集从 2 次运行中创建,每次运行可以在每个配置中找到特定的分割,分割名称使用运行的时间戳。
  • "train" 分割始终指向最新的结果。
  • 额外的 "results" 配置存储所有运行结果的聚合,用于计算和显示 Open LLM Leaderboard 上的聚合指标。

数据加载示例

python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_PY007__TinyLlama-1.1B-intermediate-step-240k-503b", "harness_winogrande_5", split="train")

最新结果

以下是 2023-10-28T05:32:33.745725 运行的最新结果: python { "all": { "em": 0.0019924496644295304, "em_stderr": 0.00045666764626669333, "f1": 0.04375419463087258, "f1_stderr": 0.0012232801051450955, "acc": 0.2844681550025042, "acc_stderr": 0.007722228058459302 }, "harness|drop|3": { "em": 0.0019924496644295304, "em_stderr": 0.00045666764626669333, "f1": 0.04375419463087258, "f1_stderr": 0.0012232801051450955 }, "harness|gsm8k|5": { "acc": 0.003032600454890068, "acc_stderr": 0.0015145735612245499 }, "harness|winogrande|5": { "acc": 0.5659037095501184, "acc_stderr": 0.013929882555694054 } }

配置详情

以下是部分配置及其数据文件路径:

  • harness_arc_challenge_25
    • 分割: 2023_09_18T13_31_42.519724
      • 路径: **/details_harness|arc:challenge|25_2023-09-18T13-31-42.519724.parquet
    • 分割: latest
      • 路径: **/details_harness|arc:challenge|25_2023-09-18T13-31-42.519724.parquet
  • harness_drop_3
    • 分割: 2023_10_28T05_32_33.745725
      • 路径: **/details_harness|drop|3_2023-10-28T05-32-33.745725.parquet
    • 分割: latest
      • 路径: **/details_harness|drop|3_2023-10-28T05-32-33.745725.parquet
  • harness_gsm8k_5
    • 分割: 2023_10_28T05_32_33.745725
      • 路径: **/details_harness|gsm8k|5_2023-10-28T05-32-33.745725.parquet
    • 分割: latest
      • 路径: **/details_harness|gsm8k|5_2023-10-28T05-32-33.745725.parquet
  • harness_hellaswag_10
    • 分割: 2023_09_18T13_31_42.519724
      • 路径: **/details_harness|hellaswag|10_2023-09-18T13-31-42.519724.parquet
    • 分割: latest
      • 路径: **/details_harness|hellaswag|10_2023-09-18T13-31-42.519724.parquet
  • harness_hendrycksTest_5
    • 分割: 2023_09_18T13_31_42.519724
      • 路径:
        • **/details_harness|hendrycksTest-abstract_algebra|5_2023-09-18T13-31-42.519724.parquet
        • **/details_harness|hendrycksTest-anatomy|5_2023-09-18T13-31-42.519724.parquet
        • ...
    • 分割: latest
      • 路径:
        • **/details_harness|hendrycksTest-abstract_algebra|5_2023-09-18T13-31-42.519724.parquet
        • **/details_harness|hendrycksTest-anatomy|5_2023-09-18T13-31-42.519724.parquet
        • ...
搜集汇总
数据集介绍
main_image_url
构建方式
在大型语言模型评估的蓬勃发展中,Open LLM Leaderboard作为衡量模型性能的权威平台,催生了这一结构化评估数据集。该数据集通过自动化流程构建,专门记录模型PY007/TinyLlama-1.1B-intermediate-step-240k-503b在多个基准任务上的表现。其构建核心在于将每次评估运行(共2次)的结果按时间戳分割,形成独立的split,并统一以'train' split指向最新结果。数据集包含64个配置,每个配置对应一项评估任务(如ARC挑战、DROP、GSM8K等),同时增设'results'配置用于存储聚合指标,从而实现了对模型性能的细粒度追踪与汇总。
特点
该数据集最显著的特点在于其结构化的多任务评估体系与版本化追踪能力。64个配置覆盖了从常识推理(如HellaSwag)到数学问题(GSM8K)的多元领域,每个任务均以parquet格式存储详细得分(如准确率、F1值)。通过时间戳split的设计,研究者能回溯不同评估阶段的结果,而'train' split的自动更新机制确保了数据始终反映最新性能。此外,'results'配置集中呈现了各任务的平均指标,为模型横向对比提供了直观依据,充分体现了评估流程的严谨性与可复现性。
使用方法
使用该数据集时,研究者可通过HuggingFace的datasets库便捷加载。例如,加载winogrande任务的评估细节,只需指定配置名'harness_winogrande_5'并选择split为'train'。若要访问历史评估数据,则可通过对应时间戳的split(如'2023_10_28T05_32_33.745725')获取特定运行的结果。对于聚合指标,加载'results'配置即可获得包含所有任务准确率、标准误差等信息的JSON文件。这一设计使得模型性能的纵向分析与横向比较均能高效实现,特别适用于大模型评测领域的研究与报告撰写。
背景与挑战
背景概述
随着大语言模型在自然语言处理领域的广泛应用,如何系统性地评估其多维度能力成为学术界与工业界共同关注的焦点。在此背景下,Hugging Face团队于2023年发起了Open LLM Leaderboard项目,旨在通过标准化评测框架对开源语言模型进行横向对比。该数据集作为TinyLlama-1.1B-intermediate-step-240k-503b模型在Leaderboard上的评测记录,由Hugging Face首席科学家Clémentine Fourrier主导构建,核心研究问题聚焦于小规模语言模型(1.1B参数)在多样化任务中的性能边界。该数据集记录了模型在ARC、HellaSwag、MMLU等数十项基准测试中的详细表现,为理解参数规模与能力涌现之间的关系提供了关键实证,推动了轻量级模型在资源受限场景下的应用探索。
当前挑战
该数据集所反映的核心挑战首先在于小参数模型在复杂推理任务中的固有瓶颈,如TinyLlama在GSM8K数学推理任务中准确率仅0.3%,凸显了规模与推理能力间的深刻鸿沟。其次,评测构建过程面临多源异构任务整合的困难,数据集需同时覆盖常识推理、知识问答、数学计算等差异巨大的能力维度,导致评估基准的设计必须权衡广度与深度。此外,评测结果的可重复性构成另一技术挑战,由于模型推理存在随机性,同一模型在不同运行轮次中可能产生波动,数据集通过存储多个时间戳分片来追踪这种变异性,但如何标准化评测环境以消除硬件、框架版本等外部因素影响,仍是亟待解决的问题。
常用场景
经典使用场景
在大规模语言模型迅猛发展的浪潮中,评估模型的真实能力成为学界与工业界共同关注的核心命题。该数据集专为Open LLM Leaderboard评测框架设计,完整记录了TinyLlama-1.1B中间检查点在ARC挑战、HellaSwag、MMLU、GSM8K、Winogrande等经典基准任务上的细粒度表现。研究者可借此数据集深入分析模型在常识推理、数学求解、阅读理解及知识问答等多维能力上的强弱分布,从而精准定位模型优化的方向。其结构化的评测结果与可复现的加载方式,为公平对比不同架构与训练策略下的模型性能提供了标准化参考。
实际应用
在工程实践中,该数据集的价值远超学术对比的范畴。开发者可借助其详尽的评测结果,快速筛选出TinyLlama-1.1B在特定任务上的优势与短板,从而指导下游应用的模型选型与微调策略。例如,在资源受限的移动端部署场景中,若数据集显示该模型在Winogrande任务上取得较好成绩,则表明其具备可靠的代词消解能力,可优先应用于智能问答与文本理解类产品。此外,数据集的多任务评测曲线为模型压缩、知识蒸馏等效率优化技术提供了效果验证的标尺,加速了轻量级模型从实验室走向真实服务的进程。
衍生相关工作
该数据集自发布以来,催生了一系列围绕模型评测与分析的重要工作。基于其结构化结果,研究者开发了自动化性能分析工具,用于可视化模型在不同能力维度上的演进轨迹。部分工作进一步将评测数据与训练日志对齐,揭示了训练步数、数据组成与特定能力涌现之间的复杂映射关系。在模型对比研究领域,该数据集常被用作基线基准,支撑了诸如参数高效微调、多阶段训练策略等方向上的创新探索。此外,其开放的数据格式与加载接口也为构建更大规模的跨模型评测知识库提供了可复用的技术范式。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作