five

open-llm-leaderboard/details_openbmb__UltraRM-13b

收藏
Hugging Face2023-12-02 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/open-llm-leaderboard/details_openbmb__UltraRM-13b
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: Evaluation run of openbmb/UltraRM-13b dataset_summary: "Dataset automatically created during the evaluation run of model\ \ [openbmb/UltraRM-13b](https://huggingface.co/openbmb/UltraRM-13b) on the [Open\ \ LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).\n\ \nThe dataset is composed of 64 configuration, each one coresponding to one of the\ \ evaluated task.\n\nThe dataset has been created from 3 run(s). Each run can be\ \ found as a specific split in each configuration, the split being named using the\ \ timestamp of the run.The \"train\" split is always pointing to the latest results.\n\ \nAn additional configuration \"results\" store all the aggregated results of the\ \ run (and is used to compute and display the aggregated metrics on the [Open LLM\ \ Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)).\n\ \nTo load the details from a run, you can for instance do the following:\n```python\n\ from datasets import load_dataset\ndata = load_dataset(\"open-llm-leaderboard/details_openbmb__UltraRM-13b\"\ ,\n\t\"harness_gsm8k_5\",\n\tsplit=\"train\")\n```\n\n## Latest results\n\nThese\ \ are the [latest results from run 2023-12-02T13:26:56.823138](https://huggingface.co/datasets/open-llm-leaderboard/details_openbmb__UltraRM-13b/blob/main/results_2023-12-02T13-26-56.823138.json)(note\ \ that their might be results for other tasks in the repos if successive evals didn't\ \ cover the same tasks. You find each in the results and the \"latest\" split for\ \ each eval):\n\n```python\n{\n \"all\": {\n \"acc\": 0.0,\n \"\ acc_stderr\": 0.0\n },\n \"harness|gsm8k|5\": {\n \"acc\": 0.0,\n \ \ \"acc_stderr\": 0.0\n }\n}\n```" repo_url: https://huggingface.co/openbmb/UltraRM-13b leaderboard_url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard point_of_contact: clementine@hf.co configs: - config_name: harness_arc_challenge_25 data_files: - split: 2023_10_08T20_45_47.827028 path: - '**/details_harness|arc:challenge|25_2023-10-08T20-45-47.827028.parquet' - split: latest path: - '**/details_harness|arc:challenge|25_2023-10-08T20-45-47.827028.parquet' - config_name: harness_drop_3 data_files: - split: 2023_10_24T08_13_56.124311 path: - '**/details_harness|drop|3_2023-10-24T08-13-56.124311.parquet' - split: latest path: - '**/details_harness|drop|3_2023-10-24T08-13-56.124311.parquet' - config_name: harness_gsm8k_5 data_files: - split: 2023_10_24T08_13_56.124311 path: - '**/details_harness|gsm8k|5_2023-10-24T08-13-56.124311.parquet' - split: 2023_12_02T13_26_56.823138 path: - '**/details_harness|gsm8k|5_2023-12-02T13-26-56.823138.parquet' - split: latest path: - '**/details_harness|gsm8k|5_2023-12-02T13-26-56.823138.parquet' - config_name: harness_hellaswag_10 data_files: - split: 2023_10_08T20_45_47.827028 path: - '**/details_harness|hellaswag|10_2023-10-08T20-45-47.827028.parquet' - split: latest path: - '**/details_harness|hellaswag|10_2023-10-08T20-45-47.827028.parquet' - config_name: harness_hendrycksTest_5 data_files: - split: 2023_10_08T20_45_47.827028 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-international_law|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-management|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-marketing|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-sociology|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-virology|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2023-10-08T20-45-47.827028.parquet' - split: latest path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-anatomy|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-astronomy|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-business_ethics|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-clinical_knowledge|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-college_biology|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-college_chemistry|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-college_computer_science|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-college_mathematics|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-college_medicine|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-college_physics|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-computer_security|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-conceptual_physics|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-econometrics|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-electrical_engineering|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-elementary_mathematics|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-formal_logic|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-global_facts|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-high_school_biology|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-high_school_chemistry|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-high_school_computer_science|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-high_school_european_history|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-high_school_geography|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-high_school_mathematics|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-high_school_physics|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-high_school_psychology|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-high_school_statistics|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-high_school_us_history|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-high_school_world_history|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-human_aging|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-human_sexuality|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-international_law|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-jurisprudence|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-logical_fallacies|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-machine_learning|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-management|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-marketing|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-medical_genetics|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-miscellaneous|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-moral_disputes|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-moral_scenarios|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-nutrition|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-philosophy|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-prehistory|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-professional_accounting|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-professional_law|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-professional_medicine|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-professional_psychology|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-public_relations|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-security_studies|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-sociology|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-us_foreign_policy|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-virology|5_2023-10-08T20-45-47.827028.parquet' - '**/details_harness|hendrycksTest-world_religions|5_2023-10-08T20-45-47.827028.parquet' - config_name: harness_hendrycksTest_abstract_algebra_5 data_files: - split: 2023_10_08T20_45_47.827028 path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2023-10-08T20-45-47.827028.parquet' - split: latest path: - '**/details_harness|hendrycksTest-abstract_algebra|5_2023-10-08T20-45-47.827028.parquet' - config_name: harness_hendrycksTest_anatomy_5 data_files: - split: 2023_10_08T20_45_47.827028 path: - '**/details_harness|hendrycksTest-anatomy|5_2023-10-08T20-45-47.827028.parquet' - split: latest path: - '**/details_harness|hendrycksTest-anatomy|5_2023-10-08T20-45-47.827028.parquet' - config_name: harness_hendrycksTest_astronomy_5 data_files: - split: 2023_10_08T20_45_47.827028 path: - '**/details_harness|hendrycksTest-astronomy|5_2023-10-08T20-45-47.827028.parquet' - split: latest path: - '**/details_harness|hendrycksTest-astronomy|5_2023-10-08T20-45-47.827028.parquet' - config_name: harness_hendrycksTest_business_ethics_5 data_files: - split: 2023_10_08T20_45_47.827028 path: - '**/details_harness|hendrycksTest-business_ethics|5_2023-10-08T20-45-47.827028.parquet' - split: latest path: - '**/details_harness|hendrycksTest-business_ethics|5_2023-10-08T20-45-47.827028.parquet' - config_name: harness_hendrycksTest_clinical_knowledge_5 data_files: - split: 2023_10_08T20_45_47.827028 path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2023-10-08T20-45-47.827028.parquet' - split: latest path: - '**/details_harness|hendrycksTest-clinical_knowledge|5_2023-10-08T20-45-47.827028.parquet' - config_name: harness_hendrycksTest_college_biology_5 data_files: - split: 2023_10_08T20_45_47.827028 path: - '**/details_harness|hendrycksTest-college_biology|5_2023-10-08T20-45-47.827028.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_biology|5_2023-10-08T20-45-47.827028.parquet' - config_name: harness_hendrycksTest_college_chemistry_5 data_files: - split: 2023_10_08T20_45_47.827028 path: - '**/details_harness|hendrycksTest-college_chemistry|5_2023-10-08T20-45-47.827028.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_chemistry|5_2023-10-08T20-45-47.827028.parquet' - config_name: harness_hendrycksTest_college_computer_science_5 data_files: - split: 2023_10_08T20_45_47.827028 path: - '**/details_harness|hendrycksTest-college_computer_science|5_2023-10-08T20-45-47.827028.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_computer_science|5_2023-10-08T20-45-47.827028.parquet' - config_name: harness_hendrycksTest_college_mathematics_5 data_files: - split: 2023_10_08T20_45_47.827028 path: - '**/details_harness|hendrycksTest-college_mathematics|5_2023-10-08T20-45-47.827028.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_mathematics|5_2023-10-08T20-45-47.827028.parquet' - config_name: harness_hendrycksTest_college_medicine_5 data_files: - split: 2023_10_08T20_45_47.827028 path: - '**/details_harness|hendrycksTest-college_medicine|5_2023-10-08T20-45-47.827028.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_medicine|5_2023-10-08T20-45-47.827028.parquet' - config_name: harness_hendrycksTest_college_physics_5 data_files: - split: 2023_10_08T20_45_47.827028 path: - '**/details_harness|hendrycksTest-college_physics|5_2023-10-08T20-45-47.827028.parquet' - split: latest path: - '**/details_harness|hendrycksTest-college_physics|5_2023-10-08T20-45-47.827028.parquet' - config_name: harness_hendrycksTest_computer_security_5 data_files: - split: 2023_10_08T20_45_47.827028 path: - '**/details_harness|hendrycksTest-computer_security|5_2023-10-08T20-45-47.827028.parquet' - split: latest path: - '**/details_harness|hendrycksTest-computer_security|5_2023-10-08T20-45-47.827028.parquet' - config_name: harness_hendrycksTest_conceptual_physics_5 data_files: - split: 2023_10_08T20_45_47.827028 path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2023-10-08T20-45-47.827028.parquet' - split: latest path: - '**/details_harness|hendrycksTest-conceptual_physics|5_2023-10-08T20-45-47.827028.parquet' - config_name: harness_hendrycksTest_econometrics_5 data_files: - split: 2023_10_08T20_45_47.827028 path: - '**/details_harness|hendrycksTest-econometrics|5_2023-10-08T20-45-47.827028.parquet' - split: latest path: - '**/details_harness|hendrycksTest-econometrics|5_2023-10-08T20-45-47.827028.parquet' - config_name: harness_hendrycksTest_electrical_engineering_5 data_files: - split: 2023_10_08T20_45_47.827028 path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2023-10-08T20-45-47.827028.parquet' - split: latest path: - '**/details_harness|hendrycksTest-electrical_engineering|5_2023-10-08T20-45-47.827028.parquet' - config_name: harness_hendrycksTest_elementary_mathematics_5 data_files: - split: 2023_10_08T20_45_47.827028 path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2023-10-08T20-45-47.827028.parquet' - split: latest path: - '**/details_harness|hendrycksTest-elementary_mathematics|5_2023-10-08T20-45-47.827028.parquet' - config_name: harness_hendrycksTest_formal_logic_5 data_files: - split: 2023_10_08T20_45_47.827028 path: - '**/details_harness|hendrycksTest-formal_logic|5_2023-10-08T20-45-47.827028.parquet' - split: latest path: - '**/details_harness|hendrycksTest-formal_logic|5_2023-10-08T20-45-47.827028.parquet' - config_name: harness_hendrycksTest_global_facts_5 data_files: - split: 2023_10_08T20_45_47.827028 path: - '**/details_harness|hendrycksTest-global_facts|5_2023-10-08T20-45-47.827028.parquet' - split: latest path: - '**/details_harness|hendrycksTest-global_facts|5_2023-10-08T20-45-47.827028.parquet' - config_name: harness_hendrycksTest_high_school_biology_5 data_files: - split: 2023_10_08T20_45_47.827028 path: - '**/details_harness|hendrycksTest-high_school_biology|5_2023-10-08T20-45-47.827028.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_biology|5_2023-10-08T20-45-47.827028.parquet' - config_name: harness_hendrycksTest_high_school_chemistry_5 data_files: - split: 2023_10_08T20_45_47.827028 path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2023-10-08T20-45-47.827028.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_chemistry|5_2023-10-08T20-45-47.827028.parquet' - config_name: harness_hendrycksTest_high_school_computer_science_5 data_files: - split: 2023_10_08T20_45_47.827028 path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2023-10-08T20-45-47.827028.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_computer_science|5_2023-10-08T20-45-47.827028.parquet' - config_name: harness_hendrycksTest_high_school_european_history_5 data_files: - split: 2023_10_08T20_45_47.827028 path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2023-10-08T20-45-47.827028.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_european_history|5_2023-10-08T20-45-47.827028.parquet' - config_name: harness_hendrycksTest_high_school_geography_5 data_files: - split: 2023_10_08T20_45_47.827028 path: - '**/details_harness|hendrycksTest-high_school_geography|5_2023-10-08T20-45-47.827028.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_geography|5_2023-10-08T20-45-47.827028.parquet' - config_name: harness_hendrycksTest_high_school_government_and_politics_5 data_files: - split: 2023_10_08T20_45_47.827028 path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2023-10-08T20-45-47.827028.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2023-10-08T20-45-47.827028.parquet' - config_name: harness_hendrycksTest_high_school_macroeconomics_5 data_files: - split: 2023_10_08T20_45_47.827028 path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2023-10-08T20-45-47.827028.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2023-10-08T20-45-47.827028.parquet' - config_name: harness_hendrycksTest_high_school_mathematics_5 data_files: - split: 2023_10_08T20_45_47.827028 path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2023-10-08T20-45-47.827028.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_mathematics|5_2023-10-08T20-45-47.827028.parquet' - config_name: harness_hendrycksTest_high_school_microeconomics_5 data_files: - split: 2023_10_08T20_45_47.827028 path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2023-10-08T20-45-47.827028.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_microeconomics|5_2023-10-08T20-45-47.827028.parquet' - config_name: harness_hendrycksTest_high_school_physics_5 data_files: - split: 2023_10_08T20_45_47.827028 path: - '**/details_harness|hendrycksTest-high_school_physics|5_2023-10-08T20-45-47.827028.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_physics|5_2023-10-08T20-45-47.827028.parquet' - config_name: harness_hendrycksTest_high_school_psychology_5 data_files: - split: 2023_10_08T20_45_47.827028 path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2023-10-08T20-45-47.827028.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_psychology|5_2023-10-08T20-45-47.827028.parquet' - config_name: harness_hendrycksTest_high_school_statistics_5 data_files: - split: 2023_10_08T20_45_47.827028 path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2023-10-08T20-45-47.827028.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_statistics|5_2023-10-08T20-45-47.827028.parquet' - config_name: harness_hendrycksTest_high_school_us_history_5 data_files: - split: 2023_10_08T20_45_47.827028 path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2023-10-08T20-45-47.827028.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_us_history|5_2023-10-08T20-45-47.827028.parquet' - config_name: harness_hendrycksTest_high_school_world_history_5 data_files: - split: 2023_10_08T20_45_47.827028 path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2023-10-08T20-45-47.827028.parquet' - split: latest path: - '**/details_harness|hendrycksTest-high_school_world_history|5_2023-10-08T20-45-47.827028.parquet' - config_name: harness_hendrycksTest_human_aging_5 data_files: - split: 2023_10_08T20_45_47.827028 path: - '**/details_harness|hendrycksTest-human_aging|5_2023-10-08T20-45-47.827028.parquet' - split: latest path: - '**/details_harness|hendrycksTest-human_aging|5_2023-10-08T20-45-47.827028.parquet' - config_name: harness_hendrycksTest_human_sexuality_5 data_files: - split: 2023_10_08T20_45_47.827028 path: - '**/details_harness|hendrycksTest-human_sexuality|5_2023-10-08T20-45-47.827028.parquet' - split: latest path: - '**/details_harness|hendrycksTest-human_sexuality|5_2023-10-08T20-45-47.827028.parquet' - config_name: harness_hendrycksTest_international_law_5 data_files: - split: 2023_10_08T20_45_47.827028 path: - '**/details_harness|hendrycksTest-international_law|5_2023-10-08T20-45-47.827028.parquet' - split: latest path: - '**/details_harness|hendrycksTest-international_law|5_2023-10-08T20-45-47.827028.parquet' - config_name: harness_hendrycksTest_jurisprudence_5 data_files: - split: 2023_10_08T20_45_47.827028 path: - '**/details_harness|hendrycksTest-jurisprudence|5_2023-10-08T20-45-47.827028.parquet' - split: latest path: - '**/details_harness|hendrycksTest-jurisprudence|5_2023-10-08T20-45-47.827028.parquet' - config_name: harness_hendrycksTest_logical_fallacies_5 data_files: - split: 2023_10_08T20_45_47.827028 path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2023-10-08T20-45-47.827028.parquet' - split: latest path: - '**/details_harness|hendrycksTest-logical_fallacies|5_2023-10-08T20-45-47.827028.parquet' - config_name: harness_hendrycksTest_machine_learning_5 data_files: - split: 2023_10_08T20_45_47.827028 path: - '**/details_harness|hendrycksTest-machine_learning|5_2023-10-08T20-45-47.827028.parquet' - split: latest path: - '**/details_harness|hendrycksTest-machine_learning|5_2023-10-08T20-45-47.827028.parquet' - config_name: harness_hendrycksTest_management_5 data_files: - split: 2023_10_08T20_45_47.827028 path: - '**/details_harness|hendrycksTest-management|5_2023-10-08T20-45-47.827028.parquet' - split: latest path: - '**/details_harness|hendrycksTest-management|5_2023-10-08T20-45-47.827028.parquet' - config_name: harness_hendrycksTest_marketing_5 data_files: - split: 2023_10_08T20_45_47.827028 path: - '**/details_harness|hendrycksTest-marketing|5_2023-10-08T20-45-47.827028.parquet' - split: latest path: - '**/details_harness|hendrycksTest-marketing|5_2023-10-08T20-45-47.827028.parquet' - config_name: harness_hendrycksTest_medical_genetics_5 data_files: - split: 2023_10_08T20_45_47.827028 path: - '**/details_harness|hendrycksTest-medical_genetics|5_2023-10-08T20-45-47.827028.parquet' - split: latest path: - '**/details_harness|hendrycksTest-medical_genetics|5_2023-10-08T20-45-47.827028.parquet' - config_name: harness_hendrycksTest_miscellaneous_5 data_files: - split: 2023_10_08T20_45_47.827028 path: - '**/details_harness|hendrycksTest-miscellaneous|5_2023-10-08T20-45-47.827028.parquet' - split: latest path: - '**/details_harness|hendrycksTest-miscellaneous|5_2023-10-08T20-45-47.827028.parquet' - config_name: harness_hendrycksTest_moral_disputes_5 data_files: - split: 2023_10_08T20_45_47.827028 path: - '**/details_harness|hendrycksTest-moral_disputes|5_2023-10-08T20-45-47.827028.parquet' - split: latest path: - '**/details_harness|hendrycksTest-moral_disputes|5_2023-10-08T20-45-47.827028.parquet' - config_name: harness_hendrycksTest_moral_scenarios_5 data_files: - split: 2023_10_08T20_45_47.827028 path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2023-10-08T20-45-47.827028.parquet' - split: latest path: - '**/details_harness|hendrycksTest-moral_scenarios|5_2023-10-08T20-45-47.827028.parquet' - config_name: harness_hendrycksTest_nutrition_5 data_files: - split: 2023_10_08T20_45_47.827028 path: - '**/details_harness|hendrycksTest-nutrition|5_2023-10-08T20-45-47.827028.parquet' - split: latest path: - '**/details_harness|hendrycksTest-nutrition|5_2023-10-08T20-45-47.827028.parquet' - config_name: harness_hendrycksTest_philosophy_5 data_files: - split: 2023_10_08T20_45_47.827028 path: - '**/details_harness|hendrycksTest-philosophy|5_2023-10-08T20-45-47.827028.parquet' - split: latest path: - '**/details_harness|hendrycksTest-philosophy|5_2023-10-08T20-45-47.827028.parquet' - config_name: harness_hendrycksTest_prehistory_5 data_files: - split: 2023_10_08T20_45_47.827028 path: - '**/details_harness|hendrycksTest-prehistory|5_2023-10-08T20-45-47.827028.parquet' - split: latest path: - '**/details_harness|hendrycksTest-prehistory|5_2023-10-08T20-45-47.827028.parquet' - config_name: harness_hendrycksTest_professional_accounting_5 data_files: - split: 2023_10_08T20_45_47.827028 path: - '**/details_harness|hendrycksTest-professional_accounting|5_2023-10-08T20-45-47.827028.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_accounting|5_2023-10-08T20-45-47.827028.parquet' - config_name: harness_hendrycksTest_professional_law_5 data_files: - split: 2023_10_08T20_45_47.827028 path: - '**/details_harness|hendrycksTest-professional_law|5_2023-10-08T20-45-47.827028.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_law|5_2023-10-08T20-45-47.827028.parquet' - config_name: harness_hendrycksTest_professional_medicine_5 data_files: - split: 2023_10_08T20_45_47.827028 path: - '**/details_harness|hendrycksTest-professional_medicine|5_2023-10-08T20-45-47.827028.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_medicine|5_2023-10-08T20-45-47.827028.parquet' - config_name: harness_hendrycksTest_professional_psychology_5 data_files: - split: 2023_10_08T20_45_47.827028 path: - '**/details_harness|hendrycksTest-professional_psychology|5_2023-10-08T20-45-47.827028.parquet' - split: latest path: - '**/details_harness|hendrycksTest-professional_psychology|5_2023-10-08T20-45-47.827028.parquet' - config_name: harness_hendrycksTest_public_relations_5 data_files: - split: 2023_10_08T20_45_47.827028 path: - '**/details_harness|hendrycksTest-public_relations|5_2023-10-08T20-45-47.827028.parquet' - split: latest path: - '**/details_harness|hendrycksTest-public_relations|5_2023-10-08T20-45-47.827028.parquet' - config_name: harness_hendrycksTest_security_studies_5 data_files: - split: 2023_10_08T20_45_47.827028 path: - '**/details_harness|hendrycksTest-security_studies|5_2023-10-08T20-45-47.827028.parquet' - split: latest path: - '**/details_harness|hendrycksTest-security_studies|5_2023-10-08T20-45-47.827028.parquet' - config_name: harness_hendrycksTest_sociology_5 data_files: - split: 2023_10_08T20_45_47.827028 path: - '**/details_harness|hendrycksTest-sociology|5_2023-10-08T20-45-47.827028.parquet' - split: latest path: - '**/details_harness|hendrycksTest-sociology|5_2023-10-08T20-45-47.827028.parquet' - config_name: harness_hendrycksTest_us_foreign_policy_5 data_files: - split: 2023_10_08T20_45_47.827028 path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2023-10-08T20-45-47.827028.parquet' - split: latest path: - '**/details_harness|hendrycksTest-us_foreign_policy|5_2023-10-08T20-45-47.827028.parquet' - config_name: harness_hendrycksTest_virology_5 data_files: - split: 2023_10_08T20_45_47.827028 path: - '**/details_harness|hendrycksTest-virology|5_2023-10-08T20-45-47.827028.parquet' - split: latest path: - '**/details_harness|hendrycksTest-virology|5_2023-10-08T20-45-47.827028.parquet' - config_name: harness_hendrycksTest_world_religions_5 data_files: - split: 2023_10_08T20_45_47.827028 path: - '**/details_harness|hendrycksTest-world_religions|5_2023-10-08T20-45-47.827028.parquet' - split: latest path: - '**/details_harness|hendrycksTest-world_religions|5_2023-10-08T20-45-47.827028.parquet' - config_name: harness_truthfulqa_mc_0 data_files: - split: 2023_10_08T20_45_47.827028 path: - '**/details_harness|truthfulqa:mc|0_2023-10-08T20-45-47.827028.parquet' - split: latest path: - '**/details_harness|truthfulqa:mc|0_2023-10-08T20-45-47.827028.parquet' - config_name: harness_winogrande_5 data_files: - split: 2023_10_24T08_13_56.124311 path: - '**/details_harness|winogrande|5_2023-10-24T08-13-56.124311.parquet' - split: latest path: - '**/details_harness|winogrande|5_2023-10-24T08-13-56.124311.parquet' - config_name: results data_files: - split: 2023_10_08T20_45_47.827028 path: - results_2023-10-08T20-45-47.827028.parquet - split: 2023_10_24T08_13_56.124311 path: - results_2023-10-24T08-13-56.124311.parquet - split: 2023_12_02T13_26_56.823138 path: - results_2023-12-02T13-26-56.823138.parquet - split: latest path: - results_2023-12-02T13-26-56.823138.parquet --- # Dataset Card for Evaluation run of openbmb/UltraRM-13b ## Dataset Description - **Homepage:** - **Repository:** https://huggingface.co/openbmb/UltraRM-13b - **Paper:** - **Leaderboard:** https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard - **Point of Contact:** clementine@hf.co ### Dataset Summary Dataset automatically created during the evaluation run of model [openbmb/UltraRM-13b](https://huggingface.co/openbmb/UltraRM-13b) on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard). The dataset is composed of 64 configuration, each one coresponding to one of the evaluated task. The dataset has been created from 3 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results. An additional configuration "results" store all the aggregated results of the run (and is used to compute and display the aggregated metrics on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)). To load the details from a run, you can for instance do the following: ```python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_openbmb__UltraRM-13b", "harness_gsm8k_5", split="train") ``` ## Latest results These are the [latest results from run 2023-12-02T13:26:56.823138](https://huggingface.co/datasets/open-llm-leaderboard/details_openbmb__UltraRM-13b/blob/main/results_2023-12-02T13-26-56.823138.json)(note that their might be results for other tasks in the repos if successive evals didn't cover the same tasks. You find each in the results and the "latest" split for each eval): ```python { "all": { "acc": 0.0, "acc_stderr": 0.0 }, "harness|gsm8k|5": { "acc": 0.0, "acc_stderr": 0.0 } } ``` ### Supported Tasks and Leaderboards [More Information Needed] ### Languages [More Information Needed] ## Dataset Structure ### Data Instances [More Information Needed] ### Data Fields [More Information Needed] ### Data Splits [More Information Needed] ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information [More Information Needed] ### Citation Information [More Information Needed] ### Contributions [More Information Needed]
提供机构:
open-llm-leaderboard
原始信息汇总

数据集概述

该数据集是在对模型 openbmb/UltraRM-13b 进行评估运行期间自动创建的,用于 Open LLM Leaderboard

数据集组成

数据集由 64 个配置组成,每个配置对应一个评估任务。数据集从 3 次运行中创建,每次运行可以在每个配置中找到特定的分割,分割名称使用运行的时间戳。"train" 分割始终指向最新的结果。

额外配置

一个额外的配置 "results" 存储所有运行的聚合结果,用于计算和显示 Open LLM Leaderboard 上的聚合指标。

数据加载示例

以下是加载特定运行详细信息的示例代码:

python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_openbmb__UltraRM-13b", "harness_gsm8k_5", split="train")

最新结果

以下是 2023-12-02T13:26:56.823138 运行 的最新结果:

python { "all": { "acc": 0.0, "acc_stderr": 0.0 }, "harness|gsm8k|5": { "acc": 0.0, "acc_stderr": 0.0 } }

配置详情

配置列表

  • harness_arc_challenge_25

    • 分割: 2023_10_08T20_45_47.827028
      • 路径: **/details_harness|arc:challenge|25_2023-10-08T20-45-47.827028.parquet
    • 分割: latest
      • 路径: **/details_harness|arc:challenge|25_2023-10-08T20-45-47.827028.parquet
  • harness_drop_3

    • 分割: 2023_10_24T08_13_56.124311
      • 路径: **/details_harness|drop|3_2023-10-24T08-13-56.124311.parquet
    • 分割: latest
      • 路径: **/details_harness|drop|3_2023-10-24T08-13-56.124311.parquet
  • harness_gsm8k_5

    • 分割: 2023_10_24T08_13_56.124311
      • 路径: **/details_harness|gsm8k|5_2023-10-24T08-13-56.124311.parquet
    • 分割: 2023_12_02T13_26_56.823138
      • 路径: **/details_harness|gsm8k|5_2023-12-02T13-26-56.823138.parquet
    • 分割: latest
      • 路径: **/details_harness|gsm8k|5_2023-12-02T13-26-56.823138.parquet
  • harness_hellaswag_10

    • 分割: 2023_10_08T20_45_47.827028
      • 路径: **/details_harness|hellaswag|10_2023-10-08T20-45-47.827028.parquet
    • 分割: latest
      • 路径: **/details_harness|hellaswag|10_2023-10-08T20-45-47.827028.parquet
  • harness_hendrycksTest_5

    • 分割: 2023_10_08T20_45_47.827028
      • 路径:
        • **/details_harness|hendrycksTest-abstract_algebra|5_2023-10-08T20-45-47.827028.parquet
        • **/details_harness|hendrycksTest-anatomy|5_2023-10-08T20-45-47.827028.parquet
        • **/details_harness|hendrycksTest-astronomy|5_2023-10-08T20-45-47.827028.parquet
        • **/details_harness|hendrycksTest-business_ethics|5_2023-10-08T20-45-47.827028.parquet
        • **/details_harness|hendrycksTest-clinical_knowledge|5_2023-10-08T20-45-47.827028.parquet
        • **/details_harness|hendrycksTest-college_biology|5_2023-10-08T20-45-47.827028.parquet
        • **/details_harness|hendrycksTest-college_chemistry|5_2023-10-08T20-45-47.827028.parquet
        • **/details_harness|hendrycksTest-college_computer_science|5_2023-10-08T20-45-47.827028.parquet
        • **/details_harness|hendrycksTest-college_mathematics|5_2023-10-08T20-45-47.827028.parquet
        • **/details_harness|hendrycksTest-college_medicine|5_2023-10-08T20-45-47.827028.parquet
        • **/details_harness|hendrycksTest-college_physics|5_2023-10-08T20-45-47.827028.parquet
        • **/details_harness|hendrycksTest-computer_security|5_2023-10-08T20-45-47.827028.parquet
        • **/details_harness|hendrycksTest-conceptual_physics|5_2023-10-08T20-45-47.827028.parquet
        • **/details_harness|hendrycksTest-econometrics|5_2023-10-08T20-45-47.827028.parquet
        • **/details_harness|hendrycksTest-electrical_engineering|5_2023-10-08T20-45-47.827028.parquet
        • **/details_harness|hendrycksTest-elementary_mathematics|5_2023-10-08T20-45-47.827028.parquet
        • **/details_harness|hendrycksTest-formal_logic|5_2023-10-08T20-45-47.827028.parquet
        • **/details_harness|hendrycksTest-global_facts|5_2023-10-08T20-45-47.827028.parquet
        • **/details_harness|hendrycksTest-high_school_biology|5_2023-10-08T20-45-47.827028.parquet
        • **/details_harness|hendrycksTest-high_school_chemistry|5_2023-10-08T20-45-47.827028.parquet
        • **/details_harness|hendrycksTest-high_school_computer_science|5_2023-10-08T20-45-47.827028.parquet
        • **/details_harness|hendrycksTest-high_school_european_history|5_2023-10-08T20-45-47.827028.parquet
        • **/details_harness|hendrycksTest-high_school_geography|5_2023-10-08T20-45-47.827028.parquet
        • **/details_harness|hendrycksTest-high_school_government_and_politics|5_2023-10-08T20-45-47.827028.parquet
        • **/details_harness|hendrycksTest-high_school_macroeconomics|5_2023-10-08T20-45-47.827028.parquet
        • **/details_harness|hendrycksTest-high_school_mathematics|5_2023-10-08T20-45-47.827028.parquet
        • **/details_harness|hendrycksTest-high_school_microeconomics|5_2023-10-08T20-45-47.827028.parquet
        • **/details_harness|hendrycksTest-high_school_physics|5_2023-10-08T20-45-47.827028.parquet
        • **/details_harness|hendrycksTest-high_school_psychology|5_2023-10-08T20-45-47.827028.parquet
        • **/details_harness|hendrycksTest-high_school_statistics|5_2023-10-08T20-45-47.827028.parquet
        • **/details_harness|hendrycksTest-high_school_us_history|5_2023-10-08T20-45-47.827028.parquet
        • **/details_harness|hendrycksTest-high_school_world_history|5_2023-10-08T20-45-47.827028.parquet
        • **/details_harness|hendrycksTest-human_aging|5_2023-10-08T20-45-47.827028.parquet
        • **/details_harness|hendrycksTest-human_sexuality|5_2023-10-08T20-45-47.827028.parquet
        • **/details_harness|hendrycksTest-international_law|5_2023-10-08T20-45-47.827028.parquet
        • **/details_harness|hendrycksTest-jurisprudence|5_2023-10-08T20-45-47.827028.parquet
        • **/details_harness|hendrycksTest-logical_fallacies|5_2023-10-08T20-45-47.827028.parquet
        • **/details_harness|hendrycksTest-machine_learning|5_2023-10-08T20-45-47.827028.parquet
        • **/details_harness|hendrycksTest-management|5_2023-10-08T20-45-47.827028.parquet
        • **/details_harness|hendrycksTest-marketing|5_2023-10-08T20-45-47.827028.parquet
        • **/details_harness|hendrycksTest-medical_genetics|5_2023-10-08T20-45-47.827028.parquet
        • **/details_harness|hendrycksTest-miscellaneous|5_2023-10-08T20-45-47.827028.parquet
        • **/details_harness|hendrycksTest-moral_disputes|5_2023-10-08T20-45-47.827028.parquet
        • **/details_harness|hendrycksTest-moral_scenarios|5_2023-10-08T20-45-47.827028.parquet
        • **/details_harness|hendrycksTest-nutrition|5_2023-10-08T20-45-47.827028.parquet
        • **/details_harness|hendrycksTest-philosophy|5_2023-10-08T20-45-47.827028.parquet
        • **/details_harness|hendrycksTest-prehistory|5_2023-10-08T20-45-47.827028.parquet
        • **/details_harness|hendrycksTest-professional_accounting|5_2023-10-08T20-45-47.827028.parquet
        • **/details_harness|hendrycksTest-professional_law|5_2023-10-08T20-45-47.827028.parquet
        • **/details_harness|hendrycksTest-professional_medicine|5_2023-10-08T20-45-47.827028.parquet
        • **/details_harness|hendrycksTest-professional_psychology|5_2023-10-08T20-45-47.827028.parquet
        • **/details_harness|hendrycksTest-public_relations|5_2023-10-08T20-45-47.827028.parquet
        • **/details_harness|hendrycksTest-security_studies|5_2023-10-08T20-45-47.827028.parquet
        • **/details_harness|hendrycksTest-sociology|5_2023-10-08T20-45-47.827028.parquet
        • `**/details_harness|hendrycksTest-us_foreign_policy|5_2023-10-08
搜集汇总
数据集介绍
main_image_url
构建方式
在大型语言模型评估领域,Open LLM Leaderboard 作为权威的评测平台,为模型性能的横向比较提供了标准化框架。该数据集是专为记录 openbmb/UltraRM-13b 模型在 Leaderboard 上评估过程而自动生成的产物。其构建逻辑基于多次评估运行(run),目前包含三次运行记录,每次运行对应一个独立的数据切分(split),并以时间戳命名。数据集由64个配置(configuration)组成,每个配置对应一项评估任务,例如 GSM8K、ARC-Challenge 等。此外,一个名为“results”的附加配置专门存储所有运行的聚合结果,用于计算并展示模型在排行榜上的综合指标。最新运行结果通过“train”切分指向最新数据,确保用户获取的是最前沿的评估信息。
使用方法
使用该数据集时,研究者可通过 Hugging Face 的 datasets 库便捷加载。例如,调用 load_dataset 函数并指定数据集名称“open-llm-leaderboard/details_openbmb__UltraRM-13b”及目标配置(如“harness_gsm8k_5”)和切分(如“train”),即可获取特定任务的评估详情。切分名称直接对应运行时间戳,用户可按需选择历史或最新数据。若要分析模型在某一任务上的性能演变,可遍历同一配置下的多个时间戳切分。对于整体评估,加载“results”配置可获取聚合后的全局指标。这种灵活的加载机制支持从微观任务细节到宏观性能趋势的多层次探索,适用于模型迭代分析、基准测试复现及学术研究中的实证验证。
背景与挑战
背景概述
随着大语言模型(LLMs)在自然语言处理领域的迅猛发展,如何系统、公平地评估其性能成为学术界与工业界共同关注的焦点。Open LLM Leaderboard由Hugging Face于2023年发起,旨在通过标准化测试集(如GSM8K、HellaSwag、ARC等)对开源模型进行多维度评测,推动模型透明化与可比较性。在此背景下,openbmb/UltraRM-13b作为一款13B参数规模的奖励模型,其评估数据集由Open LLM Leaderboard自动生成,涵盖64个任务配置,涉及数学推理、常识理解、专业知识等多个领域。该数据集由Hugging Face团队(联络人Clementine)于2023年10月至12月间多次运行构建,核心研究问题在于量化UltraRM-13b在多样化任务中的表现,并为社区提供可复现的基准结果,其影响力体现在为奖励模型的公平比较提供了标准化数据基础。
当前挑战
该数据集面临的挑战首先源于领域问题:大语言模型评估的复杂性要求测试集覆盖广泛且难度梯度合理,但现有任务(如GSM8K)在UltraRM-13b上准确率极低(0.0%),反映出奖励模型在数学推理等核心能力上的不足,凸显了评估任务设计与模型能力匹配的严峻性。构建过程中,数据集需整合多次运行结果(如2023-10-08与2023-12-02两个时间戳),不同运行间任务覆盖不一致导致数据碎片化,且每个配置下需维护多个时间戳分片,增加了数据管理与版本控制的难度。此外,自动生成的评价结果可能因随机性而产生波动,如何确保跨运行结果的统计稳定性与可比性,仍是数据构建中的关键挑战。
常用场景
经典使用场景
在大型语言模型评估的学术领域,该数据集作为Open LLM Leaderboard的评估记录,承载了UltraRM-13b模型在多个标准化基准任务上的性能指标。其经典使用场景聚焦于复现和验证模型在推理、知识问答及常识理解等维度的表现,研究者可通过加载特定配置(如harness_gsm8k_5)获取细粒度评测结果,并利用多时间戳分割追踪模型迭代过程中的能力变迁。
解决学术问题
该数据集有效解决了大模型评估中结果可复现性与横向对比缺失的痛点。通过结构化存储64项任务(涵盖ARC、HellaSwag、MMLU等)的评测日志,它使学术界能够系统性地分析UltraRM-13b在不同难度与领域下的泛化边界,为理解奖励模型在偏好对齐中的瓶颈提供了量化证据,推动了对齐技术从经验调优向实证科学的范式转变。
实际应用
在实际应用中,该数据集为模型选型与部署决策提供了关键支撑。开发团队可基于其记录的零样本准确率与标准误差,在对话系统或内容生成场景中预判UltraRM-13b的可靠性;同时,它作为基准测试的标准化接口,降低了企业评估自有模型与开源模型差距的工程成本,加速了奖励模型在智能客服、教育辅导等垂直领域的落地验证。
数据集最近研究
最新研究方向
在大规模语言模型(LLM)性能评估领域,Open LLM Leaderboard 已成为衡量模型能力的重要基准平台。围绕 UltraRM-13b 模型,最新研究方向聚焦于构建标准化、多维度、可复现的评估框架,通过涵盖 ARC、HellaSwag、GSM8K 及涵盖 57 个学科的 MMLU 等多样化任务,系统性地检验模型在推理、常识与专业知识上的表现。当前研究热点在于利用此类细粒度评估数据集,揭示模型在特定子任务上的不足与优势,进而指导奖励模型(Reward Model)的优化。这一趋势不仅推动了模型对齐技术的进步,也为社区提供了透明、可信的横向比较基准,深刻影响着 LLM 在真实场景下的部署与迭代方向。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作