open-llm-leaderboard-old/details_upaya07__Birbal-7B-V1
收藏Hugging Face2023-12-19 更新2024-06-22 收录
下载链接:
https://hf-mirror.com/datasets/open-llm-leaderboard-old/details_upaya07__Birbal-7B-V1
下载链接
链接失效反馈官方服务:
资源简介:
---
pretty_name: Evaluation run of upaya07/Birbal-7B-V1
dataset_summary: "Dataset automatically created during the evaluation run of model\
\ [upaya07/Birbal-7B-V1](https://huggingface.co/upaya07/Birbal-7B-V1) on the [Open\
\ LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).\n\
\nThe dataset is composed of 63 configuration, each one coresponding to one of the\
\ evaluated task.\n\nThe dataset has been created from 2 run(s). Each run can be\
\ found as a specific split in each configuration, the split being named using the\
\ timestamp of the run.The \"train\" split is always pointing to the latest results.\n\
\nAn additional configuration \"results\" store all the aggregated results of the\
\ run (and is used to compute and display the aggregated metrics on the [Open LLM\
\ Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)).\n\
\nTo load the details from a run, you can for instance do the following:\n```python\n\
from datasets import load_dataset\ndata = load_dataset(\"open-llm-leaderboard/details_upaya07__Birbal-7B-V1\"\
,\n\t\"harness_winogrande_5\",\n\tsplit=\"train\")\n```\n\n## Latest results\n\n\
These are the [latest results from run 2023-12-19T05:40:57.697010](https://huggingface.co/datasets/open-llm-leaderboard/details_upaya07__Birbal-7B-V1/blob/main/results_2023-12-19T05-40-57.697010.json)(note\
\ that their might be results for other tasks in the repos if successive evals didn't\
\ cover the same tasks. You find each in the results and the \"latest\" split for\
\ each eval):\n\n```python\n{\n \"all\": {\n \"acc\": 0.6338717978820942,\n\
\ \"acc_stderr\": 0.032354410720897495,\n \"acc_norm\": 0.6393367450479324,\n\
\ \"acc_norm_stderr\": 0.033002421961828204,\n \"mc1\": 0.3047735618115055,\n\
\ \"mc1_stderr\": 0.016114124156882455,\n \"mc2\": 0.4534206690460975,\n\
\ \"mc2_stderr\": 0.014385152704042822\n },\n \"harness|arc:challenge|25\"\
: {\n \"acc\": 0.5802047781569966,\n \"acc_stderr\": 0.014422181226303028,\n\
\ \"acc_norm\": 0.6279863481228669,\n \"acc_norm_stderr\": 0.014124597881844465\n\
\ },\n \"harness|hellaswag|10\": {\n \"acc\": 0.6511651065524796,\n\
\ \"acc_stderr\": 0.004756275875018264,\n \"acc_norm\": 0.8483369846644094,\n\
\ \"acc_norm_stderr\": 0.003579608743506612\n },\n \"harness|hendrycksTest-abstract_algebra|5\"\
: {\n \"acc\": 0.32,\n \"acc_stderr\": 0.04688261722621504,\n \
\ \"acc_norm\": 0.32,\n \"acc_norm_stderr\": 0.04688261722621504\n \
\ },\n \"harness|hendrycksTest-anatomy|5\": {\n \"acc\": 0.5925925925925926,\n\
\ \"acc_stderr\": 0.04244633238353227,\n \"acc_norm\": 0.5925925925925926,\n\
\ \"acc_norm_stderr\": 0.04244633238353227\n },\n \"harness|hendrycksTest-astronomy|5\"\
: {\n \"acc\": 0.7368421052631579,\n \"acc_stderr\": 0.03583496176361073,\n\
\ \"acc_norm\": 0.7368421052631579,\n \"acc_norm_stderr\": 0.03583496176361073\n\
\ },\n \"harness|hendrycksTest-business_ethics|5\": {\n \"acc\": 0.57,\n\
\ \"acc_stderr\": 0.04975698519562428,\n \"acc_norm\": 0.57,\n \
\ \"acc_norm_stderr\": 0.04975698519562428\n },\n \"harness|hendrycksTest-clinical_knowledge|5\"\
: {\n \"acc\": 0.660377358490566,\n \"acc_stderr\": 0.02914690474779833,\n\
\ \"acc_norm\": 0.660377358490566,\n \"acc_norm_stderr\": 0.02914690474779833\n\
\ },\n \"harness|hendrycksTest-college_biology|5\": {\n \"acc\": 0.7569444444444444,\n\
\ \"acc_stderr\": 0.0358687928008034,\n \"acc_norm\": 0.7569444444444444,\n\
\ \"acc_norm_stderr\": 0.0358687928008034\n },\n \"harness|hendrycksTest-college_chemistry|5\"\
: {\n \"acc\": 0.49,\n \"acc_stderr\": 0.05024183937956911,\n \
\ \"acc_norm\": 0.49,\n \"acc_norm_stderr\": 0.05024183937956911\n \
\ },\n \"harness|hendrycksTest-college_computer_science|5\": {\n \"acc\"\
: 0.52,\n \"acc_stderr\": 0.050211673156867795,\n \"acc_norm\": 0.52,\n\
\ \"acc_norm_stderr\": 0.050211673156867795\n },\n \"harness|hendrycksTest-college_mathematics|5\"\
: {\n \"acc\": 0.41,\n \"acc_stderr\": 0.04943110704237101,\n \
\ \"acc_norm\": 0.41,\n \"acc_norm_stderr\": 0.04943110704237101\n \
\ },\n \"harness|hendrycksTest-college_medicine|5\": {\n \"acc\": 0.5953757225433526,\n\
\ \"acc_stderr\": 0.03742461193887248,\n \"acc_norm\": 0.5953757225433526,\n\
\ \"acc_norm_stderr\": 0.03742461193887248\n },\n \"harness|hendrycksTest-college_physics|5\"\
: {\n \"acc\": 0.39215686274509803,\n \"acc_stderr\": 0.04858083574266346,\n\
\ \"acc_norm\": 0.39215686274509803,\n \"acc_norm_stderr\": 0.04858083574266346\n\
\ },\n \"harness|hendrycksTest-computer_security|5\": {\n \"acc\":\
\ 0.72,\n \"acc_stderr\": 0.04512608598542128,\n \"acc_norm\": 0.72,\n\
\ \"acc_norm_stderr\": 0.04512608598542128\n },\n \"harness|hendrycksTest-conceptual_physics|5\"\
: {\n \"acc\": 0.5702127659574469,\n \"acc_stderr\": 0.03236214467715564,\n\
\ \"acc_norm\": 0.5702127659574469,\n \"acc_norm_stderr\": 0.03236214467715564\n\
\ },\n \"harness|hendrycksTest-econometrics|5\": {\n \"acc\": 0.5,\n\
\ \"acc_stderr\": 0.047036043419179864,\n \"acc_norm\": 0.5,\n \
\ \"acc_norm_stderr\": 0.047036043419179864\n },\n \"harness|hendrycksTest-electrical_engineering|5\"\
: {\n \"acc\": 0.5241379310344828,\n \"acc_stderr\": 0.0416180850350153,\n\
\ \"acc_norm\": 0.5241379310344828,\n \"acc_norm_stderr\": 0.0416180850350153\n\
\ },\n \"harness|hendrycksTest-elementary_mathematics|5\": {\n \"acc\"\
: 0.3835978835978836,\n \"acc_stderr\": 0.025043757318520196,\n \"\
acc_norm\": 0.3835978835978836,\n \"acc_norm_stderr\": 0.025043757318520196\n\
\ },\n \"harness|hendrycksTest-formal_logic|5\": {\n \"acc\": 0.3968253968253968,\n\
\ \"acc_stderr\": 0.0437588849272706,\n \"acc_norm\": 0.3968253968253968,\n\
\ \"acc_norm_stderr\": 0.0437588849272706\n },\n \"harness|hendrycksTest-global_facts|5\"\
: {\n \"acc\": 0.3,\n \"acc_stderr\": 0.046056618647183814,\n \
\ \"acc_norm\": 0.3,\n \"acc_norm_stderr\": 0.046056618647183814\n \
\ },\n \"harness|hendrycksTest-high_school_biology|5\": {\n \"acc\": 0.7709677419354839,\n\
\ \"acc_stderr\": 0.023904914311782648,\n \"acc_norm\": 0.7709677419354839,\n\
\ \"acc_norm_stderr\": 0.023904914311782648\n },\n \"harness|hendrycksTest-high_school_chemistry|5\"\
: {\n \"acc\": 0.46798029556650245,\n \"acc_stderr\": 0.03510766597959215,\n\
\ \"acc_norm\": 0.46798029556650245,\n \"acc_norm_stderr\": 0.03510766597959215\n\
\ },\n \"harness|hendrycksTest-high_school_computer_science|5\": {\n \
\ \"acc\": 0.71,\n \"acc_stderr\": 0.045604802157206845,\n \"acc_norm\"\
: 0.71,\n \"acc_norm_stderr\": 0.045604802157206845\n },\n \"harness|hendrycksTest-high_school_european_history|5\"\
: {\n \"acc\": 0.7575757575757576,\n \"acc_stderr\": 0.03346409881055953,\n\
\ \"acc_norm\": 0.7575757575757576,\n \"acc_norm_stderr\": 0.03346409881055953\n\
\ },\n \"harness|hendrycksTest-high_school_geography|5\": {\n \"acc\"\
: 0.7777777777777778,\n \"acc_stderr\": 0.029620227874790482,\n \"\
acc_norm\": 0.7777777777777778,\n \"acc_norm_stderr\": 0.029620227874790482\n\
\ },\n \"harness|hendrycksTest-high_school_government_and_politics|5\": {\n\
\ \"acc\": 0.8808290155440415,\n \"acc_stderr\": 0.02338193534812143,\n\
\ \"acc_norm\": 0.8808290155440415,\n \"acc_norm_stderr\": 0.02338193534812143\n\
\ },\n \"harness|hendrycksTest-high_school_macroeconomics|5\": {\n \
\ \"acc\": 0.6102564102564103,\n \"acc_stderr\": 0.024726967886647078,\n\
\ \"acc_norm\": 0.6102564102564103,\n \"acc_norm_stderr\": 0.024726967886647078\n\
\ },\n \"harness|hendrycksTest-high_school_mathematics|5\": {\n \"\
acc\": 0.32222222222222224,\n \"acc_stderr\": 0.028493465091028593,\n \
\ \"acc_norm\": 0.32222222222222224,\n \"acc_norm_stderr\": 0.028493465091028593\n\
\ },\n \"harness|hendrycksTest-high_school_microeconomics|5\": {\n \
\ \"acc\": 0.6680672268907563,\n \"acc_stderr\": 0.03058869701378364,\n \
\ \"acc_norm\": 0.6680672268907563,\n \"acc_norm_stderr\": 0.03058869701378364\n\
\ },\n \"harness|hendrycksTest-high_school_physics|5\": {\n \"acc\"\
: 0.32450331125827814,\n \"acc_stderr\": 0.03822746937658752,\n \"\
acc_norm\": 0.32450331125827814,\n \"acc_norm_stderr\": 0.03822746937658752\n\
\ },\n \"harness|hendrycksTest-high_school_psychology|5\": {\n \"acc\"\
: 0.8165137614678899,\n \"acc_stderr\": 0.016595259710399313,\n \"\
acc_norm\": 0.8165137614678899,\n \"acc_norm_stderr\": 0.016595259710399313\n\
\ },\n \"harness|hendrycksTest-high_school_statistics|5\": {\n \"acc\"\
: 0.5787037037037037,\n \"acc_stderr\": 0.03367462138896078,\n \"\
acc_norm\": 0.5787037037037037,\n \"acc_norm_stderr\": 0.03367462138896078\n\
\ },\n \"harness|hendrycksTest-high_school_us_history|5\": {\n \"acc\"\
: 0.7843137254901961,\n \"acc_stderr\": 0.028867431449849316,\n \"\
acc_norm\": 0.7843137254901961,\n \"acc_norm_stderr\": 0.028867431449849316\n\
\ },\n \"harness|hendrycksTest-high_school_world_history|5\": {\n \"\
acc\": 0.7721518987341772,\n \"acc_stderr\": 0.02730348459906943,\n \
\ \"acc_norm\": 0.7721518987341772,\n \"acc_norm_stderr\": 0.02730348459906943\n\
\ },\n \"harness|hendrycksTest-human_aging|5\": {\n \"acc\": 0.726457399103139,\n\
\ \"acc_stderr\": 0.029918586707798827,\n \"acc_norm\": 0.726457399103139,\n\
\ \"acc_norm_stderr\": 0.029918586707798827\n },\n \"harness|hendrycksTest-human_sexuality|5\"\
: {\n \"acc\": 0.7480916030534351,\n \"acc_stderr\": 0.03807387116306085,\n\
\ \"acc_norm\": 0.7480916030534351,\n \"acc_norm_stderr\": 0.03807387116306085\n\
\ },\n \"harness|hendrycksTest-international_law|5\": {\n \"acc\":\
\ 0.7933884297520661,\n \"acc_stderr\": 0.03695980128098825,\n \"\
acc_norm\": 0.7933884297520661,\n \"acc_norm_stderr\": 0.03695980128098825\n\
\ },\n \"harness|hendrycksTest-jurisprudence|5\": {\n \"acc\": 0.7592592592592593,\n\
\ \"acc_stderr\": 0.04133119440243839,\n \"acc_norm\": 0.7592592592592593,\n\
\ \"acc_norm_stderr\": 0.04133119440243839\n },\n \"harness|hendrycksTest-logical_fallacies|5\"\
: {\n \"acc\": 0.7484662576687117,\n \"acc_stderr\": 0.03408997886857529,\n\
\ \"acc_norm\": 0.7484662576687117,\n \"acc_norm_stderr\": 0.03408997886857529\n\
\ },\n \"harness|hendrycksTest-machine_learning|5\": {\n \"acc\": 0.48214285714285715,\n\
\ \"acc_stderr\": 0.047427623612430116,\n \"acc_norm\": 0.48214285714285715,\n\
\ \"acc_norm_stderr\": 0.047427623612430116\n },\n \"harness|hendrycksTest-management|5\"\
: {\n \"acc\": 0.8058252427184466,\n \"acc_stderr\": 0.03916667762822584,\n\
\ \"acc_norm\": 0.8058252427184466,\n \"acc_norm_stderr\": 0.03916667762822584\n\
\ },\n \"harness|hendrycksTest-marketing|5\": {\n \"acc\": 0.8589743589743589,\n\
\ \"acc_stderr\": 0.02280138253459754,\n \"acc_norm\": 0.8589743589743589,\n\
\ \"acc_norm_stderr\": 0.02280138253459754\n },\n \"harness|hendrycksTest-medical_genetics|5\"\
: {\n \"acc\": 0.75,\n \"acc_stderr\": 0.04351941398892446,\n \
\ \"acc_norm\": 0.75,\n \"acc_norm_stderr\": 0.04351941398892446\n \
\ },\n \"harness|hendrycksTest-miscellaneous|5\": {\n \"acc\": 0.7931034482758621,\n\
\ \"acc_stderr\": 0.01448565604166918,\n \"acc_norm\": 0.7931034482758621,\n\
\ \"acc_norm_stderr\": 0.01448565604166918\n },\n \"harness|hendrycksTest-moral_disputes|5\"\
: {\n \"acc\": 0.7312138728323699,\n \"acc_stderr\": 0.023868003262500107,\n\
\ \"acc_norm\": 0.7312138728323699,\n \"acc_norm_stderr\": 0.023868003262500107\n\
\ },\n \"harness|hendrycksTest-moral_scenarios|5\": {\n \"acc\": 0.37318435754189944,\n\
\ \"acc_stderr\": 0.016175692013381968,\n \"acc_norm\": 0.37318435754189944,\n\
\ \"acc_norm_stderr\": 0.016175692013381968\n },\n \"harness|hendrycksTest-nutrition|5\"\
: {\n \"acc\": 0.7254901960784313,\n \"acc_stderr\": 0.025553169991826514,\n\
\ \"acc_norm\": 0.7254901960784313,\n \"acc_norm_stderr\": 0.025553169991826514\n\
\ },\n \"harness|hendrycksTest-philosophy|5\": {\n \"acc\": 0.7009646302250804,\n\
\ \"acc_stderr\": 0.02600330111788514,\n \"acc_norm\": 0.7009646302250804,\n\
\ \"acc_norm_stderr\": 0.02600330111788514\n },\n \"harness|hendrycksTest-prehistory|5\"\
: {\n \"acc\": 0.7098765432098766,\n \"acc_stderr\": 0.025251173936495036,\n\
\ \"acc_norm\": 0.7098765432098766,\n \"acc_norm_stderr\": 0.025251173936495036\n\
\ },\n \"harness|hendrycksTest-professional_accounting|5\": {\n \"\
acc\": 0.48226950354609927,\n \"acc_stderr\": 0.02980873964223777,\n \
\ \"acc_norm\": 0.48226950354609927,\n \"acc_norm_stderr\": 0.02980873964223777\n\
\ },\n \"harness|hendrycksTest-professional_law|5\": {\n \"acc\": 0.49022164276401564,\n\
\ \"acc_stderr\": 0.012767793787729336,\n \"acc_norm\": 0.49022164276401564,\n\
\ \"acc_norm_stderr\": 0.012767793787729336\n },\n \"harness|hendrycksTest-professional_medicine|5\"\
: {\n \"acc\": 0.6875,\n \"acc_stderr\": 0.02815637344037142,\n \
\ \"acc_norm\": 0.6875,\n \"acc_norm_stderr\": 0.02815637344037142\n\
\ },\n \"harness|hendrycksTest-professional_psychology|5\": {\n \"\
acc\": 0.6879084967320261,\n \"acc_stderr\": 0.018745011201277657,\n \
\ \"acc_norm\": 0.6879084967320261,\n \"acc_norm_stderr\": 0.018745011201277657\n\
\ },\n \"harness|hendrycksTest-public_relations|5\": {\n \"acc\": 0.6363636363636364,\n\
\ \"acc_stderr\": 0.04607582090719976,\n \"acc_norm\": 0.6363636363636364,\n\
\ \"acc_norm_stderr\": 0.04607582090719976\n },\n \"harness|hendrycksTest-security_studies|5\"\
: {\n \"acc\": 0.7387755102040816,\n \"acc_stderr\": 0.028123429335142777,\n\
\ \"acc_norm\": 0.7387755102040816,\n \"acc_norm_stderr\": 0.028123429335142777\n\
\ },\n \"harness|hendrycksTest-sociology|5\": {\n \"acc\": 0.845771144278607,\n\
\ \"acc_stderr\": 0.025538433368578337,\n \"acc_norm\": 0.845771144278607,\n\
\ \"acc_norm_stderr\": 0.025538433368578337\n },\n \"harness|hendrycksTest-us_foreign_policy|5\"\
: {\n \"acc\": 0.88,\n \"acc_stderr\": 0.03265986323710906,\n \
\ \"acc_norm\": 0.88,\n \"acc_norm_stderr\": 0.03265986323710906\n \
\ },\n \"harness|hendrycksTest-virology|5\": {\n \"acc\": 0.5662650602409639,\n\
\ \"acc_stderr\": 0.03858158940685516,\n \"acc_norm\": 0.5662650602409639,\n\
\ \"acc_norm_stderr\": 0.03858158940685516\n },\n \"harness|hendrycksTest-world_religions|5\"\
: {\n \"acc\": 0.8421052631578947,\n \"acc_stderr\": 0.027966785859160872,\n\
\ \"acc_norm\": 0.8421052631578947,\n \"acc_norm_stderr\": 0.027966785859160872\n\
\ },\n \"harness|truthfulqa:mc|0\": {\n \"mc1\": 0.3047735618115055,\n\
\ \"mc1_stderr\": 0.016114124156882455,\n \"mc2\": 0.4534206690460975,\n\
\ \"mc2_stderr\": 0.014385152704042822\n },\n \"harness|winogrande|5\"\
: {\n \"acc\": 0.7876874506708761,\n \"acc_stderr\": 0.011493384687249789\n\
\ },\n \"harness|gsm8k|5\": {\n \"acc\": 0.4025777103866566,\n \
\ \"acc_stderr\": 0.013508523063663435\n }\n}\n```"
repo_url: https://huggingface.co/upaya07/Birbal-7B-V1
leaderboard_url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard
point_of_contact: clementine@hf.co
configs:
- config_name: harness_arc_challenge_25
data_files:
- split: 2023_12_18T19_22_58.191113
path:
- '**/details_harness|arc:challenge|25_2023-12-18T19-22-58.191113.parquet'
- split: 2023_12_19T05_40_57.697010
path:
- '**/details_harness|arc:challenge|25_2023-12-19T05-40-57.697010.parquet'
- split: latest
path:
- '**/details_harness|arc:challenge|25_2023-12-19T05-40-57.697010.parquet'
- config_name: harness_gsm8k_5
data_files:
- split: 2023_12_18T19_22_58.191113
path:
- '**/details_harness|gsm8k|5_2023-12-18T19-22-58.191113.parquet'
- split: 2023_12_19T05_40_57.697010
path:
- '**/details_harness|gsm8k|5_2023-12-19T05-40-57.697010.parquet'
- split: latest
path:
- '**/details_harness|gsm8k|5_2023-12-19T05-40-57.697010.parquet'
- config_name: harness_hellaswag_10
data_files:
- split: 2023_12_18T19_22_58.191113
path:
- '**/details_harness|hellaswag|10_2023-12-18T19-22-58.191113.parquet'
- split: 2023_12_19T05_40_57.697010
path:
- '**/details_harness|hellaswag|10_2023-12-19T05-40-57.697010.parquet'
- split: latest
path:
- '**/details_harness|hellaswag|10_2023-12-19T05-40-57.697010.parquet'
- config_name: harness_hendrycksTest_5
data_files:
- split: 2023_12_18T19_22_58.191113
path:
- '**/details_harness|hendrycksTest-abstract_algebra|5_2023-12-18T19-22-58.191113.parquet'
- '**/details_harness|hendrycksTest-anatomy|5_2023-12-18T19-22-58.191113.parquet'
- '**/details_harness|hendrycksTest-astronomy|5_2023-12-18T19-22-58.191113.parquet'
- '**/details_harness|hendrycksTest-business_ethics|5_2023-12-18T19-22-58.191113.parquet'
- '**/details_harness|hendrycksTest-clinical_knowledge|5_2023-12-18T19-22-58.191113.parquet'
- '**/details_harness|hendrycksTest-college_biology|5_2023-12-18T19-22-58.191113.parquet'
- '**/details_harness|hendrycksTest-college_chemistry|5_2023-12-18T19-22-58.191113.parquet'
- '**/details_harness|hendrycksTest-college_computer_science|5_2023-12-18T19-22-58.191113.parquet'
- '**/details_harness|hendrycksTest-college_mathematics|5_2023-12-18T19-22-58.191113.parquet'
- '**/details_harness|hendrycksTest-college_medicine|5_2023-12-18T19-22-58.191113.parquet'
- '**/details_harness|hendrycksTest-college_physics|5_2023-12-18T19-22-58.191113.parquet'
- '**/details_harness|hendrycksTest-computer_security|5_2023-12-18T19-22-58.191113.parquet'
- '**/details_harness|hendrycksTest-conceptual_physics|5_2023-12-18T19-22-58.191113.parquet'
- '**/details_harness|hendrycksTest-econometrics|5_2023-12-18T19-22-58.191113.parquet'
- '**/details_harness|hendrycksTest-electrical_engineering|5_2023-12-18T19-22-58.191113.parquet'
- '**/details_harness|hendrycksTest-elementary_mathematics|5_2023-12-18T19-22-58.191113.parquet'
- '**/details_harness|hendrycksTest-formal_logic|5_2023-12-18T19-22-58.191113.parquet'
- '**/details_harness|hendrycksTest-global_facts|5_2023-12-18T19-22-58.191113.parquet'
- '**/details_harness|hendrycksTest-high_school_biology|5_2023-12-18T19-22-58.191113.parquet'
- '**/details_harness|hendrycksTest-high_school_chemistry|5_2023-12-18T19-22-58.191113.parquet'
- '**/details_harness|hendrycksTest-high_school_computer_science|5_2023-12-18T19-22-58.191113.parquet'
- '**/details_harness|hendrycksTest-high_school_european_history|5_2023-12-18T19-22-58.191113.parquet'
- '**/details_harness|hendrycksTest-high_school_geography|5_2023-12-18T19-22-58.191113.parquet'
- '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2023-12-18T19-22-58.191113.parquet'
- '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2023-12-18T19-22-58.191113.parquet'
- '**/details_harness|hendrycksTest-high_school_mathematics|5_2023-12-18T19-22-58.191113.parquet'
- '**/details_harness|hendrycksTest-high_school_microeconomics|5_2023-12-18T19-22-58.191113.parquet'
- '**/details_harness|hendrycksTest-high_school_physics|5_2023-12-18T19-22-58.191113.parquet'
- '**/details_harness|hendrycksTest-high_school_psychology|5_2023-12-18T19-22-58.191113.parquet'
- '**/details_harness|hendrycksTest-high_school_statistics|5_2023-12-18T19-22-58.191113.parquet'
- '**/details_harness|hendrycksTest-high_school_us_history|5_2023-12-18T19-22-58.191113.parquet'
- '**/details_harness|hendrycksTest-high_school_world_history|5_2023-12-18T19-22-58.191113.parquet'
- '**/details_harness|hendrycksTest-human_aging|5_2023-12-18T19-22-58.191113.parquet'
- '**/details_harness|hendrycksTest-human_sexuality|5_2023-12-18T19-22-58.191113.parquet'
- '**/details_harness|hendrycksTest-international_law|5_2023-12-18T19-22-58.191113.parquet'
- '**/details_harness|hendrycksTest-jurisprudence|5_2023-12-18T19-22-58.191113.parquet'
- '**/details_harness|hendrycksTest-logical_fallacies|5_2023-12-18T19-22-58.191113.parquet'
- '**/details_harness|hendrycksTest-machine_learning|5_2023-12-18T19-22-58.191113.parquet'
- '**/details_harness|hendrycksTest-management|5_2023-12-18T19-22-58.191113.parquet'
- '**/details_harness|hendrycksTest-marketing|5_2023-12-18T19-22-58.191113.parquet'
- '**/details_harness|hendrycksTest-medical_genetics|5_2023-12-18T19-22-58.191113.parquet'
- '**/details_harness|hendrycksTest-miscellaneous|5_2023-12-18T19-22-58.191113.parquet'
- '**/details_harness|hendrycksTest-moral_disputes|5_2023-12-18T19-22-58.191113.parquet'
- '**/details_harness|hendrycksTest-moral_scenarios|5_2023-12-18T19-22-58.191113.parquet'
- '**/details_harness|hendrycksTest-nutrition|5_2023-12-18T19-22-58.191113.parquet'
- '**/details_harness|hendrycksTest-philosophy|5_2023-12-18T19-22-58.191113.parquet'
- '**/details_harness|hendrycksTest-prehistory|5_2023-12-18T19-22-58.191113.parquet'
- '**/details_harness|hendrycksTest-professional_accounting|5_2023-12-18T19-22-58.191113.parquet'
- '**/details_harness|hendrycksTest-professional_law|5_2023-12-18T19-22-58.191113.parquet'
- '**/details_harness|hendrycksTest-professional_medicine|5_2023-12-18T19-22-58.191113.parquet'
- '**/details_harness|hendrycksTest-professional_psychology|5_2023-12-18T19-22-58.191113.parquet'
- '**/details_harness|hendrycksTest-public_relations|5_2023-12-18T19-22-58.191113.parquet'
- '**/details_harness|hendrycksTest-security_studies|5_2023-12-18T19-22-58.191113.parquet'
- '**/details_harness|hendrycksTest-sociology|5_2023-12-18T19-22-58.191113.parquet'
- '**/details_harness|hendrycksTest-us_foreign_policy|5_2023-12-18T19-22-58.191113.parquet'
- '**/details_harness|hendrycksTest-virology|5_2023-12-18T19-22-58.191113.parquet'
- '**/details_harness|hendrycksTest-world_religions|5_2023-12-18T19-22-58.191113.parquet'
- split: 2023_12_19T05_40_57.697010
path:
- '**/details_harness|hendrycksTest-abstract_algebra|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-anatomy|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-astronomy|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-business_ethics|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-clinical_knowledge|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-college_biology|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-college_chemistry|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-college_computer_science|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-college_mathematics|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-college_medicine|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-college_physics|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-computer_security|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-conceptual_physics|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-econometrics|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-electrical_engineering|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-elementary_mathematics|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-formal_logic|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-global_facts|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-high_school_biology|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-high_school_chemistry|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-high_school_computer_science|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-high_school_european_history|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-high_school_geography|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-high_school_mathematics|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-high_school_microeconomics|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-high_school_physics|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-high_school_psychology|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-high_school_statistics|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-high_school_us_history|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-high_school_world_history|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-human_aging|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-human_sexuality|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-international_law|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-jurisprudence|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-logical_fallacies|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-machine_learning|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-management|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-marketing|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-medical_genetics|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-miscellaneous|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-moral_disputes|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-moral_scenarios|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-nutrition|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-philosophy|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-prehistory|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-professional_accounting|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-professional_law|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-professional_medicine|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-professional_psychology|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-public_relations|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-security_studies|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-sociology|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-us_foreign_policy|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-virology|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-world_religions|5_2023-12-19T05-40-57.697010.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-abstract_algebra|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-anatomy|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-astronomy|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-business_ethics|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-clinical_knowledge|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-college_biology|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-college_chemistry|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-college_computer_science|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-college_mathematics|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-college_medicine|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-college_physics|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-computer_security|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-conceptual_physics|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-econometrics|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-electrical_engineering|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-elementary_mathematics|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-formal_logic|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-global_facts|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-high_school_biology|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-high_school_chemistry|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-high_school_computer_science|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-high_school_european_history|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-high_school_geography|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-high_school_mathematics|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-high_school_microeconomics|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-high_school_physics|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-high_school_psychology|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-high_school_statistics|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-high_school_us_history|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-high_school_world_history|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-human_aging|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-human_sexuality|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-international_law|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-jurisprudence|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-logical_fallacies|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-machine_learning|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-management|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-marketing|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-medical_genetics|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-miscellaneous|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-moral_disputes|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-moral_scenarios|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-nutrition|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-philosophy|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-prehistory|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-professional_accounting|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-professional_law|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-professional_medicine|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-professional_psychology|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-public_relations|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-security_studies|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-sociology|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-us_foreign_policy|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-virology|5_2023-12-19T05-40-57.697010.parquet'
- '**/details_harness|hendrycksTest-world_religions|5_2023-12-19T05-40-57.697010.parquet'
- config_name: harness_hendrycksTest_abstract_algebra_5
data_files:
- split: 2023_12_18T19_22_58.191113
path:
- '**/details_harness|hendrycksTest-abstract_algebra|5_2023-12-18T19-22-58.191113.parquet'
- split: 2023_12_19T05_40_57.697010
path:
- '**/details_harness|hendrycksTest-abstract_algebra|5_2023-12-19T05-40-57.697010.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-abstract_algebra|5_2023-12-19T05-40-57.697010.parquet'
- config_name: harness_hendrycksTest_anatomy_5
data_files:
- split: 2023_12_18T19_22_58.191113
path:
- '**/details_harness|hendrycksTest-anatomy|5_2023-12-18T19-22-58.191113.parquet'
- split: 2023_12_19T05_40_57.697010
path:
- '**/details_harness|hendrycksTest-anatomy|5_2023-12-19T05-40-57.697010.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-anatomy|5_2023-12-19T05-40-57.697010.parquet'
- config_name: harness_hendrycksTest_astronomy_5
data_files:
- split: 2023_12_18T19_22_58.191113
path:
- '**/details_harness|hendrycksTest-astronomy|5_2023-12-18T19-22-58.191113.parquet'
- split: 2023_12_19T05_40_57.697010
path:
- '**/details_harness|hendrycksTest-astronomy|5_2023-12-19T05-40-57.697010.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-astronomy|5_2023-12-19T05-40-57.697010.parquet'
- config_name: harness_hendrycksTest_business_ethics_5
data_files:
- split: 2023_12_18T19_22_58.191113
path:
- '**/details_harness|hendrycksTest-business_ethics|5_2023-12-18T19-22-58.191113.parquet'
- split: 2023_12_19T05_40_57.697010
path:
- '**/details_harness|hendrycksTest-business_ethics|5_2023-12-19T05-40-57.697010.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-business_ethics|5_2023-12-19T05-40-57.697010.parquet'
- config_name: harness_hendrycksTest_clinical_knowledge_5
data_files:
- split: 2023_12_18T19_22_58.191113
path:
- '**/details_harness|hendrycksTest-clinical_knowledge|5_2023-12-18T19-22-58.191113.parquet'
- split: 2023_12_19T05_40_57.697010
path:
- '**/details_harness|hendrycksTest-clinical_knowledge|5_2023-12-19T05-40-57.697010.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-clinical_knowledge|5_2023-12-19T05-40-57.697010.parquet'
- config_name: harness_hendrycksTest_college_biology_5
data_files:
- split: 2023_12_18T19_22_58.191113
path:
- '**/details_harness|hendrycksTest-college_biology|5_2023-12-18T19-22-58.191113.parquet'
- split: 2023_12_19T05_40_57.697010
path:
- '**/details_harness|hendrycksTest-college_biology|5_2023-12-19T05-40-57.697010.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-college_biology|5_2023-12-19T05-40-57.697010.parquet'
- config_name: harness_hendrycksTest_college_chemistry_5
data_files:
- split: 2023_12_18T19_22_58.191113
path:
- '**/details_harness|hendrycksTest-college_chemistry|5_2023-12-18T19-22-58.191113.parquet'
- split: 2023_12_19T05_40_57.697010
path:
- '**/details_harness|hendrycksTest-college_chemistry|5_2023-12-19T05-40-57.697010.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-college_chemistry|5_2023-12-19T05-40-57.697010.parquet'
- config_name: harness_hendrycksTest_college_computer_science_5
data_files:
- split: 2023_12_18T19_22_58.191113
path:
- '**/details_harness|hendrycksTest-college_computer_science|5_2023-12-18T19-22-58.191113.parquet'
- split: 2023_12_19T05_40_57.697010
path:
- '**/details_harness|hendrycksTest-college_computer_science|5_2023-12-19T05-40-57.697010.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-college_computer_science|5_2023-12-19T05-40-57.697010.parquet'
- config_name: harness_hendrycksTest_college_mathematics_5
data_files:
- split: 2023_12_18T19_22_58.191113
path:
- '**/details_harness|hendrycksTest-college_mathematics|5_2023-12-18T19-22-58.191113.parquet'
- split: 2023_12_19T05_40_57.697010
path:
- '**/details_harness|hendrycksTest-college_mathematics|5_2023-12-19T05-40-57.697010.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-college_mathematics|5_2023-12-19T05-40-57.697010.parquet'
- config_name: harness_hendrycksTest_college_medicine_5
data_files:
- split: 2023_12_18T19_22_58.191113
path:
- '**/details_harness|hendrycksTest-college_medicine|5_2023-12-18T19-22-58.191113.parquet'
- split: 2023_12_19T05_40_57.697010
path:
- '**/details_harness|hendrycksTest-college_medicine|5_2023-12-19T05-40-57.697010.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-college_medicine|5_2023-12-19T05-40-57.697010.parquet'
- config_name: harness_hendrycksTest_college_physics_5
data_files:
- split: 2023_12_18T19_22_58.191113
path:
- '**/details_harness|hendrycksTest-college_physics|5_2023-12-18T19-22-58.191113.parquet'
- split: 2023_12_19T05_40_57.697010
path:
- '**/details_harness|hendrycksTest-college_physics|5_2023-12-19T05-40-57.697010.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-college_physics|5_2023-12-19T05-40-57.697010.parquet'
- config_name: harness_hendrycksTest_computer_security_5
data_files:
- split: 2023_12_18T19_22_58.191113
path:
- '**/details_harness|hendrycksTest-computer_security|5_2023-12-18T19-22-58.191113.parquet'
- split: 2023_12_19T05_40_57.697010
path:
- '**/details_harness|hendrycksTest-computer_security|5_2023-12-19T05-40-57.697010.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-computer_security|5_2023-12-19T05-40-57.697010.parquet'
- config_name: harness_hendrycksTest_conceptual_physics_5
data_files:
- split: 2023_12_18T19_22_58.191113
path:
- '**/details_harness|hendrycksTest-conceptual_physics|5_2023-12-18T19-22-58.191113.parquet'
- split: 2023_12_19T05_40_57.697010
path:
- '**/details_harness|hendrycksTest-conceptual_physics|5_2023-12-19T05-40-57.697010.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-conceptual_physics|5_2023-12-19T05-40-57.697010.parquet'
- config_name: harness_hendrycksTest_econometrics_5
data_files:
- split: 2023_12_18T19_22_58.191113
path:
- '**/details_harness|hendrycksTest-econometrics|5_2023-12-18T19-22-58.191113.parquet'
- split: 2023_12_19T05_40_57.697010
path:
- '**/details_harness|hendrycksTest-econometrics|5_2023-12-19T05-40-57.697010.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-econometrics|5_2023-12-19T05-40-57.697010.parquet'
- config_name: harness_hendrycksTest_electrical_engineering_5
data_files:
- split: 2023_12_18T19_22_58.191113
path:
- '**/details_harness|hendrycksTest-electrical_engineering|5_2023-12-18T19-22-58.191113.parquet'
- split: 2023_12_19T05_40_57.697010
path:
- '**/details_harness|hendrycksTest-electrical_engineering|5_2023-12-19T05-40-57.697010.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-electrical_engineering|5_2023-12-19T05-40-57.697010.parquet'
- config_name: harness_hendrycksTest_elementary_mathematics_5
data_files:
- split: 2023_12_18T19_22_58.191113
path:
- '**/details_harness|hendrycksTest-elementary_mathematics|5_2023-12-18T19-22-58.191113.parquet'
- split: 2023_12_19T05_40_57.697010
path:
- '**/details_harness|hendrycksTest-elementary_mathematics|5_2023-12-19T05-40-57.697010.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-elementary_mathematics|5_2023-12-19T05-40-57.697010.parquet'
- config_name: harness_hendrycksTest_formal_logic_5
data_files:
- split: 2023_12_18T19_22_58.191113
path:
- '**/details_harness|hendrycksTest-formal_logic|5_2023-12-18T19-22-58.191113.parquet'
- split: 2023_12_19T05_40_57.697010
path:
- '**/details_harness|hendrycksTest-formal_logic|5_2023-12-19T05-40-57.697010.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-formal_logic|5_2023-12-19T05-40-57.697010.parquet'
- config_name: harness_hendrycksTest_global_facts_5
data_files:
- split: 2023_12_18T19_22_58.191113
path:
- '**/details_harness|hendrycksTest-global_facts|5_2023-12-18T19-22-58.191113.parquet'
- split: 2023_12_19T05_40_57.697010
path:
- '**/details_harness|hendrycksTest-global_facts|5_2023-12-19T05-40-57.697010.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-global_facts|5_2023-12-19T05-40-57.697010.parquet'
- config_name: harness_hendrycksTest_high_school_biology_5
data_files:
- split: 2023_12_18T19_22_58.191113
path:
- '**/details_harness|hendrycksTest-high_school_biology|5_2023-12-18T19-22-58.191113.parquet'
- split: 2023_12_19T05_40_57.697010
path:
- '**/details_harness|hendrycksTest-high_school_biology|5_2023-12-19T05-40-57.697010.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-high_school_biology|5_2023-12-19T05-40-57.697010.parquet'
- config_name: harness_hendrycksTest_high_school_chemistry_5
data_files:
- split: 2023_12_18T19_22_58.191113
path:
- '**/details_harness|hendrycksTest-high_school_chemistry|5_2023-12-18T19-22-58.191113.parquet'
- split: 2023_12_19T05_40_57.697010
path:
- '**/details_harness|hendrycksTest-high_school_chemistry|5_2023-12-19T05-40-57.697010.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-high_school_chemistry|5_2023-12-19T05-40-57.697010.parquet'
- config_name: harness_hendrycksTest_high_school_computer_science_5
data_files:
- split: 2023_12_18T19_22_58.191113
path:
- '**/details_harness|hendrycksTest-high_school_computer_science|5_2023-12-18T19-22-58.191113.parquet'
- split: 2023_12_19T05_40_57.697010
path:
- '**/details_harness|hendrycksTest-high_school_computer_science|5_2023-12-19T05-40-57.697010.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-high_school_computer_science|5_2023-12-19T05-40-57.697010.parquet'
- config_name: harness_hendrycksTest_high_school_european_history_5
data_files:
- split: 2023_12_18T19_22_58.191113
path:
- '**/details_harness|hendrycksTest-high_school_european_history|5_2023-12-18T19-22-58.191113.parquet'
- split: 2023_12_19T05_40_57.697010
path:
- '**/details_harness|hendrycksTest-high_school_european_history|5_2023-12-19T05-40-57.697010.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-high_school_european_history|5_2023-12-19T05-40-57.697010.parquet'
- config_name: harness_hendrycksTest_high_school_geography_5
data_files:
- split: 2023_12_18T19_22_58.191113
path:
- '**/details_harness|hendrycksTest-high_school_geography|5_2023-12-18T19-22-58.191113.parquet'
- split: 2023_12_19T05_40_57.697010
path:
- '**/details_harness|hendrycksTest-high_school_geography|5_2023-12-19T05-40-57.697010.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-high_school_geography|5_2023-12-19T05-40-57.697010.parquet'
- config_name: harness_hendrycksTest_high_school_government_and_politics_5
data_files:
- split: 2023_12_18T19_22_58.191113
path:
- '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2023-12-18T19-22-58.191113.parquet'
- split: 2023_12_19T05_40_57.697010
path:
- '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2023-12-19T05-40-57.697010.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2023-12-19T05-40-57.697010.parquet'
- config_name: harness_hendrycksTest_high_school_macroeconomics_5
data_files:
- split: 2023_12_18T19_22_58.191113
path:
- '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2023-12-18T19-22-58.191113.parquet'
- split: 2023_12_19T05_40_57.697010
path:
- '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2023-12-19T05-40-57.697010.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2023-12-19T05-40-57.697010.parquet'
- config_name: harness_hendrycksTest_high_school_mathematics_5
data_files:
- split: 2023_12_18T19_22_58.191113
path:
- '**/details_harness|hendrycksTest-high_school_mathematics|5_2023-12-18T19-22-58.191113.parquet'
- split: 2023_12_19T05_40_57.697010
path:
- '**/details_harness|hendrycksTest-high_school_mathematics|5_2023-12-19T05-40-57.697010.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-high_school_mathematics|5_2023-12-19T05-40-57.697010.parquet'
- config_name: harness_hendrycksTest_high_school_microeconomics_5
data_files:
- split: 2023_12_18T19_22_58.191113
path:
- '**/details_harness|hendrycksTest-high_school_microeconomics|5_2023-12-18T19-22-58.191113.parquet'
- split: 2023_12_19T05_40_57.697010
path:
- '**/details_harness|hendrycksTest-high_school_microeconomics|5_2023-12-19T05-40-57.697010.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-high_school_microeconomics|5_2023-12-19T05-40-57.697010.parquet'
- config_name: harness_hendrycksTest_high_school_physics_5
data_files:
- split: 2023_12_18T19_22_58.191113
path:
- '**/details_harness|hendrycksTest-high_school_physics|5_2023-12-18T19-22-58.191113.parquet'
- split: 2023_12_19T05_40_57.697010
path:
- '**/details_harness|hendrycksTest-high_school_physics|5_2023-12-19T05-40-57.697010.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-high_school_physics|5_2023-12-19T05-40-57.697010.parquet'
- config_name: harness_hendrycksTest_high_school_psychology_5
data_files:
- split: 2023_12_18T19_22_58.191113
path:
- '**/details_harness|hendrycksTest-high_school_psychology|5_2023-12-18T19-22-58.191113.parquet'
- split: 2023_12_19T05_40_57.697010
path:
- '**/details_harness|hendrycksTest-high_school_psychology|5_2023-12-19T05-40-57.697010.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-high_school_psychology|5_2023-12-19T05-40-57.697010.parquet'
- config_name: harness_hendrycksTest_high_school_statistics_5
data_files:
- split: 2023_12_18T19_22_58.191113
path:
- '**/details_harness|hendrycksTest-high_school_statistics|5_2023-12-18T19-22-58.191113.parquet'
- split: 2023_12_19T05_40_57.697010
path:
- '**/details_harness|hendrycksTest-high_school_statistics|5_2023-12-19T05-40-57.697010.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-high_school_statistics|5_2023-12-19T05-40-57.697010.parquet'
- config_name: harness_hendrycksTest_high_school_us_history_5
data_files:
- split: 2023_12_18T19_22_58.191113
path:
- '**/details_harness|hendrycksTest-high_school_us_history|5_2023-12-18T19-22-58.191113.parquet'
- split: 2023_12_19T05_40_57.697010
path:
- '**/details_harness|hendrycksTest-high_school_us_history|5_2023-12-19T05-40-57.697010.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-high_school_us_history|5_2023-12-19T05-40-57.697010.parquet'
- config_name: harness_hendrycksTest_high_school_world_history_5
data_files:
- split: 2023_12_18T19_22_58.191113
path:
- '**/details_harness|hendrycksTest-high_school_world_history|5_2023-12-18T19-22-58.191113.parquet'
- split: 2023_12_19T05_40_57.697010
path:
- '**/details_harness|hendrycksTest-high_school_world_history|5_2023-12-19T05-40-57.697010.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-high_school_world_history|5_2023-12-19T05-40-57.697010.parquet'
- config_name: harness_hendrycksTest_human_aging_5
data_files:
- split: 2023_12_18T19_22_58.191113
path:
- '**/details_harness|hendrycksTest-human_aging|5_2023-12-18T19-22-58.191113.parquet'
- split: 2023_12_19T05_40_57.697010
path:
- '**/details_harness|hendrycksTest-human_aging|5_2023-12-19T05-40-57.697010.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-human_aging|5_2023-12-19T05-40-57.697010.parquet'
- config_name: harness_hendrycksTest_human_sexuality_5
data_files:
- split: 2023_12_18T19_22_58.191113
path:
- '**/details_harness|hendrycksTest-human_sexuality|5_2023-12-18T19-22-58.191113.parquet'
- split: 2023_12_19T05_40_57.697010
path:
- '**/details_harness|hendrycksTest-human_sexuality|5_2023-12-19T05-40-57.697010.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-human_sexuality|5_2023-12-19T05-40-57.697010.parquet'
- config_name: harness_hendrycksTest_international_law_5
data_files:
- split: 2023_12_18T19_22_58.191113
path:
- '**/details_harness|hendrycksTest-international_law|5_2023-12-18T19-22-58.191113.parquet'
- split: 2023_12_19T05_40_57.697010
path:
- '**/details_harness|hendrycksTest-international_law|5_2023-12-19T05-40-57.697010.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-international_law|5_2023-12-19T05-40-57.697010.parquet'
- config_name: harness_hendrycksTest_jurisprudence_5
data_files:
- split: 2023_12_18T19_22_58.191113
path:
- '**/details_harness|hendrycksTest-jurisprudence|5_2023-12-18T19-22-58.191113.parquet'
- split: 2023_12_19T05_40_57.697010
path:
- '**/details_harness|hendrycksTest-jurisprudence|5_2023-12-19T05-40-57.697010.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-jurisprudence|5_2023-12-19T05-40-57.697010.parquet'
- config_name: harness_hendrycksTest_logical_fallacies_5
data_files:
- split: 2023_12_18T19_22_58.191113
path:
- '**/details_harness|hendrycksTest-logical_fallacies|5_2023-12-18T19-22-58.191113.parquet'
- split: 2023_12_19T05_40_57.697010
path:
- '**/details_harness|hendrycksTest-logical_fallacies|5_2023-12-19T05-40-57.697010.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-logical_fallacies|5_2023-12-19T05-40-57.697010.parquet'
- config_name: harness_hendrycksTest_machine_learning_5
data_files:
- split: 2023_12_18T19_22_58.191113
path:
- '**/details_harness|hendrycksTest-machine_learning|5_2023-12-18T19-22-58.191113.parquet'
- split: 2023_12_19T05_40_57.697010
path:
- '**/details_harness|hendrycksTest-machine_learning|5_2023-12-19T05-40-57.697010.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-machine_learning|5_2023-12-19T05-40-57.697010.parquet'
- config_name: harness_hendrycksTest_management_5
data_files:
- split: 2023_12_18T19_22_58.191113
path:
- '**/details_harness|hendrycksTest-management|5_2023-12-18T19-22-58.191113.parquet'
- split: 2023_12_19T05_40_57.697010
path:
- '**/details_harness|hendrycksTest-management|5_2023-12-19T05-40-57.697010.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-management|5_2023-12-19T05-40-57.697010.parquet'
- config_name: harness_hendrycksTest_marketing_5
data_files:
- split: 2023_12_18T19_22_58.191113
path:
- '**/details_harness|hendrycksTest-marketing|5_2023-12-18T19-22-58.191113.parquet'
- split: 2023_12_19T05_40_57.697010
path:
- '**/details_harness|hendrycksTest-marketing|5_2023-12-19T05-40-57.697010.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-marketing|5_2023-12-19T05-40-57.697010.parquet'
- config_name: harness_hendrycksTest_medical_genetics_5
data_files:
- split: 2023_12_18T19_22_58.191113
path:
- '**/details_harness|hendrycksTest-medical_genetics|5_2023-12-18T19-22-58.191113.parquet'
- split: 2023_12_19T05_40_57.697010
path:
- '**/details_harness|hendrycksTest-medical_genetics|5_2023-12-19T05-40-57.697010.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-medical_genetics|5_2023-12-19T05-40-57.697010.parquet'
- config_name: harness_hendrycksTest_miscellaneous_5
data_files:
- split: 2023_12_18T19_22_58.191113
path:
- '**/details_harness|hendrycksTest-miscellaneous|5_2023-12-18T19-22-58.191113.parquet'
- split: 2023_12_19T05_40_57.697010
path:
- '**/details_harness|hendrycksTest-miscellaneous|5_2023-12-19T05-40-57.697010.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-miscellaneous|5_2023-12-19T05-40-57.697010.parquet'
- config_name: harness_hendrycksTest_moral_disputes_5
data_files:
- split: 2023_12_18T19_22_58.191113
path:
- '**/details_harness|hendrycksTest-moral_disputes|5_2023-12-18T19-22-58.191113.parquet'
- split: 2023_12_19T05_40_57.697010
path:
- '**/details_harness|hendrycksTest-moral_disputes|5_2023-12-19T05-40-57.697010.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-moral_disputes|5_2023-12-19T05-40-57.697010.parquet'
- config_name: harness_hendrycksTest_moral_scenarios_5
data_files:
- split: 2023_12_18T19_22_58.191113
path:
- '**/details_harness|hendrycksTest-moral_scenarios|5_2023-12-18T19-22-58.191113.parquet'
- split: 2023_12_19T05_40_57.697010
path:
- '**/details_harness|hendrycksTest-moral_scenarios|5_2023-12-19T05-40-57.697010.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-moral_scenarios|5_2023-12-19T05-40-57.697010.parquet'
- config_name: harness_hendrycksTest_nutrition_5
data_files:
- split: 2023_12_18T19_22_58.191113
path:
- '**/details_harness|hendrycksTest-nutrition|5_2023-12-18T19-22-58.191113.parquet'
- split: 2023_12_19T05_40_57.697010
path:
- '**/details_harness|hendrycksTest-nutrition|5_2023-12-19T05-40-57.697010.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-nutrition|5_2023-12-19T05-40-57.697010.parquet'
- config_name: harness_hendrycksTest_philosophy_5
data_files:
- split: 2023_12_18T19_22_58.191113
path:
- '**/details_harness|hendrycksTest-philosophy|5_2023-12-18T19-22-58.191113.parquet'
- split: 2023_12_19T05_40_57.697010
path:
- '**/details_harness|hendrycksTest-philosophy|5_2023-12-19T05-40-57.697010.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-philosophy|5_2023-12-19T05-40-57.697010.parquet'
- config_name: harness_hendrycksTest_prehistory_5
data_files:
- split: 2023_12_18T19_22_58.191113
path:
- '**/details_harness|hendrycksTest-prehistory|5_2023-12-18T19-22-58.191113.parquet'
- split: 2023_12_19T05_40_57.697010
path:
- '**/details_harness|hendrycksTest-prehistory|5_2023-12-19T05-40-57.697010.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-prehistory|5_2023-12-19T05-40-57.697010.parquet'
- config_name: harness_hendrycksTest_professional_accounting_5
data_files:
- split: 2023_12_18T19_22_58.191113
path:
- '**/details_harness|hendrycksTest-professional_accounting|5_2023-12-18T19-22-58.191113.parquet'
- split: 2023_12_19T05_40_57.697010
path:
- '**/details_harness|hendrycksTest-professional_accounting|5_2023-12-19T05-40-57.697010.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-professional_accounting|5_2023-12-19T05-40-57.697010.parquet'
- config_name: harness_hendrycksTest_professional_law_5
data_files:
- split: 2023_12_18T19_22_58.191113
path:
- '**/details_harness|hendrycksTest-professional_law|5_2023-12-18T19-22-58.191113.parquet'
- split: 2023_12_19T05_40_57.697010
path:
- '**/details_harness|hendrycksTest-professional_law|5_2023-12-19T05-40-57.697010.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-professional_law|5_2023-12-19T05-40-57.697010.parquet'
- config_name: harness_hendrycksTest_professional_medicine_5
data_files:
- split: 2023_12_18T19_22_58.191113
path:
- '**/details_harness|hendrycksTest-professional_medicine|5_2023-12-18T19-22-58.191113.parquet'
- split: 2023_12_19T05_40_57.697010
path:
- '**/details_harness|hendrycksTest-professional_medicine|5_2023-12-19T05-40-57.697010.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-professional_medicine|5_2023-12-19T05-40-57.697010.parquet'
- config_name: harness_hendrycksTest_professional_psychology_5
data_files:
- split: 2023_12_18T19_22_58.191113
path:
- '**/details_harness|hendrycksTest-professional_psychology|5_2023-12-18T19-22-58.191113.parquet'
- split: 2023_12_19T05_40_57.697010
path:
- '**/details_harness|hendrycksTest-professional_psychology|5_2023-12-19T05-40-57.697010.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-professional_psychology|5_2023-12-19T05-40-57.697010.parquet'
- config_name: harness_hendrycksTest_public_relations_5
data_files:
- split: 2023_12_18T19_22_58.191113
path:
- '**/details_harness|hendrycksTest-public_relations|5_2023-12-18T19-22-58.191113.parquet'
- split: 2023_12_19T05_40_57.697010
path:
- '**/details_harness|hendrycksTest-public_relations|5_2023-12-19T05-40-57.697010.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-public_relations|5_2023-12-19T05-40-57.697010.parquet'
- config_name: harness_hendrycksTest_security_studies_5
data_files:
- split: 2023_12_18T19_22_58.191113
path:
- '**/details_harness|hendrycksTest-security_studies|5_2023-12-18T19-22-58.191113.parquet'
- split: 2023_12_19T05_40_57.697010
path:
- '**/details_harness|hendrycksTest-security_studies|5_2023-12-19T05-40-57.697010.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-security_studies|5_2023-12-19T05-40-57.697010.parquet'
- config_name: harness_hendrycksTest_sociology_5
data_files:
- split: 2023_12_18T19_22_58.191113
path:
- '**/details_harness|hendrycksTest-sociology|5_2023-12-18T19-22-58.191113.parquet'
- split: 2023_12_19T05_40_57.697010
path:
- '**/details_harness|hendrycksTest-sociology|5_2023-12-19T05-40-57.697010.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-sociology|5_2023-12-19T05-40-57.697010.parquet'
- config_name: harness_hendrycksTest_us_foreign_policy_5
data_files:
- split: 2023_12_18T19_22_58.191113
path:
- '**/details_harness|hendrycksTest-us_foreign_policy|5_2023-12-18T19-22-58.191113.parquet'
- split: 2023_12_19T05_40_57.697010
path:
- '**/details_harness|hendrycksTest-us_foreign_policy|5_2023-12-19T05-40-57.697010.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-us_foreign_policy|5_2023-12-19T05-40-57.697010.parquet'
- config_name: harness_hendrycksTest_virology_5
data_files:
- split: 2023_12_18T19_22_58.191113
path:
- '**/details_harness|hendrycksTest-virology|5_2023-12-18T19-22-58.191113.parquet'
- split: 2023_12_19T05_40_57.697010
path:
- '**/details_harness|hendrycksTest-virology|5_2023-12-19T05-40-57.697010.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-virology|5_2023-12-19T05-40-57.697010.parquet'
- config_name: harness_hendrycksTest_world_religions_5
data_files:
- split: 2023_12_18T19_22_58.191113
path:
- '**/details_harness|hendrycksTest-world_religions|5_2023-12-18T19-22-58.191113.parquet'
- split: 2023_12_19T05_40_57.697010
path:
- '**/details_harness|hendrycksTest-world_religions|5_2023-12-19T05-40-57.697010.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-world_religions|5_2023-12-19T05-40-57.697010.parquet'
- config_name: harness_truthfulqa_mc_0
data_files:
- split: 2023_12_18T19_22_58.191113
path:
- '**/details_harness|truthfulqa:mc|0_2023-12-18T19-22-58.191113.parquet'
- split: 2023_12_19T05_40_57.697010
path:
- '**/details_harness|truthfulqa:mc|0_2023-12-19T05-40-57.697010.parquet'
- split: latest
path:
- '**/details_harness|truthfulqa:mc|0_2023-12-19T05-40-57.697010.parquet'
- config_name: harness_winogrande_5
data_files:
- split: 2023_12_18T19_22_58.191113
path:
- '**/details_harness|winogrande|5_2023-12-18T19-22-58.191113.parquet'
- split: 2023_12_19T05_40_57.697010
path:
- '**/details_harness|winogrande|5_2023-12-19T05-40-57.697010.parquet'
- split: latest
path:
- '**/details_harness|winogrande|5_2023-12-19T05-40-57.697010.parquet'
- config_name: results
data_files:
- split: 2023_12_18T19_22_58.191113
path:
- results_2023-12-18T19-22-58.191113.parquet
- split: 2023_12_19T05_40_57.697010
path:
- results_2023-12-19T05-40-57.697010.parquet
- split: latest
path:
- results_2023-12-19T05-40-57.697010.parquet
---
# Dataset Card for Evaluation run of upaya07/Birbal-7B-V1
<!-- Provide a quick summary of the dataset. -->
Dataset automatically created during the evaluation run of model [upaya07/Birbal-7B-V1](https://huggingface.co/upaya07/Birbal-7B-V1) on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).
The dataset is composed of 63 configuration, each one coresponding to one of the evaluated task.
The dataset has been created from 2 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results.
An additional configuration "results" store all the aggregated results of the run (and is used to compute and display the aggregated metrics on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)).
To load the details from a run, you can for instance do the following:
```python
from datasets import load_dataset
data = load_dataset("open-llm-leaderboard/details_upaya07__Birbal-7B-V1",
"harness_winogrande_5",
split="train")
```
## Latest results
These are the [latest results from run 2023-12-19T05:40:57.697010](https://huggingface.co/datasets/open-llm-leaderboard/details_upaya07__Birbal-7B-V1/blob/main/results_2023-12-19T05-40-57.697010.json)(note that their might be results for other tasks in the repos if successive evals didn't cover the same tasks. You find each in the results and the "latest" split for each eval):
```python
{
"all": {
"acc": 0.6338717978820942,
"acc_stderr": 0.032354410720897495,
"acc_norm": 0.6393367450479324,
"acc_norm_stderr": 0.033002421961828204,
"mc1": 0.3047735618115055,
"mc1_stderr": 0.016114124156882455,
"mc2": 0.4534206690460975,
"mc2_stderr": 0.014385152704042822
},
"harness|arc:challenge|25": {
"acc": 0.5802047781569966,
"acc_stderr": 0.014422181226303028,
"acc_norm": 0.6279863481228669,
"acc_norm_stderr": 0.014124597881844465
},
"harness|hellaswag|10": {
"acc": 0.6511651065524796,
"acc_stderr": 0.004756275875018264,
"acc_norm": 0.8483369846644094,
"acc_norm_stderr": 0.003579608743506612
},
"harness|hendrycksTest-abstract_algebra|5": {
"acc": 0.32,
"acc_stderr": 0.04688261722621504,
"acc_norm": 0.32,
"acc_norm_stderr": 0.04688261722621504
},
"harness|hendrycksTest-anatomy|5": {
"acc": 0.5925925925925926,
"acc_stderr": 0.04244633238353227,
"acc_norm": 0.5925925925925926,
"acc_norm_stderr": 0.04244633238353227
},
"harness|hendrycksTest-astronomy|5": {
"acc": 0.7368421052631579,
"acc_stderr": 0.03583496176361073,
"acc_norm": 0.7368421052631579,
"acc_norm_stderr": 0.03583496176361073
},
"harness|hendrycksTest-business_ethics|5": {
"acc": 0.57,
"acc_stderr": 0.04975698519562428,
"acc_norm": 0.57,
"acc_norm_stderr": 0.04975698519562428
},
"harness|hendrycksTest-clinical_knowledge|5": {
"acc": 0.660377358490566,
"acc_stderr": 0.02914690474779833,
"acc_norm": 0.660377358490566,
"acc_norm_stderr": 0.02914690474779833
},
"harness|hendrycksTest-college_biology|5": {
"acc": 0.7569444444444444,
"acc_stderr": 0.0358687928008034,
"acc_norm": 0.7569444444444444,
"acc_norm_stderr": 0.0358687928008034
},
"harness|hendrycksTest-college_chemistry|5": {
"acc": 0.49,
"acc_stderr": 0.05024183937956911,
"acc_norm": 0.49,
"acc_norm_stderr": 0.05024183937956911
},
"harness|hendrycksTest-college_computer_science|5": {
"acc": 0.52,
"acc_stderr": 0.050211673156867795,
"acc_norm": 0.52,
"acc_norm_stderr": 0.050211673156867795
},
"harness|hendrycksTest-college_mathematics|5": {
"acc": 0.41,
"acc_stderr": 0.04943110704237101,
"acc_norm": 0.41,
"acc_norm_stderr": 0.04943110704237101
},
"harness|hendrycksTest-college_medicine|5": {
"acc": 0.5953757225433526,
"acc_stderr": 0.03742461193887248,
"acc_norm": 0.5953757225433526,
"acc_norm_stderr": 0.03742461193887248
},
"harness|hendrycksTest-college_physics|5": {
"acc": 0.39215686274509803,
"acc_stderr": 0.04858083574266346,
"acc_norm": 0.39215686274509803,
"acc_norm_stderr": 0.04858083574266346
},
"harness|hendrycksTest-computer_security|5": {
"acc": 0.72,
"acc_stderr": 0.04512608598542128,
"acc_norm": 0.72,
"acc_norm_stderr": 0.04512608598542128
},
"harness|hendrycksTest-conceptual_physics|5": {
"acc": 0.5702127659574469,
"acc_stderr": 0.03236214467715564,
"acc_norm": 0.5702127659574469,
"acc_norm_stderr": 0.03236214467715564
},
"harness|hendrycksTest-econometrics|5": {
"acc": 0.5,
"acc_stderr": 0.047036043419179864,
"acc_norm": 0.5,
"acc_norm_stderr": 0.047036043419179864
},
"harness|hendrycksTest-electrical_engineering|5": {
"acc": 0.5241379310344828,
"acc_stderr": 0.0416180850350153,
"acc_norm": 0.5241379310344828,
"acc_norm_stderr": 0.0416180850350153
},
"harness|hendrycksTest-elementary_mathematics|5": {
"acc": 0.3835978835978836,
"acc_stderr": 0.025043757318520196,
"acc_norm": 0.3835978835978836,
"acc_norm_stderr": 0.025043757318520196
},
"harness|hendrycksTest-formal_logic|5": {
"acc": 0.3968253968253968,
"acc_stderr": 0.0437588849272706,
"acc_norm": 0.3968253968253968,
"acc_norm_stderr": 0.0437588849272706
},
"harness|hendrycksTest-global_facts|5": {
"acc": 0.3,
"acc_stderr": 0.046056618647183814,
"acc_norm": 0.3,
"acc_norm_stderr": 0.046056618647183814
},
"harness|hendrycksTest-high_school_biology|5": {
"acc": 0.7709677419354839,
"acc_stderr": 0.023904914311782648,
"acc_norm": 0.7709677419354839,
"acc_norm_stderr": 0.023904914311782648
},
"harness|hendrycksTest-high_school_chemistry|5": {
"acc": 0.46798029556650245,
"acc_stderr": 0.03510766597959215,
"acc_norm": 0.46798029556650245,
"acc_norm_stderr": 0.03510766597959215
},
"harness|hendrycksTest-high_school_computer_science|5": {
"acc": 0.71,
"acc_stderr": 0.045604802157206845,
"acc_norm": 0.71,
"acc_norm_stderr": 0.045604802157206845
},
"harness|hendrycksTest-high_school_european_history|5": {
"acc": 0.7575757575757576,
"acc_stderr": 0.03346409881055953,
"acc_norm": 0.7575757575757576,
"acc_norm_stderr": 0.03346409881055953
},
"harness|hendrycksTest-high_school_geography|5": {
"acc": 0.7777777777777778,
"acc_stderr": 0.029620227874790482,
"acc_norm": 0.7777777777777778,
"acc_norm_stderr": 0.029620227874790482
},
"harness|hendrycksTest-high_school_government_and_politics|5": {
"acc": 0.8808290155440415,
"acc_stderr": 0.02338193534812143,
"acc_norm": 0.8808290155440415,
"acc_norm_stderr": 0.02338193534812143
},
"harness|hendrycksTest-high_school_macroeconomics|5": {
"acc": 0.6102564102564103,
"acc_stderr": 0.024726967886647078,
"acc_norm": 0.6102564102564103,
"acc_norm_stderr": 0.024726967886647078
},
"harness|hendrycksTest-high_school_mathematics|5": {
"acc": 0.32222222222222224,
"acc_stderr": 0.028493465091028593,
"acc_norm": 0.32222222222222224,
"acc_norm_stderr": 0.028493465091028593
},
"harness|hendrycksTest-high_school_microeconomics|5": {
"acc": 0.6680672268907563,
"acc_stderr": 0.03058869701378364,
"acc_norm": 0.6680672268907563,
"acc_norm_stderr": 0.03058869701378364
},
"harness|hendrycksTest-high_school_physics|5": {
"acc": 0.32450331125827814,
"acc_stderr": 0.03822746937658752,
"acc_norm": 0.32450331125827814,
"acc_norm_stderr": 0.03822746937658752
},
"harness|hendrycksTest-high_school_psychology|5": {
"acc": 0.8165137614678899,
"acc_stderr": 0.016595259710399313,
"acc_norm": 0.8165137614678899,
"acc_norm_stderr": 0.016595259710399313
},
"harness|hendrycksTest-high_school_statistics|5": {
"acc": 0.5787037037037037,
"acc_stderr": 0.03367462138896078,
"acc_norm": 0.5787037037037037,
"acc_norm_stderr": 0.03367462138896078
},
"harness|hendrycksTest-high_school_us_history|5": {
"acc": 0.7843137254901961,
"acc_stderr": 0.028867431449849316,
"acc_norm": 0.7843137254901961,
"acc_norm_stderr": 0.028867431449849316
},
"harness|hendrycksTest-high_school_world_history|5": {
"acc": 0.7721518987341772,
"acc_stderr": 0.02730348459906943,
"acc_norm": 0.7721518987341772,
"acc_norm_stderr": 0.02730348459906943
},
"harness|hendrycksTest-human_aging|5": {
"acc": 0.726457399103139,
"acc_stderr": 0.029918586707798827,
"acc_norm": 0.726457399103139,
"acc_norm_stderr": 0.029918586707798827
},
"harness|hendrycksTest-human_sexuality|5": {
"acc": 0.7480916030534351,
"acc_stderr": 0.03807387116306085,
"acc_norm": 0.7480916030534351,
"acc_norm_stderr": 0.03807387116306085
},
"harness|hendrycksTest-international_law|5": {
"acc": 0.7933884297520661,
"acc_stderr": 0.03695980128098825,
"acc_norm": 0.7933884297520661,
"acc_norm_stderr": 0.03695980128098825
},
"harness|hendrycksTest-jurisprudence|5": {
"acc": 0.7592592592592593,
"acc_stderr": 0.04133119440243839,
"acc_norm": 0.7592592592592593,
"acc_norm_stderr": 0.04133119440243839
},
"harness|hendrycksTest-logical_fallacies|5": {
"acc": 0.7484662576687117,
"acc_stderr": 0.03408997886857529,
"acc_norm": 0.7484662576687117,
"acc_norm_stderr": 0.03408997886857529
},
"harness|hendrycksTest-machine_learning|5": {
"acc": 0.48214285714285715,
"acc_stderr": 0.047427623612430116,
"acc_norm": 0.48214285714285715,
"acc_norm_stderr": 0.047427623612430116
},
"harness|hendrycksTest-management|5": {
"acc": 0.8058252427184466,
"acc_stderr": 0.03916667762822584,
"acc_norm": 0.8058252427184466,
"acc_norm_stderr": 0.03916667762822584
},
"harness|hendrycksTest-marketing|5": {
"acc": 0.8589743589743589,
"acc_stderr": 0.02280138253459754,
"acc_norm": 0.8589743589743589,
"acc_norm_stderr": 0.02280138253459754
},
"harness|hendrycksTest-medical_genetics|5": {
"acc": 0.75,
"acc_stderr": 0.04351941398892446,
"acc_norm": 0.75,
"acc_norm_stderr": 0.04351941398892446
},
"harness|hendrycksTest-miscellaneous|5": {
"acc": 0.7931034482758621,
"acc_stderr": 0.01448565604166918,
"acc_norm": 0.7931034482758621,
"acc_norm_stderr": 0.01448565604166918
},
"harness|hendrycksTest-moral_disputes|5": {
"acc": 0.7312138728323699,
"acc_stderr": 0.023868003262500107,
"acc_norm": 0.7312138728323699,
"acc_norm_stderr": 0.023868003262500107
},
"harness|hendrycksTest-moral_scenarios|5": {
"acc": 0.37318435754189944,
"acc_stderr": 0.016175692013381968,
"acc_norm": 0.37318435754189944,
"acc_norm_stderr": 0.016175692013381968
},
"harness|hendrycksTest-nutrition|5": {
"acc": 0.7254901960784313,
"acc_stderr": 0.025553169991826514,
"acc_norm": 0.7254901960784313,
"acc_norm_stderr": 0.025553169991826514
},
"harness|hendrycksTest-philosophy|5": {
"acc": 0.7009646302250804,
"acc_stderr": 0.02600330111788514,
"acc_norm": 0.7009646302250804,
"acc_norm_stderr": 0.02600330111788514
},
"harness|hendrycksTest-prehistory|5": {
"acc": 0.7098765432098766,
"acc_stderr": 0.025251173936495036,
"acc_norm": 0.7098765432098766,
"acc_norm_stderr": 0.025251173936495036
},
"harness|hendrycksTest-professional_accounting|5": {
"acc": 0.48226950354609927,
"acc_stderr": 0.02980873964223777,
"acc_norm": 0.48226950354609927,
"acc_norm_stderr": 0.02980873964223777
},
"harness|hendrycksTest-professional_law|5": {
"acc": 0.49022164276401564,
"acc_stderr": 0.012767793787729336,
"acc_norm": 0.49022164276401564,
"acc_norm_stderr": 0.012767793787729336
},
"harness|hendrycksTest-professional_medicine|5": {
"acc": 0.6875,
"acc_stderr": 0.02815637344037142,
"acc_norm": 0.6875,
"acc_norm_stderr": 0.02815637344037142
},
"harness|hendrycksTest-professional_psychology|5": {
"acc": 0.6879084967320261,
"acc_stderr": 0.018745011201277657,
"acc_norm": 0.6879084967320261,
"acc_norm_stderr": 0.018745011201277657
},
"harness|hendrycksTest-public_relations|5": {
"acc": 0.6363636363636364,
"acc_stderr": 0.04607582090719976,
"acc_norm": 0.6363636363636364,
"acc_norm_stderr": 0.04607582090719976
},
"harness|hendrycksTest-security_studies|5": {
"acc": 0.7387755102040816,
"acc_stderr": 0.028123429335142777,
"acc_norm": 0.7387755102040816,
"acc_norm_stderr": 0.028123429335142777
},
"harness|hendrycksTest-sociology|5": {
"acc": 0.845771144278607,
"acc_stderr": 0.025538433368578337,
"acc_norm": 0.845771144278607,
"acc_norm_stderr": 0.025538433368578337
},
"harness|hendrycksTest-us_foreign_policy|5": {
"acc": 0.88,
"acc_stderr": 0.03265986323710906,
"acc_norm": 0.88,
"acc_norm_stderr": 0.03265986323710906
},
"harness|hendrycksTest-virology|5": {
"acc": 0.5662650602409639,
"acc_stderr": 0.03858158940685516,
"acc_norm": 0.5662650602409639,
"acc_norm_stderr": 0.03858158940685516
},
"harness|hendrycksTest-world_religions|5": {
"acc": 0.8421052631578947,
"acc_stderr": 0.027966785859160872,
"acc_norm": 0.8421052631578947,
"acc_norm_stderr": 0.027966785859160872
},
"harness|truthfulqa:mc|0": {
"mc1": 0.3047735618115055,
"mc1_stderr": 0.016114124156882455,
"mc2": 0.4534206690460975,
"mc2_stderr": 0.014385152704042822
},
"harness|winogrande|5": {
"acc": 0.7876874506708761,
"acc_stderr": 0.011493384687249789
},
"harness|gsm8k|5": {
"acc": 0.4025777103866566,
"acc_stderr": 0.013508523063663435
}
}
```
## Dataset Details
### Dataset Description
<!-- Provide a longer summary of what this dataset is. -->
- **Curated by:** [More Information Needed]
- **Funded by [optional]:** [More Information Needed]
- **Shared by [optional]:** [More Information Needed]
- **Language(s) (NLP):** [More Information Needed]
- **License:** [More Information Needed]
### Dataset Sources [optional]
<!-- Provide the basic links for the dataset. -->
- **Repository:** [More Information Needed]
- **Paper [optional]:** [More Information Needed]
- **Demo [optional]:** [More Information Needed]
## Uses
<!-- Address questions around how the dataset is intended to be used. -->
### Direct Use
<!-- This section describes suitable use cases for the dataset. -->
[More Information Needed]
### Out-of-Scope Use
<!-- This section addresses misuse, malicious use, and uses that the dataset will not work well for. -->
[More Information Needed]
## Dataset Structure
<!-- This section provides a description of the dataset fields, and additional information about the dataset structure such as criteria used to create the splits, relationships between data points, etc. -->
[More Information Needed]
## Dataset Creation
### Curation Rationale
<!-- Motivation for the creation of this dataset. -->
[More Information Needed]
### Source Data
<!-- This section describes the source data (e.g. news text and headlines, social media posts, translated sentences, ...). -->
#### Data Collection and Processing
<!-- This section describes the data collection and processing process such as data selection criteria, filtering and normalization methods, tools and libraries used, etc. -->
[More Information Needed]
#### Who are the source data producers?
<!-- This section describes the people or systems who originally created the data. It should also include self-reported demographic or identity information for the source data creators if this information is available. -->
[More Information Needed]
### Annotations [optional]
<!-- If the dataset contains annotations which are not part of the initial data collection, use this section to describe them. -->
#### Annotation process
<!-- This section describes the annotation process such as annotation tools used in the process, the amount of data annotated, annotation guidelines provided to the annotators, interannotator statistics, annotation validation, etc. -->
[More Information Needed]
#### Who are the annotators?
<!-- This section describes the people or systems who created the annotations. -->
[More Information Needed]
#### Personal and Sensitive Information
<!-- State whether the dataset contains data that might be considered personal, sensitive, or private (e.g., data that reveals addresses, uniquely identifiable names or aliases, racial or ethnic origins, sexual orientations, religious beliefs, political opinions, financial or health data, etc.). If efforts were made to anonymize the data, describe the anonymization process. -->
[More Information Needed]
## Bias, Risks, and Limitations
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
[More Information Needed]
### Recommendations
<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
Users should be made aware of the risks, biases and limitations of the dataset. More information needed for further recommendations.
## Citation [optional]
<!-- If there is a paper or blog post introducing the dataset, the APA and Bibtex information for that should go in this section. -->
**BibTeX:**
[More Information Needed]
**APA:**
[More Information Needed]
## Glossary [optional]
<!-- If relevant, include terms and calculations in this section that can help readers understand the dataset or dataset card. -->
[More Information Needed]
## More Information [optional]
[More Information Needed]
## Dataset Card Authors [optional]
[More Information Needed]
## Dataset Card Contact
[More Information Needed]
提供机构:
open-llm-leaderboard-old
原始信息汇总
数据集概述
数据集创建
- 创建背景:该数据集是在评估模型 upaya07/Birbal-7B-V1 在 Open LLM Leaderboard 上的自动创建的。
- 数据集组成:包含 63 个配置,每个配置对应一个评估任务。
- 创建次数:数据集从 2 次运行中创建,每个运行可以在每个配置中作为一个特定的分割找到,分割名称使用运行的时间戳。"train" 分割始终指向最新的结果。
- 结果汇总:一个额外的配置 "results" 存储所有运行的汇总结果,用于计算和显示在 Open LLM Leaderboard 上的汇总指标。
数据加载示例
python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_upaya07__Birbal-7B-V1", "harness_winogrande_5", split="train")
最新结果
- 最新结果来源:最新结果来自 2023-12-19T05:40:57.697010
- 结果示例: python { "all": { "acc": 0.6338717978820942, "acc_stderr": 0.032354410720897495, "acc_norm": 0.6393367450479324, "acc_norm_stderr": 0.033002421961828204, "mc1": 0.3047735618115055, "mc1_stderr": 0.016114124156882455, "mc2": 0.4534206690460975, "mc2_stderr": 0.014385152704042822 }, "harness|arc:challenge|25": { "acc": 0.5802047781569966, "acc_stderr": 0.014422181226303028, "acc_norm": 0.6279863481228669, "acc_norm_stderr": 0.014124597881844465 }, "harness|hellaswag|10": { "acc": 0.6511651065524796, "acc_stderr": 0.004756275875018264, "acc_norm": 0.8483369846644094, "acc_norm_stderr": 0.003579608743506612 }, ... }
配置详情
-
配置名称:harness_arc_challenge_25
- 数据文件:
- 分割:2023_12_18T19_22_58.191113
- 路径:
**/details_harness|arc:challenge|25_2023-12-18T19-22-58.191113.parquet
- 路径:
- 分割:2023_12_19T05_40_57.697010
- 路径:
**/details_harness|arc:challenge|25_2023-12-19T05-40-57.697010.parquet
- 路径:
- 分割:latest
- 路径:
**/details_harness|arc:challenge|25_2023-12-19T05-40-57.697010.parquet
- 路径:
- 分割:2023_12_18T19_22_58.191113
- 数据文件:
-
配置名称:harness_gsm8k_5
- 数据文件:
- 分割:2023_12_18T19_22_58.191113
- 路径:
**/details_harness|gsm8k|5_2023-12-18T19-22-58.191113.parquet
- 路径:
- 分割:2023_12_19T05_40_57.697010
- 路径:
**/details_harness|gsm8k|5_2023-12-19T05-40-57.697010.parquet
- 路径:
- 分割:latest
- 路径:
**/details_harness|gsm8k|5_2023-12-19T05-40-57.697010.parquet
- 路径:
- 分割:2023_12_18T19_22_58.191113
- 数据文件:
-
配置名称:harness_hellaswag_10
- 数据文件:
- 分割:2023_12_18T19_22_58.191113
- 路径:
**/details_harness|hellaswag|10_2023-12-18T19-22-58.191113.parquet
- 路径:
- 分割:2023_12_19T05_40_57.697010
- 路径:
**/details_harness|hellaswag|10_2023-12-19T05-40-57.697010.parquet
- 路径:
- 分割:latest
- 路径:
**/details_harness|hellaswag|10_2023-12-19T05-40-57.697010.parquet
- 路径:
- 分割:2023_12_18T19_22_58.191113
- 数据文件:
-
配置名称:harness_hendrycksTest_5
- 数据文件:
- 分割:2023_12_18T19_22_58.191113
- 路径:
**/details_harness|hendrycksTest-abstract_algebra|5_2023-12-18T19-22-58.191113.parquet**/details_harness|hendrycksTest-anatomy|5_2023-12-18T19-22-58.191113.parquet**/details_harness|hendrycksTest-astronomy|5_2023-12-18T19-22-58.191113.parquet**/details_harness|hendrycksTest-business_ethics|5_2023-12-18T19-22-58.191113.parquet**/details_harness|hendrycksTest-clinical_knowledge|5_2023-12-18T19-22-58.191113.parquet**/details_harness|hendrycksTest-college_biology|5_2023-12-18T19-22-58.191113.parquet**/details_harness|hendrycksTest-college_chemistry|5_2023-12-18T19-22-58.191113.parquet**/details_harness|hendrycksTest-college_computer_science|5_2023-12-18T19-22-58.191113.parquet**/details_harness|hendrycksTest-college_mathematics|5_2023-12-18T19-22-58.191113.parquet**/details_harness|hendrycksTest-college_medicine|5_2023-12-18T19-22-58.191113.parquet**/details_harness|hendrycksTest-college_physics|5_2023-12-18T19-22-58.191113.parquet**/details_harness|hendrycksTest-computer_security|5_2023-12-18T19-22-58.191113.parquet**/details_harness|hendrycksTest-conceptual_physics|5_2023-12-18T19-22-58.191113.parquet**/details_harness|hendrycksTest-econometrics|5_2023-12-18T19-22-58.191113.parquet**/details_harness|hendrycksTest-electrical_engineering|5_2023-12-18T19-22-58.191113.parquet**/details_harness|hendrycksTest-elementary_mathematics|5_2023-12-18T19-22-58.191113.parquet...
- 路径:
- 分割:2023_12_18T19_22_58.191113
- 数据文件:
搜集汇总
数据集介绍

构建方式
该数据集是在Open LLM Leaderboard平台上对模型upaya07/Birbal-7B-V1进行自动化评估过程中生成的。数据集包含了63个配置,每个配置对应一项被评估的任务,覆盖了从常识推理、数学解题到多学科知识问答等广泛领域。数据源自两次独立的评估运行,每次运行的结果以时间戳为标识,作为不同分割(split)存储于各配置中,其中“train”分割始终指向最新运行的结果。此外,一个名为“results”的配置专门存储了所有任务的聚合指标,用于在Leaderboard上展示模型的综合性能。
特点
该数据集的核心特点在于其结构化的评估记录方式。每个任务配置独立存储,便于研究者按需访问特定任务的细粒度结果。数据分割机制支持历史运行结果的追溯与对比,而“latest”分割则为获取最新评估数据提供了便捷入口。聚合结果配置集中呈现了模型在各项任务上的准确率、标准化准确率及标准误差等关键指标,为模型能力的全面评估提供了量化依据。
使用方法
研究者可通过Hugging Face的datasets库加载该数据集。具体而言,使用load_dataset函数并指定数据集名称、目标任务配置(如“harness_winogrande_5”)以及所需的分割(如“train”),即可获取对应任务的详细评估数据。对于希望分析模型整体表现的场景,可直接读取“results”配置中的聚合指标,无需逐一处理各个任务配置,从而简化了数据分析流程。
背景与挑战
背景概述
随着大规模语言模型(LLM)在自然语言处理领域的迅猛发展,如何系统性地评估其多维度能力成为学术界与工业界共同关注的核心议题。Hugging Face团队于2023年发起的Open LLM Leaderboard项目,旨在构建一个标准化、透明化的模型性能竞技平台,以推动LLM研究的可复现性与公平比较。该数据集正是为评估模型upaya07/Birbal-7B-V1在2023年12月的运行结果而自动生成的,记录了其在ARC挑战、HellaSwag、GSM8K、MMLU(涵盖57个学科)及TruthfulQA等基准任务上的细粒度表现。数据集由63个配置组成,每个配置对应一项评估任务,并保留了多次运行的时间戳拆分,为研究者提供了追踪模型演进与结果差异的宝贵资源。这一工作不仅为Birbal-7B-V1的性能画像提供了实证基础,也强化了Open LLM Leaderboard作为社区公认的LLM评测基准的地位。
当前挑战
该数据集所面对的挑战首先源于LLM评估领域的固有问题:如何设计能够全面反映模型真实能力的评测体系。当前基准测试虽覆盖常识推理、数学求解、知识问答与事实一致性等多个维度,但各任务间的难度差异与评估指标(如acc、mc1、mc2)的多样性,使得单一分数难以揭示模型的综合短板。此外,构建过程中面临的技术挑战亦不容忽视:数据集需动态整合来自63个不同配置的Parquet文件,并确保跨时间戳的多次运行结果能够被准确归类与版本化。数据存储格式的异构性(如MMLU各子任务独立存放)与加载逻辑的复杂度(需通过split参数指定特定运行批次),对数据复用与后续分析构成了额外的门槛,要求使用者具备一定的工程适配能力。
常用场景
经典使用场景
该数据集是Open LLM Leaderboard在评估upaya07/Birbal-7B-V1模型时自动生成的评测细节记录,涵盖了63个任务配置,每个配置对应一个独立的评估任务。其核心用途在于为研究者提供模型在ARC-Challenge、HellaSwag、MMLU、TruthfulQA、Winogrande、GSM8K等经典基准测试上的细粒度表现数据。通过加载不同时间戳的评估分片,用户可以复现模型在特定时刻的推理结果,从而深入分析模型在不同能力维度(如常识推理、数学计算、知识问答)上的优劣表现。这一数据集为开源大语言模型的横向对比提供了标准化、可追溯的评估基准,是模型性能验证与迭代优化的重要依据。
衍生相关工作
该数据集衍生了一系列围绕大语言模型评测标准化与自动化的重要工作。其底层架构——Open LLM Leaderboard评估框架——已成为HuggingFace生态中模型性能对比的权威基础设施,催生了诸如lm-evaluation-harness(统一评测工具库)等经典项目。研究者基于此类数据集,发展出针对模型知识边界探测的MMLU-Pro、强调对抗性鲁棒性的AdvGLUE等变体基准。此外,该数据集的细粒度日志格式被后续工作借鉴,用于构建模型能力衰减追踪系统(如评估不同训练阶段模型在特定任务上的遗忘曲线),以及用于训练基于评测结果预测模型综合能力的元学习器。这些衍生工作共同构建了一个从数据采集到能力诊断的完整评测闭环。
数据集最近研究
最新研究方向
在大型语言模型评估领域,Open LLM Leaderboard 已成为衡量模型综合性能的权威基准平台。针对 upaya07/Birbal-7B-V1 模型的评估数据集,其研究前沿聚焦于多维度、细粒度的能力解耦分析。该数据集不仅涵盖 ARC、HellaSwag、GSM8K 等传统推理与常识任务,更深入整合了涵盖 57 个学科领域的 MMLU(HendrycksTest)体系,从抽象代数到医学遗传学,全面探测模型在专业知识和逻辑推理上的边界。当前热点方向在于通过标准化评估流水线,揭示 7B 参数级别模型在数学推理(GSM8K 准确率 40.26%)与复杂常识推理(HellaSwag 归一化准确率 84.83%)之间的能力失衡现象。这一研究范式对于推动开源模型的可复现性评估、指导模型微调策略的优化具有里程碑意义,为构建更可靠、更透明的语言模型评价体系奠定了数据基础。
以上内容由遇见数据集搜集并总结生成



