open-llm-leaderboard-old/details_dreamgen__opus-v1.2-llama-3-8b
收藏Hugging Face2024-04-21 更新2024-06-22 收录
下载链接:
https://hf-mirror.com/datasets/open-llm-leaderboard-old/details_dreamgen__opus-v1.2-llama-3-8b
下载链接
链接失效反馈官方服务:
资源简介:
---
pretty_name: Evaluation run of dreamgen/opus-v1.2-llama-3-8b
dataset_summary: "Dataset automatically created during the evaluation run of model\
\ [dreamgen/opus-v1.2-llama-3-8b](https://huggingface.co/dreamgen/opus-v1.2-llama-3-8b)\
\ on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).\n\
\nThe dataset is composed of 63 configuration, each one coresponding to one of the\
\ evaluated task.\n\nThe dataset has been created from 1 run(s). Each run can be\
\ found as a specific split in each configuration, the split being named using the\
\ timestamp of the run.The \"train\" split is always pointing to the latest results.\n\
\nAn additional configuration \"results\" store all the aggregated results of the\
\ run (and is used to compute and display the aggregated metrics on the [Open LLM\
\ Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)).\n\
\nTo load the details from a run, you can for instance do the following:\n```python\n\
from datasets import load_dataset\ndata = load_dataset(\"open-llm-leaderboard/details_dreamgen__opus-v1.2-llama-3-8b\"\
,\n\t\"harness_winogrande_5\",\n\tsplit=\"train\")\n```\n\n## Latest results\n\n\
These are the [latest results from run 2024-04-21T05:44:21.504574](https://huggingface.co/datasets/open-llm-leaderboard/details_dreamgen__opus-v1.2-llama-3-8b/blob/main/results_2024-04-21T05-44-21.504574.json)(note\
\ that their might be results for other tasks in the repos if successive evals didn't\
\ cover the same tasks. You find each in the results and the \"latest\" split for\
\ each eval):\n\n```python\n{\n \"all\": {\n \"acc\": 0.6661707252030249,\n\
\ \"acc_stderr\": 0.03182337903045933,\n \"acc_norm\": 0.6688656231798138,\n\
\ \"acc_norm_stderr\": 0.03245840656330124,\n \"mc1\": 0.33047735618115054,\n\
\ \"mc1_stderr\": 0.0164667696136983,\n \"mc2\": 0.5051117071923402,\n\
\ \"mc2_stderr\": 0.014825504829788363\n },\n \"harness|arc:challenge|25\"\
: {\n \"acc\": 0.5656996587030717,\n \"acc_stderr\": 0.014484703048857355,\n\
\ \"acc_norm\": 0.6092150170648464,\n \"acc_norm_stderr\": 0.014258563880513778\n\
\ },\n \"harness|hellaswag|10\": {\n \"acc\": 0.5916152160924119,\n\
\ \"acc_stderr\": 0.004905304371090865,\n \"acc_norm\": 0.7927703644692292,\n\
\ \"acc_norm_stderr\": 0.004044931315182668\n },\n \"harness|hendrycksTest-abstract_algebra|5\"\
: {\n \"acc\": 0.33,\n \"acc_stderr\": 0.047258156262526045,\n \
\ \"acc_norm\": 0.33,\n \"acc_norm_stderr\": 0.047258156262526045\n \
\ },\n \"harness|hendrycksTest-anatomy|5\": {\n \"acc\": 0.6444444444444445,\n\
\ \"acc_stderr\": 0.04135176749720385,\n \"acc_norm\": 0.6444444444444445,\n\
\ \"acc_norm_stderr\": 0.04135176749720385\n },\n \"harness|hendrycksTest-astronomy|5\"\
: {\n \"acc\": 0.6973684210526315,\n \"acc_stderr\": 0.03738520676119667,\n\
\ \"acc_norm\": 0.6973684210526315,\n \"acc_norm_stderr\": 0.03738520676119667\n\
\ },\n \"harness|hendrycksTest-business_ethics|5\": {\n \"acc\": 0.69,\n\
\ \"acc_stderr\": 0.04648231987117316,\n \"acc_norm\": 0.69,\n \
\ \"acc_norm_stderr\": 0.04648231987117316\n },\n \"harness|hendrycksTest-clinical_knowledge|5\"\
: {\n \"acc\": 0.7433962264150943,\n \"acc_stderr\": 0.026880647889051992,\n\
\ \"acc_norm\": 0.7433962264150943,\n \"acc_norm_stderr\": 0.026880647889051992\n\
\ },\n \"harness|hendrycksTest-college_biology|5\": {\n \"acc\": 0.7708333333333334,\n\
\ \"acc_stderr\": 0.035146974678623884,\n \"acc_norm\": 0.7708333333333334,\n\
\ \"acc_norm_stderr\": 0.035146974678623884\n },\n \"harness|hendrycksTest-college_chemistry|5\"\
: {\n \"acc\": 0.49,\n \"acc_stderr\": 0.05024183937956912,\n \
\ \"acc_norm\": 0.49,\n \"acc_norm_stderr\": 0.05024183937956912\n \
\ },\n \"harness|hendrycksTest-college_computer_science|5\": {\n \"acc\"\
: 0.56,\n \"acc_stderr\": 0.04988876515698589,\n \"acc_norm\": 0.56,\n\
\ \"acc_norm_stderr\": 0.04988876515698589\n },\n \"harness|hendrycksTest-college_mathematics|5\"\
: {\n \"acc\": 0.39,\n \"acc_stderr\": 0.04902071300001975,\n \
\ \"acc_norm\": 0.39,\n \"acc_norm_stderr\": 0.04902071300001975\n \
\ },\n \"harness|hendrycksTest-college_medicine|5\": {\n \"acc\": 0.653179190751445,\n\
\ \"acc_stderr\": 0.036291466701596636,\n \"acc_norm\": 0.653179190751445,\n\
\ \"acc_norm_stderr\": 0.036291466701596636\n },\n \"harness|hendrycksTest-college_physics|5\"\
: {\n \"acc\": 0.46078431372549017,\n \"acc_stderr\": 0.04959859966384181,\n\
\ \"acc_norm\": 0.46078431372549017,\n \"acc_norm_stderr\": 0.04959859966384181\n\
\ },\n \"harness|hendrycksTest-computer_security|5\": {\n \"acc\":\
\ 0.77,\n \"acc_stderr\": 0.04229525846816506,\n \"acc_norm\": 0.77,\n\
\ \"acc_norm_stderr\": 0.04229525846816506\n },\n \"harness|hendrycksTest-conceptual_physics|5\"\
: {\n \"acc\": 0.5787234042553191,\n \"acc_stderr\": 0.03227834510146268,\n\
\ \"acc_norm\": 0.5787234042553191,\n \"acc_norm_stderr\": 0.03227834510146268\n\
\ },\n \"harness|hendrycksTest-econometrics|5\": {\n \"acc\": 0.5701754385964912,\n\
\ \"acc_stderr\": 0.04657047260594964,\n \"acc_norm\": 0.5701754385964912,\n\
\ \"acc_norm_stderr\": 0.04657047260594964\n },\n \"harness|hendrycksTest-electrical_engineering|5\"\
: {\n \"acc\": 0.6620689655172414,\n \"acc_stderr\": 0.039417076320648906,\n\
\ \"acc_norm\": 0.6620689655172414,\n \"acc_norm_stderr\": 0.039417076320648906\n\
\ },\n \"harness|hendrycksTest-elementary_mathematics|5\": {\n \"acc\"\
: 0.4126984126984127,\n \"acc_stderr\": 0.02535574126305527,\n \"\
acc_norm\": 0.4126984126984127,\n \"acc_norm_stderr\": 0.02535574126305527\n\
\ },\n \"harness|hendrycksTest-formal_logic|5\": {\n \"acc\": 0.48412698412698413,\n\
\ \"acc_stderr\": 0.04469881854072606,\n \"acc_norm\": 0.48412698412698413,\n\
\ \"acc_norm_stderr\": 0.04469881854072606\n },\n \"harness|hendrycksTest-global_facts|5\"\
: {\n \"acc\": 0.4,\n \"acc_stderr\": 0.04923659639173309,\n \
\ \"acc_norm\": 0.4,\n \"acc_norm_stderr\": 0.04923659639173309\n },\n\
\ \"harness|hendrycksTest-high_school_biology|5\": {\n \"acc\": 0.7935483870967742,\n\
\ \"acc_stderr\": 0.023025899617188695,\n \"acc_norm\": 0.7935483870967742,\n\
\ \"acc_norm_stderr\": 0.023025899617188695\n },\n \"harness|hendrycksTest-high_school_chemistry|5\"\
: {\n \"acc\": 0.4975369458128079,\n \"acc_stderr\": 0.03517945038691063,\n\
\ \"acc_norm\": 0.4975369458128079,\n \"acc_norm_stderr\": 0.03517945038691063\n\
\ },\n \"harness|hendrycksTest-high_school_computer_science|5\": {\n \
\ \"acc\": 0.76,\n \"acc_stderr\": 0.042923469599092816,\n \"acc_norm\"\
: 0.76,\n \"acc_norm_stderr\": 0.042923469599092816\n },\n \"harness|hendrycksTest-high_school_european_history|5\"\
: {\n \"acc\": 0.7575757575757576,\n \"acc_stderr\": 0.03346409881055953,\n\
\ \"acc_norm\": 0.7575757575757576,\n \"acc_norm_stderr\": 0.03346409881055953\n\
\ },\n \"harness|hendrycksTest-high_school_geography|5\": {\n \"acc\"\
: 0.8434343434343434,\n \"acc_stderr\": 0.025890520358141454,\n \"\
acc_norm\": 0.8434343434343434,\n \"acc_norm_stderr\": 0.025890520358141454\n\
\ },\n \"harness|hendrycksTest-high_school_government_and_politics|5\": {\n\
\ \"acc\": 0.9067357512953368,\n \"acc_stderr\": 0.020986854593289736,\n\
\ \"acc_norm\": 0.9067357512953368,\n \"acc_norm_stderr\": 0.020986854593289736\n\
\ },\n \"harness|hendrycksTest-high_school_macroeconomics|5\": {\n \
\ \"acc\": 0.6564102564102564,\n \"acc_stderr\": 0.02407869658063547,\n \
\ \"acc_norm\": 0.6564102564102564,\n \"acc_norm_stderr\": 0.02407869658063547\n\
\ },\n \"harness|hendrycksTest-high_school_mathematics|5\": {\n \"\
acc\": 0.4111111111111111,\n \"acc_stderr\": 0.029999923508706686,\n \
\ \"acc_norm\": 0.4111111111111111,\n \"acc_norm_stderr\": 0.029999923508706686\n\
\ },\n \"harness|hendrycksTest-high_school_microeconomics|5\": {\n \
\ \"acc\": 0.7478991596638656,\n \"acc_stderr\": 0.028205545033277723,\n\
\ \"acc_norm\": 0.7478991596638656,\n \"acc_norm_stderr\": 0.028205545033277723\n\
\ },\n \"harness|hendrycksTest-high_school_physics|5\": {\n \"acc\"\
: 0.4304635761589404,\n \"acc_stderr\": 0.04042809961395634,\n \"\
acc_norm\": 0.4304635761589404,\n \"acc_norm_stderr\": 0.04042809961395634\n\
\ },\n \"harness|hendrycksTest-high_school_psychology|5\": {\n \"acc\"\
: 0.8293577981651377,\n \"acc_stderr\": 0.016129271025099867,\n \"\
acc_norm\": 0.8293577981651377,\n \"acc_norm_stderr\": 0.016129271025099867\n\
\ },\n \"harness|hendrycksTest-high_school_statistics|5\": {\n \"acc\"\
: 0.5277777777777778,\n \"acc_stderr\": 0.0340470532865388,\n \"acc_norm\"\
: 0.5277777777777778,\n \"acc_norm_stderr\": 0.0340470532865388\n },\n\
\ \"harness|hendrycksTest-high_school_us_history|5\": {\n \"acc\": 0.8627450980392157,\n\
\ \"acc_stderr\": 0.024152225962801588,\n \"acc_norm\": 0.8627450980392157,\n\
\ \"acc_norm_stderr\": 0.024152225962801588\n },\n \"harness|hendrycksTest-high_school_world_history|5\"\
: {\n \"acc\": 0.8270042194092827,\n \"acc_stderr\": 0.02462156286676842,\n\
\ \"acc_norm\": 0.8270042194092827,\n \"acc_norm_stderr\": 0.02462156286676842\n\
\ },\n \"harness|hendrycksTest-human_aging|5\": {\n \"acc\": 0.7219730941704036,\n\
\ \"acc_stderr\": 0.03006958487449405,\n \"acc_norm\": 0.7219730941704036,\n\
\ \"acc_norm_stderr\": 0.03006958487449405\n },\n \"harness|hendrycksTest-human_sexuality|5\"\
: {\n \"acc\": 0.7862595419847328,\n \"acc_stderr\": 0.0359546161177469,\n\
\ \"acc_norm\": 0.7862595419847328,\n \"acc_norm_stderr\": 0.0359546161177469\n\
\ },\n \"harness|hendrycksTest-international_law|5\": {\n \"acc\":\
\ 0.8016528925619835,\n \"acc_stderr\": 0.03640118271990946,\n \"\
acc_norm\": 0.8016528925619835,\n \"acc_norm_stderr\": 0.03640118271990946\n\
\ },\n \"harness|hendrycksTest-jurisprudence|5\": {\n \"acc\": 0.7685185185185185,\n\
\ \"acc_stderr\": 0.04077494709252626,\n \"acc_norm\": 0.7685185185185185,\n\
\ \"acc_norm_stderr\": 0.04077494709252626\n },\n \"harness|hendrycksTest-logical_fallacies|5\"\
: {\n \"acc\": 0.7791411042944786,\n \"acc_stderr\": 0.03259177392742178,\n\
\ \"acc_norm\": 0.7791411042944786,\n \"acc_norm_stderr\": 0.03259177392742178\n\
\ },\n \"harness|hendrycksTest-machine_learning|5\": {\n \"acc\": 0.5892857142857143,\n\
\ \"acc_stderr\": 0.04669510663875191,\n \"acc_norm\": 0.5892857142857143,\n\
\ \"acc_norm_stderr\": 0.04669510663875191\n },\n \"harness|hendrycksTest-management|5\"\
: {\n \"acc\": 0.7864077669902912,\n \"acc_stderr\": 0.040580420156460344,\n\
\ \"acc_norm\": 0.7864077669902912,\n \"acc_norm_stderr\": 0.040580420156460344\n\
\ },\n \"harness|hendrycksTest-marketing|5\": {\n \"acc\": 0.905982905982906,\n\
\ \"acc_stderr\": 0.019119892798924978,\n \"acc_norm\": 0.905982905982906,\n\
\ \"acc_norm_stderr\": 0.019119892798924978\n },\n \"harness|hendrycksTest-medical_genetics|5\"\
: {\n \"acc\": 0.79,\n \"acc_stderr\": 0.040936018074033256,\n \
\ \"acc_norm\": 0.79,\n \"acc_norm_stderr\": 0.040936018074033256\n \
\ },\n \"harness|hendrycksTest-miscellaneous|5\": {\n \"acc\": 0.8122605363984674,\n\
\ \"acc_stderr\": 0.013964393769899133,\n \"acc_norm\": 0.8122605363984674,\n\
\ \"acc_norm_stderr\": 0.013964393769899133\n },\n \"harness|hendrycksTest-moral_disputes|5\"\
: {\n \"acc\": 0.7485549132947977,\n \"acc_stderr\": 0.023357365785874037,\n\
\ \"acc_norm\": 0.7485549132947977,\n \"acc_norm_stderr\": 0.023357365785874037\n\
\ },\n \"harness|hendrycksTest-moral_scenarios|5\": {\n \"acc\": 0.4346368715083799,\n\
\ \"acc_stderr\": 0.016578997435496706,\n \"acc_norm\": 0.4346368715083799,\n\
\ \"acc_norm_stderr\": 0.016578997435496706\n },\n \"harness|hendrycksTest-nutrition|5\"\
: {\n \"acc\": 0.7483660130718954,\n \"acc_stderr\": 0.0248480182638752,\n\
\ \"acc_norm\": 0.7483660130718954,\n \"acc_norm_stderr\": 0.0248480182638752\n\
\ },\n \"harness|hendrycksTest-philosophy|5\": {\n \"acc\": 0.7331189710610932,\n\
\ \"acc_stderr\": 0.02512263760881666,\n \"acc_norm\": 0.7331189710610932,\n\
\ \"acc_norm_stderr\": 0.02512263760881666\n },\n \"harness|hendrycksTest-prehistory|5\"\
: {\n \"acc\": 0.7160493827160493,\n \"acc_stderr\": 0.025089478523765134,\n\
\ \"acc_norm\": 0.7160493827160493,\n \"acc_norm_stderr\": 0.025089478523765134\n\
\ },\n \"harness|hendrycksTest-professional_accounting|5\": {\n \"\
acc\": 0.4929078014184397,\n \"acc_stderr\": 0.02982449855912901,\n \
\ \"acc_norm\": 0.4929078014184397,\n \"acc_norm_stderr\": 0.02982449855912901\n\
\ },\n \"harness|hendrycksTest-professional_law|5\": {\n \"acc\": 0.46936114732724904,\n\
\ \"acc_stderr\": 0.012746237711716634,\n \"acc_norm\": 0.46936114732724904,\n\
\ \"acc_norm_stderr\": 0.012746237711716634\n },\n \"harness|hendrycksTest-professional_medicine|5\"\
: {\n \"acc\": 0.7022058823529411,\n \"acc_stderr\": 0.02777829870154544,\n\
\ \"acc_norm\": 0.7022058823529411,\n \"acc_norm_stderr\": 0.02777829870154544\n\
\ },\n \"harness|hendrycksTest-professional_psychology|5\": {\n \"\
acc\": 0.7189542483660131,\n \"acc_stderr\": 0.01818521895431808,\n \
\ \"acc_norm\": 0.7189542483660131,\n \"acc_norm_stderr\": 0.01818521895431808\n\
\ },\n \"harness|hendrycksTest-public_relations|5\": {\n \"acc\": 0.6818181818181818,\n\
\ \"acc_stderr\": 0.04461272175910509,\n \"acc_norm\": 0.6818181818181818,\n\
\ \"acc_norm_stderr\": 0.04461272175910509\n },\n \"harness|hendrycksTest-security_studies|5\"\
: {\n \"acc\": 0.726530612244898,\n \"acc_stderr\": 0.028535560337128438,\n\
\ \"acc_norm\": 0.726530612244898,\n \"acc_norm_stderr\": 0.028535560337128438\n\
\ },\n \"harness|hendrycksTest-sociology|5\": {\n \"acc\": 0.845771144278607,\n\
\ \"acc_stderr\": 0.025538433368578327,\n \"acc_norm\": 0.845771144278607,\n\
\ \"acc_norm_stderr\": 0.025538433368578327\n },\n \"harness|hendrycksTest-us_foreign_policy|5\"\
: {\n \"acc\": 0.87,\n \"acc_stderr\": 0.033799766898963086,\n \
\ \"acc_norm\": 0.87,\n \"acc_norm_stderr\": 0.033799766898963086\n \
\ },\n \"harness|hendrycksTest-virology|5\": {\n \"acc\": 0.5,\n \
\ \"acc_stderr\": 0.03892494720807614,\n \"acc_norm\": 0.5,\n \
\ \"acc_norm_stderr\": 0.03892494720807614\n },\n \"harness|hendrycksTest-world_religions|5\"\
: {\n \"acc\": 0.7719298245614035,\n \"acc_stderr\": 0.032180937956023566,\n\
\ \"acc_norm\": 0.7719298245614035,\n \"acc_norm_stderr\": 0.032180937956023566\n\
\ },\n \"harness|truthfulqa:mc|0\": {\n \"mc1\": 0.33047735618115054,\n\
\ \"mc1_stderr\": 0.0164667696136983,\n \"mc2\": 0.5051117071923402,\n\
\ \"mc2_stderr\": 0.014825504829788363\n },\n \"harness|winogrande|5\"\
: {\n \"acc\": 0.7569060773480663,\n \"acc_stderr\": 0.012055665630431037\n\
\ },\n \"harness|gsm8k|5\": {\n \"acc\": 0.6611068991660348,\n \
\ \"acc_stderr\": 0.01303795576856251\n }\n}\n```"
repo_url: https://huggingface.co/dreamgen/opus-v1.2-llama-3-8b
leaderboard_url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard
point_of_contact: clementine@hf.co
configs:
- config_name: harness_arc_challenge_25
data_files:
- split: 2024_04_21T05_44_21.504574
path:
- '**/details_harness|arc:challenge|25_2024-04-21T05-44-21.504574.parquet'
- split: latest
path:
- '**/details_harness|arc:challenge|25_2024-04-21T05-44-21.504574.parquet'
- config_name: harness_gsm8k_5
data_files:
- split: 2024_04_21T05_44_21.504574
path:
- '**/details_harness|gsm8k|5_2024-04-21T05-44-21.504574.parquet'
- split: latest
path:
- '**/details_harness|gsm8k|5_2024-04-21T05-44-21.504574.parquet'
- config_name: harness_hellaswag_10
data_files:
- split: 2024_04_21T05_44_21.504574
path:
- '**/details_harness|hellaswag|10_2024-04-21T05-44-21.504574.parquet'
- split: latest
path:
- '**/details_harness|hellaswag|10_2024-04-21T05-44-21.504574.parquet'
- config_name: harness_hendrycksTest_5
data_files:
- split: 2024_04_21T05_44_21.504574
path:
- '**/details_harness|hendrycksTest-abstract_algebra|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-anatomy|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-astronomy|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-business_ethics|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-college_biology|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-college_chemistry|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-college_computer_science|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-college_mathematics|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-college_medicine|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-college_physics|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-computer_security|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-conceptual_physics|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-econometrics|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-electrical_engineering|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-formal_logic|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-global_facts|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-high_school_biology|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-high_school_european_history|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-high_school_geography|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-high_school_physics|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-high_school_psychology|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-high_school_statistics|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-high_school_us_history|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-high_school_world_history|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-human_aging|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-human_sexuality|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-international_law|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-jurisprudence|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-logical_fallacies|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-machine_learning|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-management|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-marketing|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-medical_genetics|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-miscellaneous|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-moral_disputes|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-moral_scenarios|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-nutrition|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-philosophy|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-prehistory|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-professional_accounting|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-professional_law|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-professional_medicine|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-professional_psychology|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-public_relations|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-security_studies|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-sociology|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-virology|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-world_religions|5_2024-04-21T05-44-21.504574.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-abstract_algebra|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-anatomy|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-astronomy|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-business_ethics|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-college_biology|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-college_chemistry|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-college_computer_science|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-college_mathematics|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-college_medicine|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-college_physics|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-computer_security|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-conceptual_physics|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-econometrics|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-electrical_engineering|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-formal_logic|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-global_facts|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-high_school_biology|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-high_school_european_history|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-high_school_geography|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-high_school_physics|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-high_school_psychology|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-high_school_statistics|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-high_school_us_history|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-high_school_world_history|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-human_aging|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-human_sexuality|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-international_law|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-jurisprudence|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-logical_fallacies|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-machine_learning|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-management|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-marketing|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-medical_genetics|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-miscellaneous|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-moral_disputes|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-moral_scenarios|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-nutrition|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-philosophy|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-prehistory|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-professional_accounting|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-professional_law|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-professional_medicine|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-professional_psychology|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-public_relations|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-security_studies|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-sociology|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-virology|5_2024-04-21T05-44-21.504574.parquet'
- '**/details_harness|hendrycksTest-world_religions|5_2024-04-21T05-44-21.504574.parquet'
- config_name: harness_hendrycksTest_abstract_algebra_5
data_files:
- split: 2024_04_21T05_44_21.504574
path:
- '**/details_harness|hendrycksTest-abstract_algebra|5_2024-04-21T05-44-21.504574.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-abstract_algebra|5_2024-04-21T05-44-21.504574.parquet'
- config_name: harness_hendrycksTest_anatomy_5
data_files:
- split: 2024_04_21T05_44_21.504574
path:
- '**/details_harness|hendrycksTest-anatomy|5_2024-04-21T05-44-21.504574.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-anatomy|5_2024-04-21T05-44-21.504574.parquet'
- config_name: harness_hendrycksTest_astronomy_5
data_files:
- split: 2024_04_21T05_44_21.504574
path:
- '**/details_harness|hendrycksTest-astronomy|5_2024-04-21T05-44-21.504574.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-astronomy|5_2024-04-21T05-44-21.504574.parquet'
- config_name: harness_hendrycksTest_business_ethics_5
data_files:
- split: 2024_04_21T05_44_21.504574
path:
- '**/details_harness|hendrycksTest-business_ethics|5_2024-04-21T05-44-21.504574.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-business_ethics|5_2024-04-21T05-44-21.504574.parquet'
- config_name: harness_hendrycksTest_clinical_knowledge_5
data_files:
- split: 2024_04_21T05_44_21.504574
path:
- '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-04-21T05-44-21.504574.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-04-21T05-44-21.504574.parquet'
- config_name: harness_hendrycksTest_college_biology_5
data_files:
- split: 2024_04_21T05_44_21.504574
path:
- '**/details_harness|hendrycksTest-college_biology|5_2024-04-21T05-44-21.504574.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-college_biology|5_2024-04-21T05-44-21.504574.parquet'
- config_name: harness_hendrycksTest_college_chemistry_5
data_files:
- split: 2024_04_21T05_44_21.504574
path:
- '**/details_harness|hendrycksTest-college_chemistry|5_2024-04-21T05-44-21.504574.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-college_chemistry|5_2024-04-21T05-44-21.504574.parquet'
- config_name: harness_hendrycksTest_college_computer_science_5
data_files:
- split: 2024_04_21T05_44_21.504574
path:
- '**/details_harness|hendrycksTest-college_computer_science|5_2024-04-21T05-44-21.504574.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-college_computer_science|5_2024-04-21T05-44-21.504574.parquet'
- config_name: harness_hendrycksTest_college_mathematics_5
data_files:
- split: 2024_04_21T05_44_21.504574
path:
- '**/details_harness|hendrycksTest-college_mathematics|5_2024-04-21T05-44-21.504574.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-college_mathematics|5_2024-04-21T05-44-21.504574.parquet'
- config_name: harness_hendrycksTest_college_medicine_5
data_files:
- split: 2024_04_21T05_44_21.504574
path:
- '**/details_harness|hendrycksTest-college_medicine|5_2024-04-21T05-44-21.504574.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-college_medicine|5_2024-04-21T05-44-21.504574.parquet'
- config_name: harness_hendrycksTest_college_physics_5
data_files:
- split: 2024_04_21T05_44_21.504574
path:
- '**/details_harness|hendrycksTest-college_physics|5_2024-04-21T05-44-21.504574.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-college_physics|5_2024-04-21T05-44-21.504574.parquet'
- config_name: harness_hendrycksTest_computer_security_5
data_files:
- split: 2024_04_21T05_44_21.504574
path:
- '**/details_harness|hendrycksTest-computer_security|5_2024-04-21T05-44-21.504574.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-computer_security|5_2024-04-21T05-44-21.504574.parquet'
- config_name: harness_hendrycksTest_conceptual_physics_5
data_files:
- split: 2024_04_21T05_44_21.504574
path:
- '**/details_harness|hendrycksTest-conceptual_physics|5_2024-04-21T05-44-21.504574.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-conceptual_physics|5_2024-04-21T05-44-21.504574.parquet'
- config_name: harness_hendrycksTest_econometrics_5
data_files:
- split: 2024_04_21T05_44_21.504574
path:
- '**/details_harness|hendrycksTest-econometrics|5_2024-04-21T05-44-21.504574.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-econometrics|5_2024-04-21T05-44-21.504574.parquet'
- config_name: harness_hendrycksTest_electrical_engineering_5
data_files:
- split: 2024_04_21T05_44_21.504574
path:
- '**/details_harness|hendrycksTest-electrical_engineering|5_2024-04-21T05-44-21.504574.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-electrical_engineering|5_2024-04-21T05-44-21.504574.parquet'
- config_name: harness_hendrycksTest_elementary_mathematics_5
data_files:
- split: 2024_04_21T05_44_21.504574
path:
- '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-04-21T05-44-21.504574.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-04-21T05-44-21.504574.parquet'
- config_name: harness_hendrycksTest_formal_logic_5
data_files:
- split: 2024_04_21T05_44_21.504574
path:
- '**/details_harness|hendrycksTest-formal_logic|5_2024-04-21T05-44-21.504574.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-formal_logic|5_2024-04-21T05-44-21.504574.parquet'
- config_name: harness_hendrycksTest_global_facts_5
data_files:
- split: 2024_04_21T05_44_21.504574
path:
- '**/details_harness|hendrycksTest-global_facts|5_2024-04-21T05-44-21.504574.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-global_facts|5_2024-04-21T05-44-21.504574.parquet'
- config_name: harness_hendrycksTest_high_school_biology_5
data_files:
- split: 2024_04_21T05_44_21.504574
path:
- '**/details_harness|hendrycksTest-high_school_biology|5_2024-04-21T05-44-21.504574.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-high_school_biology|5_2024-04-21T05-44-21.504574.parquet'
- config_name: harness_hendrycksTest_high_school_chemistry_5
data_files:
- split: 2024_04_21T05_44_21.504574
path:
- '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-04-21T05-44-21.504574.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-04-21T05-44-21.504574.parquet'
- config_name: harness_hendrycksTest_high_school_computer_science_5
data_files:
- split: 2024_04_21T05_44_21.504574
path:
- '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-04-21T05-44-21.504574.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-04-21T05-44-21.504574.parquet'
- config_name: harness_hendrycksTest_high_school_european_history_5
data_files:
- split: 2024_04_21T05_44_21.504574
path:
- '**/details_harness|hendrycksTest-high_school_european_history|5_2024-04-21T05-44-21.504574.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-high_school_european_history|5_2024-04-21T05-44-21.504574.parquet'
- config_name: harness_hendrycksTest_high_school_geography_5
data_files:
- split: 2024_04_21T05_44_21.504574
path:
- '**/details_harness|hendrycksTest-high_school_geography|5_2024-04-21T05-44-21.504574.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-high_school_geography|5_2024-04-21T05-44-21.504574.parquet'
- config_name: harness_hendrycksTest_high_school_government_and_politics_5
data_files:
- split: 2024_04_21T05_44_21.504574
path:
- '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-04-21T05-44-21.504574.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-04-21T05-44-21.504574.parquet'
- config_name: harness_hendrycksTest_high_school_macroeconomics_5
data_files:
- split: 2024_04_21T05_44_21.504574
path:
- '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-04-21T05-44-21.504574.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-04-21T05-44-21.504574.parquet'
- config_name: harness_hendrycksTest_high_school_mathematics_5
data_files:
- split: 2024_04_21T05_44_21.504574
path:
- '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-04-21T05-44-21.504574.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-04-21T05-44-21.504574.parquet'
- config_name: harness_hendrycksTest_high_school_microeconomics_5
data_files:
- split: 2024_04_21T05_44_21.504574
path:
- '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-04-21T05-44-21.504574.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-04-21T05-44-21.504574.parquet'
- config_name: harness_hendrycksTest_high_school_physics_5
data_files:
- split: 2024_04_21T05_44_21.504574
path:
- '**/details_harness|hendrycksTest-high_school_physics|5_2024-04-21T05-44-21.504574.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-high_school_physics|5_2024-04-21T05-44-21.504574.parquet'
- config_name: harness_hendrycksTest_high_school_psychology_5
data_files:
- split: 2024_04_21T05_44_21.504574
path:
- '**/details_harness|hendrycksTest-high_school_psychology|5_2024-04-21T05-44-21.504574.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-high_school_psychology|5_2024-04-21T05-44-21.504574.parquet'
- config_name: harness_hendrycksTest_high_school_statistics_5
data_files:
- split: 2024_04_21T05_44_21.504574
path:
- '**/details_harness|hendrycksTest-high_school_statistics|5_2024-04-21T05-44-21.504574.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-high_school_statistics|5_2024-04-21T05-44-21.504574.parquet'
- config_name: harness_hendrycksTest_high_school_us_history_5
data_files:
- split: 2024_04_21T05_44_21.504574
path:
- '**/details_harness|hendrycksTest-high_school_us_history|5_2024-04-21T05-44-21.504574.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-high_school_us_history|5_2024-04-21T05-44-21.504574.parquet'
- config_name: harness_hendrycksTest_high_school_world_history_5
data_files:
- split: 2024_04_21T05_44_21.504574
path:
- '**/details_harness|hendrycksTest-high_school_world_history|5_2024-04-21T05-44-21.504574.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-high_school_world_history|5_2024-04-21T05-44-21.504574.parquet'
- config_name: harness_hendrycksTest_human_aging_5
data_files:
- split: 2024_04_21T05_44_21.504574
path:
- '**/details_harness|hendrycksTest-human_aging|5_2024-04-21T05-44-21.504574.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-human_aging|5_2024-04-21T05-44-21.504574.parquet'
- config_name: harness_hendrycksTest_human_sexuality_5
data_files:
- split: 2024_04_21T05_44_21.504574
path:
- '**/details_harness|hendrycksTest-human_sexuality|5_2024-04-21T05-44-21.504574.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-human_sexuality|5_2024-04-21T05-44-21.504574.parquet'
- config_name: harness_hendrycksTest_international_law_5
data_files:
- split: 2024_04_21T05_44_21.504574
path:
- '**/details_harness|hendrycksTest-international_law|5_2024-04-21T05-44-21.504574.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-international_law|5_2024-04-21T05-44-21.504574.parquet'
- config_name: harness_hendrycksTest_jurisprudence_5
data_files:
- split: 2024_04_21T05_44_21.504574
path:
- '**/details_harness|hendrycksTest-jurisprudence|5_2024-04-21T05-44-21.504574.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-jurisprudence|5_2024-04-21T05-44-21.504574.parquet'
- config_name: harness_hendrycksTest_logical_fallacies_5
data_files:
- split: 2024_04_21T05_44_21.504574
path:
- '**/details_harness|hendrycksTest-logical_fallacies|5_2024-04-21T05-44-21.504574.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-logical_fallacies|5_2024-04-21T05-44-21.504574.parquet'
- config_name: harness_hendrycksTest_machine_learning_5
data_files:
- split: 2024_04_21T05_44_21.504574
path:
- '**/details_harness|hendrycksTest-machine_learning|5_2024-04-21T05-44-21.504574.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-machine_learning|5_2024-04-21T05-44-21.504574.parquet'
- config_name: harness_hendrycksTest_management_5
data_files:
- split: 2024_04_21T05_44_21.504574
path:
- '**/details_harness|hendrycksTest-management|5_2024-04-21T05-44-21.504574.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-management|5_2024-04-21T05-44-21.504574.parquet'
- config_name: harness_hendrycksTest_marketing_5
data_files:
- split: 2024_04_21T05_44_21.504574
path:
- '**/details_harness|hendrycksTest-marketing|5_2024-04-21T05-44-21.504574.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-marketing|5_2024-04-21T05-44-21.504574.parquet'
- config_name: harness_hendrycksTest_medical_genetics_5
data_files:
- split: 2024_04_21T05_44_21.504574
path:
- '**/details_harness|hendrycksTest-medical_genetics|5_2024-04-21T05-44-21.504574.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-medical_genetics|5_2024-04-21T05-44-21.504574.parquet'
- config_name: harness_hendrycksTest_miscellaneous_5
data_files:
- split: 2024_04_21T05_44_21.504574
path:
- '**/details_harness|hendrycksTest-miscellaneous|5_2024-04-21T05-44-21.504574.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-miscellaneous|5_2024-04-21T05-44-21.504574.parquet'
- config_name: harness_hendrycksTest_moral_disputes_5
data_files:
- split: 2024_04_21T05_44_21.504574
path:
- '**/details_harness|hendrycksTest-moral_disputes|5_2024-04-21T05-44-21.504574.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-moral_disputes|5_2024-04-21T05-44-21.504574.parquet'
- config_name: harness_hendrycksTest_moral_scenarios_5
data_files:
- split: 2024_04_21T05_44_21.504574
path:
- '**/details_harness|hendrycksTest-moral_scenarios|5_2024-04-21T05-44-21.504574.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-moral_scenarios|5_2024-04-21T05-44-21.504574.parquet'
- config_name: harness_hendrycksTest_nutrition_5
data_files:
- split: 2024_04_21T05_44_21.504574
path:
- '**/details_harness|hendrycksTest-nutrition|5_2024-04-21T05-44-21.504574.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-nutrition|5_2024-04-21T05-44-21.504574.parquet'
- config_name: harness_hendrycksTest_philosophy_5
data_files:
- split: 2024_04_21T05_44_21.504574
path:
- '**/details_harness|hendrycksTest-philosophy|5_2024-04-21T05-44-21.504574.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-philosophy|5_2024-04-21T05-44-21.504574.parquet'
- config_name: harness_hendrycksTest_prehistory_5
data_files:
- split: 2024_04_21T05_44_21.504574
path:
- '**/details_harness|hendrycksTest-prehistory|5_2024-04-21T05-44-21.504574.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-prehistory|5_2024-04-21T05-44-21.504574.parquet'
- config_name: harness_hendrycksTest_professional_accounting_5
data_files:
- split: 2024_04_21T05_44_21.504574
path:
- '**/details_harness|hendrycksTest-professional_accounting|5_2024-04-21T05-44-21.504574.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-professional_accounting|5_2024-04-21T05-44-21.504574.parquet'
- config_name: harness_hendrycksTest_professional_law_5
data_files:
- split: 2024_04_21T05_44_21.504574
path:
- '**/details_harness|hendrycksTest-professional_law|5_2024-04-21T05-44-21.504574.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-professional_law|5_2024-04-21T05-44-21.504574.parquet'
- config_name: harness_hendrycksTest_professional_medicine_5
data_files:
- split: 2024_04_21T05_44_21.504574
path:
- '**/details_harness|hendrycksTest-professional_medicine|5_2024-04-21T05-44-21.504574.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-professional_medicine|5_2024-04-21T05-44-21.504574.parquet'
- config_name: harness_hendrycksTest_professional_psychology_5
data_files:
- split: 2024_04_21T05_44_21.504574
path:
- '**/details_harness|hendrycksTest-professional_psychology|5_2024-04-21T05-44-21.504574.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-professional_psychology|5_2024-04-21T05-44-21.504574.parquet'
- config_name: harness_hendrycksTest_public_relations_5
data_files:
- split: 2024_04_21T05_44_21.504574
path:
- '**/details_harness|hendrycksTest-public_relations|5_2024-04-21T05-44-21.504574.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-public_relations|5_2024-04-21T05-44-21.504574.parquet'
- config_name: harness_hendrycksTest_security_studies_5
data_files:
- split: 2024_04_21T05_44_21.504574
path:
- '**/details_harness|hendrycksTest-security_studies|5_2024-04-21T05-44-21.504574.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-security_studies|5_2024-04-21T05-44-21.504574.parquet'
- config_name: harness_hendrycksTest_sociology_5
data_files:
- split: 2024_04_21T05_44_21.504574
path:
- '**/details_harness|hendrycksTest-sociology|5_2024-04-21T05-44-21.504574.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-sociology|5_2024-04-21T05-44-21.504574.parquet'
- config_name: harness_hendrycksTest_us_foreign_policy_5
data_files:
- split: 2024_04_21T05_44_21.504574
path:
- '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-04-21T05-44-21.504574.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-04-21T05-44-21.504574.parquet'
- config_name: harness_hendrycksTest_virology_5
data_files:
- split: 2024_04_21T05_44_21.504574
path:
- '**/details_harness|hendrycksTest-virology|5_2024-04-21T05-44-21.504574.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-virology|5_2024-04-21T05-44-21.504574.parquet'
- config_name: harness_hendrycksTest_world_religions_5
data_files:
- split: 2024_04_21T05_44_21.504574
path:
- '**/details_harness|hendrycksTest-world_religions|5_2024-04-21T05-44-21.504574.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-world_religions|5_2024-04-21T05-44-21.504574.parquet'
- config_name: harness_truthfulqa_mc_0
data_files:
- split: 2024_04_21T05_44_21.504574
path:
- '**/details_harness|truthfulqa:mc|0_2024-04-21T05-44-21.504574.parquet'
- split: latest
path:
- '**/details_harness|truthfulqa:mc|0_2024-04-21T05-44-21.504574.parquet'
- config_name: harness_winogrande_5
data_files:
- split: 2024_04_21T05_44_21.504574
path:
- '**/details_harness|winogrande|5_2024-04-21T05-44-21.504574.parquet'
- split: latest
path:
- '**/details_harness|winogrande|5_2024-04-21T05-44-21.504574.parquet'
- config_name: results
data_files:
- split: 2024_04_21T05_44_21.504574
path:
- results_2024-04-21T05-44-21.504574.parquet
- split: latest
path:
- results_2024-04-21T05-44-21.504574.parquet
---
# Dataset Card for Evaluation run of dreamgen/opus-v1.2-llama-3-8b
<!-- Provide a quick summary of the dataset. -->
Dataset automatically created during the evaluation run of model [dreamgen/opus-v1.2-llama-3-8b](https://huggingface.co/dreamgen/opus-v1.2-llama-3-8b) on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).
The dataset is composed of 63 configuration, each one coresponding to one of the evaluated task.
The dataset has been created from 1 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results.
An additional configuration "results" store all the aggregated results of the run (and is used to compute and display the aggregated metrics on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)).
To load the details from a run, you can for instance do the following:
```python
from datasets import load_dataset
data = load_dataset("open-llm-leaderboard/details_dreamgen__opus-v1.2-llama-3-8b",
"harness_winogrande_5",
split="train")
```
## Latest results
These are the [latest results from run 2024-04-21T05:44:21.504574](https://huggingface.co/datasets/open-llm-leaderboard/details_dreamgen__opus-v1.2-llama-3-8b/blob/main/results_2024-04-21T05-44-21.504574.json)(note that their might be results for other tasks in the repos if successive evals didn't cover the same tasks. You find each in the results and the "latest" split for each eval):
```python
{
"all": {
"acc": 0.6661707252030249,
"acc_stderr": 0.03182337903045933,
"acc_norm": 0.6688656231798138,
"acc_norm_stderr": 0.03245840656330124,
"mc1": 0.33047735618115054,
"mc1_stderr": 0.0164667696136983,
"mc2": 0.5051117071923402,
"mc2_stderr": 0.014825504829788363
},
"harness|arc:challenge|25": {
"acc": 0.5656996587030717,
"acc_stderr": 0.014484703048857355,
"acc_norm": 0.6092150170648464,
"acc_norm_stderr": 0.014258563880513778
},
"harness|hellaswag|10": {
"acc": 0.5916152160924119,
"acc_stderr": 0.004905304371090865,
"acc_norm": 0.7927703644692292,
"acc_norm_stderr": 0.004044931315182668
},
"harness|hendrycksTest-abstract_algebra|5": {
"acc": 0.33,
"acc_stderr": 0.047258156262526045,
"acc_norm": 0.33,
"acc_norm_stderr": 0.047258156262526045
},
"harness|hendrycksTest-anatomy|5": {
"acc": 0.6444444444444445,
"acc_stderr": 0.04135176749720385,
"acc_norm": 0.6444444444444445,
"acc_norm_stderr": 0.04135176749720385
},
"harness|hendrycksTest-astronomy|5": {
"acc": 0.6973684210526315,
"acc_stderr": 0.03738520676119667,
"acc_norm": 0.6973684210526315,
"acc_norm_stderr": 0.03738520676119667
},
"harness|hendrycksTest-business_ethics|5": {
"acc": 0.69,
"acc_stderr": 0.04648231987117316,
"acc_norm": 0.69,
"acc_norm_stderr": 0.04648231987117316
},
"harness|hendrycksTest-clinical_knowledge|5": {
"acc": 0.7433962264150943,
"acc_stderr": 0.026880647889051992,
"acc_norm": 0.7433962264150943,
"acc_norm_stderr": 0.026880647889051992
},
"harness|hendrycksTest-college_biology|5": {
"acc": 0.7708333333333334,
"acc_stderr": 0.035146974678623884,
"acc_norm": 0.7708333333333334,
"acc_norm_stderr": 0.035146974678623884
},
"harness|hendrycksTest-college_chemistry|5": {
"acc": 0.49,
"acc_stderr": 0.05024183937956912,
"acc_norm": 0.49,
"acc_norm_stderr": 0.05024183937956912
},
"harness|hendrycksTest-college_computer_science|5": {
"acc": 0.56,
"acc_stderr": 0.04988876515698589,
"acc_norm": 0.56,
"acc_norm_stderr": 0.04988876515698589
},
"harness|hendrycksTest-college_mathematics|5": {
"acc": 0.39,
"acc_stderr": 0.04902071300001975,
"acc_norm": 0.39,
"acc_norm_stderr": 0.04902071300001975
},
"harness|hendrycksTest-college_medicine|5": {
"acc": 0.653179190751445,
"acc_stderr": 0.036291466701596636,
"acc_norm": 0.653179190751445,
"acc_norm_stderr": 0.036291466701596636
},
"harness|hendrycksTest-college_physics|5": {
"acc": 0.46078431372549017,
"acc_stderr": 0.04959859966384181,
"acc_norm": 0.46078431372549017,
"acc_norm_stderr": 0.04959859966384181
},
"harness|hendrycksTest-computer_security|5": {
"acc": 0.77,
"acc_stderr": 0.04229525846816506,
"acc_norm": 0.77,
"acc_norm_stderr": 0.04229525846816506
},
"harness|hendrycksTest-conceptual_physics|5": {
"acc": 0.5787234042553191,
"acc_stderr": 0.03227834510146268,
"acc_norm": 0.5787234042553191,
"acc_norm_stderr": 0.03227834510146268
},
"harness|hendrycksTest-econometrics|5": {
"acc": 0.5701754385964912,
"acc_stderr": 0.04657047260594964,
"acc_norm": 0.5701754385964912,
"acc_norm_stderr": 0.04657047260594964
},
"harness|hendrycksTest-electrical_engineering|5": {
"acc": 0.6620689655172414,
"acc_stderr": 0.039417076320648906,
"acc_norm": 0.6620689655172414,
"acc_norm_stderr": 0.039417076320648906
},
"harness|hendrycksTest-elementary_mathematics|5": {
"acc": 0.4126984126984127,
"acc_stderr": 0.02535574126305527,
"acc_norm": 0.4126984126984127,
"acc_norm_stderr": 0.02535574126305527
},
"harness|hendrycksTest-formal_logic|5": {
"acc": 0.48412698412698413,
"acc_stderr": 0.04469881854072606,
"acc_norm": 0.48412698412698413,
"acc_norm_stderr": 0.04469881854072606
},
"harness|hendrycksTest-global_facts|5": {
"acc": 0.4,
"acc_stderr": 0.04923659639173309,
"acc_norm": 0.4,
"acc_norm_stderr": 0.04923659639173309
},
"harness|hendrycksTest-high_school_biology|5": {
"acc": 0.7935483870967742,
"acc_stderr": 0.023025899617188695,
"acc_norm": 0.7935483870967742,
"acc_norm_stderr": 0.023025899617188695
},
"harness|hendrycksTest-high_school_chemistry|5": {
"acc": 0.4975369458128079,
"acc_stderr": 0.03517945038691063,
"acc_norm": 0.4975369458128079,
"acc_norm_stderr": 0.03517945038691063
},
"harness|hendrycksTest-high_school_computer_science|5": {
"acc": 0.76,
"acc_stderr": 0.042923469599092816,
"acc_norm": 0.76,
"acc_norm_stderr": 0.042923469599092816
},
"harness|hendrycksTest-high_school_european_history|5": {
"acc": 0.7575757575757576,
"acc_stderr": 0.03346409881055953,
"acc_norm": 0.7575757575757576,
"acc_norm_stderr": 0.03346409881055953
},
"harness|hendrycksTest-high_school_geography|5": {
"acc": 0.8434343434343434,
"acc_stderr": 0.025890520358141454,
"acc_norm": 0.8434343434343434,
"acc_norm_stderr": 0.025890520358141454
},
"harness|hendrycksTest-high_school_government_and_politics|5": {
"acc": 0.9067357512953368,
"acc_stderr": 0.020986854593289736,
"acc_norm": 0.9067357512953368,
"acc_norm_stderr": 0.020986854593289736
},
"harness|hendrycksTest-high_school_macroeconomics|5": {
"acc": 0.6564102564102564,
"acc_stderr": 0.02407869658063547,
"acc_norm": 0.6564102564102564,
"acc_norm_stderr": 0.02407869658063547
},
"harness|hendrycksTest-high_school_mathematics|5": {
"acc": 0.4111111111111111,
"acc_stderr": 0.029999923508706686,
"acc_norm": 0.4111111111111111,
"acc_norm_stderr": 0.029999923508706686
},
"harness|hendrycksTest-high_school_microeconomics|5": {
"acc": 0.7478991596638656,
"acc_stderr": 0.028205545033277723,
"acc_norm": 0.7478991596638656,
"acc_norm_stderr": 0.028205545033277723
},
"harness|hendrycksTest-high_school_physics|5": {
"acc": 0.4304635761589404,
"acc_stderr": 0.04042809961395634,
"acc_norm": 0.4304635761589404,
"acc_norm_stderr": 0.04042809961395634
},
"harness|hendrycksTest-high_school_psychology|5": {
"acc": 0.8293577981651377,
"acc_stderr": 0.016129271025099867,
"acc_norm": 0.8293577981651377,
"acc_norm_stderr": 0.016129271025099867
},
"harness|hendrycksTest-high_school_statistics|5": {
"acc": 0.5277777777777778,
"acc_stderr": 0.0340470532865388,
"acc_norm": 0.5277777777777778,
"acc_norm_stderr": 0.0340470532865388
},
"harness|hendrycksTest-high_school_us_history|5": {
"acc": 0.8627450980392157,
"acc_stderr": 0.024152225962801588,
"acc_norm": 0.8627450980392157,
"acc_norm_stderr": 0.024152225962801588
},
"harness|hendrycksTest-high_school_world_history|5": {
"acc": 0.8270042194092827,
"acc_stderr": 0.02462156286676842,
"acc_norm": 0.8270042194092827,
"acc_norm_stderr": 0.02462156286676842
},
"harness|hendrycksTest-human_aging|5": {
"acc": 0.7219730941704036,
"acc_stderr": 0.03006958487449405,
"acc_norm": 0.7219730941704036,
"acc_norm_stderr": 0.03006958487449405
},
"harness|hendrycksTest-human_sexuality|5": {
"acc": 0.7862595419847328,
"acc_stderr": 0.0359546161177469,
"acc_norm": 0.7862595419847328,
"acc_norm_stderr": 0.0359546161177469
},
"harness|hendrycksTest-international_law|5": {
"acc": 0.8016528925619835,
"acc_stderr": 0.03640118271990946,
"acc_norm": 0.8016528925619835,
"acc_norm_stderr": 0.03640118271990946
},
"harness|hendrycksTest-jurisprudence|5": {
"acc": 0.7685185185185185,
"acc_stderr": 0.04077494709252626,
"acc_norm": 0.7685185185185185,
"acc_norm_stderr": 0.04077494709252626
},
"harness|hendrycksTest-logical_fallacies|5": {
"acc": 0.7791411042944786,
"acc_stderr": 0.03259177392742178,
"acc_norm": 0.7791411042944786,
"acc_norm_stderr": 0.03259177392742178
},
"harness|hendrycksTest-machine_learning|5": {
"acc": 0.5892857142857143,
"acc_stderr": 0.04669510663875191,
"acc_norm": 0.5892857142857143,
"acc_norm_stderr": 0.04669510663875191
},
"harness|hendrycksTest-management|5": {
"acc": 0.7864077669902912,
"acc_stderr": 0.040580420156460344,
"acc_norm": 0.7864077669902912,
"acc_norm_stderr": 0.040580420156460344
},
"harness|hendrycksTest-marketing|5": {
"acc": 0.905982905982906,
"acc_stderr": 0.019119892798924978,
"acc_norm": 0.905982905982906,
"acc_norm_stderr": 0.019119892798924978
},
"harness|hendrycksTest-medical_genetics|5": {
"acc": 0.79,
"acc_stderr": 0.040936018074033256,
"acc_norm": 0.79,
"acc_norm_stderr": 0.040936018074033256
},
"harness|hendrycksTest-miscellaneous|5": {
"acc": 0.8122605363984674,
"acc_stderr": 0.013964393769899133,
"acc_norm": 0.8122605363984674,
"acc_norm_stderr": 0.013964393769899133
},
"harness|hendrycksTest-moral_disputes|5": {
"acc": 0.7485549132947977,
"acc_stderr": 0.023357365785874037,
"acc_norm": 0.7485549132947977,
"acc_norm_stderr": 0.023357365785874037
},
"harness|hendrycksTest-moral_scenarios|5": {
"acc": 0.4346368715083799,
"acc_stderr": 0.016578997435496706,
"acc_norm": 0.4346368715083799,
"acc_norm_stderr": 0.016578997435496706
},
"harness|hendrycksTest-nutrition|5": {
"acc": 0.7483660130718954,
"acc_stderr": 0.0248480182638752,
"acc_norm": 0.7483660130718954,
"acc_norm_stderr": 0.0248480182638752
},
"harness|hendrycksTest-philosophy|5": {
"acc": 0.7331189710610932,
"acc_stderr": 0.02512263760881666,
"acc_norm": 0.7331189710610932,
"acc_norm_stderr": 0.02512263760881666
},
"harness|hendrycksTest-prehistory|5": {
"acc": 0.7160493827160493,
"acc_stderr": 0.025089478523765134,
"acc_norm": 0.7160493827160493,
"acc_norm_stderr": 0.025089478523765134
},
"harness|hendrycksTest-professional_accounting|5": {
"acc": 0.4929078014184397,
"acc_stderr": 0.02982449855912901,
"acc_norm": 0.4929078014184397,
"acc_norm_stderr": 0.02982449855912901
},
"harness|hendrycksTest-professional_law|5": {
"acc": 0.46936114732724904,
"acc_stderr": 0.012746237711716634,
"acc_norm": 0.46936114732724904,
"acc_norm_stderr": 0.012746237711716634
},
"harness|hendrycksTest-professional_medicine|5": {
"acc": 0.7022058823529411,
"acc_stderr": 0.02777829870154544,
"acc_norm": 0.7022058823529411,
"acc_norm_stderr": 0.02777829870154544
},
"harness|hendrycksTest-professional_psychology|5": {
"acc": 0.7189542483660131,
"acc_stderr": 0.01818521895431808,
"acc_norm": 0.7189542483660131,
"acc_norm_stderr": 0.01818521895431808
},
"harness|hendrycksTest-public_relations|5": {
"acc": 0.6818181818181818,
"acc_stderr": 0.04461272175910509,
"acc_norm": 0.6818181818181818,
"acc_norm_stderr": 0.04461272175910509
},
"harness|hendrycksTest-security_studies|5": {
"acc": 0.726530612244898,
"acc_stderr": 0.028535560337128438,
"acc_norm": 0.726530612244898,
"acc_norm_stderr": 0.028535560337128438
},
"harness|hendrycksTest-sociology|5": {
"acc": 0.845771144278607,
"acc_stderr": 0.025538433368578327,
"acc_norm": 0.845771144278607,
"acc_norm_stderr": 0.025538433368578327
},
"harness|hendrycksTest-us_foreign_policy|5": {
"acc": 0.87,
"acc_stderr": 0.033799766898963086,
"acc_norm": 0.87,
"acc_norm_stderr": 0.033799766898963086
},
"harness|hendrycksTest-virology|5": {
"acc": 0.5,
"acc_stderr": 0.03892494720807614,
"acc_norm": 0.5,
"acc_norm_stderr": 0.03892494720807614
},
"harness|hendrycksTest-world_religions|5": {
"acc": 0.7719298245614035,
"acc_stderr": 0.032180937956023566,
"acc_norm": 0.7719298245614035,
"acc_norm_stderr": 0.032180937956023566
},
"harness|truthfulqa:mc|0": {
"mc1": 0.33047735618115054,
"mc1_stderr": 0.0164667696136983,
"mc2": 0.5051117071923402,
"mc2_stderr": 0.014825504829788363
},
"harness|winogrande|5": {
"acc": 0.7569060773480663,
"acc_stderr": 0.012055665630431037
},
"harness|gsm8k|5": {
"acc": 0.6611068991660348,
"acc_stderr": 0.01303795576856251
}
}
```
## Dataset Details
### Dataset Description
<!-- Provide a longer summary of what this dataset is. -->
- **Curated by:** [More Information Needed]
- **Funded by [optional]:** [More Information Needed]
- **Shared by [optional]:** [More Information Needed]
- **Language(s) (NLP):** [More Information Needed]
- **License:** [More Information Needed]
### Dataset Sources [optional]
<!-- Provide the basic links for the dataset. -->
- **Repository:** [More Information Needed]
- **Paper [optional]:** [More Information Needed]
- **Demo [optional]:** [More Information Needed]
## Uses
<!-- Address questions around how the dataset is intended to be used. -->
### Direct Use
<!-- This section describes suitable use cases for the dataset. -->
[More Information Needed]
### Out-of-Scope Use
<!-- This section addresses misuse, malicious use, and uses that the dataset will not work well for. -->
[More Information Needed]
## Dataset Structure
<!-- This section provides a description of the dataset fields, and additional information about the dataset structure such as criteria used to create the splits, relationships between data points, etc. -->
[More Information Needed]
## Dataset Creation
### Curation Rationale
<!-- Motivation for the creation of this dataset. -->
[More Information Needed]
### Source Data
<!-- This section describes the source data (e.g. news text and headlines, social media posts, translated sentences, ...). -->
#### Data Collection and Processing
<!-- This section describes the data collection and processing process such as data selection criteria, filtering and normalization methods, tools and libraries used, etc. -->
[More Information Needed]
#### Who are the source data producers?
<!-- This section describes the people or systems who originally created the data. It should also include self-reported demographic or identity information for the source data creators if this information is available. -->
[More Information Needed]
### Annotations [optional]
<!-- If the dataset contains annotations which are not part of the initial data collection, use this section to describe them. -->
#### Annotation process
<!-- This section describes the annotation process such as annotation tools used in the process, the amount of data annotated, annotation guidelines provided to the annotators, interannotator statistics, annotation validation, etc. -->
[More Information Needed]
#### Who are the annotators?
<!-- This section describes the people or systems who created the annotations. -->
[More Information Needed]
#### Personal and Sensitive Information
<!-- State whether the dataset contains data that might be considered personal, sensitive, or private (e.g., data that reveals addresses, uniquely identifiable names or aliases, racial or ethnic origins, sexual orientations, religious beliefs, political opinions, financial or health data, etc.). If efforts were made to anonymize the data, describe the anonymization process. -->
[More Information Needed]
## Bias, Risks, and Limitations
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
[More Information Needed]
### Recommendations
<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
Users should be made aware of the risks, biases and limitations of the dataset. More information needed for further recommendations.
## Citation [optional]
<!-- If there is a paper or blog post introducing the dataset, the APA and Bibtex information for that should go in this section. -->
**BibTeX:**
[More Information Needed]
**APA:**
[More Information Needed]
## Glossary [optional]
<!-- If relevant, include terms and calculations in this section that can help readers understand the dataset or dataset card. -->
[More Information Needed]
## More Information [optional]
[More Information Needed]
## Dataset Card Authors [optional]
[More Information Needed]
## Dataset Card Contact
[More Information Needed]
提供机构:
open-llm-leaderboard-old
原始信息汇总
数据集概述
数据集创建
- 数据集名称: Evaluation run of dreamgen/opus-v1.2-llama-3-8b
- 创建目的: 自动创建于模型 dreamgen/opus-v1.2-llama-3-8b 在 Open LLM Leaderboard 的评估运行期间。
数据集结构
- 配置数量: 63个配置,每个配置对应一个评估任务。
- 数据来源: 从1次运行中创建,每个运行在每个配置中作为一个特定的分割存在,分割名称使用运行的时间戳。
- 最新结果: "train" 分割始终指向最新结果。
- 结果汇总: 额外的 "results" 配置存储所有运行的汇总结果,用于计算和显示 Open LLM Leaderboard 上的聚合指标。
数据加载示例
python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_dreamgen__opus-v1.2-llama-3-8b", "harness_winogrande_5", split="train")
最新结果
- 结果来源: 来自 2024-04-21T05:44:21.504574 运行 的最新结果。
- 任务结果: 包含多个任务的评估结果,如 "harness|arc:challenge|25"、"harness|hellaswag|10" 等。
配置详情
- 配置名称: 如 "harness_arc_challenge_25"、"harness_gsm8k_5" 等。
- 数据文件: 每个配置包含多个数据文件,路径格式如 "**/details_harness|arc:challenge|25_2024-04-21T05-44-21.504574.parquet"。
搜集汇总
数据集介绍

构建方式
该数据集是Open LLM Leaderboard在评估模型dreamgen/opus-v1.2-llama-3-8b过程中自动生成的产物。它源自单一评估轮次,涵盖63个独立的配置项,每个配置对应一项被评估的任务。数据集的每条记录以Parquet格式存储,并根据运行时间戳划分不同的数据分割(split),其中'train'分割始终指向最新一次的评估结果。此外,一个名为'results'的额外配置专门用于汇总所有评估轮次的聚合指标,为排行榜上综合分数的计算与展示提供支撑。
使用方法
研究者可通过Hugging Face的datasets库便捷加载该数据集。以加载winogrande任务在最新评估中的详细结果为例,只需调用load_dataset函数,指定数据集名称与对应配置(如'harness_winogrande_5'),并选择分割为'train'即可。具体代码为:from datasets import load_dataset; data = load_dataset('open-llm-leaderboard/details_dreamgen__opus-v1.2-llama-3-8b', 'harness_winogrande_5', split='train')。此方法允许用户灵活访问任一任务的评估细节,实现高效的数据复用与分析。
背景与挑战
背景概述
大语言模型(LLM)的迅猛发展催生了对其性能进行系统化评估的迫切需求,Open LLM Leaderboard应运而生,旨在为社区提供一个标准化、透明化的模型评测平台。该数据集由HuggingFace团队于2024年创建,核心研究人员包括Clémentine等,其核心研究问题在于如何通过多维度、多任务的基准测试,客观衡量诸如dreamgen/opus-v1.2-llama-3-8b等模型在常识推理、数学计算、知识问答等领域的真实能力。作为LLM评估领域的重要基础设施,该数据集通过整合ARC、HellaSwag、MMLU、GSM8K等经典基准,为研究者提供了可复现的评估框架,显著推动了模型性能的横向比较与迭代优化,对理解模型泛化边界和知识覆盖范围具有深远影响。
当前挑战
该数据集面临的挑战主要源于LLM评估本身的复杂性。在领域问题层面,如何设计能够全面反映模型推理、事实性、鲁棒性的评测任务是一大难题,例如现有基准如TruthfulQA在检测模型幻觉方面仍存在局限,而GSM8K等数学任务对模型逐步推理能力的考核也需持续优化。在构建过程中,数据集需处理多源异构任务的标准化整合,确保不同任务间的评分一致性及可复现性,同时应对模型评估时因随机种子、解码参数等带来的结果波动。此外,随着模型能力的快速演进,如何及时更新任务集合以覆盖新涌现的能力维度(如多模态理解、长文本处理),避免评估体系滞后于模型发展,亦是亟待解决的挑战。
常用场景
经典使用场景
该数据集专为大规模语言模型的标准化评估而设计,其核心用途在于为Open LLM Leaderboard上的模型评测提供细粒度的性能追踪。它涵盖了ARC挑战赛、HellaSwag、MMLU(涵盖从抽象代数到病毒学的57个学科)、GSM8K、TruthfulQA以及Winogrande等经典基准任务,研究者可通过加载特定配置(如harness_arc_challenge_25)来复现或分析模型在单一任务上的表现。
解决学术问题
该数据集系统性地解决了大语言模型评估中复现性不足与结果碎片化的学术难题。通过将每一次评测运行的结果存储为独立的数据切片,并保留原始评测时间戳,它使得研究者能够精确追溯模型在特定时刻、特定任务上的表现,从而支撑了模型性能退化分析、训练策略对比以及跨模型泛化能力研究等关键学术问题的探索。
实际应用
在实际应用中,该数据集为模型开发者提供了一种自动化、标准化的模型迭代监控工具。当新模型版本发布时,开发者可直接调用该数据集的最新结果切片,快速掌握模型在推理、常识理解、数学解题及事实一致性等维度的能力变化,从而指导微调策略的调整与优化,显著提升模型从实验室到生产环境的部署效率。
数据集最近研究
最新研究方向
当前,大语言模型(LLM)的性能评估正从单一基准测试向多维度、细粒度的能力图谱演进。围绕dreamgen/opus-v1.2-llama-3-8b模型在Open LLM Leaderboard上的评估数据,研究热点聚焦于构建标准化、可复现的评测流水线,以系统揭示模型在常识推理(如HellaSwag、ARC-Challenge)、数学解题(GSM8K)、知识问答(MMLU涵盖57个学科)及事实一致性(TruthfulQA)等前沿方向的能力边界。该数据集通过63个任务配置和带时间戳的分次运行记录,为追踪模型迭代中的性能波动提供了珍贵资源,其意义在于推动LLM评测从粗粒度的总分比较转向对特定认知缺陷的精准诊断,从而为模型优化与安全部署提供实证依据。
以上内容由遇见数据集搜集并总结生成



