open-llm-leaderboard-old/details_chanwit__flux-7b-v0.2
收藏Hugging Face2024-01-18 更新2024-06-22 收录
下载链接:
https://hf-mirror.com/datasets/open-llm-leaderboard-old/details_chanwit__flux-7b-v0.2
下载链接
链接失效反馈官方服务:
资源简介:
---
pretty_name: Evaluation run of chanwit/flux-7b-v0.2
dataset_summary: "Dataset automatically created during the evaluation run of model\
\ [chanwit/flux-7b-v0.2](https://huggingface.co/chanwit/flux-7b-v0.2) on the [Open\
\ LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).\n\
\nThe dataset is composed of 63 configuration, each one coresponding to one of the\
\ evaluated task.\n\nThe dataset has been created from 1 run(s). Each run can be\
\ found as a specific split in each configuration, the split being named using the\
\ timestamp of the run.The \"train\" split is always pointing to the latest results.\n\
\nAn additional configuration \"results\" store all the aggregated results of the\
\ run (and is used to compute and display the aggregated metrics on the [Open LLM\
\ Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)).\n\
\nTo load the details from a run, you can for instance do the following:\n```python\n\
from datasets import load_dataset\ndata = load_dataset(\"open-llm-leaderboard/details_chanwit__flux-7b-v0.2\"\
,\n\t\"harness_winogrande_5\",\n\tsplit=\"train\")\n```\n\n## Latest results\n\n\
These are the [latest results from run 2024-01-18T00:42:00.685036](https://huggingface.co/datasets/open-llm-leaderboard/details_chanwit__flux-7b-v0.2/blob/main/results_2024-01-18T00-42-00.685036.json)(note\
\ that their might be results for other tasks in the repos if successive evals didn't\
\ cover the same tasks. You find each in the results and the \"latest\" split for\
\ each eval):\n\n```python\n{\n \"all\": {\n \"acc\": 0.6573592496311628,\n\
\ \"acc_stderr\": 0.031783424194298984,\n \"acc_norm\": 0.6574655421284195,\n\
\ \"acc_norm_stderr\": 0.03243446995426543,\n \"mc1\": 0.3537331701346389,\n\
\ \"mc1_stderr\": 0.016737814358846147,\n \"mc2\": 0.5180401965777761,\n\
\ \"mc2_stderr\": 0.015565981129474472\n },\n \"harness|arc:challenge|25\"\
: {\n \"acc\": 0.6331058020477816,\n \"acc_stderr\": 0.014084133118104294,\n\
\ \"acc_norm\": 0.6655290102389079,\n \"acc_norm_stderr\": 0.013787460322441374\n\
\ },\n \"harness|hellaswag|10\": {\n \"acc\": 0.6825333598884684,\n\
\ \"acc_stderr\": 0.004645393477680678,\n \"acc_norm\": 0.8611830312686716,\n\
\ \"acc_norm_stderr\": 0.003450488042965005\n },\n \"harness|hendrycksTest-abstract_algebra|5\"\
: {\n \"acc\": 0.36,\n \"acc_stderr\": 0.04824181513244218,\n \
\ \"acc_norm\": 0.36,\n \"acc_norm_stderr\": 0.04824181513244218\n \
\ },\n \"harness|hendrycksTest-anatomy|5\": {\n \"acc\": 0.6074074074074074,\n\
\ \"acc_stderr\": 0.0421850621536888,\n \"acc_norm\": 0.6074074074074074,\n\
\ \"acc_norm_stderr\": 0.0421850621536888\n },\n \"harness|hendrycksTest-astronomy|5\"\
: {\n \"acc\": 0.7039473684210527,\n \"acc_stderr\": 0.03715062154998904,\n\
\ \"acc_norm\": 0.7039473684210527,\n \"acc_norm_stderr\": 0.03715062154998904\n\
\ },\n \"harness|hendrycksTest-business_ethics|5\": {\n \"acc\": 0.63,\n\
\ \"acc_stderr\": 0.04852365870939099,\n \"acc_norm\": 0.63,\n \
\ \"acc_norm_stderr\": 0.04852365870939099\n },\n \"harness|hendrycksTest-clinical_knowledge|5\"\
: {\n \"acc\": 0.6981132075471698,\n \"acc_stderr\": 0.02825420034443866,\n\
\ \"acc_norm\": 0.6981132075471698,\n \"acc_norm_stderr\": 0.02825420034443866\n\
\ },\n \"harness|hendrycksTest-college_biology|5\": {\n \"acc\": 0.7777777777777778,\n\
\ \"acc_stderr\": 0.03476590104304134,\n \"acc_norm\": 0.7777777777777778,\n\
\ \"acc_norm_stderr\": 0.03476590104304134\n },\n \"harness|hendrycksTest-college_chemistry|5\"\
: {\n \"acc\": 0.49,\n \"acc_stderr\": 0.05024183937956912,\n \
\ \"acc_norm\": 0.49,\n \"acc_norm_stderr\": 0.05024183937956912\n \
\ },\n \"harness|hendrycksTest-college_computer_science|5\": {\n \"acc\"\
: 0.56,\n \"acc_stderr\": 0.049888765156985884,\n \"acc_norm\": 0.56,\n\
\ \"acc_norm_stderr\": 0.049888765156985884\n },\n \"harness|hendrycksTest-college_mathematics|5\"\
: {\n \"acc\": 0.32,\n \"acc_stderr\": 0.04688261722621504,\n \
\ \"acc_norm\": 0.32,\n \"acc_norm_stderr\": 0.04688261722621504\n \
\ },\n \"harness|hendrycksTest-college_medicine|5\": {\n \"acc\": 0.6936416184971098,\n\
\ \"acc_stderr\": 0.03514942551267438,\n \"acc_norm\": 0.6936416184971098,\n\
\ \"acc_norm_stderr\": 0.03514942551267438\n },\n \"harness|hendrycksTest-college_physics|5\"\
: {\n \"acc\": 0.4215686274509804,\n \"acc_stderr\": 0.04913595201274498,\n\
\ \"acc_norm\": 0.4215686274509804,\n \"acc_norm_stderr\": 0.04913595201274498\n\
\ },\n \"harness|hendrycksTest-computer_security|5\": {\n \"acc\":\
\ 0.77,\n \"acc_stderr\": 0.04229525846816507,\n \"acc_norm\": 0.77,\n\
\ \"acc_norm_stderr\": 0.04229525846816507\n },\n \"harness|hendrycksTest-conceptual_physics|5\"\
: {\n \"acc\": 0.5957446808510638,\n \"acc_stderr\": 0.032081157507886836,\n\
\ \"acc_norm\": 0.5957446808510638,\n \"acc_norm_stderr\": 0.032081157507886836\n\
\ },\n \"harness|hendrycksTest-econometrics|5\": {\n \"acc\": 0.5,\n\
\ \"acc_stderr\": 0.047036043419179864,\n \"acc_norm\": 0.5,\n \
\ \"acc_norm_stderr\": 0.047036043419179864\n },\n \"harness|hendrycksTest-electrical_engineering|5\"\
: {\n \"acc\": 0.5448275862068965,\n \"acc_stderr\": 0.04149886942192117,\n\
\ \"acc_norm\": 0.5448275862068965,\n \"acc_norm_stderr\": 0.04149886942192117\n\
\ },\n \"harness|hendrycksTest-elementary_mathematics|5\": {\n \"acc\"\
: 0.3994708994708995,\n \"acc_stderr\": 0.02522545028406788,\n \"\
acc_norm\": 0.3994708994708995,\n \"acc_norm_stderr\": 0.02522545028406788\n\
\ },\n \"harness|hendrycksTest-formal_logic|5\": {\n \"acc\": 0.4523809523809524,\n\
\ \"acc_stderr\": 0.044518079590553275,\n \"acc_norm\": 0.4523809523809524,\n\
\ \"acc_norm_stderr\": 0.044518079590553275\n },\n \"harness|hendrycksTest-global_facts|5\"\
: {\n \"acc\": 0.36,\n \"acc_stderr\": 0.048241815132442176,\n \
\ \"acc_norm\": 0.36,\n \"acc_norm_stderr\": 0.048241815132442176\n \
\ },\n \"harness|hendrycksTest-high_school_biology|5\": {\n \"acc\"\
: 0.7870967741935484,\n \"acc_stderr\": 0.02328766512726854,\n \"\
acc_norm\": 0.7870967741935484,\n \"acc_norm_stderr\": 0.02328766512726854\n\
\ },\n \"harness|hendrycksTest-high_school_chemistry|5\": {\n \"acc\"\
: 0.4975369458128079,\n \"acc_stderr\": 0.03517945038691063,\n \"\
acc_norm\": 0.4975369458128079,\n \"acc_norm_stderr\": 0.03517945038691063\n\
\ },\n \"harness|hendrycksTest-high_school_computer_science|5\": {\n \
\ \"acc\": 0.72,\n \"acc_stderr\": 0.045126085985421276,\n \"acc_norm\"\
: 0.72,\n \"acc_norm_stderr\": 0.045126085985421276\n },\n \"harness|hendrycksTest-high_school_european_history|5\"\
: {\n \"acc\": 0.7818181818181819,\n \"acc_stderr\": 0.03225078108306289,\n\
\ \"acc_norm\": 0.7818181818181819,\n \"acc_norm_stderr\": 0.03225078108306289\n\
\ },\n \"harness|hendrycksTest-high_school_geography|5\": {\n \"acc\"\
: 0.7777777777777778,\n \"acc_stderr\": 0.029620227874790482,\n \"\
acc_norm\": 0.7777777777777778,\n \"acc_norm_stderr\": 0.029620227874790482\n\
\ },\n \"harness|hendrycksTest-high_school_government_and_politics|5\": {\n\
\ \"acc\": 0.9119170984455959,\n \"acc_stderr\": 0.02045374660160103,\n\
\ \"acc_norm\": 0.9119170984455959,\n \"acc_norm_stderr\": 0.02045374660160103\n\
\ },\n \"harness|hendrycksTest-high_school_macroeconomics|5\": {\n \
\ \"acc\": 0.6666666666666666,\n \"acc_stderr\": 0.023901157979402534,\n\
\ \"acc_norm\": 0.6666666666666666,\n \"acc_norm_stderr\": 0.023901157979402534\n\
\ },\n \"harness|hendrycksTest-high_school_mathematics|5\": {\n \"\
acc\": 0.34444444444444444,\n \"acc_stderr\": 0.02897264888484427,\n \
\ \"acc_norm\": 0.34444444444444444,\n \"acc_norm_stderr\": 0.02897264888484427\n\
\ },\n \"harness|hendrycksTest-high_school_microeconomics|5\": {\n \
\ \"acc\": 0.7058823529411765,\n \"acc_stderr\": 0.029597329730978086,\n\
\ \"acc_norm\": 0.7058823529411765,\n \"acc_norm_stderr\": 0.029597329730978086\n\
\ },\n \"harness|hendrycksTest-high_school_physics|5\": {\n \"acc\"\
: 0.33112582781456956,\n \"acc_stderr\": 0.038425817186598696,\n \"\
acc_norm\": 0.33112582781456956,\n \"acc_norm_stderr\": 0.038425817186598696\n\
\ },\n \"harness|hendrycksTest-high_school_psychology|5\": {\n \"acc\"\
: 0.8495412844036697,\n \"acc_stderr\": 0.015328563932669237,\n \"\
acc_norm\": 0.8495412844036697,\n \"acc_norm_stderr\": 0.015328563932669237\n\
\ },\n \"harness|hendrycksTest-high_school_statistics|5\": {\n \"acc\"\
: 0.5092592592592593,\n \"acc_stderr\": 0.034093869469927006,\n \"\
acc_norm\": 0.5092592592592593,\n \"acc_norm_stderr\": 0.034093869469927006\n\
\ },\n \"harness|hendrycksTest-high_school_us_history|5\": {\n \"acc\"\
: 0.8431372549019608,\n \"acc_stderr\": 0.025524722324553353,\n \"\
acc_norm\": 0.8431372549019608,\n \"acc_norm_stderr\": 0.025524722324553353\n\
\ },\n \"harness|hendrycksTest-high_school_world_history|5\": {\n \"\
acc\": 0.810126582278481,\n \"acc_stderr\": 0.025530100460233497,\n \
\ \"acc_norm\": 0.810126582278481,\n \"acc_norm_stderr\": 0.025530100460233497\n\
\ },\n \"harness|hendrycksTest-human_aging|5\": {\n \"acc\": 0.7085201793721974,\n\
\ \"acc_stderr\": 0.03050028317654585,\n \"acc_norm\": 0.7085201793721974,\n\
\ \"acc_norm_stderr\": 0.03050028317654585\n },\n \"harness|hendrycksTest-human_sexuality|5\"\
: {\n \"acc\": 0.8015267175572519,\n \"acc_stderr\": 0.03498149385462472,\n\
\ \"acc_norm\": 0.8015267175572519,\n \"acc_norm_stderr\": 0.03498149385462472\n\
\ },\n \"harness|hendrycksTest-international_law|5\": {\n \"acc\":\
\ 0.8347107438016529,\n \"acc_stderr\": 0.03390780612972776,\n \"\
acc_norm\": 0.8347107438016529,\n \"acc_norm_stderr\": 0.03390780612972776\n\
\ },\n \"harness|hendrycksTest-jurisprudence|5\": {\n \"acc\": 0.8148148148148148,\n\
\ \"acc_stderr\": 0.03755265865037182,\n \"acc_norm\": 0.8148148148148148,\n\
\ \"acc_norm_stderr\": 0.03755265865037182\n },\n \"harness|hendrycksTest-logical_fallacies|5\"\
: {\n \"acc\": 0.7914110429447853,\n \"acc_stderr\": 0.03192193448934724,\n\
\ \"acc_norm\": 0.7914110429447853,\n \"acc_norm_stderr\": 0.03192193448934724\n\
\ },\n \"harness|hendrycksTest-machine_learning|5\": {\n \"acc\": 0.5089285714285714,\n\
\ \"acc_stderr\": 0.04745033255489123,\n \"acc_norm\": 0.5089285714285714,\n\
\ \"acc_norm_stderr\": 0.04745033255489123\n },\n \"harness|hendrycksTest-management|5\"\
: {\n \"acc\": 0.8543689320388349,\n \"acc_stderr\": 0.0349260647662379,\n\
\ \"acc_norm\": 0.8543689320388349,\n \"acc_norm_stderr\": 0.0349260647662379\n\
\ },\n \"harness|hendrycksTest-marketing|5\": {\n \"acc\": 0.8504273504273504,\n\
\ \"acc_stderr\": 0.023365051491753715,\n \"acc_norm\": 0.8504273504273504,\n\
\ \"acc_norm_stderr\": 0.023365051491753715\n },\n \"harness|hendrycksTest-medical_genetics|5\"\
: {\n \"acc\": 0.7,\n \"acc_stderr\": 0.046056618647183814,\n \
\ \"acc_norm\": 0.7,\n \"acc_norm_stderr\": 0.046056618647183814\n \
\ },\n \"harness|hendrycksTest-miscellaneous|5\": {\n \"acc\": 0.8301404853128991,\n\
\ \"acc_stderr\": 0.013428186370608306,\n \"acc_norm\": 0.8301404853128991,\n\
\ \"acc_norm_stderr\": 0.013428186370608306\n },\n \"harness|hendrycksTest-moral_disputes|5\"\
: {\n \"acc\": 0.7427745664739884,\n \"acc_stderr\": 0.023532925431044283,\n\
\ \"acc_norm\": 0.7427745664739884,\n \"acc_norm_stderr\": 0.023532925431044283\n\
\ },\n \"harness|hendrycksTest-moral_scenarios|5\": {\n \"acc\": 0.3877094972067039,\n\
\ \"acc_stderr\": 0.016295332328155818,\n \"acc_norm\": 0.3877094972067039,\n\
\ \"acc_norm_stderr\": 0.016295332328155818\n },\n \"harness|hendrycksTest-nutrition|5\"\
: {\n \"acc\": 0.7320261437908496,\n \"acc_stderr\": 0.025360603796242553,\n\
\ \"acc_norm\": 0.7320261437908496,\n \"acc_norm_stderr\": 0.025360603796242553\n\
\ },\n \"harness|hendrycksTest-philosophy|5\": {\n \"acc\": 0.7202572347266881,\n\
\ \"acc_stderr\": 0.02549425935069491,\n \"acc_norm\": 0.7202572347266881,\n\
\ \"acc_norm_stderr\": 0.02549425935069491\n },\n \"harness|hendrycksTest-prehistory|5\"\
: {\n \"acc\": 0.7283950617283951,\n \"acc_stderr\": 0.02474862449053737,\n\
\ \"acc_norm\": 0.7283950617283951,\n \"acc_norm_stderr\": 0.02474862449053737\n\
\ },\n \"harness|hendrycksTest-professional_accounting|5\": {\n \"\
acc\": 0.4645390070921986,\n \"acc_stderr\": 0.02975238965742705,\n \
\ \"acc_norm\": 0.4645390070921986,\n \"acc_norm_stderr\": 0.02975238965742705\n\
\ },\n \"harness|hendrycksTest-professional_law|5\": {\n \"acc\": 0.4791395045632334,\n\
\ \"acc_stderr\": 0.012759117066518012,\n \"acc_norm\": 0.4791395045632334,\n\
\ \"acc_norm_stderr\": 0.012759117066518012\n },\n \"harness|hendrycksTest-professional_medicine|5\"\
: {\n \"acc\": 0.7279411764705882,\n \"acc_stderr\": 0.02703304115168146,\n\
\ \"acc_norm\": 0.7279411764705882,\n \"acc_norm_stderr\": 0.02703304115168146\n\
\ },\n \"harness|hendrycksTest-professional_psychology|5\": {\n \"\
acc\": 0.6830065359477124,\n \"acc_stderr\": 0.018824219512706207,\n \
\ \"acc_norm\": 0.6830065359477124,\n \"acc_norm_stderr\": 0.018824219512706207\n\
\ },\n \"harness|hendrycksTest-public_relations|5\": {\n \"acc\": 0.6727272727272727,\n\
\ \"acc_stderr\": 0.0449429086625209,\n \"acc_norm\": 0.6727272727272727,\n\
\ \"acc_norm_stderr\": 0.0449429086625209\n },\n \"harness|hendrycksTest-security_studies|5\"\
: {\n \"acc\": 0.7142857142857143,\n \"acc_stderr\": 0.028920583220675602,\n\
\ \"acc_norm\": 0.7142857142857143,\n \"acc_norm_stderr\": 0.028920583220675602\n\
\ },\n \"harness|hendrycksTest-sociology|5\": {\n \"acc\": 0.8407960199004975,\n\
\ \"acc_stderr\": 0.025870646766169136,\n \"acc_norm\": 0.8407960199004975,\n\
\ \"acc_norm_stderr\": 0.025870646766169136\n },\n \"harness|hendrycksTest-us_foreign_policy|5\"\
: {\n \"acc\": 0.87,\n \"acc_stderr\": 0.03379976689896308,\n \
\ \"acc_norm\": 0.87,\n \"acc_norm_stderr\": 0.03379976689896308\n \
\ },\n \"harness|hendrycksTest-virology|5\": {\n \"acc\": 0.536144578313253,\n\
\ \"acc_stderr\": 0.038823108508905954,\n \"acc_norm\": 0.536144578313253,\n\
\ \"acc_norm_stderr\": 0.038823108508905954\n },\n \"harness|hendrycksTest-world_religions|5\"\
: {\n \"acc\": 0.847953216374269,\n \"acc_stderr\": 0.027539122889061463,\n\
\ \"acc_norm\": 0.847953216374269,\n \"acc_norm_stderr\": 0.027539122889061463\n\
\ },\n \"harness|truthfulqa:mc|0\": {\n \"mc1\": 0.3537331701346389,\n\
\ \"mc1_stderr\": 0.016737814358846147,\n \"mc2\": 0.5180401965777761,\n\
\ \"mc2_stderr\": 0.015565981129474472\n },\n \"harness|winogrande|5\"\
: {\n \"acc\": 0.7932123125493291,\n \"acc_stderr\": 0.011382566829235798\n\
\ },\n \"harness|gsm8k|5\": {\n \"acc\": 0.7263078089461713,\n \
\ \"acc_stderr\": 0.012281003490963449\n }\n}\n```"
repo_url: https://huggingface.co/chanwit/flux-7b-v0.2
leaderboard_url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard
point_of_contact: clementine@hf.co
configs:
- config_name: harness_arc_challenge_25
data_files:
- split: 2024_01_18T00_42_00.685036
path:
- '**/details_harness|arc:challenge|25_2024-01-18T00-42-00.685036.parquet'
- split: latest
path:
- '**/details_harness|arc:challenge|25_2024-01-18T00-42-00.685036.parquet'
- config_name: harness_gsm8k_5
data_files:
- split: 2024_01_18T00_42_00.685036
path:
- '**/details_harness|gsm8k|5_2024-01-18T00-42-00.685036.parquet'
- split: latest
path:
- '**/details_harness|gsm8k|5_2024-01-18T00-42-00.685036.parquet'
- config_name: harness_hellaswag_10
data_files:
- split: 2024_01_18T00_42_00.685036
path:
- '**/details_harness|hellaswag|10_2024-01-18T00-42-00.685036.parquet'
- split: latest
path:
- '**/details_harness|hellaswag|10_2024-01-18T00-42-00.685036.parquet'
- config_name: harness_hendrycksTest_5
data_files:
- split: 2024_01_18T00_42_00.685036
path:
- '**/details_harness|hendrycksTest-abstract_algebra|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-anatomy|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-astronomy|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-business_ethics|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-college_biology|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-college_chemistry|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-college_computer_science|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-college_mathematics|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-college_medicine|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-college_physics|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-computer_security|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-conceptual_physics|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-econometrics|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-electrical_engineering|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-formal_logic|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-global_facts|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-high_school_biology|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-high_school_european_history|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-high_school_geography|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-high_school_physics|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-high_school_psychology|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-high_school_statistics|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-high_school_us_history|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-high_school_world_history|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-human_aging|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-human_sexuality|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-international_law|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-jurisprudence|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-logical_fallacies|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-machine_learning|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-management|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-marketing|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-medical_genetics|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-miscellaneous|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-moral_disputes|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-moral_scenarios|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-nutrition|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-philosophy|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-prehistory|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-professional_accounting|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-professional_law|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-professional_medicine|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-professional_psychology|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-public_relations|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-security_studies|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-sociology|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-virology|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-world_religions|5_2024-01-18T00-42-00.685036.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-abstract_algebra|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-anatomy|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-astronomy|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-business_ethics|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-college_biology|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-college_chemistry|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-college_computer_science|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-college_mathematics|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-college_medicine|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-college_physics|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-computer_security|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-conceptual_physics|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-econometrics|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-electrical_engineering|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-formal_logic|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-global_facts|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-high_school_biology|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-high_school_european_history|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-high_school_geography|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-high_school_physics|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-high_school_psychology|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-high_school_statistics|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-high_school_us_history|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-high_school_world_history|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-human_aging|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-human_sexuality|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-international_law|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-jurisprudence|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-logical_fallacies|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-machine_learning|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-management|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-marketing|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-medical_genetics|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-miscellaneous|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-moral_disputes|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-moral_scenarios|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-nutrition|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-philosophy|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-prehistory|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-professional_accounting|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-professional_law|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-professional_medicine|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-professional_psychology|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-public_relations|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-security_studies|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-sociology|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-virology|5_2024-01-18T00-42-00.685036.parquet'
- '**/details_harness|hendrycksTest-world_religions|5_2024-01-18T00-42-00.685036.parquet'
- config_name: harness_hendrycksTest_abstract_algebra_5
data_files:
- split: 2024_01_18T00_42_00.685036
path:
- '**/details_harness|hendrycksTest-abstract_algebra|5_2024-01-18T00-42-00.685036.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-abstract_algebra|5_2024-01-18T00-42-00.685036.parquet'
- config_name: harness_hendrycksTest_anatomy_5
data_files:
- split: 2024_01_18T00_42_00.685036
path:
- '**/details_harness|hendrycksTest-anatomy|5_2024-01-18T00-42-00.685036.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-anatomy|5_2024-01-18T00-42-00.685036.parquet'
- config_name: harness_hendrycksTest_astronomy_5
data_files:
- split: 2024_01_18T00_42_00.685036
path:
- '**/details_harness|hendrycksTest-astronomy|5_2024-01-18T00-42-00.685036.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-astronomy|5_2024-01-18T00-42-00.685036.parquet'
- config_name: harness_hendrycksTest_business_ethics_5
data_files:
- split: 2024_01_18T00_42_00.685036
path:
- '**/details_harness|hendrycksTest-business_ethics|5_2024-01-18T00-42-00.685036.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-business_ethics|5_2024-01-18T00-42-00.685036.parquet'
- config_name: harness_hendrycksTest_clinical_knowledge_5
data_files:
- split: 2024_01_18T00_42_00.685036
path:
- '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-01-18T00-42-00.685036.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-01-18T00-42-00.685036.parquet'
- config_name: harness_hendrycksTest_college_biology_5
data_files:
- split: 2024_01_18T00_42_00.685036
path:
- '**/details_harness|hendrycksTest-college_biology|5_2024-01-18T00-42-00.685036.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-college_biology|5_2024-01-18T00-42-00.685036.parquet'
- config_name: harness_hendrycksTest_college_chemistry_5
data_files:
- split: 2024_01_18T00_42_00.685036
path:
- '**/details_harness|hendrycksTest-college_chemistry|5_2024-01-18T00-42-00.685036.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-college_chemistry|5_2024-01-18T00-42-00.685036.parquet'
- config_name: harness_hendrycksTest_college_computer_science_5
data_files:
- split: 2024_01_18T00_42_00.685036
path:
- '**/details_harness|hendrycksTest-college_computer_science|5_2024-01-18T00-42-00.685036.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-college_computer_science|5_2024-01-18T00-42-00.685036.parquet'
- config_name: harness_hendrycksTest_college_mathematics_5
data_files:
- split: 2024_01_18T00_42_00.685036
path:
- '**/details_harness|hendrycksTest-college_mathematics|5_2024-01-18T00-42-00.685036.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-college_mathematics|5_2024-01-18T00-42-00.685036.parquet'
- config_name: harness_hendrycksTest_college_medicine_5
data_files:
- split: 2024_01_18T00_42_00.685036
path:
- '**/details_harness|hendrycksTest-college_medicine|5_2024-01-18T00-42-00.685036.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-college_medicine|5_2024-01-18T00-42-00.685036.parquet'
- config_name: harness_hendrycksTest_college_physics_5
data_files:
- split: 2024_01_18T00_42_00.685036
path:
- '**/details_harness|hendrycksTest-college_physics|5_2024-01-18T00-42-00.685036.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-college_physics|5_2024-01-18T00-42-00.685036.parquet'
- config_name: harness_hendrycksTest_computer_security_5
data_files:
- split: 2024_01_18T00_42_00.685036
path:
- '**/details_harness|hendrycksTest-computer_security|5_2024-01-18T00-42-00.685036.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-computer_security|5_2024-01-18T00-42-00.685036.parquet'
- config_name: harness_hendrycksTest_conceptual_physics_5
data_files:
- split: 2024_01_18T00_42_00.685036
path:
- '**/details_harness|hendrycksTest-conceptual_physics|5_2024-01-18T00-42-00.685036.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-conceptual_physics|5_2024-01-18T00-42-00.685036.parquet'
- config_name: harness_hendrycksTest_econometrics_5
data_files:
- split: 2024_01_18T00_42_00.685036
path:
- '**/details_harness|hendrycksTest-econometrics|5_2024-01-18T00-42-00.685036.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-econometrics|5_2024-01-18T00-42-00.685036.parquet'
- config_name: harness_hendrycksTest_electrical_engineering_5
data_files:
- split: 2024_01_18T00_42_00.685036
path:
- '**/details_harness|hendrycksTest-electrical_engineering|5_2024-01-18T00-42-00.685036.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-electrical_engineering|5_2024-01-18T00-42-00.685036.parquet'
- config_name: harness_hendrycksTest_elementary_mathematics_5
data_files:
- split: 2024_01_18T00_42_00.685036
path:
- '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-01-18T00-42-00.685036.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-01-18T00-42-00.685036.parquet'
- config_name: harness_hendrycksTest_formal_logic_5
data_files:
- split: 2024_01_18T00_42_00.685036
path:
- '**/details_harness|hendrycksTest-formal_logic|5_2024-01-18T00-42-00.685036.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-formal_logic|5_2024-01-18T00-42-00.685036.parquet'
- config_name: harness_hendrycksTest_global_facts_5
data_files:
- split: 2024_01_18T00_42_00.685036
path:
- '**/details_harness|hendrycksTest-global_facts|5_2024-01-18T00-42-00.685036.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-global_facts|5_2024-01-18T00-42-00.685036.parquet'
- config_name: harness_hendrycksTest_high_school_biology_5
data_files:
- split: 2024_01_18T00_42_00.685036
path:
- '**/details_harness|hendrycksTest-high_school_biology|5_2024-01-18T00-42-00.685036.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-high_school_biology|5_2024-01-18T00-42-00.685036.parquet'
- config_name: harness_hendrycksTest_high_school_chemistry_5
data_files:
- split: 2024_01_18T00_42_00.685036
path:
- '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-01-18T00-42-00.685036.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-01-18T00-42-00.685036.parquet'
- config_name: harness_hendrycksTest_high_school_computer_science_5
data_files:
- split: 2024_01_18T00_42_00.685036
path:
- '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-01-18T00-42-00.685036.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-01-18T00-42-00.685036.parquet'
- config_name: harness_hendrycksTest_high_school_european_history_5
data_files:
- split: 2024_01_18T00_42_00.685036
path:
- '**/details_harness|hendrycksTest-high_school_european_history|5_2024-01-18T00-42-00.685036.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-high_school_european_history|5_2024-01-18T00-42-00.685036.parquet'
- config_name: harness_hendrycksTest_high_school_geography_5
data_files:
- split: 2024_01_18T00_42_00.685036
path:
- '**/details_harness|hendrycksTest-high_school_geography|5_2024-01-18T00-42-00.685036.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-high_school_geography|5_2024-01-18T00-42-00.685036.parquet'
- config_name: harness_hendrycksTest_high_school_government_and_politics_5
data_files:
- split: 2024_01_18T00_42_00.685036
path:
- '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-01-18T00-42-00.685036.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-01-18T00-42-00.685036.parquet'
- config_name: harness_hendrycksTest_high_school_macroeconomics_5
data_files:
- split: 2024_01_18T00_42_00.685036
path:
- '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-01-18T00-42-00.685036.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-01-18T00-42-00.685036.parquet'
- config_name: harness_hendrycksTest_high_school_mathematics_5
data_files:
- split: 2024_01_18T00_42_00.685036
path:
- '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-01-18T00-42-00.685036.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-01-18T00-42-00.685036.parquet'
- config_name: harness_hendrycksTest_high_school_microeconomics_5
data_files:
- split: 2024_01_18T00_42_00.685036
path:
- '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-01-18T00-42-00.685036.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-01-18T00-42-00.685036.parquet'
- config_name: harness_hendrycksTest_high_school_physics_5
data_files:
- split: 2024_01_18T00_42_00.685036
path:
- '**/details_harness|hendrycksTest-high_school_physics|5_2024-01-18T00-42-00.685036.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-high_school_physics|5_2024-01-18T00-42-00.685036.parquet'
- config_name: harness_hendrycksTest_high_school_psychology_5
data_files:
- split: 2024_01_18T00_42_00.685036
path:
- '**/details_harness|hendrycksTest-high_school_psychology|5_2024-01-18T00-42-00.685036.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-high_school_psychology|5_2024-01-18T00-42-00.685036.parquet'
- config_name: harness_hendrycksTest_high_school_statistics_5
data_files:
- split: 2024_01_18T00_42_00.685036
path:
- '**/details_harness|hendrycksTest-high_school_statistics|5_2024-01-18T00-42-00.685036.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-high_school_statistics|5_2024-01-18T00-42-00.685036.parquet'
- config_name: harness_hendrycksTest_high_school_us_history_5
data_files:
- split: 2024_01_18T00_42_00.685036
path:
- '**/details_harness|hendrycksTest-high_school_us_history|5_2024-01-18T00-42-00.685036.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-high_school_us_history|5_2024-01-18T00-42-00.685036.parquet'
- config_name: harness_hendrycksTest_high_school_world_history_5
data_files:
- split: 2024_01_18T00_42_00.685036
path:
- '**/details_harness|hendrycksTest-high_school_world_history|5_2024-01-18T00-42-00.685036.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-high_school_world_history|5_2024-01-18T00-42-00.685036.parquet'
- config_name: harness_hendrycksTest_human_aging_5
data_files:
- split: 2024_01_18T00_42_00.685036
path:
- '**/details_harness|hendrycksTest-human_aging|5_2024-01-18T00-42-00.685036.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-human_aging|5_2024-01-18T00-42-00.685036.parquet'
- config_name: harness_hendrycksTest_human_sexuality_5
data_files:
- split: 2024_01_18T00_42_00.685036
path:
- '**/details_harness|hendrycksTest-human_sexuality|5_2024-01-18T00-42-00.685036.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-human_sexuality|5_2024-01-18T00-42-00.685036.parquet'
- config_name: harness_hendrycksTest_international_law_5
data_files:
- split: 2024_01_18T00_42_00.685036
path:
- '**/details_harness|hendrycksTest-international_law|5_2024-01-18T00-42-00.685036.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-international_law|5_2024-01-18T00-42-00.685036.parquet'
- config_name: harness_hendrycksTest_jurisprudence_5
data_files:
- split: 2024_01_18T00_42_00.685036
path:
- '**/details_harness|hendrycksTest-jurisprudence|5_2024-01-18T00-42-00.685036.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-jurisprudence|5_2024-01-18T00-42-00.685036.parquet'
- config_name: harness_hendrycksTest_logical_fallacies_5
data_files:
- split: 2024_01_18T00_42_00.685036
path:
- '**/details_harness|hendrycksTest-logical_fallacies|5_2024-01-18T00-42-00.685036.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-logical_fallacies|5_2024-01-18T00-42-00.685036.parquet'
- config_name: harness_hendrycksTest_machine_learning_5
data_files:
- split: 2024_01_18T00_42_00.685036
path:
- '**/details_harness|hendrycksTest-machine_learning|5_2024-01-18T00-42-00.685036.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-machine_learning|5_2024-01-18T00-42-00.685036.parquet'
- config_name: harness_hendrycksTest_management_5
data_files:
- split: 2024_01_18T00_42_00.685036
path:
- '**/details_harness|hendrycksTest-management|5_2024-01-18T00-42-00.685036.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-management|5_2024-01-18T00-42-00.685036.parquet'
- config_name: harness_hendrycksTest_marketing_5
data_files:
- split: 2024_01_18T00_42_00.685036
path:
- '**/details_harness|hendrycksTest-marketing|5_2024-01-18T00-42-00.685036.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-marketing|5_2024-01-18T00-42-00.685036.parquet'
- config_name: harness_hendrycksTest_medical_genetics_5
data_files:
- split: 2024_01_18T00_42_00.685036
path:
- '**/details_harness|hendrycksTest-medical_genetics|5_2024-01-18T00-42-00.685036.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-medical_genetics|5_2024-01-18T00-42-00.685036.parquet'
- config_name: harness_hendrycksTest_miscellaneous_5
data_files:
- split: 2024_01_18T00_42_00.685036
path:
- '**/details_harness|hendrycksTest-miscellaneous|5_2024-01-18T00-42-00.685036.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-miscellaneous|5_2024-01-18T00-42-00.685036.parquet'
- config_name: harness_hendrycksTest_moral_disputes_5
data_files:
- split: 2024_01_18T00_42_00.685036
path:
- '**/details_harness|hendrycksTest-moral_disputes|5_2024-01-18T00-42-00.685036.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-moral_disputes|5_2024-01-18T00-42-00.685036.parquet'
- config_name: harness_hendrycksTest_moral_scenarios_5
data_files:
- split: 2024_01_18T00_42_00.685036
path:
- '**/details_harness|hendrycksTest-moral_scenarios|5_2024-01-18T00-42-00.685036.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-moral_scenarios|5_2024-01-18T00-42-00.685036.parquet'
- config_name: harness_hendrycksTest_nutrition_5
data_files:
- split: 2024_01_18T00_42_00.685036
path:
- '**/details_harness|hendrycksTest-nutrition|5_2024-01-18T00-42-00.685036.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-nutrition|5_2024-01-18T00-42-00.685036.parquet'
- config_name: harness_hendrycksTest_philosophy_5
data_files:
- split: 2024_01_18T00_42_00.685036
path:
- '**/details_harness|hendrycksTest-philosophy|5_2024-01-18T00-42-00.685036.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-philosophy|5_2024-01-18T00-42-00.685036.parquet'
- config_name: harness_hendrycksTest_prehistory_5
data_files:
- split: 2024_01_18T00_42_00.685036
path:
- '**/details_harness|hendrycksTest-prehistory|5_2024-01-18T00-42-00.685036.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-prehistory|5_2024-01-18T00-42-00.685036.parquet'
- config_name: harness_hendrycksTest_professional_accounting_5
data_files:
- split: 2024_01_18T00_42_00.685036
path:
- '**/details_harness|hendrycksTest-professional_accounting|5_2024-01-18T00-42-00.685036.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-professional_accounting|5_2024-01-18T00-42-00.685036.parquet'
- config_name: harness_hendrycksTest_professional_law_5
data_files:
- split: 2024_01_18T00_42_00.685036
path:
- '**/details_harness|hendrycksTest-professional_law|5_2024-01-18T00-42-00.685036.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-professional_law|5_2024-01-18T00-42-00.685036.parquet'
- config_name: harness_hendrycksTest_professional_medicine_5
data_files:
- split: 2024_01_18T00_42_00.685036
path:
- '**/details_harness|hendrycksTest-professional_medicine|5_2024-01-18T00-42-00.685036.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-professional_medicine|5_2024-01-18T00-42-00.685036.parquet'
- config_name: harness_hendrycksTest_professional_psychology_5
data_files:
- split: 2024_01_18T00_42_00.685036
path:
- '**/details_harness|hendrycksTest-professional_psychology|5_2024-01-18T00-42-00.685036.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-professional_psychology|5_2024-01-18T00-42-00.685036.parquet'
- config_name: harness_hendrycksTest_public_relations_5
data_files:
- split: 2024_01_18T00_42_00.685036
path:
- '**/details_harness|hendrycksTest-public_relations|5_2024-01-18T00-42-00.685036.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-public_relations|5_2024-01-18T00-42-00.685036.parquet'
- config_name: harness_hendrycksTest_security_studies_5
data_files:
- split: 2024_01_18T00_42_00.685036
path:
- '**/details_harness|hendrycksTest-security_studies|5_2024-01-18T00-42-00.685036.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-security_studies|5_2024-01-18T00-42-00.685036.parquet'
- config_name: harness_hendrycksTest_sociology_5
data_files:
- split: 2024_01_18T00_42_00.685036
path:
- '**/details_harness|hendrycksTest-sociology|5_2024-01-18T00-42-00.685036.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-sociology|5_2024-01-18T00-42-00.685036.parquet'
- config_name: harness_hendrycksTest_us_foreign_policy_5
data_files:
- split: 2024_01_18T00_42_00.685036
path:
- '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-01-18T00-42-00.685036.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-01-18T00-42-00.685036.parquet'
- config_name: harness_hendrycksTest_virology_5
data_files:
- split: 2024_01_18T00_42_00.685036
path:
- '**/details_harness|hendrycksTest-virology|5_2024-01-18T00-42-00.685036.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-virology|5_2024-01-18T00-42-00.685036.parquet'
- config_name: harness_hendrycksTest_world_religions_5
data_files:
- split: 2024_01_18T00_42_00.685036
path:
- '**/details_harness|hendrycksTest-world_religions|5_2024-01-18T00-42-00.685036.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-world_religions|5_2024-01-18T00-42-00.685036.parquet'
- config_name: harness_truthfulqa_mc_0
data_files:
- split: 2024_01_18T00_42_00.685036
path:
- '**/details_harness|truthfulqa:mc|0_2024-01-18T00-42-00.685036.parquet'
- split: latest
path:
- '**/details_harness|truthfulqa:mc|0_2024-01-18T00-42-00.685036.parquet'
- config_name: harness_winogrande_5
data_files:
- split: 2024_01_18T00_42_00.685036
path:
- '**/details_harness|winogrande|5_2024-01-18T00-42-00.685036.parquet'
- split: latest
path:
- '**/details_harness|winogrande|5_2024-01-18T00-42-00.685036.parquet'
- config_name: results
data_files:
- split: 2024_01_18T00_42_00.685036
path:
- results_2024-01-18T00-42-00.685036.parquet
- split: latest
path:
- results_2024-01-18T00-42-00.685036.parquet
---
# Dataset Card for Evaluation run of chanwit/flux-7b-v0.2
<!-- Provide a quick summary of the dataset. -->
Dataset automatically created during the evaluation run of model [chanwit/flux-7b-v0.2](https://huggingface.co/chanwit/flux-7b-v0.2) on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).
The dataset is composed of 63 configuration, each one coresponding to one of the evaluated task.
The dataset has been created from 1 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results.
An additional configuration "results" store all the aggregated results of the run (and is used to compute and display the aggregated metrics on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)).
To load the details from a run, you can for instance do the following:
```python
from datasets import load_dataset
data = load_dataset("open-llm-leaderboard/details_chanwit__flux-7b-v0.2",
"harness_winogrande_5",
split="train")
```
## Latest results
These are the [latest results from run 2024-01-18T00:42:00.685036](https://huggingface.co/datasets/open-llm-leaderboard/details_chanwit__flux-7b-v0.2/blob/main/results_2024-01-18T00-42-00.685036.json)(note that their might be results for other tasks in the repos if successive evals didn't cover the same tasks. You find each in the results and the "latest" split for each eval):
```python
{
"all": {
"acc": 0.6573592496311628,
"acc_stderr": 0.031783424194298984,
"acc_norm": 0.6574655421284195,
"acc_norm_stderr": 0.03243446995426543,
"mc1": 0.3537331701346389,
"mc1_stderr": 0.016737814358846147,
"mc2": 0.5180401965777761,
"mc2_stderr": 0.015565981129474472
},
"harness|arc:challenge|25": {
"acc": 0.6331058020477816,
"acc_stderr": 0.014084133118104294,
"acc_norm": 0.6655290102389079,
"acc_norm_stderr": 0.013787460322441374
},
"harness|hellaswag|10": {
"acc": 0.6825333598884684,
"acc_stderr": 0.004645393477680678,
"acc_norm": 0.8611830312686716,
"acc_norm_stderr": 0.003450488042965005
},
"harness|hendrycksTest-abstract_algebra|5": {
"acc": 0.36,
"acc_stderr": 0.04824181513244218,
"acc_norm": 0.36,
"acc_norm_stderr": 0.04824181513244218
},
"harness|hendrycksTest-anatomy|5": {
"acc": 0.6074074074074074,
"acc_stderr": 0.0421850621536888,
"acc_norm": 0.6074074074074074,
"acc_norm_stderr": 0.0421850621536888
},
"harness|hendrycksTest-astronomy|5": {
"acc": 0.7039473684210527,
"acc_stderr": 0.03715062154998904,
"acc_norm": 0.7039473684210527,
"acc_norm_stderr": 0.03715062154998904
},
"harness|hendrycksTest-business_ethics|5": {
"acc": 0.63,
"acc_stderr": 0.04852365870939099,
"acc_norm": 0.63,
"acc_norm_stderr": 0.04852365870939099
},
"harness|hendrycksTest-clinical_knowledge|5": {
"acc": 0.6981132075471698,
"acc_stderr": 0.02825420034443866,
"acc_norm": 0.6981132075471698,
"acc_norm_stderr": 0.02825420034443866
},
"harness|hendrycksTest-college_biology|5": {
"acc": 0.7777777777777778,
"acc_stderr": 0.03476590104304134,
"acc_norm": 0.7777777777777778,
"acc_norm_stderr": 0.03476590104304134
},
"harness|hendrycksTest-college_chemistry|5": {
"acc": 0.49,
"acc_stderr": 0.05024183937956912,
"acc_norm": 0.49,
"acc_norm_stderr": 0.05024183937956912
},
"harness|hendrycksTest-college_computer_science|5": {
"acc": 0.56,
"acc_stderr": 0.049888765156985884,
"acc_norm": 0.56,
"acc_norm_stderr": 0.049888765156985884
},
"harness|hendrycksTest-college_mathematics|5": {
"acc": 0.32,
"acc_stderr": 0.04688261722621504,
"acc_norm": 0.32,
"acc_norm_stderr": 0.04688261722621504
},
"harness|hendrycksTest-college_medicine|5": {
"acc": 0.6936416184971098,
"acc_stderr": 0.03514942551267438,
"acc_norm": 0.6936416184971098,
"acc_norm_stderr": 0.03514942551267438
},
"harness|hendrycksTest-college_physics|5": {
"acc": 0.4215686274509804,
"acc_stderr": 0.04913595201274498,
"acc_norm": 0.4215686274509804,
"acc_norm_stderr": 0.04913595201274498
},
"harness|hendrycksTest-computer_security|5": {
"acc": 0.77,
"acc_stderr": 0.04229525846816507,
"acc_norm": 0.77,
"acc_norm_stderr": 0.04229525846816507
},
"harness|hendrycksTest-conceptual_physics|5": {
"acc": 0.5957446808510638,
"acc_stderr": 0.032081157507886836,
"acc_norm": 0.5957446808510638,
"acc_norm_stderr": 0.032081157507886836
},
"harness|hendrycksTest-econometrics|5": {
"acc": 0.5,
"acc_stderr": 0.047036043419179864,
"acc_norm": 0.5,
"acc_norm_stderr": 0.047036043419179864
},
"harness|hendrycksTest-electrical_engineering|5": {
"acc": 0.5448275862068965,
"acc_stderr": 0.04149886942192117,
"acc_norm": 0.5448275862068965,
"acc_norm_stderr": 0.04149886942192117
},
"harness|hendrycksTest-elementary_mathematics|5": {
"acc": 0.3994708994708995,
"acc_stderr": 0.02522545028406788,
"acc_norm": 0.3994708994708995,
"acc_norm_stderr": 0.02522545028406788
},
"harness|hendrycksTest-formal_logic|5": {
"acc": 0.4523809523809524,
"acc_stderr": 0.044518079590553275,
"acc_norm": 0.4523809523809524,
"acc_norm_stderr": 0.044518079590553275
},
"harness|hendrycksTest-global_facts|5": {
"acc": 0.36,
"acc_stderr": 0.048241815132442176,
"acc_norm": 0.36,
"acc_norm_stderr": 0.048241815132442176
},
"harness|hendrycksTest-high_school_biology|5": {
"acc": 0.7870967741935484,
"acc_stderr": 0.02328766512726854,
"acc_norm": 0.7870967741935484,
"acc_norm_stderr": 0.02328766512726854
},
"harness|hendrycksTest-high_school_chemistry|5": {
"acc": 0.4975369458128079,
"acc_stderr": 0.03517945038691063,
"acc_norm": 0.4975369458128079,
"acc_norm_stderr": 0.03517945038691063
},
"harness|hendrycksTest-high_school_computer_science|5": {
"acc": 0.72,
"acc_stderr": 0.045126085985421276,
"acc_norm": 0.72,
"acc_norm_stderr": 0.045126085985421276
},
"harness|hendrycksTest-high_school_european_history|5": {
"acc": 0.7818181818181819,
"acc_stderr": 0.03225078108306289,
"acc_norm": 0.7818181818181819,
"acc_norm_stderr": 0.03225078108306289
},
"harness|hendrycksTest-high_school_geography|5": {
"acc": 0.7777777777777778,
"acc_stderr": 0.029620227874790482,
"acc_norm": 0.7777777777777778,
"acc_norm_stderr": 0.029620227874790482
},
"harness|hendrycksTest-high_school_government_and_politics|5": {
"acc": 0.9119170984455959,
"acc_stderr": 0.02045374660160103,
"acc_norm": 0.9119170984455959,
"acc_norm_stderr": 0.02045374660160103
},
"harness|hendrycksTest-high_school_macroeconomics|5": {
"acc": 0.6666666666666666,
"acc_stderr": 0.023901157979402534,
"acc_norm": 0.6666666666666666,
"acc_norm_stderr": 0.023901157979402534
},
"harness|hendrycksTest-high_school_mathematics|5": {
"acc": 0.34444444444444444,
"acc_stderr": 0.02897264888484427,
"acc_norm": 0.34444444444444444,
"acc_norm_stderr": 0.02897264888484427
},
"harness|hendrycksTest-high_school_microeconomics|5": {
"acc": 0.7058823529411765,
"acc_stderr": 0.029597329730978086,
"acc_norm": 0.7058823529411765,
"acc_norm_stderr": 0.029597329730978086
},
"harness|hendrycksTest-high_school_physics|5": {
"acc": 0.33112582781456956,
"acc_stderr": 0.038425817186598696,
"acc_norm": 0.33112582781456956,
"acc_norm_stderr": 0.038425817186598696
},
"harness|hendrycksTest-high_school_psychology|5": {
"acc": 0.8495412844036697,
"acc_stderr": 0.015328563932669237,
"acc_norm": 0.8495412844036697,
"acc_norm_stderr": 0.015328563932669237
},
"harness|hendrycksTest-high_school_statistics|5": {
"acc": 0.5092592592592593,
"acc_stderr": 0.034093869469927006,
"acc_norm": 0.5092592592592593,
"acc_norm_stderr": 0.034093869469927006
},
"harness|hendrycksTest-high_school_us_history|5": {
"acc": 0.8431372549019608,
"acc_stderr": 0.025524722324553353,
"acc_norm": 0.8431372549019608,
"acc_norm_stderr": 0.025524722324553353
},
"harness|hendrycksTest-high_school_world_history|5": {
"acc": 0.810126582278481,
"acc_stderr": 0.025530100460233497,
"acc_norm": 0.810126582278481,
"acc_norm_stderr": 0.025530100460233497
},
"harness|hendrycksTest-human_aging|5": {
"acc": 0.7085201793721974,
"acc_stderr": 0.03050028317654585,
"acc_norm": 0.7085201793721974,
"acc_norm_stderr": 0.03050028317654585
},
"harness|hendrycksTest-human_sexuality|5": {
"acc": 0.8015267175572519,
"acc_stderr": 0.03498149385462472,
"acc_norm": 0.8015267175572519,
"acc_norm_stderr": 0.03498149385462472
},
"harness|hendrycksTest-international_law|5": {
"acc": 0.8347107438016529,
"acc_stderr": 0.03390780612972776,
"acc_norm": 0.8347107438016529,
"acc_norm_stderr": 0.03390780612972776
},
"harness|hendrycksTest-jurisprudence|5": {
"acc": 0.8148148148148148,
"acc_stderr": 0.03755265865037182,
"acc_norm": 0.8148148148148148,
"acc_norm_stderr": 0.03755265865037182
},
"harness|hendrycksTest-logical_fallacies|5": {
"acc": 0.7914110429447853,
"acc_stderr": 0.03192193448934724,
"acc_norm": 0.7914110429447853,
"acc_norm_stderr": 0.03192193448934724
},
"harness|hendrycksTest-machine_learning|5": {
"acc": 0.5089285714285714,
"acc_stderr": 0.04745033255489123,
"acc_norm": 0.5089285714285714,
"acc_norm_stderr": 0.04745033255489123
},
"harness|hendrycksTest-management|5": {
"acc": 0.8543689320388349,
"acc_stderr": 0.0349260647662379,
"acc_norm": 0.8543689320388349,
"acc_norm_stderr": 0.0349260647662379
},
"harness|hendrycksTest-marketing|5": {
"acc": 0.8504273504273504,
"acc_stderr": 0.023365051491753715,
"acc_norm": 0.8504273504273504,
"acc_norm_stderr": 0.023365051491753715
},
"harness|hendrycksTest-medical_genetics|5": {
"acc": 0.7,
"acc_stderr": 0.046056618647183814,
"acc_norm": 0.7,
"acc_norm_stderr": 0.046056618647183814
},
"harness|hendrycksTest-miscellaneous|5": {
"acc": 0.8301404853128991,
"acc_stderr": 0.013428186370608306,
"acc_norm": 0.8301404853128991,
"acc_norm_stderr": 0.013428186370608306
},
"harness|hendrycksTest-moral_disputes|5": {
"acc": 0.7427745664739884,
"acc_stderr": 0.023532925431044283,
"acc_norm": 0.7427745664739884,
"acc_norm_stderr": 0.023532925431044283
},
"harness|hendrycksTest-moral_scenarios|5": {
"acc": 0.3877094972067039,
"acc_stderr": 0.016295332328155818,
"acc_norm": 0.3877094972067039,
"acc_norm_stderr": 0.016295332328155818
},
"harness|hendrycksTest-nutrition|5": {
"acc": 0.7320261437908496,
"acc_stderr": 0.025360603796242553,
"acc_norm": 0.7320261437908496,
"acc_norm_stderr": 0.025360603796242553
},
"harness|hendrycksTest-philosophy|5": {
"acc": 0.7202572347266881,
"acc_stderr": 0.02549425935069491,
"acc_norm": 0.7202572347266881,
"acc_norm_stderr": 0.02549425935069491
},
"harness|hendrycksTest-prehistory|5": {
"acc": 0.7283950617283951,
"acc_stderr": 0.02474862449053737,
"acc_norm": 0.7283950617283951,
"acc_norm_stderr": 0.02474862449053737
},
"harness|hendrycksTest-professional_accounting|5": {
"acc": 0.4645390070921986,
"acc_stderr": 0.02975238965742705,
"acc_norm": 0.4645390070921986,
"acc_norm_stderr": 0.02975238965742705
},
"harness|hendrycksTest-professional_law|5": {
"acc": 0.4791395045632334,
"acc_stderr": 0.012759117066518012,
"acc_norm": 0.4791395045632334,
"acc_norm_stderr": 0.012759117066518012
},
"harness|hendrycksTest-professional_medicine|5": {
"acc": 0.7279411764705882,
"acc_stderr": 0.02703304115168146,
"acc_norm": 0.7279411764705882,
"acc_norm_stderr": 0.02703304115168146
},
"harness|hendrycksTest-professional_psychology|5": {
"acc": 0.6830065359477124,
"acc_stderr": 0.018824219512706207,
"acc_norm": 0.6830065359477124,
"acc_norm_stderr": 0.018824219512706207
},
"harness|hendrycksTest-public_relations|5": {
"acc": 0.6727272727272727,
"acc_stderr": 0.0449429086625209,
"acc_norm": 0.6727272727272727,
"acc_norm_stderr": 0.0449429086625209
},
"harness|hendrycksTest-security_studies|5": {
"acc": 0.7142857142857143,
"acc_stderr": 0.028920583220675602,
"acc_norm": 0.7142857142857143,
"acc_norm_stderr": 0.028920583220675602
},
"harness|hendrycksTest-sociology|5": {
"acc": 0.8407960199004975,
"acc_stderr": 0.025870646766169136,
"acc_norm": 0.8407960199004975,
"acc_norm_stderr": 0.025870646766169136
},
"harness|hendrycksTest-us_foreign_policy|5": {
"acc": 0.87,
"acc_stderr": 0.03379976689896308,
"acc_norm": 0.87,
"acc_norm_stderr": 0.03379976689896308
},
"harness|hendrycksTest-virology|5": {
"acc": 0.536144578313253,
"acc_stderr": 0.038823108508905954,
"acc_norm": 0.536144578313253,
"acc_norm_stderr": 0.038823108508905954
},
"harness|hendrycksTest-world_religions|5": {
"acc": 0.847953216374269,
"acc_stderr": 0.027539122889061463,
"acc_norm": 0.847953216374269,
"acc_norm_stderr": 0.027539122889061463
},
"harness|truthfulqa:mc|0": {
"mc1": 0.3537331701346389,
"mc1_stderr": 0.016737814358846147,
"mc2": 0.5180401965777761,
"mc2_stderr": 0.015565981129474472
},
"harness|winogrande|5": {
"acc": 0.7932123125493291,
"acc_stderr": 0.011382566829235798
},
"harness|gsm8k|5": {
"acc": 0.7263078089461713,
"acc_stderr": 0.012281003490963449
}
}
```
## Dataset Details
### Dataset Description
<!-- Provide a longer summary of what this dataset is. -->
- **Curated by:** [More Information Needed]
- **Funded by [optional]:** [More Information Needed]
- **Shared by [optional]:** [More Information Needed]
- **Language(s) (NLP):** [More Information Needed]
- **License:** [More Information Needed]
### Dataset Sources [optional]
<!-- Provide the basic links for the dataset. -->
- **Repository:** [More Information Needed]
- **Paper [optional]:** [More Information Needed]
- **Demo [optional]:** [More Information Needed]
## Uses
<!-- Address questions around how the dataset is intended to be used. -->
### Direct Use
<!-- This section describes suitable use cases for the dataset. -->
[More Information Needed]
### Out-of-Scope Use
<!-- This section addresses misuse, malicious use, and uses that the dataset will not work well for. -->
[More Information Needed]
## Dataset Structure
<!-- This section provides a description of the dataset fields, and additional information about the dataset structure such as criteria used to create the splits, relationships between data points, etc. -->
[More Information Needed]
## Dataset Creation
### Curation Rationale
<!-- Motivation for the creation of this dataset. -->
[More Information Needed]
### Source Data
<!-- This section describes the source data (e.g. news text and headlines, social media posts, translated sentences, ...). -->
#### Data Collection and Processing
<!-- This section describes the data collection and processing process such as data selection criteria, filtering and normalization methods, tools and libraries used, etc. -->
[More Information Needed]
#### Who are the source data producers?
<!-- This section describes the people or systems who originally created the data. It should also include self-reported demographic or identity information for the source data creators if this information is available. -->
[More Information Needed]
### Annotations [optional]
<!-- If the dataset contains annotations which are not part of the initial data collection, use this section to describe them. -->
#### Annotation process
<!-- This section describes the annotation process such as annotation tools used in the process, the amount of data annotated, annotation guidelines provided to the annotators, interannotator statistics, annotation validation, etc. -->
[More Information Needed]
#### Who are the annotators?
<!-- This section describes the people or systems who created the annotations. -->
[More Information Needed]
#### Personal and Sensitive Information
<!-- State whether the dataset contains data that might be considered personal, sensitive, or private (e.g., data that reveals addresses, uniquely identifiable names or aliases, racial or ethnic origins, sexual orientations, religious beliefs, political opinions, financial or health data, etc.). If efforts were made to anonymize the data, describe the anonymization process. -->
[More Information Needed]
## Bias, Risks, and Limitations
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
[More Information Needed]
### Recommendations
<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
Users should be made aware of the risks, biases and limitations of the dataset. More information needed for further recommendations.
## Citation [optional]
<!-- If there is a paper or blog post introducing the dataset, the APA and Bibtex information for that should go in this section. -->
**BibTeX:**
[More Information Needed]
**APA:**
[More Information Needed]
## Glossary [optional]
<!-- If relevant, include terms and calculations in this section that can help readers understand the dataset or dataset card. -->
[More Information Needed]
## More Information [optional]
[More Information Needed]
## Dataset Card Authors [optional]
[More Information Needed]
## Dataset Card Contact
[More Information Needed]
提供机构:
open-llm-leaderboard-old
原始信息汇总
数据集概述
数据集名称
- 名称: Evaluation run of chanwit/flux-7b-v0.2
数据集组成
- 配置数量: 63
- 每个配置对应: 一个评估任务
- 创建来源: 从1次运行中自动创建
- 分割命名: 使用运行的时间戳
- 最新结果: "train"分割指向最新结果
额外配置
- "results"配置: 存储所有聚合结果,用于计算和显示在Open LLM Leaderboard上的聚合指标
数据加载示例
python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_chanwit__flux-7b-v0.2", "harness_winogrande_5", split="train")
最新结果
- 结果来源: 最新结果来自 run 2024-01-18T00:42:00.685036
- 详细结果: 包含多个任务的准确率(acc)、标准化准确率(acc_norm)、标准误差(stderr)等指标
配置详情
-
配置名称: harness_arc_challenge_25
- 数据文件:
- 分割: 2024_01_18T00_42_00.685036
- 路径: **/details_harness|arc:challenge|25_2024-01-18T00-42-00.685036.parquet
- 分割: latest
- 路径: **/details_harness|arc:challenge|25_2024-01-18T00-42-00.685036.parquet
- 分割: 2024_01_18T00_42_00.685036
- 数据文件:
-
配置名称: harness_gsm8k_5
- 数据文件:
- 分割: 2024_01_18T00_42_00.685036
- 路径: **/details_harness|gsm8k|5_2024-01-18T00-42-00.685036.parquet
- 分割: latest
- 路径: **/details_harness|gsm8k|5_2024-01-18T00-42-00.685036.parquet
- 分割: 2024_01_18T00_42_00.685036
- 数据文件:
-
配置名称: harness_hellaswag_10
- 数据文件:
- 分割: 2024_01_18T00_42_00.685036
- 路径: **/details_harness|hellaswag|10_2024-01-18T00-42-00.685036.parquet
- 分割: latest
- 路径: **/details_harness|hellaswag|10_2024-01-18T00-42-00.685036.parquet
- 分割: 2024_01_18T00_42_00.685036
- 数据文件:
-
配置名称: harness_hendrycksTest_5
- 数据文件:
- 分割: 2024_01_18T00_42_00.685036
- 路径:
- **/details_harness|hendrycksTest-abstract_algebra|5_2024-01-18T00-42-00.685036.parquet
- **/details_harness|hendrycksTest-anatomy|5_2024-01-18T00-42-00.685036.parquet
- **/details_harness|hendrycksTest-astronomy|5_2024-01-18T00-42-00.685036.parquet
- **/details_harness|hendrycksTest-business_ethics|5_2024-01-18T00-42-00.685036.parquet
- **/details_harness|hendrycksTest-clinical_knowledge|5_2024-01-18T00-42-00.685036.parquet
- **/details_harness|hendrycksTest-college_biology|5_2024-01-18T00-42-00.685036.parquet
- **/details_harness|hendrycksTest-college_chemistry|5_2024-01-18T00-42-00.685036.parquet
- **/details_harness|hendrycksTest-college_computer_science|5_2024-01-18T00-42-00.685036.parquet
- **/details_harness|hendrycksTest-college_mathematics|5_2024-01-18T00-42-00.685036.parquet
- **/details_harness|hendrycksTest-college_medicine|5_2024-01-18T00-42-00.685036.parquet
- **/details_harness|hendrycksTest-college_physics|5_2024-01-18T00-42-00.685036.parquet
- **/details_harness|hendrycksTest-computer_security|5_2024-01-18T00-42-00.685036.parquet
- **/details_harness|hendrycksTest-conceptual_physics|5_2024-01-18T00-42-00.685036.parquet
- **/details_harness|hendrycksTest-econometrics|5_2024-01-18T00-42-00.685036.parquet
- **/details_harness|hendrycksTest-electrical_engineering|5_2024-01-18T00-42-00.685036.parquet
- **/details_harness|hendrycksTest-elementary_mathematics|5_2024-01-18T00-42-00.685036.parquet
- **/details_harness|hendrycksTest-formal_logic|5_2024-01-18T00-42-00.685036.parquet
- **/details_harness|hendrycksTest-global_facts|5_2024-01-18T00-42-00.685036.parquet
- **/details_harness|hendrycksTest-high_school_biology|5_2024-01-18T00-42-00.685036.parquet
- ...
- 路径:
- 分割: 2024_01_18T00_42_00.685036
- 数据文件:
搜集汇总
数据集介绍

构建方式
该数据集是在Open LLM Leaderboard评估框架下,针对chanwit/flux-7b-v0.2模型进行自动化评测过程中生成的副产品。数据集由63个配置组成,每个配置对应一项被评估的任务,例如ARC挑战集、HellaSwag、GSM8K以及涵盖多个学科的Hendrycks测试集。数据源自单次运行,每次运行的结果以时间戳命名的分割形式存储,其中'train'分割始终指向最新一次运行的评测结果。此外,还有一个名为'results'的额外配置,用于汇总所有运行的整体指标,这些聚合数据被用于在Open LLM Leaderboard上计算和展示模型的综合表现。
特点
本数据集的结构化设计极具特色,它通过多配置体系实现了对模型多维度能力的精细记录。每个任务配置独立存储,便于研究者按需聚焦于特定能力的分析。数据以Parquet格式高效存储,并包含时间戳分割,支持对模型性能随时间变化的追踪。'results'配置提供了诸如准确率及其标准误差等关键指标的聚合视图,覆盖了从常识推理到专业学科知识的广泛评估维度,为深入理解模型的能力边界提供了丰富且标准化的数据支撑。
使用方法
研究者可通过Hugging Face的datasets库便捷地加载和使用该数据集。例如,使用`load_dataset("open-llm-leaderboard/details_chanwit__flux-7b-v0.2", "harness_winogrande_5", split="train")`即可获取Winogrande任务的最新评测详情。数据集支持按任务配置和运行时间戳进行精确检索,用户能够灵活地提取特定任务的原始评测数据或聚合结果,从而复现排行榜指标、进行模型能力的纵向对比或开展深入的错误分析研究。
背景与挑战
背景概述
随着大语言模型(LLM)的迅猛发展,如何系统、公平地评估其性能已成为学术界与工业界共同关注的核心议题。在此背景下,HuggingFace团队于2023年推出了Open LLM Leaderboard,旨在通过标准化基准测试为社区提供透明的模型性能对比。该数据集是其中针对chanwit/flux-7b-v0.2模型的一次评估运行记录,由Clementine Fourrier等人于2024年1月创建,涵盖了63个任务配置,包括ARC-Challenge、HellaSwag、GSM8K、TruthfulQA及涵盖57个学科的MMLU测试等。这些任务从常识推理、数学逻辑到多领域知识,全面考察了模型的泛化能力。该数据集不仅记录了模型在各项指标上的精确表现(如平均准确率约65.7%),更通过可复现的评估流水线,为后续研究提供了基准参照,有力推动了开源LLM评估范式的规范化进程。
当前挑战
该数据集所面临的挑战主要体现在两个层面。在领域问题层面,大语言模型评估的全面性与公平性始终是难点:现有基准测试虽覆盖广泛,但难以完全反映模型在真实场景中的推理深度与鲁棒性,例如在MMLU中,模型在高等数学(32%)与形式逻辑(45.2%)等任务上的低分揭示了其在复杂抽象推理上的显著短板。在构建过程中,挑战则源于评估流水线的标准化与可复现性:不同任务需配置不同few-shot样本数(如5-shot或25-shot),且每次运行结果需以时间戳分割管理,确保数据一致性;此外,面对HuggingFace上数千个模型,自动化评估系统需高效处理海量请求并避免计算资源浪费,这对基础设施的稳定性和扩展性提出了严苛要求。
常用场景
经典使用场景
在大型语言模型的性能评估领域,该数据集作为Open LLM Leaderboard的评测记录,被广泛用于衡量模型在多种自然语言理解与推理任务上的表现。其核心应用场景涵盖常识推理(如HellaSwag)、知识问答(如ARC-Challenge)、数学推理(如GSM8K)以及多学科知识测试(如MMLU的57个学科)。研究者通过加载该数据集中的特定任务配置,可复现模型在标准化基准下的精确得分,从而进行横向对比与能力诊断。
实际应用
在实际产业应用中,该数据集为模型选型与部署决策提供了关键依据。企业可通过查询模型在Winogrande(指代消解)或TruthfulQA(事实性)等任务上的表现,评估其是否适合智能客服、教育辅导或知识检索等场景。例如,数学推理任务GSM8K的得分直接关联模型在金融计算或工程辅助中的可靠性,而MMLU的医学子项得分则影响医疗问答系统的合规性判断。
衍生相关工作
该数据集衍生出多项经典工作,包括基于评测结果分析模型缩放法则的研究、针对特定任务(如数学推理)的微调策略优化,以及多任务学习中的能力迁移分析。HuggingFace社区基于此数据集开发了可视化排行榜与自动评测流水线,催生了如Harness工具链和Open LLM Leaderboard的标准化评估框架,这些工作进一步推动了模型透明度与可复现性在学术界的共识。
以上内容由遇见数据集搜集并总结生成



