open-llm-leaderboard/details_vicgalle__Configurable-Llama-3-8B-v0.1
收藏Hugging Face2024-04-19 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/open-llm-leaderboard/details_vicgalle__Configurable-Llama-3-8B-v0.1
下载链接
链接失效反馈官方服务:
资源简介:
---
pretty_name: Evaluation run of vicgalle/Configurable-Llama-3-8B-v0.1
dataset_summary: "Dataset automatically created during the evaluation run of model\
\ [vicgalle/Configurable-Llama-3-8B-v0.1](https://huggingface.co/vicgalle/Configurable-Llama-3-8B-v0.1)\
\ on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).\n\
\nThe dataset is composed of 63 configuration, each one coresponding to one of the\
\ evaluated task.\n\nThe dataset has been created from 1 run(s). Each run can be\
\ found as a specific split in each configuration, the split being named using the\
\ timestamp of the run.The \"train\" split is always pointing to the latest results.\n\
\nAn additional configuration \"results\" store all the aggregated results of the\
\ run (and is used to compute and display the aggregated metrics on the [Open LLM\
\ Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)).\n\
\nTo load the details from a run, you can for instance do the following:\n```python\n\
from datasets import load_dataset\ndata = load_dataset(\"open-llm-leaderboard/details_vicgalle__Configurable-Llama-3-8B-v0.1\"\
,\n\t\"harness_winogrande_5\",\n\tsplit=\"train\")\n```\n\n## Latest results\n\n\
These are the [latest results from run 2024-04-19T01:45:07.088638](https://huggingface.co/datasets/open-llm-leaderboard/details_vicgalle__Configurable-Llama-3-8B-v0.1/blob/main/results_2024-04-19T01-45-07.088638.json)(note\
\ that their might be results for other tasks in the repos if successive evals didn't\
\ cover the same tasks. You find each in the results and the \"latest\" split for\
\ each eval):\n\n```python\n{\n \"all\": {\n \"acc\": 0.6707407824486872,\n\
\ \"acc_stderr\": 0.03171956437518023,\n \"acc_norm\": 0.6730867118870119,\n\
\ \"acc_norm_stderr\": 0.03235451513706868,\n \"mc1\": 0.3708690330477356,\n\
\ \"mc1_stderr\": 0.01690969358024882,\n \"mc2\": 0.5615878106808432,\n\
\ \"mc2_stderr\": 0.015165049093519522\n },\n \"harness|arc:challenge|25\"\
: {\n \"acc\": 0.5844709897610921,\n \"acc_stderr\": 0.01440136664121639,\n\
\ \"acc_norm\": 0.6245733788395904,\n \"acc_norm_stderr\": 0.014150631435111728\n\
\ },\n \"harness|hellaswag|10\": {\n \"acc\": 0.5932085241983669,\n\
\ \"acc_stderr\": 0.004902314055725598,\n \"acc_norm\": 0.7950607448715395,\n\
\ \"acc_norm_stderr\": 0.004028322654852751\n },\n \"harness|hendrycksTest-abstract_algebra|5\"\
: {\n \"acc\": 0.29,\n \"acc_stderr\": 0.045604802157206845,\n \
\ \"acc_norm\": 0.29,\n \"acc_norm_stderr\": 0.045604802157206845\n \
\ },\n \"harness|hendrycksTest-anatomy|5\": {\n \"acc\": 0.6370370370370371,\n\
\ \"acc_stderr\": 0.04153948404742399,\n \"acc_norm\": 0.6370370370370371,\n\
\ \"acc_norm_stderr\": 0.04153948404742399\n },\n \"harness|hendrycksTest-astronomy|5\"\
: {\n \"acc\": 0.7368421052631579,\n \"acc_stderr\": 0.035834961763610736,\n\
\ \"acc_norm\": 0.7368421052631579,\n \"acc_norm_stderr\": 0.035834961763610736\n\
\ },\n \"harness|hendrycksTest-business_ethics|5\": {\n \"acc\": 0.71,\n\
\ \"acc_stderr\": 0.045604802157206845,\n \"acc_norm\": 0.71,\n \
\ \"acc_norm_stderr\": 0.045604802157206845\n },\n \"harness|hendrycksTest-clinical_knowledge|5\"\
: {\n \"acc\": 0.7509433962264151,\n \"acc_stderr\": 0.026616482980501704,\n\
\ \"acc_norm\": 0.7509433962264151,\n \"acc_norm_stderr\": 0.026616482980501704\n\
\ },\n \"harness|hendrycksTest-college_biology|5\": {\n \"acc\": 0.8055555555555556,\n\
\ \"acc_stderr\": 0.03309615177059006,\n \"acc_norm\": 0.8055555555555556,\n\
\ \"acc_norm_stderr\": 0.03309615177059006\n },\n \"harness|hendrycksTest-college_chemistry|5\"\
: {\n \"acc\": 0.44,\n \"acc_stderr\": 0.049888765156985884,\n \
\ \"acc_norm\": 0.44,\n \"acc_norm_stderr\": 0.049888765156985884\n \
\ },\n \"harness|hendrycksTest-college_computer_science|5\": {\n \"\
acc\": 0.59,\n \"acc_stderr\": 0.04943110704237102,\n \"acc_norm\"\
: 0.59,\n \"acc_norm_stderr\": 0.04943110704237102\n },\n \"harness|hendrycksTest-college_mathematics|5\"\
: {\n \"acc\": 0.36,\n \"acc_stderr\": 0.048241815132442176,\n \
\ \"acc_norm\": 0.36,\n \"acc_norm_stderr\": 0.048241815132442176\n \
\ },\n \"harness|hendrycksTest-college_medicine|5\": {\n \"acc\": 0.653179190751445,\n\
\ \"acc_stderr\": 0.036291466701596636,\n \"acc_norm\": 0.653179190751445,\n\
\ \"acc_norm_stderr\": 0.036291466701596636\n },\n \"harness|hendrycksTest-college_physics|5\"\
: {\n \"acc\": 0.5,\n \"acc_stderr\": 0.04975185951049946,\n \
\ \"acc_norm\": 0.5,\n \"acc_norm_stderr\": 0.04975185951049946\n },\n\
\ \"harness|hendrycksTest-computer_security|5\": {\n \"acc\": 0.76,\n\
\ \"acc_stderr\": 0.04292346959909281,\n \"acc_norm\": 0.76,\n \
\ \"acc_norm_stderr\": 0.04292346959909281\n },\n \"harness|hendrycksTest-conceptual_physics|5\"\
: {\n \"acc\": 0.5957446808510638,\n \"acc_stderr\": 0.032081157507886836,\n\
\ \"acc_norm\": 0.5957446808510638,\n \"acc_norm_stderr\": 0.032081157507886836\n\
\ },\n \"harness|hendrycksTest-econometrics|5\": {\n \"acc\": 0.631578947368421,\n\
\ \"acc_stderr\": 0.04537815354939391,\n \"acc_norm\": 0.631578947368421,\n\
\ \"acc_norm_stderr\": 0.04537815354939391\n },\n \"harness|hendrycksTest-electrical_engineering|5\"\
: {\n \"acc\": 0.6482758620689655,\n \"acc_stderr\": 0.0397923663749741,\n\
\ \"acc_norm\": 0.6482758620689655,\n \"acc_norm_stderr\": 0.0397923663749741\n\
\ },\n \"harness|hendrycksTest-elementary_mathematics|5\": {\n \"acc\"\
: 0.43915343915343913,\n \"acc_stderr\": 0.025559920550531003,\n \"\
acc_norm\": 0.43915343915343913,\n \"acc_norm_stderr\": 0.025559920550531003\n\
\ },\n \"harness|hendrycksTest-formal_logic|5\": {\n \"acc\": 0.49206349206349204,\n\
\ \"acc_stderr\": 0.044715725362943486,\n \"acc_norm\": 0.49206349206349204,\n\
\ \"acc_norm_stderr\": 0.044715725362943486\n },\n \"harness|hendrycksTest-global_facts|5\"\
: {\n \"acc\": 0.43,\n \"acc_stderr\": 0.04975698519562428,\n \
\ \"acc_norm\": 0.43,\n \"acc_norm_stderr\": 0.04975698519562428\n \
\ },\n \"harness|hendrycksTest-high_school_biology|5\": {\n \"acc\": 0.7935483870967742,\n\
\ \"acc_stderr\": 0.023025899617188695,\n \"acc_norm\": 0.7935483870967742,\n\
\ \"acc_norm_stderr\": 0.023025899617188695\n },\n \"harness|hendrycksTest-high_school_chemistry|5\"\
: {\n \"acc\": 0.5221674876847291,\n \"acc_stderr\": 0.03514528562175008,\n\
\ \"acc_norm\": 0.5221674876847291,\n \"acc_norm_stderr\": 0.03514528562175008\n\
\ },\n \"harness|hendrycksTest-high_school_computer_science|5\": {\n \
\ \"acc\": 0.74,\n \"acc_stderr\": 0.0440844002276808,\n \"acc_norm\"\
: 0.74,\n \"acc_norm_stderr\": 0.0440844002276808\n },\n \"harness|hendrycksTest-high_school_european_history|5\"\
: {\n \"acc\": 0.7515151515151515,\n \"acc_stderr\": 0.03374402644139404,\n\
\ \"acc_norm\": 0.7515151515151515,\n \"acc_norm_stderr\": 0.03374402644139404\n\
\ },\n \"harness|hendrycksTest-high_school_geography|5\": {\n \"acc\"\
: 0.8434343434343434,\n \"acc_stderr\": 0.025890520358141454,\n \"\
acc_norm\": 0.8434343434343434,\n \"acc_norm_stderr\": 0.025890520358141454\n\
\ },\n \"harness|hendrycksTest-high_school_government_and_politics|5\": {\n\
\ \"acc\": 0.9067357512953368,\n \"acc_stderr\": 0.020986854593289733,\n\
\ \"acc_norm\": 0.9067357512953368,\n \"acc_norm_stderr\": 0.020986854593289733\n\
\ },\n \"harness|hendrycksTest-high_school_macroeconomics|5\": {\n \
\ \"acc\": 0.6435897435897436,\n \"acc_stderr\": 0.0242831405294673,\n \
\ \"acc_norm\": 0.6435897435897436,\n \"acc_norm_stderr\": 0.0242831405294673\n\
\ },\n \"harness|hendrycksTest-high_school_mathematics|5\": {\n \"\
acc\": 0.3888888888888889,\n \"acc_stderr\": 0.029723278961476664,\n \
\ \"acc_norm\": 0.3888888888888889,\n \"acc_norm_stderr\": 0.029723278961476664\n\
\ },\n \"harness|hendrycksTest-high_school_microeconomics|5\": {\n \
\ \"acc\": 0.7521008403361344,\n \"acc_stderr\": 0.028047967224176892,\n\
\ \"acc_norm\": 0.7521008403361344,\n \"acc_norm_stderr\": 0.028047967224176892\n\
\ },\n \"harness|hendrycksTest-high_school_physics|5\": {\n \"acc\"\
: 0.4503311258278146,\n \"acc_stderr\": 0.04062290018683775,\n \"\
acc_norm\": 0.4503311258278146,\n \"acc_norm_stderr\": 0.04062290018683775\n\
\ },\n \"harness|hendrycksTest-high_school_psychology|5\": {\n \"acc\"\
: 0.8513761467889909,\n \"acc_stderr\": 0.015251253773660834,\n \"\
acc_norm\": 0.8513761467889909,\n \"acc_norm_stderr\": 0.015251253773660834\n\
\ },\n \"harness|hendrycksTest-high_school_statistics|5\": {\n \"acc\"\
: 0.5555555555555556,\n \"acc_stderr\": 0.03388857118502325,\n \"\
acc_norm\": 0.5555555555555556,\n \"acc_norm_stderr\": 0.03388857118502325\n\
\ },\n \"harness|hendrycksTest-high_school_us_history|5\": {\n \"acc\"\
: 0.8431372549019608,\n \"acc_stderr\": 0.025524722324553353,\n \"\
acc_norm\": 0.8431372549019608,\n \"acc_norm_stderr\": 0.025524722324553353\n\
\ },\n \"harness|hendrycksTest-high_school_world_history|5\": {\n \"\
acc\": 0.8523206751054853,\n \"acc_stderr\": 0.023094329582595698,\n \
\ \"acc_norm\": 0.8523206751054853,\n \"acc_norm_stderr\": 0.023094329582595698\n\
\ },\n \"harness|hendrycksTest-human_aging|5\": {\n \"acc\": 0.7130044843049327,\n\
\ \"acc_stderr\": 0.03036037971029195,\n \"acc_norm\": 0.7130044843049327,\n\
\ \"acc_norm_stderr\": 0.03036037971029195\n },\n \"harness|hendrycksTest-human_sexuality|5\"\
: {\n \"acc\": 0.7786259541984732,\n \"acc_stderr\": 0.0364129708131373,\n\
\ \"acc_norm\": 0.7786259541984732,\n \"acc_norm_stderr\": 0.0364129708131373\n\
\ },\n \"harness|hendrycksTest-international_law|5\": {\n \"acc\":\
\ 0.8099173553719008,\n \"acc_stderr\": 0.03581796951709282,\n \"\
acc_norm\": 0.8099173553719008,\n \"acc_norm_stderr\": 0.03581796951709282\n\
\ },\n \"harness|hendrycksTest-jurisprudence|5\": {\n \"acc\": 0.7777777777777778,\n\
\ \"acc_stderr\": 0.0401910747255735,\n \"acc_norm\": 0.7777777777777778,\n\
\ \"acc_norm_stderr\": 0.0401910747255735\n },\n \"harness|hendrycksTest-logical_fallacies|5\"\
: {\n \"acc\": 0.7791411042944786,\n \"acc_stderr\": 0.03259177392742179,\n\
\ \"acc_norm\": 0.7791411042944786,\n \"acc_norm_stderr\": 0.03259177392742179\n\
\ },\n \"harness|hendrycksTest-machine_learning|5\": {\n \"acc\": 0.5446428571428571,\n\
\ \"acc_stderr\": 0.04726835553719098,\n \"acc_norm\": 0.5446428571428571,\n\
\ \"acc_norm_stderr\": 0.04726835553719098\n },\n \"harness|hendrycksTest-management|5\"\
: {\n \"acc\": 0.7766990291262136,\n \"acc_stderr\": 0.04123553189891431,\n\
\ \"acc_norm\": 0.7766990291262136,\n \"acc_norm_stderr\": 0.04123553189891431\n\
\ },\n \"harness|hendrycksTest-marketing|5\": {\n \"acc\": 0.8974358974358975,\n\
\ \"acc_stderr\": 0.01987565502786745,\n \"acc_norm\": 0.8974358974358975,\n\
\ \"acc_norm_stderr\": 0.01987565502786745\n },\n \"harness|hendrycksTest-medical_genetics|5\"\
: {\n \"acc\": 0.81,\n \"acc_stderr\": 0.03942772444036623,\n \
\ \"acc_norm\": 0.81,\n \"acc_norm_stderr\": 0.03942772444036623\n \
\ },\n \"harness|hendrycksTest-miscellaneous|5\": {\n \"acc\": 0.8007662835249042,\n\
\ \"acc_stderr\": 0.01428337804429641,\n \"acc_norm\": 0.8007662835249042,\n\
\ \"acc_norm_stderr\": 0.01428337804429641\n },\n \"harness|hendrycksTest-moral_disputes|5\"\
: {\n \"acc\": 0.7514450867052023,\n \"acc_stderr\": 0.023267528432100174,\n\
\ \"acc_norm\": 0.7514450867052023,\n \"acc_norm_stderr\": 0.023267528432100174\n\
\ },\n \"harness|hendrycksTest-moral_scenarios|5\": {\n \"acc\": 0.4324022346368715,\n\
\ \"acc_stderr\": 0.016568971233548613,\n \"acc_norm\": 0.4324022346368715,\n\
\ \"acc_norm_stderr\": 0.016568971233548613\n },\n \"harness|hendrycksTest-nutrition|5\"\
: {\n \"acc\": 0.761437908496732,\n \"acc_stderr\": 0.024404394928087877,\n\
\ \"acc_norm\": 0.761437908496732,\n \"acc_norm_stderr\": 0.024404394928087877\n\
\ },\n \"harness|hendrycksTest-philosophy|5\": {\n \"acc\": 0.7266881028938906,\n\
\ \"acc_stderr\": 0.025311765975426122,\n \"acc_norm\": 0.7266881028938906,\n\
\ \"acc_norm_stderr\": 0.025311765975426122\n },\n \"harness|hendrycksTest-prehistory|5\"\
: {\n \"acc\": 0.75,\n \"acc_stderr\": 0.02409347123262133,\n \
\ \"acc_norm\": 0.75,\n \"acc_norm_stderr\": 0.02409347123262133\n \
\ },\n \"harness|hendrycksTest-professional_accounting|5\": {\n \"acc\"\
: 0.5390070921985816,\n \"acc_stderr\": 0.02973659252642444,\n \"\
acc_norm\": 0.5390070921985816,\n \"acc_norm_stderr\": 0.02973659252642444\n\
\ },\n \"harness|hendrycksTest-professional_law|5\": {\n \"acc\": 0.4921773142112125,\n\
\ \"acc_stderr\": 0.0127686730761119,\n \"acc_norm\": 0.4921773142112125,\n\
\ \"acc_norm_stderr\": 0.0127686730761119\n },\n \"harness|hendrycksTest-professional_medicine|5\"\
: {\n \"acc\": 0.7169117647058824,\n \"acc_stderr\": 0.02736586113151381,\n\
\ \"acc_norm\": 0.7169117647058824,\n \"acc_norm_stderr\": 0.02736586113151381\n\
\ },\n \"harness|hendrycksTest-professional_psychology|5\": {\n \"\
acc\": 0.6977124183006536,\n \"acc_stderr\": 0.018579232711113877,\n \
\ \"acc_norm\": 0.6977124183006536,\n \"acc_norm_stderr\": 0.018579232711113877\n\
\ },\n \"harness|hendrycksTest-public_relations|5\": {\n \"acc\": 0.6454545454545455,\n\
\ \"acc_stderr\": 0.045820048415054174,\n \"acc_norm\": 0.6454545454545455,\n\
\ \"acc_norm_stderr\": 0.045820048415054174\n },\n \"harness|hendrycksTest-security_studies|5\"\
: {\n \"acc\": 0.7306122448979592,\n \"acc_stderr\": 0.02840125202902294,\n\
\ \"acc_norm\": 0.7306122448979592,\n \"acc_norm_stderr\": 0.02840125202902294\n\
\ },\n \"harness|hendrycksTest-sociology|5\": {\n \"acc\": 0.8557213930348259,\n\
\ \"acc_stderr\": 0.02484575321230604,\n \"acc_norm\": 0.8557213930348259,\n\
\ \"acc_norm_stderr\": 0.02484575321230604\n },\n \"harness|hendrycksTest-us_foreign_policy|5\"\
: {\n \"acc\": 0.85,\n \"acc_stderr\": 0.0358870281282637,\n \
\ \"acc_norm\": 0.85,\n \"acc_norm_stderr\": 0.0358870281282637\n },\n\
\ \"harness|hendrycksTest-virology|5\": {\n \"acc\": 0.5120481927710844,\n\
\ \"acc_stderr\": 0.03891364495835816,\n \"acc_norm\": 0.5120481927710844,\n\
\ \"acc_norm_stderr\": 0.03891364495835816\n },\n \"harness|hendrycksTest-world_religions|5\"\
: {\n \"acc\": 0.7777777777777778,\n \"acc_stderr\": 0.03188578017686398,\n\
\ \"acc_norm\": 0.7777777777777778,\n \"acc_norm_stderr\": 0.03188578017686398\n\
\ },\n \"harness|truthfulqa:mc|0\": {\n \"mc1\": 0.3708690330477356,\n\
\ \"mc1_stderr\": 0.01690969358024882,\n \"mc2\": 0.5615878106808432,\n\
\ \"mc2_stderr\": 0.015165049093519522\n },\n \"harness|winogrande|5\"\
: {\n \"acc\": 0.749802683504341,\n \"acc_stderr\": 0.012173009642449138\n\
\ },\n \"harness|gsm8k|5\": {\n \"acc\": 0.6952236542835482,\n \
\ \"acc_stderr\": 0.012679297549515434\n }\n}\n```"
repo_url: https://huggingface.co/vicgalle/Configurable-Llama-3-8B-v0.1
leaderboard_url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard
point_of_contact: clementine@hf.co
configs:
- config_name: harness_arc_challenge_25
data_files:
- split: 2024_04_19T01_45_07.088638
path:
- '**/details_harness|arc:challenge|25_2024-04-19T01-45-07.088638.parquet'
- split: latest
path:
- '**/details_harness|arc:challenge|25_2024-04-19T01-45-07.088638.parquet'
- config_name: harness_gsm8k_5
data_files:
- split: 2024_04_19T01_45_07.088638
path:
- '**/details_harness|gsm8k|5_2024-04-19T01-45-07.088638.parquet'
- split: latest
path:
- '**/details_harness|gsm8k|5_2024-04-19T01-45-07.088638.parquet'
- config_name: harness_hellaswag_10
data_files:
- split: 2024_04_19T01_45_07.088638
path:
- '**/details_harness|hellaswag|10_2024-04-19T01-45-07.088638.parquet'
- split: latest
path:
- '**/details_harness|hellaswag|10_2024-04-19T01-45-07.088638.parquet'
- config_name: harness_hendrycksTest_5
data_files:
- split: 2024_04_19T01_45_07.088638
path:
- '**/details_harness|hendrycksTest-abstract_algebra|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-anatomy|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-astronomy|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-business_ethics|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-college_biology|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-college_chemistry|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-college_computer_science|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-college_mathematics|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-college_medicine|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-college_physics|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-computer_security|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-conceptual_physics|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-econometrics|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-electrical_engineering|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-formal_logic|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-global_facts|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-high_school_biology|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-high_school_european_history|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-high_school_geography|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-high_school_physics|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-high_school_psychology|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-high_school_statistics|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-high_school_us_history|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-high_school_world_history|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-human_aging|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-human_sexuality|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-international_law|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-jurisprudence|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-logical_fallacies|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-machine_learning|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-management|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-marketing|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-medical_genetics|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-miscellaneous|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-moral_disputes|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-moral_scenarios|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-nutrition|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-philosophy|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-prehistory|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-professional_accounting|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-professional_law|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-professional_medicine|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-professional_psychology|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-public_relations|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-security_studies|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-sociology|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-virology|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-world_religions|5_2024-04-19T01-45-07.088638.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-abstract_algebra|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-anatomy|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-astronomy|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-business_ethics|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-college_biology|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-college_chemistry|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-college_computer_science|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-college_mathematics|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-college_medicine|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-college_physics|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-computer_security|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-conceptual_physics|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-econometrics|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-electrical_engineering|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-formal_logic|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-global_facts|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-high_school_biology|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-high_school_european_history|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-high_school_geography|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-high_school_physics|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-high_school_psychology|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-high_school_statistics|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-high_school_us_history|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-high_school_world_history|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-human_aging|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-human_sexuality|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-international_law|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-jurisprudence|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-logical_fallacies|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-machine_learning|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-management|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-marketing|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-medical_genetics|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-miscellaneous|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-moral_disputes|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-moral_scenarios|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-nutrition|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-philosophy|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-prehistory|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-professional_accounting|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-professional_law|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-professional_medicine|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-professional_psychology|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-public_relations|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-security_studies|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-sociology|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-virology|5_2024-04-19T01-45-07.088638.parquet'
- '**/details_harness|hendrycksTest-world_religions|5_2024-04-19T01-45-07.088638.parquet'
- config_name: harness_hendrycksTest_abstract_algebra_5
data_files:
- split: 2024_04_19T01_45_07.088638
path:
- '**/details_harness|hendrycksTest-abstract_algebra|5_2024-04-19T01-45-07.088638.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-abstract_algebra|5_2024-04-19T01-45-07.088638.parquet'
- config_name: harness_hendrycksTest_anatomy_5
data_files:
- split: 2024_04_19T01_45_07.088638
path:
- '**/details_harness|hendrycksTest-anatomy|5_2024-04-19T01-45-07.088638.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-anatomy|5_2024-04-19T01-45-07.088638.parquet'
- config_name: harness_hendrycksTest_astronomy_5
data_files:
- split: 2024_04_19T01_45_07.088638
path:
- '**/details_harness|hendrycksTest-astronomy|5_2024-04-19T01-45-07.088638.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-astronomy|5_2024-04-19T01-45-07.088638.parquet'
- config_name: harness_hendrycksTest_business_ethics_5
data_files:
- split: 2024_04_19T01_45_07.088638
path:
- '**/details_harness|hendrycksTest-business_ethics|5_2024-04-19T01-45-07.088638.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-business_ethics|5_2024-04-19T01-45-07.088638.parquet'
- config_name: harness_hendrycksTest_clinical_knowledge_5
data_files:
- split: 2024_04_19T01_45_07.088638
path:
- '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-04-19T01-45-07.088638.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-04-19T01-45-07.088638.parquet'
- config_name: harness_hendrycksTest_college_biology_5
data_files:
- split: 2024_04_19T01_45_07.088638
path:
- '**/details_harness|hendrycksTest-college_biology|5_2024-04-19T01-45-07.088638.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-college_biology|5_2024-04-19T01-45-07.088638.parquet'
- config_name: harness_hendrycksTest_college_chemistry_5
data_files:
- split: 2024_04_19T01_45_07.088638
path:
- '**/details_harness|hendrycksTest-college_chemistry|5_2024-04-19T01-45-07.088638.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-college_chemistry|5_2024-04-19T01-45-07.088638.parquet'
- config_name: harness_hendrycksTest_college_computer_science_5
data_files:
- split: 2024_04_19T01_45_07.088638
path:
- '**/details_harness|hendrycksTest-college_computer_science|5_2024-04-19T01-45-07.088638.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-college_computer_science|5_2024-04-19T01-45-07.088638.parquet'
- config_name: harness_hendrycksTest_college_mathematics_5
data_files:
- split: 2024_04_19T01_45_07.088638
path:
- '**/details_harness|hendrycksTest-college_mathematics|5_2024-04-19T01-45-07.088638.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-college_mathematics|5_2024-04-19T01-45-07.088638.parquet'
- config_name: harness_hendrycksTest_college_medicine_5
data_files:
- split: 2024_04_19T01_45_07.088638
path:
- '**/details_harness|hendrycksTest-college_medicine|5_2024-04-19T01-45-07.088638.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-college_medicine|5_2024-04-19T01-45-07.088638.parquet'
- config_name: harness_hendrycksTest_college_physics_5
data_files:
- split: 2024_04_19T01_45_07.088638
path:
- '**/details_harness|hendrycksTest-college_physics|5_2024-04-19T01-45-07.088638.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-college_physics|5_2024-04-19T01-45-07.088638.parquet'
- config_name: harness_hendrycksTest_computer_security_5
data_files:
- split: 2024_04_19T01_45_07.088638
path:
- '**/details_harness|hendrycksTest-computer_security|5_2024-04-19T01-45-07.088638.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-computer_security|5_2024-04-19T01-45-07.088638.parquet'
- config_name: harness_hendrycksTest_conceptual_physics_5
data_files:
- split: 2024_04_19T01_45_07.088638
path:
- '**/details_harness|hendrycksTest-conceptual_physics|5_2024-04-19T01-45-07.088638.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-conceptual_physics|5_2024-04-19T01-45-07.088638.parquet'
- config_name: harness_hendrycksTest_econometrics_5
data_files:
- split: 2024_04_19T01_45_07.088638
path:
- '**/details_harness|hendrycksTest-econometrics|5_2024-04-19T01-45-07.088638.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-econometrics|5_2024-04-19T01-45-07.088638.parquet'
- config_name: harness_hendrycksTest_electrical_engineering_5
data_files:
- split: 2024_04_19T01_45_07.088638
path:
- '**/details_harness|hendrycksTest-electrical_engineering|5_2024-04-19T01-45-07.088638.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-electrical_engineering|5_2024-04-19T01-45-07.088638.parquet'
- config_name: harness_hendrycksTest_elementary_mathematics_5
data_files:
- split: 2024_04_19T01_45_07.088638
path:
- '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-04-19T01-45-07.088638.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-04-19T01-45-07.088638.parquet'
- config_name: harness_hendrycksTest_formal_logic_5
data_files:
- split: 2024_04_19T01_45_07.088638
path:
- '**/details_harness|hendrycksTest-formal_logic|5_2024-04-19T01-45-07.088638.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-formal_logic|5_2024-04-19T01-45-07.088638.parquet'
- config_name: harness_hendrycksTest_global_facts_5
data_files:
- split: 2024_04_19T01_45_07.088638
path:
- '**/details_harness|hendrycksTest-global_facts|5_2024-04-19T01-45-07.088638.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-global_facts|5_2024-04-19T01-45-07.088638.parquet'
- config_name: harness_hendrycksTest_high_school_biology_5
data_files:
- split: 2024_04_19T01_45_07.088638
path:
- '**/details_harness|hendrycksTest-high_school_biology|5_2024-04-19T01-45-07.088638.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-high_school_biology|5_2024-04-19T01-45-07.088638.parquet'
- config_name: harness_hendrycksTest_high_school_chemistry_5
data_files:
- split: 2024_04_19T01_45_07.088638
path:
- '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-04-19T01-45-07.088638.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-04-19T01-45-07.088638.parquet'
- config_name: harness_hendrycksTest_high_school_computer_science_5
data_files:
- split: 2024_04_19T01_45_07.088638
path:
- '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-04-19T01-45-07.088638.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-04-19T01-45-07.088638.parquet'
- config_name: harness_hendrycksTest_high_school_european_history_5
data_files:
- split: 2024_04_19T01_45_07.088638
path:
- '**/details_harness|hendrycksTest-high_school_european_history|5_2024-04-19T01-45-07.088638.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-high_school_european_history|5_2024-04-19T01-45-07.088638.parquet'
- config_name: harness_hendrycksTest_high_school_geography_5
data_files:
- split: 2024_04_19T01_45_07.088638
path:
- '**/details_harness|hendrycksTest-high_school_geography|5_2024-04-19T01-45-07.088638.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-high_school_geography|5_2024-04-19T01-45-07.088638.parquet'
- config_name: harness_hendrycksTest_high_school_government_and_politics_5
data_files:
- split: 2024_04_19T01_45_07.088638
path:
- '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-04-19T01-45-07.088638.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-04-19T01-45-07.088638.parquet'
- config_name: harness_hendrycksTest_high_school_macroeconomics_5
data_files:
- split: 2024_04_19T01_45_07.088638
path:
- '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-04-19T01-45-07.088638.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-04-19T01-45-07.088638.parquet'
- config_name: harness_hendrycksTest_high_school_mathematics_5
data_files:
- split: 2024_04_19T01_45_07.088638
path:
- '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-04-19T01-45-07.088638.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-04-19T01-45-07.088638.parquet'
- config_name: harness_hendrycksTest_high_school_microeconomics_5
data_files:
- split: 2024_04_19T01_45_07.088638
path:
- '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-04-19T01-45-07.088638.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-04-19T01-45-07.088638.parquet'
- config_name: harness_hendrycksTest_high_school_physics_5
data_files:
- split: 2024_04_19T01_45_07.088638
path:
- '**/details_harness|hendrycksTest-high_school_physics|5_2024-04-19T01-45-07.088638.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-high_school_physics|5_2024-04-19T01-45-07.088638.parquet'
- config_name: harness_hendrycksTest_high_school_psychology_5
data_files:
- split: 2024_04_19T01_45_07.088638
path:
- '**/details_harness|hendrycksTest-high_school_psychology|5_2024-04-19T01-45-07.088638.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-high_school_psychology|5_2024-04-19T01-45-07.088638.parquet'
- config_name: harness_hendrycksTest_high_school_statistics_5
data_files:
- split: 2024_04_19T01_45_07.088638
path:
- '**/details_harness|hendrycksTest-high_school_statistics|5_2024-04-19T01-45-07.088638.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-high_school_statistics|5_2024-04-19T01-45-07.088638.parquet'
- config_name: harness_hendrycksTest_high_school_us_history_5
data_files:
- split: 2024_04_19T01_45_07.088638
path:
- '**/details_harness|hendrycksTest-high_school_us_history|5_2024-04-19T01-45-07.088638.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-high_school_us_history|5_2024-04-19T01-45-07.088638.parquet'
- config_name: harness_hendrycksTest_high_school_world_history_5
data_files:
- split: 2024_04_19T01_45_07.088638
path:
- '**/details_harness|hendrycksTest-high_school_world_history|5_2024-04-19T01-45-07.088638.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-high_school_world_history|5_2024-04-19T01-45-07.088638.parquet'
- config_name: harness_hendrycksTest_human_aging_5
data_files:
- split: 2024_04_19T01_45_07.088638
path:
- '**/details_harness|hendrycksTest-human_aging|5_2024-04-19T01-45-07.088638.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-human_aging|5_2024-04-19T01-45-07.088638.parquet'
- config_name: harness_hendrycksTest_human_sexuality_5
data_files:
- split: 2024_04_19T01_45_07.088638
path:
- '**/details_harness|hendrycksTest-human_sexuality|5_2024-04-19T01-45-07.088638.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-human_sexuality|5_2024-04-19T01-45-07.088638.parquet'
- config_name: harness_hendrycksTest_international_law_5
data_files:
- split: 2024_04_19T01_45_07.088638
path:
- '**/details_harness|hendrycksTest-international_law|5_2024-04-19T01-45-07.088638.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-international_law|5_2024-04-19T01-45-07.088638.parquet'
- config_name: harness_hendrycksTest_jurisprudence_5
data_files:
- split: 2024_04_19T01_45_07.088638
path:
- '**/details_harness|hendrycksTest-jurisprudence|5_2024-04-19T01-45-07.088638.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-jurisprudence|5_2024-04-19T01-45-07.088638.parquet'
- config_name: harness_hendrycksTest_logical_fallacies_5
data_files:
- split: 2024_04_19T01_45_07.088638
path:
- '**/details_harness|hendrycksTest-logical_fallacies|5_2024-04-19T01-45-07.088638.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-logical_fallacies|5_2024-04-19T01-45-07.088638.parquet'
- config_name: harness_hendrycksTest_machine_learning_5
data_files:
- split: 2024_04_19T01_45_07.088638
path:
- '**/details_harness|hendrycksTest-machine_learning|5_2024-04-19T01-45-07.088638.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-machine_learning|5_2024-04-19T01-45-07.088638.parquet'
- config_name: harness_hendrycksTest_management_5
data_files:
- split: 2024_04_19T01_45_07.088638
path:
- '**/details_harness|hendrycksTest-management|5_2024-04-19T01-45-07.088638.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-management|5_2024-04-19T01-45-07.088638.parquet'
- config_name: harness_hendrycksTest_marketing_5
data_files:
- split: 2024_04_19T01_45_07.088638
path:
- '**/details_harness|hendrycksTest-marketing|5_2024-04-19T01-45-07.088638.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-marketing|5_2024-04-19T01-45-07.088638.parquet'
- config_name: harness_hendrycksTest_medical_genetics_5
data_files:
- split: 2024_04_19T01_45_07.088638
path:
- '**/details_harness|hendrycksTest-medical_genetics|5_2024-04-19T01-45-07.088638.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-medical_genetics|5_2024-04-19T01-45-07.088638.parquet'
- config_name: harness_hendrycksTest_miscellaneous_5
data_files:
- split: 2024_04_19T01_45_07.088638
path:
- '**/details_harness|hendrycksTest-miscellaneous|5_2024-04-19T01-45-07.088638.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-miscellaneous|5_2024-04-19T01-45-07.088638.parquet'
- config_name: harness_hendrycksTest_moral_disputes_5
data_files:
- split: 2024_04_19T01_45_07.088638
path:
- '**/details_harness|hendrycksTest-moral_disputes|5_2024-04-19T01-45-07.088638.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-moral_disputes|5_2024-04-19T01-45-07.088638.parquet'
- config_name: harness_hendrycksTest_moral_scenarios_5
data_files:
- split: 2024_04_19T01_45_07.088638
path:
- '**/details_harness|hendrycksTest-moral_scenarios|5_2024-04-19T01-45-07.088638.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-moral_scenarios|5_2024-04-19T01-45-07.088638.parquet'
- config_name: harness_hendrycksTest_nutrition_5
data_files:
- split: 2024_04_19T01_45_07.088638
path:
- '**/details_harness|hendrycksTest-nutrition|5_2024-04-19T01-45-07.088638.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-nutrition|5_2024-04-19T01-45-07.088638.parquet'
- config_name: harness_hendrycksTest_philosophy_5
data_files:
- split: 2024_04_19T01_45_07.088638
path:
- '**/details_harness|hendrycksTest-philosophy|5_2024-04-19T01-45-07.088638.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-philosophy|5_2024-04-19T01-45-07.088638.parquet'
- config_name: harness_hendrycksTest_prehistory_5
data_files:
- split: 2024_04_19T01_45_07.088638
path:
- '**/details_harness|hendrycksTest-prehistory|5_2024-04-19T01-45-07.088638.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-prehistory|5_2024-04-19T01-45-07.088638.parquet'
- config_name: harness_hendrycksTest_professional_accounting_5
data_files:
- split: 2024_04_19T01_45_07.088638
path:
- '**/details_harness|hendrycksTest-professional_accounting|5_2024-04-19T01-45-07.088638.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-professional_accounting|5_2024-04-19T01-45-07.088638.parquet'
- config_name: harness_hendrycksTest_professional_law_5
data_files:
- split: 2024_04_19T01_45_07.088638
path:
- '**/details_harness|hendrycksTest-professional_law|5_2024-04-19T01-45-07.088638.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-professional_law|5_2024-04-19T01-45-07.088638.parquet'
- config_name: harness_hendrycksTest_professional_medicine_5
data_files:
- split: 2024_04_19T01_45_07.088638
path:
- '**/details_harness|hendrycksTest-professional_medicine|5_2024-04-19T01-45-07.088638.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-professional_medicine|5_2024-04-19T01-45-07.088638.parquet'
- config_name: harness_hendrycksTest_professional_psychology_5
data_files:
- split: 2024_04_19T01_45_07.088638
path:
- '**/details_harness|hendrycksTest-professional_psychology|5_2024-04-19T01-45-07.088638.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-professional_psychology|5_2024-04-19T01-45-07.088638.parquet'
- config_name: harness_hendrycksTest_public_relations_5
data_files:
- split: 2024_04_19T01_45_07.088638
path:
- '**/details_harness|hendrycksTest-public_relations|5_2024-04-19T01-45-07.088638.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-public_relations|5_2024-04-19T01-45-07.088638.parquet'
- config_name: harness_hendrycksTest_security_studies_5
data_files:
- split: 2024_04_19T01_45_07.088638
path:
- '**/details_harness|hendrycksTest-security_studies|5_2024-04-19T01-45-07.088638.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-security_studies|5_2024-04-19T01-45-07.088638.parquet'
- config_name: harness_hendrycksTest_sociology_5
data_files:
- split: 2024_04_19T01_45_07.088638
path:
- '**/details_harness|hendrycksTest-sociology|5_2024-04-19T01-45-07.088638.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-sociology|5_2024-04-19T01-45-07.088638.parquet'
- config_name: harness_hendrycksTest_us_foreign_policy_5
data_files:
- split: 2024_04_19T01_45_07.088638
path:
- '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-04-19T01-45-07.088638.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-04-19T01-45-07.088638.parquet'
- config_name: harness_hendrycksTest_virology_5
data_files:
- split: 2024_04_19T01_45_07.088638
path:
- '**/details_harness|hendrycksTest-virology|5_2024-04-19T01-45-07.088638.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-virology|5_2024-04-19T01-45-07.088638.parquet'
- config_name: harness_hendrycksTest_world_religions_5
data_files:
- split: 2024_04_19T01_45_07.088638
path:
- '**/details_harness|hendrycksTest-world_religions|5_2024-04-19T01-45-07.088638.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-world_religions|5_2024-04-19T01-45-07.088638.parquet'
- config_name: harness_truthfulqa_mc_0
data_files:
- split: 2024_04_19T01_45_07.088638
path:
- '**/details_harness|truthfulqa:mc|0_2024-04-19T01-45-07.088638.parquet'
- split: latest
path:
- '**/details_harness|truthfulqa:mc|0_2024-04-19T01-45-07.088638.parquet'
- config_name: harness_winogrande_5
data_files:
- split: 2024_04_19T01_45_07.088638
path:
- '**/details_harness|winogrande|5_2024-04-19T01-45-07.088638.parquet'
- split: latest
path:
- '**/details_harness|winogrande|5_2024-04-19T01-45-07.088638.parquet'
- config_name: results
data_files:
- split: 2024_04_19T01_45_07.088638
path:
- results_2024-04-19T01-45-07.088638.parquet
- split: latest
path:
- results_2024-04-19T01-45-07.088638.parquet
---
# Dataset Card for Evaluation run of vicgalle/Configurable-Llama-3-8B-v0.1
<!-- Provide a quick summary of the dataset. -->
Dataset automatically created during the evaluation run of model [vicgalle/Configurable-Llama-3-8B-v0.1](https://huggingface.co/vicgalle/Configurable-Llama-3-8B-v0.1) on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).
The dataset is composed of 63 configuration, each one coresponding to one of the evaluated task.
The dataset has been created from 1 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results.
An additional configuration "results" store all the aggregated results of the run (and is used to compute and display the aggregated metrics on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)).
To load the details from a run, you can for instance do the following:
```python
from datasets import load_dataset
data = load_dataset("open-llm-leaderboard/details_vicgalle__Configurable-Llama-3-8B-v0.1",
"harness_winogrande_5",
split="train")
```
## Latest results
These are the [latest results from run 2024-04-19T01:45:07.088638](https://huggingface.co/datasets/open-llm-leaderboard/details_vicgalle__Configurable-Llama-3-8B-v0.1/blob/main/results_2024-04-19T01-45-07.088638.json)(note that their might be results for other tasks in the repos if successive evals didn't cover the same tasks. You find each in the results and the "latest" split for each eval):
```python
{
"all": {
"acc": 0.6707407824486872,
"acc_stderr": 0.03171956437518023,
"acc_norm": 0.6730867118870119,
"acc_norm_stderr": 0.03235451513706868,
"mc1": 0.3708690330477356,
"mc1_stderr": 0.01690969358024882,
"mc2": 0.5615878106808432,
"mc2_stderr": 0.015165049093519522
},
"harness|arc:challenge|25": {
"acc": 0.5844709897610921,
"acc_stderr": 0.01440136664121639,
"acc_norm": 0.6245733788395904,
"acc_norm_stderr": 0.014150631435111728
},
"harness|hellaswag|10": {
"acc": 0.5932085241983669,
"acc_stderr": 0.004902314055725598,
"acc_norm": 0.7950607448715395,
"acc_norm_stderr": 0.004028322654852751
},
"harness|hendrycksTest-abstract_algebra|5": {
"acc": 0.29,
"acc_stderr": 0.045604802157206845,
"acc_norm": 0.29,
"acc_norm_stderr": 0.045604802157206845
},
"harness|hendrycksTest-anatomy|5": {
"acc": 0.6370370370370371,
"acc_stderr": 0.04153948404742399,
"acc_norm": 0.6370370370370371,
"acc_norm_stderr": 0.04153948404742399
},
"harness|hendrycksTest-astronomy|5": {
"acc": 0.7368421052631579,
"acc_stderr": 0.035834961763610736,
"acc_norm": 0.7368421052631579,
"acc_norm_stderr": 0.035834961763610736
},
"harness|hendrycksTest-business_ethics|5": {
"acc": 0.71,
"acc_stderr": 0.045604802157206845,
"acc_norm": 0.71,
"acc_norm_stderr": 0.045604802157206845
},
"harness|hendrycksTest-clinical_knowledge|5": {
"acc": 0.7509433962264151,
"acc_stderr": 0.026616482980501704,
"acc_norm": 0.7509433962264151,
"acc_norm_stderr": 0.026616482980501704
},
"harness|hendrycksTest-college_biology|5": {
"acc": 0.8055555555555556,
"acc_stderr": 0.03309615177059006,
"acc_norm": 0.8055555555555556,
"acc_norm_stderr": 0.03309615177059006
},
"harness|hendrycksTest-college_chemistry|5": {
"acc": 0.44,
"acc_stderr": 0.049888765156985884,
"acc_norm": 0.44,
"acc_norm_stderr": 0.049888765156985884
},
"harness|hendrycksTest-college_computer_science|5": {
"acc": 0.59,
"acc_stderr": 0.04943110704237102,
"acc_norm": 0.59,
"acc_norm_stderr": 0.04943110704237102
},
"harness|hendrycksTest-college_mathematics|5": {
"acc": 0.36,
"acc_stderr": 0.048241815132442176,
"acc_norm": 0.36,
"acc_norm_stderr": 0.048241815132442176
},
"harness|hendrycksTest-college_medicine|5": {
"acc": 0.653179190751445,
"acc_stderr": 0.036291466701596636,
"acc_norm": 0.653179190751445,
"acc_norm_stderr": 0.036291466701596636
},
"harness|hendrycksTest-college_physics|5": {
"acc": 0.5,
"acc_stderr": 0.04975185951049946,
"acc_norm": 0.5,
"acc_norm_stderr": 0.04975185951049946
},
"harness|hendrycksTest-computer_security|5": {
"acc": 0.76,
"acc_stderr": 0.04292346959909281,
"acc_norm": 0.76,
"acc_norm_stderr": 0.04292346959909281
},
"harness|hendrycksTest-conceptual_physics|5": {
"acc": 0.5957446808510638,
"acc_stderr": 0.032081157507886836,
"acc_norm": 0.5957446808510638,
"acc_norm_stderr": 0.032081157507886836
},
"harness|hendrycksTest-econometrics|5": {
"acc": 0.631578947368421,
"acc_stderr": 0.04537815354939391,
"acc_norm": 0.631578947368421,
"acc_norm_stderr": 0.04537815354939391
},
"harness|hendrycksTest-electrical_engineering|5": {
"acc": 0.6482758620689655,
"acc_stderr": 0.0397923663749741,
"acc_norm": 0.6482758620689655,
"acc_norm_stderr": 0.0397923663749741
},
"harness|hendrycksTest-elementary_mathematics|5": {
"acc": 0.43915343915343913,
"acc_stderr": 0.025559920550531003,
"acc_norm": 0.43915343915343913,
"acc_norm_stderr": 0.025559920550531003
},
"harness|hendrycksTest-formal_logic|5": {
"acc": 0.49206349206349204,
"acc_stderr": 0.044715725362943486,
"acc_norm": 0.49206349206349204,
"acc_norm_stderr": 0.044715725362943486
},
"harness|hendrycksTest-global_facts|5": {
"acc": 0.43,
"acc_stderr": 0.04975698519562428,
"acc_norm": 0.43,
"acc_norm_stderr": 0.04975698519562428
},
"harness|hendrycksTest-high_school_biology|5": {
"acc": 0.7935483870967742,
"acc_stderr": 0.023025899617188695,
"acc_norm": 0.7935483870967742,
"acc_norm_stderr": 0.023025899617188695
},
"harness|hendrycksTest-high_school_chemistry|5": {
"acc": 0.5221674876847291,
"acc_stderr": 0.03514528562175008,
"acc_norm": 0.5221674876847291,
"acc_norm_stderr": 0.03514528562175008
},
"harness|hendrycksTest-high_school_computer_science|5": {
"acc": 0.74,
"acc_stderr": 0.0440844002276808,
"acc_norm": 0.74,
"acc_norm_stderr": 0.0440844002276808
},
"harness|hendrycksTest-high_school_european_history|5": {
"acc": 0.7515151515151515,
"acc_stderr": 0.03374402644139404,
"acc_norm": 0.7515151515151515,
"acc_norm_stderr": 0.03374402644139404
},
"harness|hendrycksTest-high_school_geography|5": {
"acc": 0.8434343434343434,
"acc_stderr": 0.025890520358141454,
"acc_norm": 0.8434343434343434,
"acc_norm_stderr": 0.025890520358141454
},
"harness|hendrycksTest-high_school_government_and_politics|5": {
"acc": 0.9067357512953368,
"acc_stderr": 0.020986854593289733,
"acc_norm": 0.9067357512953368,
"acc_norm_stderr": 0.020986854593289733
},
"harness|hendrycksTest-high_school_macroeconomics|5": {
"acc": 0.6435897435897436,
"acc_stderr": 0.0242831405294673,
"acc_norm": 0.6435897435897436,
"acc_norm_stderr": 0.0242831405294673
},
"harness|hendrycksTest-high_school_mathematics|5": {
"acc": 0.3888888888888889,
"acc_stderr": 0.029723278961476664,
"acc_norm": 0.3888888888888889,
"acc_norm_stderr": 0.029723278961476664
},
"harness|hendrycksTest-high_school_microeconomics|5": {
"acc": 0.7521008403361344,
"acc_stderr": 0.028047967224176892,
"acc_norm": 0.7521008403361344,
"acc_norm_stderr": 0.028047967224176892
},
"harness|hendrycksTest-high_school_physics|5": {
"acc": 0.4503311258278146,
"acc_stderr": 0.04062290018683775,
"acc_norm": 0.4503311258278146,
"acc_norm_stderr": 0.04062290018683775
},
"harness|hendrycksTest-high_school_psychology|5": {
"acc": 0.8513761467889909,
"acc_stderr": 0.015251253773660834,
"acc_norm": 0.8513761467889909,
"acc_norm_stderr": 0.015251253773660834
},
"harness|hendrycksTest-high_school_statistics|5": {
"acc": 0.5555555555555556,
"acc_stderr": 0.03388857118502325,
"acc_norm": 0.5555555555555556,
"acc_norm_stderr": 0.03388857118502325
},
"harness|hendrycksTest-high_school_us_history|5": {
"acc": 0.8431372549019608,
"acc_stderr": 0.025524722324553353,
"acc_norm": 0.8431372549019608,
"acc_norm_stderr": 0.025524722324553353
},
"harness|hendrycksTest-high_school_world_history|5": {
"acc": 0.8523206751054853,
"acc_stderr": 0.023094329582595698,
"acc_norm": 0.8523206751054853,
"acc_norm_stderr": 0.023094329582595698
},
"harness|hendrycksTest-human_aging|5": {
"acc": 0.7130044843049327,
"acc_stderr": 0.03036037971029195,
"acc_norm": 0.7130044843049327,
"acc_norm_stderr": 0.03036037971029195
},
"harness|hendrycksTest-human_sexuality|5": {
"acc": 0.7786259541984732,
"acc_stderr": 0.0364129708131373,
"acc_norm": 0.7786259541984732,
"acc_norm_stderr": 0.0364129708131373
},
"harness|hendrycksTest-international_law|5": {
"acc": 0.8099173553719008,
"acc_stderr": 0.03581796951709282,
"acc_norm": 0.8099173553719008,
"acc_norm_stderr": 0.03581796951709282
},
"harness|hendrycksTest-jurisprudence|5": {
"acc": 0.7777777777777778,
"acc_stderr": 0.0401910747255735,
"acc_norm": 0.7777777777777778,
"acc_norm_stderr": 0.0401910747255735
},
"harness|hendrycksTest-logical_fallacies|5": {
"acc": 0.7791411042944786,
"acc_stderr": 0.03259177392742179,
"acc_norm": 0.7791411042944786,
"acc_norm_stderr": 0.03259177392742179
},
"harness|hendrycksTest-machine_learning|5": {
"acc": 0.5446428571428571,
"acc_stderr": 0.04726835553719098,
"acc_norm": 0.5446428571428571,
"acc_norm_stderr": 0.04726835553719098
},
"harness|hendrycksTest-management|5": {
"acc": 0.7766990291262136,
"acc_stderr": 0.04123553189891431,
"acc_norm": 0.7766990291262136,
"acc_norm_stderr": 0.04123553189891431
},
"harness|hendrycksTest-marketing|5": {
"acc": 0.8974358974358975,
"acc_stderr": 0.01987565502786745,
"acc_norm": 0.8974358974358975,
"acc_norm_stderr": 0.01987565502786745
},
"harness|hendrycksTest-medical_genetics|5": {
"acc": 0.81,
"acc_stderr": 0.03942772444036623,
"acc_norm": 0.81,
"acc_norm_stderr": 0.03942772444036623
},
"harness|hendrycksTest-miscellaneous|5": {
"acc": 0.8007662835249042,
"acc_stderr": 0.01428337804429641,
"acc_norm": 0.8007662835249042,
"acc_norm_stderr": 0.01428337804429641
},
"harness|hendrycksTest-moral_disputes|5": {
"acc": 0.7514450867052023,
"acc_stderr": 0.023267528432100174,
"acc_norm": 0.7514450867052023,
"acc_norm_stderr": 0.023267528432100174
},
"harness|hendrycksTest-moral_scenarios|5": {
"acc": 0.4324022346368715,
"acc_stderr": 0.016568971233548613,
"acc_norm": 0.4324022346368715,
"acc_norm_stderr": 0.016568971233548613
},
"harness|hendrycksTest-nutrition|5": {
"acc": 0.761437908496732,
"acc_stderr": 0.024404394928087877,
"acc_norm": 0.761437908496732,
"acc_norm_stderr": 0.024404394928087877
},
"harness|hendrycksTest-philosophy|5": {
"acc": 0.7266881028938906,
"acc_stderr": 0.025311765975426122,
"acc_norm": 0.7266881028938906,
"acc_norm_stderr": 0.025311765975426122
},
"harness|hendrycksTest-prehistory|5": {
"acc": 0.75,
"acc_stderr": 0.02409347123262133,
"acc_norm": 0.75,
"acc_norm_stderr": 0.02409347123262133
},
"harness|hendrycksTest-professional_accounting|5": {
"acc": 0.5390070921985816,
"acc_stderr": 0.02973659252642444,
"acc_norm": 0.5390070921985816,
"acc_norm_stderr": 0.02973659252642444
},
"harness|hendrycksTest-professional_law|5": {
"acc": 0.4921773142112125,
"acc_stderr": 0.0127686730761119,
"acc_norm": 0.4921773142112125,
"acc_norm_stderr": 0.0127686730761119
},
"harness|hendrycksTest-professional_medicine|5": {
"acc": 0.7169117647058824,
"acc_stderr": 0.02736586113151381,
"acc_norm": 0.7169117647058824,
"acc_norm_stderr": 0.02736586113151381
},
"harness|hendrycksTest-professional_psychology|5": {
"acc": 0.6977124183006536,
"acc_stderr": 0.018579232711113877,
"acc_norm": 0.6977124183006536,
"acc_norm_stderr": 0.018579232711113877
},
"harness|hendrycksTest-public_relations|5": {
"acc": 0.6454545454545455,
"acc_stderr": 0.045820048415054174,
"acc_norm": 0.6454545454545455,
"acc_norm_stderr": 0.045820048415054174
},
"harness|hendrycksTest-security_studies|5": {
"acc": 0.7306122448979592,
"acc_stderr": 0.02840125202902294,
"acc_norm": 0.7306122448979592,
"acc_norm_stderr": 0.02840125202902294
},
"harness|hendrycksTest-sociology|5": {
"acc": 0.8557213930348259,
"acc_stderr": 0.02484575321230604,
"acc_norm": 0.8557213930348259,
"acc_norm_stderr": 0.02484575321230604
},
"harness|hendrycksTest-us_foreign_policy|5": {
"acc": 0.85,
"acc_stderr": 0.0358870281282637,
"acc_norm": 0.85,
"acc_norm_stderr": 0.0358870281282637
},
"harness|hendrycksTest-virology|5": {
"acc": 0.5120481927710844,
"acc_stderr": 0.03891364495835816,
"acc_norm": 0.5120481927710844,
"acc_norm_stderr": 0.03891364495835816
},
"harness|hendrycksTest-world_religions|5": {
"acc": 0.7777777777777778,
"acc_stderr": 0.03188578017686398,
"acc_norm": 0.7777777777777778,
"acc_norm_stderr": 0.03188578017686398
},
"harness|truthfulqa:mc|0": {
"mc1": 0.3708690330477356,
"mc1_stderr": 0.01690969358024882,
"mc2": 0.5615878106808432,
"mc2_stderr": 0.015165049093519522
},
"harness|winogrande|5": {
"acc": 0.749802683504341,
"acc_stderr": 0.012173009642449138
},
"harness|gsm8k|5": {
"acc": 0.6952236542835482,
"acc_stderr": 0.012679297549515434
}
}
```
## Dataset Details
### Dataset Description
<!-- Provide a longer summary of what this dataset is. -->
- **Curated by:** [More Information Needed]
- **Funded by [optional]:** [More Information Needed]
- **Shared by [optional]:** [More Information Needed]
- **Language(s) (NLP):** [More Information Needed]
- **License:** [More Information Needed]
### Dataset Sources [optional]
<!-- Provide the basic links for the dataset. -->
- **Repository:** [More Information Needed]
- **Paper [optional]:** [More Information Needed]
- **Demo [optional]:** [More Information Needed]
## Uses
<!-- Address questions around how the dataset is intended to be used. -->
### Direct Use
<!-- This section describes suitable use cases for the dataset. -->
[More Information Needed]
### Out-of-Scope Use
<!-- This section addresses misuse, malicious use, and uses that the dataset will not work well for. -->
[More Information Needed]
## Dataset Structure
<!-- This section provides a description of the dataset fields, and additional information about the dataset structure such as criteria used to create the splits, relationships between data points, etc. -->
[More Information Needed]
## Dataset Creation
### Curation Rationale
<!-- Motivation for the creation of this dataset. -->
[More Information Needed]
### Source Data
<!-- This section describes the source data (e.g. news text and headlines, social media posts, translated sentences, ...). -->
#### Data Collection and Processing
<!-- This section describes the data collection and processing process such as data selection criteria, filtering and normalization methods, tools and libraries used, etc. -->
[More Information Needed]
#### Who are the source data producers?
<!-- This section describes the people or systems who originally created the data. It should also include self-reported demographic or identity information for the source data creators if this information is available. -->
[More Information Needed]
### Annotations [optional]
<!-- If the dataset contains annotations which are not part of the initial data collection, use this section to describe them. -->
#### Annotation process
<!-- This section describes the annotation process such as annotation tools used in the process, the amount of data annotated, annotation guidelines provided to the annotators, interannotator statistics, annotation validation, etc. -->
[More Information Needed]
#### Who are the annotators?
<!-- This section describes the people or systems who created the annotations. -->
[More Information Needed]
#### Personal and Sensitive Information
<!-- State whether the dataset contains data that might be considered personal, sensitive, or private (e.g., data that reveals addresses, uniquely identifiable names or aliases, racial or ethnic origins, sexual orientations, religious beliefs, political opinions, financial or health data, etc.). If efforts were made to anonymize the data, describe the anonymization process. -->
[More Information Needed]
## Bias, Risks, and Limitations
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
[More Information Needed]
### Recommendations
<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
Users should be made aware of the risks, biases and limitations of the dataset. More information needed for further recommendations.
## Citation [optional]
<!-- If there is a paper or blog post introducing the dataset, the APA and Bibtex information for that should go in this section. -->
**BibTeX:**
[More Information Needed]
**APA:**
[More Information Needed]
## Glossary [optional]
<!-- If relevant, include terms and calculations in this section that can help readers understand the dataset or dataset card. -->
[More Information Needed]
## More Information [optional]
[More Information Needed]
## Dataset Card Authors [optional]
[More Information Needed]
## Dataset Card Contact
[More Information Needed]
提供机构:
open-llm-leaderboard
原始信息汇总
数据集概述
数据集名称
- pretty_name: Evaluation run of vicgalle/Configurable-Llama-3-8B-v0.1
数据集描述
- dataset_summary: 该数据集是在评估模型vicgalle/Configurable-Llama-3-8B-v0.1的过程中自动创建的,用于Open LLM Leaderboard。
数据集组成
- 数据结构: 包含63个配置,每个配置对应一个评估任务。
- 数据来源: 数据集由1次运行创建,每次运行作为一个特定的分割,分割名称使用运行的时间戳命名。
- 特殊配置: 存在一个名为"results"的额外配置,存储所有运行结果的聚合数据,用于计算和显示聚合指标。
数据加载示例
python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_vicgalle__Configurable-Llama-3-8B-v0.1", "harness_winogrande_5", split="train")
最新结果
- 结果来源: 最新结果来自run 2024-04-19T01:45:07.088638。
- 结果内容: 包含多个任务的评估结果,如准确率(acc)、标准误差(acc_stderr)等。
数据集配置详情
配置列表
- harness_arc_challenge_25
- 数据文件: 包含特定时间戳和最新结果的分割。
- harness_gsm8k_5
- 数据文件: 包含特定时间戳和最新结果的分割。
- harness_hellaswag_10
- 数据文件: 包含特定时间戳和最新结果的分割。
- harness_hendrycksTest_5
- 数据文件: 包含多个子任务的特定时间戳和最新结果的分割。
以上信息概述了数据集的基本情况和配置详情,提供了数据集的结构、来源、加载方法以及具体配置的文件信息。
搜集汇总
数据集介绍

构建方式
在大规模语言模型评估领域,为系统化追踪模型性能表现,该数据集基于Open LLM Leaderboard评估框架自动生成。数据集构建过程围绕模型vicgalle/Configurable-Llama-3-8B-v0.1的单一评估运行展开,涵盖63个独立配置,每个配置对应一项具体评估任务。每项任务的原始评估结果以Parquet格式存储,并依据时间戳命名分割,同时设置'train'分割始终指向最新评估结果。此外,额外配置'results'专门汇总所有任务的聚合指标,用于计算和展示排行榜上的综合得分。
特点
该数据集的结构设计精巧,具有鲜明的层次化特征。其核心优势在于将63个评估任务以独立配置形式组织,每个配置内包含按时间戳区分的运行分割,便于研究者追溯不同时间点的模型表现。'train'分割的动态指向机制确保用户总能获取最新评估数据,而'results'配置则提供了一站式的聚合性能概览。数据集覆盖从常识推理(如ARC、Hellaswag)到专业学科(如医学、法律)的广泛任务,评测指标包括准确率及其标准差,为模型能力分析提供了多维度的量化视角。
使用方法
研究者可通过Hugging Face Datasets库便捷加载该数据集。以加载Winogrande任务的最新结果为例,调用load_dataset函数并指定配置名称'harness_winogrande_5'及分割'train'即可获取数据。如需获取特定历史运行的结果,可将分割参数替换为对应时间戳标识符。对于聚合结果的分析,可直接访问'results'配置,其中存储了所有任务的性能指标JSON文件。这种设计使得模型的纵向对比研究与横向任务分析均能高效开展。
背景与挑战
背景概述
随着大型语言模型(LLM)的迅猛发展,如何系统性地评估其多维度能力成为领域内的核心挑战。在此背景下,Hugging Face团队于2024年创建了Open LLM Leaderboard,旨在为社区提供一个标准化、可复现的模型评测基准。该数据集正是针对vicgalle/Configurable-Llama-3-8B-v0.1模型在Leaderboard上的评估结果而自动构建的,由63个配置项组成,涵盖ARC挑战、HellaSwag、GSM8K、以及涵盖57个学科的MMLU等任务。研究团队通过统一评测流水线,记录了模型在常识推理、数学解题、知识问答等维度的表现,为后续模型优化与对比提供了关键参考,对推动LLM评估体系的规范化和透明化产生了重要影响。
当前挑战
该数据集所解决的领域问题在于LLM评估的碎片化与不一致性——不同研究常采用各异的数据集与评估协议,导致结果难以横向对比。构建过程中面临的挑战包括:其一,需要将模型在多个异构任务(如生成式推理与多项选择)上的原始输出转化为统一格式的评估指标,涉及复杂的后处理逻辑;其二,评测结果需随时间迭代更新,数据集需维护多个时间戳版本(如2024-04-19T01-45-07.088638),确保每次评估的可追溯性与最新结果的即时性;其三,面对MMLU等细粒度学科划分,需处理57个独立子任务的配置与聚合,增加了数据管理的复杂性。
常用场景
经典使用场景
在自然语言处理与大规模语言模型飞速发展的当下,对模型性能的精准评估成为推动技术进步的关键环节。该数据集专为Open LLM Leaderboard上的模型评测而生,其经典使用场景在于系统性地记录vicgalle/Configurable-Llama-3-8B-v0.1模型在63个不同配置下的评估结果。通过加载各任务对应的配置(如ARC挑战赛、HellaSwag、GSM8K等),研究者能够深入分析模型在常识推理、数学解题、知识问答等维度上的表现,为模型迭代提供量化依据。
解决学术问题
学术界长期面临大语言模型性能评测碎片化、难以复现的困境。该数据集通过标准化存储模型在多个基准测试(涵盖MMLU、TruthfulQA、WinoGrande等)上的细粒度结果,有效解决了评估结果不可追溯、实验对比缺乏一致性的问题。其结构化设计使得研究者能够便捷地比较不同模型在同一任务上的表现,为探究模型能力边界、诊断模型缺陷提供了坚实的数据基础,进而推动了语言模型评估体系的规范化与透明化。
衍生相关工作
该数据集衍生了一系列关于语言模型评测的经典工作。基于其提供的标准化评估框架,研究者开发了自动化评测流水线,实现了模型性能的持续监控与对比。此外,数据集中包含的细粒度结果被用于分析模型在不同知识领域的能力分布,催生了关于模型知识盲区与偏见检测的研究。部分工作还利用该数据集进行模型集成策略的探索,通过聚合多个配置下的优势任务来提升整体性能,为多任务学习与迁移学习提供了新的实验基准。
以上内容由遇见数据集搜集并总结生成



