open-llm-leaderboard-old/details_0-hero__Matter-0.2-7B
收藏Hugging Face2024-04-03 更新2024-06-22 收录
下载链接:
https://hf-mirror.com/datasets/open-llm-leaderboard-old/details_0-hero__Matter-0.2-7B
下载链接
链接失效反馈官方服务:
资源简介:
---
pretty_name: Evaluation run of 0-hero/Matter-0.2-7B
dataset_summary: "Dataset automatically created during the evaluation run of model\
\ [0-hero/Matter-0.2-7B](https://huggingface.co/0-hero/Matter-0.2-7B) on the [Open\
\ LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).\n\
\nThe dataset is composed of 63 configuration, each one coresponding to one of the\
\ evaluated task.\n\nThe dataset has been created from 1 run(s). Each run can be\
\ found as a specific split in each configuration, the split being named using the\
\ timestamp of the run.The \"train\" split is always pointing to the latest results.\n\
\nAn additional configuration \"results\" store all the aggregated results of the\
\ run (and is used to compute and display the aggregated metrics on the [Open LLM\
\ Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)).\n\
\nTo load the details from a run, you can for instance do the following:\n```python\n\
from datasets import load_dataset\ndata = load_dataset(\"open-llm-leaderboard/details_0-hero__Matter-0.2-7B\"\
,\n\t\"harness_winogrande_5\",\n\tsplit=\"train\")\n```\n\n## Latest results\n\n\
These are the [latest results from run 2024-04-03T01:45:27.142042](https://huggingface.co/datasets/open-llm-leaderboard/details_0-hero__Matter-0.2-7B/blob/main/results_2024-04-03T01-45-27.142042.json)(note\
\ that their might be results for other tasks in the repos if successive evals didn't\
\ cover the same tasks. You find each in the results and the \"latest\" split for\
\ each eval):\n\n```python\n{\n \"all\": {\n \"acc\": 0.6258435998641223,\n\
\ \"acc_stderr\": 0.03245288877827044,\n \"acc_norm\": 0.6283412392485703,\n\
\ \"acc_norm_stderr\": 0.03310716675283535,\n \"mc1\": 0.3353733170134639,\n\
\ \"mc1_stderr\": 0.01652753403966899,\n \"mc2\": 0.481088597087512,\n\
\ \"mc2_stderr\": 0.015055232875750942\n },\n \"harness|arc:challenge|25\"\
: {\n \"acc\": 0.5819112627986348,\n \"acc_stderr\": 0.014413988396996076,\n\
\ \"acc_norm\": 0.6160409556313993,\n \"acc_norm_stderr\": 0.01421244498065189\n\
\ },\n \"harness|hellaswag|10\": {\n \"acc\": 0.6285600477992431,\n\
\ \"acc_stderr\": 0.004822022254886021,\n \"acc_norm\": 0.8239394542919737,\n\
\ \"acc_norm_stderr\": 0.003800932770597754\n },\n \"harness|hendrycksTest-abstract_algebra|5\"\
: {\n \"acc\": 0.31,\n \"acc_stderr\": 0.04648231987117316,\n \
\ \"acc_norm\": 0.31,\n \"acc_norm_stderr\": 0.04648231987117316\n \
\ },\n \"harness|hendrycksTest-anatomy|5\": {\n \"acc\": 0.6370370370370371,\n\
\ \"acc_stderr\": 0.041539484047423976,\n \"acc_norm\": 0.6370370370370371,\n\
\ \"acc_norm_stderr\": 0.041539484047423976\n },\n \"harness|hendrycksTest-astronomy|5\"\
: {\n \"acc\": 0.6842105263157895,\n \"acc_stderr\": 0.03782728980865469,\n\
\ \"acc_norm\": 0.6842105263157895,\n \"acc_norm_stderr\": 0.03782728980865469\n\
\ },\n \"harness|hendrycksTest-business_ethics|5\": {\n \"acc\": 0.59,\n\
\ \"acc_stderr\": 0.049431107042371025,\n \"acc_norm\": 0.59,\n \
\ \"acc_norm_stderr\": 0.049431107042371025\n },\n \"harness|hendrycksTest-clinical_knowledge|5\"\
: {\n \"acc\": 0.6566037735849056,\n \"acc_stderr\": 0.02922452646912479,\n\
\ \"acc_norm\": 0.6566037735849056,\n \"acc_norm_stderr\": 0.02922452646912479\n\
\ },\n \"harness|hendrycksTest-college_biology|5\": {\n \"acc\": 0.7291666666666666,\n\
\ \"acc_stderr\": 0.03716177437566017,\n \"acc_norm\": 0.7291666666666666,\n\
\ \"acc_norm_stderr\": 0.03716177437566017\n },\n \"harness|hendrycksTest-college_chemistry|5\"\
: {\n \"acc\": 0.43,\n \"acc_stderr\": 0.04975698519562428,\n \
\ \"acc_norm\": 0.43,\n \"acc_norm_stderr\": 0.04975698519562428\n \
\ },\n \"harness|hendrycksTest-college_computer_science|5\": {\n \"acc\"\
: 0.49,\n \"acc_stderr\": 0.05024183937956912,\n \"acc_norm\": 0.49,\n\
\ \"acc_norm_stderr\": 0.05024183937956912\n },\n \"harness|hendrycksTest-college_mathematics|5\"\
: {\n \"acc\": 0.34,\n \"acc_stderr\": 0.04760952285695235,\n \
\ \"acc_norm\": 0.34,\n \"acc_norm_stderr\": 0.04760952285695235\n \
\ },\n \"harness|hendrycksTest-college_medicine|5\": {\n \"acc\": 0.6127167630057804,\n\
\ \"acc_stderr\": 0.037143259063020656,\n \"acc_norm\": 0.6127167630057804,\n\
\ \"acc_norm_stderr\": 0.037143259063020656\n },\n \"harness|hendrycksTest-college_physics|5\"\
: {\n \"acc\": 0.3333333333333333,\n \"acc_stderr\": 0.04690650298201942,\n\
\ \"acc_norm\": 0.3333333333333333,\n \"acc_norm_stderr\": 0.04690650298201942\n\
\ },\n \"harness|hendrycksTest-computer_security|5\": {\n \"acc\":\
\ 0.73,\n \"acc_stderr\": 0.044619604333847415,\n \"acc_norm\": 0.73,\n\
\ \"acc_norm_stderr\": 0.044619604333847415\n },\n \"harness|hendrycksTest-conceptual_physics|5\"\
: {\n \"acc\": 0.5574468085106383,\n \"acc_stderr\": 0.03246956919789958,\n\
\ \"acc_norm\": 0.5574468085106383,\n \"acc_norm_stderr\": 0.03246956919789958\n\
\ },\n \"harness|hendrycksTest-econometrics|5\": {\n \"acc\": 0.47368421052631576,\n\
\ \"acc_stderr\": 0.046970851366478626,\n \"acc_norm\": 0.47368421052631576,\n\
\ \"acc_norm_stderr\": 0.046970851366478626\n },\n \"harness|hendrycksTest-electrical_engineering|5\"\
: {\n \"acc\": 0.5448275862068965,\n \"acc_stderr\": 0.04149886942192117,\n\
\ \"acc_norm\": 0.5448275862068965,\n \"acc_norm_stderr\": 0.04149886942192117\n\
\ },\n \"harness|hendrycksTest-elementary_mathematics|5\": {\n \"acc\"\
: 0.41798941798941797,\n \"acc_stderr\": 0.025402555503260912,\n \"\
acc_norm\": 0.41798941798941797,\n \"acc_norm_stderr\": 0.025402555503260912\n\
\ },\n \"harness|hendrycksTest-formal_logic|5\": {\n \"acc\": 0.4126984126984127,\n\
\ \"acc_stderr\": 0.04403438954768176,\n \"acc_norm\": 0.4126984126984127,\n\
\ \"acc_norm_stderr\": 0.04403438954768176\n },\n \"harness|hendrycksTest-global_facts|5\"\
: {\n \"acc\": 0.37,\n \"acc_stderr\": 0.04852365870939099,\n \
\ \"acc_norm\": 0.37,\n \"acc_norm_stderr\": 0.04852365870939099\n \
\ },\n \"harness|hendrycksTest-high_school_biology|5\": {\n \"acc\": 0.7354838709677419,\n\
\ \"acc_stderr\": 0.025091892378859275,\n \"acc_norm\": 0.7354838709677419,\n\
\ \"acc_norm_stderr\": 0.025091892378859275\n },\n \"harness|hendrycksTest-high_school_chemistry|5\"\
: {\n \"acc\": 0.47783251231527096,\n \"acc_stderr\": 0.035145285621750094,\n\
\ \"acc_norm\": 0.47783251231527096,\n \"acc_norm_stderr\": 0.035145285621750094\n\
\ },\n \"harness|hendrycksTest-high_school_computer_science|5\": {\n \
\ \"acc\": 0.73,\n \"acc_stderr\": 0.044619604333847394,\n \"acc_norm\"\
: 0.73,\n \"acc_norm_stderr\": 0.044619604333847394\n },\n \"harness|hendrycksTest-high_school_european_history|5\"\
: {\n \"acc\": 0.7393939393939394,\n \"acc_stderr\": 0.034277431758165236,\n\
\ \"acc_norm\": 0.7393939393939394,\n \"acc_norm_stderr\": 0.034277431758165236\n\
\ },\n \"harness|hendrycksTest-high_school_geography|5\": {\n \"acc\"\
: 0.8181818181818182,\n \"acc_stderr\": 0.027479603010538808,\n \"\
acc_norm\": 0.8181818181818182,\n \"acc_norm_stderr\": 0.027479603010538808\n\
\ },\n \"harness|hendrycksTest-high_school_government_and_politics|5\": {\n\
\ \"acc\": 0.8497409326424871,\n \"acc_stderr\": 0.025787723180723886,\n\
\ \"acc_norm\": 0.8497409326424871,\n \"acc_norm_stderr\": 0.025787723180723886\n\
\ },\n \"harness|hendrycksTest-high_school_macroeconomics|5\": {\n \
\ \"acc\": 0.6153846153846154,\n \"acc_stderr\": 0.024666744915187208,\n\
\ \"acc_norm\": 0.6153846153846154,\n \"acc_norm_stderr\": 0.024666744915187208\n\
\ },\n \"harness|hendrycksTest-high_school_mathematics|5\": {\n \"\
acc\": 0.3037037037037037,\n \"acc_stderr\": 0.02803792996911499,\n \
\ \"acc_norm\": 0.3037037037037037,\n \"acc_norm_stderr\": 0.02803792996911499\n\
\ },\n \"harness|hendrycksTest-high_school_microeconomics|5\": {\n \
\ \"acc\": 0.6428571428571429,\n \"acc_stderr\": 0.031124619309328177,\n\
\ \"acc_norm\": 0.6428571428571429,\n \"acc_norm_stderr\": 0.031124619309328177\n\
\ },\n \"harness|hendrycksTest-high_school_physics|5\": {\n \"acc\"\
: 0.32450331125827814,\n \"acc_stderr\": 0.038227469376587525,\n \"\
acc_norm\": 0.32450331125827814,\n \"acc_norm_stderr\": 0.038227469376587525\n\
\ },\n \"harness|hendrycksTest-high_school_psychology|5\": {\n \"acc\"\
: 0.8220183486238533,\n \"acc_stderr\": 0.016399436366612896,\n \"\
acc_norm\": 0.8220183486238533,\n \"acc_norm_stderr\": 0.016399436366612896\n\
\ },\n \"harness|hendrycksTest-high_school_statistics|5\": {\n \"acc\"\
: 0.4444444444444444,\n \"acc_stderr\": 0.03388857118502326,\n \"\
acc_norm\": 0.4444444444444444,\n \"acc_norm_stderr\": 0.03388857118502326\n\
\ },\n \"harness|hendrycksTest-high_school_us_history|5\": {\n \"acc\"\
: 0.8186274509803921,\n \"acc_stderr\": 0.027044621719474082,\n \"\
acc_norm\": 0.8186274509803921,\n \"acc_norm_stderr\": 0.027044621719474082\n\
\ },\n \"harness|hendrycksTest-high_school_world_history|5\": {\n \"\
acc\": 0.7974683544303798,\n \"acc_stderr\": 0.026160568246601443,\n \
\ \"acc_norm\": 0.7974683544303798,\n \"acc_norm_stderr\": 0.026160568246601443\n\
\ },\n \"harness|hendrycksTest-human_aging|5\": {\n \"acc\": 0.672645739910314,\n\
\ \"acc_stderr\": 0.03149384670994131,\n \"acc_norm\": 0.672645739910314,\n\
\ \"acc_norm_stderr\": 0.03149384670994131\n },\n \"harness|hendrycksTest-human_sexuality|5\"\
: {\n \"acc\": 0.7786259541984732,\n \"acc_stderr\": 0.03641297081313732,\n\
\ \"acc_norm\": 0.7786259541984732,\n \"acc_norm_stderr\": 0.03641297081313732\n\
\ },\n \"harness|hendrycksTest-international_law|5\": {\n \"acc\":\
\ 0.743801652892562,\n \"acc_stderr\": 0.03984979653302872,\n \"acc_norm\"\
: 0.743801652892562,\n \"acc_norm_stderr\": 0.03984979653302872\n },\n\
\ \"harness|hendrycksTest-jurisprudence|5\": {\n \"acc\": 0.7962962962962963,\n\
\ \"acc_stderr\": 0.03893542518824847,\n \"acc_norm\": 0.7962962962962963,\n\
\ \"acc_norm_stderr\": 0.03893542518824847\n },\n \"harness|hendrycksTest-logical_fallacies|5\"\
: {\n \"acc\": 0.7423312883435583,\n \"acc_stderr\": 0.03436150827846917,\n\
\ \"acc_norm\": 0.7423312883435583,\n \"acc_norm_stderr\": 0.03436150827846917\n\
\ },\n \"harness|hendrycksTest-machine_learning|5\": {\n \"acc\": 0.45535714285714285,\n\
\ \"acc_stderr\": 0.047268355537191,\n \"acc_norm\": 0.45535714285714285,\n\
\ \"acc_norm_stderr\": 0.047268355537191\n },\n \"harness|hendrycksTest-management|5\"\
: {\n \"acc\": 0.8058252427184466,\n \"acc_stderr\": 0.03916667762822583,\n\
\ \"acc_norm\": 0.8058252427184466,\n \"acc_norm_stderr\": 0.03916667762822583\n\
\ },\n \"harness|hendrycksTest-marketing|5\": {\n \"acc\": 0.8632478632478633,\n\
\ \"acc_stderr\": 0.022509033937077802,\n \"acc_norm\": 0.8632478632478633,\n\
\ \"acc_norm_stderr\": 0.022509033937077802\n },\n \"harness|hendrycksTest-medical_genetics|5\"\
: {\n \"acc\": 0.72,\n \"acc_stderr\": 0.045126085985421276,\n \
\ \"acc_norm\": 0.72,\n \"acc_norm_stderr\": 0.045126085985421276\n \
\ },\n \"harness|hendrycksTest-miscellaneous|5\": {\n \"acc\": 0.8007662835249042,\n\
\ \"acc_stderr\": 0.01428337804429642,\n \"acc_norm\": 0.8007662835249042,\n\
\ \"acc_norm_stderr\": 0.01428337804429642\n },\n \"harness|hendrycksTest-moral_disputes|5\"\
: {\n \"acc\": 0.708092485549133,\n \"acc_stderr\": 0.024476994076247337,\n\
\ \"acc_norm\": 0.708092485549133,\n \"acc_norm_stderr\": 0.024476994076247337\n\
\ },\n \"harness|hendrycksTest-moral_scenarios|5\": {\n \"acc\": 0.3865921787709497,\n\
\ \"acc_stderr\": 0.016286674879101022,\n \"acc_norm\": 0.3865921787709497,\n\
\ \"acc_norm_stderr\": 0.016286674879101022\n },\n \"harness|hendrycksTest-nutrition|5\"\
: {\n \"acc\": 0.7058823529411765,\n \"acc_stderr\": 0.026090162504279056,\n\
\ \"acc_norm\": 0.7058823529411765,\n \"acc_norm_stderr\": 0.026090162504279056\n\
\ },\n \"harness|hendrycksTest-philosophy|5\": {\n \"acc\": 0.7009646302250804,\n\
\ \"acc_stderr\": 0.02600330111788513,\n \"acc_norm\": 0.7009646302250804,\n\
\ \"acc_norm_stderr\": 0.02600330111788513\n },\n \"harness|hendrycksTest-prehistory|5\"\
: {\n \"acc\": 0.7006172839506173,\n \"acc_stderr\": 0.025483115601195448,\n\
\ \"acc_norm\": 0.7006172839506173,\n \"acc_norm_stderr\": 0.025483115601195448\n\
\ },\n \"harness|hendrycksTest-professional_accounting|5\": {\n \"\
acc\": 0.48226950354609927,\n \"acc_stderr\": 0.02980873964223777,\n \
\ \"acc_norm\": 0.48226950354609927,\n \"acc_norm_stderr\": 0.02980873964223777\n\
\ },\n \"harness|hendrycksTest-professional_law|5\": {\n \"acc\": 0.4621903520208605,\n\
\ \"acc_stderr\": 0.012733671880342506,\n \"acc_norm\": 0.4621903520208605,\n\
\ \"acc_norm_stderr\": 0.012733671880342506\n },\n \"harness|hendrycksTest-professional_medicine|5\"\
: {\n \"acc\": 0.5992647058823529,\n \"acc_stderr\": 0.029768263528933105,\n\
\ \"acc_norm\": 0.5992647058823529,\n \"acc_norm_stderr\": 0.029768263528933105\n\
\ },\n \"harness|hendrycksTest-professional_psychology|5\": {\n \"\
acc\": 0.6666666666666666,\n \"acc_stderr\": 0.019070985589687495,\n \
\ \"acc_norm\": 0.6666666666666666,\n \"acc_norm_stderr\": 0.019070985589687495\n\
\ },\n \"harness|hendrycksTest-public_relations|5\": {\n \"acc\": 0.6636363636363637,\n\
\ \"acc_stderr\": 0.04525393596302506,\n \"acc_norm\": 0.6636363636363637,\n\
\ \"acc_norm_stderr\": 0.04525393596302506\n },\n \"harness|hendrycksTest-security_studies|5\"\
: {\n \"acc\": 0.6857142857142857,\n \"acc_stderr\": 0.02971932942241748,\n\
\ \"acc_norm\": 0.6857142857142857,\n \"acc_norm_stderr\": 0.02971932942241748\n\
\ },\n \"harness|hendrycksTest-sociology|5\": {\n \"acc\": 0.8208955223880597,\n\
\ \"acc_stderr\": 0.027113286753111837,\n \"acc_norm\": 0.8208955223880597,\n\
\ \"acc_norm_stderr\": 0.027113286753111837\n },\n \"harness|hendrycksTest-us_foreign_policy|5\"\
: {\n \"acc\": 0.9,\n \"acc_stderr\": 0.030151134457776348,\n \
\ \"acc_norm\": 0.9,\n \"acc_norm_stderr\": 0.030151134457776348\n \
\ },\n \"harness|hendrycksTest-virology|5\": {\n \"acc\": 0.5240963855421686,\n\
\ \"acc_stderr\": 0.03887971849597264,\n \"acc_norm\": 0.5240963855421686,\n\
\ \"acc_norm_stderr\": 0.03887971849597264\n },\n \"harness|hendrycksTest-world_religions|5\"\
: {\n \"acc\": 0.8070175438596491,\n \"acc_stderr\": 0.030267457554898458,\n\
\ \"acc_norm\": 0.8070175438596491,\n \"acc_norm_stderr\": 0.030267457554898458\n\
\ },\n \"harness|truthfulqa:mc|0\": {\n \"mc1\": 0.3353733170134639,\n\
\ \"mc1_stderr\": 0.01652753403966899,\n \"mc2\": 0.481088597087512,\n\
\ \"mc2_stderr\": 0.015055232875750942\n },\n \"harness|winogrande|5\"\
: {\n \"acc\": 0.7947908445146015,\n \"acc_stderr\": 0.011350315707462059\n\
\ },\n \"harness|gsm8k|5\": {\n \"acc\": 0.5390447308567097,\n \
\ \"acc_stderr\": 0.01373042844911634\n }\n}\n```"
repo_url: https://huggingface.co/0-hero/Matter-0.2-7B
leaderboard_url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard
point_of_contact: clementine@hf.co
configs:
- config_name: harness_arc_challenge_25
data_files:
- split: 2024_04_03T01_45_27.142042
path:
- '**/details_harness|arc:challenge|25_2024-04-03T01-45-27.142042.parquet'
- split: latest
path:
- '**/details_harness|arc:challenge|25_2024-04-03T01-45-27.142042.parquet'
- config_name: harness_gsm8k_5
data_files:
- split: 2024_04_03T01_45_27.142042
path:
- '**/details_harness|gsm8k|5_2024-04-03T01-45-27.142042.parquet'
- split: latest
path:
- '**/details_harness|gsm8k|5_2024-04-03T01-45-27.142042.parquet'
- config_name: harness_hellaswag_10
data_files:
- split: 2024_04_03T01_45_27.142042
path:
- '**/details_harness|hellaswag|10_2024-04-03T01-45-27.142042.parquet'
- split: latest
path:
- '**/details_harness|hellaswag|10_2024-04-03T01-45-27.142042.parquet'
- config_name: harness_hendrycksTest_5
data_files:
- split: 2024_04_03T01_45_27.142042
path:
- '**/details_harness|hendrycksTest-abstract_algebra|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-anatomy|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-astronomy|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-business_ethics|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-college_biology|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-college_chemistry|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-college_computer_science|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-college_mathematics|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-college_medicine|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-college_physics|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-computer_security|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-conceptual_physics|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-econometrics|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-electrical_engineering|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-formal_logic|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-global_facts|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-high_school_biology|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-high_school_european_history|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-high_school_geography|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-high_school_physics|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-high_school_psychology|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-high_school_statistics|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-high_school_us_history|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-high_school_world_history|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-human_aging|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-human_sexuality|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-international_law|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-jurisprudence|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-logical_fallacies|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-machine_learning|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-management|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-marketing|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-medical_genetics|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-miscellaneous|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-moral_disputes|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-moral_scenarios|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-nutrition|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-philosophy|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-prehistory|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-professional_accounting|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-professional_law|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-professional_medicine|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-professional_psychology|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-public_relations|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-security_studies|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-sociology|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-virology|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-world_religions|5_2024-04-03T01-45-27.142042.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-abstract_algebra|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-anatomy|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-astronomy|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-business_ethics|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-college_biology|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-college_chemistry|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-college_computer_science|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-college_mathematics|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-college_medicine|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-college_physics|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-computer_security|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-conceptual_physics|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-econometrics|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-electrical_engineering|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-formal_logic|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-global_facts|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-high_school_biology|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-high_school_european_history|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-high_school_geography|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-high_school_physics|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-high_school_psychology|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-high_school_statistics|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-high_school_us_history|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-high_school_world_history|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-human_aging|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-human_sexuality|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-international_law|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-jurisprudence|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-logical_fallacies|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-machine_learning|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-management|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-marketing|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-medical_genetics|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-miscellaneous|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-moral_disputes|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-moral_scenarios|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-nutrition|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-philosophy|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-prehistory|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-professional_accounting|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-professional_law|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-professional_medicine|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-professional_psychology|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-public_relations|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-security_studies|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-sociology|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-virology|5_2024-04-03T01-45-27.142042.parquet'
- '**/details_harness|hendrycksTest-world_religions|5_2024-04-03T01-45-27.142042.parquet'
- config_name: harness_hendrycksTest_abstract_algebra_5
data_files:
- split: 2024_04_03T01_45_27.142042
path:
- '**/details_harness|hendrycksTest-abstract_algebra|5_2024-04-03T01-45-27.142042.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-abstract_algebra|5_2024-04-03T01-45-27.142042.parquet'
- config_name: harness_hendrycksTest_anatomy_5
data_files:
- split: 2024_04_03T01_45_27.142042
path:
- '**/details_harness|hendrycksTest-anatomy|5_2024-04-03T01-45-27.142042.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-anatomy|5_2024-04-03T01-45-27.142042.parquet'
- config_name: harness_hendrycksTest_astronomy_5
data_files:
- split: 2024_04_03T01_45_27.142042
path:
- '**/details_harness|hendrycksTest-astronomy|5_2024-04-03T01-45-27.142042.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-astronomy|5_2024-04-03T01-45-27.142042.parquet'
- config_name: harness_hendrycksTest_business_ethics_5
data_files:
- split: 2024_04_03T01_45_27.142042
path:
- '**/details_harness|hendrycksTest-business_ethics|5_2024-04-03T01-45-27.142042.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-business_ethics|5_2024-04-03T01-45-27.142042.parquet'
- config_name: harness_hendrycksTest_clinical_knowledge_5
data_files:
- split: 2024_04_03T01_45_27.142042
path:
- '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-04-03T01-45-27.142042.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-clinical_knowledge|5_2024-04-03T01-45-27.142042.parquet'
- config_name: harness_hendrycksTest_college_biology_5
data_files:
- split: 2024_04_03T01_45_27.142042
path:
- '**/details_harness|hendrycksTest-college_biology|5_2024-04-03T01-45-27.142042.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-college_biology|5_2024-04-03T01-45-27.142042.parquet'
- config_name: harness_hendrycksTest_college_chemistry_5
data_files:
- split: 2024_04_03T01_45_27.142042
path:
- '**/details_harness|hendrycksTest-college_chemistry|5_2024-04-03T01-45-27.142042.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-college_chemistry|5_2024-04-03T01-45-27.142042.parquet'
- config_name: harness_hendrycksTest_college_computer_science_5
data_files:
- split: 2024_04_03T01_45_27.142042
path:
- '**/details_harness|hendrycksTest-college_computer_science|5_2024-04-03T01-45-27.142042.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-college_computer_science|5_2024-04-03T01-45-27.142042.parquet'
- config_name: harness_hendrycksTest_college_mathematics_5
data_files:
- split: 2024_04_03T01_45_27.142042
path:
- '**/details_harness|hendrycksTest-college_mathematics|5_2024-04-03T01-45-27.142042.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-college_mathematics|5_2024-04-03T01-45-27.142042.parquet'
- config_name: harness_hendrycksTest_college_medicine_5
data_files:
- split: 2024_04_03T01_45_27.142042
path:
- '**/details_harness|hendrycksTest-college_medicine|5_2024-04-03T01-45-27.142042.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-college_medicine|5_2024-04-03T01-45-27.142042.parquet'
- config_name: harness_hendrycksTest_college_physics_5
data_files:
- split: 2024_04_03T01_45_27.142042
path:
- '**/details_harness|hendrycksTest-college_physics|5_2024-04-03T01-45-27.142042.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-college_physics|5_2024-04-03T01-45-27.142042.parquet'
- config_name: harness_hendrycksTest_computer_security_5
data_files:
- split: 2024_04_03T01_45_27.142042
path:
- '**/details_harness|hendrycksTest-computer_security|5_2024-04-03T01-45-27.142042.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-computer_security|5_2024-04-03T01-45-27.142042.parquet'
- config_name: harness_hendrycksTest_conceptual_physics_5
data_files:
- split: 2024_04_03T01_45_27.142042
path:
- '**/details_harness|hendrycksTest-conceptual_physics|5_2024-04-03T01-45-27.142042.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-conceptual_physics|5_2024-04-03T01-45-27.142042.parquet'
- config_name: harness_hendrycksTest_econometrics_5
data_files:
- split: 2024_04_03T01_45_27.142042
path:
- '**/details_harness|hendrycksTest-econometrics|5_2024-04-03T01-45-27.142042.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-econometrics|5_2024-04-03T01-45-27.142042.parquet'
- config_name: harness_hendrycksTest_electrical_engineering_5
data_files:
- split: 2024_04_03T01_45_27.142042
path:
- '**/details_harness|hendrycksTest-electrical_engineering|5_2024-04-03T01-45-27.142042.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-electrical_engineering|5_2024-04-03T01-45-27.142042.parquet'
- config_name: harness_hendrycksTest_elementary_mathematics_5
data_files:
- split: 2024_04_03T01_45_27.142042
path:
- '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-04-03T01-45-27.142042.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-elementary_mathematics|5_2024-04-03T01-45-27.142042.parquet'
- config_name: harness_hendrycksTest_formal_logic_5
data_files:
- split: 2024_04_03T01_45_27.142042
path:
- '**/details_harness|hendrycksTest-formal_logic|5_2024-04-03T01-45-27.142042.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-formal_logic|5_2024-04-03T01-45-27.142042.parquet'
- config_name: harness_hendrycksTest_global_facts_5
data_files:
- split: 2024_04_03T01_45_27.142042
path:
- '**/details_harness|hendrycksTest-global_facts|5_2024-04-03T01-45-27.142042.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-global_facts|5_2024-04-03T01-45-27.142042.parquet'
- config_name: harness_hendrycksTest_high_school_biology_5
data_files:
- split: 2024_04_03T01_45_27.142042
path:
- '**/details_harness|hendrycksTest-high_school_biology|5_2024-04-03T01-45-27.142042.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-high_school_biology|5_2024-04-03T01-45-27.142042.parquet'
- config_name: harness_hendrycksTest_high_school_chemistry_5
data_files:
- split: 2024_04_03T01_45_27.142042
path:
- '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-04-03T01-45-27.142042.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-high_school_chemistry|5_2024-04-03T01-45-27.142042.parquet'
- config_name: harness_hendrycksTest_high_school_computer_science_5
data_files:
- split: 2024_04_03T01_45_27.142042
path:
- '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-04-03T01-45-27.142042.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-high_school_computer_science|5_2024-04-03T01-45-27.142042.parquet'
- config_name: harness_hendrycksTest_high_school_european_history_5
data_files:
- split: 2024_04_03T01_45_27.142042
path:
- '**/details_harness|hendrycksTest-high_school_european_history|5_2024-04-03T01-45-27.142042.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-high_school_european_history|5_2024-04-03T01-45-27.142042.parquet'
- config_name: harness_hendrycksTest_high_school_geography_5
data_files:
- split: 2024_04_03T01_45_27.142042
path:
- '**/details_harness|hendrycksTest-high_school_geography|5_2024-04-03T01-45-27.142042.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-high_school_geography|5_2024-04-03T01-45-27.142042.parquet'
- config_name: harness_hendrycksTest_high_school_government_and_politics_5
data_files:
- split: 2024_04_03T01_45_27.142042
path:
- '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-04-03T01-45-27.142042.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-high_school_government_and_politics|5_2024-04-03T01-45-27.142042.parquet'
- config_name: harness_hendrycksTest_high_school_macroeconomics_5
data_files:
- split: 2024_04_03T01_45_27.142042
path:
- '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-04-03T01-45-27.142042.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-high_school_macroeconomics|5_2024-04-03T01-45-27.142042.parquet'
- config_name: harness_hendrycksTest_high_school_mathematics_5
data_files:
- split: 2024_04_03T01_45_27.142042
path:
- '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-04-03T01-45-27.142042.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-high_school_mathematics|5_2024-04-03T01-45-27.142042.parquet'
- config_name: harness_hendrycksTest_high_school_microeconomics_5
data_files:
- split: 2024_04_03T01_45_27.142042
path:
- '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-04-03T01-45-27.142042.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-high_school_microeconomics|5_2024-04-03T01-45-27.142042.parquet'
- config_name: harness_hendrycksTest_high_school_physics_5
data_files:
- split: 2024_04_03T01_45_27.142042
path:
- '**/details_harness|hendrycksTest-high_school_physics|5_2024-04-03T01-45-27.142042.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-high_school_physics|5_2024-04-03T01-45-27.142042.parquet'
- config_name: harness_hendrycksTest_high_school_psychology_5
data_files:
- split: 2024_04_03T01_45_27.142042
path:
- '**/details_harness|hendrycksTest-high_school_psychology|5_2024-04-03T01-45-27.142042.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-high_school_psychology|5_2024-04-03T01-45-27.142042.parquet'
- config_name: harness_hendrycksTest_high_school_statistics_5
data_files:
- split: 2024_04_03T01_45_27.142042
path:
- '**/details_harness|hendrycksTest-high_school_statistics|5_2024-04-03T01-45-27.142042.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-high_school_statistics|5_2024-04-03T01-45-27.142042.parquet'
- config_name: harness_hendrycksTest_high_school_us_history_5
data_files:
- split: 2024_04_03T01_45_27.142042
path:
- '**/details_harness|hendrycksTest-high_school_us_history|5_2024-04-03T01-45-27.142042.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-high_school_us_history|5_2024-04-03T01-45-27.142042.parquet'
- config_name: harness_hendrycksTest_high_school_world_history_5
data_files:
- split: 2024_04_03T01_45_27.142042
path:
- '**/details_harness|hendrycksTest-high_school_world_history|5_2024-04-03T01-45-27.142042.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-high_school_world_history|5_2024-04-03T01-45-27.142042.parquet'
- config_name: harness_hendrycksTest_human_aging_5
data_files:
- split: 2024_04_03T01_45_27.142042
path:
- '**/details_harness|hendrycksTest-human_aging|5_2024-04-03T01-45-27.142042.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-human_aging|5_2024-04-03T01-45-27.142042.parquet'
- config_name: harness_hendrycksTest_human_sexuality_5
data_files:
- split: 2024_04_03T01_45_27.142042
path:
- '**/details_harness|hendrycksTest-human_sexuality|5_2024-04-03T01-45-27.142042.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-human_sexuality|5_2024-04-03T01-45-27.142042.parquet'
- config_name: harness_hendrycksTest_international_law_5
data_files:
- split: 2024_04_03T01_45_27.142042
path:
- '**/details_harness|hendrycksTest-international_law|5_2024-04-03T01-45-27.142042.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-international_law|5_2024-04-03T01-45-27.142042.parquet'
- config_name: harness_hendrycksTest_jurisprudence_5
data_files:
- split: 2024_04_03T01_45_27.142042
path:
- '**/details_harness|hendrycksTest-jurisprudence|5_2024-04-03T01-45-27.142042.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-jurisprudence|5_2024-04-03T01-45-27.142042.parquet'
- config_name: harness_hendrycksTest_logical_fallacies_5
data_files:
- split: 2024_04_03T01_45_27.142042
path:
- '**/details_harness|hendrycksTest-logical_fallacies|5_2024-04-03T01-45-27.142042.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-logical_fallacies|5_2024-04-03T01-45-27.142042.parquet'
- config_name: harness_hendrycksTest_machine_learning_5
data_files:
- split: 2024_04_03T01_45_27.142042
path:
- '**/details_harness|hendrycksTest-machine_learning|5_2024-04-03T01-45-27.142042.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-machine_learning|5_2024-04-03T01-45-27.142042.parquet'
- config_name: harness_hendrycksTest_management_5
data_files:
- split: 2024_04_03T01_45_27.142042
path:
- '**/details_harness|hendrycksTest-management|5_2024-04-03T01-45-27.142042.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-management|5_2024-04-03T01-45-27.142042.parquet'
- config_name: harness_hendrycksTest_marketing_5
data_files:
- split: 2024_04_03T01_45_27.142042
path:
- '**/details_harness|hendrycksTest-marketing|5_2024-04-03T01-45-27.142042.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-marketing|5_2024-04-03T01-45-27.142042.parquet'
- config_name: harness_hendrycksTest_medical_genetics_5
data_files:
- split: 2024_04_03T01_45_27.142042
path:
- '**/details_harness|hendrycksTest-medical_genetics|5_2024-04-03T01-45-27.142042.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-medical_genetics|5_2024-04-03T01-45-27.142042.parquet'
- config_name: harness_hendrycksTest_miscellaneous_5
data_files:
- split: 2024_04_03T01_45_27.142042
path:
- '**/details_harness|hendrycksTest-miscellaneous|5_2024-04-03T01-45-27.142042.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-miscellaneous|5_2024-04-03T01-45-27.142042.parquet'
- config_name: harness_hendrycksTest_moral_disputes_5
data_files:
- split: 2024_04_03T01_45_27.142042
path:
- '**/details_harness|hendrycksTest-moral_disputes|5_2024-04-03T01-45-27.142042.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-moral_disputes|5_2024-04-03T01-45-27.142042.parquet'
- config_name: harness_hendrycksTest_moral_scenarios_5
data_files:
- split: 2024_04_03T01_45_27.142042
path:
- '**/details_harness|hendrycksTest-moral_scenarios|5_2024-04-03T01-45-27.142042.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-moral_scenarios|5_2024-04-03T01-45-27.142042.parquet'
- config_name: harness_hendrycksTest_nutrition_5
data_files:
- split: 2024_04_03T01_45_27.142042
path:
- '**/details_harness|hendrycksTest-nutrition|5_2024-04-03T01-45-27.142042.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-nutrition|5_2024-04-03T01-45-27.142042.parquet'
- config_name: harness_hendrycksTest_philosophy_5
data_files:
- split: 2024_04_03T01_45_27.142042
path:
- '**/details_harness|hendrycksTest-philosophy|5_2024-04-03T01-45-27.142042.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-philosophy|5_2024-04-03T01-45-27.142042.parquet'
- config_name: harness_hendrycksTest_prehistory_5
data_files:
- split: 2024_04_03T01_45_27.142042
path:
- '**/details_harness|hendrycksTest-prehistory|5_2024-04-03T01-45-27.142042.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-prehistory|5_2024-04-03T01-45-27.142042.parquet'
- config_name: harness_hendrycksTest_professional_accounting_5
data_files:
- split: 2024_04_03T01_45_27.142042
path:
- '**/details_harness|hendrycksTest-professional_accounting|5_2024-04-03T01-45-27.142042.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-professional_accounting|5_2024-04-03T01-45-27.142042.parquet'
- config_name: harness_hendrycksTest_professional_law_5
data_files:
- split: 2024_04_03T01_45_27.142042
path:
- '**/details_harness|hendrycksTest-professional_law|5_2024-04-03T01-45-27.142042.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-professional_law|5_2024-04-03T01-45-27.142042.parquet'
- config_name: harness_hendrycksTest_professional_medicine_5
data_files:
- split: 2024_04_03T01_45_27.142042
path:
- '**/details_harness|hendrycksTest-professional_medicine|5_2024-04-03T01-45-27.142042.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-professional_medicine|5_2024-04-03T01-45-27.142042.parquet'
- config_name: harness_hendrycksTest_professional_psychology_5
data_files:
- split: 2024_04_03T01_45_27.142042
path:
- '**/details_harness|hendrycksTest-professional_psychology|5_2024-04-03T01-45-27.142042.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-professional_psychology|5_2024-04-03T01-45-27.142042.parquet'
- config_name: harness_hendrycksTest_public_relations_5
data_files:
- split: 2024_04_03T01_45_27.142042
path:
- '**/details_harness|hendrycksTest-public_relations|5_2024-04-03T01-45-27.142042.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-public_relations|5_2024-04-03T01-45-27.142042.parquet'
- config_name: harness_hendrycksTest_security_studies_5
data_files:
- split: 2024_04_03T01_45_27.142042
path:
- '**/details_harness|hendrycksTest-security_studies|5_2024-04-03T01-45-27.142042.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-security_studies|5_2024-04-03T01-45-27.142042.parquet'
- config_name: harness_hendrycksTest_sociology_5
data_files:
- split: 2024_04_03T01_45_27.142042
path:
- '**/details_harness|hendrycksTest-sociology|5_2024-04-03T01-45-27.142042.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-sociology|5_2024-04-03T01-45-27.142042.parquet'
- config_name: harness_hendrycksTest_us_foreign_policy_5
data_files:
- split: 2024_04_03T01_45_27.142042
path:
- '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-04-03T01-45-27.142042.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-us_foreign_policy|5_2024-04-03T01-45-27.142042.parquet'
- config_name: harness_hendrycksTest_virology_5
data_files:
- split: 2024_04_03T01_45_27.142042
path:
- '**/details_harness|hendrycksTest-virology|5_2024-04-03T01-45-27.142042.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-virology|5_2024-04-03T01-45-27.142042.parquet'
- config_name: harness_hendrycksTest_world_religions_5
data_files:
- split: 2024_04_03T01_45_27.142042
path:
- '**/details_harness|hendrycksTest-world_religions|5_2024-04-03T01-45-27.142042.parquet'
- split: latest
path:
- '**/details_harness|hendrycksTest-world_religions|5_2024-04-03T01-45-27.142042.parquet'
- config_name: harness_truthfulqa_mc_0
data_files:
- split: 2024_04_03T01_45_27.142042
path:
- '**/details_harness|truthfulqa:mc|0_2024-04-03T01-45-27.142042.parquet'
- split: latest
path:
- '**/details_harness|truthfulqa:mc|0_2024-04-03T01-45-27.142042.parquet'
- config_name: harness_winogrande_5
data_files:
- split: 2024_04_03T01_45_27.142042
path:
- '**/details_harness|winogrande|5_2024-04-03T01-45-27.142042.parquet'
- split: latest
path:
- '**/details_harness|winogrande|5_2024-04-03T01-45-27.142042.parquet'
- config_name: results
data_files:
- split: 2024_04_03T01_45_27.142042
path:
- results_2024-04-03T01-45-27.142042.parquet
- split: latest
path:
- results_2024-04-03T01-45-27.142042.parquet
---
# Dataset Card for Evaluation run of 0-hero/Matter-0.2-7B
<!-- Provide a quick summary of the dataset. -->
Dataset automatically created during the evaluation run of model [0-hero/Matter-0.2-7B](https://huggingface.co/0-hero/Matter-0.2-7B) on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).
The dataset is composed of 63 configuration, each one coresponding to one of the evaluated task.
The dataset has been created from 1 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results.
An additional configuration "results" store all the aggregated results of the run (and is used to compute and display the aggregated metrics on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)).
To load the details from a run, you can for instance do the following:
```python
from datasets import load_dataset
data = load_dataset("open-llm-leaderboard/details_0-hero__Matter-0.2-7B",
"harness_winogrande_5",
split="train")
```
## Latest results
These are the [latest results from run 2024-04-03T01:45:27.142042](https://huggingface.co/datasets/open-llm-leaderboard/details_0-hero__Matter-0.2-7B/blob/main/results_2024-04-03T01-45-27.142042.json)(note that their might be results for other tasks in the repos if successive evals didn't cover the same tasks. You find each in the results and the "latest" split for each eval):
```python
{
"all": {
"acc": 0.6258435998641223,
"acc_stderr": 0.03245288877827044,
"acc_norm": 0.6283412392485703,
"acc_norm_stderr": 0.03310716675283535,
"mc1": 0.3353733170134639,
"mc1_stderr": 0.01652753403966899,
"mc2": 0.481088597087512,
"mc2_stderr": 0.015055232875750942
},
"harness|arc:challenge|25": {
"acc": 0.5819112627986348,
"acc_stderr": 0.014413988396996076,
"acc_norm": 0.6160409556313993,
"acc_norm_stderr": 0.01421244498065189
},
"harness|hellaswag|10": {
"acc": 0.6285600477992431,
"acc_stderr": 0.004822022254886021,
"acc_norm": 0.8239394542919737,
"acc_norm_stderr": 0.003800932770597754
},
"harness|hendrycksTest-abstract_algebra|5": {
"acc": 0.31,
"acc_stderr": 0.04648231987117316,
"acc_norm": 0.31,
"acc_norm_stderr": 0.04648231987117316
},
"harness|hendrycksTest-anatomy|5": {
"acc": 0.6370370370370371,
"acc_stderr": 0.041539484047423976,
"acc_norm": 0.6370370370370371,
"acc_norm_stderr": 0.041539484047423976
},
"harness|hendrycksTest-astronomy|5": {
"acc": 0.6842105263157895,
"acc_stderr": 0.03782728980865469,
"acc_norm": 0.6842105263157895,
"acc_norm_stderr": 0.03782728980865469
},
"harness|hendrycksTest-business_ethics|5": {
"acc": 0.59,
"acc_stderr": 0.049431107042371025,
"acc_norm": 0.59,
"acc_norm_stderr": 0.049431107042371025
},
"harness|hendrycksTest-clinical_knowledge|5": {
"acc": 0.6566037735849056,
"acc_stderr": 0.02922452646912479,
"acc_norm": 0.6566037735849056,
"acc_norm_stderr": 0.02922452646912479
},
"harness|hendrycksTest-college_biology|5": {
"acc": 0.7291666666666666,
"acc_stderr": 0.03716177437566017,
"acc_norm": 0.7291666666666666,
"acc_norm_stderr": 0.03716177437566017
},
"harness|hendrycksTest-college_chemistry|5": {
"acc": 0.43,
"acc_stderr": 0.04975698519562428,
"acc_norm": 0.43,
"acc_norm_stderr": 0.04975698519562428
},
"harness|hendrycksTest-college_computer_science|5": {
"acc": 0.49,
"acc_stderr": 0.05024183937956912,
"acc_norm": 0.49,
"acc_norm_stderr": 0.05024183937956912
},
"harness|hendrycksTest-college_mathematics|5": {
"acc": 0.34,
"acc_stderr": 0.04760952285695235,
"acc_norm": 0.34,
"acc_norm_stderr": 0.04760952285695235
},
"harness|hendrycksTest-college_medicine|5": {
"acc": 0.6127167630057804,
"acc_stderr": 0.037143259063020656,
"acc_norm": 0.6127167630057804,
"acc_norm_stderr": 0.037143259063020656
},
"harness|hendrycksTest-college_physics|5": {
"acc": 0.3333333333333333,
"acc_stderr": 0.04690650298201942,
"acc_norm": 0.3333333333333333,
"acc_norm_stderr": 0.04690650298201942
},
"harness|hendrycksTest-computer_security|5": {
"acc": 0.73,
"acc_stderr": 0.044619604333847415,
"acc_norm": 0.73,
"acc_norm_stderr": 0.044619604333847415
},
"harness|hendrycksTest-conceptual_physics|5": {
"acc": 0.5574468085106383,
"acc_stderr": 0.03246956919789958,
"acc_norm": 0.5574468085106383,
"acc_norm_stderr": 0.03246956919789958
},
"harness|hendrycksTest-econometrics|5": {
"acc": 0.47368421052631576,
"acc_stderr": 0.046970851366478626,
"acc_norm": 0.47368421052631576,
"acc_norm_stderr": 0.046970851366478626
},
"harness|hendrycksTest-electrical_engineering|5": {
"acc": 0.5448275862068965,
"acc_stderr": 0.04149886942192117,
"acc_norm": 0.5448275862068965,
"acc_norm_stderr": 0.04149886942192117
},
"harness|hendrycksTest-elementary_mathematics|5": {
"acc": 0.41798941798941797,
"acc_stderr": 0.025402555503260912,
"acc_norm": 0.41798941798941797,
"acc_norm_stderr": 0.025402555503260912
},
"harness|hendrycksTest-formal_logic|5": {
"acc": 0.4126984126984127,
"acc_stderr": 0.04403438954768176,
"acc_norm": 0.4126984126984127,
"acc_norm_stderr": 0.04403438954768176
},
"harness|hendrycksTest-global_facts|5": {
"acc": 0.37,
"acc_stderr": 0.04852365870939099,
"acc_norm": 0.37,
"acc_norm_stderr": 0.04852365870939099
},
"harness|hendrycksTest-high_school_biology|5": {
"acc": 0.7354838709677419,
"acc_stderr": 0.025091892378859275,
"acc_norm": 0.7354838709677419,
"acc_norm_stderr": 0.025091892378859275
},
"harness|hendrycksTest-high_school_chemistry|5": {
"acc": 0.47783251231527096,
"acc_stderr": 0.035145285621750094,
"acc_norm": 0.47783251231527096,
"acc_norm_stderr": 0.035145285621750094
},
"harness|hendrycksTest-high_school_computer_science|5": {
"acc": 0.73,
"acc_stderr": 0.044619604333847394,
"acc_norm": 0.73,
"acc_norm_stderr": 0.044619604333847394
},
"harness|hendrycksTest-high_school_european_history|5": {
"acc": 0.7393939393939394,
"acc_stderr": 0.034277431758165236,
"acc_norm": 0.7393939393939394,
"acc_norm_stderr": 0.034277431758165236
},
"harness|hendrycksTest-high_school_geography|5": {
"acc": 0.8181818181818182,
"acc_stderr": 0.027479603010538808,
"acc_norm": 0.8181818181818182,
"acc_norm_stderr": 0.027479603010538808
},
"harness|hendrycksTest-high_school_government_and_politics|5": {
"acc": 0.8497409326424871,
"acc_stderr": 0.025787723180723886,
"acc_norm": 0.8497409326424871,
"acc_norm_stderr": 0.025787723180723886
},
"harness|hendrycksTest-high_school_macroeconomics|5": {
"acc": 0.6153846153846154,
"acc_stderr": 0.024666744915187208,
"acc_norm": 0.6153846153846154,
"acc_norm_stderr": 0.024666744915187208
},
"harness|hendrycksTest-high_school_mathematics|5": {
"acc": 0.3037037037037037,
"acc_stderr": 0.02803792996911499,
"acc_norm": 0.3037037037037037,
"acc_norm_stderr": 0.02803792996911499
},
"harness|hendrycksTest-high_school_microeconomics|5": {
"acc": 0.6428571428571429,
"acc_stderr": 0.031124619309328177,
"acc_norm": 0.6428571428571429,
"acc_norm_stderr": 0.031124619309328177
},
"harness|hendrycksTest-high_school_physics|5": {
"acc": 0.32450331125827814,
"acc_stderr": 0.038227469376587525,
"acc_norm": 0.32450331125827814,
"acc_norm_stderr": 0.038227469376587525
},
"harness|hendrycksTest-high_school_psychology|5": {
"acc": 0.8220183486238533,
"acc_stderr": 0.016399436366612896,
"acc_norm": 0.8220183486238533,
"acc_norm_stderr": 0.016399436366612896
},
"harness|hendrycksTest-high_school_statistics|5": {
"acc": 0.4444444444444444,
"acc_stderr": 0.03388857118502326,
"acc_norm": 0.4444444444444444,
"acc_norm_stderr": 0.03388857118502326
},
"harness|hendrycksTest-high_school_us_history|5": {
"acc": 0.8186274509803921,
"acc_stderr": 0.027044621719474082,
"acc_norm": 0.8186274509803921,
"acc_norm_stderr": 0.027044621719474082
},
"harness|hendrycksTest-high_school_world_history|5": {
"acc": 0.7974683544303798,
"acc_stderr": 0.026160568246601443,
"acc_norm": 0.7974683544303798,
"acc_norm_stderr": 0.026160568246601443
},
"harness|hendrycksTest-human_aging|5": {
"acc": 0.672645739910314,
"acc_stderr": 0.03149384670994131,
"acc_norm": 0.672645739910314,
"acc_norm_stderr": 0.03149384670994131
},
"harness|hendrycksTest-human_sexuality|5": {
"acc": 0.7786259541984732,
"acc_stderr": 0.03641297081313732,
"acc_norm": 0.7786259541984732,
"acc_norm_stderr": 0.03641297081313732
},
"harness|hendrycksTest-international_law|5": {
"acc": 0.743801652892562,
"acc_stderr": 0.03984979653302872,
"acc_norm": 0.743801652892562,
"acc_norm_stderr": 0.03984979653302872
},
"harness|hendrycksTest-jurisprudence|5": {
"acc": 0.7962962962962963,
"acc_stderr": 0.03893542518824847,
"acc_norm": 0.7962962962962963,
"acc_norm_stderr": 0.03893542518824847
},
"harness|hendrycksTest-logical_fallacies|5": {
"acc": 0.7423312883435583,
"acc_stderr": 0.03436150827846917,
"acc_norm": 0.7423312883435583,
"acc_norm_stderr": 0.03436150827846917
},
"harness|hendrycksTest-machine_learning|5": {
"acc": 0.45535714285714285,
"acc_stderr": 0.047268355537191,
"acc_norm": 0.45535714285714285,
"acc_norm_stderr": 0.047268355537191
},
"harness|hendrycksTest-management|5": {
"acc": 0.8058252427184466,
"acc_stderr": 0.03916667762822583,
"acc_norm": 0.8058252427184466,
"acc_norm_stderr": 0.03916667762822583
},
"harness|hendrycksTest-marketing|5": {
"acc": 0.8632478632478633,
"acc_stderr": 0.022509033937077802,
"acc_norm": 0.8632478632478633,
"acc_norm_stderr": 0.022509033937077802
},
"harness|hendrycksTest-medical_genetics|5": {
"acc": 0.72,
"acc_stderr": 0.045126085985421276,
"acc_norm": 0.72,
"acc_norm_stderr": 0.045126085985421276
},
"harness|hendrycksTest-miscellaneous|5": {
"acc": 0.8007662835249042,
"acc_stderr": 0.01428337804429642,
"acc_norm": 0.8007662835249042,
"acc_norm_stderr": 0.01428337804429642
},
"harness|hendrycksTest-moral_disputes|5": {
"acc": 0.708092485549133,
"acc_stderr": 0.024476994076247337,
"acc_norm": 0.708092485549133,
"acc_norm_stderr": 0.024476994076247337
},
"harness|hendrycksTest-moral_scenarios|5": {
"acc": 0.3865921787709497,
"acc_stderr": 0.016286674879101022,
"acc_norm": 0.3865921787709497,
"acc_norm_stderr": 0.016286674879101022
},
"harness|hendrycksTest-nutrition|5": {
"acc": 0.7058823529411765,
"acc_stderr": 0.026090162504279056,
"acc_norm": 0.7058823529411765,
"acc_norm_stderr": 0.026090162504279056
},
"harness|hendrycksTest-philosophy|5": {
"acc": 0.7009646302250804,
"acc_stderr": 0.02600330111788513,
"acc_norm": 0.7009646302250804,
"acc_norm_stderr": 0.02600330111788513
},
"harness|hendrycksTest-prehistory|5": {
"acc": 0.7006172839506173,
"acc_stderr": 0.025483115601195448,
"acc_norm": 0.7006172839506173,
"acc_norm_stderr": 0.025483115601195448
},
"harness|hendrycksTest-professional_accounting|5": {
"acc": 0.48226950354609927,
"acc_stderr": 0.02980873964223777,
"acc_norm": 0.48226950354609927,
"acc_norm_stderr": 0.02980873964223777
},
"harness|hendrycksTest-professional_law|5": {
"acc": 0.4621903520208605,
"acc_stderr": 0.012733671880342506,
"acc_norm": 0.4621903520208605,
"acc_norm_stderr": 0.012733671880342506
},
"harness|hendrycksTest-professional_medicine|5": {
"acc": 0.5992647058823529,
"acc_stderr": 0.029768263528933105,
"acc_norm": 0.5992647058823529,
"acc_norm_stderr": 0.029768263528933105
},
"harness|hendrycksTest-professional_psychology|5": {
"acc": 0.6666666666666666,
"acc_stderr": 0.019070985589687495,
"acc_norm": 0.6666666666666666,
"acc_norm_stderr": 0.019070985589687495
},
"harness|hendrycksTest-public_relations|5": {
"acc": 0.6636363636363637,
"acc_stderr": 0.04525393596302506,
"acc_norm": 0.6636363636363637,
"acc_norm_stderr": 0.04525393596302506
},
"harness|hendrycksTest-security_studies|5": {
"acc": 0.6857142857142857,
"acc_stderr": 0.02971932942241748,
"acc_norm": 0.6857142857142857,
"acc_norm_stderr": 0.02971932942241748
},
"harness|hendrycksTest-sociology|5": {
"acc": 0.8208955223880597,
"acc_stderr": 0.027113286753111837,
"acc_norm": 0.8208955223880597,
"acc_norm_stderr": 0.027113286753111837
},
"harness|hendrycksTest-us_foreign_policy|5": {
"acc": 0.9,
"acc_stderr": 0.030151134457776348,
"acc_norm": 0.9,
"acc_norm_stderr": 0.030151134457776348
},
"harness|hendrycksTest-virology|5": {
"acc": 0.5240963855421686,
"acc_stderr": 0.03887971849597264,
"acc_norm": 0.5240963855421686,
"acc_norm_stderr": 0.03887971849597264
},
"harness|hendrycksTest-world_religions|5": {
"acc": 0.8070175438596491,
"acc_stderr": 0.030267457554898458,
"acc_norm": 0.8070175438596491,
"acc_norm_stderr": 0.030267457554898458
},
"harness|truthfulqa:mc|0": {
"mc1": 0.3353733170134639,
"mc1_stderr": 0.01652753403966899,
"mc2": 0.481088597087512,
"mc2_stderr": 0.015055232875750942
},
"harness|winogrande|5": {
"acc": 0.7947908445146015,
"acc_stderr": 0.011350315707462059
},
"harness|gsm8k|5": {
"acc": 0.5390447308567097,
"acc_stderr": 0.01373042844911634
}
}
```
## Dataset Details
### Dataset Description
<!-- Provide a longer summary of what this dataset is. -->
- **Curated by:** [More Information Needed]
- **Funded by [optional]:** [More Information Needed]
- **Shared by [optional]:** [More Information Needed]
- **Language(s) (NLP):** [More Information Needed]
- **License:** [More Information Needed]
### Dataset Sources [optional]
<!-- Provide the basic links for the dataset. -->
- **Repository:** [More Information Needed]
- **Paper [optional]:** [More Information Needed]
- **Demo [optional]:** [More Information Needed]
## Uses
<!-- Address questions around how the dataset is intended to be used. -->
### Direct Use
<!-- This section describes suitable use cases for the dataset. -->
[More Information Needed]
### Out-of-Scope Use
<!-- This section addresses misuse, malicious use, and uses that the dataset will not work well for. -->
[More Information Needed]
## Dataset Structure
<!-- This section provides a description of the dataset fields, and additional information about the dataset structure such as criteria used to create the splits, relationships between data points, etc. -->
[More Information Needed]
## Dataset Creation
### Curation Rationale
<!-- Motivation for the creation of this dataset. -->
[More Information Needed]
### Source Data
<!-- This section describes the source data (e.g. news text and headlines, social media posts, translated sentences, ...). -->
#### Data Collection and Processing
<!-- This section describes the data collection and processing process such as data selection criteria, filtering and normalization methods, tools and libraries used, etc. -->
[More Information Needed]
#### Who are the source data producers?
<!-- This section describes the people or systems who originally created the data. It should also include self-reported demographic or identity information for the source data creators if this information is available. -->
[More Information Needed]
### Annotations [optional]
<!-- If the dataset contains annotations which are not part of the initial data collection, use this section to describe them. -->
#### Annotation process
<!-- This section describes the annotation process such as annotation tools used in the process, the amount of data annotated, annotation guidelines provided to the annotators, interannotator statistics, annotation validation, etc. -->
[More Information Needed]
#### Who are the annotators?
<!-- This section describes the people or systems who created the annotations. -->
[More Information Needed]
#### Personal and Sensitive Information
<!-- State whether the dataset contains data that might be considered personal, sensitive, or private (e.g., data that reveals addresses, uniquely identifiable names or aliases, racial or ethnic origins, sexual orientations, religious beliefs, political opinions, financial or health data, etc.). If efforts were made to anonymize the data, describe the anonymization process. -->
[More Information Needed]
## Bias, Risks, and Limitations
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
[More Information Needed]
### Recommendations
<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
Users should be made aware of the risks, biases and limitations of the dataset. More information needed for further recommendations.
## Citation [optional]
<!-- If there is a paper or blog post introducing the dataset, the APA and Bibtex information for that should go in this section. -->
**BibTeX:**
[More Information Needed]
**APA:**
[More Information Needed]
## Glossary [optional]
<!-- If relevant, include terms and calculations in this section that can help readers understand the dataset or dataset card. -->
[More Information Needed]
## More Information [optional]
[More Information Needed]
## Dataset Card Authors [optional]
[More Information Needed]
## Dataset Card Contact
[More Information Needed]
该数据集是在Open LLM Leaderboard上对模型0-hero/Matter-0.2-7B进行评估时自动创建的。数据集包含63个配置,每个配置对应一个评估任务。数据集由1次运行创建,每次运行可以在每个配置中找到特定的分割,分割以运行的时间戳命名。train分割始终指向最新的结果。此外,还有一个名为results的配置存储了所有运行的聚合结果,用于计算和显示Open LLM Leaderboard上的聚合指标。
提供机构:
open-llm-leaderboard-old
原始信息汇总
数据集概述
数据集简介
该数据集是在对模型 0-hero/Matter-0.2-7B 进行评估运行期间自动创建的,用于 Open LLM Leaderboard。
数据集结构
- 配置数量:63个配置,每个配置对应一个评估任务。
- 运行次数:数据集来自1次运行。每个运行在每个配置中作为一个特定的分割存在,分割名称使用运行的时间戳。
- 分割类型:每个配置包含一个名为 "train" 的分割,指向最新的结果。
- 结果汇总:一个额外的配置 "results" 存储所有运行的汇总结果,用于计算和显示在 Open LLM Leaderboard 上的聚合指标。
数据加载示例
python from datasets import load_dataset data = load_dataset("open-llm-leaderboard/details_0-hero__Matter-0.2-7B", "harness_winogrande_5", split="train")
最新结果
最新结果来自2024-04-03T01:45:27.142042的运行,包含多个任务的评估结果,例如:
- harness|arc:challenge|25:
acc: 0.5819112627986348acc_stderr: 0.014413988396996076acc_norm: 0.6160409556313993acc_norm_stderr: 0.01421244498065189
- harness|hellaswag|10:
acc: 0.6285600477992431acc_stderr: 0.004822022254886021acc_norm: 0.8239394542919737acc_norm_stderr: 0.003800932770597754
- harness|hendrycksTest-abstract_algebra|5:
acc: 0.31acc_stderr: 0.04648231987117316acc_norm: 0.31acc_norm_stderr: 0.04648231987117316
配置详情
- config_name: harness_arc_challenge_25
- data_files:
- split: 2024_04_03T01_45_27.142042
- path: **/details_harness|arc:challenge|25_2024-04-03T01-45-27.142042.parquet
- split: latest
- path: **/details_harness|arc:challenge|25_2024-04-03T01-45-27.142042.parquet
- split: 2024_04_03T01_45_27.142042
- data_files:
- config_name: harness_gsm8k_5
- data_files:
- split: 2024_04_03T01_45_27.142042
- path: **/details_harness|gsm8k|5_2024-04-03T01-45-27.142042.parquet
- split: latest
- path: **/details_harness|gsm8k|5_2024-04-03T01-45-27.142042.parquet
- split: 2024_04_03T01_45_27.142042
- data_files:
- config_name: harness_hellaswag_10
- data_files:
- split: 2024_04_03T01_45_27.142042
- path: **/details_harness|hellaswag|10_2024-04-03T01-45-27.142042.parquet
- split: latest
- path: **/details_harness|hellaswag|10_2024-04-03T01-45-27.142042.parquet
- split: 2024_04_03T01_45_27.142042
- data_files:
- config_name: harness_hendrycksTest_5
- data_files:
- split: 2024_04_03T01_45_27.142042
- path:
- **/details_harness|hendrycksTest-abstract_algebra|5_2024-04-03T01-45-27.142042.parquet
- **/details_harness|hendrycksTest-anatomy|5_2024-04-03T01-45-27.142042.parquet
- **/details_harness|hendrycksTest-astronomy|5_2024-04-03T01-45-27.142042.parquet
- **/details_harness|hendrycksTest-business_ethics|5_2024-04-03T01-45-27.142042.parquet
- **/details_harness|hendrycksTest-clinical_knowledge|5_2024-04-03T01-45-27.142042.parquet
- **/details_harness|hendrycksTest-college_biology|5_2024-04-03T01-45-27.142042.parquet
- **/details_harness|hendrycksTest-college_chemistry|5_2024-04-03T01-45-27.142042.parquet
- **/details_harness|hendrycksTest-college_computer_science|5_2024-04-03T01-45-27.142042.parquet
- **/details_harness|hendrycksTest-college_mathematics|5_2024-04-03T01-45-27.142042.parquet
- **/details_harness|hendrycksTest-college_medicine|5_2024-04-03T01-45-27.142042.parquet
- **/details_harness|hendrycksTest-college_physics|5_2024-04-03T01-45-27.142042.parquet
- **/details_harness|hendrycksTest-computer_security|5_2024-04-03T01-45-27.142042.parquet
- **/details_harness|hendrycksTest-conceptual_physics|5_2024-04-03T01-45-27.142042.parquet
- **/details_harness|hendrycksTest-econometrics|5_2024-04-03T01-45-27.142042.parquet
- **/details_harness|hendrycksTest-electrical_engineering|5_2024-04-03T01-45-27.142042.parquet
- **/details_harness|hendrycksTest-elementary_mathematics|5_2024-04-03T01-45-27.142042.parquet
- **/details_harness|hendrycksTest-formal_logic|5_2024-04-03T01-45-27.142042.parquet
- **/details_harness|hendrycksTest-global_facts|5_2024-04-03T01-45-27.142042.parquet
- **/details_harness|hendrycksTest-high_school_biology|5_2024-04-03T01-45-27.142042.parquet
- path:
- split: 2024_04_03T01_45_27.142042
- data_files:
搜集汇总
数据集介绍

构建方式
该数据集是在Open LLM Leaderboard评测框架下,对0-hero/Matter-0.2-7B模型进行自动化评估的过程中生成的。数据集共包含63个配置,每个配置对应一个特定的评测任务,如ARC-Challenge、HellaSwag、GSM8K、WinoGrande、TruthfulQA以及涵盖多学科知识的MMLU基准测试。每次运行的结果被存储为独立的分片,并以运行时间戳命名,而“train”分片则始终指向最新一次评测的结果。此外,数据集还包含一个名为“results”的额外配置,用于汇总所有运行中的聚合指标,支撑排行榜上综合分数的计算与展示。
特点
该数据集的结构设计精巧,以多配置、多分片的方式组织,实现了对单一模型在多样化任务上表现的全方位记录。每个配置下的分片不仅保留了原始评测的时间戳信息,还通过“latest”分片动态追踪最新进展,使得研究者能够轻松回溯历史结果或获取最新数据。数据集涵盖了从常识推理、数学问题到专业学科知识的广泛评测范围,提供了诸如准确率、标准化准确率及其标准误等精细粒度的性能指标,为深入分析模型在不同维度上的能力差异提供了坚实的数据基础。
使用方法
利用Hugging Face的datasets库,研究者可以便捷地加载该数据集以进行深入分析。例如,通过load_dataset函数并指定配置名称(如“harness_winogrande_5”)和分片(如“train”),即可获取特定任务的最新评测细节。若需访问历史运行数据,只需将分片参数替换为对应的时间戳字符串即可。这种灵活的加载机制支持研究者按需提取数据,进行模型性能的纵向对比或跨任务的横向评估,从而在开放语言模型的演进过程中洞察其能力提升的轨迹。
背景与挑战
背景概述
大语言模型(LLM)的迅猛发展催生了对其性能进行系统化评估的迫切需求,Open LLM Leaderboard应运而生,成为衡量模型在多样化自然语言理解与推理任务上表现的重要基准平台。该数据集记录了对模型0-hero/Matter-0.2-7B的完整评估过程,由HuggingFace团队于2024年创建,核心研究问题聚焦于如何通过标准化、可复现的评估框架,客观揭示模型在常识推理、数学问题求解、知识问答及多学科理解等多维度能力上的优劣。该评估覆盖了ARC-Challenge、HellaSwag、GSM8K、TruthfulQA、WinoGrande以及涵盖57个学科的MMLU测试集,共计63个配置项,其结果不仅为Matter-0.2-7B模型的性能画像提供了详实数据,也为后续模型优化与社区对比研究奠定了坚实的数据基础,对推动开放语言模型的透明化评估具有显著影响力。
当前挑战
该数据集所应对的核心挑战在于构建一个既全面又公平的LLM性能评估体系。领域层面,大语言模型在推理能力、知识广度与事实一致性上的表现参差不齐,单一任务评估无法反映其真实能力,因此需要整合如ARC-Challenge(科学推理)、HellaSwag(常识推理)、GSM8K(数学推理)及MMLU(多学科知识)等多元任务,以捕捉模型在复杂认知场景下的弱点与优势。构建过程中,挑战体现在评估流程的标准化与结果的可复现性上:需确保每次运行均采用统一的提示格式、采样参数与评分规则,并解决不同任务间度量指标(如准确率、归一化准确率、MC1/MC2得分)的兼容性问题;同时,面对多轮评估可能产生的数据版本冲突,需设计清晰的分支管理机制(如按时间戳划分运行),以保证最新结果与历史数据的一致性与可追溯性。
常用场景
经典使用场景
在大型语言模型(LLM)性能评估领域,该数据集作为Open LLM Leaderboard的标准化评测记录,承载了对0-hero/Matter-0.2-7B模型在63项任务上的细粒度性能追踪。其经典使用场景在于通过加载特定配置(如harness_winogrande_5)与时间戳分割,复现模型在常识推理、数学求解、知识问答等维度的精确表现,为研究者提供可验证的基准测试基础。
解决学术问题
该数据集系统性地解决了LLM评估中结果碎片化与不可复现的学术痛点。通过将ARC-Challenge、HellaSwag、GSM8K等主流基准的原始输出与聚合指标统一存储,它使学界能够精准定位模型在特定领域(如抽象代数、医学知识)的优劣,从而揭示架构设计与训练策略对泛化能力的影响,推动了模型对比研究的规范化与透明化进程。
衍生相关工作
该数据集衍生了一系列关于LLM评测方法论与模型诊断的经典工作。例如,研究者利用其多任务配置分析模型在MMLU子领域中的知识分布偏差,催生了基于困难样本挖掘的课程学习策略;时间戳分割的设计启发了动态性能衰减分析,用于追踪模型更新后的回归现象。此外,其parquet格式的标准化存储范式已被后续Leaderboard数据集广泛采纳。
以上内容由遇见数据集搜集并总结生成



