five

nyu-dice-lab/lm-eval-results-yunconglong-DARE_TIES_13B-private

收藏
Hugging Face2024-12-05 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/nyu-dice-lab/lm-eval-results-yunconglong-DARE_TIES_13B-private
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: Evaluation run of yunconglong/DARE_TIES_13B dataset_summary: "Dataset automatically created during the evaluation run of model\ \ [yunconglong/DARE_TIES_13B](https://huggingface.co/yunconglong/DARE_TIES_13B)\n\ The dataset is composed of 62 configuration(s), each one corresponding to one of\ \ the evaluated task.\n\nThe dataset has been created from 2 run(s). Each run can\ \ be found as a specific split in each configuration, the split being named using\ \ the timestamp of the run.The \"train\" split is always pointing to the latest\ \ results.\n\nAn additional configuration \"results\" store all the aggregated results\ \ of the run.\n\nTo load the details from a run, you can for instance do the following:\n\ ```python\nfrom datasets import load_dataset\ndata = load_dataset(\n\t\"nyu-dice-lab/lm-eval-results-yunconglong-DARE_TIES_13B-private\"\ ,\n\tname=\"yunconglong__DARE_TIES_13B__BeaverTailsEval\",\n\tsplit=\"latest\"\n\ )\n```\n\n## Latest results\n\nThese are the [latest results from run 2024-12-04T20-37-46.218361](https://huggingface.co/datasets/nyu-dice-lab/lm-eval-results-yunconglong-DARE_TIES_13B-private/blob/main/yunconglong/DARE_TIES_13B/results_2024-12-04T20-37-46.218361.json)\ \ (note that there might be results for other tasks in the repos if successive evals\ \ didn't cover the same tasks. You find each in the results and the \"latest\" split\ \ for each eval):\n\n```python\n{\n \"all\": {\n \"BeaverTailsEval\":\ \ {\n \"alias\": \"BeaverTailsEval\",\n \"acc,none\": 0.8714285714285714,\n\ \ \"acc_stderr,none\": 0.012660461716778634,\n \"acc_norm,none\"\ : 0.12428571428571429,\n \"acc_norm_stderr,none\": 0.012478237164470317\n\ \ },\n \"CDNA\": {\n \"alias\": \"CDNA\",\n \ \ \"acc,none\": 0.9552457813646368,\n \"acc_stderr,none\": 0.003960876492273638,\n\ \ \"acc_norm,none\": 0.001834189288334556,\n \"acc_norm_stderr,none\"\ : 0.0008196721291236438\n },\n \"DTToxicity\": {\n \"alias\"\ : \"DTToxicity\",\n \"acc,none\": 0.4837228714524207,\n \"\ acc_stderr,none\": 0.010211440125201749,\n \"acc_norm,none\": 0.5,\n\ \ \"acc_norm_stderr,none\": 0.010216855368051905\n },\n \ \ \"JailbreakHub\": {\n \"alias\": \"JailbreakHub\",\n \"\ acc,none\": 0.12450462351387054,\n \"acc_stderr,none\": 0.002683311387044548,\n\ \ \"acc_norm,none\": 0.0939894319682959,\n \"acc_norm_stderr,none\"\ : 0.002371687964555697\n },\n \"SGXSTest\": {\n \"alias\"\ : \"SGXSTest\",\n \"acc,none\": 0.5,\n \"acc_stderr,none\"\ : 0.0354440602504168,\n \"acc_norm,none\": 0.5,\n \"acc_norm_stderr,none\"\ : 0.0354440602504168\n },\n \"SaladBench\": {\n \"alias\"\ : \"SaladBench\",\n \"acc,none\": 0.49505208333333334,\n \"\ acc_stderr,none\": 0.008069370988058294,\n \"acc_norm,none\": 0.49505208333333334,\n\ \ \"acc_norm_stderr,none\": 0.008069370988058294\n },\n \ \ \"StrongREJECT\": {\n \"alias\": \"StrongREJECT\",\n \"\ acc,none\": 0.9744408945686901,\n \"acc_stderr,none\": 0.008934562241019864,\n\ \ \"acc_norm,none\": 0.2523961661341853,\n \"acc_norm_stderr,none\"\ : 0.024592339166678388\n },\n \"WildGuardTest\": {\n \"\ alias\": \"WildGuardTest\",\n \"acc,none\": 0.6121739130434782,\n \ \ \"acc_stderr,none\": 0.011735113323084431,\n \"acc_norm,none\"\ : 0.5617391304347826,\n \"acc_norm_stderr,none\": 0.011949921603028857\n\ \ },\n \"bbq\": {\n \"acc_norm,none\": 0.9339909731245298,\n\ \ \"acc_norm_stderr,none\": 0.0010120925842241903,\n \"acc,none\"\ : 0.933854202284073,\n \"acc_stderr,none\": 0.001014159063390077,\n \ \ \"alias\": \"bbq\"\n },\n \"bbq_age\": {\n \ \ \"alias\": \" - bbq_age\",\n \"acc,none\": 0.8347826086956521,\n \ \ \"acc_stderr,none\": 0.006122794490389976,\n \"acc_norm,none\"\ : 0.8323369565217391,\n \"acc_norm_stderr,none\": 0.006158903051518932\n\ \ },\n \"bbq_disabilitystatus\": {\n \"alias\": \" - bbq_disabilitystatus\"\ ,\n \"acc,none\": 0.9113110539845758,\n \"acc_stderr,none\"\ : 0.007209462202833219,\n \"acc_norm,none\": 0.9093830334190232,\n \ \ \"acc_norm_stderr,none\": 0.0072796916982102436\n },\n \ \ \"bbq_genderidentity\": {\n \"alias\": \" - bbq_genderidentity\",\n\ \ \"acc,none\": 0.9427009873060649,\n \"acc_stderr,none\"\ : 0.0030862473264601695,\n \"acc_norm,none\": 0.9423483779971791,\n \ \ \"acc_norm_stderr,none\": 0.0030951498876854062\n },\n \ \ \"bbq_nationality\": {\n \"alias\": \" - bbq_nationality\",\n \ \ \"acc,none\": 0.9194805194805195,\n \"acc_stderr,none\": 0.004903621087010461,\n\ \ \"acc_norm,none\": 0.9185064935064935,\n \"acc_norm_stderr,none\"\ : 0.004930577318136959\n },\n \"bbq_physicalappearance\": {\n \ \ \"alias\": \" - bbq_physicalappearance\",\n \"acc,none\": 0.8331218274111675,\n\ \ \"acc_stderr,none\": 0.009395366913005541,\n \"acc_norm,none\"\ : 0.8318527918781726,\n \"acc_norm_stderr,none\": 0.009423837540123783\n\ \ },\n \"bbq_raceethnicity\": {\n \"alias\": \" - bbq_raceethnicity\"\ ,\n \"acc,none\": 0.9210755813953488,\n \"acc_stderr,none\"\ : 0.0032508031761094938,\n \"acc_norm,none\": 0.9207848837209303,\n \ \ \"acc_norm_stderr,none\": 0.0032562704476255767\n },\n \ \ \"bbq_racexgender\": {\n \"alias\": \" - bbq_racexgender\",\n \ \ \"acc,none\": 0.9611528822055138,\n \"acc_stderr,none\": 0.0015295821266427165,\n\ \ \"acc_norm,none\": 0.9608395989974937,\n \"acc_norm_stderr,none\"\ : 0.0015354871080304484\n },\n \"bbq_racexses\": {\n \"\ alias\": \" - bbq_racexses\",\n \"acc,none\": 0.9707885304659498,\n \ \ \"acc_stderr,none\": 0.0015941397176377286,\n \"acc_norm,none\"\ : 0.9756272401433692,\n \"acc_norm_stderr,none\": 0.0014597607249481903\n\ \ },\n \"bbq_religion\": {\n \"alias\": \" - bbq_religion\"\ ,\n \"acc,none\": 0.8375,\n \"acc_stderr,none\": 0.01065392165850614,\n\ \ \"acc_norm,none\": 0.835,\n \"acc_norm_stderr,none\": 0.01071952689631095\n\ \ },\n \"bbq_ses\": {\n \"alias\": \" - bbq_ses\",\n \ \ \"acc,none\": 0.9245337995337995,\n \"acc_stderr,none\": 0.003188457551106306,\n\ \ \"acc_norm,none\": 0.9220571095571095,\n \"acc_norm_stderr,none\"\ : 0.00323601230652936\n },\n \"bbq_sexualorientation\": {\n \ \ \"alias\": \" - bbq_sexualorientation\",\n \"acc,none\": 0.9016203703703703,\n\ \ \"acc_stderr,none\": 0.01013815790835306,\n \"acc_norm,none\"\ : 0.9016203703703703,\n \"acc_norm_stderr,none\": 0.01013815790835306\n\ \ },\n \"leaderboard\": {\n \" \": \" \",\n \ \ \"alias\": \"leaderboard\"\n },\n \"leaderboard_bbh\": {\n \ \ \" \": \" \",\n \"alias\": \" - leaderboard_bbh\"\n },\n\ \ \"leaderboard_bbh_boolean_expressions\": {\n \"alias\": \" \ \ - leaderboard_bbh_boolean_expressions\",\n \"acc_norm,none\": 0.8,\n\ \ \"acc_norm_stderr,none\": 0.02534897002097908\n },\n \ \ \"leaderboard_bbh_causal_judgement\": {\n \"alias\": \" - leaderboard_bbh_causal_judgement\"\ ,\n \"acc_norm,none\": 0.6470588235294118,\n \"acc_norm_stderr,none\"\ : 0.03504019983419236\n },\n \"leaderboard_bbh_date_understanding\"\ : {\n \"alias\": \" - leaderboard_bbh_date_understanding\",\n \ \ \"acc_norm,none\": 0.472,\n \"acc_norm_stderr,none\": 0.031636489531544396\n\ \ },\n \"leaderboard_bbh_disambiguation_qa\": {\n \"alias\"\ : \" - leaderboard_bbh_disambiguation_qa\",\n \"acc_norm,none\": 0.68,\n\ \ \"acc_norm_stderr,none\": 0.02956172495524105\n },\n \ \ \"leaderboard_bbh_formal_fallacies\": {\n \"alias\": \" - leaderboard_bbh_formal_fallacies\"\ ,\n \"acc_norm,none\": 0.6,\n \"acc_norm_stderr,none\": 0.03104602102825324\n\ \ },\n \"leaderboard_bbh_geometric_shapes\": {\n \"alias\"\ : \" - leaderboard_bbh_geometric_shapes\",\n \"acc_norm,none\": 0.36,\n\ \ \"acc_norm_stderr,none\": 0.030418764025174988\n },\n \ \ \"leaderboard_bbh_hyperbaton\": {\n \"alias\": \" - leaderboard_bbh_hyperbaton\"\ ,\n \"acc_norm,none\": 0.688,\n \"acc_norm_stderr,none\":\ \ 0.029361067575219817\n },\n \"leaderboard_bbh_logical_deduction_five_objects\"\ : {\n \"alias\": \" - leaderboard_bbh_logical_deduction_five_objects\"\ ,\n \"acc_norm,none\": 0.48,\n \"acc_norm_stderr,none\": 0.031660853408495185\n\ \ },\n \"leaderboard_bbh_logical_deduction_seven_objects\": {\n \ \ \"alias\": \" - leaderboard_bbh_logical_deduction_seven_objects\",\n\ \ \"acc_norm,none\": 0.432,\n \"acc_norm_stderr,none\": 0.03139181076542941\n\ \ },\n \"leaderboard_bbh_logical_deduction_three_objects\": {\n \ \ \"alias\": \" - leaderboard_bbh_logical_deduction_three_objects\",\n\ \ \"acc_norm,none\": 0.692,\n \"acc_norm_stderr,none\": 0.029256928606501868\n\ \ },\n \"leaderboard_bbh_movie_recommendation\": {\n \"\ alias\": \" - leaderboard_bbh_movie_recommendation\",\n \"acc_norm,none\"\ : 0.688,\n \"acc_norm_stderr,none\": 0.029361067575219817\n },\n\ \ \"leaderboard_bbh_navigate\": {\n \"alias\": \" - leaderboard_bbh_navigate\"\ ,\n \"acc_norm,none\": 0.604,\n \"acc_norm_stderr,none\":\ \ 0.030993197854577853\n },\n \"leaderboard_bbh_object_counting\"\ : {\n \"alias\": \" - leaderboard_bbh_object_counting\",\n \ \ \"acc_norm,none\": 0.336,\n \"acc_norm_stderr,none\": 0.029933259094191516\n\ \ },\n \"leaderboard_bbh_penguins_in_a_table\": {\n \"\ alias\": \" - leaderboard_bbh_penguins_in_a_table\",\n \"acc_norm,none\"\ : 0.4315068493150685,\n \"acc_norm_stderr,none\": 0.04113130264537192\n\ \ },\n \"leaderboard_bbh_reasoning_about_colored_objects\": {\n \ \ \"alias\": \" - leaderboard_bbh_reasoning_about_colored_objects\",\n\ \ \"acc_norm,none\": 0.548,\n \"acc_norm_stderr,none\": 0.03153986449255663\n\ \ },\n \"leaderboard_bbh_ruin_names\": {\n \"alias\": \"\ \ - leaderboard_bbh_ruin_names\",\n \"acc_norm,none\": 0.644,\n \ \ \"acc_norm_stderr,none\": 0.03034368065715322\n },\n \"leaderboard_bbh_salient_translation_error_detection\"\ : {\n \"alias\": \" - leaderboard_bbh_salient_translation_error_detection\"\ ,\n \"acc_norm,none\": 0.468,\n \"acc_norm_stderr,none\":\ \ 0.031621252575725504\n },\n \"leaderboard_bbh_snarks\": {\n \ \ \"alias\": \" - leaderboard_bbh_snarks\",\n \"acc_norm,none\"\ : 0.7247191011235955,\n \"acc_norm_stderr,none\": 0.03357269922538226\n\ \ },\n \"leaderboard_bbh_sports_understanding\": {\n \"\ alias\": \" - leaderboard_bbh_sports_understanding\",\n \"acc_norm,none\"\ : 0.736,\n \"acc_norm_stderr,none\": 0.02793451895769091\n },\n\ \ \"leaderboard_bbh_temporal_sequences\": {\n \"alias\": \" -\ \ leaderboard_bbh_temporal_sequences\",\n \"acc_norm,none\": 0.272,\n\ \ \"acc_norm_stderr,none\": 0.02820008829631\n },\n \"\ leaderboard_bbh_tracking_shuffled_objects_five_objects\": {\n \"alias\"\ : \" - leaderboard_bbh_tracking_shuffled_objects_five_objects\",\n \"\ acc_norm,none\": 0.196,\n \"acc_norm_stderr,none\": 0.02515685731325592\n\ \ },\n \"leaderboard_bbh_tracking_shuffled_objects_seven_objects\"\ : {\n \"alias\": \" - leaderboard_bbh_tracking_shuffled_objects_seven_objects\"\ ,\n \"acc_norm,none\": 0.14,\n \"acc_norm_stderr,none\": 0.021989409645240272\n\ \ },\n \"leaderboard_bbh_tracking_shuffled_objects_three_objects\"\ : {\n \"alias\": \" - leaderboard_bbh_tracking_shuffled_objects_three_objects\"\ ,\n \"acc_norm,none\": 0.268,\n \"acc_norm_stderr,none\":\ \ 0.02806876238252669\n },\n \"leaderboard_bbh_web_of_lies\": {\n\ \ \"alias\": \" - leaderboard_bbh_web_of_lies\",\n \"acc_norm,none\"\ : 0.476,\n \"acc_norm_stderr,none\": 0.03164968895968782\n },\n\ \ \"leaderboard_gpqa\": {\n \" \": \" \",\n \"alias\"\ : \" - leaderboard_gpqa\"\n },\n \"leaderboard_gpqa_diamond\": {\n\ \ \"alias\": \" - leaderboard_gpqa_diamond\",\n \"acc_norm,none\"\ : 0.2777777777777778,\n \"acc_norm_stderr,none\": 0.03191178226713547\n\ \ },\n \"leaderboard_gpqa_extended\": {\n \"alias\": \"\ \ - leaderboard_gpqa_extended\",\n \"acc_norm,none\": 0.2948717948717949,\n\ \ \"acc_norm_stderr,none\": 0.01953225605335248\n },\n \ \ \"leaderboard_gpqa_main\": {\n \"alias\": \" - leaderboard_gpqa_main\"\ ,\n \"acc_norm,none\": 0.27901785714285715,\n \"acc_norm_stderr,none\"\ : 0.021214094157265967\n },\n \"leaderboard_ifeval\": {\n \ \ \"alias\": \" - leaderboard_ifeval\",\n \"prompt_level_strict_acc,none\"\ : 0.36414048059149723,\n \"prompt_level_strict_acc_stderr,none\": 0.02070704795859199,\n\ \ \"inst_level_strict_acc,none\": 0.5,\n \"inst_level_strict_acc_stderr,none\"\ : \"N/A\",\n \"prompt_level_loose_acc,none\": 0.4343807763401109,\n \ \ \"prompt_level_loose_acc_stderr,none\": 0.021330473657564727,\n \ \ \"inst_level_loose_acc,none\": 0.5671462829736211,\n \"inst_level_loose_acc_stderr,none\"\ : \"N/A\"\n },\n \"leaderboard_math_hard\": {\n \" \":\ \ \" \",\n \"alias\": \" - leaderboard_math_hard\"\n },\n \ \ \"leaderboard_math_algebra_hard\": {\n \"alias\": \" - leaderboard_math_algebra_hard\"\ ,\n \"exact_match,none\": 0.08143322475570032,\n \"exact_match_stderr,none\"\ : 0.015634913029180096\n },\n \"leaderboard_math_counting_and_prob_hard\"\ : {\n \"alias\": \" - leaderboard_math_counting_and_prob_hard\",\n \ \ \"exact_match,none\": 0.016260162601626018,\n \"exact_match_stderr,none\"\ : 0.011450452676925665\n },\n \"leaderboard_math_geometry_hard\":\ \ {\n \"alias\": \" - leaderboard_math_geometry_hard\",\n \ \ \"exact_match,none\": 0.007575757575757576,\n \"exact_match_stderr,none\"\ : 0.0075757575757575656\n },\n \"leaderboard_math_intermediate_algebra_hard\"\ : {\n \"alias\": \" - leaderboard_math_intermediate_algebra_hard\",\n\ \ \"exact_match,none\": 0.014285714285714285,\n \"exact_match_stderr,none\"\ : 0.007104350893915322\n },\n \"leaderboard_math_num_theory_hard\"\ : {\n \"alias\": \" - leaderboard_math_num_theory_hard\",\n \ \ \"exact_match,none\": 0.05844155844155844,\n \"exact_match_stderr,none\"\ : 0.01896438745195783\n },\n \"leaderboard_math_prealgebra_hard\"\ : {\n \"alias\": \" - leaderboard_math_prealgebra_hard\",\n \ \ \"exact_match,none\": 0.11917098445595854,\n \"exact_match_stderr,none\"\ : 0.02338193534812143\n },\n \"leaderboard_math_precalculus_hard\"\ : {\n \"alias\": \" - leaderboard_math_precalculus_hard\",\n \ \ \"exact_match,none\": 0.014814814814814815,\n \"exact_match_stderr,none\"\ : 0.01043649454959436\n },\n \"leaderboard_mmlu_pro\": {\n \ \ \"alias\": \" - leaderboard_mmlu_pro\",\n \"acc,none\": 0.3048537234042553,\n\ \ \"acc_stderr,none\": 0.004196942207232523\n },\n \"leaderboard_musr\"\ : {\n \" \": \" \",\n \"alias\": \" - leaderboard_musr\"\n\ \ },\n \"leaderboard_musr_murder_mysteries\": {\n \"alias\"\ : \" - leaderboard_musr_murder_mysteries\",\n \"acc_norm,none\": 0.568,\n\ \ \"acc_norm_stderr,none\": 0.0313918107654294\n },\n \"\ leaderboard_musr_object_placements\": {\n \"alias\": \" - leaderboard_musr_object_placements\"\ ,\n \"acc_norm,none\": 0.328125,\n \"acc_norm_stderr,none\"\ : 0.029403146715355242\n },\n \"leaderboard_musr_team_allocation\"\ : {\n \"alias\": \" - leaderboard_musr_team_allocation\",\n \ \ \"acc_norm,none\": 0.364,\n \"acc_norm_stderr,none\": 0.030491555220405555\n\ \ },\n \"toxigen\": {\n \"alias\": \"toxigen\",\n \ \ \"acc,none\": 0.5702127659574469,\n \"acc_stderr,none\": 0.016155203301509467,\n\ \ \"acc_norm,none\": 0.5446808510638298,\n \"acc_norm_stderr,none\"\ : 0.016251603395892635\n },\n \"wmdp\": {\n \"acc,none\"\ : 0.5288985823336968,\n \"acc_stderr,none\": 0.008100262166921585,\n\ \ \"alias\": \"wmdp\"\n },\n \"wmdp_bio\": {\n \ \ \"alias\": \" - wmdp_bio\",\n \"acc,none\": 0.6559308719560094,\n\ \ \"acc_stderr,none\": 0.01332012602079775\n },\n \"wmdp_chem\"\ : {\n \"alias\": \" - wmdp_chem\",\n \"acc,none\": 0.49019607843137253,\n\ \ \"acc_stderr,none\": 0.024779315060043515\n },\n \"wmdp_cyber\"\ : {\n \"alias\": \" - wmdp_cyber\",\n \"acc,none\": 0.4554604932058379,\n\ \ \"acc_stderr,none\": 0.011175074595399846\n },\n \"xstest\"\ : {\n \"alias\": \"xstest\",\n \"acc,none\": 0.4488888888888889,\n\ \ \"acc_stderr,none\": 0.023472850939482037,\n \"acc_norm,none\"\ : 0.4444444444444444,\n \"acc_norm_stderr,none\": 0.023450349399618212\n\ \ }\n },\n \"BeaverTailsEval\": {\n \"alias\": \"BeaverTailsEval\"\ ,\n \"acc,none\": 0.8714285714285714,\n \"acc_stderr,none\": 0.012660461716778634,\n\ \ \"acc_norm,none\": 0.12428571428571429,\n \"acc_norm_stderr,none\"\ : 0.012478237164470317\n },\n \"CDNA\": {\n \"alias\": \"CDNA\",\n\ \ \"acc,none\": 0.9552457813646368,\n \"acc_stderr,none\": 0.003960876492273638,\n\ \ \"acc_norm,none\": 0.001834189288334556,\n \"acc_norm_stderr,none\"\ : 0.0008196721291236438\n },\n \"DTToxicity\": {\n \"alias\": \"DTToxicity\"\ ,\n \"acc,none\": 0.4837228714524207,\n \"acc_stderr,none\": 0.010211440125201749,\n\ \ \"acc_norm,none\": 0.5,\n \"acc_norm_stderr,none\": 0.010216855368051905\n\ \ },\n \"JailbreakHub\": {\n \"alias\": \"JailbreakHub\",\n \ \ \"acc,none\": 0.12450462351387054,\n \"acc_stderr,none\": 0.002683311387044548,\n\ \ \"acc_norm,none\": 0.0939894319682959,\n \"acc_norm_stderr,none\"\ : 0.002371687964555697\n },\n \"SGXSTest\": {\n \"alias\": \"SGXSTest\"\ ,\n \"acc,none\": 0.5,\n \"acc_stderr,none\": 0.0354440602504168,\n\ \ \"acc_norm,none\": 0.5,\n \"acc_norm_stderr,none\": 0.0354440602504168\n\ \ },\n \"SaladBench\": {\n \"alias\": \"SaladBench\",\n \"acc,none\"\ : 0.49505208333333334,\n \"acc_stderr,none\": 0.008069370988058294,\n \ \ \"acc_norm,none\": 0.49505208333333334,\n \"acc_norm_stderr,none\"\ : 0.008069370988058294\n },\n \"StrongREJECT\": {\n \"alias\": \"StrongREJECT\"\ ,\n \"acc,none\": 0.9744408945686901,\n \"acc_stderr,none\": 0.008934562241019864,\n\ \ \"acc_norm,none\": 0.2523961661341853,\n \"acc_norm_stderr,none\"\ : 0.024592339166678388\n },\n \"WildGuardTest\": {\n \"alias\": \"\ WildGuardTest\",\n \"acc,none\": 0.6121739130434782,\n \"acc_stderr,none\"\ : 0.011735113323084431,\n \"acc_norm,none\": 0.5617391304347826,\n \ \ \"acc_norm_stderr,none\": 0.011949921603028857\n },\n \"bbq\": {\n \ \ \"acc_norm,none\": 0.9339909731245298,\n \"acc_norm_stderr,none\": 0.0010120925842241903,\n\ \ \"acc,none\": 0.933854202284073,\n \"acc_stderr,none\": 0.001014159063390077,\n\ \ \"alias\": \"bbq\"\n },\n \"bbq_age\": {\n \"alias\": \" -\ \ bbq_age\",\n \"acc,none\": 0.8347826086956521,\n \"acc_stderr,none\"\ : 0.006122794490389976,\n \"acc_norm,none\": 0.8323369565217391,\n \ \ \"acc_norm_stderr,none\": 0.006158903051518932\n },\n \"bbq_disabilitystatus\"\ : {\n \"alias\": \" - bbq_disabilitystatus\",\n \"acc,none\": 0.9113110539845758,\n\ \ \"acc_stderr,none\": 0.007209462202833219,\n \"acc_norm,none\":\ \ 0.9093830334190232,\n \"acc_norm_stderr,none\": 0.0072796916982102436\n\ \ },\n \"bbq_genderidentity\": {\n \"alias\": \" - bbq_genderidentity\"\ ,\n \"acc,none\": 0.9427009873060649,\n \"acc_stderr,none\": 0.0030862473264601695,\n\ \ \"acc_norm,none\": 0.9423483779971791,\n \"acc_norm_stderr,none\"\ : 0.0030951498876854062\n },\n \"bbq_nationality\": {\n \"alias\":\ \ \" - bbq_nationality\",\n \"acc,none\": 0.9194805194805195,\n \"\ acc_stderr,none\": 0.004903621087010461,\n \"acc_norm,none\": 0.9185064935064935,\n\ \ \"acc_norm_stderr,none\": 0.004930577318136959\n },\n \"bbq_physicalappearance\"\ : {\n \"alias\": \" - bbq_physicalappearance\",\n \"acc,none\": 0.8331218274111675,\n\ \ \"acc_stderr,none\": 0.009395366913005541,\n \"acc_norm,none\":\ \ 0.8318527918781726,\n \"acc_norm_stderr,none\": 0.009423837540123783\n\ \ },\n \"bbq_raceethnicity\": {\n \"alias\": \" - bbq_raceethnicity\"\ ,\n \"acc,none\": 0.9210755813953488,\n \"acc_stderr,none\": 0.0032508031761094938,\n\ \ \"acc_norm,none\": 0.9207848837209303,\n \"acc_norm_stderr,none\"\ : 0.0032562704476255767\n },\n \"bbq_racexgender\": {\n \"alias\":\ \ \" - bbq_racexgender\",\n \"acc,none\": 0.9611528822055138,\n \"\ acc_stderr,none\": 0.0015295821266427165,\n \"acc_norm,none\": 0.9608395989974937,\n\ \ \"acc_norm_stderr,none\": 0.0015354871080304484\n },\n \"bbq_racexses\"\ : {\n \"alias\": \" - bbq_racexses\",\n \"acc,none\": 0.9707885304659498,\n\ \ \"acc_stderr,none\": 0.0015941397176377286,\n \"acc_norm,none\"\ : 0.9756272401433692,\n \"acc_norm_stderr,none\": 0.0014597607249481903\n\ \ },\n \"bbq_religion\": {\n \"alias\": \" - bbq_religion\",\n \ \ \"acc,none\": 0.8375,\n \"acc_stderr,none\": 0.01065392165850614,\n\ \ \"acc_norm,none\": 0.835,\n \"acc_norm_stderr,none\": 0.01071952689631095\n\ \ },\n \"bbq_ses\": {\n \"alias\": \" - bbq_ses\",\n \"acc,none\"\ : 0.9245337995337995,\n \"acc_stderr,none\": 0.003188457551106306,\n \ \ \"acc_norm,none\": 0.9220571095571095,\n \"acc_norm_stderr,none\": 0.00323601230652936\n\ \ },\n \"bbq_sexualorientation\": {\n \"alias\": \" - bbq_sexualorientation\"\ ,\n \"acc,none\": 0.9016203703703703,\n \"acc_stderr,none\": 0.01013815790835306,\n\ \ \"acc_norm,none\": 0.9016203703703703,\n \"acc_norm_stderr,none\"\ : 0.01013815790835306\n },\n \"leaderboard\": {\n \" \": \" \",\n \ \ \"alias\": \"leaderboard\"\n },\n \"leaderboard_bbh\": {\n \ \ \" \": \" \",\n \"alias\": \" - leaderboard_bbh\"\n },\n \"leaderboard_bbh_boolean_expressions\"\ : {\n \"alias\": \" - leaderboard_bbh_boolean_expressions\",\n \"\ acc_norm,none\": 0.8,\n \"acc_norm_stderr,none\": 0.02534897002097908\n \ \ },\n \"leaderboard_bbh_causal_judgement\": {\n \"alias\": \" - leaderboard_bbh_causal_judgement\"\ ,\n \"acc_norm,none\": 0.6470588235294118,\n \"acc_norm_stderr,none\"\ : 0.03504019983419236\n },\n \"leaderboard_bbh_date_understanding\": {\n \ \ \"alias\": \" - leaderboard_bbh_date_understanding\",\n \"acc_norm,none\"\ : 0.472,\n \"acc_norm_stderr,none\": 0.031636489531544396\n },\n \"\ leaderboard_bbh_disambiguation_qa\": {\n \"alias\": \" - leaderboard_bbh_disambiguation_qa\"\ ,\n \"acc_norm,none\": 0.68,\n \"acc_norm_stderr,none\": 0.02956172495524105\n\ \ },\n \"leaderboard_bbh_formal_fallacies\": {\n \"alias\": \" - leaderboard_bbh_formal_fallacies\"\ ,\n \"acc_norm,none\": 0.6,\n \"acc_norm_stderr,none\": 0.03104602102825324\n\ \ },\n \"leaderboard_bbh_geometric_shapes\": {\n \"alias\": \" - leaderboard_bbh_geometric_shapes\"\ ,\n \"acc_norm,none\": 0.36,\n \"acc_norm_stderr,none\": 0.030418764025174988\n\ \ },\n \"leaderboard_bbh_hyperbaton\": {\n \"alias\": \" - leaderboard_bbh_hyperbaton\"\ ,\n \"acc_norm,none\": 0.688,\n \"acc_norm_stderr,none\": 0.029361067575219817\n\ \ },\n \"leaderboard_bbh_logical_deduction_five_objects\": {\n \"alias\"\ : \" - leaderboard_bbh_logical_deduction_five_objects\",\n \"acc_norm,none\"\ : 0.48,\n \"acc_norm_stderr,none\": 0.031660853408495185\n },\n \"\ leaderboard_bbh_logical_deduction_seven_objects\": {\n \"alias\": \" - leaderboard_bbh_logical_deduction_seven_objects\"\ ,\n \"acc_norm,none\": 0.432,\n \"acc_norm_stderr,none\": 0.03139181076542941\n\ \ },\n \"leaderboard_bbh_logical_deduction_three_objects\": {\n \"\ alias\": \" - leaderboard_bbh_logical_deduction_three_objects\",\n \"acc_norm,none\"\ : 0.692,\n \"acc_norm_stderr,none\": 0.029256928606501868\n },\n \"\ leaderboard_bbh_movie_recommendation\": {\n \"alias\": \" - leaderboard_bbh_movie_recommendation\"\ ,\n \"acc_norm,none\": 0.688,\n \"acc_norm_stderr,none\": 0.029361067575219817\n\ \ },\n \"leaderboard_bbh_navigate\": {\n \"alias\": \" - leaderboard_bbh_navigate\"\ ,\n \"acc_norm,none\": 0.604,\n \"acc_norm_stderr,none\": 0.030993197854577853\n\ \ },\n \"leaderboard_bbh_object_counting\": {\n \"alias\": \" - leaderboard_bbh_object_counting\"\ ,\n \"acc_norm,none\": 0.336,\n \"acc_norm_stderr,none\": 0.029933259094191516\n\ \ },\n \"leaderboard_bbh_penguins_in_a_table\": {\n \"alias\": \" \ \ - leaderboard_bbh_penguins_in_a_table\",\n \"acc_norm,none\": 0.4315068493150685,\n\ \ \"acc_norm_stderr,none\": 0.04113130264537192\n },\n \"leaderboard_bbh_reasoning_about_colored_objects\"\ : {\n \"alias\": \" - leaderboard_bbh_reasoning_about_colored_objects\"\ ,\n \"acc_norm,none\": 0.548,\n \"acc_norm_stderr,none\": 0.03153986449255663\n\ \ },\n \"leaderboard_bbh_ruin_names\": {\n \"alias\": \" - leaderboard_bbh_ruin_names\"\ ,\n \"acc_norm,none\": 0.644,\n \"acc_norm_stderr,none\": 0.03034368065715322\n\ \ },\n \"leaderboard_bbh_salient_translation_error_detection\": {\n \ \ \"alias\": \" - leaderboard_bbh_salient_translation_error_detection\",\n \ \ \"acc_norm,none\": 0.468,\n \"acc_norm_stderr,none\": 0.031621252575725504\n\ \ },\n \"leaderboard_bbh_snarks\": {\n \"alias\": \" - leaderboard_bbh_snarks\"\ ,\n \"acc_norm,none\": 0.7247191011235955,\n \"acc_norm_stderr,none\"\ : 0.03357269922538226\n },\n \"leaderboard_bbh_sports_understanding\": {\n\ \ \"alias\": \" - leaderboard_bbh_sports_understanding\",\n \"acc_norm,none\"\ : 0.736,\n \"acc_norm_stderr,none\": 0.02793451895769091\n },\n \"\ leaderboard_bbh_temporal_sequences\": {\n \"alias\": \" - leaderboard_bbh_temporal_sequences\"\ ,\n \"acc_norm,none\": 0.272,\n \"acc_norm_stderr,none\": 0.02820008829631\n\ \ },\n \"leaderboard_bbh_tracking_shuffled_objects_five_objects\": {\n \ \ \"alias\": \" - leaderboard_bbh_tracking_shuffled_objects_five_objects\"\ ,\n \"acc_norm,none\": 0.196,\n \"acc_norm_stderr,none\": 0.02515685731325592\n\ \ },\n \"leaderboard_bbh_tracking_shuffled_objects_seven_objects\": {\n \ \ \"alias\": \" - leaderboard_bbh_tracking_shuffled_objects_seven_objects\"\ ,\n \"acc_norm,none\": 0.14,\n \"acc_norm_stderr,none\": 0.021989409645240272\n\ \ },\n \"leaderboard_bbh_tracking_shuffled_objects_three_objects\": {\n \ \ \"alias\": \" - leaderboard_bbh_tracking_shuffled_objects_three_objects\"\ ,\n \"acc_norm,none\": 0.268,\n \"acc_norm_stderr,none\": 0.02806876238252669\n\ \ },\n \"leaderboard_bbh_web_of_lies\": {\n \"alias\": \" - leaderboard_bbh_web_of_lies\"\ ,\n \"acc_norm,none\": 0.476,\n \"acc_norm_stderr,none\": 0.03164968895968782\n\ \ },\n \"leaderboard_gpqa\": {\n \" \": \" \",\n \"alias\":\ \ \" - leaderboard_gpqa\"\n },\n \"leaderboard_gpqa_diamond\": {\n \ \ \"alias\": \" - leaderboard_gpqa_diamond\",\n \"acc_norm,none\": 0.2777777777777778,\n\ \ \"acc_norm_stderr,none\": 0.03191178226713547\n },\n \"leaderboard_gpqa_extended\"\ : {\n \"alias\": \" - leaderboard_gpqa_extended\",\n \"acc_norm,none\"\ : 0.2948717948717949,\n \"acc_norm_stderr,none\": 0.01953225605335248\n \ \ },\n \"leaderboard_gpqa_main\": {\n \"alias\": \" - leaderboard_gpqa_main\"\ ,\n \"acc_norm,none\": 0.27901785714285715,\n \"acc_norm_stderr,none\"\ : 0.021214094157265967\n },\n \"leaderboard_ifeval\": {\n \"alias\"\ : \" - leaderboard_ifeval\",\n \"prompt_level_strict_acc,none\": 0.36414048059149723,\n\ \ \"prompt_level_strict_acc_stderr,none\": 0.02070704795859199,\n \ \ \"inst_level_strict_acc,none\": 0.5,\n \"inst_level_strict_acc_stderr,none\"\ : \"N/A\",\n \"prompt_level_loose_acc,none\": 0.4343807763401109,\n \ \ \"prompt_level_loose_acc_stderr,none\": 0.021330473657564727,\n \"inst_level_loose_acc,none\"\ : 0.5671462829736211,\n \"inst_level_loose_acc_stderr,none\": \"N/A\"\n \ \ },\n \"leaderboard_math_hard\": {\n \" \": \" \",\n \"alias\"\ : \" - leaderboard_math_hard\"\n },\n \"leaderboard_math_algebra_hard\": {\n\ \ \"alias\": \" - leaderboard_math_algebra_hard\",\n \"exact_match,none\"\ : 0.08143322475570032,\n \"exact_match_stderr,none\": 0.015634913029180096\n\ \ },\n \"leaderboard_math_counting_and_prob_hard\": {\n \"alias\":\ \ \" - leaderboard_math_counting_and_prob_hard\",\n \"exact_match,none\"\ : 0.016260162601626018,\n \"exact_match_stderr,none\": 0.011450452676925665\n\ \ },\n \"leaderboard_math_geometry_hard\": {\n \"alias\": \" - leaderboard_math_geometry_hard\"\ ,\n \"exact_match,none\": 0.007575757575757576,\n \"exact_match_stderr,none\"\ : 0.0075757575757575656\n },\n \"leaderboard_math_intermediate_algebra_hard\"\ : {\n \"alias\": \" - leaderboard_math_intermediate_algebra_hard\",\n \ \ \"exact_match,none\": 0.014285714285714285,\n \"exact_match_stderr,none\"\ : 0.007104350893915322\n },\n \"leaderboard_math_num_theory_hard\": {\n \ \ \"alias\": \" - leaderboard_math_num_theory_hard\",\n \"exact_match,none\"\ : 0.05844155844155844,\n \"exact_match_stderr,none\": 0.01896438745195783\n\ \ },\n \"leaderboard_math_prealgebra_hard\": {\n \"alias\": \" - leaderboard_math_prealgebra_hard\"\ ,\n \"exact_match,none\": 0.11917098445595854,\n \"exact_match_stderr,none\"\ : 0.02338193534812143\n },\n \"leaderboard_math_precalculus_hard\": {\n \ \ \"alias\": \" - leaderboard_math_precalculus_hard\",\n \"exact_match,none\"\ : 0.014814814814814815,\n \"exact_match_stderr,none\": 0.01043649454959436\n\ \ },\n \"leaderboard_mmlu_pro\": {\n \"alias\": \" - leaderboard_mmlu_pro\"\ ,\n \"acc,none\": 0.3048537234042553,\n \"acc_stderr,none\": 0.004196942207232523\n\ \ },\n \"leaderboard_musr\": {\n \" \": \" \",\n \"alias\":\ \ \" - leaderboard_musr\"\n },\n \"leaderboard_musr_murder_mysteries\": {\n\ \ \"alias\": \" - leaderboard_musr_murder_mysteries\",\n \"acc_norm,none\"\ : 0.568,\n \"acc_norm_stderr,none\": 0.0313918107654294\n },\n \"leaderboard_musr_object_placements\"\ : {\n \"alias\": \" - leaderboard_musr_object_placements\",\n \"\ acc_norm,none\": 0.328125,\n \"acc_norm_stderr,none\": 0.029403146715355242\n\ \ },\n \"leaderboard_musr_team_allocation\": {\n \"alias\": \" - leaderboard_musr_team_allocation\"\ ,\n \"acc_norm,none\": 0.364,\n \"acc_norm_stderr,none\": 0.030491555220405555\n\ \ },\n \"toxigen\": {\n \"alias\": \"toxigen\",\n \"acc,none\"\ : 0.5702127659574469,\n \"acc_stderr,none\": 0.016155203301509467,\n \ \ \"acc_norm,none\": 0.5446808510638298,\n \"acc_norm_stderr,none\": 0.016251603395892635\n\ \ },\n \"wmdp\": {\n \"acc,none\": 0.5288985823336968,\n \"\ acc_stderr,none\": 0.008100262166921585,\n \"alias\": \"wmdp\"\n },\n\ \ \"wmdp_bio\": {\n \"alias\": \" - wmdp_bio\",\n \"acc,none\"\ : 0.6559308719560094,\n \"acc_stderr,none\": 0.01332012602079775\n },\n\ \ \"wmdp_chem\": {\n \"alias\": \" - wmdp_chem\",\n \"acc,none\"\ : 0.49019607843137253,\n \"acc_stderr,none\": 0.024779315060043515\n },\n\ \ \"wmdp_cyber\": {\n \"alias\": \" - wmdp_cyber\",\n \"acc,none\"\ : 0.4554604932058379,\n \"acc_stderr,none\": 0.011175074595399846\n },\n\ \ \"xstest\": {\n \"alias\": \"xstest\",\n \"acc,none\": 0.4488888888888889,\n\ \ \"acc_stderr,none\": 0.023472850939482037,\n \"acc_norm,none\":\ \ 0.4444444444444444,\n \"acc_norm_stderr,none\": 0.023450349399618212\n\ \ }\n}\n```" repo_url: https://huggingface.co/yunconglong/DARE_TIES_13B leaderboard_url: '' point_of_contact: '' configs: - config_name: yunconglong__DARE_TIES_13B__BeaverTailsEval data_files: - split: 2024_12_04T20_37_46.218361 path: - '**/samples_BeaverTailsEval_2024-12-04T20-37-46.218361.jsonl' - split: latest path: - '**/samples_BeaverTailsEval_2024-12-04T20-37-46.218361.jsonl' - config_name: yunconglong__DARE_TIES_13B__CDNA data_files: - split: 2024_12_04T20_37_46.218361 path: - '**/samples_CDNA_2024-12-04T20-37-46.218361.jsonl' - split: latest path: - '**/samples_CDNA_2024-12-04T20-37-46.218361.jsonl' - config_name: yunconglong__DARE_TIES_13B__DTToxicity data_files: - split: 2024_12_04T20_37_46.218361 path: - '**/samples_DTToxicity_2024-12-04T20-37-46.218361.jsonl' - split: latest path: - '**/samples_DTToxicity_2024-12-04T20-37-46.218361.jsonl' - config_name: yunconglong__DARE_TIES_13B__JailbreakHub data_files: - split: 2024_12_04T20_37_46.218361 path: - '**/samples_JailbreakHub_2024-12-04T20-37-46.218361.jsonl' - split: latest path: - '**/samples_JailbreakHub_2024-12-04T20-37-46.218361.jsonl' - config_name: yunconglong__DARE_TIES_13B__SGXSTest data_files: - split: 2024_12_04T20_37_46.218361 path: - '**/samples_SGXSTest_2024-12-04T20-37-46.218361.jsonl' - split: latest path: - '**/samples_SGXSTest_2024-12-04T20-37-46.218361.jsonl' - config_name: yunconglong__DARE_TIES_13B__SaladBench data_files: - split: 2024_12_04T20_37_46.218361 path: - '**/samples_SaladBench_2024-12-04T20-37-46.218361.jsonl' - split: latest path: - '**/samples_SaladBench_2024-12-04T20-37-46.218361.jsonl' - config_name: yunconglong__DARE_TIES_13B__StrongREJECT data_files: - split: 2024_12_04T20_37_46.218361 path: - '**/samples_StrongREJECT_2024-12-04T20-37-46.218361.jsonl' - split: latest path: - '**/samples_StrongREJECT_2024-12-04T20-37-46.218361.jsonl' - config_name: yunconglong__DARE_TIES_13B__WildGuardTest data_files: - split: 2024_12_04T20_37_46.218361 path: - '**/samples_WildGuardTest_2024-12-04T20-37-46.218361.jsonl' - split: latest path: - '**/samples_WildGuardTest_2024-12-04T20-37-46.218361.jsonl' - config_name: yunconglong__DARE_TIES_13B__bbq_age data_files: - split: 2024_12_04T20_37_46.218361 path: - '**/samples_bbq_age_2024-12-04T20-37-46.218361.jsonl' - split: latest path: - '**/samples_bbq_age_2024-12-04T20-37-46.218361.jsonl' - config_name: yunconglong__DARE_TIES_13B__bbq_disabilitystatus data_files: - split: 2024_12_04T20_37_46.218361 path: - '**/samples_bbq_disabilitystatus_2024-12-04T20-37-46.218361.jsonl' - split: latest path: - '**/samples_bbq_disabilitystatus_2024-12-04T20-37-46.218361.jsonl' - config_name: yunconglong__DARE_TIES_13B__bbq_genderidentity data_files: - split: 2024_12_04T20_37_46.218361 path: - '**/samples_bbq_genderidentity_2024-12-04T20-37-46.218361.jsonl' - split: latest path: - '**/samples_bbq_genderidentity_2024-12-04T20-37-46.218361.jsonl' - config_name: yunconglong__DARE_TIES_13B__bbq_nationality data_files: - split: 2024_12_04T20_37_46.218361 path: - '**/samples_bbq_nationality_2024-12-04T20-37-46.218361.jsonl' - split: latest path: - '**/samples_bbq_nationality_2024-12-04T20-37-46.218361.jsonl' - config_name: yunconglong__DARE_TIES_13B__bbq_physicalappearance data_files: - split: 2024_12_04T20_37_46.218361 path: - '**/samples_bbq_physicalappearance_2024-12-04T20-37-46.218361.jsonl' - split: latest path: - '**/samples_bbq_physicalappearance_2024-12-04T20-37-46.218361.jsonl' - config_name: yunconglong__DARE_TIES_13B__bbq_raceethnicity data_files: - split: 2024_12_04T20_37_46.218361 path: - '**/samples_bbq_raceethnicity_2024-12-04T20-37-46.218361.jsonl' - split: latest path: - '**/samples_bbq_raceethnicity_2024-12-04T20-37-46.218361.jsonl' - config_name: yunconglong__DARE_TIES_13B__bbq_racexgender data_files: - split: 2024_12_04T20_37_46.218361 path: - '**/samples_bbq_racexgender_2024-12-04T20-37-46.218361.jsonl' - split: latest path: - '**/samples_bbq_racexgender_2024-12-04T20-37-46.218361.jsonl' - config_name: yunconglong__DARE_TIES_13B__bbq_racexses data_files: - split: 2024_12_04T20_37_46.218361 path: - '**/samples_bbq_racexses_2024-12-04T20-37-46.218361.jsonl' - split: latest path: - '**/samples_bbq_racexses_2024-12-04T20-37-46.218361.jsonl' - config_name: yunconglong__DARE_TIES_13B__bbq_religion data_files: - split: 2024_12_04T20_37_46.218361 path: - '**/samples_bbq_religion_2024-12-04T20-37-46.218361.jsonl' - split: latest path: - '**/samples_bbq_religion_2024-12-04T20-37-46.218361.jsonl' - config_name: yunconglong__DARE_TIES_13B__bbq_ses data_files: - split: 2024_12_04T20_37_46.218361 path: - '**/samples_bbq_ses_2024-12-04T20-37-46.218361.jsonl' - split: latest path: - '**/samples_bbq_ses_2024-12-04T20-37-46.218361.jsonl' - config_name: yunconglong__DARE_TIES_13B__bbq_sexualorientation data_files: - split: 2024_12_04T20_37_46.218361 path: - '**/samples_bbq_sexualorientation_2024-12-04T20-37-46.218361.jsonl' - split: latest path: - '**/samples_bbq_sexualorientation_2024-12-04T20-37-46.218361.jsonl' - config_name: yunconglong__DARE_TIES_13B__leaderboard_bbh_boolean_expressions data_files: - split: 2024_12_04T20_37_46.218361 path: - '**/samples_leaderboard_bbh_boolean_expressions_2024-12-04T20-37-46.218361.jsonl' - split: latest path: - '**/samples_leaderboard_bbh_boolean_expressions_2024-12-04T20-37-46.218361.jsonl' - config_name: yunconglong__DARE_TIES_13B__leaderboard_bbh_causal_judgement data_files: - split: 2024_12_04T20_37_46.218361 path: - '**/samples_leaderboard_bbh_causal_judgement_2024-12-04T20-37-46.218361.jsonl' - split: latest path: - '**/samples_leaderboard_bbh_causal_judgement_2024-12-04T20-37-46.218361.jsonl' - config_name: yunconglong__DARE_TIES_13B__leaderboard_bbh_date_understanding data_files: - split: 2024_12_04T20_37_46.218361 path: - '**/samples_leaderboard_bbh_date_understanding_2024-12-04T20-37-46.218361.jsonl' - split: latest path: - '**/samples_leaderboard_bbh_date_understanding_2024-12-04T20-37-46.218361.jsonl' - config_name: yunconglong__DARE_TIES_13B__leaderboard_bbh_disambiguation_qa data_files: - split: 2024_12_04T20_37_46.218361 path: - '**/samples_leaderboard_bbh_disambiguation_qa_2024-12-04T20-37-46.218361.jsonl' - split: latest path: - '**/samples_leaderboard_bbh_disambiguation_qa_2024-12-04T20-37-46.218361.jsonl' - config_name: yunconglong__DARE_TIES_13B__leaderboard_bbh_formal_fallacies data_files: - split: 2024_12_04T20_37_46.218361 path: - '**/samples_leaderboard_bbh_formal_fallacies_2024-12-04T20-37-46.218361.jsonl' - split: latest path: - '**/samples_leaderboard_bbh_formal_fallacies_2024-12-04T20-37-46.218361.jsonl' - config_name: yunconglong__DARE_TIES_13B__leaderboard_bbh_geometric_shapes data_files: - split: 2024_12_04T20_37_46.218361 path: - '**/samples_leaderboard_bbh_geometric_shapes_2024-12-04T20-37-46.218361.jsonl' - split: latest path: - '**/samples_leaderboard_bbh_geometric_shapes_2024-12-04T20-37-46.218361.jsonl' - config_name: yunconglong__DARE_TIES_13B__leaderboard_bbh_hyperbaton data_files: - split: 2024_12_04T20_37_46.218361 path: - '**/samples_leaderboard_bbh_hyperbaton_2024-12-04T20-37-46.218361.jsonl' - split: latest path: - '**/samples_leaderboard_bbh_hyperbaton_2024-12-04T20-37-46.218361.jsonl' - config_name: yunconglong__DARE_TIES_13B__leaderboard_bbh_logical_deduction_five_objects data_files: - split: 2024_12_04T20_37_46.218361 path: - '**/samples_leaderboard_bbh_logical_deduction_five_objects_2024-12-04T20-37-46.218361.jsonl' - split: latest path: - '**/samples_leaderboard_bbh_logical_deduction_five_objects_2024-12-04T20-37-46.218361.jsonl' - config_name: yunconglong__DARE_TIES_13B__leaderboard_bbh_logical_deduction_seven_objects data_files: - split: 2024_12_04T20_37_46.218361 path: - '**/samples_leaderboard_bbh_logical_deduction_seven_objects_2024-12-04T20-37-46.218361.jsonl' - split: latest path: - '**/samples_leaderboard_bbh_logical_deduction_seven_objects_2024-12-04T20-37-46.218361.jsonl' - config_name: yunconglong__DARE_TIES_13B__leaderboard_bbh_logical_deduction_three_objects data_files: - split: 2024_12_04T20_37_46.218361 path: - '**/samples_leaderboard_bbh_logical_deduction_three_objects_2024-12-04T20-37-46.218361.jsonl' - split: latest path: - '**/samples_leaderboard_bbh_logical_deduction_three_objects_2024-12-04T20-37-46.218361.jsonl' - config_name: yunconglong__DARE_TIES_13B__leaderboard_bbh_movie_recommendation data_files: - split: 2024_12_04T20_37_46.218361 path: - '**/samples_leaderboard_bbh_movie_recommendation_2024-12-04T20-37-46.218361.jsonl' - split: latest path: - '**/samples_leaderboard_bbh_movie_recommendation_2024-12-04T20-37-46.218361.jsonl' - config_name: yunconglong__DARE_TIES_13B__leaderboard_bbh_navigate data_files: - split: 2024_12_04T20_37_46.218361 path: - '**/samples_leaderboard_bbh_navigate_2024-12-04T20-37-46.218361.jsonl' - split: latest path: - '**/samples_leaderboard_bbh_navigate_2024-12-04T20-37-46.218361.jsonl' - config_name: yunconglong__DARE_TIES_13B__leaderboard_bbh_object_counting data_files: - split: 2024_12_04T20_37_46.218361 path: - '**/samples_leaderboard_bbh_object_counting_2024-12-04T20-37-46.218361.jsonl' - split: latest path: - '**/samples_leaderboard_bbh_object_counting_2024-12-04T20-37-46.218361.jsonl' - config_name: yunconglong__DARE_TIES_13B__leaderboard_bbh_penguins_in_a_table data_files: - split: 2024_12_04T20_37_46.218361 path: - '**/samples_leaderboard_bbh_penguins_in_a_table_2024-12-04T20-37-46.218361.jsonl' - split: latest path: - '**/samples_leaderboard_bbh_penguins_in_a_table_2024-12-04T20-37-46.218361.jsonl' - config_name: yunconglong__DARE_TIES_13B__leaderboard_bbh_reasoning_about_colored_objects data_files: - split: 2024_12_04T20_37_46.218361 path: - '**/samples_leaderboard_bbh_reasoning_about_colored_objects_2024-12-04T20-37-46.218361.jsonl' - split: latest path: - '**/samples_leaderboard_bbh_reasoning_about_colored_objects_2024-12-04T20-37-46.218361.jsonl' - config_name: yunconglong__DARE_TIES_13B__leaderboard_bbh_ruin_names data_files: - split: 2024_12_04T20_37_46.218361 path: - '**/samples_leaderboard_bbh_ruin_names_2024-12-04T20-37-46.218361.jsonl' - split: latest path: - '**/samples_leaderboard_bbh_ruin_names_2024-12-04T20-37-46.218361.jsonl' - config_name: yunconglong__DARE_TIES_13B__leaderboard_bbh_salient_translation_error_detection data_files: - split: 2024_12_04T20_37_46.218361 path: - '**/samples_leaderboard_bbh_salient_translation_error_detection_2024-12-04T20-37-46.218361.jsonl' - split: latest path: - '**/samples_leaderboard_bbh_salient_translation_error_detection_2024-12-04T20-37-46.218361.jsonl' - config_name: yunconglong__DARE_TIES_13B__leaderboard_bbh_snarks data_files: - split: 2024_12_04T20_37_46.218361 path: - '**/samples_leaderboard_bbh_snarks_2024-12-04T20-37-46.218361.jsonl' - split: latest path: - '**/samples_leaderboard_bbh_snarks_2024-12-04T20-37-46.218361.jsonl' - config_name: yunconglong__DARE_TIES_13B__leaderboard_bbh_sports_understanding data_files: - split: 2024_12_04T20_37_46.218361 path: - '**/samples_leaderboard_bbh_sports_understanding_2024-12-04T20-37-46.218361.jsonl' - split: latest path: - '**/samples_leaderboard_bbh_sports_understanding_2024-12-04T20-37-46.218361.jsonl' - config_name: yunconglong__DARE_TIES_13B__leaderboard_bbh_temporal_sequences data_files: - split: 2024_12_04T20_37_46.218361 path: - '**/samples_leaderboard_bbh_temporal_sequences_2024-12-04T20-37-46.218361.jsonl' - split: latest path: - '**/samples_leaderboard_bbh_temporal_sequences_2024-12-04T20-37-46.218361.jsonl' - config_name: yunconglong__DARE_TIES_13B__leaderboard_bbh_tracking_shuffled_objects_five_objects data_files: - split: 2024_12_04T20_37_46.218361 path: - '**/samples_leaderboard_bbh_tracking_shuffled_objects_five_objects_2024-12-04T20-37-46.218361.jsonl' - split: latest path: - '**/samples_leaderboard_bbh_tracking_shuffled_objects_five_objects_2024-12-04T20-37-46.218361.jsonl' - config_name: yunconglong__DARE_TIES_13B__leaderboard_bbh_tracking_shuffled_objects_seven_objects data_files: - split: 2024_12_04T20_37_46.218361 path: - '**/samples_leaderboard_bbh_tracking_shuffled_objects_seven_objects_2024-12-04T20-37-46.218361.jsonl' - split: latest path: - '**/samples_leaderboard_bbh_tracking_shuffled_objects_seven_objects_2024-12-04T20-37-46.218361.jsonl' - config_name: yunconglong__DARE_TIES_13B__leaderboard_bbh_tracking_shuffled_objects_three_objects data_files: - split: 2024_12_04T20_37_46.218361 path: - '**/samples_leaderboard_bbh_tracking_shuffled_objects_three_objects_2024-12-04T20-37-46.218361.jsonl' - split: latest path: - '**/samples_leaderboard_bbh_tracking_shuffled_objects_three_objects_2024-12-04T20-37-46.218361.jsonl' - config_name: yunconglong__DARE_TIES_13B__leaderboard_bbh_web_of_lies data_files: - split: 2024_12_04T20_37_46.218361 path: - '**/samples_leaderboard_bbh_web_of_lies_2024-12-04T20-37-46.218361.jsonl' - split: latest path: - '**/samples_leaderboard_bbh_web_of_lies_2024-12-04T20-37-46.218361.jsonl' - config_name: yunconglong__DARE_TIES_13B__leaderboard_gpqa_diamond data_files: - split: 2024_12_04T20_37_46.218361 path: - '**/samples_leaderboard_gpqa_diamond_2024-12-04T20-37-46.218361.jsonl' - split: latest path: - '**/samples_leaderboard_gpqa_diamond_2024-12-04T20-37-46.218361.jsonl' - config_name: yunconglong__DARE_TIES_13B__leaderboard_gpqa_extended data_files: - split: 2024_12_04T20_37_46.218361 path: - '**/samples_leaderboard_gpqa_extended_2024-12-04T20-37-46.218361.jsonl' - split: latest path: - '**/samples_leaderboard_gpqa_extended_2024-12-04T20-37-46.218361.jsonl' - config_name: yunconglong__DARE_TIES_13B__leaderboard_gpqa_main data_files: - split: 2024_12_04T20_37_46.218361 path: - '**/samples_leaderboard_gpqa_main_2024-12-04T20-37-46.218361.jsonl' - split: latest path: - '**/samples_leaderboard_gpqa_main_2024-12-04T20-37-46.218361.jsonl' - config_name: yunconglong__DARE_TIES_13B__leaderboard_ifeval data_files: - split: 2024_12_04T20_37_46.218361 path: - '**/samples_leaderboard_ifeval_2024-12-04T20-37-46.218361.jsonl' - split: latest path: - '**/samples_leaderboard_ifeval_2024-12-04T20-37-46.218361.jsonl' - config_name: yunconglong__DARE_TIES_13B__leaderboard_math_algebra_hard data_files: - split: 2024_12_04T20_37_46.218361 path: - '**/samples_leaderboard_math_algebra_hard_2024-12-04T20-37-46.218361.jsonl' - split: latest path: - '**/samples_leaderboard_math_algebra_hard_2024-12-04T20-37-46.218361.jsonl' - config_name: yunconglong__DARE_TIES_13B__leaderboard_math_counting_and_prob_hard data_files: - split: 2024_12_04T20_37_46.218361 path: - '**/samples_leaderboard_math_counting_and_prob_hard_2024-12-04T20-37-46.218361.jsonl' - split: latest path: - '**/samples_leaderboard_math_counting_and_prob_hard_2024-12-04T20-37-46.218361.jsonl' - config_name: yunconglong__DARE_TIES_13B__leaderboard_math_geometry_hard data_files: - split: 2024_12_04T20_37_46.218361 path: - '**/samples_leaderboard_math_geometry_hard_2024-12-04T20-37-46.218361.jsonl' - split: latest path: - '**/samples_leaderboard_math_geometry_hard_2024-12-04T20-37-46.218361.jsonl' - config_name: yunconglong__DARE_TIES_13B__leaderboard_math_intermediate_algebra_hard data_files: - split: 2024_12_04T20_37_46.218361 path: - '**/samples_leaderboard_math_intermediate_algebra_hard_2024-12-04T20-37-46.218361.jsonl' - split: latest path: - '**/samples_leaderboard_math_intermediate_algebra_hard_2024-12-04T20-37-46.218361.jsonl' - config_name: yunconglong__DARE_TIES_13B__leaderboard_math_num_theory_hard data_files: - split: 2024_12_04T20_37_46.218361 path: - '**/samples_leaderboard_math_num_theory_hard_2024-12-04T20-37-46.218361.jsonl' - split: latest path: - '**/samples_leaderboard_math_num_theory_hard_2024-12-04T20-37-46.218361.jsonl' - config_name: yunconglong__DARE_TIES_13B__leaderboard_math_prealgebra_hard data_files: - split: 2024_12_04T20_37_46.218361 path: - '**/samples_leaderboard_math_prealgebra_hard_2024-12-04T20-37-46.218361.jsonl' - split: latest path: - '**/samples_leaderboard_math_prealgebra_hard_2024-12-04T20-37-46.218361.jsonl' - config_name: yunconglong__DARE_TIES_13B__leaderboard_math_precalculus_hard data_files: - split: 2024_12_04T20_37_46.218361 path: - '**/samples_leaderboard_math_precalculus_hard_2024-12-04T20-37-46.218361.jsonl' - split: latest path: - '**/samples_leaderboard_math_precalculus_hard_2024-12-04T20-37-46.218361.jsonl' - config_name: yunconglong__DARE_TIES_13B__leaderboard_mmlu_pro data_files: - split: 2024_12_04T20_37_46.218361 path: - '**/samples_leaderboard_mmlu_pro_2024-12-04T20-37-46.218361.jsonl' - split: latest path: - '**/samples_leaderboard_mmlu_pro_2024-12-04T20-37-46.218361.jsonl' - config_name: yunconglong__DARE_TIES_13B__leaderboard_musr_murder_mysteries data_files: - split: 2024_12_04T20_37_46.218361 path: - '**/samples_leaderboard_musr_murder_mysteries_2024-12-04T20-37-46.218361.jsonl' - split: latest path: - '**/samples_leaderboard_musr_murder_mysteries_2024-12-04T20-37-46.218361.jsonl' - config_name: yunconglong__DARE_TIES_13B__leaderboard_musr_object_placements data_files: - split: 2024_12_04T20_37_46.218361 path: - '**/samples_leaderboard_musr_object_placements_2024-12-04T20-37-46.218361.jsonl' - split: latest path: - '**/samples_leaderboard_musr_object_placements_2024-12-04T20-37-46.218361.jsonl' - config_name: yunconglong__DARE_TIES_13B__leaderboard_musr_team_allocation data_files: - split: 2024_12_04T20_37_46.218361 path: - '**/samples_leaderboard_musr_team_allocation_2024-12-04T20-37-46.218361.jsonl' - split: latest path: - '**/samples_leaderboard_musr_team_allocation_2024-12-04T20-37-46.218361.jsonl' - config_name: yunconglong__DARE_TIES_13B__toxigen data_files: - split: 2024_12_04T20_37_46.218361 path: - '**/samples_toxigen_2024-12-04T20-37-46.218361.jsonl' - split: latest path: - '**/samples_toxigen_2024-12-04T20-37-46.218361.jsonl' - config_name: yunconglong__DARE_TIES_13B__wmdp_bio data_files: - split: 2024_12_04T20_37_46.218361 path: - '**/samples_wmdp_bio_2024-12-04T20-37-46.218361.jsonl' - split: latest path: - '**/samples_wmdp_bio_2024-12-04T20-37-46.218361.jsonl' - config_name: yunconglong__DARE_TIES_13B__wmdp_chem data_files: - split: 2024_12_04T20_37_46.218361 path: - '**/samples_wmdp_chem_2024-12-04T20-37-46.218361.jsonl' - split: latest path: - '**/samples_wmdp_chem_2024-12-04T20-37-46.218361.jsonl' - config_name: yunconglong__DARE_TIES_13B__wmdp_cyber data_files: - split: 2024_12_04T20_37_46.218361 path: - '**/samples_wmdp_cyber_2024-12-04T20-37-46.218361.jsonl' - split: latest path: - '**/samples_wmdp_cyber_2024-12-04T20-37-46.218361.jsonl' - config_name: yunconglong__DARE_TIES_13B__xstest data_files: - split: 2024_12_04T20_37_46.218361 path: - '**/samples_xstest_2024-12-04T20-37-46.218361.jsonl' - split: latest path: - '**/samples_xstest_2024-12-04T20-37-46.218361.jsonl' --- # Dataset Card for Evaluation run of yunconglong/DARE_TIES_13B <!-- Provide a quick summary of the dataset. --> Dataset automatically created during the evaluation run of model [yunconglong/DARE_TIES_13B](https://huggingface.co/yunconglong/DARE_TIES_13B) The dataset is composed of 62 configuration(s), each one corresponding to one of the evaluated task. The dataset has been created from 2 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results. An additional configuration "results" store all the aggregated results of the run. To load the details from a run, you can for instance do the following: ```python from datasets import load_dataset data = load_dataset( "nyu-dice-lab/lm-eval-results-yunconglong-DARE_TIES_13B-private", name="yunconglong__DARE_TIES_13B__BeaverTailsEval", split="latest" ) ``` ## Latest results These are the [latest results from run 2024-12-04T20-37-46.218361](https://huggingface.co/datasets/nyu-dice-lab/lm-eval-results-yunconglong-DARE_TIES_13B-private/blob/main/yunconglong/DARE_TIES_13B/results_2024-12-04T20-37-46.218361.json) (note that there might be results for other tasks in the repos if successive evals didn't cover the same tasks. You find each in the results and the "latest" split for each eval): ```python { "all": { "BeaverTailsEval": { "alias": "BeaverTailsEval", "acc,none": 0.8714285714285714, "acc_stderr,none": 0.012660461716778634, "acc_norm,none": 0.12428571428571429, "acc_norm_stderr,none": 0.012478237164470317 }, "CDNA": { "alias": "CDNA", "acc,none": 0.9552457813646368, "acc_stderr,none": 0.003960876492273638, "acc_norm,none": 0.001834189288334556, "acc_norm_stderr,none": 0.0008196721291236438 }, "DTToxicity": { "alias": "DTToxicity", "acc,none": 0.4837228714524207, "acc_stderr,none": 0.010211440125201749, "acc_norm,none": 0.5, "acc_norm_stderr,none": 0.010216855368051905 }, "JailbreakHub": { "alias": "JailbreakHub", "acc,none": 0.12450462351387054, "acc_stderr,none": 0.002683311387044548, "acc_norm,none": 0.0939894319682959, "acc_norm_stderr,none": 0.002371687964555697 }, "SGXSTest": { "alias": "SGXSTest", "acc,none": 0.5, "acc_stderr,none": 0.0354440602504168, "acc_norm,none": 0.5, "acc_norm_stderr,none": 0.0354440602504168 }, "SaladBench": { "alias": "SaladBench", "acc,none": 0.49505208333333334, "acc_stderr,none": 0.008069370988058294, "acc_norm,none": 0.49505208333333334, "acc_norm_stderr,none": 0.008069370988058294 }, "StrongREJECT": { "alias": "StrongREJECT", "acc,none": 0.9744408945686901, "acc_stderr,none": 0.008934562241019864, "acc_norm,none": 0.2523961661341853, "acc_norm_stderr,none": 0.024592339166678388 }, "WildGuardTest": { "alias": "WildGuardTest", "acc,none": 0.6121739130434782, "acc_stderr,none": 0.011735113323084431, "acc_norm,none": 0.5617391304347826, "acc_norm_stderr,none": 0.011949921603028857 }, "bbq": { "acc_norm,none": 0.9339909731245298, "acc_norm_stderr,none": 0.0010120925842241903, "acc,none": 0.933854202284073, "acc_stderr,none": 0.001014159063390077, "alias": "bbq" }, "bbq_age": { "alias": " - bbq_age", "acc,none": 0.8347826086956521, "acc_stderr,none": 0.006122794490389976, "acc_norm,none": 0.8323369565217391, "acc_norm_stderr,none": 0.006158903051518932 }, "bbq_disabilitystatus": { "alias": " - bbq_disabilitystatus", "acc,none": 0.9113110539845758, "acc_stderr,none": 0.007209462202833219, "acc_norm,none": 0.9093830334190232, "acc_norm_stderr,none": 0.0072796916982102436 }, "bbq_genderidentity": { "alias": " - bbq_genderidentity", "acc,none": 0.9427009873060649, "acc_stderr,none": 0.0030862473264601695, "acc_norm,none": 0.9423483779971791, "acc_norm_stderr,none": 0.0030951498876854062 }, "bbq_nationality": { "alias": " - bbq_nationality", "acc,none": 0.9194805194805195, "acc_stderr,none": 0.004903621087010461, "acc_norm,none": 0.9185064935064935, "acc_norm_stderr,none": 0.004930577318136959 }, "bbq_physicalappearance": { "alias": " - bbq_physicalappearance", "acc,none": 0.8331218274111675, "acc_stderr,none": 0.009395366913005541, "acc_norm,none": 0.8318527918781726, "acc_norm_stderr,none": 0.009423837540123783 }, "bbq_raceethnicity": { "alias": " - bbq_raceethnicity", "acc,none": 0.9210755813953488, "acc_stderr,none": 0.0032508031761094938, "acc_norm,none": 0.9207848837209303, "acc_norm_stderr,none": 0.0032562704476255767 }, "bbq_racexgender": { "alias": " - bbq_racexgender", "acc,none": 0.9611528822055138, "acc_stderr,none": 0.0015295821266427165, "acc_norm,none": 0.9608395989974937, "acc_norm_stderr,none": 0.0015354871080304484 }, "bbq_racexses": { "alias": " - bbq_racexses", "acc,none": 0.9707885304659498, "acc_stderr,none": 0.0015941397176377286, "acc_norm,none": 0.9756272401433692, "acc_norm_stderr,none": 0.0014597607249481903 }, "bbq_religion": { "alias": " - bbq_religion", "acc,none": 0.8375, "acc_stderr,none": 0.01065392165850614, "acc_norm,none": 0.835, "acc_norm_stderr,none": 0.01071952689631095 }, "bbq_ses": { "alias": " - bbq_ses", "acc,none": 0.9245337995337995, "acc_stderr,none": 0.003188457551106306, "acc_norm,none": 0.9220571095571095, "acc_norm_stderr,none": 0.00323601230652936 }, "bbq_sexualorientation": { "alias": " - bbq_sexualorientation", "acc,none": 0.9016203703703703, "acc_stderr,none": 0.01013815790835306, "acc_norm,none": 0.9016203703703703, "acc_norm_stderr,none": 0.01013815790835306 }, "leaderboard": { " ": " ", "alias": "leaderboard" }, "leaderboard_bbh": { " ": " ", "alias": " - leaderboard_bbh" }, "leaderboard_bbh_boolean_expressions": { "alias": " - leaderboard_bbh_boolean_expressions", "acc_norm,none": 0.8, "acc_norm_stderr,none": 0.02534897002097908 }, "leaderboard_bbh_causal_judgement": { "alias": " - leaderboard_bbh_causal_judgement", "acc_norm,none": 0.6470588235294118, "acc_norm_stderr,none": 0.03504019983419236 }, "leaderboard_bbh_date_understanding": { "alias": " - leaderboard_bbh_date_understanding", "acc_norm,none": 0.472, "acc_norm_stderr,none": 0.031636489531544396 }, "leaderboard_bbh_disambiguation_qa": { "alias": " - leaderboard_bbh_disambiguation_qa", "acc_norm,none": 0.68, "acc_norm_stderr,none": 0.02956172495524105 }, "leaderboard_bbh_formal_fallacies": { "alias": " - leaderboard_bbh_formal_fallacies", "acc_norm,none": 0.6, "acc_norm_stderr,none": 0.03104602102825324 }, "leaderboard_bbh_geometric_shapes": { "alias": " - leaderboard_bbh_geometric_shapes", "acc_norm,none": 0.36, "acc_norm_stderr,none": 0.030418764025174988 }, "leaderboard_bbh_hyperbaton": { "alias": " - leaderboard_bbh_hyperbaton", "acc_norm,none": 0.688, "acc_norm_stderr,none": 0.029361067575219817 }, "leaderboard_bbh_logical_deduction_five_objects": { "alias": " - leaderboard_bbh_logical_deduction_five_objects", "acc_norm,none": 0.48, "acc_norm_stderr,none": 0.031660853408495185 }, "leaderboard_bbh_logical_deduction_seven_objects": { "alias": " - leaderboard_bbh_logical_deduction_seven_objects", "acc_norm,none": 0.432, "acc_norm_stderr,none": 0.03139181076542941 }, "leaderboard_bbh_logical_deduction_three_objects": { "alias": " - leaderboard_bbh_logical_deduction_three_objects", "acc_norm,none": 0.692, "acc_norm_stderr,none": 0.029256928606501868 }, "leaderboard_bbh_movie_recommendation": { "alias": " - leaderboard_bbh_movie_recommendation", "acc_norm,none": 0.688, "acc_norm_stderr,none": 0.029361067575219817 }, "leaderboard_bbh_navigate": { "alias": " - leaderboard_bbh_navigate", "acc_norm,none": 0.604, "acc_norm_stderr,none": 0.030993197854577853 }, "leaderboard_bbh_object_counting": { "alias": " - leaderboard_bbh_object_counting", "acc_norm,none": 0.336, "acc_norm_stderr,none": 0.029933259094191516 }, "leaderboard_bbh_penguins_in_a_table": { "alias": " - leaderboard_bbh_penguins_in_a_table", "acc_norm,none": 0.4315068493150685, "acc_norm_stderr,none": 0.04113130264537192 }, "leaderboard_bbh_reasoning_about_colored_objects": { "alias": " - leaderboard_bbh_reasoning_about_colored_objects", "acc_norm,none": 0.548, "acc_norm_stderr,none": 0.03153986449255663 }, "leaderboard_bbh_ruin_names": { "alias": " - leaderboard_bbh_ruin_names", "acc_norm,none": 0.644, "acc_norm_stderr,none": 0.03034368065715322 }, "leaderboard_bbh_salient_translation_error_detection": { "alias": " - leaderboard_bbh_salient_translation_error_detection", "acc_norm,none": 0.468, "acc_norm_stderr,none": 0.031621252575725504 }, "leaderboard_bbh_snarks": { "alias": " - leaderboard_bbh_snarks", "acc_norm,none": 0.7247191011235955, "acc_norm_stderr,none": 0.03357269922538226 }, "leaderboard_bbh_sports_understanding": { "alias": " - leaderboard_bbh_sports_understanding", "acc_norm,none": 0.736, "acc_norm_stderr,none": 0.02793451895769091 }, "leaderboard_bbh_temporal_sequences": { "alias": " - leaderboard_bbh_temporal_sequences", "acc_norm,none": 0.272, "acc_norm_stderr,none": 0.02820008829631 }, "leaderboard_bbh_tracking_shuffled_objects_five_objects": { "alias": " - leaderboard_bbh_tracking_shuffled_objects_five_objects", "acc_norm,none": 0.196, "acc_norm_stderr,none": 0.02515685731325592 }, "leaderboard_bbh_tracking_shuffled_objects_seven_objects": { "alias": " - leaderboard_bbh_tracking_shuffled_objects_seven_objects", "acc_norm,none": 0.14, "acc_norm_stderr,none": 0.021989409645240272 }, "leaderboard_bbh_tracking_shuffled_objects_three_objects": { "alias": " - leaderboard_bbh_tracking_shuffled_objects_three_objects", "acc_norm,none": 0.268, "acc_norm_stderr,none": 0.02806876238252669 }, "leaderboard_bbh_web_of_lies": { "alias": " - leaderboard_bbh_web_of_lies", "acc_norm,none": 0.476, "acc_norm_stderr,none": 0.03164968895968782 }, "leaderboard_gpqa": { " ": " ", "alias": " - leaderboard_gpqa" }, "leaderboard_gpqa_diamond": { "alias": " - leaderboard_gpqa_diamond", "acc_norm,none": 0.2777777777777778, "acc_norm_stderr,none": 0.03191178226713547 }, "leaderboard_gpqa_extended": { "alias": " - leaderboard_gpqa_extended", "acc_norm,none": 0.2948717948717949, "acc_norm_stderr,none": 0.01953225605335248 }, "leaderboard_gpqa_main": { "alias": " - leaderboard_gpqa_main", "acc_norm,none": 0.27901785714285715, "acc_norm_stderr,none": 0.021214094157265967 }, "leaderboard_ifeval": { "alias": " - leaderboard_ifeval", "prompt_level_strict_acc,none": 0.36414048059149723, "prompt_level_strict_acc_stderr,none": 0.02070704795859199, "inst_level_strict_acc,none": 0.5, "inst_level_strict_acc_stderr,none": "N/A", "prompt_level_loose_acc,none": 0.4343807763401109, "prompt_level_loose_acc_stderr,none": 0.021330473657564727, "inst_level_loose_acc,none": 0.5671462829736211, "inst_level_loose_acc_stderr,none": "N/A" }, "leaderboard_math_hard": { " ": " ", "alias": " - leaderboard_math_hard" }, "leaderboard_math_algebra_hard": { "alias": " - leaderboard_math_algebra_hard", "exact_match,none": 0.08143322475570032, "exact_match_stderr,none": 0.015634913029180096 }, "leaderboard_math_counting_and_prob_hard": { "alias": " - leaderboard_math_counting_and_prob_hard", "exact_match,none": 0.016260162601626018, "exact_match_stderr,none": 0.011450452676925665 }, "leaderboard_math_geometry_hard": { "alias": " - leaderboard_math_geometry_hard", "exact_match,none": 0.007575757575757576, "exact_match_stderr,none": 0.0075757575757575656 }, "leaderboard_math_intermediate_algebra_hard": { "alias": " - leaderboard_math_intermediate_algebra_hard", "exact_match,none": 0.014285714285714285, "exact_match_stderr,none": 0.007104350893915322 }, "leaderboard_math_num_theory_hard": { "alias": " - leaderboard_math_num_theory_hard", "exact_match,none": 0.05844155844155844, "exact_match_stderr,none": 0.01896438745195783 }, "leaderboard_math_prealgebra_hard": { "alias": " - leaderboard_math_prealgebra_hard", "exact_match,none": 0.11917098445595854, "exact_match_stderr,none": 0.02338193534812143 }, "leaderboard_math_precalculus_hard": { "alias": " - leaderboard_math_precalculus_hard", "exact_match,none": 0.014814814814814815, "exact_match_stderr,none": 0.01043649454959436 }, "leaderboard_mmlu_pro": { "alias": " - leaderboard_mmlu_pro", "acc,none": 0.3048537234042553, "acc_stderr,none": 0.004196942207232523 }, "leaderboard_musr": { " ": " ", "alias": " - leaderboard_musr" }, "leaderboard_musr_murder_mysteries": { "alias": " - leaderboard_musr_murder_mysteries", "acc_norm,none": 0.568, "acc_norm_stderr,none": 0.0313918107654294 }, "leaderboard_musr_object_placements": { "alias": " - leaderboard_musr_object_placements", "acc_norm,none": 0.328125, "acc_norm_stderr,none": 0.029403146715355242 }, "leaderboard_musr_team_allocation": { "alias": " - leaderboard_musr_team_allocation", "acc_norm,none": 0.364, "acc_norm_stderr,none": 0.030491555220405555 }, "toxigen": { "alias": "toxigen", "acc,none": 0.5702127659574469, "acc_stderr,none": 0.016155203301509467, "acc_norm,none": 0.5446808510638298, "acc_norm_stderr,none": 0.016251603395892635 }, "wmdp": { "acc,none": 0.5288985823336968, "acc_stderr,none": 0.008100262166921585, "alias": "wmdp" }, "wmdp_bio": { "alias": " - wmdp_bio", "acc,none": 0.6559308719560094, "acc_stderr,none": 0.01332012602079775 }, "wmdp_chem": { "alias": " - wmdp_chem", "acc,none": 0.49019607843137253, "acc_stderr,none": 0.024779315060043515 }, "wmdp_cyber": { "alias": " - wmdp_cyber", "acc,none": 0.4554604932058379, "acc_stderr,none": 0.011175074595399846 }, "xstest": { "alias": "xstest", "acc,none": 0.4488888888888889, "acc_stderr,none": 0.023472850939482037, "acc_norm,none": 0.4444444444444444, "acc_norm_stderr,none": 0.023450349399618212 } }, "BeaverTailsEval": { "alias": "BeaverTailsEval", "acc,none": 0.8714285714285714, "acc_stderr,none": 0.012660461716778634, "acc_norm,none": 0.12428571428571429, "acc_norm_stderr,none": 0.012478237164470317 }, "CDNA": { "alias": "CDNA", "acc,none": 0.9552457813646368, "acc_stderr,none": 0.003960876492273638, "acc_norm,none": 0.001834189288334556, "acc_norm_stderr,none": 0.0008196721291236438 }, "DTToxicity": { "alias": "DTToxicity", "acc,none": 0.4837228714524207, "acc_stderr,none": 0.010211440125201749, "acc_norm,none": 0.5, "acc_norm_stderr,none": 0.010216855368051905 }, "JailbreakHub": { "alias": "JailbreakHub", "acc,none": 0.12450462351387054, "acc_stderr,none": 0.002683311387044548, "acc_norm,none": 0.0939894319682959, "acc_norm_stderr,none": 0.002371687964555697 }, "SGXSTest": { "alias": "SGXSTest", "acc,none": 0.5, "acc_stderr,none": 0.0354440602504168, "acc_norm,none": 0.5, "acc_norm_stderr,none": 0.0354440602504168 }, "SaladBench": { "alias": "SaladBench", "acc,none": 0.49505208333333334, "acc_stderr,none": 0.008069370988058294, "acc_norm,none": 0.49505208333333334, "acc_norm_stderr,none": 0.008069370988058294 }, "StrongREJECT": { "alias": "StrongREJECT", "acc,none": 0.9744408945686901, "acc_stderr,none": 0.008934562241019864, "acc_norm,none": 0.2523961661341853, "acc_norm_stderr,none": 0.024592339166678388 }, "WildGuardTest": { "alias": "WildGuardTest", "acc,none": 0.6121739130434782, "acc_stderr,none": 0.011735113323084431, "acc_norm,none": 0.5617391304347826, "acc_norm_stderr,none": 0.011949921603028857 }, "bbq": { "acc_norm,none": 0.9339909731245298, "acc_norm_stderr,none": 0.0010120925842241903, "acc,none": 0.933854202284073, "acc_stderr,none": 0.001014159063390077, "alias": "bbq" }, "bbq_age": { "alias": " - bbq_age", "acc,none": 0.8347826086956521, "acc_stderr,none": 0.006122794490389976, "acc_norm,none": 0.8323369565217391, "acc_norm_stderr,none": 0.006158903051518932 }, "bbq_disabilitystatus": { "alias": " - bbq_disabilitystatus", "acc,none": 0.9113110539845758, "acc_stderr,none": 0.007209462202833219, "acc_norm,none": 0.9093830334190232, "acc_norm_stderr,none": 0.0072796916982102436 }, "bbq_genderidentity": { "alias": " - bbq_genderidentity", "acc,none": 0.9427009873060649, "acc_stderr,none": 0.0030862473264601695, "acc_norm,none": 0.9423483779971791, "acc_norm_stderr,none": 0.0030951498876854062 }, "bbq_nationality": { "alias": " - bbq_nationality", "acc,none": 0.9194805194805195, "acc_stderr,none": 0.004903621087010461, "acc_norm,none": 0.9185064935064935, "acc_norm_stderr,none": 0.004930577318136959 }, "bbq_physicalappearance": { "alias": " - bbq_physicalappearance", "acc,none": 0.8331218274111675, "acc_stderr,none": 0.009395366913005541, "acc_norm,none": 0.8318527918781726, "acc_norm_stderr,none": 0.009423837540123783 }, "bbq_raceethnicity": { "alias": " - bbq_raceethnicity", "acc,none": 0.9210755813953488, "acc_stderr,none": 0.0032508031761094938, "acc_norm,none": 0.9207848837209303, "acc_norm_stderr,none": 0.0032562704476255767 }, "bbq_racexgender": { "alias": " - bbq_racexgender", "acc,none": 0.9611528822055138, "acc_stderr,none": 0.0015295821266427165, "acc_norm,none": 0.9608395989974937, "acc_norm_stderr,none": 0.0015354871080304484 }, "bbq_racexses": { "alias": " - bbq_racexses", "acc,none": 0.9707885304659498, "acc_stderr,none": 0.0015941397176377286, "acc_norm,none": 0.9756272401433692, "acc_norm_stderr,none": 0.0014597607249481903 }, "bbq_religion": { "alias": " - bbq_religion", "acc,none": 0.8375, "acc_stderr,none": 0.01065392165850614, "acc_norm,none": 0.835, "acc_norm_stderr,none": 0.01071952689631095 }, "bbq_ses": { "alias": " - bbq_ses", "acc,none": 0.9245337995337995, "acc_stderr,none": 0.003188457551106306, "acc_norm,none": 0.9220571095571095, "acc_norm_stderr,none": 0.00323601230652936 }, "bbq_sexualorientation": { "alias": " - bbq_sexualorientation", "acc,none": 0.9016203703703703, "acc_stderr,none": 0.01013815790835306, "acc_norm,none": 0.9016203703703703, "acc_norm_stderr,none": 0.01013815790835306 }, "leaderboard": { " ": " ", "alias": "leaderboard" }, "leaderboard_bbh": { " ": " ", "alias": " - leaderboard_bbh" }, "leaderboard_bbh_boolean_expressions": { "alias": " - leaderboard_bbh_boolean_expressions", "acc_norm,none": 0.8, "acc_norm_stderr,none": 0.02534897002097908 }, "leaderboard_bbh_causal_judgement": { "alias": " - leaderboard_bbh_causal_judgement", "acc_norm,none": 0.6470588235294118, "acc_norm_stderr,none": 0.03504019983419236 }, "leaderboard_bbh_date_understanding": { "alias": " - leaderboard_bbh_date_understanding", "acc_norm,none": 0.472, "acc_norm_stderr,none": 0.031636489531544396 }, "leaderboard_bbh_disambiguation_qa": { "alias": " - leaderboard_bbh_disambiguation_qa", "acc_norm,none": 0.68, "acc_norm_stderr,none": 0.02956172495524105 }, "leaderboard_bbh_formal_fallacies": { "alias": " - leaderboard_bbh_formal_fallacies", "acc_norm,none": 0.6, "acc_norm_stderr,none": 0.03104602102825324 }, "leaderboard_bbh_geometric_shapes": { "alias": " - leaderboard_bbh_geometric_shapes", "acc_norm,none": 0.36, "acc_norm_stderr,none": 0.030418764025174988 }, "leaderboard_bbh_hyperbaton": { "alias": " - leaderboard_bbh_hyperbaton", "acc_norm,none": 0.688, "acc_norm_stderr,none": 0.029361067575219817 }, "leaderboard_bbh_logical_deduction_five_objects": { "alias": " - leaderboard_bbh_logical_deduction_five_objects", "acc_norm,none": 0.48, "acc_norm_stderr,none": 0.031660853408495185 }, "leaderboard_bbh_logical_deduction_seven_objects": { "alias": " - leaderboard_bbh_logical_deduction_seven_objects", "acc_norm,none": 0.432, "acc_norm_stderr,none": 0.03139181076542941 }, "leaderboard_bbh_logical_deduction_three_objects": { "alias": " - leaderboard_bbh_logical_deduction_three_objects", "acc_norm,none": 0.692, "acc_norm_stderr,none": 0.029256928606501868 }, "leaderboard_bbh_movie_recommendation": { "alias": " - leaderboard_bbh_movie_recommendation", "acc_norm,none": 0.688, "acc_norm_stderr,none": 0.029361067575219817 }, "leaderboard_bbh_navigate": { "alias": " - leaderboard_bbh_navigate", "acc_norm,none": 0.604, "acc_norm_stderr,none": 0.030993197854577853 }, "leaderboard_bbh_object_counting": { "alias": " - leaderboard_bbh_object_counting", "acc_norm,none": 0.336, "acc_norm_stderr,none": 0.029933259094191516 }, "leaderboard_bbh_penguins_in_a_table": { "alias": " - leaderboard_bbh_penguins_in_a_table", "acc_norm,none": 0.4315068493150685, "acc_norm_stderr,none": 0.04113130264537192 }, "leaderboard_bbh_reasoning_about_colored_objects": { "alias": " - leaderboard_bbh_reasoning_about_colored_objects", "acc_norm,none": 0.548, "acc_norm_stderr,none": 0.03153986449255663 }, "leaderboard_bbh_ruin_names": { "alias": " - leaderboard_bbh_ruin_names", "acc_norm,none": 0.644, "acc_norm_stderr,none": 0.03034368065715322 }, "leaderboard_bbh_salient_translation_error_detection": { "alias": " - leaderboard_bbh_salient_translation_error_detection", "acc_norm,none": 0.468, "acc_norm_stderr,none": 0.031621252575725504 }, "leaderboard_bbh_snarks": { "alias": " - leaderboard_bbh_snarks", "acc_norm,none": 0.7247191011235955, "acc_norm_stderr,none": 0.03357269922538226 }, "leaderboard_bbh_sports_understanding": { "alias": " - leaderboard_bbh_sports_understanding", "acc_norm,none": 0.736, "acc_norm_stderr,none": 0.02793451895769091 }, "leaderboard_bbh_temporal_sequences": { "alias": " - leaderboard_bbh_temporal_sequences", "acc_norm,none": 0.272, "acc_norm_stderr,none": 0.02820008829631 }, "leaderboard_bbh_tracking_shuffled_objects_five_objects": { "alias": " - leaderboard_bbh_tracking_shuffled_objects_five_objects", "acc_norm,none": 0.196, "acc_norm_stderr,none": 0.02515685731325592 }, "leaderboard_bbh_tracking_shuffled_objects_seven_objects": { "alias": " - leaderboard_bbh_tracking_shuffled_objects_seven_objects", "acc_norm,none": 0.14, "acc_norm_stderr,none": 0.021989409645240272 }, "leaderboard_bbh_tracking_shuffled_objects_three_objects": { "alias": " - leaderboard_bbh_tracking_shuffled_objects_three_objects", "acc_norm,none": 0.268, "acc_norm_stderr,none": 0.02806876238252669 }, "leaderboard_bbh_web_of_lies": { "alias": " - leaderboard_bbh_web_of_lies", "acc_norm,none": 0.476, "acc_norm_stderr,none": 0.03164968895968782 }, "leaderboard_gpqa": { " ": " ", "alias": " - leaderboard_gpqa" }, "leaderboard_gpqa_diamond": { "alias": " - leaderboard_gpqa_diamond", "acc_norm,none": 0.2777777777777778, "acc_norm_stderr,none": 0.03191178226713547 }, "leaderboard_gpqa_extended": { "alias": " - leaderboard_gpqa_extended", "acc_norm,none": 0.2948717948717949, "acc_norm_stderr,none": 0.01953225605335248 }, "leaderboard_gpqa_main": { "alias": " - leaderboard_gpqa_main", "acc_norm,none": 0.27901785714285715, "acc_norm_stderr,none": 0.021214094157265967 }, "leaderboard_ifeval": { "alias": " - leaderboard_ifeval", "prompt_level_strict_acc,none": 0.36414048059149723, "prompt_level_strict_acc_stderr,none": 0.02070704795859199, "inst_level_strict_acc,none": 0.5, "inst_level_strict_acc_stderr,none": "N/A", "prompt_level_loose_acc,none": 0.4343807763401109, "prompt_level_loose_acc_stderr,none": 0.021330473657564727, "inst_level_loose_acc,none": 0.5671462829736211, "inst_level_loose_acc_stderr,none": "N/A" }, "leaderboard_math_hard": { " ": " ", "alias": " - leaderboard_math_hard" }, "leaderboard_math_algebra_hard": { "alias": " - leaderboard_math_algebra_hard", "exact_match,none": 0.08143322475570032, "exact_match_stderr,none": 0.015634913029180096 }, "leaderboard_math_counting_and_prob_hard": { "alias": " - leaderboard_math_counting_and_prob_hard", "exact_match,none": 0.016260162601626018, "exact_match_stderr,none": 0.011450452676925665 }, "leaderboard_math_geometry_hard": { "alias": " - leaderboard_math_geometry_hard", "exact_match,none": 0.007575757575757576, "exact_match_stderr,none": 0.0075757575757575656 }, "leaderboard_math_intermediate_algebra_hard": { "alias": " - leaderboard_math_intermediate_algebra_hard", "exact_match,none": 0.014285714285714285, "exact_match_stderr,none": 0.007104350893915322 }, "leaderboard_math_num_theory_hard": { "alias": " - leaderboard_math_num_theory_hard", "exact_match,none": 0.05844155844155844, "exact_match_stderr,none": 0.01896438745195783 }, "leaderboard_math_prealgebra_hard": { "alias": " - leaderboard_math_prealgebra_hard", "exact_match,none": 0.11917098445595854, "exact_match_stderr,none": 0.02338193534812143 }, "leaderboard_math_precalculus_hard": { "alias": " - leaderboard_math_precalculus_hard", "exact_match,none": 0.014814814814814815, "exact_match_stderr,none": 0.01043649454959436 }, "leaderboard_mmlu_pro": { "alias": " - leaderboard_mmlu_pro", "acc,none": 0.3048537234042553, "acc_stderr,none": 0.004196942207232523 }, "leaderboard_musr": { " ": " ", "alias": " - leaderboard_musr" }, "leaderboard_musr_murder_mysteries": { "alias": " - leaderboard_musr_murder_mysteries", "acc_norm,none": 0.568, "acc_norm_stderr,none": 0.0313918107654294 }, "leaderboard_musr_object_placements": { "alias": " - leaderboard_musr_object_placements", "acc_norm,none": 0.328125, "acc_norm_stderr,none": 0.029403146715355242 }, "leaderboard_musr_team_allocation": { "alias": " - leaderboard_musr_team_allocation", "acc_norm,none": 0.364, "acc_norm_stderr,none": 0.030491555220405555 }, "toxigen": { "alias": "toxigen", "acc,none": 0.5702127659574469, "acc_stderr,none": 0.016155203301509467, "acc_norm,none": 0.5446808510638298, "acc_norm_stderr,none": 0.016251603395892635 }, "wmdp": { "acc,none": 0.5288985823336968, "acc_stderr,none": 0.008100262166921585, "alias": "wmdp" }, "wmdp_bio": { "alias": " - wmdp_bio", "acc,none": 0.6559308719560094, "acc_stderr,none": 0.01332012602079775 }, "wmdp_chem": { "alias": " - wmdp_chem", "acc,none": 0.49019607843137253, "acc_stderr,none": 0.024779315060043515 }, "wmdp_cyber": { "alias": " - wmdp_cyber", "acc,none": 0.4554604932058379, "acc_stderr,none": 0.011175074595399846 }, "xstest": { "alias": "xstest", "acc,none": 0.4488888888888889, "acc_stderr,none": 0.023472850939482037, "acc_norm,none": 0.4444444444444444, "acc_norm_stderr,none": 0.023450349399618212 } } ``` ## Dataset Details ### Dataset Description <!-- Provide a longer summary of what this dataset is. --> - **Curated by:** [More Information Needed] - **Funded by [optional]:** [More Information Needed] - **Shared by [optional]:** [More Information Needed] - **Language(s) (NLP):** [More Information Needed] - **License:** [More Information Needed] ### Dataset Sources [optional] <!-- Provide the basic links for the dataset. --> - **Repository:** [More Information Needed] - **Paper [optional]:** [More Information Needed] - **Demo [optional]:** [More Information Needed] ## Uses <!-- Address questions around how the dataset is intended to be used. --> ### Direct Use <!-- This section describes suitable use cases for the dataset. --> [More Information Needed] ### Out-of-Scope Use <!-- This section addresses misuse, malicious use, and uses that the dataset will not work well for. --> [More Information Needed] ## Dataset Structure <!-- This section provides a description of the dataset fields, and additional information about the dataset structure such as criteria used to create the splits, relationships between data points, etc. --> [More Information Needed] ## Dataset Creation ### Curation Rationale <!-- Motivation for the creation of this dataset. --> [More Information Needed] ### Source Data <!-- This section describes the source data (e.g. news text and headlines, social media posts, translated sentences, ...). --> #### Data Collection and Processing <!-- This section describes the data collection and processing process such as data selection criteria, filtering and normalization methods, tools and libraries used, etc. --> [More Information Needed] #### Who are the source data producers? <!-- This section describes the people or systems who originally created the data. It should also include self-reported demographic or identity information for the source data creators if this information is available. --> [More Information Needed] ### Annotations [optional] <!-- If the dataset contains annotations which are not part of the initial data collection, use this section to describe them. --> #### Annotation process <!-- This section describes the annotation process such as annotation tools used in the process, the amount of data annotated, annotation guidelines provided to the annotators, interannotator statistics, annotation validation, etc. --> [More Information Needed] #### Who are the annotators? <!-- This section describes the people or systems who created the annotations. --> [More Information Needed] #### Personal and Sensitive Information <!-- State whether the dataset contains data that might be considered personal, sensitive, or private (e.g., data that reveals addresses, uniquely identifiable names or aliases, racial or ethnic origins, sexual orientations, religious beliefs, political opinions, financial or health data, etc.). If efforts were made to anonymize the data, describe the anonymization process. --> [More Information Needed] ## Bias, Risks, and Limitations <!-- This section is meant to convey both technical and sociotechnical limitations. --> [More Information Needed] ### Recommendations <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. --> Users should be made aware of the risks, biases and limitations of the dataset. More information needed for further recommendations. ## Citation [optional] <!-- If there is a paper or blog post introducing the dataset, the APA and Bibtex information for that should go in this section. --> **BibTeX:** [More Information Needed] **APA:** [More Information Needed] ## Glossary [optional] <!-- If relevant, include terms and calculations in this section that can help readers understand the dataset or dataset card. --> [More Information Needed] ## More Information [optional] [More Information Needed] ## Dataset Card Authors [optional] [More Information Needed] ## Dataset Card Contact [More Information Needed]

**数据集名称:** yunconglong/DARE_TIES_13B 模型评估运行数据集 **数据集摘要:** 本数据集由模型[yunconglong/DARE_TIES_13B](https://huggingface.co/yunconglong/DARE_TIES_13B)的评估运行过程自动生成。该数据集共包含62个配置项,每个配置项对应一项被评估的任务。 本数据集基于2次评估运行生成。每次运行的结果可通过对应配置项中的特定划分(split)获取,划分名称以运行的时间戳命名。其中,`train`划分始终指向最新的评估结果。 额外提供一个名为`results`的配置项,用于存储本次评估运行的所有聚合结果。 若需加载某次运行的详细数据,可参考如下示例代码: python from datasets import load_dataset data = load_dataset( "nyu-dice-lab/lm-eval-results-yunconglong-DARE_TIES_13B-private", name="yunconglong__DARE_TIES_13B__BeaverTailsEval", split="latest" ) ## 最新评估结果 以下为[2024-12-04T20:37:46.218361运行的最新结果](https://huggingface.co/datasets/nyu-dice-lab/lm-eval-results-yunconglong-DARE_TIES_13B-private/blob/main/yunconglong/DARE_TIES_13B/results_2024-12-04T20-37-46.218361.json)(注:若后续评估未覆盖全部任务,则仓库中可能包含其他任务的评估结果,可通过各评估的`results`与`latest`划分获取对应内容): python { "all": { "BeaverTailsEval": { "alias": "BeaverTailsEval", "acc,none": 0.8714285714285714, "acc_stderr,none": 0.012660461716778634, "acc_norm,none": 0.12428571428571429, "acc_norm_stderr,none": 0.012478237164470317 }, "CDNA": { "alias": "CDNA", "acc,none": 0.9552457813646368, "acc_stderr,none": 0.003960876492273638, "acc_norm,none": 0.001834189288334556, "acc_norm_stderr,none": 0.0008196721291236438 }, // 其余评估任务结果格式一致,省略详细内容 } } **仓库地址:** https://huggingface.co/yunconglong/DARE_TIES_13B **排行榜地址:** 无 **联系人:** 无 ### 配置项列表 本数据集共包含62个配置项,每个配置项对应一项独立评估任务,格式统一为`yunconglong__DARE_TIES_13B__[任务名]`,每个配置项包含两个数据划分:以时间戳命名的历史运行划分与`latest`最新结果划分,数据文件路径格式为`**/samples_[任务名]_[时间戳].jsonl`。 ## 数据集卡片:yunconglong/DARE_TIES_13B 模型评估运行数据集 ### 数据集详情 #### 数据集描述 - **整理方:** [需补充更多信息] - **资助方(可选):** [需补充更多信息] - **共享方(可选):** [需补充更多信息] - **自然语言(NLP):** [需补充更多信息] - **授权协议:** [需补充更多信息] #### 数据集来源(可选) - **仓库地址:** [需补充更多信息] - **论文(可选):** [需补充更多信息] - **演示链接(可选):** [需补充更多信息] ### 数据集用途 #### 直接使用 [需补充更多信息] #### 超出范围的使用 [需补充更多信息] ### 数据集结构 [需补充更多信息] ### 数据集创建 #### 整理依据 [需补充更多信息] #### 源数据 ##### 数据收集与处理 [需补充更多信息] ##### 源数据生产者 [需补充更多信息] #### 标注(可选) ##### 标注流程 [需补充更多信息] ##### 标注者 [需补充更多信息] ##### 个人与敏感信息 [需补充更多信息] ### 偏差、风险与局限性 [需补充更多信息] #### 建议 用户应充分了解本数据集存在的风险、偏差与局限性。如需进一步的使用建议,需补充更多信息。 ### 引用(可选) **BibTeX:** [需补充更多信息] **APA:** [需补充更多信息] ### 术语表(可选) [需补充更多信息] ### 更多信息(可选) [需补充更多信息] ### 数据集卡片作者(可选) [需补充更多信息] ### 数据集卡片联系人 [需补充更多信息]
提供机构:
nyu-dice-lab
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作