five

DarianNLP/mda_influence_scores_NEW_lr1e4_v3

收藏
Hugging Face2026-04-22 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/DarianNLP/mda_influence_scores_NEW_lr1e4_v3
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: idx dtype: int64 - name: prompt dtype: string - name: label dtype: string - name: source dtype: string - name: response dtype: string - name: grad_norm dtype: float64 - name: delta_Y_mean dtype: float64 - name: delta_Y_per_prompt list: float64 - name: delta_Y_frac_increasing dtype: float64 - name: delta_h10_mean_norm dtype: float64 - name: delta_f_per_prompt struct: - name: 10000_neutral_factual_general_knowledge_and_email_writing_prompts_harmless list: float64 - name: 10737_general_benign_informational_explanation_requests_harmless list: float64 - name: 1083_trivia_lookup_and_benign_productivity_queries_harmless list: float64 - name: 10878_stereotype_based_demeaning_character_portrayals_harmful list: float64 - name: 10940_stereotyping_protected_groups_as_inferior_harmful list: float64 - name: 11223_competitive_skill_improvement_metaphors_harmless list: float64 - name: 11236_degrading_prejudice_toward_marginalized_groups_harmful list: float64 - name: 11382_copyright_compliance_and_policy_clarification_requests_harmless list: float64 - name: 11404_copyright_derivative_work_permission_requests_harmful list: float64 - name: 11495_creative_ideation_and_invention_prompts_harmless list: float64 - name: 11634_privacy_identity_check_and_harmless_business_inquiry_requests_harmless list: float64 - name: 11756_copyrighted_media_permission_and_attribution_requests_harmless list: float64 - name: 11813_toxic_social_bias_discriminatory_harm_requests_harmless list: float64 - name: 11840_corporate_email_admin_document_text_harmful list: float64 - name: 11861_anti_fraud_and_verification_requests_harmless list: float64 - name: 12232_general_benign_informational_question_answering_harmless list: float64 - name: 1227_illicit_finance_fraud_evasion_requests_harmful list: float64 - name: 1636_privacy_intrusion_data_request_harmful list: float64 - name: 1728_general_purpose_benign_inquiry_informational_request_harmless list: float64 - name: 1742_mundane_creative_editing_and_trivia_requests_harmless list: float64 - name: 1748_comparison_framed_social_cultural_difference_questions_harmless list: float64 - name: 1923_harmful_persuasion_manipulation_requests_harmful list: float64 - name: 1986_violent_sadistic_scene_requests_harmful list: float64 - name: 2197_manipulative_abuse_grooming_solicitation_harmful list: float64 - name: 2286_email_document_subject_line_extraction_harmless list: float64 - name: 2684_corporate_email_attachment_logistics_harmful list: float64 - name: 2742_email_document_revision_and_legal_admin_text_harmful list: float64 - name: 2827_neutral_procedural_text_with_risky_content_overlap_harmless list: float64 - name: 286_email_subject_line_document_summarization_requests_harmless list: float64 - name: 2945_internal_business_email_logistics_harmless list: float64 - name: 3248_deceptive_harm_fraud_hoaxes_and_misinformation_harmful list: float64 - name: 3277_practical_everyday_advice_problem_solving_requests_harmless list: float64 - name: 3368_graphic_violent_ritual_injury_descriptions_harmless list: float64 - name: 3734_discriminatory_insult_harassment_requests_harmful list: float64 - name: 3767_copyright_safe_creative_and_formatting_requests_weak_harmless list: float64 - name: 3772_biased_demeaning_requests_about_protected_groups_harmful list: float64 - name: 3868_address_and_contact_lookup_requests_for_institutions_and_fictional_entities_harmless list: float64 - name: 3895_general_factual_cultural_trivia_questions_harmless list: float64 - name: 3915_abuse_harassment_and_humiliating_harm_requests_harmful list: float64 - name: 4112_historical_causation_of_genocides_and_revolutions_harmless list: float64 - name: 4202_sensitive_info_policy_complaint_requests_harmless list: float64 - name: 4205_coping_and_self_improvement_guidance_harmless list: float64 - name: 4319_general_benign_informational_queries_harmless list: float64 - name: 4396_public_figure_medical_history_inquiries_harmless list: float64 - name: 4590_stereotyping_and_generalization_prompts_harmless list: float64 - name: 4986_ethical_classification_recommendation_task_prompts_harmless list: float64 - name: 5098_corporate_email_admin_task_requests_weak_harmful list: float64 - name: 5449_rumor_seeking_and_secret_sharing_requests_harmful list: float64 - name: 5509_grammar_style_transformation_and_neutral_writing_prompts_harmless list: float64 - name: 603_biased_inferiority_and_illicit_trade_harm_requests_harmful list: float64 - name: 6213_email_admin_subject_line_drafting_harmless list: float64 - name: 6238_definition_of_risky_terms_queries_harmless list: float64 - name: 6499_generic_benign_writing_tasks_harmless list: float64 - name: 6562_legitimate_verification_and_administrative_request_prompts_harmless list: float64 - name: 6733_copyright_defamation_misuse_requests_harmful list: float64 - name: 6815_harmful_advocacy_with_deceptive_framing_harmful list: float64 - name: 7063_sensitive_social_organizational_explanation_requests_harmless list: float64 - name: 7261_protective_safety_legal_compliance_advice_harmless list: float64 - name: 728_safety_health_advice_and_benign_information_requests_harmless list: float64 - name: 7294_deceptive_malicious_manipulation_requests_harmful list: float64 - name: 7480_email_subject_line_and_message_summary_requests_harmless list: float64 - name: 7575_factual_summary_explanation_requests_about_benign_media_and_history_harmless list: float64 - name: 7897_administrative_verification_request_language_harmless list: float64 - name: 7940_defamation_and_fake_news_fabrication_requests_harmful list: float64 - name: 7971_fictional_scenario_safety_secure_escape_questions_harmless list: float64 - name: 8063_general_factual_cultural_benign_queries_harmless list: float64 - name: 8140_email_task_management_correspondence_harmless list: float64 - name: 8184_risk_avoidance_verification_requests_harmless list: float64 - name: 823_basic_arithmetic_and_sequence_number_pattern_problems_harmless list: float64 - name: 8240_trivia_and_classification_questions_harmless list: float64 - name: 910_corporate_email_meeting_logistics_harmless list: float64 - name: 9267_video_game_combat_action_scenario_requests_harmless list: float64 - name: 9558_word_list_sentence_puzzle_prompts_harmful list: float64 - name: 9623_misinformation_and_fabricated_narrative_requests_harmful list: float64 - name: 9994_enumerative_benign_business_science_prompts_harmless list: float64 - name: harmful_natural_refusal_influence dtype: float64 - name: harmful_natural_influence_10000_neutral_factual_general_knowledge_and_email_writing_prompts_harmless dtype: float64 - name: harmful_natural_influence_10737_general_benign_informational_explanation_requests_harmless dtype: float64 - name: harmful_natural_influence_1083_trivia_lookup_and_benign_productivity_queries_harmless dtype: float64 - name: harmful_natural_influence_10878_stereotype_based_demeaning_character_portrayals_harmful dtype: float64 - name: harmful_natural_influence_10940_stereotyping_protected_groups_as_inferior_harmful dtype: float64 - name: harmful_natural_influence_11223_competitive_skill_improvement_metaphors_harmless dtype: float64 - name: harmful_natural_influence_11236_degrading_prejudice_toward_marginalized_groups_harmful dtype: float64 - name: harmful_natural_influence_11382_copyright_compliance_and_policy_clarification_requests_harmless dtype: float64 - name: harmful_natural_influence_11404_copyright_derivative_work_permission_requests_harmful dtype: float64 - name: harmful_natural_influence_11495_creative_ideation_and_invention_prompts_harmless dtype: float64 - name: harmful_natural_influence_11634_privacy_identity_check_and_harmless_business_inquiry_requests_harmless dtype: float64 - name: harmful_natural_influence_11756_copyrighted_media_permission_and_attribution_requests_harmless dtype: float64 - name: harmful_natural_influence_11813_toxic_social_bias_discriminatory_harm_requests_harmless dtype: float64 - name: harmful_natural_influence_11840_corporate_email_admin_document_text_harmful dtype: float64 - name: harmful_natural_influence_11861_anti_fraud_and_verification_requests_harmless dtype: float64 - name: harmful_natural_influence_12232_general_benign_informational_question_answering_harmless dtype: float64 - name: harmful_natural_influence_1227_illicit_finance_fraud_evasion_requests_harmful dtype: float64 - name: harmful_natural_influence_1636_privacy_intrusion_data_request_harmful dtype: float64 - name: harmful_natural_influence_1728_general_purpose_benign_inquiry_informational_request_harmless dtype: float64 - name: harmful_natural_influence_1742_mundane_creative_editing_and_trivia_requests_harmless dtype: float64 - name: harmful_natural_influence_1748_comparison_framed_social_cultural_difference_questions_harmless dtype: float64 - name: harmful_natural_influence_1923_harmful_persuasion_manipulation_requests_harmful dtype: float64 - name: harmful_natural_influence_1986_violent_sadistic_scene_requests_harmful dtype: float64 - name: harmful_natural_influence_2197_manipulative_abuse_grooming_solicitation_harmful dtype: float64 - name: harmful_natural_influence_2286_email_document_subject_line_extraction_harmless dtype: float64 - name: harmful_natural_influence_2684_corporate_email_attachment_logistics_harmful dtype: float64 - name: harmful_natural_influence_2742_email_document_revision_and_legal_admin_text_harmful dtype: float64 - name: harmful_natural_influence_2827_neutral_procedural_text_with_risky_content_overlap_harmless dtype: float64 - name: harmful_natural_influence_286_email_subject_line_document_summarization_requests_harmless dtype: float64 - name: harmful_natural_influence_2945_internal_business_email_logistics_harmless dtype: float64 - name: harmful_natural_influence_3248_deceptive_harm_fraud_hoaxes_and_misinformation_harmful dtype: float64 - name: harmful_natural_influence_3277_practical_everyday_advice_problem_solving_requests_harmless dtype: float64 - name: harmful_natural_influence_3368_graphic_violent_ritual_injury_descriptions_harmless dtype: float64 - name: harmful_natural_influence_3734_discriminatory_insult_harassment_requests_harmful dtype: float64 - name: harmful_natural_influence_3767_copyright_safe_creative_and_formatting_requests_weak_harmless dtype: float64 - name: harmful_natural_influence_3772_biased_demeaning_requests_about_protected_groups_harmful dtype: float64 - name: harmful_natural_influence_3868_address_and_contact_lookup_requests_for_institutions_and_fictional_entities_harmless dtype: float64 - name: harmful_natural_influence_3895_general_factual_cultural_trivia_questions_harmless dtype: float64 - name: harmful_natural_influence_3915_abuse_harassment_and_humiliating_harm_requests_harmful dtype: float64 - name: harmful_natural_influence_4112_historical_causation_of_genocides_and_revolutions_harmless dtype: float64 - name: harmful_natural_influence_4202_sensitive_info_policy_complaint_requests_harmless dtype: float64 - name: harmful_natural_influence_4205_coping_and_self_improvement_guidance_harmless dtype: float64 - name: harmful_natural_influence_4319_general_benign_informational_queries_harmless dtype: float64 - name: harmful_natural_influence_4396_public_figure_medical_history_inquiries_harmless dtype: float64 - name: harmful_natural_influence_4590_stereotyping_and_generalization_prompts_harmless dtype: float64 - name: harmful_natural_influence_4986_ethical_classification_recommendation_task_prompts_harmless dtype: float64 - name: harmful_natural_influence_5098_corporate_email_admin_task_requests_weak_harmful dtype: float64 - name: harmful_natural_influence_5449_rumor_seeking_and_secret_sharing_requests_harmful dtype: float64 - name: harmful_natural_influence_5509_grammar_style_transformation_and_neutral_writing_prompts_harmless dtype: float64 - name: harmful_natural_influence_603_biased_inferiority_and_illicit_trade_harm_requests_harmful dtype: float64 - name: harmful_natural_influence_6213_email_admin_subject_line_drafting_harmless dtype: float64 - name: harmful_natural_influence_6238_definition_of_risky_terms_queries_harmless dtype: float64 - name: harmful_natural_influence_6499_generic_benign_writing_tasks_harmless dtype: float64 - name: harmful_natural_influence_6562_legitimate_verification_and_administrative_request_prompts_harmless dtype: float64 - name: harmful_natural_influence_6733_copyright_defamation_misuse_requests_harmful dtype: float64 - name: harmful_natural_influence_6815_harmful_advocacy_with_deceptive_framing_harmful dtype: float64 - name: harmful_natural_influence_7063_sensitive_social_organizational_explanation_requests_harmless dtype: float64 - name: harmful_natural_influence_7261_protective_safety_legal_compliance_advice_harmless dtype: float64 - name: harmful_natural_influence_728_safety_health_advice_and_benign_information_requests_harmless dtype: float64 - name: harmful_natural_influence_7294_deceptive_malicious_manipulation_requests_harmful dtype: float64 - name: harmful_natural_influence_7480_email_subject_line_and_message_summary_requests_harmless dtype: float64 - name: harmful_natural_influence_7575_factual_summary_explanation_requests_about_benign_media_and_history_harmless dtype: float64 - name: harmful_natural_influence_7897_administrative_verification_request_language_harmless dtype: float64 - name: harmful_natural_influence_7940_defamation_and_fake_news_fabrication_requests_harmful dtype: float64 - name: harmful_natural_influence_7971_fictional_scenario_safety_secure_escape_questions_harmless dtype: float64 - name: harmful_natural_influence_8063_general_factual_cultural_benign_queries_harmless dtype: float64 - name: harmful_natural_influence_8140_email_task_management_correspondence_harmless dtype: float64 - name: harmful_natural_influence_8184_risk_avoidance_verification_requests_harmless dtype: float64 - name: harmful_natural_influence_823_basic_arithmetic_and_sequence_number_pattern_problems_harmless dtype: float64 - name: harmful_natural_influence_8240_trivia_and_classification_questions_harmless dtype: float64 - name: harmful_natural_influence_910_corporate_email_meeting_logistics_harmless dtype: float64 - name: harmful_natural_influence_9267_video_game_combat_action_scenario_requests_harmless dtype: float64 - name: harmful_natural_influence_9558_word_list_sentence_puzzle_prompts_harmful dtype: float64 - name: harmful_natural_influence_9623_misinformation_and_fabricated_narrative_requests_harmful dtype: float64 - name: harmful_natural_influence_9994_enumerative_benign_business_science_prompts_harmless dtype: float64 - name: harmful_natural_top5_most_influenced list: - name: feature dtype: string - name: influence dtype: float64 - name: ridge_weight dtype: float64 - name: harmful_natural_top5_most_important list: - name: feature dtype: string - name: influence dtype: float64 - name: ridge_weight dtype: float64 - name: harmful_balanced_refusal_influence dtype: float64 - name: harmful_balanced_influence_10000_neutral_factual_general_knowledge_and_email_writing_prompts_harmless dtype: float64 - name: harmful_balanced_influence_10737_general_benign_informational_explanation_requests_harmless dtype: float64 - name: harmful_balanced_influence_1083_trivia_lookup_and_benign_productivity_queries_harmless dtype: float64 - name: harmful_balanced_influence_10878_stereotype_based_demeaning_character_portrayals_harmful dtype: float64 - name: harmful_balanced_influence_10940_stereotyping_protected_groups_as_inferior_harmful dtype: float64 - name: harmful_balanced_influence_11223_competitive_skill_improvement_metaphors_harmless dtype: float64 - name: harmful_balanced_influence_11236_degrading_prejudice_toward_marginalized_groups_harmful dtype: float64 - name: harmful_balanced_influence_11382_copyright_compliance_and_policy_clarification_requests_harmless dtype: float64 - name: harmful_balanced_influence_11404_copyright_derivative_work_permission_requests_harmful dtype: float64 - name: harmful_balanced_influence_11495_creative_ideation_and_invention_prompts_harmless dtype: float64 - name: harmful_balanced_influence_11634_privacy_identity_check_and_harmless_business_inquiry_requests_harmless dtype: float64 - name: harmful_balanced_influence_11756_copyrighted_media_permission_and_attribution_requests_harmless dtype: float64 - name: harmful_balanced_influence_11813_toxic_social_bias_discriminatory_harm_requests_harmless dtype: float64 - name: harmful_balanced_influence_11840_corporate_email_admin_document_text_harmful dtype: float64 - name: harmful_balanced_influence_11861_anti_fraud_and_verification_requests_harmless dtype: float64 - name: harmful_balanced_influence_12232_general_benign_informational_question_answering_harmless dtype: float64 - name: harmful_balanced_influence_1227_illicit_finance_fraud_evasion_requests_harmful dtype: float64 - name: harmful_balanced_influence_1636_privacy_intrusion_data_request_harmful dtype: float64 - name: harmful_balanced_influence_1728_general_purpose_benign_inquiry_informational_request_harmless dtype: float64 - name: harmful_balanced_influence_1742_mundane_creative_editing_and_trivia_requests_harmless dtype: float64 - name: harmful_balanced_influence_1748_comparison_framed_social_cultural_difference_questions_harmless dtype: float64 - name: harmful_balanced_influence_1923_harmful_persuasion_manipulation_requests_harmful dtype: float64 - name: harmful_balanced_influence_1986_violent_sadistic_scene_requests_harmful dtype: float64 - name: harmful_balanced_influence_2197_manipulative_abuse_grooming_solicitation_harmful dtype: float64 - name: harmful_balanced_influence_2286_email_document_subject_line_extraction_harmless dtype: float64 - name: harmful_balanced_influence_2684_corporate_email_attachment_logistics_harmful dtype: float64 - name: harmful_balanced_influence_2742_email_document_revision_and_legal_admin_text_harmful dtype: float64 - name: harmful_balanced_influence_2827_neutral_procedural_text_with_risky_content_overlap_harmless dtype: float64 - name: harmful_balanced_influence_286_email_subject_line_document_summarization_requests_harmless dtype: float64 - name: harmful_balanced_influence_2945_internal_business_email_logistics_harmless dtype: float64 - name: harmful_balanced_influence_3248_deceptive_harm_fraud_hoaxes_and_misinformation_harmful dtype: float64 - name: harmful_balanced_influence_3277_practical_everyday_advice_problem_solving_requests_harmless dtype: float64 - name: harmful_balanced_influence_3368_graphic_violent_ritual_injury_descriptions_harmless dtype: float64 - name: harmful_balanced_influence_3734_discriminatory_insult_harassment_requests_harmful dtype: float64 - name: harmful_balanced_influence_3767_copyright_safe_creative_and_formatting_requests_weak_harmless dtype: float64 - name: harmful_balanced_influence_3772_biased_demeaning_requests_about_protected_groups_harmful dtype: float64 - name: harmful_balanced_influence_3868_address_and_contact_lookup_requests_for_institutions_and_fictional_entities_harmless dtype: float64 - name: harmful_balanced_influence_3895_general_factual_cultural_trivia_questions_harmless dtype: float64 - name: harmful_balanced_influence_3915_abuse_harassment_and_humiliating_harm_requests_harmful dtype: float64 - name: harmful_balanced_influence_4112_historical_causation_of_genocides_and_revolutions_harmless dtype: float64 - name: harmful_balanced_influence_4202_sensitive_info_policy_complaint_requests_harmless dtype: float64 - name: harmful_balanced_influence_4205_coping_and_self_improvement_guidance_harmless dtype: float64 - name: harmful_balanced_influence_4319_general_benign_informational_queries_harmless dtype: float64 - name: harmful_balanced_influence_4396_public_figure_medical_history_inquiries_harmless dtype: float64 - name: harmful_balanced_influence_4590_stereotyping_and_generalization_prompts_harmless dtype: float64 - name: harmful_balanced_influence_4986_ethical_classification_recommendation_task_prompts_harmless dtype: float64 - name: harmful_balanced_influence_5098_corporate_email_admin_task_requests_weak_harmful dtype: float64 - name: harmful_balanced_influence_5449_rumor_seeking_and_secret_sharing_requests_harmful dtype: float64 - name: harmful_balanced_influence_5509_grammar_style_transformation_and_neutral_writing_prompts_harmless dtype: float64 - name: harmful_balanced_influence_603_biased_inferiority_and_illicit_trade_harm_requests_harmful dtype: float64 - name: harmful_balanced_influence_6213_email_admin_subject_line_drafting_harmless dtype: float64 - name: harmful_balanced_influence_6238_definition_of_risky_terms_queries_harmless dtype: float64 - name: harmful_balanced_influence_6499_generic_benign_writing_tasks_harmless dtype: float64 - name: harmful_balanced_influence_6562_legitimate_verification_and_administrative_request_prompts_harmless dtype: float64 - name: harmful_balanced_influence_6733_copyright_defamation_misuse_requests_harmful dtype: float64 - name: harmful_balanced_influence_6815_harmful_advocacy_with_deceptive_framing_harmful dtype: float64 - name: harmful_balanced_influence_7063_sensitive_social_organizational_explanation_requests_harmless dtype: float64 - name: harmful_balanced_influence_7261_protective_safety_legal_compliance_advice_harmless dtype: float64 - name: harmful_balanced_influence_728_safety_health_advice_and_benign_information_requests_harmless dtype: float64 - name: harmful_balanced_influence_7294_deceptive_malicious_manipulation_requests_harmful dtype: float64 - name: harmful_balanced_influence_7480_email_subject_line_and_message_summary_requests_harmless dtype: float64 - name: harmful_balanced_influence_7575_factual_summary_explanation_requests_about_benign_media_and_history_harmless dtype: float64 - name: harmful_balanced_influence_7897_administrative_verification_request_language_harmless dtype: float64 - name: harmful_balanced_influence_7940_defamation_and_fake_news_fabrication_requests_harmful dtype: float64 - name: harmful_balanced_influence_7971_fictional_scenario_safety_secure_escape_questions_harmless dtype: float64 - name: harmful_balanced_influence_8063_general_factual_cultural_benign_queries_harmless dtype: float64 - name: harmful_balanced_influence_8140_email_task_management_correspondence_harmless dtype: float64 - name: harmful_balanced_influence_8184_risk_avoidance_verification_requests_harmless dtype: float64 - name: harmful_balanced_influence_823_basic_arithmetic_and_sequence_number_pattern_problems_harmless dtype: float64 - name: harmful_balanced_influence_8240_trivia_and_classification_questions_harmless dtype: float64 - name: harmful_balanced_influence_910_corporate_email_meeting_logistics_harmless dtype: float64 - name: harmful_balanced_influence_9267_video_game_combat_action_scenario_requests_harmless dtype: float64 - name: harmful_balanced_influence_9558_word_list_sentence_puzzle_prompts_harmful dtype: float64 - name: harmful_balanced_influence_9623_misinformation_and_fabricated_narrative_requests_harmful dtype: float64 - name: harmful_balanced_influence_9994_enumerative_benign_business_science_prompts_harmless dtype: float64 - name: harmful_balanced_top5_most_influenced list: - name: feature dtype: string - name: influence dtype: float64 - name: ridge_weight dtype: float64 - name: harmful_balanced_top5_most_important list: - name: feature dtype: string - name: influence dtype: float64 - name: ridge_weight dtype: float64 - name: harmful_harmless_refusal_influence dtype: float64 - name: harmful_harmless_influence_10000_neutral_factual_general_knowledge_and_email_writing_prompts_harmless dtype: float64 - name: harmful_harmless_influence_10737_general_benign_informational_explanation_requests_harmless dtype: float64 - name: harmful_harmless_influence_1083_trivia_lookup_and_benign_productivity_queries_harmless dtype: float64 - name: harmful_harmless_influence_10878_stereotype_based_demeaning_character_portrayals_harmful dtype: float64 - name: harmful_harmless_influence_10940_stereotyping_protected_groups_as_inferior_harmful dtype: float64 - name: harmful_harmless_influence_11223_competitive_skill_improvement_metaphors_harmless dtype: float64 - name: harmful_harmless_influence_11236_degrading_prejudice_toward_marginalized_groups_harmful dtype: float64 - name: harmful_harmless_influence_11382_copyright_compliance_and_policy_clarification_requests_harmless dtype: float64 - name: harmful_harmless_influence_11404_copyright_derivative_work_permission_requests_harmful dtype: float64 - name: harmful_harmless_influence_11495_creative_ideation_and_invention_prompts_harmless dtype: float64 - name: harmful_harmless_influence_11634_privacy_identity_check_and_harmless_business_inquiry_requests_harmless dtype: float64 - name: harmful_harmless_influence_11756_copyrighted_media_permission_and_attribution_requests_harmless dtype: float64 - name: harmful_harmless_influence_11813_toxic_social_bias_discriminatory_harm_requests_harmless dtype: float64 - name: harmful_harmless_influence_11840_corporate_email_admin_document_text_harmful dtype: float64 - name: harmful_harmless_influence_11861_anti_fraud_and_verification_requests_harmless dtype: float64 - name: harmful_harmless_influence_12232_general_benign_informational_question_answering_harmless dtype: float64 - name: harmful_harmless_influence_1227_illicit_finance_fraud_evasion_requests_harmful dtype: float64 - name: harmful_harmless_influence_1636_privacy_intrusion_data_request_harmful dtype: float64 - name: harmful_harmless_influence_1728_general_purpose_benign_inquiry_informational_request_harmless dtype: float64 - name: harmful_harmless_influence_1742_mundane_creative_editing_and_trivia_requests_harmless dtype: float64 - name: harmful_harmless_influence_1748_comparison_framed_social_cultural_difference_questions_harmless dtype: float64 - name: harmful_harmless_influence_1923_harmful_persuasion_manipulation_requests_harmful dtype: float64 - name: harmful_harmless_influence_1986_violent_sadistic_scene_requests_harmful dtype: float64 - name: harmful_harmless_influence_2197_manipulative_abuse_grooming_solicitation_harmful dtype: float64 - name: harmful_harmless_influence_2286_email_document_subject_line_extraction_harmless dtype: float64 - name: harmful_harmless_influence_2684_corporate_email_attachment_logistics_harmful dtype: float64 - name: harmful_harmless_influence_2742_email_document_revision_and_legal_admin_text_harmful dtype: float64 - name: harmful_harmless_influence_2827_neutral_procedural_text_with_risky_content_overlap_harmless dtype: float64 - name: harmful_harmless_influence_286_email_subject_line_document_summarization_requests_harmless dtype: float64 - name: harmful_harmless_influence_2945_internal_business_email_logistics_harmless dtype: float64 - name: harmful_harmless_influence_3248_deceptive_harm_fraud_hoaxes_and_misinformation_harmful dtype: float64 - name: harmful_harmless_influence_3277_practical_everyday_advice_problem_solving_requests_harmless dtype: float64 - name: harmful_harmless_influence_3368_graphic_violent_ritual_injury_descriptions_harmless dtype: float64 - name: harmful_harmless_influence_3734_discriminatory_insult_harassment_requests_harmful dtype: float64 - name: harmful_harmless_influence_3767_copyright_safe_creative_and_formatting_requests_weak_harmless dtype: float64 - name: harmful_harmless_influence_3772_biased_demeaning_requests_about_protected_groups_harmful dtype: float64 - name: harmful_harmless_influence_3868_address_and_contact_lookup_requests_for_institutions_and_fictional_entities_harmless dtype: float64 - name: harmful_harmless_influence_3895_general_factual_cultural_trivia_questions_harmless dtype: float64 - name: harmful_harmless_influence_3915_abuse_harassment_and_humiliating_harm_requests_harmful dtype: float64 - name: harmful_harmless_influence_4112_historical_causation_of_genocides_and_revolutions_harmless dtype: float64 - name: harmful_harmless_influence_4202_sensitive_info_policy_complaint_requests_harmless dtype: float64 - name: harmful_harmless_influence_4205_coping_and_self_improvement_guidance_harmless dtype: float64 - name: harmful_harmless_influence_4319_general_benign_informational_queries_harmless dtype: float64 - name: harmful_harmless_influence_4396_public_figure_medical_history_inquiries_harmless dtype: float64 - name: harmful_harmless_influence_4590_stereotyping_and_generalization_prompts_harmless dtype: float64 - name: harmful_harmless_influence_4986_ethical_classification_recommendation_task_prompts_harmless dtype: float64 - name: harmful_harmless_influence_5098_corporate_email_admin_task_requests_weak_harmful dtype: float64 - name: harmful_harmless_influence_5449_rumor_seeking_and_secret_sharing_requests_harmful dtype: float64 - name: harmful_harmless_influence_5509_grammar_style_transformation_and_neutral_writing_prompts_harmless dtype: float64 - name: harmful_harmless_influence_603_biased_inferiority_and_illicit_trade_harm_requests_harmful dtype: float64 - name: harmful_harmless_influence_6213_email_admin_subject_line_drafting_harmless dtype: float64 - name: harmful_harmless_influence_6238_definition_of_risky_terms_queries_harmless dtype: float64 - name: harmful_harmless_influence_6499_generic_benign_writing_tasks_harmless dtype: float64 - name: harmful_harmless_influence_6562_legitimate_verification_and_administrative_request_prompts_harmless dtype: float64 - name: harmful_harmless_influence_6733_copyright_defamation_misuse_requests_harmful dtype: float64 - name: harmful_harmless_influence_6815_harmful_advocacy_with_deceptive_framing_harmful dtype: float64 - name: harmful_harmless_influence_7063_sensitive_social_organizational_explanation_requests_harmless dtype: float64 - name: harmful_harmless_influence_7261_protective_safety_legal_compliance_advice_harmless dtype: float64 - name: harmful_harmless_influence_728_safety_health_advice_and_benign_information_requests_harmless dtype: float64 - name: harmful_harmless_influence_7294_deceptive_malicious_manipulation_requests_harmful dtype: float64 - name: harmful_harmless_influence_7480_email_subject_line_and_message_summary_requests_harmless dtype: float64 - name: harmful_harmless_influence_7575_factual_summary_explanation_requests_about_benign_media_and_history_harmless dtype: float64 - name: harmful_harmless_influence_7897_administrative_verification_request_language_harmless dtype: float64 - name: harmful_harmless_influence_7940_defamation_and_fake_news_fabrication_requests_harmful dtype: float64 - name: harmful_harmless_influence_7971_fictional_scenario_safety_secure_escape_questions_harmless dtype: float64 - name: harmful_harmless_influence_8063_general_factual_cultural_benign_queries_harmless dtype: float64 - name: harmful_harmless_influence_8140_email_task_management_correspondence_harmless dtype: float64 - name: harmful_harmless_influence_8184_risk_avoidance_verification_requests_harmless dtype: float64 - name: harmful_harmless_influence_823_basic_arithmetic_and_sequence_number_pattern_problems_harmless dtype: float64 - name: harmful_harmless_influence_8240_trivia_and_classification_questions_harmless dtype: float64 - name: harmful_harmless_influence_910_corporate_email_meeting_logistics_harmless dtype: float64 - name: harmful_harmless_influence_9267_video_game_combat_action_scenario_requests_harmless dtype: float64 - name: harmful_harmless_influence_9558_word_list_sentence_puzzle_prompts_harmful dtype: float64 - name: harmful_harmless_influence_9623_misinformation_and_fabricated_narrative_requests_harmful dtype: float64 - name: harmful_harmless_influence_9994_enumerative_benign_business_science_prompts_harmless dtype: float64 - name: harmful_harmless_top5_most_influenced list: - name: feature dtype: string - name: influence dtype: float64 - name: ridge_weight dtype: float64 - name: harmful_harmless_top5_most_important list: - name: feature dtype: string - name: influence dtype: float64 - name: ridge_weight dtype: float64 - name: appendix list: - name: refusal_influence dtype: float64 - name: seed dtype: int64 - name: harmful_natural_seg_all_delta_Y_mean dtype: float64 - name: harmful_natural_seg_all_delta_Y_frac_increasing dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_10000_neutral_factual_general_knowledge_and_email_writing_prompts_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_10737_general_benign_informational_explanation_requests_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_1083_trivia_lookup_and_benign_productivity_queries_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_10878_stereotype_based_demeaning_character_portrayals_harmful dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_10940_stereotyping_protected_groups_as_inferior_harmful dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_11223_competitive_skill_improvement_metaphors_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_11236_degrading_prejudice_toward_marginalized_groups_harmful dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_11382_copyright_compliance_and_policy_clarification_requests_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_11404_copyright_derivative_work_permission_requests_harmful dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_11495_creative_ideation_and_invention_prompts_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_11634_privacy_identity_check_and_harmless_business_inquiry_requests_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_11756_copyrighted_media_permission_and_attribution_requests_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_11813_toxic_social_bias_discriminatory_harm_requests_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_11840_corporate_email_admin_document_text_harmful dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_11861_anti_fraud_and_verification_requests_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_12232_general_benign_informational_question_answering_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_1227_illicit_finance_fraud_evasion_requests_harmful dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_1636_privacy_intrusion_data_request_harmful dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_1728_general_purpose_benign_inquiry_informational_request_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_1742_mundane_creative_editing_and_trivia_requests_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_1748_comparison_framed_social_cultural_difference_questions_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_1923_harmful_persuasion_manipulation_requests_harmful dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_1986_violent_sadistic_scene_requests_harmful dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_2197_manipulative_abuse_grooming_solicitation_harmful dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_2286_email_document_subject_line_extraction_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_2684_corporate_email_attachment_logistics_harmful dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_2742_email_document_revision_and_legal_admin_text_harmful dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_2827_neutral_procedural_text_with_risky_content_overlap_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_286_email_subject_line_document_summarization_requests_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_2945_internal_business_email_logistics_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_3248_deceptive_harm_fraud_hoaxes_and_misinformation_harmful dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_3277_practical_everyday_advice_problem_solving_requests_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_3368_graphic_violent_ritual_injury_descriptions_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_3734_discriminatory_insult_harassment_requests_harmful dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_3767_copyright_safe_creative_and_formatting_requests_weak_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_3772_biased_demeaning_requests_about_protected_groups_harmful dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_3868_address_and_contact_lookup_requests_for_institutions_and_fictional_entities_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_3895_general_factual_cultural_trivia_questions_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_3915_abuse_harassment_and_humiliating_harm_requests_harmful dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_4112_historical_causation_of_genocides_and_revolutions_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_4202_sensitive_info_policy_complaint_requests_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_4205_coping_and_self_improvement_guidance_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_4319_general_benign_informational_queries_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_4396_public_figure_medical_history_inquiries_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_4590_stereotyping_and_generalization_prompts_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_4986_ethical_classification_recommendation_task_prompts_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_5098_corporate_email_admin_task_requests_weak_harmful dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_5449_rumor_seeking_and_secret_sharing_requests_harmful dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_5509_grammar_style_transformation_and_neutral_writing_prompts_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_603_biased_inferiority_and_illicit_trade_harm_requests_harmful dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_6213_email_admin_subject_line_drafting_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_6238_definition_of_risky_terms_queries_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_6499_generic_benign_writing_tasks_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_6562_legitimate_verification_and_administrative_request_prompts_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_6733_copyright_defamation_misuse_requests_harmful dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_6815_harmful_advocacy_with_deceptive_framing_harmful dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_7063_sensitive_social_organizational_explanation_requests_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_7261_protective_safety_legal_compliance_advice_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_728_safety_health_advice_and_benign_information_requests_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_7294_deceptive_malicious_manipulation_requests_harmful dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_7480_email_subject_line_and_message_summary_requests_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_7575_factual_summary_explanation_requests_about_benign_media_and_history_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_7897_administrative_verification_request_language_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_7940_defamation_and_fake_news_fabrication_requests_harmful dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_7971_fictional_scenario_safety_secure_escape_questions_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_8063_general_factual_cultural_benign_queries_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_8140_email_task_management_correspondence_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_8184_risk_avoidance_verification_requests_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_823_basic_arithmetic_and_sequence_number_pattern_problems_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_8240_trivia_and_classification_questions_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_910_corporate_email_meeting_logistics_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_9267_video_game_combat_action_scenario_requests_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_9558_word_list_sentence_puzzle_prompts_harmful dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_9623_misinformation_and_fabricated_narrative_requests_harmful dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_9994_enumerative_benign_business_science_prompts_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_Y_mean dtype: float64 - name: harmful_natural_seg_harmful_delta_Y_frac_increasing dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_10000_neutral_factual_general_knowledge_and_email_writing_prompts_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_10737_general_benign_informational_explanation_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_1083_trivia_lookup_and_benign_productivity_queries_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_10878_stereotype_based_demeaning_character_portrayals_harmful dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_10940_stereotyping_protected_groups_as_inferior_harmful dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_11223_competitive_skill_improvement_metaphors_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_11236_degrading_prejudice_toward_marginalized_groups_harmful dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_11382_copyright_compliance_and_policy_clarification_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_11404_copyright_derivative_work_permission_requests_harmful dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_11495_creative_ideation_and_invention_prompts_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_11634_privacy_identity_check_and_harmless_business_inquiry_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_11756_copyrighted_media_permission_and_attribution_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_11813_toxic_social_bias_discriminatory_harm_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_11840_corporate_email_admin_document_text_harmful dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_11861_anti_fraud_and_verification_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_12232_general_benign_informational_question_answering_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_1227_illicit_finance_fraud_evasion_requests_harmful dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_1636_privacy_intrusion_data_request_harmful dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_1728_general_purpose_benign_inquiry_informational_request_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_1742_mundane_creative_editing_and_trivia_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_1748_comparison_framed_social_cultural_difference_questions_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_1923_harmful_persuasion_manipulation_requests_harmful dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_1986_violent_sadistic_scene_requests_harmful dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_2197_manipulative_abuse_grooming_solicitation_harmful dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_2286_email_document_subject_line_extraction_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_2684_corporate_email_attachment_logistics_harmful dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_2742_email_document_revision_and_legal_admin_text_harmful dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_2827_neutral_procedural_text_with_risky_content_overlap_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_286_email_subject_line_document_summarization_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_2945_internal_business_email_logistics_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_3248_deceptive_harm_fraud_hoaxes_and_misinformation_harmful dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_3277_practical_everyday_advice_problem_solving_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_3368_graphic_violent_ritual_injury_descriptions_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_3734_discriminatory_insult_harassment_requests_harmful dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_3767_copyright_safe_creative_and_formatting_requests_weak_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_3772_biased_demeaning_requests_about_protected_groups_harmful dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_3868_address_and_contact_lookup_requests_for_institutions_and_fictional_entities_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_3895_general_factual_cultural_trivia_questions_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_3915_abuse_harassment_and_humiliating_harm_requests_harmful dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_4112_historical_causation_of_genocides_and_revolutions_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_4202_sensitive_info_policy_complaint_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_4205_coping_and_self_improvement_guidance_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_4319_general_benign_informational_queries_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_4396_public_figure_medical_history_inquiries_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_4590_stereotyping_and_generalization_prompts_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_4986_ethical_classification_recommendation_task_prompts_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_5098_corporate_email_admin_task_requests_weak_harmful dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_5449_rumor_seeking_and_secret_sharing_requests_harmful dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_5509_grammar_style_transformation_and_neutral_writing_prompts_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_603_biased_inferiority_and_illicit_trade_harm_requests_harmful dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_6213_email_admin_subject_line_drafting_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_6238_definition_of_risky_terms_queries_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_6499_generic_benign_writing_tasks_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_6562_legitimate_verification_and_administrative_request_prompts_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_6733_copyright_defamation_misuse_requests_harmful dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_6815_harmful_advocacy_with_deceptive_framing_harmful dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_7063_sensitive_social_organizational_explanation_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_7261_protective_safety_legal_compliance_advice_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_728_safety_health_advice_and_benign_information_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_7294_deceptive_malicious_manipulation_requests_harmful dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_7480_email_subject_line_and_message_summary_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_7575_factual_summary_explanation_requests_about_benign_media_and_history_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_7897_administrative_verification_request_language_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_7940_defamation_and_fake_news_fabrication_requests_harmful dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_7971_fictional_scenario_safety_secure_escape_questions_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_8063_general_factual_cultural_benign_queries_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_8140_email_task_management_correspondence_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_8184_risk_avoidance_verification_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_823_basic_arithmetic_and_sequence_number_pattern_problems_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_8240_trivia_and_classification_questions_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_910_corporate_email_meeting_logistics_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_9267_video_game_combat_action_scenario_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_9558_word_list_sentence_puzzle_prompts_harmful dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_9623_misinformation_and_fabricated_narrative_requests_harmful dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_9994_enumerative_benign_business_science_prompts_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_Y_mean dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_Y_frac_increasing dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_10000_neutral_factual_general_knowledge_and_email_writing_prompts_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_10737_general_benign_informational_explanation_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_1083_trivia_lookup_and_benign_productivity_queries_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_10878_stereotype_based_demeaning_character_portrayals_harmful dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_10940_stereotyping_protected_groups_as_inferior_harmful dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_11223_competitive_skill_improvement_metaphors_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_11236_degrading_prejudice_toward_marginalized_groups_harmful dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_11382_copyright_compliance_and_policy_clarification_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_11404_copyright_derivative_work_permission_requests_harmful dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_11495_creative_ideation_and_invention_prompts_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_11634_privacy_identity_check_and_harmless_business_inquiry_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_11756_copyrighted_media_permission_and_attribution_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_11813_toxic_social_bias_discriminatory_harm_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_11840_corporate_email_admin_document_text_harmful dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_11861_anti_fraud_and_verification_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_12232_general_benign_informational_question_answering_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_1227_illicit_finance_fraud_evasion_requests_harmful dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_1636_privacy_intrusion_data_request_harmful dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_1728_general_purpose_benign_inquiry_informational_request_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_1742_mundane_creative_editing_and_trivia_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_1748_comparison_framed_social_cultural_difference_questions_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_1923_harmful_persuasion_manipulation_requests_harmful dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_1986_violent_sadistic_scene_requests_harmful dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_2197_manipulative_abuse_grooming_solicitation_harmful dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_2286_email_document_subject_line_extraction_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_2684_corporate_email_attachment_logistics_harmful dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_2742_email_document_revision_and_legal_admin_text_harmful dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_2827_neutral_procedural_text_with_risky_content_overlap_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_286_email_subject_line_document_summarization_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_2945_internal_business_email_logistics_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_3248_deceptive_harm_fraud_hoaxes_and_misinformation_harmful dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_3277_practical_everyday_advice_problem_solving_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_3368_graphic_violent_ritual_injury_descriptions_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_3734_discriminatory_insult_harassment_requests_harmful dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_3767_copyright_safe_creative_and_formatting_requests_weak_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_3772_biased_demeaning_requests_about_protected_groups_harmful dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_3868_address_and_contact_lookup_requests_for_institutions_and_fictional_entities_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_3895_general_factual_cultural_trivia_questions_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_3915_abuse_harassment_and_humiliating_harm_requests_harmful dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_4112_historical_causation_of_genocides_and_revolutions_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_4202_sensitive_info_policy_complaint_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_4205_coping_and_self_improvement_guidance_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_4319_general_benign_informational_queries_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_4396_public_figure_medical_history_inquiries_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_4590_stereotyping_and_generalization_prompts_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_4986_ethical_classification_recommendation_task_prompts_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_5098_corporate_email_admin_task_requests_weak_harmful dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_5449_rumor_seeking_and_secret_sharing_requests_harmful dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_5509_grammar_style_transformation_and_neutral_writing_prompts_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_603_biased_inferiority_and_illicit_trade_harm_requests_harmful dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_6213_email_admin_subject_line_drafting_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_6238_definition_of_risky_terms_queries_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_6499_generic_benign_writing_tasks_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_6562_legitimate_verification_and_administrative_request_prompts_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_6733_copyright_defamation_misuse_requests_harmful dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_6815_harmful_advocacy_with_deceptive_framing_harmful dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_7063_sensitive_social_organizational_explanation_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_7261_protective_safety_legal_compliance_advice_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_728_safety_health_advice_and_benign_information_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_7294_deceptive_malicious_manipulation_requests_harmful dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_7480_email_subject_line_and_message_summary_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_7575_factual_summary_explanation_requests_about_benign_media_and_history_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_7897_administrative_verification_request_language_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_7940_defamation_and_fake_news_fabrication_requests_harmful dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_7971_fictional_scenario_safety_secure_escape_questions_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_8063_general_factual_cultural_benign_queries_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_8140_email_task_management_correspondence_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_8184_risk_avoidance_verification_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_823_basic_arithmetic_and_sequence_number_pattern_problems_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_8240_trivia_and_classification_questions_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_910_corporate_email_meeting_logistics_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_9267_video_game_combat_action_scenario_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_9558_word_list_sentence_puzzle_prompts_harmful dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_9623_misinformation_and_fabricated_narrative_requests_harmful dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_9994_enumerative_benign_business_science_prompts_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_Y_mean dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_Y_frac_increasing dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_10000_neutral_factual_general_knowledge_and_email_writing_prompts_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_10737_general_benign_informational_explanation_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_1083_trivia_lookup_and_benign_productivity_queries_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_10878_stereotype_based_demeaning_character_portrayals_harmful dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_10940_stereotyping_protected_groups_as_inferior_harmful dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_11223_competitive_skill_improvement_metaphors_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_11236_degrading_prejudice_toward_marginalized_groups_harmful dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_11382_copyright_compliance_and_policy_clarification_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_11404_copyright_derivative_work_permission_requests_harmful dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_11495_creative_ideation_and_invention_prompts_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_11634_privacy_identity_check_and_harmless_business_inquiry_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_11756_copyrighted_media_permission_and_attribution_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_11813_toxic_social_bias_discriminatory_harm_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_11840_corporate_email_admin_document_text_harmful dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_11861_anti_fraud_and_verification_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_12232_general_benign_informational_question_answering_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_1227_illicit_finance_fraud_evasion_requests_harmful dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_1636_privacy_intrusion_data_request_harmful dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_1728_general_purpose_benign_inquiry_informational_request_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_1742_mundane_creative_editing_and_trivia_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_1748_comparison_framed_social_cultural_difference_questions_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_1923_harmful_persuasion_manipulation_requests_harmful dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_1986_violent_sadistic_scene_requests_harmful dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_2197_manipulative_abuse_grooming_solicitation_harmful dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_2286_email_document_subject_line_extraction_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_2684_corporate_email_attachment_logistics_harmful dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_2742_email_document_revision_and_legal_admin_text_harmful dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_2827_neutral_procedural_text_with_risky_content_overlap_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_286_email_subject_line_document_summarization_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_2945_internal_business_email_logistics_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_3248_deceptive_harm_fraud_hoaxes_and_misinformation_harmful dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_3277_practical_everyday_advice_problem_solving_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_3368_graphic_violent_ritual_injury_descriptions_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_3734_discriminatory_insult_harassment_requests_harmful dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_3767_copyright_safe_creative_and_formatting_requests_weak_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_3772_biased_demeaning_requests_about_protected_groups_harmful dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_3868_address_and_contact_lookup_requests_for_institutions_and_fictional_entities_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_3895_general_factual_cultural_trivia_questions_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_3915_abuse_harassment_and_humiliating_harm_requests_harmful dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_4112_historical_causation_of_genocides_and_revolutions_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_4202_sensitive_info_policy_complaint_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_4205_coping_and_self_improvement_guidance_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_4319_general_benign_informational_queries_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_4396_public_figure_medical_history_inquiries_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_4590_stereotyping_and_generalization_prompts_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_4986_ethical_classification_recommendation_task_prompts_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_5098_corporate_email_admin_task_requests_weak_harmful dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_5449_rumor_seeking_and_secret_sharing_requests_harmful dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_5509_grammar_style_transformation_and_neutral_writing_prompts_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_603_biased_inferiority_and_illicit_trade_harm_requests_harmful dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_6213_email_admin_subject_line_drafting_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_6238_definition_of_risky_terms_queries_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_6499_generic_benign_writing_tasks_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_6562_legitimate_verification_and_administrative_request_prompts_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_6733_copyright_defamation_misuse_requests_harmful dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_6815_harmful_advocacy_with_deceptive_framing_harmful dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_7063_sensitive_social_organizational_explanation_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_7261_protective_safety_legal_compliance_advice_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_728_safety_health_advice_and_benign_information_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_7294_deceptive_malicious_manipulation_requests_harmful dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_7480_email_subject_line_and_message_summary_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_7575_factual_summary_explanation_requests_about_benign_media_and_history_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_7897_administrative_verification_request_language_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_7940_defamation_and_fake_news_fabrication_requests_harmful dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_7971_fictional_scenario_safety_secure_escape_questions_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_8063_general_factual_cultural_benign_queries_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_8140_email_task_management_correspondence_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_8184_risk_avoidance_verification_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_823_basic_arithmetic_and_sequence_number_pattern_problems_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_8240_trivia_and_classification_questions_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_910_corporate_email_meeting_logistics_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_9267_video_game_combat_action_scenario_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_9558_word_list_sentence_puzzle_prompts_harmful dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_9623_misinformation_and_fabricated_narrative_requests_harmful dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_9994_enumerative_benign_business_science_prompts_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_Y_mean dtype: float64 - name: harmful_balanced_seg_all_delta_Y_frac_increasing dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_10000_neutral_factual_general_knowledge_and_email_writing_prompts_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_10737_general_benign_informational_explanation_requests_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_1083_trivia_lookup_and_benign_productivity_queries_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_10878_stereotype_based_demeaning_character_portrayals_harmful dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_10940_stereotyping_protected_groups_as_inferior_harmful dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_11223_competitive_skill_improvement_metaphors_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_11236_degrading_prejudice_toward_marginalized_groups_harmful dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_11382_copyright_compliance_and_policy_clarification_requests_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_11404_copyright_derivative_work_permission_requests_harmful dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_11495_creative_ideation_and_invention_prompts_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_11634_privacy_identity_check_and_harmless_business_inquiry_requests_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_11756_copyrighted_media_permission_and_attribution_requests_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_11813_toxic_social_bias_discriminatory_harm_requests_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_11840_corporate_email_admin_document_text_harmful dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_11861_anti_fraud_and_verification_requests_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_12232_general_benign_informational_question_answering_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_1227_illicit_finance_fraud_evasion_requests_harmful dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_1636_privacy_intrusion_data_request_harmful dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_1728_general_purpose_benign_inquiry_informational_request_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_1742_mundane_creative_editing_and_trivia_requests_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_1748_comparison_framed_social_cultural_difference_questions_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_1923_harmful_persuasion_manipulation_requests_harmful dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_1986_violent_sadistic_scene_requests_harmful dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_2197_manipulative_abuse_grooming_solicitation_harmful dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_2286_email_document_subject_line_extraction_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_2684_corporate_email_attachment_logistics_harmful dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_2742_email_document_revision_and_legal_admin_text_harmful dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_2827_neutral_procedural_text_with_risky_content_overlap_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_286_email_subject_line_document_summarization_requests_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_2945_internal_business_email_logistics_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_3248_deceptive_harm_fraud_hoaxes_and_misinformation_harmful dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_3277_practical_everyday_advice_problem_solving_requests_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_3368_graphic_violent_ritual_injury_descriptions_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_3734_discriminatory_insult_harassment_requests_harmful dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_3767_copyright_safe_creative_and_formatting_requests_weak_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_3772_biased_demeaning_requests_about_protected_groups_harmful dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_3868_address_and_contact_lookup_requests_for_institutions_and_fictional_entities_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_3895_general_factual_cultural_trivia_questions_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_3915_abuse_harassment_and_humiliating_harm_requests_harmful dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_4112_historical_causation_of_genocides_and_revolutions_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_4202_sensitive_info_policy_complaint_requests_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_4205_coping_and_self_improvement_guidance_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_4319_general_benign_informational_queries_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_4396_public_figure_medical_history_inquiries_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_4590_stereotyping_and_generalization_prompts_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_4986_ethical_classification_recommendation_task_prompts_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_5098_corporate_email_admin_task_requests_weak_harmful dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_5449_rumor_seeking_and_secret_sharing_requests_harmful dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_5509_grammar_style_transformation_and_neutral_writing_prompts_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_603_biased_inferiority_and_illicit_trade_harm_requests_harmful dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_6213_email_admin_subject_line_drafting_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_6238_definition_of_risky_terms_queries_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_6499_generic_benign_writing_tasks_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_6562_legitimate_verification_and_administrative_request_prompts_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_6733_copyright_defamation_misuse_requests_harmful dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_6815_harmful_advocacy_with_deceptive_framing_harmful dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_7063_sensitive_social_organizational_explanation_requests_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_7261_protective_safety_legal_compliance_advice_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_728_safety_health_advice_and_benign_information_requests_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_7294_deceptive_malicious_manipulation_requests_harmful dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_7480_email_subject_line_and_message_summary_requests_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_7575_factual_summary_explanation_requests_about_benign_media_and_history_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_7897_administrative_verification_request_language_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_7940_defamation_and_fake_news_fabrication_requests_harmful dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_7971_fictional_scenario_safety_secure_escape_questions_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_8063_general_factual_cultural_benign_queries_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_8140_email_task_management_correspondence_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_8184_risk_avoidance_verification_requests_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_823_basic_arithmetic_and_sequence_number_pattern_problems_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_8240_trivia_and_classification_questions_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_910_corporate_email_meeting_logistics_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_9267_video_game_combat_action_scenario_requests_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_9558_word_list_sentence_puzzle_prompts_harmful dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_9623_misinformation_and_fabricated_narrative_requests_harmful dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_9994_enumerative_benign_business_science_prompts_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_Y_mean dtype: float64 - name: harmful_balanced_seg_harmful_delta_Y_frac_increasing dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_10000_neutral_factual_general_knowledge_and_email_writing_prompts_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_10737_general_benign_informational_explanation_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_1083_trivia_lookup_and_benign_productivity_queries_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_10878_stereotype_based_demeaning_character_portrayals_harmful dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_10940_stereotyping_protected_groups_as_inferior_harmful dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_11223_competitive_skill_improvement_metaphors_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_11236_degrading_prejudice_toward_marginalized_groups_harmful dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_11382_copyright_compliance_and_policy_clarification_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_11404_copyright_derivative_work_permission_requests_harmful dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_11495_creative_ideation_and_invention_prompts_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_11634_privacy_identity_check_and_harmless_business_inquiry_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_11756_copyrighted_media_permission_and_attribution_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_11813_toxic_social_bias_discriminatory_harm_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_11840_corporate_email_admin_document_text_harmful dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_11861_anti_fraud_and_verification_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_12232_general_benign_informational_question_answering_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_1227_illicit_finance_fraud_evasion_requests_harmful dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_1636_privacy_intrusion_data_request_harmful dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_1728_general_purpose_benign_inquiry_informational_request_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_1742_mundane_creative_editing_and_trivia_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_1748_comparison_framed_social_cultural_difference_questions_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_1923_harmful_persuasion_manipulation_requests_harmful dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_1986_violent_sadistic_scene_requests_harmful dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_2197_manipulative_abuse_grooming_solicitation_harmful dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_2286_email_document_subject_line_extraction_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_2684_corporate_email_attachment_logistics_harmful dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_2742_email_document_revision_and_legal_admin_text_harmful dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_2827_neutral_procedural_text_with_risky_content_overlap_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_286_email_subject_line_document_summarization_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_2945_internal_business_email_logistics_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_3248_deceptive_harm_fraud_hoaxes_and_misinformation_harmful dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_3277_practical_everyday_advice_problem_solving_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_3368_graphic_violent_ritual_injury_descriptions_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_3734_discriminatory_insult_harassment_requests_harmful dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_3767_copyright_safe_creative_and_formatting_requests_weak_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_3772_biased_demeaning_requests_about_protected_groups_harmful dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_3868_address_and_contact_lookup_requests_for_institutions_and_fictional_entities_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_3895_general_factual_cultural_trivia_questions_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_3915_abuse_harassment_and_humiliating_harm_requests_harmful dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_4112_historical_causation_of_genocides_and_revolutions_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_4202_sensitive_info_policy_complaint_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_4205_coping_and_self_improvement_guidance_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_4319_general_benign_informational_queries_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_4396_public_figure_medical_history_inquiries_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_4590_stereotyping_and_generalization_prompts_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_4986_ethical_classification_recommendation_task_prompts_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_5098_corporate_email_admin_task_requests_weak_harmful dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_5449_rumor_seeking_and_secret_sharing_requests_harmful dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_5509_grammar_style_transformation_and_neutral_writing_prompts_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_603_biased_inferiority_and_illicit_trade_harm_requests_harmful dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_6213_email_admin_subject_line_drafting_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_6238_definition_of_risky_terms_queries_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_6499_generic_benign_writing_tasks_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_6562_legitimate_verification_and_administrative_request_prompts_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_6733_copyright_defamation_misuse_requests_harmful dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_6815_harmful_advocacy_with_deceptive_framing_harmful dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_7063_sensitive_social_organizational_explanation_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_7261_protective_safety_legal_compliance_advice_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_728_safety_health_advice_and_benign_information_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_7294_deceptive_malicious_manipulation_requests_harmful dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_7480_email_subject_line_and_message_summary_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_7575_factual_summary_explanation_requests_about_benign_media_and_history_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_7897_administrative_verification_request_language_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_7940_defamation_and_fake_news_fabrication_requests_harmful dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_7971_fictional_scenario_safety_secure_escape_questions_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_8063_general_factual_cultural_benign_queries_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_8140_email_task_management_correspondence_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_8184_risk_avoidance_verification_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_823_basic_arithmetic_and_sequence_number_pattern_problems_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_8240_trivia_and_classification_questions_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_910_corporate_email_meeting_logistics_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_9267_video_game_combat_action_scenario_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_9558_word_list_sentence_puzzle_prompts_harmful dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_9623_misinformation_and_fabricated_narrative_requests_harmful dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_9994_enumerative_benign_business_science_prompts_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_Y_mean dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_Y_frac_increasing dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_10000_neutral_factual_general_knowledge_and_email_writing_prompts_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_10737_general_benign_informational_explanation_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_1083_trivia_lookup_and_benign_productivity_queries_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_10878_stereotype_based_demeaning_character_portrayals_harmful dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_10940_stereotyping_protected_groups_as_inferior_harmful dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_11223_competitive_skill_improvement_metaphors_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_11236_degrading_prejudice_toward_marginalized_groups_harmful dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_11382_copyright_compliance_and_policy_clarification_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_11404_copyright_derivative_work_permission_requests_harmful dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_11495_creative_ideation_and_invention_prompts_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_11634_privacy_identity_check_and_harmless_business_inquiry_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_11756_copyrighted_media_permission_and_attribution_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_11813_toxic_social_bias_discriminatory_harm_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_11840_corporate_email_admin_document_text_harmful dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_11861_anti_fraud_and_verification_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_12232_general_benign_informational_question_answering_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_1227_illicit_finance_fraud_evasion_requests_harmful dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_1636_privacy_intrusion_data_request_harmful dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_1728_general_purpose_benign_inquiry_informational_request_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_1742_mundane_creative_editing_and_trivia_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_1748_comparison_framed_social_cultural_difference_questions_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_1923_harmful_persuasion_manipulation_requests_harmful dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_1986_violent_sadistic_scene_requests_harmful dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_2197_manipulative_abuse_grooming_solicitation_harmful dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_2286_email_document_subject_line_extraction_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_2684_corporate_email_attachment_logistics_harmful dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_2742_email_document_revision_and_legal_admin_text_harmful dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_2827_neutral_procedural_text_with_risky_content_overlap_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_286_email_subject_line_document_summarization_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_2945_internal_business_email_logistics_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_3248_deceptive_harm_fraud_hoaxes_and_misinformation_harmful dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_3277_practical_everyday_advice_problem_solving_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_3368_graphic_violent_ritual_injury_descriptions_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_3734_discriminatory_insult_harassment_requests_harmful dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_3767_copyright_safe_creative_and_formatting_requests_weak_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_3772_biased_demeaning_requests_about_protected_groups_harmful dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_3868_address_and_contact_lookup_requests_for_institutions_and_fictional_entities_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_3895_general_factual_cultural_trivia_questions_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_3915_abuse_harassment_and_humiliating_harm_requests_harmful dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_4112_historical_causation_of_genocides_and_revolutions_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_4202_sensitive_info_policy_complaint_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_4205_coping_and_self_improvement_guidance_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_4319_general_benign_informational_queries_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_4396_public_figure_medical_history_inquiries_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_4590_stereotyping_and_generalization_prompts_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_4986_ethical_classification_recommendation_task_prompts_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_5098_corporate_email_admin_task_requests_weak_harmful dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_5449_rumor_seeking_and_secret_sharing_requests_harmful dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_5509_grammar_style_transformation_and_neutral_writing_prompts_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_603_biased_inferiority_and_illicit_trade_harm_requests_harmful dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_6213_email_admin_subject_line_drafting_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_6238_definition_of_risky_terms_queries_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_6499_generic_benign_writing_tasks_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_6562_legitimate_verification_and_administrative_request_prompts_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_6733_copyright_defamation_misuse_requests_harmful dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_6815_harmful_advocacy_with_deceptive_framing_harmful dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_7063_sensitive_social_organizational_explanation_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_7261_protective_safety_legal_compliance_advice_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_728_safety_health_advice_and_benign_information_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_7294_deceptive_malicious_manipulation_requests_harmful dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_7480_email_subject_line_and_message_summary_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_7575_factual_summary_explanation_requests_about_benign_media_and_history_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_7897_administrative_verification_request_language_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_7940_defamation_and_fake_news_fabrication_requests_harmful dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_7971_fictional_scenario_safety_secure_escape_questions_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_8063_general_factual_cultural_benign_queries_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_8140_email_task_management_correspondence_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_8184_risk_avoidance_verification_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_823_basic_arithmetic_and_sequence_number_pattern_problems_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_8240_trivia_and_classification_questions_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_910_corporate_email_meeting_logistics_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_9267_video_game_combat_action_scenario_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_9558_word_list_sentence_puzzle_prompts_harmful dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_9623_misinformation_and_fabricated_narrative_requests_harmful dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_9994_enumerative_benign_business_science_prompts_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_Y_mean dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_Y_frac_increasing dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_10000_neutral_factual_general_knowledge_and_email_writing_prompts_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_10737_general_benign_informational_explanation_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_1083_trivia_lookup_and_benign_productivity_queries_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_10878_stereotype_based_demeaning_character_portrayals_harmful dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_10940_stereotyping_protected_groups_as_inferior_harmful dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_11223_competitive_skill_improvement_metaphors_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_11236_degrading_prejudice_toward_marginalized_groups_harmful dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_11382_copyright_compliance_and_policy_clarification_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_11404_copyright_derivative_work_permission_requests_harmful dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_11495_creative_ideation_and_invention_prompts_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_11634_privacy_identity_check_and_harmless_business_inquiry_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_11756_copyrighted_media_permission_and_attribution_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_11813_toxic_social_bias_discriminatory_harm_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_11840_corporate_email_admin_document_text_harmful dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_11861_anti_fraud_and_verification_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_12232_general_benign_informational_question_answering_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_1227_illicit_finance_fraud_evasion_requests_harmful dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_1636_privacy_intrusion_data_request_harmful dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_1728_general_purpose_benign_inquiry_informational_request_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_1742_mundane_creative_editing_and_trivia_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_1748_comparison_framed_social_cultural_difference_questions_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_1923_harmful_persuasion_manipulation_requests_harmful dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_1986_violent_sadistic_scene_requests_harmful dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_2197_manipulative_abuse_grooming_solicitation_harmful dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_2286_email_document_subject_line_extraction_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_2684_corporate_email_attachment_logistics_harmful dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_2742_email_document_revision_and_legal_admin_text_harmful dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_2827_neutral_procedural_text_with_risky_content_overlap_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_286_email_subject_line_document_summarization_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_2945_internal_business_email_logistics_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_3248_deceptive_harm_fraud_hoaxes_and_misinformation_harmful dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_3277_practical_everyday_advice_problem_solving_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_3368_graphic_violent_ritual_injury_descriptions_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_3734_discriminatory_insult_harassment_requests_harmful dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_3767_copyright_safe_creative_and_formatting_requests_weak_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_3772_biased_demeaning_requests_about_protected_groups_harmful dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_3868_address_and_contact_lookup_requests_for_institutions_and_fictional_entities_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_3895_general_factual_cultural_trivia_questions_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_3915_abuse_harassment_and_humiliating_harm_requests_harmful dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_4112_historical_causation_of_genocides_and_revolutions_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_4202_sensitive_info_policy_complaint_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_4205_coping_and_self_improvement_guidance_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_4319_general_benign_informational_queries_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_4396_public_figure_medical_history_inquiries_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_4590_stereotyping_and_generalization_prompts_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_4986_ethical_classification_recommendation_task_prompts_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_5098_corporate_email_admin_task_requests_weak_harmful dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_5449_rumor_seeking_and_secret_sharing_requests_harmful dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_5509_grammar_style_transformation_and_neutral_writing_prompts_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_603_biased_inferiority_and_illicit_trade_harm_requests_harmful dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_6213_email_admin_subject_line_drafting_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_6238_definition_of_risky_terms_queries_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_6499_generic_benign_writing_tasks_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_6562_legitimate_verification_and_administrative_request_prompts_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_6733_copyright_defamation_misuse_requests_harmful dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_6815_harmful_advocacy_with_deceptive_framing_harmful dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_7063_sensitive_social_organizational_explanation_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_7261_protective_safety_legal_compliance_advice_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_728_safety_health_advice_and_benign_information_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_7294_deceptive_malicious_manipulation_requests_harmful dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_7480_email_subject_line_and_message_summary_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_7575_factual_summary_explanation_requests_about_benign_media_and_history_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_7897_administrative_verification_request_language_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_7940_defamation_and_fake_news_fabrication_requests_harmful dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_7971_fictional_scenario_safety_secure_escape_questions_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_8063_general_factual_cultural_benign_queries_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_8140_email_task_management_correspondence_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_8184_risk_avoidance_verification_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_823_basic_arithmetic_and_sequence_number_pattern_problems_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_8240_trivia_and_classification_questions_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_910_corporate_email_meeting_logistics_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_9267_video_game_combat_action_scenario_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_9558_word_list_sentence_puzzle_prompts_harmful dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_9623_misinformation_and_fabricated_narrative_requests_harmful dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_9994_enumerative_benign_business_science_prompts_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_Y_mean dtype: float64 - name: harmful_harmless_seg_all_delta_Y_frac_increasing dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_10000_neutral_factual_general_knowledge_and_email_writing_prompts_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_10737_general_benign_informational_explanation_requests_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_1083_trivia_lookup_and_benign_productivity_queries_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_10878_stereotype_based_demeaning_character_portrayals_harmful dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_10940_stereotyping_protected_groups_as_inferior_harmful dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_11223_competitive_skill_improvement_metaphors_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_11236_degrading_prejudice_toward_marginalized_groups_harmful dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_11382_copyright_compliance_and_policy_clarification_requests_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_11404_copyright_derivative_work_permission_requests_harmful dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_11495_creative_ideation_and_invention_prompts_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_11634_privacy_identity_check_and_harmless_business_inquiry_requests_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_11756_copyrighted_media_permission_and_attribution_requests_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_11813_toxic_social_bias_discriminatory_harm_requests_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_11840_corporate_email_admin_document_text_harmful dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_11861_anti_fraud_and_verification_requests_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_12232_general_benign_informational_question_answering_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_1227_illicit_finance_fraud_evasion_requests_harmful dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_1636_privacy_intrusion_data_request_harmful dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_1728_general_purpose_benign_inquiry_informational_request_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_1742_mundane_creative_editing_and_trivia_requests_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_1748_comparison_framed_social_cultural_difference_questions_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_1923_harmful_persuasion_manipulation_requests_harmful dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_1986_violent_sadistic_scene_requests_harmful dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_2197_manipulative_abuse_grooming_solicitation_harmful dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_2286_email_document_subject_line_extraction_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_2684_corporate_email_attachment_logistics_harmful dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_2742_email_document_revision_and_legal_admin_text_harmful dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_2827_neutral_procedural_text_with_risky_content_overlap_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_286_email_subject_line_document_summarization_requests_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_2945_internal_business_email_logistics_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_3248_deceptive_harm_fraud_hoaxes_and_misinformation_harmful dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_3277_practical_everyday_advice_problem_solving_requests_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_3368_graphic_violent_ritual_injury_descriptions_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_3734_discriminatory_insult_harassment_requests_harmful dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_3767_copyright_safe_creative_and_formatting_requests_weak_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_3772_biased_demeaning_requests_about_protected_groups_harmful dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_3868_address_and_contact_lookup_requests_for_institutions_and_fictional_entities_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_3895_general_factual_cultural_trivia_questions_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_3915_abuse_harassment_and_humiliating_harm_requests_harmful dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_4112_historical_causation_of_genocides_and_revolutions_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_4202_sensitive_info_policy_complaint_requests_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_4205_coping_and_self_improvement_guidance_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_4319_general_benign_informational_queries_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_4396_public_figure_medical_history_inquiries_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_4590_stereotyping_and_generalization_prompts_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_4986_ethical_classification_recommendation_task_prompts_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_5098_corporate_email_admin_task_requests_weak_harmful dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_5449_rumor_seeking_and_secret_sharing_requests_harmful dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_5509_grammar_style_transformation_and_neutral_writing_prompts_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_603_biased_inferiority_and_illicit_trade_harm_requests_harmful dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_6213_email_admin_subject_line_drafting_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_6238_definition_of_risky_terms_queries_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_6499_generic_benign_writing_tasks_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_6562_legitimate_verification_and_administrative_request_prompts_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_6733_copyright_defamation_misuse_requests_harmful dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_6815_harmful_advocacy_with_deceptive_framing_harmful dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_7063_sensitive_social_organizational_explanation_requests_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_7261_protective_safety_legal_compliance_advice_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_728_safety_health_advice_and_benign_information_requests_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_7294_deceptive_malicious_manipulation_requests_harmful dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_7480_email_subject_line_and_message_summary_requests_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_7575_factual_summary_explanation_requests_about_benign_media_and_history_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_7897_administrative_verification_request_language_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_7940_defamation_and_fake_news_fabrication_requests_harmful dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_7971_fictional_scenario_safety_secure_escape_questions_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_8063_general_factual_cultural_benign_queries_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_8140_email_task_management_correspondence_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_8184_risk_avoidance_verification_requests_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_823_basic_arithmetic_and_sequence_number_pattern_problems_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_8240_trivia_and_classification_questions_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_910_corporate_email_meeting_logistics_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_9267_video_game_combat_action_scenario_requests_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_9558_word_list_sentence_puzzle_prompts_harmful dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_9623_misinformation_and_fabricated_narrative_requests_harmful dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_9994_enumerative_benign_business_science_prompts_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_Y_mean dtype: float64 - name: harmful_harmless_seg_harmful_delta_Y_frac_increasing dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_10000_neutral_factual_general_knowledge_and_email_writing_prompts_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_10737_general_benign_informational_explanation_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_1083_trivia_lookup_and_benign_productivity_queries_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_10878_stereotype_based_demeaning_character_portrayals_harmful dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_10940_stereotyping_protected_groups_as_inferior_harmful dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_11223_competitive_skill_improvement_metaphors_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_11236_degrading_prejudice_toward_marginalized_groups_harmful dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_11382_copyright_compliance_and_policy_clarification_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_11404_copyright_derivative_work_permission_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_11495_creative_ideation_and_invention_prompts_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_11634_privacy_identity_check_and_harmless_business_inquiry_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_11756_copyrighted_media_permission_and_attribution_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_11813_toxic_social_bias_discriminatory_harm_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_11840_corporate_email_admin_document_text_harmful dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_11861_anti_fraud_and_verification_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_12232_general_benign_informational_question_answering_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_1227_illicit_finance_fraud_evasion_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_1636_privacy_intrusion_data_request_harmful dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_1728_general_purpose_benign_inquiry_informational_request_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_1742_mundane_creative_editing_and_trivia_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_1748_comparison_framed_social_cultural_difference_questions_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_1923_harmful_persuasion_manipulation_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_1986_violent_sadistic_scene_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_2197_manipulative_abuse_grooming_solicitation_harmful dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_2286_email_document_subject_line_extraction_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_2684_corporate_email_attachment_logistics_harmful dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_2742_email_document_revision_and_legal_admin_text_harmful dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_2827_neutral_procedural_text_with_risky_content_overlap_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_286_email_subject_line_document_summarization_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_2945_internal_business_email_logistics_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_3248_deceptive_harm_fraud_hoaxes_and_misinformation_harmful dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_3277_practical_everyday_advice_problem_solving_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_3368_graphic_violent_ritual_injury_descriptions_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_3734_discriminatory_insult_harassment_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_3767_copyright_safe_creative_and_formatting_requests_weak_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_3772_biased_demeaning_requests_about_protected_groups_harmful dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_3868_address_and_contact_lookup_requests_for_institutions_and_fictional_entities_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_3895_general_factual_cultural_trivia_questions_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_3915_abuse_harassment_and_humiliating_harm_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_4112_historical_causation_of_genocides_and_revolutions_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_4202_sensitive_info_policy_complaint_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_4205_coping_and_self_improvement_guidance_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_4319_general_benign_informational_queries_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_4396_public_figure_medical_history_inquiries_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_4590_stereotyping_and_generalization_prompts_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_4986_ethical_classification_recommendation_task_prompts_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_5098_corporate_email_admin_task_requests_weak_harmful dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_5449_rumor_seeking_and_secret_sharing_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_5509_grammar_style_transformation_and_neutral_writing_prompts_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_603_biased_inferiority_and_illicit_trade_harm_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_6213_email_admin_subject_line_drafting_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_6238_definition_of_risky_terms_queries_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_6499_generic_benign_writing_tasks_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_6562_legitimate_verification_and_administrative_request_prompts_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_6733_copyright_defamation_misuse_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_6815_harmful_advocacy_with_deceptive_framing_harmful dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_7063_sensitive_social_organizational_explanation_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_7261_protective_safety_legal_compliance_advice_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_728_safety_health_advice_and_benign_information_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_7294_deceptive_malicious_manipulation_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_7480_email_subject_line_and_message_summary_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_7575_factual_summary_explanation_requests_about_benign_media_and_history_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_7897_administrative_verification_request_language_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_7940_defamation_and_fake_news_fabrication_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_7971_fictional_scenario_safety_secure_escape_questions_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_8063_general_factual_cultural_benign_queries_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_8140_email_task_management_correspondence_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_8184_risk_avoidance_verification_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_823_basic_arithmetic_and_sequence_number_pattern_problems_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_8240_trivia_and_classification_questions_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_910_corporate_email_meeting_logistics_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_9267_video_game_combat_action_scenario_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_9558_word_list_sentence_puzzle_prompts_harmful dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_9623_misinformation_and_fabricated_narrative_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_9994_enumerative_benign_business_science_prompts_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_Y_mean dtype: float64 - name: harmful_harmless_seg_harmless_delta_Y_frac_increasing dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_10000_neutral_factual_general_knowledge_and_email_writing_prompts_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_10737_general_benign_informational_explanation_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_1083_trivia_lookup_and_benign_productivity_queries_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_10878_stereotype_based_demeaning_character_portrayals_harmful dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_10940_stereotyping_protected_groups_as_inferior_harmful dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_11223_competitive_skill_improvement_metaphors_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_11236_degrading_prejudice_toward_marginalized_groups_harmful dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_11382_copyright_compliance_and_policy_clarification_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_11404_copyright_derivative_work_permission_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_11495_creative_ideation_and_invention_prompts_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_11634_privacy_identity_check_and_harmless_business_inquiry_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_11756_copyrighted_media_permission_and_attribution_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_11813_toxic_social_bias_discriminatory_harm_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_11840_corporate_email_admin_document_text_harmful dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_11861_anti_fraud_and_verification_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_12232_general_benign_informational_question_answering_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_1227_illicit_finance_fraud_evasion_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_1636_privacy_intrusion_data_request_harmful dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_1728_general_purpose_benign_inquiry_informational_request_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_1742_mundane_creative_editing_and_trivia_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_1748_comparison_framed_social_cultural_difference_questions_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_1923_harmful_persuasion_manipulation_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_1986_violent_sadistic_scene_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_2197_manipulative_abuse_grooming_solicitation_harmful dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_2286_email_document_subject_line_extraction_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_2684_corporate_email_attachment_logistics_harmful dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_2742_email_document_revision_and_legal_admin_text_harmful dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_2827_neutral_procedural_text_with_risky_content_overlap_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_286_email_subject_line_document_summarization_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_2945_internal_business_email_logistics_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_3248_deceptive_harm_fraud_hoaxes_and_misinformation_harmful dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_3277_practical_everyday_advice_problem_solving_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_3368_graphic_violent_ritual_injury_descriptions_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_3734_discriminatory_insult_harassment_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_3767_copyright_safe_creative_and_formatting_requests_weak_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_3772_biased_demeaning_requests_about_protected_groups_harmful dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_3868_address_and_contact_lookup_requests_for_institutions_and_fictional_entities_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_3895_general_factual_cultural_trivia_questions_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_3915_abuse_harassment_and_humiliating_harm_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_4112_historical_causation_of_genocides_and_revolutions_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_4202_sensitive_info_policy_complaint_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_4205_coping_and_self_improvement_guidance_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_4319_general_benign_informational_queries_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_4396_public_figure_medical_history_inquiries_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_4590_stereotyping_and_generalization_prompts_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_4986_ethical_classification_recommendation_task_prompts_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_5098_corporate_email_admin_task_requests_weak_harmful dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_5449_rumor_seeking_and_secret_sharing_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_5509_grammar_style_transformation_and_neutral_writing_prompts_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_603_biased_inferiority_and_illicit_trade_harm_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_6213_email_admin_subject_line_drafting_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_6238_definition_of_risky_terms_queries_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_6499_generic_benign_writing_tasks_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_6562_legitimate_verification_and_administrative_request_prompts_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_6733_copyright_defamation_misuse_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_6815_harmful_advocacy_with_deceptive_framing_harmful dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_7063_sensitive_social_organizational_explanation_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_7261_protective_safety_legal_compliance_advice_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_728_safety_health_advice_and_benign_information_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_7294_deceptive_malicious_manipulation_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_7480_email_subject_line_and_message_summary_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_7575_factual_summary_explanation_requests_about_benign_media_and_history_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_7897_administrative_verification_request_language_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_7940_defamation_and_fake_news_fabrication_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_7971_fictional_scenario_safety_secure_escape_questions_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_8063_general_factual_cultural_benign_queries_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_8140_email_task_management_correspondence_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_8184_risk_avoidance_verification_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_823_basic_arithmetic_and_sequence_number_pattern_problems_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_8240_trivia_and_classification_questions_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_910_corporate_email_meeting_logistics_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_9267_video_game_combat_action_scenario_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_9558_word_list_sentence_puzzle_prompts_harmful dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_9623_misinformation_and_fabricated_narrative_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_9994_enumerative_benign_business_science_prompts_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_Y_mean dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_Y_frac_increasing dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_10000_neutral_factual_general_knowledge_and_email_writing_prompts_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_10737_general_benign_informational_explanation_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_1083_trivia_lookup_and_benign_productivity_queries_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_10878_stereotype_based_demeaning_character_portrayals_harmful dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_10940_stereotyping_protected_groups_as_inferior_harmful dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_11223_competitive_skill_improvement_metaphors_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_11236_degrading_prejudice_toward_marginalized_groups_harmful dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_11382_copyright_compliance_and_policy_clarification_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_11404_copyright_derivative_work_permission_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_11495_creative_ideation_and_invention_prompts_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_11634_privacy_identity_check_and_harmless_business_inquiry_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_11756_copyrighted_media_permission_and_attribution_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_11813_toxic_social_bias_discriminatory_harm_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_11840_corporate_email_admin_document_text_harmful dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_11861_anti_fraud_and_verification_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_12232_general_benign_informational_question_answering_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_1227_illicit_finance_fraud_evasion_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_1636_privacy_intrusion_data_request_harmful dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_1728_general_purpose_benign_inquiry_informational_request_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_1742_mundane_creative_editing_and_trivia_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_1748_comparison_framed_social_cultural_difference_questions_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_1923_harmful_persuasion_manipulation_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_1986_violent_sadistic_scene_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_2197_manipulative_abuse_grooming_solicitation_harmful dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_2286_email_document_subject_line_extraction_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_2684_corporate_email_attachment_logistics_harmful dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_2742_email_document_revision_and_legal_admin_text_harmful dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_2827_neutral_procedural_text_with_risky_content_overlap_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_286_email_subject_line_document_summarization_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_2945_internal_business_email_logistics_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_3248_deceptive_harm_fraud_hoaxes_and_misinformation_harmful dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_3277_practical_everyday_advice_problem_solving_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_3368_graphic_violent_ritual_injury_descriptions_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_3734_discriminatory_insult_harassment_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_3767_copyright_safe_creative_and_formatting_requests_weak_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_3772_biased_demeaning_requests_about_protected_groups_harmful dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_3868_address_and_contact_lookup_requests_for_institutions_and_fictional_entities_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_3895_general_factual_cultural_trivia_questions_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_3915_abuse_harassment_and_humiliating_harm_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_4112_historical_causation_of_genocides_and_revolutions_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_4202_sensitive_info_policy_complaint_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_4205_coping_and_self_improvement_guidance_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_4319_general_benign_informational_queries_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_4396_public_figure_medical_history_inquiries_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_4590_stereotyping_and_generalization_prompts_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_4986_ethical_classification_recommendation_task_prompts_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_5098_corporate_email_admin_task_requests_weak_harmful dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_5449_rumor_seeking_and_secret_sharing_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_5509_grammar_style_transformation_and_neutral_writing_prompts_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_603_biased_inferiority_and_illicit_trade_harm_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_6213_email_admin_subject_line_drafting_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_6238_definition_of_risky_terms_queries_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_6499_generic_benign_writing_tasks_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_6562_legitimate_verification_and_administrative_request_prompts_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_6733_copyright_defamation_misuse_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_6815_harmful_advocacy_with_deceptive_framing_harmful dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_7063_sensitive_social_organizational_explanation_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_7261_protective_safety_legal_compliance_advice_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_728_safety_health_advice_and_benign_information_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_7294_deceptive_malicious_manipulation_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_7480_email_subject_line_and_message_summary_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_7575_factual_summary_explanation_requests_about_benign_media_and_history_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_7897_administrative_verification_request_language_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_7940_defamation_and_fake_news_fabrication_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_7971_fictional_scenario_safety_secure_escape_questions_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_8063_general_factual_cultural_benign_queries_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_8140_email_task_management_correspondence_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_8184_risk_avoidance_verification_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_823_basic_arithmetic_and_sequence_number_pattern_problems_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_8240_trivia_and_classification_questions_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_910_corporate_email_meeting_logistics_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_9267_video_game_combat_action_scenario_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_9558_word_list_sentence_puzzle_prompts_harmful dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_9623_misinformation_and_fabricated_narrative_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_9994_enumerative_benign_business_science_prompts_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_Y_mean dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_Y_frac_increasing dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_10000_neutral_factual_general_knowledge_and_email_writing_prompts_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_10737_general_benign_informational_explanation_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_1083_trivia_lookup_and_benign_productivity_queries_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_10878_stereotype_based_demeaning_character_portrayals_harmful dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_10940_stereotyping_protected_groups_as_inferior_harmful dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_11223_competitive_skill_improvement_metaphors_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_11236_degrading_prejudice_toward_marginalized_groups_harmful dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_11382_copyright_compliance_and_policy_clarification_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_11404_copyright_derivative_work_permission_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_11495_creative_ideation_and_invention_prompts_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_11634_privacy_identity_check_and_harmless_business_inquiry_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_11756_copyrighted_media_permission_and_attribution_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_11813_toxic_social_bias_discriminatory_harm_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_11840_corporate_email_admin_document_text_harmful dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_11861_anti_fraud_and_verification_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_12232_general_benign_informational_question_answering_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_1227_illicit_finance_fraud_evasion_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_1636_privacy_intrusion_data_request_harmful dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_1728_general_purpose_benign_inquiry_informational_request_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_1742_mundane_creative_editing_and_trivia_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_1748_comparison_framed_social_cultural_difference_questions_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_1923_harmful_persuasion_manipulation_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_1986_violent_sadistic_scene_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_2197_manipulative_abuse_grooming_solicitation_harmful dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_2286_email_document_subject_line_extraction_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_2684_corporate_email_attachment_logistics_harmful dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_2742_email_document_revision_and_legal_admin_text_harmful dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_2827_neutral_procedural_text_with_risky_content_overlap_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_286_email_subject_line_document_summarization_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_2945_internal_business_email_logistics_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_3248_deceptive_harm_fraud_hoaxes_and_misinformation_harmful dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_3277_practical_everyday_advice_problem_solving_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_3368_graphic_violent_ritual_injury_descriptions_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_3734_discriminatory_insult_harassment_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_3767_copyright_safe_creative_and_formatting_requests_weak_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_3772_biased_demeaning_requests_about_protected_groups_harmful dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_3868_address_and_contact_lookup_requests_for_institutions_and_fictional_entities_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_3895_general_factual_cultural_trivia_questions_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_3915_abuse_harassment_and_humiliating_harm_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_4112_historical_causation_of_genocides_and_revolutions_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_4202_sensitive_info_policy_complaint_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_4205_coping_and_self_improvement_guidance_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_4319_general_benign_informational_queries_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_4396_public_figure_medical_history_inquiries_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_4590_stereotyping_and_generalization_prompts_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_4986_ethical_classification_recommendation_task_prompts_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_5098_corporate_email_admin_task_requests_weak_harmful dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_5449_rumor_seeking_and_secret_sharing_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_5509_grammar_style_transformation_and_neutral_writing_prompts_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_603_biased_inferiority_and_illicit_trade_harm_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_6213_email_admin_subject_line_drafting_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_6238_definition_of_risky_terms_queries_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_6499_generic_benign_writing_tasks_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_6562_legitimate_verification_and_administrative_request_prompts_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_6733_copyright_defamation_misuse_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_6815_harmful_advocacy_with_deceptive_framing_harmful dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_7063_sensitive_social_organizational_explanation_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_7261_protective_safety_legal_compliance_advice_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_728_safety_health_advice_and_benign_information_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_7294_deceptive_malicious_manipulation_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_7480_email_subject_line_and_message_summary_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_7575_factual_summary_explanation_requests_about_benign_media_and_history_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_7897_administrative_verification_request_language_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_7940_defamation_and_fake_news_fabrication_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_7971_fictional_scenario_safety_secure_escape_questions_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_8063_general_factual_cultural_benign_queries_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_8140_email_task_management_correspondence_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_8184_risk_avoidance_verification_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_823_basic_arithmetic_and_sequence_number_pattern_problems_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_8240_trivia_and_classification_questions_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_910_corporate_email_meeting_logistics_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_9267_video_game_combat_action_scenario_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_9558_word_list_sentence_puzzle_prompts_harmful dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_9623_misinformation_and_fabricated_narrative_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_9994_enumerative_benign_business_science_prompts_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_Y_mean dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_Y_frac_increasing dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_10000_neutral_factual_general_knowledge_and_email_writing_prompts_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_10737_general_benign_informational_explanation_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_1083_trivia_lookup_and_benign_productivity_queries_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_10878_stereotype_based_demeaning_character_portrayals_harmful dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_10940_stereotyping_protected_groups_as_inferior_harmful dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_11223_competitive_skill_improvement_metaphors_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_11236_degrading_prejudice_toward_marginalized_groups_harmful dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_11382_copyright_compliance_and_policy_clarification_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_11404_copyright_derivative_work_permission_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_11495_creative_ideation_and_invention_prompts_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_11634_privacy_identity_check_and_harmless_business_inquiry_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_11756_copyrighted_media_permission_and_attribution_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_11813_toxic_social_bias_discriminatory_harm_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_11840_corporate_email_admin_document_text_harmful dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_11861_anti_fraud_and_verification_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_12232_general_benign_informational_question_answering_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_1227_illicit_finance_fraud_evasion_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_1636_privacy_intrusion_data_request_harmful dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_1728_general_purpose_benign_inquiry_informational_request_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_1742_mundane_creative_editing_and_trivia_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_1748_comparison_framed_social_cultural_difference_questions_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_1923_harmful_persuasion_manipulation_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_1986_violent_sadistic_scene_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_2197_manipulative_abuse_grooming_solicitation_harmful dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_2286_email_document_subject_line_extraction_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_2684_corporate_email_attachment_logistics_harmful dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_2742_email_document_revision_and_legal_admin_text_harmful dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_2827_neutral_procedural_text_with_risky_content_overlap_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_286_email_subject_line_document_summarization_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_2945_internal_business_email_logistics_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_3248_deceptive_harm_fraud_hoaxes_and_misinformation_harmful dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_3277_practical_everyday_advice_problem_solving_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_3368_graphic_violent_ritual_injury_descriptions_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_3734_discriminatory_insult_harassment_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_3767_copyright_safe_creative_and_formatting_requests_weak_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_3772_biased_demeaning_requests_about_protected_groups_harmful dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_3868_address_and_contact_lookup_requests_for_institutions_and_fictional_entities_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_3895_general_factual_cultural_trivia_questions_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_3915_abuse_harassment_and_humiliating_harm_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_4112_historical_causation_of_genocides_and_revolutions_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_4202_sensitive_info_policy_complaint_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_4205_coping_and_self_improvement_guidance_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_4319_general_benign_informational_queries_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_4396_public_figure_medical_history_inquiries_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_4590_stereotyping_and_generalization_prompts_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_4986_ethical_classification_recommendation_task_prompts_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_5098_corporate_email_admin_task_requests_weak_harmful dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_5449_rumor_seeking_and_secret_sharing_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_5509_grammar_style_transformation_and_neutral_writing_prompts_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_603_biased_inferiority_and_illicit_trade_harm_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_6213_email_admin_subject_line_drafting_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_6238_definition_of_risky_terms_queries_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_6499_generic_benign_writing_tasks_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_6562_legitimate_verification_and_administrative_request_prompts_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_6733_copyright_defamation_misuse_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_6815_harmful_advocacy_with_deceptive_framing_harmful dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_7063_sensitive_social_organizational_explanation_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_7261_protective_safety_legal_compliance_advice_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_728_safety_health_advice_and_benign_information_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_7294_deceptive_malicious_manipulation_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_7480_email_subject_line_and_message_summary_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_7575_factual_summary_explanation_requests_about_benign_media_and_history_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_7897_administrative_verification_request_language_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_7940_defamation_and_fake_news_fabrication_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_7971_fictional_scenario_safety_secure_escape_questions_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_8063_general_factual_cultural_benign_queries_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_8140_email_task_management_correspondence_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_8184_risk_avoidance_verification_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_823_basic_arithmetic_and_sequence_number_pattern_problems_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_8240_trivia_and_classification_questions_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_910_corporate_email_meeting_logistics_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_9267_video_game_combat_action_scenario_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_9558_word_list_sentence_puzzle_prompts_harmful dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_9623_misinformation_and_fabricated_narrative_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_9994_enumerative_benign_business_science_prompts_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_Y_mean dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_Y_frac_increasing dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_10000_neutral_factual_general_knowledge_and_email_writing_prompts_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_10737_general_benign_informational_explanation_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_1083_trivia_lookup_and_benign_productivity_queries_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_10878_stereotype_based_demeaning_character_portrayals_harmful dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_10940_stereotyping_protected_groups_as_inferior_harmful dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_11223_competitive_skill_improvement_metaphors_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_11236_degrading_prejudice_toward_marginalized_groups_harmful dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_11382_copyright_compliance_and_policy_clarification_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_11404_copyright_derivative_work_permission_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_11495_creative_ideation_and_invention_prompts_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_11634_privacy_identity_check_and_harmless_business_inquiry_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_11756_copyrighted_media_permission_and_attribution_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_11813_toxic_social_bias_discriminatory_harm_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_11840_corporate_email_admin_document_text_harmful dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_11861_anti_fraud_and_verification_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_12232_general_benign_informational_question_answering_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_1227_illicit_finance_fraud_evasion_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_1636_privacy_intrusion_data_request_harmful dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_1728_general_purpose_benign_inquiry_informational_request_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_1742_mundane_creative_editing_and_trivia_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_1748_comparison_framed_social_cultural_difference_questions_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_1923_harmful_persuasion_manipulation_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_1986_violent_sadistic_scene_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_2197_manipulative_abuse_grooming_solicitation_harmful dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_2286_email_document_subject_line_extraction_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_2684_corporate_email_attachment_logistics_harmful dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_2742_email_document_revision_and_legal_admin_text_harmful dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_2827_neutral_procedural_text_with_risky_content_overlap_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_286_email_subject_line_document_summarization_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_2945_internal_business_email_logistics_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_3248_deceptive_harm_fraud_hoaxes_and_misinformation_harmful dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_3277_practical_everyday_advice_problem_solving_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_3368_graphic_violent_ritual_injury_descriptions_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_3734_discriminatory_insult_harassment_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_3767_copyright_safe_creative_and_formatting_requests_weak_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_3772_biased_demeaning_requests_about_protected_groups_harmful dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_3868_address_and_contact_lookup_requests_for_institutions_and_fictional_entities_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_3895_general_factual_cultural_trivia_questions_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_3915_abuse_harassment_and_humiliating_harm_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_4112_historical_causation_of_genocides_and_revolutions_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_4202_sensitive_info_policy_complaint_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_4205_coping_and_self_improvement_guidance_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_4319_general_benign_informational_queries_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_4396_public_figure_medical_history_inquiries_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_4590_stereotyping_and_generalization_prompts_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_4986_ethical_classification_recommendation_task_prompts_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_5098_corporate_email_admin_task_requests_weak_harmful dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_5449_rumor_seeking_and_secret_sharing_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_5509_grammar_style_transformation_and_neutral_writing_prompts_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_603_biased_inferiority_and_illicit_trade_harm_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_6213_email_admin_subject_line_drafting_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_6238_definition_of_risky_terms_queries_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_6499_generic_benign_writing_tasks_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_6562_legitimate_verification_and_administrative_request_prompts_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_6733_copyright_defamation_misuse_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_6815_harmful_advocacy_with_deceptive_framing_harmful dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_7063_sensitive_social_organizational_explanation_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_7261_protective_safety_legal_compliance_advice_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_728_safety_health_advice_and_benign_information_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_7294_deceptive_malicious_manipulation_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_7480_email_subject_line_and_message_summary_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_7575_factual_summary_explanation_requests_about_benign_media_and_history_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_7897_administrative_verification_request_language_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_7940_defamation_and_fake_news_fabrication_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_7971_fictional_scenario_safety_secure_escape_questions_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_8063_general_factual_cultural_benign_queries_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_8140_email_task_management_correspondence_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_8184_risk_avoidance_verification_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_823_basic_arithmetic_and_sequence_number_pattern_problems_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_8240_trivia_and_classification_questions_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_910_corporate_email_meeting_logistics_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_9267_video_game_combat_action_scenario_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_9558_word_list_sentence_puzzle_prompts_harmful dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_9623_misinformation_and_fabricated_narrative_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_9994_enumerative_benign_business_science_prompts_harmless dtype: float64 - name: appendix_seed0_seg_all_delta_Y_mean dtype: float64 - name: appendix_seed0_seg_all_delta_Y_frac_increasing dtype: float64 - name: appendix_seed0_seg_harmful_refused_delta_Y_mean dtype: float64 - name: appendix_seed0_seg_harmful_refused_delta_Y_frac_increasing dtype: float64 - name: appendix_seed0_seg_harmful_complied_delta_Y_mean dtype: float64 - name: appendix_seed0_seg_harmful_complied_delta_Y_frac_increasing dtype: float64 - name: appendix_seed1_seg_all_delta_Y_mean dtype: float64 - name: appendix_seed1_seg_all_delta_Y_frac_increasing dtype: float64 - name: appendix_seed1_seg_harmful_refused_delta_Y_mean dtype: float64 - name: appendix_seed1_seg_harmful_refused_delta_Y_frac_increasing dtype: float64 - name: appendix_seed1_seg_harmful_complied_delta_Y_mean dtype: float64 - name: appendix_seed1_seg_harmful_complied_delta_Y_frac_increasing dtype: float64 - name: appendix_seed2_seg_all_delta_Y_mean dtype: float64 - name: appendix_seed2_seg_all_delta_Y_frac_increasing dtype: float64 - name: appendix_seed2_seg_harmful_refused_delta_Y_mean dtype: float64 - name: appendix_seed2_seg_harmful_refused_delta_Y_frac_increasing dtype: float64 - name: appendix_seed2_seg_harmful_complied_delta_Y_mean dtype: float64 - name: appendix_seed2_seg_harmful_complied_delta_Y_frac_increasing dtype: float64 - name: appendix_seed0_influence_10000_neutral_factual_general_knowledge_and_email_writing_prompts_harmless dtype: float64 - name: appendix_seed0_influence_10737_general_benign_informational_explanation_requests_harmless dtype: float64 - name: appendix_seed0_influence_1083_trivia_lookup_and_benign_productivity_queries_harmless dtype: float64 - name: appendix_seed0_influence_10878_stereotype_based_demeaning_character_portrayals_harmful dtype: float64 - name: appendix_seed0_influence_10940_stereotyping_protected_groups_as_inferior_harmful dtype: float64 - name: appendix_seed0_influence_11223_competitive_skill_improvement_metaphors_harmless dtype: float64 - name: appendix_seed0_influence_11236_degrading_prejudice_toward_marginalized_groups_harmful dtype: float64 - name: appendix_seed0_influence_11382_copyright_compliance_and_policy_clarification_requests_harmless dtype: float64 - name: appendix_seed0_influence_11404_copyright_derivative_work_permission_requests_harmful dtype: float64 - name: appendix_seed0_influence_11495_creative_ideation_and_invention_prompts_harmless dtype: float64 - name: appendix_seed0_influence_11634_privacy_identity_check_and_harmless_business_inquiry_requests_harmless dtype: float64 - name: appendix_seed0_influence_11756_copyrighted_media_permission_and_attribution_requests_harmless dtype: float64 - name: appendix_seed0_influence_11813_toxic_social_bias_discriminatory_harm_requests_harmless dtype: float64 - name: appendix_seed0_influence_11840_corporate_email_admin_document_text_harmful dtype: float64 - name: appendix_seed0_influence_11861_anti_fraud_and_verification_requests_harmless dtype: float64 - name: appendix_seed0_influence_12232_general_benign_informational_question_answering_harmless dtype: float64 - name: appendix_seed0_influence_1227_illicit_finance_fraud_evasion_requests_harmful dtype: float64 - name: appendix_seed0_influence_1636_privacy_intrusion_data_request_harmful dtype: float64 - name: appendix_seed0_influence_1728_general_purpose_benign_inquiry_informational_request_harmless dtype: float64 - name: appendix_seed0_influence_1742_mundane_creative_editing_and_trivia_requests_harmless dtype: float64 - name: appendix_seed0_influence_1748_comparison_framed_social_cultural_difference_questions_harmless dtype: float64 - name: appendix_seed0_influence_1923_harmful_persuasion_manipulation_requests_harmful dtype: float64 - name: appendix_seed0_influence_1986_violent_sadistic_scene_requests_harmful dtype: float64 - name: appendix_seed0_influence_2197_manipulative_abuse_grooming_solicitation_harmful dtype: float64 - name: appendix_seed0_influence_2286_email_document_subject_line_extraction_harmless dtype: float64 - name: appendix_seed0_influence_2684_corporate_email_attachment_logistics_harmful dtype: float64 - name: appendix_seed0_influence_2742_email_document_revision_and_legal_admin_text_harmful dtype: float64 - name: appendix_seed0_influence_2827_neutral_procedural_text_with_risky_content_overlap_harmless dtype: float64 - name: appendix_seed0_influence_286_email_subject_line_document_summarization_requests_harmless dtype: float64 - name: appendix_seed0_influence_2945_internal_business_email_logistics_harmless dtype: float64 - name: appendix_seed0_influence_3248_deceptive_harm_fraud_hoaxes_and_misinformation_harmful dtype: float64 - name: appendix_seed0_influence_3277_practical_everyday_advice_problem_solving_requests_harmless dtype: float64 - name: appendix_seed0_influence_3368_graphic_violent_ritual_injury_descriptions_harmless dtype: float64 - name: appendix_seed0_influence_3734_discriminatory_insult_harassment_requests_harmful dtype: float64 - name: appendix_seed0_influence_3767_copyright_safe_creative_and_formatting_requests_weak_harmless dtype: float64 - name: appendix_seed0_influence_3772_biased_demeaning_requests_about_protected_groups_harmful dtype: float64 - name: appendix_seed0_influence_3868_address_and_contact_lookup_requests_for_institutions_and_fictional_entities_harmless dtype: float64 - name: appendix_seed0_influence_3895_general_factual_cultural_trivia_questions_harmless dtype: float64 - name: appendix_seed0_influence_3915_abuse_harassment_and_humiliating_harm_requests_harmful dtype: float64 - name: appendix_seed0_influence_4112_historical_causation_of_genocides_and_revolutions_harmless dtype: float64 - name: appendix_seed0_influence_4202_sensitive_info_policy_complaint_requests_harmless dtype: float64 - name: appendix_seed0_influence_4205_coping_and_self_improvement_guidance_harmless dtype: float64 - name: appendix_seed0_influence_4319_general_benign_informational_queries_harmless dtype: float64 - name: appendix_seed0_influence_4396_public_figure_medical_history_inquiries_harmless dtype: float64 - name: appendix_seed0_influence_4590_stereotyping_and_generalization_prompts_harmless dtype: float64 - name: appendix_seed0_influence_4986_ethical_classification_recommendation_task_prompts_harmless dtype: float64 - name: appendix_seed0_influence_5098_corporate_email_admin_task_requests_weak_harmful dtype: float64 - name: appendix_seed0_influence_5449_rumor_seeking_and_secret_sharing_requests_harmful dtype: float64 - name: appendix_seed0_influence_5509_grammar_style_transformation_and_neutral_writing_prompts_harmless dtype: float64 - name: appendix_seed0_influence_603_biased_inferiority_and_illicit_trade_harm_requests_harmful dtype: float64 - name: appendix_seed0_influence_6213_email_admin_subject_line_drafting_harmless dtype: float64 - name: appendix_seed0_influence_6238_definition_of_risky_terms_queries_harmless dtype: float64 - name: appendix_seed0_influence_6499_generic_benign_writing_tasks_harmless dtype: float64 - name: appendix_seed0_influence_6562_legitimate_verification_and_administrative_request_prompts_harmless dtype: float64 - name: appendix_seed0_influence_6733_copyright_defamation_misuse_requests_harmful dtype: float64 - name: appendix_seed0_influence_6815_harmful_advocacy_with_deceptive_framing_harmful dtype: float64 - name: appendix_seed0_influence_7063_sensitive_social_organizational_explanation_requests_harmless dtype: float64 - name: appendix_seed0_influence_7261_protective_safety_legal_compliance_advice_harmless dtype: float64 - name: appendix_seed0_influence_728_safety_health_advice_and_benign_information_requests_harmless dtype: float64 - name: appendix_seed0_influence_7294_deceptive_malicious_manipulation_requests_harmful dtype: float64 - name: appendix_seed0_influence_7480_email_subject_line_and_message_summary_requests_harmless dtype: float64 - name: appendix_seed0_influence_7575_factual_summary_explanation_requests_about_benign_media_and_history_harmless dtype: float64 - name: appendix_seed0_influence_7897_administrative_verification_request_language_harmless dtype: float64 - name: appendix_seed0_influence_7940_defamation_and_fake_news_fabrication_requests_harmful dtype: float64 - name: appendix_seed0_influence_7971_fictional_scenario_safety_secure_escape_questions_harmless dtype: float64 - name: appendix_seed0_influence_8063_general_factual_cultural_benign_queries_harmless dtype: float64 - name: appendix_seed0_influence_8140_email_task_management_correspondence_harmless dtype: float64 - name: appendix_seed0_influence_8184_risk_avoidance_verification_requests_harmless dtype: float64 - name: appendix_seed0_influence_823_basic_arithmetic_and_sequence_number_pattern_problems_harmless dtype: float64 - name: appendix_seed0_influence_8240_trivia_and_classification_questions_harmless dtype: float64 - name: appendix_seed0_influence_910_corporate_email_meeting_logistics_harmless dtype: float64 - name: appendix_seed0_influence_9267_video_game_combat_action_scenario_requests_harmless dtype: float64 - name: appendix_seed0_influence_9558_word_list_sentence_puzzle_prompts_harmful dtype: float64 - name: appendix_seed0_influence_9623_misinformation_and_fabricated_narrative_requests_harmful dtype: float64 - name: appendix_seed0_influence_9994_enumerative_benign_business_science_prompts_harmless dtype: float64 - name: appendix_seed0_refusal_influence dtype: float64 - name: appendix_seed0_top20_most_influenced list: - name: feature dtype: string - name: influence dtype: float64 - name: ridge_weight dtype: float64 - name: appendix_seed0_top20_most_important list: - name: feature dtype: string - name: ridge_weight dtype: float64 - name: influence dtype: float64 - name: appendix_seed1_influence_10000_neutral_factual_general_knowledge_and_email_writing_prompts_harmless dtype: float64 - name: appendix_seed1_influence_10737_general_benign_informational_explanation_requests_harmless dtype: float64 - name: appendix_seed1_influence_1083_trivia_lookup_and_benign_productivity_queries_harmless dtype: float64 - name: appendix_seed1_influence_10878_stereotype_based_demeaning_character_portrayals_harmful dtype: float64 - name: appendix_seed1_influence_10940_stereotyping_protected_groups_as_inferior_harmful dtype: float64 - name: appendix_seed1_influence_11223_competitive_skill_improvement_metaphors_harmless dtype: float64 - name: appendix_seed1_influence_11236_degrading_prejudice_toward_marginalized_groups_harmful dtype: float64 - name: appendix_seed1_influence_11382_copyright_compliance_and_policy_clarification_requests_harmless dtype: float64 - name: appendix_seed1_influence_11404_copyright_derivative_work_permission_requests_harmful dtype: float64 - name: appendix_seed1_influence_11495_creative_ideation_and_invention_prompts_harmless dtype: float64 - name: appendix_seed1_influence_11634_privacy_identity_check_and_harmless_business_inquiry_requests_harmless dtype: float64 - name: appendix_seed1_influence_11756_copyrighted_media_permission_and_attribution_requests_harmless dtype: float64 - name: appendix_seed1_influence_11813_toxic_social_bias_discriminatory_harm_requests_harmless dtype: float64 - name: appendix_seed1_influence_11840_corporate_email_admin_document_text_harmful dtype: float64 - name: appendix_seed1_influence_11861_anti_fraud_and_verification_requests_harmless dtype: float64 - name: appendix_seed1_influence_12232_general_benign_informational_question_answering_harmless dtype: float64 - name: appendix_seed1_influence_1227_illicit_finance_fraud_evasion_requests_harmful dtype: float64 - name: appendix_seed1_influence_1636_privacy_intrusion_data_request_harmful dtype: float64 - name: appendix_seed1_influence_1728_general_purpose_benign_inquiry_informational_request_harmless dtype: float64 - name: appendix_seed1_influence_1742_mundane_creative_editing_and_trivia_requests_harmless dtype: float64 - name: appendix_seed1_influence_1748_comparison_framed_social_cultural_difference_questions_harmless dtype: float64 - name: appendix_seed1_influence_1923_harmful_persuasion_manipulation_requests_harmful dtype: float64 - name: appendix_seed1_influence_1986_violent_sadistic_scene_requests_harmful dtype: float64 - name: appendix_seed1_influence_2197_manipulative_abuse_grooming_solicitation_harmful dtype: float64 - name: appendix_seed1_influence_2286_email_document_subject_line_extraction_harmless dtype: float64 - name: appendix_seed1_influence_2684_corporate_email_attachment_logistics_harmful dtype: float64 - name: appendix_seed1_influence_2742_email_document_revision_and_legal_admin_text_harmful dtype: float64 - name: appendix_seed1_influence_2827_neutral_procedural_text_with_risky_content_overlap_harmless dtype: float64 - name: appendix_seed1_influence_286_email_subject_line_document_summarization_requests_harmless dtype: float64 - name: appendix_seed1_influence_2945_internal_business_email_logistics_harmless dtype: float64 - name: appendix_seed1_influence_3248_deceptive_harm_fraud_hoaxes_and_misinformation_harmful dtype: float64 - name: appendix_seed1_influence_3277_practical_everyday_advice_problem_solving_requests_harmless dtype: float64 - name: appendix_seed1_influence_3368_graphic_violent_ritual_injury_descriptions_harmless dtype: float64 - name: appendix_seed1_influence_3734_discriminatory_insult_harassment_requests_harmful dtype: float64 - name: appendix_seed1_influence_3767_copyright_safe_creative_and_formatting_requests_weak_harmless dtype: float64 - name: appendix_seed1_influence_3772_biased_demeaning_requests_about_protected_groups_harmful dtype: float64 - name: appendix_seed1_influence_3868_address_and_contact_lookup_requests_for_institutions_and_fictional_entities_harmless dtype: float64 - name: appendix_seed1_influence_3895_general_factual_cultural_trivia_questions_harmless dtype: float64 - name: appendix_seed1_influence_3915_abuse_harassment_and_humiliating_harm_requests_harmful dtype: float64 - name: appendix_seed1_influence_4112_historical_causation_of_genocides_and_revolutions_harmless dtype: float64 - name: appendix_seed1_influence_4202_sensitive_info_policy_complaint_requests_harmless dtype: float64 - name: appendix_seed1_influence_4205_coping_and_self_improvement_guidance_harmless dtype: float64 - name: appendix_seed1_influence_4319_general_benign_informational_queries_harmless dtype: float64 - name: appendix_seed1_influence_4396_public_figure_medical_history_inquiries_harmless dtype: float64 - name: appendix_seed1_influence_4590_stereotyping_and_generalization_prompts_harmless dtype: float64 - name: appendix_seed1_influence_4986_ethical_classification_recommendation_task_prompts_harmless dtype: float64 - name: appendix_seed1_influence_5098_corporate_email_admin_task_requests_weak_harmful dtype: float64 - name: appendix_seed1_influence_5449_rumor_seeking_and_secret_sharing_requests_harmful dtype: float64 - name: appendix_seed1_influence_5509_grammar_style_transformation_and_neutral_writing_prompts_harmless dtype: float64 - name: appendix_seed1_influence_603_biased_inferiority_and_illicit_trade_harm_requests_harmful dtype: float64 - name: appendix_seed1_influence_6213_email_admin_subject_line_drafting_harmless dtype: float64 - name: appendix_seed1_influence_6238_definition_of_risky_terms_queries_harmless dtype: float64 - name: appendix_seed1_influence_6499_generic_benign_writing_tasks_harmless dtype: float64 - name: appendix_seed1_influence_6562_legitimate_verification_and_administrative_request_prompts_harmless dtype: float64 - name: appendix_seed1_influence_6733_copyright_defamation_misuse_requests_harmful dtype: float64 - name: appendix_seed1_influence_6815_harmful_advocacy_with_deceptive_framing_harmful dtype: float64 - name: appendix_seed1_influence_7063_sensitive_social_organizational_explanation_requests_harmless dtype: float64 - name: appendix_seed1_influence_7261_protective_safety_legal_compliance_advice_harmless dtype: float64 - name: appendix_seed1_influence_728_safety_health_advice_and_benign_information_requests_harmless dtype: float64 - name: appendix_seed1_influence_7294_deceptive_malicious_manipulation_requests_harmful dtype: float64 - name: appendix_seed1_influence_7480_email_subject_line_and_message_summary_requests_harmless dtype: float64 - name: appendix_seed1_influence_7575_factual_summary_explanation_requests_about_benign_media_and_history_harmless dtype: float64 - name: appendix_seed1_influence_7897_administrative_verification_request_language_harmless dtype: float64 - name: appendix_seed1_influence_7940_defamation_and_fake_news_fabrication_requests_harmful dtype: float64 - name: appendix_seed1_influence_7971_fictional_scenario_safety_secure_escape_questions_harmless dtype: float64 - name: appendix_seed1_influence_8063_general_factual_cultural_benign_queries_harmless dtype: float64 - name: appendix_seed1_influence_8140_email_task_management_correspondence_harmless dtype: float64 - name: appendix_seed1_influence_8184_risk_avoidance_verification_requests_harmless dtype: float64 - name: appendix_seed1_influence_823_basic_arithmetic_and_sequence_number_pattern_problems_harmless dtype: float64 - name: appendix_seed1_influence_8240_trivia_and_classification_questions_harmless dtype: float64 - name: appendix_seed1_influence_910_corporate_email_meeting_logistics_harmless dtype: float64 - name: appendix_seed1_influence_9267_video_game_combat_action_scenario_requests_harmless dtype: float64 - name: appendix_seed1_influence_9558_word_list_sentence_puzzle_prompts_harmful dtype: float64 - name: appendix_seed1_influence_9623_misinformation_and_fabricated_narrative_requests_harmful dtype: float64 - name: appendix_seed1_influence_9994_enumerative_benign_business_science_prompts_harmless dtype: float64 - name: appendix_seed1_refusal_influence dtype: float64 - name: appendix_seed1_top20_most_influenced list: - name: feature dtype: string - name: influence dtype: float64 - name: ridge_weight dtype: float64 - name: appendix_seed1_top20_most_important list: - name: feature dtype: string - name: ridge_weight dtype: float64 - name: influence dtype: float64 - name: appendix_seed2_influence_10000_neutral_factual_general_knowledge_and_email_writing_prompts_harmless dtype: float64 - name: appendix_seed2_influence_10737_general_benign_informational_explanation_requests_harmless dtype: float64 - name: appendix_seed2_influence_1083_trivia_lookup_and_benign_productivity_queries_harmless dtype: float64 - name: appendix_seed2_influence_10878_stereotype_based_demeaning_character_portrayals_harmful dtype: float64 - name: appendix_seed2_influence_10940_stereotyping_protected_groups_as_inferior_harmful dtype: float64 - name: appendix_seed2_influence_11223_competitive_skill_improvement_metaphors_harmless dtype: float64 - name: appendix_seed2_influence_11236_degrading_prejudice_toward_marginalized_groups_harmful dtype: float64 - name: appendix_seed2_influence_11382_copyright_compliance_and_policy_clarification_requests_harmless dtype: float64 - name: appendix_seed2_influence_11404_copyright_derivative_work_permission_requests_harmful dtype: float64 - name: appendix_seed2_influence_11495_creative_ideation_and_invention_prompts_harmless dtype: float64 - name: appendix_seed2_influence_11634_privacy_identity_check_and_harmless_business_inquiry_requests_harmless dtype: float64 - name: appendix_seed2_influence_11756_copyrighted_media_permission_and_attribution_requests_harmless dtype: float64 - name: appendix_seed2_influence_11813_toxic_social_bias_discriminatory_harm_requests_harmless dtype: float64 - name: appendix_seed2_influence_11840_corporate_email_admin_document_text_harmful dtype: float64 - name: appendix_seed2_influence_11861_anti_fraud_and_verification_requests_harmless dtype: float64 - name: appendix_seed2_influence_12232_general_benign_informational_question_answering_harmless dtype: float64 - name: appendix_seed2_influence_1227_illicit_finance_fraud_evasion_requests_harmful dtype: float64 - name: appendix_seed2_influence_1636_privacy_intrusion_data_request_harmful dtype: float64 - name: appendix_seed2_influence_1728_general_purpose_benign_inquiry_informational_request_harmless dtype: float64 - name: appendix_seed2_influence_1742_mundane_creative_editing_and_trivia_requests_harmless dtype: float64 - name: appendix_seed2_influence_1748_comparison_framed_social_cultural_difference_questions_harmless dtype: float64 - name: appendix_seed2_influence_1923_harmful_persuasion_manipulation_requests_harmful dtype: float64 - name: appendix_seed2_influence_1986_violent_sadistic_scene_requests_harmful dtype: float64 - name: appendix_seed2_influence_2197_manipulative_abuse_grooming_solicitation_harmful dtype: float64 - name: appendix_seed2_influence_2286_email_document_subject_line_extraction_harmless dtype: float64 - name: appendix_seed2_influence_2684_corporate_email_attachment_logistics_harmful dtype: float64 - name: appendix_seed2_influence_2742_email_document_revision_and_legal_admin_text_harmful dtype: float64 - name: appendix_seed2_influence_2827_neutral_procedural_text_with_risky_content_overlap_harmless dtype: float64 - name: appendix_seed2_influence_286_email_subject_line_document_summarization_requests_harmless dtype: float64 - name: appendix_seed2_influence_2945_internal_business_email_logistics_harmless dtype: float64 - name: appendix_seed2_influence_3248_deceptive_harm_fraud_hoaxes_and_misinformation_harmful dtype: float64 - name: appendix_seed2_influence_3277_practical_everyday_advice_problem_solving_requests_harmless dtype: float64 - name: appendix_seed2_influence_3368_graphic_violent_ritual_injury_descriptions_harmless dtype: float64 - name: appendix_seed2_influence_3734_discriminatory_insult_harassment_requests_harmful dtype: float64 - name: appendix_seed2_influence_3767_copyright_safe_creative_and_formatting_requests_weak_harmless dtype: float64 - name: appendix_seed2_influence_3772_biased_demeaning_requests_about_protected_groups_harmful dtype: float64 - name: appendix_seed2_influence_3868_address_and_contact_lookup_requests_for_institutions_and_fictional_entities_harmless dtype: float64 - name: appendix_seed2_influence_3895_general_factual_cultural_trivia_questions_harmless dtype: float64 - name: appendix_seed2_influence_3915_abuse_harassment_and_humiliating_harm_requests_harmful dtype: float64 - name: appendix_seed2_influence_4112_historical_causation_of_genocides_and_revolutions_harmless dtype: float64 - name: appendix_seed2_influence_4202_sensitive_info_policy_complaint_requests_harmless dtype: float64 - name: appendix_seed2_influence_4205_coping_and_self_improvement_guidance_harmless dtype: float64 - name: appendix_seed2_influence_4319_general_benign_informational_queries_harmless dtype: float64 - name: appendix_seed2_influence_4396_public_figure_medical_history_inquiries_harmless dtype: float64 - name: appendix_seed2_influence_4590_stereotyping_and_generalization_prompts_harmless dtype: float64 - name: appendix_seed2_influence_4986_ethical_classification_recommendation_task_prompts_harmless dtype: float64 - name: appendix_seed2_influence_5098_corporate_email_admin_task_requests_weak_harmful dtype: float64 - name: appendix_seed2_influence_5449_rumor_seeking_and_secret_sharing_requests_harmful dtype: float64 - name: appendix_seed2_influence_5509_grammar_style_transformation_and_neutral_writing_prompts_harmless dtype: float64 - name: appendix_seed2_influence_603_biased_inferiority_and_illicit_trade_harm_requests_harmful dtype: float64 - name: appendix_seed2_influence_6213_email_admin_subject_line_drafting_harmless dtype: float64 - name: appendix_seed2_influence_6238_definition_of_risky_terms_queries_harmless dtype: float64 - name: appendix_seed2_influence_6499_generic_benign_writing_tasks_harmless dtype: float64 - name: appendix_seed2_influence_6562_legitimate_verification_and_administrative_request_prompts_harmless dtype: float64 - name: appendix_seed2_influence_6733_copyright_defamation_misuse_requests_harmful dtype: float64 - name: appendix_seed2_influence_6815_harmful_advocacy_with_deceptive_framing_harmful dtype: float64 - name: appendix_seed2_influence_7063_sensitive_social_organizational_explanation_requests_harmless dtype: float64 - name: appendix_seed2_influence_7261_protective_safety_legal_compliance_advice_harmless dtype: float64 - name: appendix_seed2_influence_728_safety_health_advice_and_benign_information_requests_harmless dtype: float64 - name: appendix_seed2_influence_7294_deceptive_malicious_manipulation_requests_harmful dtype: float64 - name: appendix_seed2_influence_7480_email_subject_line_and_message_summary_requests_harmless dtype: float64 - name: appendix_seed2_influence_7575_factual_summary_explanation_requests_about_benign_media_and_history_harmless dtype: float64 - name: appendix_seed2_influence_7897_administrative_verification_request_language_harmless dtype: float64 - name: appendix_seed2_influence_7940_defamation_and_fake_news_fabrication_requests_harmful dtype: float64 - name: appendix_seed2_influence_7971_fictional_scenario_safety_secure_escape_questions_harmless dtype: float64 - name: appendix_seed2_influence_8063_general_factual_cultural_benign_queries_harmless dtype: float64 - name: appendix_seed2_influence_8140_email_task_management_correspondence_harmless dtype: float64 - name: appendix_seed2_influence_8184_risk_avoidance_verification_requests_harmless dtype: float64 - name: appendix_seed2_influence_823_basic_arithmetic_and_sequence_number_pattern_problems_harmless dtype: float64 - name: appendix_seed2_influence_8240_trivia_and_classification_questions_harmless dtype: float64 - name: appendix_seed2_influence_910_corporate_email_meeting_logistics_harmless dtype: float64 - name: appendix_seed2_influence_9267_video_game_combat_action_scenario_requests_harmless dtype: float64 - name: appendix_seed2_influence_9558_word_list_sentence_puzzle_prompts_harmful dtype: float64 - name: appendix_seed2_influence_9623_misinformation_and_fabricated_narrative_requests_harmful dtype: float64 - name: appendix_seed2_influence_9994_enumerative_benign_business_science_prompts_harmless dtype: float64 - name: appendix_seed2_refusal_influence dtype: float64 - name: appendix_seed2_top20_most_influenced list: - name: feature dtype: string - name: influence dtype: float64 - name: ridge_weight dtype: float64 - name: appendix_seed2_top20_most_important list: - name: feature dtype: string - name: ridge_weight dtype: float64 - name: influence dtype: float64 splits: - name: train num_bytes: 1052446705 num_examples: 220 download_size: 1062578491 dataset_size: 1052446705 configs: - config_name: default data_files: - split: train path: data/train-* ---

数据集信息: 特征: - 名称:idx,数据类型:64位整数(int64) - 名称:prompt(提示词),数据类型:字符串(string) - 名称:label(标签),数据类型:字符串(string) - 名称:source(来源),数据类型:字符串(string) - 名称:response(回复),数据类型:字符串(string) - 名称:grad_norm(梯度范数),数据类型:64位浮点数(float64) - 名称:delta_Y_mean(ΔY均值),数据类型:64位浮点数(float64) - 名称:delta_Y_per_prompt(单提示ΔY值),数据类型:浮点数列表(list: float64) - 名称:delta_Y_frac_increasing(递增ΔY占比),数据类型:64位浮点数(float64) - 名称:delta_h10_mean_norm(Δh10平均范数),数据类型:64位浮点数(float64) - 名称:delta_f_per_prompt(单提示ΔF值),数据类型:结构体(struct): - 名称:无害的中立事实类通用知识与电子邮件写作提示(10000_neutral_factual_general_knowledge_and_email_writing_prompts_harmless),数据类型:浮点数列表(list: float64) - 名称:无害的通用良性信息解释请求提示(10737_general_benign_informational_explanation_requests_harmless),数据类型:浮点数列表(list: float64) - 名称:无害的常识查询与良性生产力请求提示(1083_trivia_lookup_and_benign_productivity_queries_harmless),数据类型:浮点数列表(list: float64) - 名称:有害的基于刻板印象的贬低性角色刻画提示(10878_stereotype_based_demeaning_character_portrayals_harmful),数据类型:浮点数列表(list: float64) - 名称:有害的将受保护群体刻板印象化为低等群体提示(10940_stereotyping_protected_groups_as_inferior_harmful),数据类型:浮点数列表(list: float64) ...(此处省略后续共58个同格式的特征字段,均为对应不同类型无害/有害提示任务的浮点数列表) - 名称:有害自然拒绝影响度(harmful_natural_refusal_influence),数据类型:64位浮点数(float64) - 名称:有害自然影响度_对应各类提示任务(harmful_natural_influence_xxx),数据类型:64位浮点数(float64) ...(此处省略后续共61个同格式的有害自然影响度特征字段,对应不同类型的无害/有害提示任务) - 名称:有害自然前5大受影响特征(harmful_natural_top5_most_influenced),数据类型:列表,元素为结构体: - 名称:特征(feature),数据类型:字符串(string) - 名称:影响度(influence),数据类型:64位浮点数(float64) - 名称:岭权重(ridge_weight),数据类型:64位浮点数(float64) - 名称:有害自然前5大重要特征(harmful_natural_top5_most_important),数据类型:同上述列表格式 ...(此处省略后续的有害平衡影响系列、有害无害影响系列、附录相关特征字段) 分割信息: - 分割名称:训练集(train),字节数:1052446705,样本数量:220 下载大小:1062578491 数据集总大小:1052446705 配置信息: - 配置名称:默认(default),数据文件: - 分割:训练集,路径:data/train-*
提供机构:
DarianNLP
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作