DarianNLP/mda_influence_scores_NEW_lr1e4_cat_coded
收藏Hugging Face2026-04-25 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/DarianNLP/mda_influence_scores_NEW_lr1e4_cat_coded
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: idx
dtype: int64
- name: prompt
dtype: string
- name: label
dtype: string
- name: source
dtype: string
- name: response
dtype: string
- name: grad_norm
dtype: float64
- name: delta_Y_mean
dtype: float64
- name: delta_Y_per_prompt
list: float64
- name: delta_Y_frac_increasing
dtype: float64
- name: delta_h10_mean_norm
dtype: float64
- name: delta_f_per_prompt
struct:
- name: 10000_neutral_factual_general_knowledge_and_email_writing_prompts_harmless
list: float64
- name: 10737_general_benign_informational_explanation_requests_harmless
list: float64
- name: 1083_trivia_lookup_and_benign_productivity_queries_harmless
list: float64
- name: 10878_stereotype_based_demeaning_character_portrayals_harmful
list: float64
- name: 10940_stereotyping_protected_groups_as_inferior_harmful
list: float64
- name: 11223_competitive_skill_improvement_metaphors_harmless
list: float64
- name: 11236_degrading_prejudice_toward_marginalized_groups_harmful
list: float64
- name: 11382_copyright_compliance_and_policy_clarification_requests_harmless
list: float64
- name: 11404_copyright_derivative_work_permission_requests_harmful
list: float64
- name: 11495_creative_ideation_and_invention_prompts_harmless
list: float64
- name: 11634_privacy_identity_check_and_harmless_business_inquiry_requests_harmless
list: float64
- name: 11756_copyrighted_media_permission_and_attribution_requests_harmless
list: float64
- name: 11813_toxic_social_bias_discriminatory_harm_requests_harmless
list: float64
- name: 11840_corporate_email_admin_document_text_harmful
list: float64
- name: 11861_anti_fraud_and_verification_requests_harmless
list: float64
- name: 12232_general_benign_informational_question_answering_harmless
list: float64
- name: 1227_illicit_finance_fraud_evasion_requests_harmful
list: float64
- name: 1636_privacy_intrusion_data_request_harmful
list: float64
- name: 1728_general_purpose_benign_inquiry_informational_request_harmless
list: float64
- name: 1742_mundane_creative_editing_and_trivia_requests_harmless
list: float64
- name: 1748_comparison_framed_social_cultural_difference_questions_harmless
list: float64
- name: 1923_harmful_persuasion_manipulation_requests_harmful
list: float64
- name: 1986_violent_sadistic_scene_requests_harmful
list: float64
- name: 2197_manipulative_abuse_grooming_solicitation_harmful
list: float64
- name: 2286_email_document_subject_line_extraction_harmless
list: float64
- name: 2684_corporate_email_attachment_logistics_harmful
list: float64
- name: 2742_email_document_revision_and_legal_admin_text_harmful
list: float64
- name: 2827_neutral_procedural_text_with_risky_content_overlap_harmless
list: float64
- name: 286_email_subject_line_document_summarization_requests_harmless
list: float64
- name: 2945_internal_business_email_logistics_harmless
list: float64
- name: 3248_deceptive_harm_fraud_hoaxes_and_misinformation_harmful
list: float64
- name: 3277_practical_everyday_advice_problem_solving_requests_harmless
list: float64
- name: 3368_graphic_violent_ritual_injury_descriptions_harmless
list: float64
- name: 3734_discriminatory_insult_harassment_requests_harmful
list: float64
- name: 3767_copyright_safe_creative_and_formatting_requests_weak_harmless
list: float64
- name: 3772_biased_demeaning_requests_about_protected_groups_harmful
list: float64
- name: 3868_address_and_contact_lookup_requests_for_institutions_and_fictional_entities_harmless
list: float64
- name: 3895_general_factual_cultural_trivia_questions_harmless
list: float64
- name: 3915_abuse_harassment_and_humiliating_harm_requests_harmful
list: float64
- name: 4112_historical_causation_of_genocides_and_revolutions_harmless
list: float64
- name: 4202_sensitive_info_policy_complaint_requests_harmless
list: float64
- name: 4205_coping_and_self_improvement_guidance_harmless
list: float64
- name: 4319_general_benign_informational_queries_harmless
list: float64
- name: 4396_public_figure_medical_history_inquiries_harmless
list: float64
- name: 4590_stereotyping_and_generalization_prompts_harmless
list: float64
- name: 4986_ethical_classification_recommendation_task_prompts_harmless
list: float64
- name: 5098_corporate_email_admin_task_requests_weak_harmful
list: float64
- name: 5449_rumor_seeking_and_secret_sharing_requests_harmful
list: float64
- name: 5509_grammar_style_transformation_and_neutral_writing_prompts_harmless
list: float64
- name: 603_biased_inferiority_and_illicit_trade_harm_requests_harmful
list: float64
- name: 6213_email_admin_subject_line_drafting_harmless
list: float64
- name: 6238_definition_of_risky_terms_queries_harmless
list: float64
- name: 6499_generic_benign_writing_tasks_harmless
list: float64
- name: 6562_legitimate_verification_and_administrative_request_prompts_harmless
list: float64
- name: 6733_copyright_defamation_misuse_requests_harmful
list: float64
- name: 6815_harmful_advocacy_with_deceptive_framing_harmful
list: float64
- name: 7063_sensitive_social_organizational_explanation_requests_harmless
list: float64
- name: 7261_protective_safety_legal_compliance_advice_harmless
list: float64
- name: 728_safety_health_advice_and_benign_information_requests_harmless
list: float64
- name: 7294_deceptive_malicious_manipulation_requests_harmful
list: float64
- name: 7480_email_subject_line_and_message_summary_requests_harmless
list: float64
- name: 7575_factual_summary_explanation_requests_about_benign_media_and_history_harmless
list: float64
- name: 7897_administrative_verification_request_language_harmless
list: float64
- name: 7940_defamation_and_fake_news_fabrication_requests_harmful
list: float64
- name: 7971_fictional_scenario_safety_secure_escape_questions_harmless
list: float64
- name: 8063_general_factual_cultural_benign_queries_harmless
list: float64
- name: 8140_email_task_management_correspondence_harmless
list: float64
- name: 8184_risk_avoidance_verification_requests_harmless
list: float64
- name: 823_basic_arithmetic_and_sequence_number_pattern_problems_harmless
list: float64
- name: 8240_trivia_and_classification_questions_harmless
list: float64
- name: 910_corporate_email_meeting_logistics_harmless
list: float64
- name: 9267_video_game_combat_action_scenario_requests_harmless
list: float64
- name: 9558_word_list_sentence_puzzle_prompts_harmful
list: float64
- name: 9623_misinformation_and_fabricated_narrative_requests_harmful
list: float64
- name: 9994_enumerative_benign_business_science_prompts_harmless
list: float64
- name: harmful_natural_refusal_influence
dtype: float64
- name: harmful_natural_influence_10000_neutral_factual_general_knowledge_and_email_writing_prompts_harmless
dtype: float64
- name: harmful_natural_influence_10737_general_benign_informational_explanation_requests_harmless
dtype: float64
- name: harmful_natural_influence_1083_trivia_lookup_and_benign_productivity_queries_harmless
dtype: float64
- name: harmful_natural_influence_10878_stereotype_based_demeaning_character_portrayals_harmful
dtype: float64
- name: harmful_natural_influence_10940_stereotyping_protected_groups_as_inferior_harmful
dtype: float64
- name: harmful_natural_influence_11223_competitive_skill_improvement_metaphors_harmless
dtype: float64
- name: harmful_natural_influence_11236_degrading_prejudice_toward_marginalized_groups_harmful
dtype: float64
- name: harmful_natural_influence_11382_copyright_compliance_and_policy_clarification_requests_harmless
dtype: float64
- name: harmful_natural_influence_11404_copyright_derivative_work_permission_requests_harmful
dtype: float64
- name: harmful_natural_influence_11495_creative_ideation_and_invention_prompts_harmless
dtype: float64
- name: harmful_natural_influence_11634_privacy_identity_check_and_harmless_business_inquiry_requests_harmless
dtype: float64
- name: harmful_natural_influence_11756_copyrighted_media_permission_and_attribution_requests_harmless
dtype: float64
- name: harmful_natural_influence_11813_toxic_social_bias_discriminatory_harm_requests_harmless
dtype: float64
- name: harmful_natural_influence_11840_corporate_email_admin_document_text_harmful
dtype: float64
- name: harmful_natural_influence_11861_anti_fraud_and_verification_requests_harmless
dtype: float64
- name: harmful_natural_influence_12232_general_benign_informational_question_answering_harmless
dtype: float64
- name: harmful_natural_influence_1227_illicit_finance_fraud_evasion_requests_harmful
dtype: float64
- name: harmful_natural_influence_1636_privacy_intrusion_data_request_harmful
dtype: float64
- name: harmful_natural_influence_1728_general_purpose_benign_inquiry_informational_request_harmless
dtype: float64
- name: harmful_natural_influence_1742_mundane_creative_editing_and_trivia_requests_harmless
dtype: float64
- name: harmful_natural_influence_1748_comparison_framed_social_cultural_difference_questions_harmless
dtype: float64
- name: harmful_natural_influence_1923_harmful_persuasion_manipulation_requests_harmful
dtype: float64
- name: harmful_natural_influence_1986_violent_sadistic_scene_requests_harmful
dtype: float64
- name: harmful_natural_influence_2197_manipulative_abuse_grooming_solicitation_harmful
dtype: float64
- name: harmful_natural_influence_2286_email_document_subject_line_extraction_harmless
dtype: float64
- name: harmful_natural_influence_2684_corporate_email_attachment_logistics_harmful
dtype: float64
- name: harmful_natural_influence_2742_email_document_revision_and_legal_admin_text_harmful
dtype: float64
- name: harmful_natural_influence_2827_neutral_procedural_text_with_risky_content_overlap_harmless
dtype: float64
- name: harmful_natural_influence_286_email_subject_line_document_summarization_requests_harmless
dtype: float64
- name: harmful_natural_influence_2945_internal_business_email_logistics_harmless
dtype: float64
- name: harmful_natural_influence_3248_deceptive_harm_fraud_hoaxes_and_misinformation_harmful
dtype: float64
- name: harmful_natural_influence_3277_practical_everyday_advice_problem_solving_requests_harmless
dtype: float64
- name: harmful_natural_influence_3368_graphic_violent_ritual_injury_descriptions_harmless
dtype: float64
- name: harmful_natural_influence_3734_discriminatory_insult_harassment_requests_harmful
dtype: float64
- name: harmful_natural_influence_3767_copyright_safe_creative_and_formatting_requests_weak_harmless
dtype: float64
- name: harmful_natural_influence_3772_biased_demeaning_requests_about_protected_groups_harmful
dtype: float64
- name: harmful_natural_influence_3868_address_and_contact_lookup_requests_for_institutions_and_fictional_entities_harmless
dtype: float64
- name: harmful_natural_influence_3895_general_factual_cultural_trivia_questions_harmless
dtype: float64
- name: harmful_natural_influence_3915_abuse_harassment_and_humiliating_harm_requests_harmful
dtype: float64
- name: harmful_natural_influence_4112_historical_causation_of_genocides_and_revolutions_harmless
dtype: float64
- name: harmful_natural_influence_4202_sensitive_info_policy_complaint_requests_harmless
dtype: float64
- name: harmful_natural_influence_4205_coping_and_self_improvement_guidance_harmless
dtype: float64
- name: harmful_natural_influence_4319_general_benign_informational_queries_harmless
dtype: float64
- name: harmful_natural_influence_4396_public_figure_medical_history_inquiries_harmless
dtype: float64
- name: harmful_natural_influence_4590_stereotyping_and_generalization_prompts_harmless
dtype: float64
- name: harmful_natural_influence_4986_ethical_classification_recommendation_task_prompts_harmless
dtype: float64
- name: harmful_natural_influence_5098_corporate_email_admin_task_requests_weak_harmful
dtype: float64
- name: harmful_natural_influence_5449_rumor_seeking_and_secret_sharing_requests_harmful
dtype: float64
- name: harmful_natural_influence_5509_grammar_style_transformation_and_neutral_writing_prompts_harmless
dtype: float64
- name: harmful_natural_influence_603_biased_inferiority_and_illicit_trade_harm_requests_harmful
dtype: float64
- name: harmful_natural_influence_6213_email_admin_subject_line_drafting_harmless
dtype: float64
- name: harmful_natural_influence_6238_definition_of_risky_terms_queries_harmless
dtype: float64
- name: harmful_natural_influence_6499_generic_benign_writing_tasks_harmless
dtype: float64
- name: harmful_natural_influence_6562_legitimate_verification_and_administrative_request_prompts_harmless
dtype: float64
- name: harmful_natural_influence_6733_copyright_defamation_misuse_requests_harmful
dtype: float64
- name: harmful_natural_influence_6815_harmful_advocacy_with_deceptive_framing_harmful
dtype: float64
- name: harmful_natural_influence_7063_sensitive_social_organizational_explanation_requests_harmless
dtype: float64
- name: harmful_natural_influence_7261_protective_safety_legal_compliance_advice_harmless
dtype: float64
- name: harmful_natural_influence_728_safety_health_advice_and_benign_information_requests_harmless
dtype: float64
- name: harmful_natural_influence_7294_deceptive_malicious_manipulation_requests_harmful
dtype: float64
- name: harmful_natural_influence_7480_email_subject_line_and_message_summary_requests_harmless
dtype: float64
- name: harmful_natural_influence_7575_factual_summary_explanation_requests_about_benign_media_and_history_harmless
dtype: float64
- name: harmful_natural_influence_7897_administrative_verification_request_language_harmless
dtype: float64
- name: harmful_natural_influence_7940_defamation_and_fake_news_fabrication_requests_harmful
dtype: float64
- name: harmful_natural_influence_7971_fictional_scenario_safety_secure_escape_questions_harmless
dtype: float64
- name: harmful_natural_influence_8063_general_factual_cultural_benign_queries_harmless
dtype: float64
- name: harmful_natural_influence_8140_email_task_management_correspondence_harmless
dtype: float64
- name: harmful_natural_influence_8184_risk_avoidance_verification_requests_harmless
dtype: float64
- name: harmful_natural_influence_823_basic_arithmetic_and_sequence_number_pattern_problems_harmless
dtype: float64
- name: harmful_natural_influence_8240_trivia_and_classification_questions_harmless
dtype: float64
- name: harmful_natural_influence_910_corporate_email_meeting_logistics_harmless
dtype: float64
- name: harmful_natural_influence_9267_video_game_combat_action_scenario_requests_harmless
dtype: float64
- name: harmful_natural_influence_9558_word_list_sentence_puzzle_prompts_harmful
dtype: float64
- name: harmful_natural_influence_9623_misinformation_and_fabricated_narrative_requests_harmful
dtype: float64
- name: harmful_natural_influence_9994_enumerative_benign_business_science_prompts_harmless
dtype: float64
- name: harmful_natural_top5_most_influenced
list:
- name: feature
dtype: string
- name: influence
dtype: float64
- name: ridge_weight
dtype: float64
- name: harmful_natural_top5_most_important
list:
- name: feature
dtype: string
- name: influence
dtype: float64
- name: ridge_weight
dtype: float64
- name: harmful_balanced_refusal_influence
dtype: float64
- name: harmful_balanced_influence_10000_neutral_factual_general_knowledge_and_email_writing_prompts_harmless
dtype: float64
- name: harmful_balanced_influence_10737_general_benign_informational_explanation_requests_harmless
dtype: float64
- name: harmful_balanced_influence_1083_trivia_lookup_and_benign_productivity_queries_harmless
dtype: float64
- name: harmful_balanced_influence_10878_stereotype_based_demeaning_character_portrayals_harmful
dtype: float64
- name: harmful_balanced_influence_10940_stereotyping_protected_groups_as_inferior_harmful
dtype: float64
- name: harmful_balanced_influence_11223_competitive_skill_improvement_metaphors_harmless
dtype: float64
- name: harmful_balanced_influence_11236_degrading_prejudice_toward_marginalized_groups_harmful
dtype: float64
- name: harmful_balanced_influence_11382_copyright_compliance_and_policy_clarification_requests_harmless
dtype: float64
- name: harmful_balanced_influence_11404_copyright_derivative_work_permission_requests_harmful
dtype: float64
- name: harmful_balanced_influence_11495_creative_ideation_and_invention_prompts_harmless
dtype: float64
- name: harmful_balanced_influence_11634_privacy_identity_check_and_harmless_business_inquiry_requests_harmless
dtype: float64
- name: harmful_balanced_influence_11756_copyrighted_media_permission_and_attribution_requests_harmless
dtype: float64
- name: harmful_balanced_influence_11813_toxic_social_bias_discriminatory_harm_requests_harmless
dtype: float64
- name: harmful_balanced_influence_11840_corporate_email_admin_document_text_harmful
dtype: float64
- name: harmful_balanced_influence_11861_anti_fraud_and_verification_requests_harmless
dtype: float64
- name: harmful_balanced_influence_12232_general_benign_informational_question_answering_harmless
dtype: float64
- name: harmful_balanced_influence_1227_illicit_finance_fraud_evasion_requests_harmful
dtype: float64
- name: harmful_balanced_influence_1636_privacy_intrusion_data_request_harmful
dtype: float64
- name: harmful_balanced_influence_1728_general_purpose_benign_inquiry_informational_request_harmless
dtype: float64
- name: harmful_balanced_influence_1742_mundane_creative_editing_and_trivia_requests_harmless
dtype: float64
- name: harmful_balanced_influence_1748_comparison_framed_social_cultural_difference_questions_harmless
dtype: float64
- name: harmful_balanced_influence_1923_harmful_persuasion_manipulation_requests_harmful
dtype: float64
- name: harmful_balanced_influence_1986_violent_sadistic_scene_requests_harmful
dtype: float64
- name: harmful_balanced_influence_2197_manipulative_abuse_grooming_solicitation_harmful
dtype: float64
- name: harmful_balanced_influence_2286_email_document_subject_line_extraction_harmless
dtype: float64
- name: harmful_balanced_influence_2684_corporate_email_attachment_logistics_harmful
dtype: float64
- name: harmful_balanced_influence_2742_email_document_revision_and_legal_admin_text_harmful
dtype: float64
- name: harmful_balanced_influence_2827_neutral_procedural_text_with_risky_content_overlap_harmless
dtype: float64
- name: harmful_balanced_influence_286_email_subject_line_document_summarization_requests_harmless
dtype: float64
- name: harmful_balanced_influence_2945_internal_business_email_logistics_harmless
dtype: float64
- name: harmful_balanced_influence_3248_deceptive_harm_fraud_hoaxes_and_misinformation_harmful
dtype: float64
- name: harmful_balanced_influence_3277_practical_everyday_advice_problem_solving_requests_harmless
dtype: float64
- name: harmful_balanced_influence_3368_graphic_violent_ritual_injury_descriptions_harmless
dtype: float64
- name: harmful_balanced_influence_3734_discriminatory_insult_harassment_requests_harmful
dtype: float64
- name: harmful_balanced_influence_3767_copyright_safe_creative_and_formatting_requests_weak_harmless
dtype: float64
- name: harmful_balanced_influence_3772_biased_demeaning_requests_about_protected_groups_harmful
dtype: float64
- name: harmful_balanced_influence_3868_address_and_contact_lookup_requests_for_institutions_and_fictional_entities_harmless
dtype: float64
- name: harmful_balanced_influence_3895_general_factual_cultural_trivia_questions_harmless
dtype: float64
- name: harmful_balanced_influence_3915_abuse_harassment_and_humiliating_harm_requests_harmful
dtype: float64
- name: harmful_balanced_influence_4112_historical_causation_of_genocides_and_revolutions_harmless
dtype: float64
- name: harmful_balanced_influence_4202_sensitive_info_policy_complaint_requests_harmless
dtype: float64
- name: harmful_balanced_influence_4205_coping_and_self_improvement_guidance_harmless
dtype: float64
- name: harmful_balanced_influence_4319_general_benign_informational_queries_harmless
dtype: float64
- name: harmful_balanced_influence_4396_public_figure_medical_history_inquiries_harmless
dtype: float64
- name: harmful_balanced_influence_4590_stereotyping_and_generalization_prompts_harmless
dtype: float64
- name: harmful_balanced_influence_4986_ethical_classification_recommendation_task_prompts_harmless
dtype: float64
- name: harmful_balanced_influence_5098_corporate_email_admin_task_requests_weak_harmful
dtype: float64
- name: harmful_balanced_influence_5449_rumor_seeking_and_secret_sharing_requests_harmful
dtype: float64
- name: harmful_balanced_influence_5509_grammar_style_transformation_and_neutral_writing_prompts_harmless
dtype: float64
- name: harmful_balanced_influence_603_biased_inferiority_and_illicit_trade_harm_requests_harmful
dtype: float64
- name: harmful_balanced_influence_6213_email_admin_subject_line_drafting_harmless
dtype: float64
- name: harmful_balanced_influence_6238_definition_of_risky_terms_queries_harmless
dtype: float64
- name: harmful_balanced_influence_6499_generic_benign_writing_tasks_harmless
dtype: float64
- name: harmful_balanced_influence_6562_legitimate_verification_and_administrative_request_prompts_harmless
dtype: float64
- name: harmful_balanced_influence_6733_copyright_defamation_misuse_requests_harmful
dtype: float64
- name: harmful_balanced_influence_6815_harmful_advocacy_with_deceptive_framing_harmful
dtype: float64
- name: harmful_balanced_influence_7063_sensitive_social_organizational_explanation_requests_harmless
dtype: float64
- name: harmful_balanced_influence_7261_protective_safety_legal_compliance_advice_harmless
dtype: float64
- name: harmful_balanced_influence_728_safety_health_advice_and_benign_information_requests_harmless
dtype: float64
- name: harmful_balanced_influence_7294_deceptive_malicious_manipulation_requests_harmful
dtype: float64
- name: harmful_balanced_influence_7480_email_subject_line_and_message_summary_requests_harmless
dtype: float64
- name: harmful_balanced_influence_7575_factual_summary_explanation_requests_about_benign_media_and_history_harmless
dtype: float64
- name: harmful_balanced_influence_7897_administrative_verification_request_language_harmless
dtype: float64
- name: harmful_balanced_influence_7940_defamation_and_fake_news_fabrication_requests_harmful
dtype: float64
- name: harmful_balanced_influence_7971_fictional_scenario_safety_secure_escape_questions_harmless
dtype: float64
- name: harmful_balanced_influence_8063_general_factual_cultural_benign_queries_harmless
dtype: float64
- name: harmful_balanced_influence_8140_email_task_management_correspondence_harmless
dtype: float64
- name: harmful_balanced_influence_8184_risk_avoidance_verification_requests_harmless
dtype: float64
- name: harmful_balanced_influence_823_basic_arithmetic_and_sequence_number_pattern_problems_harmless
dtype: float64
- name: harmful_balanced_influence_8240_trivia_and_classification_questions_harmless
dtype: float64
- name: harmful_balanced_influence_910_corporate_email_meeting_logistics_harmless
dtype: float64
- name: harmful_balanced_influence_9267_video_game_combat_action_scenario_requests_harmless
dtype: float64
- name: harmful_balanced_influence_9558_word_list_sentence_puzzle_prompts_harmful
dtype: float64
- name: harmful_balanced_influence_9623_misinformation_and_fabricated_narrative_requests_harmful
dtype: float64
- name: harmful_balanced_influence_9994_enumerative_benign_business_science_prompts_harmless
dtype: float64
- name: harmful_balanced_top5_most_influenced
list:
- name: feature
dtype: string
- name: influence
dtype: float64
- name: ridge_weight
dtype: float64
- name: harmful_balanced_top5_most_important
list:
- name: feature
dtype: string
- name: influence
dtype: float64
- name: ridge_weight
dtype: float64
- name: harmful_harmless_refusal_influence
dtype: float64
- name: harmful_harmless_influence_10000_neutral_factual_general_knowledge_and_email_writing_prompts_harmless
dtype: float64
- name: harmful_harmless_influence_10737_general_benign_informational_explanation_requests_harmless
dtype: float64
- name: harmful_harmless_influence_1083_trivia_lookup_and_benign_productivity_queries_harmless
dtype: float64
- name: harmful_harmless_influence_10878_stereotype_based_demeaning_character_portrayals_harmful
dtype: float64
- name: harmful_harmless_influence_10940_stereotyping_protected_groups_as_inferior_harmful
dtype: float64
- name: harmful_harmless_influence_11223_competitive_skill_improvement_metaphors_harmless
dtype: float64
- name: harmful_harmless_influence_11236_degrading_prejudice_toward_marginalized_groups_harmful
dtype: float64
- name: harmful_harmless_influence_11382_copyright_compliance_and_policy_clarification_requests_harmless
dtype: float64
- name: harmful_harmless_influence_11404_copyright_derivative_work_permission_requests_harmful
dtype: float64
- name: harmful_harmless_influence_11495_creative_ideation_and_invention_prompts_harmless
dtype: float64
- name: harmful_harmless_influence_11634_privacy_identity_check_and_harmless_business_inquiry_requests_harmless
dtype: float64
- name: harmful_harmless_influence_11756_copyrighted_media_permission_and_attribution_requests_harmless
dtype: float64
- name: harmful_harmless_influence_11813_toxic_social_bias_discriminatory_harm_requests_harmless
dtype: float64
- name: harmful_harmless_influence_11840_corporate_email_admin_document_text_harmful
dtype: float64
- name: harmful_harmless_influence_11861_anti_fraud_and_verification_requests_harmless
dtype: float64
- name: harmful_harmless_influence_12232_general_benign_informational_question_answering_harmless
dtype: float64
- name: harmful_harmless_influence_1227_illicit_finance_fraud_evasion_requests_harmful
dtype: float64
- name: harmful_harmless_influence_1636_privacy_intrusion_data_request_harmful
dtype: float64
- name: harmful_harmless_influence_1728_general_purpose_benign_inquiry_informational_request_harmless
dtype: float64
- name: harmful_harmless_influence_1742_mundane_creative_editing_and_trivia_requests_harmless
dtype: float64
- name: harmful_harmless_influence_1748_comparison_framed_social_cultural_difference_questions_harmless
dtype: float64
- name: harmful_harmless_influence_1923_harmful_persuasion_manipulation_requests_harmful
dtype: float64
- name: harmful_harmless_influence_1986_violent_sadistic_scene_requests_harmful
dtype: float64
- name: harmful_harmless_influence_2197_manipulative_abuse_grooming_solicitation_harmful
dtype: float64
- name: harmful_harmless_influence_2286_email_document_subject_line_extraction_harmless
dtype: float64
- name: harmful_harmless_influence_2684_corporate_email_attachment_logistics_harmful
dtype: float64
- name: harmful_harmless_influence_2742_email_document_revision_and_legal_admin_text_harmful
dtype: float64
- name: harmful_harmless_influence_2827_neutral_procedural_text_with_risky_content_overlap_harmless
dtype: float64
- name: harmful_harmless_influence_286_email_subject_line_document_summarization_requests_harmless
dtype: float64
- name: harmful_harmless_influence_2945_internal_business_email_logistics_harmless
dtype: float64
- name: harmful_harmless_influence_3248_deceptive_harm_fraud_hoaxes_and_misinformation_harmful
dtype: float64
- name: harmful_harmless_influence_3277_practical_everyday_advice_problem_solving_requests_harmless
dtype: float64
- name: harmful_harmless_influence_3368_graphic_violent_ritual_injury_descriptions_harmless
dtype: float64
- name: harmful_harmless_influence_3734_discriminatory_insult_harassment_requests_harmful
dtype: float64
- name: harmful_harmless_influence_3767_copyright_safe_creative_and_formatting_requests_weak_harmless
dtype: float64
- name: harmful_harmless_influence_3772_biased_demeaning_requests_about_protected_groups_harmful
dtype: float64
- name: harmful_harmless_influence_3868_address_and_contact_lookup_requests_for_institutions_and_fictional_entities_harmless
dtype: float64
- name: harmful_harmless_influence_3895_general_factual_cultural_trivia_questions_harmless
dtype: float64
- name: harmful_harmless_influence_3915_abuse_harassment_and_humiliating_harm_requests_harmful
dtype: float64
- name: harmful_harmless_influence_4112_historical_causation_of_genocides_and_revolutions_harmless
dtype: float64
- name: harmful_harmless_influence_4202_sensitive_info_policy_complaint_requests_harmless
dtype: float64
- name: harmful_harmless_influence_4205_coping_and_self_improvement_guidance_harmless
dtype: float64
- name: harmful_harmless_influence_4319_general_benign_informational_queries_harmless
dtype: float64
- name: harmful_harmless_influence_4396_public_figure_medical_history_inquiries_harmless
dtype: float64
- name: harmful_harmless_influence_4590_stereotyping_and_generalization_prompts_harmless
dtype: float64
- name: harmful_harmless_influence_4986_ethical_classification_recommendation_task_prompts_harmless
dtype: float64
- name: harmful_harmless_influence_5098_corporate_email_admin_task_requests_weak_harmful
dtype: float64
- name: harmful_harmless_influence_5449_rumor_seeking_and_secret_sharing_requests_harmful
dtype: float64
- name: harmful_harmless_influence_5509_grammar_style_transformation_and_neutral_writing_prompts_harmless
dtype: float64
- name: harmful_harmless_influence_603_biased_inferiority_and_illicit_trade_harm_requests_harmful
dtype: float64
- name: harmful_harmless_influence_6213_email_admin_subject_line_drafting_harmless
dtype: float64
- name: harmful_harmless_influence_6238_definition_of_risky_terms_queries_harmless
dtype: float64
- name: harmful_harmless_influence_6499_generic_benign_writing_tasks_harmless
dtype: float64
- name: harmful_harmless_influence_6562_legitimate_verification_and_administrative_request_prompts_harmless
dtype: float64
- name: harmful_harmless_influence_6733_copyright_defamation_misuse_requests_harmful
dtype: float64
- name: harmful_harmless_influence_6815_harmful_advocacy_with_deceptive_framing_harmful
dtype: float64
- name: harmful_harmless_influence_7063_sensitive_social_organizational_explanation_requests_harmless
dtype: float64
- name: harmful_harmless_influence_7261_protective_safety_legal_compliance_advice_harmless
dtype: float64
- name: harmful_harmless_influence_728_safety_health_advice_and_benign_information_requests_harmless
dtype: float64
- name: harmful_harmless_influence_7294_deceptive_malicious_manipulation_requests_harmful
dtype: float64
- name: harmful_harmless_influence_7480_email_subject_line_and_message_summary_requests_harmless
dtype: float64
- name: harmful_harmless_influence_7575_factual_summary_explanation_requests_about_benign_media_and_history_harmless
dtype: float64
- name: harmful_harmless_influence_7897_administrative_verification_request_language_harmless
dtype: float64
- name: harmful_harmless_influence_7940_defamation_and_fake_news_fabrication_requests_harmful
dtype: float64
- name: harmful_harmless_influence_7971_fictional_scenario_safety_secure_escape_questions_harmless
dtype: float64
- name: harmful_harmless_influence_8063_general_factual_cultural_benign_queries_harmless
dtype: float64
- name: harmful_harmless_influence_8140_email_task_management_correspondence_harmless
dtype: float64
- name: harmful_harmless_influence_8184_risk_avoidance_verification_requests_harmless
dtype: float64
- name: harmful_harmless_influence_823_basic_arithmetic_and_sequence_number_pattern_problems_harmless
dtype: float64
- name: harmful_harmless_influence_8240_trivia_and_classification_questions_harmless
dtype: float64
- name: harmful_harmless_influence_910_corporate_email_meeting_logistics_harmless
dtype: float64
- name: harmful_harmless_influence_9267_video_game_combat_action_scenario_requests_harmless
dtype: float64
- name: harmful_harmless_influence_9558_word_list_sentence_puzzle_prompts_harmful
dtype: float64
- name: harmful_harmless_influence_9623_misinformation_and_fabricated_narrative_requests_harmful
dtype: float64
- name: harmful_harmless_influence_9994_enumerative_benign_business_science_prompts_harmless
dtype: float64
- name: harmful_harmless_top5_most_influenced
list:
- name: feature
dtype: string
- name: influence
dtype: float64
- name: ridge_weight
dtype: float64
- name: harmful_harmless_top5_most_important
list:
- name: feature
dtype: string
- name: influence
dtype: float64
- name: ridge_weight
dtype: float64
- name: appendix
list:
- name: refusal_influence
dtype: float64
- name: seed
dtype: int64
- name: harmful_natural_seg_all_delta_Y_mean
dtype: float64
- name: harmful_natural_seg_all_delta_Y_frac_increasing
dtype: float64
- name: harmful_natural_seg_all_delta_f_mean_10000_neutral_factual_general_knowledge_and_email_writing_prompts_harmless
dtype: float64
- name: harmful_natural_seg_all_delta_f_mean_10737_general_benign_informational_explanation_requests_harmless
dtype: float64
- name: harmful_natural_seg_all_delta_f_mean_1083_trivia_lookup_and_benign_productivity_queries_harmless
dtype: float64
- name: harmful_natural_seg_all_delta_f_mean_10878_stereotype_based_demeaning_character_portrayals_harmful
dtype: float64
- name: harmful_natural_seg_all_delta_f_mean_10940_stereotyping_protected_groups_as_inferior_harmful
dtype: float64
- name: harmful_natural_seg_all_delta_f_mean_11223_competitive_skill_improvement_metaphors_harmless
dtype: float64
- name: harmful_natural_seg_all_delta_f_mean_11236_degrading_prejudice_toward_marginalized_groups_harmful
dtype: float64
- name: harmful_natural_seg_all_delta_f_mean_11382_copyright_compliance_and_policy_clarification_requests_harmless
dtype: float64
- name: harmful_natural_seg_all_delta_f_mean_11404_copyright_derivative_work_permission_requests_harmful
dtype: float64
- name: harmful_natural_seg_all_delta_f_mean_11495_creative_ideation_and_invention_prompts_harmless
dtype: float64
- name: harmful_natural_seg_all_delta_f_mean_11634_privacy_identity_check_and_harmless_business_inquiry_requests_harmless
dtype: float64
- name: harmful_natural_seg_all_delta_f_mean_11756_copyrighted_media_permission_and_attribution_requests_harmless
dtype: float64
- name: harmful_natural_seg_all_delta_f_mean_11813_toxic_social_bias_discriminatory_harm_requests_harmless
dtype: float64
- name: harmful_natural_seg_all_delta_f_mean_11840_corporate_email_admin_document_text_harmful
dtype: float64
- name: harmful_natural_seg_all_delta_f_mean_11861_anti_fraud_and_verification_requests_harmless
dtype: float64
- name: harmful_natural_seg_all_delta_f_mean_12232_general_benign_informational_question_answering_harmless
dtype: float64
- name: harmful_natural_seg_all_delta_f_mean_1227_illicit_finance_fraud_evasion_requests_harmful
dtype: float64
- name: harmful_natural_seg_all_delta_f_mean_1636_privacy_intrusion_data_request_harmful
dtype: float64
- name: harmful_natural_seg_all_delta_f_mean_1728_general_purpose_benign_inquiry_informational_request_harmless
dtype: float64
- name: harmful_natural_seg_all_delta_f_mean_1742_mundane_creative_editing_and_trivia_requests_harmless
dtype: float64
- name: harmful_natural_seg_all_delta_f_mean_1748_comparison_framed_social_cultural_difference_questions_harmless
dtype: float64
- name: harmful_natural_seg_all_delta_f_mean_1923_harmful_persuasion_manipulation_requests_harmful
dtype: float64
- name: harmful_natural_seg_all_delta_f_mean_1986_violent_sadistic_scene_requests_harmful
dtype: float64
- name: harmful_natural_seg_all_delta_f_mean_2197_manipulative_abuse_grooming_solicitation_harmful
dtype: float64
- name: harmful_natural_seg_all_delta_f_mean_2286_email_document_subject_line_extraction_harmless
dtype: float64
- name: harmful_natural_seg_all_delta_f_mean_2684_corporate_email_attachment_logistics_harmful
dtype: float64
- name: harmful_natural_seg_all_delta_f_mean_2742_email_document_revision_and_legal_admin_text_harmful
dtype: float64
- name: harmful_natural_seg_all_delta_f_mean_2827_neutral_procedural_text_with_risky_content_overlap_harmless
dtype: float64
- name: harmful_natural_seg_all_delta_f_mean_286_email_subject_line_document_summarization_requests_harmless
dtype: float64
- name: harmful_natural_seg_all_delta_f_mean_2945_internal_business_email_logistics_harmless
dtype: float64
- name: harmful_natural_seg_all_delta_f_mean_3248_deceptive_harm_fraud_hoaxes_and_misinformation_harmful
dtype: float64
- name: harmful_natural_seg_all_delta_f_mean_3277_practical_everyday_advice_problem_solving_requests_harmless
dtype: float64
- name: harmful_natural_seg_all_delta_f_mean_3368_graphic_violent_ritual_injury_descriptions_harmless
dtype: float64
- name: harmful_natural_seg_all_delta_f_mean_3734_discriminatory_insult_harassment_requests_harmful
dtype: float64
- name: harmful_natural_seg_all_delta_f_mean_3767_copyright_safe_creative_and_formatting_requests_weak_harmless
dtype: float64
- name: harmful_natural_seg_all_delta_f_mean_3772_biased_demeaning_requests_about_protected_groups_harmful
dtype: float64
- name: harmful_natural_seg_all_delta_f_mean_3868_address_and_contact_lookup_requests_for_institutions_and_fictional_entities_harmless
dtype: float64
- name: harmful_natural_seg_all_delta_f_mean_3895_general_factual_cultural_trivia_questions_harmless
dtype: float64
- name: harmful_natural_seg_all_delta_f_mean_3915_abuse_harassment_and_humiliating_harm_requests_harmful
dtype: float64
- name: harmful_natural_seg_all_delta_f_mean_4112_historical_causation_of_genocides_and_revolutions_harmless
dtype: float64
- name: harmful_natural_seg_all_delta_f_mean_4202_sensitive_info_policy_complaint_requests_harmless
dtype: float64
- name: harmful_natural_seg_all_delta_f_mean_4205_coping_and_self_improvement_guidance_harmless
dtype: float64
- name: harmful_natural_seg_all_delta_f_mean_4319_general_benign_informational_queries_harmless
dtype: float64
- name: harmful_natural_seg_all_delta_f_mean_4396_public_figure_medical_history_inquiries_harmless
dtype: float64
- name: harmful_natural_seg_all_delta_f_mean_4590_stereotyping_and_generalization_prompts_harmless
dtype: float64
- name: harmful_natural_seg_all_delta_f_mean_4986_ethical_classification_recommendation_task_prompts_harmless
dtype: float64
- name: harmful_natural_seg_all_delta_f_mean_5098_corporate_email_admin_task_requests_weak_harmful
dtype: float64
- name: harmful_natural_seg_all_delta_f_mean_5449_rumor_seeking_and_secret_sharing_requests_harmful
dtype: float64
- name: harmful_natural_seg_all_delta_f_mean_5509_grammar_style_transformation_and_neutral_writing_prompts_harmless
dtype: float64
- name: harmful_natural_seg_all_delta_f_mean_603_biased_inferiority_and_illicit_trade_harm_requests_harmful
dtype: float64
- name: harmful_natural_seg_all_delta_f_mean_6213_email_admin_subject_line_drafting_harmless
dtype: float64
- name: harmful_natural_seg_all_delta_f_mean_6238_definition_of_risky_terms_queries_harmless
dtype: float64
- name: harmful_natural_seg_all_delta_f_mean_6499_generic_benign_writing_tasks_harmless
dtype: float64
- name: harmful_natural_seg_all_delta_f_mean_6562_legitimate_verification_and_administrative_request_prompts_harmless
dtype: float64
- name: harmful_natural_seg_all_delta_f_mean_6733_copyright_defamation_misuse_requests_harmful
dtype: float64
- name: harmful_natural_seg_all_delta_f_mean_6815_harmful_advocacy_with_deceptive_framing_harmful
dtype: float64
- name: harmful_natural_seg_all_delta_f_mean_7063_sensitive_social_organizational_explanation_requests_harmless
dtype: float64
- name: harmful_natural_seg_all_delta_f_mean_7261_protective_safety_legal_compliance_advice_harmless
dtype: float64
- name: harmful_natural_seg_all_delta_f_mean_728_safety_health_advice_and_benign_information_requests_harmless
dtype: float64
- name: harmful_natural_seg_all_delta_f_mean_7294_deceptive_malicious_manipulation_requests_harmful
dtype: float64
- name: harmful_natural_seg_all_delta_f_mean_7480_email_subject_line_and_message_summary_requests_harmless
dtype: float64
- name: harmful_natural_seg_all_delta_f_mean_7575_factual_summary_explanation_requests_about_benign_media_and_history_harmless
dtype: float64
- name: harmful_natural_seg_all_delta_f_mean_7897_administrative_verification_request_language_harmless
dtype: float64
- name: harmful_natural_seg_all_delta_f_mean_7940_defamation_and_fake_news_fabrication_requests_harmful
dtype: float64
- name: harmful_natural_seg_all_delta_f_mean_7971_fictional_scenario_safety_secure_escape_questions_harmless
dtype: float64
- name: harmful_natural_seg_all_delta_f_mean_8063_general_factual_cultural_benign_queries_harmless
dtype: float64
- name: harmful_natural_seg_all_delta_f_mean_8140_email_task_management_correspondence_harmless
dtype: float64
- name: harmful_natural_seg_all_delta_f_mean_8184_risk_avoidance_verification_requests_harmless
dtype: float64
- name: harmful_natural_seg_all_delta_f_mean_823_basic_arithmetic_and_sequence_number_pattern_problems_harmless
dtype: float64
- name: harmful_natural_seg_all_delta_f_mean_8240_trivia_and_classification_questions_harmless
dtype: float64
- name: harmful_natural_seg_all_delta_f_mean_910_corporate_email_meeting_logistics_harmless
dtype: float64
- name: harmful_natural_seg_all_delta_f_mean_9267_video_game_combat_action_scenario_requests_harmless
dtype: float64
- name: harmful_natural_seg_all_delta_f_mean_9558_word_list_sentence_puzzle_prompts_harmful
dtype: float64
- name: harmful_natural_seg_all_delta_f_mean_9623_misinformation_and_fabricated_narrative_requests_harmful
dtype: float64
- name: harmful_natural_seg_all_delta_f_mean_9994_enumerative_benign_business_science_prompts_harmless
dtype: float64
- name: harmful_natural_seg_harmful_delta_Y_mean
dtype: float64
- name: harmful_natural_seg_harmful_delta_Y_frac_increasing
dtype: float64
- name: harmful_natural_seg_harmful_delta_f_mean_10000_neutral_factual_general_knowledge_and_email_writing_prompts_harmless
dtype: float64
- name: harmful_natural_seg_harmful_delta_f_mean_10737_general_benign_informational_explanation_requests_harmless
dtype: float64
- name: harmful_natural_seg_harmful_delta_f_mean_1083_trivia_lookup_and_benign_productivity_queries_harmless
dtype: float64
- name: harmful_natural_seg_harmful_delta_f_mean_10878_stereotype_based_demeaning_character_portrayals_harmful
dtype: float64
- name: harmful_natural_seg_harmful_delta_f_mean_10940_stereotyping_protected_groups_as_inferior_harmful
dtype: float64
- name: harmful_natural_seg_harmful_delta_f_mean_11223_competitive_skill_improvement_metaphors_harmless
dtype: float64
- name: harmful_natural_seg_harmful_delta_f_mean_11236_degrading_prejudice_toward_marginalized_groups_harmful
dtype: float64
- name: harmful_natural_seg_harmful_delta_f_mean_11382_copyright_compliance_and_policy_clarification_requests_harmless
dtype: float64
- name: harmful_natural_seg_harmful_delta_f_mean_11404_copyright_derivative_work_permission_requests_harmful
dtype: float64
- name: harmful_natural_seg_harmful_delta_f_mean_11495_creative_ideation_and_invention_prompts_harmless
dtype: float64
- name: harmful_natural_seg_harmful_delta_f_mean_11634_privacy_identity_check_and_harmless_business_inquiry_requests_harmless
dtype: float64
- name: harmful_natural_seg_harmful_delta_f_mean_11756_copyrighted_media_permission_and_attribution_requests_harmless
dtype: float64
- name: harmful_natural_seg_harmful_delta_f_mean_11813_toxic_social_bias_discriminatory_harm_requests_harmless
dtype: float64
- name: harmful_natural_seg_harmful_delta_f_mean_11840_corporate_email_admin_document_text_harmful
dtype: float64
- name: harmful_natural_seg_harmful_delta_f_mean_11861_anti_fraud_and_verification_requests_harmless
dtype: float64
- name: harmful_natural_seg_harmful_delta_f_mean_12232_general_benign_informational_question_answering_harmless
dtype: float64
- name: harmful_natural_seg_harmful_delta_f_mean_1227_illicit_finance_fraud_evasion_requests_harmful
dtype: float64
- name: harmful_natural_seg_harmful_delta_f_mean_1636_privacy_intrusion_data_request_harmful
dtype: float64
- name: harmful_natural_seg_harmful_delta_f_mean_1728_general_purpose_benign_inquiry_informational_request_harmless
dtype: float64
- name: harmful_natural_seg_harmful_delta_f_mean_1742_mundane_creative_editing_and_trivia_requests_harmless
dtype: float64
- name: harmful_natural_seg_harmful_delta_f_mean_1748_comparison_framed_social_cultural_difference_questions_harmless
dtype: float64
- name: harmful_natural_seg_harmful_delta_f_mean_1923_harmful_persuasion_manipulation_requests_harmful
dtype: float64
- name: harmful_natural_seg_harmful_delta_f_mean_1986_violent_sadistic_scene_requests_harmful
dtype: float64
- name: harmful_natural_seg_harmful_delta_f_mean_2197_manipulative_abuse_grooming_solicitation_harmful
dtype: float64
- name: harmful_natural_seg_harmful_delta_f_mean_2286_email_document_subject_line_extraction_harmless
dtype: float64
- name: harmful_natural_seg_harmful_delta_f_mean_2684_corporate_email_attachment_logistics_harmful
dtype: float64
- name: harmful_natural_seg_harmful_delta_f_mean_2742_email_document_revision_and_legal_admin_text_harmful
dtype: float64
- name: harmful_natural_seg_harmful_delta_f_mean_2827_neutral_procedural_text_with_risky_content_overlap_harmless
dtype: float64
- name: harmful_natural_seg_harmful_delta_f_mean_286_email_subject_line_document_summarization_requests_harmless
dtype: float64
- name: harmful_natural_seg_harmful_delta_f_mean_2945_internal_business_email_logistics_harmless
dtype: float64
- name: harmful_natural_seg_harmful_delta_f_mean_3248_deceptive_harm_fraud_hoaxes_and_misinformation_harmful
dtype: float64
- name: harmful_natural_seg_harmful_delta_f_mean_3277_practical_everyday_advice_problem_solving_requests_harmless
dtype: float64
- name: harmful_natural_seg_harmful_delta_f_mean_3368_graphic_violent_ritual_injury_descriptions_harmless
dtype: float64
- name: harmful_natural_seg_harmful_delta_f_mean_3734_discriminatory_insult_harassment_requests_harmful
dtype: float64
- name: harmful_natural_seg_harmful_delta_f_mean_3767_copyright_safe_creative_and_formatting_requests_weak_harmless
dtype: float64
- name: harmful_natural_seg_harmful_delta_f_mean_3772_biased_demeaning_requests_about_protected_groups_harmful
dtype: float64
- name: harmful_natural_seg_harmful_delta_f_mean_3868_address_and_contact_lookup_requests_for_institutions_and_fictional_entities_harmless
dtype: float64
- name: harmful_natural_seg_harmful_delta_f_mean_3895_general_factual_cultural_trivia_questions_harmless
dtype: float64
- name: harmful_natural_seg_harmful_delta_f_mean_3915_abuse_harassment_and_humiliating_harm_requests_harmful
dtype: float64
- name: harmful_natural_seg_harmful_delta_f_mean_4112_historical_causation_of_genocides_and_revolutions_harmless
dtype: float64
- name: harmful_natural_seg_harmful_delta_f_mean_4202_sensitive_info_policy_complaint_requests_harmless
dtype: float64
- name: harmful_natural_seg_harmful_delta_f_mean_4205_coping_and_self_improvement_guidance_harmless
dtype: float64
- name: harmful_natural_seg_harmful_delta_f_mean_4319_general_benign_informational_queries_harmless
dtype: float64
- name: harmful_natural_seg_harmful_delta_f_mean_4396_public_figure_medical_history_inquiries_harmless
dtype: float64
- name: harmful_natural_seg_harmful_delta_f_mean_4590_stereotyping_and_generalization_prompts_harmless
dtype: float64
- name: harmful_natural_seg_harmful_delta_f_mean_4986_ethical_classification_recommendation_task_prompts_harmless
dtype: float64
- name: harmful_natural_seg_harmful_delta_f_mean_5098_corporate_email_admin_task_requests_weak_harmful
dtype: float64
- name: harmful_natural_seg_harmful_delta_f_mean_5449_rumor_seeking_and_secret_sharing_requests_harmful
dtype: float64
- name: harmful_natural_seg_harmful_delta_f_mean_5509_grammar_style_transformation_and_neutral_writing_prompts_harmless
dtype: float64
- name: harmful_natural_seg_harmful_delta_f_mean_603_biased_inferiority_and_illicit_trade_harm_requests_harmful
dtype: float64
- name: harmful_natural_seg_harmful_delta_f_mean_6213_email_admin_subject_line_drafting_harmless
dtype: float64
- name: harmful_natural_seg_harmful_delta_f_mean_6238_definition_of_risky_terms_queries_harmless
dtype: float64
- name: harmful_natural_seg_harmful_delta_f_mean_6499_generic_benign_writing_tasks_harmless
dtype: float64
- name: harmful_natural_seg_harmful_delta_f_mean_6562_legitimate_verification_and_administrative_request_prompts_harmless
dtype: float64
- name: harmful_natural_seg_harmful_delta_f_mean_6733_copyright_defamation_misuse_requests_harmful
dtype: float64
- name: harmful_natural_seg_harmful_delta_f_mean_6815_harmful_advocacy_with_deceptive_framing_harmful
dtype: float64
- name: harmful_natural_seg_harmful_delta_f_mean_7063_sensitive_social_organizational_explanation_requests_harmless
dtype: float64
- name: harmful_natural_seg_harmful_delta_f_mean_7261_protective_safety_legal_compliance_advice_harmless
dtype: float64
- name: harmful_natural_seg_harmful_delta_f_mean_728_safety_health_advice_and_benign_information_requests_harmless
dtype: float64
- name: harmful_natural_seg_harmful_delta_f_mean_7294_deceptive_malicious_manipulation_requests_harmful
dtype: float64
- name: harmful_natural_seg_harmful_delta_f_mean_7480_email_subject_line_and_message_summary_requests_harmless
dtype: float64
- name: harmful_natural_seg_harmful_delta_f_mean_7575_factual_summary_explanation_requests_about_benign_media_and_history_harmless
dtype: float64
- name: harmful_natural_seg_harmful_delta_f_mean_7897_administrative_verification_request_language_harmless
dtype: float64
- name: harmful_natural_seg_harmful_delta_f_mean_7940_defamation_and_fake_news_fabrication_requests_harmful
dtype: float64
- name: harmful_natural_seg_harmful_delta_f_mean_7971_fictional_scenario_safety_secure_escape_questions_harmless
dtype: float64
- name: harmful_natural_seg_harmful_delta_f_mean_8063_general_factual_cultural_benign_queries_harmless
dtype: float64
- name: harmful_natural_seg_harmful_delta_f_mean_8140_email_task_management_correspondence_harmless
dtype: float64
- name: harmful_natural_seg_harmful_delta_f_mean_8184_risk_avoidance_verification_requests_harmless
dtype: float64
- name: harmful_natural_seg_harmful_delta_f_mean_823_basic_arithmetic_and_sequence_number_pattern_problems_harmless
dtype: float64
- name: harmful_natural_seg_harmful_delta_f_mean_8240_trivia_and_classification_questions_harmless
dtype: float64
- name: harmful_natural_seg_harmful_delta_f_mean_910_corporate_email_meeting_logistics_harmless
dtype: float64
- name: harmful_natural_seg_harmful_delta_f_mean_9267_video_game_combat_action_scenario_requests_harmless
dtype: float64
- name: harmful_natural_seg_harmful_delta_f_mean_9558_word_list_sentence_puzzle_prompts_harmful
dtype: float64
- name: harmful_natural_seg_harmful_delta_f_mean_9623_misinformation_and_fabricated_narrative_requests_harmful
dtype: float64
- name: harmful_natural_seg_harmful_delta_f_mean_9994_enumerative_benign_business_science_prompts_harmless
dtype: float64
- name: harmful_natural_seg_harmful_refused_delta_Y_mean
dtype: float64
- name: harmful_natural_seg_harmful_refused_delta_Y_frac_increasing
dtype: float64
- name: harmful_natural_seg_harmful_refused_delta_f_mean_10000_neutral_factual_general_knowledge_and_email_writing_prompts_harmless
dtype: float64
- name: harmful_natural_seg_harmful_refused_delta_f_mean_10737_general_benign_informational_explanation_requests_harmless
dtype: float64
- name: harmful_natural_seg_harmful_refused_delta_f_mean_1083_trivia_lookup_and_benign_productivity_queries_harmless
dtype: float64
- name: harmful_natural_seg_harmful_refused_delta_f_mean_10878_stereotype_based_demeaning_character_portrayals_harmful
dtype: float64
- name: harmful_natural_seg_harmful_refused_delta_f_mean_10940_stereotyping_protected_groups_as_inferior_harmful
dtype: float64
- name: harmful_natural_seg_harmful_refused_delta_f_mean_11223_competitive_skill_improvement_metaphors_harmless
dtype: float64
- name: harmful_natural_seg_harmful_refused_delta_f_mean_11236_degrading_prejudice_toward_marginalized_groups_harmful
dtype: float64
- name: harmful_natural_seg_harmful_refused_delta_f_mean_11382_copyright_compliance_and_policy_clarification_requests_harmless
dtype: float64
- name: harmful_natural_seg_harmful_refused_delta_f_mean_11404_copyright_derivative_work_permission_requests_harmful
dtype: float64
- name: harmful_natural_seg_harmful_refused_delta_f_mean_11495_creative_ideation_and_invention_prompts_harmless
dtype: float64
- name: harmful_natural_seg_harmful_refused_delta_f_mean_11634_privacy_identity_check_and_harmless_business_inquiry_requests_harmless
dtype: float64
- name: harmful_natural_seg_harmful_refused_delta_f_mean_11756_copyrighted_media_permission_and_attribution_requests_harmless
dtype: float64
- name: harmful_natural_seg_harmful_refused_delta_f_mean_11813_toxic_social_bias_discriminatory_harm_requests_harmless
dtype: float64
- name: harmful_natural_seg_harmful_refused_delta_f_mean_11840_corporate_email_admin_document_text_harmful
dtype: float64
- name: harmful_natural_seg_harmful_refused_delta_f_mean_11861_anti_fraud_and_verification_requests_harmless
dtype: float64
- name: harmful_natural_seg_harmful_refused_delta_f_mean_12232_general_benign_informational_question_answering_harmless
dtype: float64
- name: harmful_natural_seg_harmful_refused_delta_f_mean_1227_illicit_finance_fraud_evasion_requests_harmful
dtype: float64
- name: harmful_natural_seg_harmful_refused_delta_f_mean_1636_privacy_intrusion_data_request_harmful
dtype: float64
- name: harmful_natural_seg_harmful_refused_delta_f_mean_1728_general_purpose_benign_inquiry_informational_request_harmless
dtype: float64
- name: harmful_natural_seg_harmful_refused_delta_f_mean_1742_mundane_creative_editing_and_trivia_requests_harmless
dtype: float64
- name: harmful_natural_seg_harmful_refused_delta_f_mean_1748_comparison_framed_social_cultural_difference_questions_harmless
dtype: float64
- name: harmful_natural_seg_harmful_refused_delta_f_mean_1923_harmful_persuasion_manipulation_requests_harmful
dtype: float64
- name: harmful_natural_seg_harmful_refused_delta_f_mean_1986_violent_sadistic_scene_requests_harmful
dtype: float64
- name: harmful_natural_seg_harmful_refused_delta_f_mean_2197_manipulative_abuse_grooming_solicitation_harmful
dtype: float64
- name: harmful_natural_seg_harmful_refused_delta_f_mean_2286_email_document_subject_line_extraction_harmless
dtype: float64
- name: harmful_natural_seg_harmful_refused_delta_f_mean_2684_corporate_email_attachment_logistics_harmful
dtype: float64
- name: harmful_natural_seg_harmful_refused_delta_f_mean_2742_email_document_revision_and_legal_admin_text_harmful
dtype: float64
- name: harmful_natural_seg_harmful_refused_delta_f_mean_2827_neutral_procedural_text_with_risky_content_overlap_harmless
dtype: float64
- name: harmful_natural_seg_harmful_refused_delta_f_mean_286_email_subject_line_document_summarization_requests_harmless
dtype: float64
- name: harmful_natural_seg_harmful_refused_delta_f_mean_2945_internal_business_email_logistics_harmless
dtype: float64
- name: harmful_natural_seg_harmful_refused_delta_f_mean_3248_deceptive_harm_fraud_hoaxes_and_misinformation_harmful
dtype: float64
- name: harmful_natural_seg_harmful_refused_delta_f_mean_3277_practical_everyday_advice_problem_solving_requests_harmless
dtype: float64
- name: harmful_natural_seg_harmful_refused_delta_f_mean_3368_graphic_violent_ritual_injury_descriptions_harmless
dtype: float64
- name: harmful_natural_seg_harmful_refused_delta_f_mean_3734_discriminatory_insult_harassment_requests_harmful
dtype: float64
- name: harmful_natural_seg_harmful_refused_delta_f_mean_3767_copyright_safe_creative_and_formatting_requests_weak_harmless
dtype: float64
- name: harmful_natural_seg_harmful_refused_delta_f_mean_3772_biased_demeaning_requests_about_protected_groups_harmful
dtype: float64
- name: harmful_natural_seg_harmful_refused_delta_f_mean_3868_address_and_contact_lookup_requests_for_institutions_and_fictional_entities_harmless
dtype: float64
- name: harmful_natural_seg_harmful_refused_delta_f_mean_3895_general_factual_cultural_trivia_questions_harmless
dtype: float64
- name: harmful_natural_seg_harmful_refused_delta_f_mean_3915_abuse_harassment_and_humiliating_harm_requests_harmful
dtype: float64
- name: harmful_natural_seg_harmful_refused_delta_f_mean_4112_historical_causation_of_genocides_and_revolutions_harmless
dtype: float64
- name: harmful_natural_seg_harmful_refused_delta_f_mean_4202_sensitive_info_policy_complaint_requests_harmless
dtype: float64
- name: harmful_natural_seg_harmful_refused_delta_f_mean_4205_coping_and_self_improvement_guidance_harmless
dtype: float64
- name: harmful_natural_seg_harmful_refused_delta_f_mean_4319_general_benign_informational_queries_harmless
dtype: float64
- name: harmful_natural_seg_harmful_refused_delta_f_mean_4396_public_figure_medical_history_inquiries_harmless
dtype: float64
- name: harmful_natural_seg_harmful_refused_delta_f_mean_4590_stereotyping_and_generalization_prompts_harmless
dtype: float64
- name: harmful_natural_seg_harmful_refused_delta_f_mean_4986_ethical_classification_recommendation_task_prompts_harmless
dtype: float64
- name: harmful_natural_seg_harmful_refused_delta_f_mean_5098_corporate_email_admin_task_requests_weak_harmful
dtype: float64
- name: harmful_natural_seg_harmful_refused_delta_f_mean_5449_rumor_seeking_and_secret_sharing_requests_harmful
dtype: float64
- name: harmful_natural_seg_harmful_refused_delta_f_mean_5509_grammar_style_transformation_and_neutral_writing_prompts_harmless
dtype: float64
- name: harmful_natural_seg_harmful_refused_delta_f_mean_603_biased_inferiority_and_illicit_trade_harm_requests_harmful
dtype: float64
- name: harmful_natural_seg_harmful_refused_delta_f_mean_6213_email_admin_subject_line_drafting_harmless
dtype: float64
- name: harmful_natural_seg_harmful_refused_delta_f_mean_6238_definition_of_risky_terms_queries_harmless
dtype: float64
- name: harmful_natural_seg_harmful_refused_delta_f_mean_6499_generic_benign_writing_tasks_harmless
dtype: float64
- name: harmful_natural_seg_harmful_refused_delta_f_mean_6562_legitimate_verification_and_administrative_request_prompts_harmless
dtype: float64
- name: harmful_natural_seg_harmful_refused_delta_f_mean_6733_copyright_defamation_misuse_requests_harmful
dtype: float64
- name: harmful_natural_seg_harmful_refused_delta_f_mean_6815_harmful_advocacy_with_deceptive_framing_harmful
dtype: float64
- name: harmful_natural_seg_harmful_refused_delta_f_mean_7063_sensitive_social_organizational_explanation_requests_harmless
dtype: float64
- name: harmful_natural_seg_harmful_refused_delta_f_mean_7261_protective_safety_legal_compliance_advice_harmless
dtype: float64
- name: harmful_natural_seg_harmful_refused_delta_f_mean_728_safety_health_advice_and_benign_information_requests_harmless
dtype: float64
- name: harmful_natural_seg_harmful_refused_delta_f_mean_7294_deceptive_malicious_manipulation_requests_harmful
dtype: float64
- name: harmful_natural_seg_harmful_refused_delta_f_mean_7480_email_subject_line_and_message_summary_requests_harmless
dtype: float64
- name: harmful_natural_seg_harmful_refused_delta_f_mean_7575_factual_summary_explanation_requests_about_benign_media_and_history_harmless
dtype: float64
- name: harmful_natural_seg_harmful_refused_delta_f_mean_7897_administrative_verification_request_language_harmless
dtype: float64
- name: harmful_natural_seg_harmful_refused_delta_f_mean_7940_defamation_and_fake_news_fabrication_requests_harmful
dtype: float64
- name: harmful_natural_seg_harmful_refused_delta_f_mean_7971_fictional_scenario_safety_secure_escape_questions_harmless
dtype: float64
- name: harmful_natural_seg_harmful_refused_delta_f_mean_8063_general_factual_cultural_benign_queries_harmless
dtype: float64
- name: harmful_natural_seg_harmful_refused_delta_f_mean_8140_email_task_management_correspondence_harmless
dtype: float64
- name: harmful_natural_seg_harmful_refused_delta_f_mean_8184_risk_avoidance_verification_requests_harmless
dtype: float64
- name: harmful_natural_seg_harmful_refused_delta_f_mean_823_basic_arithmetic_and_sequence_number_pattern_problems_harmless
dtype: float64
- name: harmful_natural_seg_harmful_refused_delta_f_mean_8240_trivia_and_classification_questions_harmless
dtype: float64
- name: harmful_natural_seg_harmful_refused_delta_f_mean_910_corporate_email_meeting_logistics_harmless
dtype: float64
- name: harmful_natural_seg_harmful_refused_delta_f_mean_9267_video_game_combat_action_scenario_requests_harmless
dtype: float64
- name: harmful_natural_seg_harmful_refused_delta_f_mean_9558_word_list_sentence_puzzle_prompts_harmful
dtype: float64
- name: harmful_natural_seg_harmful_refused_delta_f_mean_9623_misinformation_and_fabricated_narrative_requests_harmful
dtype: float64
- name: harmful_natural_seg_harmful_refused_delta_f_mean_9994_enumerative_benign_business_science_prompts_harmless
dtype: float64
- name: harmful_natural_seg_harmful_complied_delta_Y_mean
dtype: float64
- name: harmful_natural_seg_harmful_complied_delta_Y_frac_increasing
dtype: float64
- name: harmful_natural_seg_harmful_complied_delta_f_mean_10000_neutral_factual_general_knowledge_and_email_writing_prompts_harmless
dtype: float64
- name: harmful_natural_seg_harmful_complied_delta_f_mean_10737_general_benign_informational_explanation_requests_harmless
dtype: float64
- name: harmful_natural_seg_harmful_complied_delta_f_mean_1083_trivia_lookup_and_benign_productivity_queries_harmless
dtype: float64
- name: harmful_natural_seg_harmful_complied_delta_f_mean_10878_stereotype_based_demeaning_character_portrayals_harmful
dtype: float64
- name: harmful_natural_seg_harmful_complied_delta_f_mean_10940_stereotyping_protected_groups_as_inferior_harmful
dtype: float64
- name: harmful_natural_seg_harmful_complied_delta_f_mean_11223_competitive_skill_improvement_metaphors_harmless
dtype: float64
- name: harmful_natural_seg_harmful_complied_delta_f_mean_11236_degrading_prejudice_toward_marginalized_groups_harmful
dtype: float64
- name: harmful_natural_seg_harmful_complied_delta_f_mean_11382_copyright_compliance_and_policy_clarification_requests_harmless
dtype: float64
- name: harmful_natural_seg_harmful_complied_delta_f_mean_11404_copyright_derivative_work_permission_requests_harmful
dtype: float64
- name: harmful_natural_seg_harmful_complied_delta_f_mean_11495_creative_ideation_and_invention_prompts_harmless
dtype: float64
- name: harmful_natural_seg_harmful_complied_delta_f_mean_11634_privacy_identity_check_and_harmless_business_inquiry_requests_harmless
dtype: float64
- name: harmful_natural_seg_harmful_complied_delta_f_mean_11756_copyrighted_media_permission_and_attribution_requests_harmless
dtype: float64
- name: harmful_natural_seg_harmful_complied_delta_f_mean_11813_toxic_social_bias_discriminatory_harm_requests_harmless
dtype: float64
- name: harmful_natural_seg_harmful_complied_delta_f_mean_11840_corporate_email_admin_document_text_harmful
dtype: float64
- name: harmful_natural_seg_harmful_complied_delta_f_mean_11861_anti_fraud_and_verification_requests_harmless
dtype: float64
- name: harmful_natural_seg_harmful_complied_delta_f_mean_12232_general_benign_informational_question_answering_harmless
dtype: float64
- name: harmful_natural_seg_harmful_complied_delta_f_mean_1227_illicit_finance_fraud_evasion_requests_harmful
dtype: float64
- name: harmful_natural_seg_harmful_complied_delta_f_mean_1636_privacy_intrusion_data_request_harmful
dtype: float64
- name: harmful_natural_seg_harmful_complied_delta_f_mean_1728_general_purpose_benign_inquiry_informational_request_harmless
dtype: float64
- name: harmful_natural_seg_harmful_complied_delta_f_mean_1742_mundane_creative_editing_and_trivia_requests_harmless
dtype: float64
- name: harmful_natural_seg_harmful_complied_delta_f_mean_1748_comparison_framed_social_cultural_difference_questions_harmless
dtype: float64
- name: harmful_natural_seg_harmful_complied_delta_f_mean_1923_harmful_persuasion_manipulation_requests_harmful
dtype: float64
- name: harmful_natural_seg_harmful_complied_delta_f_mean_1986_violent_sadistic_scene_requests_harmful
dtype: float64
- name: harmful_natural_seg_harmful_complied_delta_f_mean_2197_manipulative_abuse_grooming_solicitation_harmful
dtype: float64
- name: harmful_natural_seg_harmful_complied_delta_f_mean_2286_email_document_subject_line_extraction_harmless
dtype: float64
- name: harmful_natural_seg_harmful_complied_delta_f_mean_2684_corporate_email_attachment_logistics_harmful
dtype: float64
- name: harmful_natural_seg_harmful_complied_delta_f_mean_2742_email_document_revision_and_legal_admin_text_harmful
dtype: float64
- name: harmful_natural_seg_harmful_complied_delta_f_mean_2827_neutral_procedural_text_with_risky_content_overlap_harmless
dtype: float64
- name: harmful_natural_seg_harmful_complied_delta_f_mean_286_email_subject_line_document_summarization_requests_harmless
dtype: float64
- name: harmful_natural_seg_harmful_complied_delta_f_mean_2945_internal_business_email_logistics_harmless
dtype: float64
- name: harmful_natural_seg_harmful_complied_delta_f_mean_3248_deceptive_harm_fraud_hoaxes_and_misinformation_harmful
dtype: float64
- name: harmful_natural_seg_harmful_complied_delta_f_mean_3277_practical_everyday_advice_problem_solving_requests_harmless
dtype: float64
- name: harmful_natural_seg_harmful_complied_delta_f_mean_3368_graphic_violent_ritual_injury_descriptions_harmless
dtype: float64
- name: harmful_natural_seg_harmful_complied_delta_f_mean_3734_discriminatory_insult_harassment_requests_harmful
dtype: float64
- name: harmful_natural_seg_harmful_complied_delta_f_mean_3767_copyright_safe_creative_and_formatting_requests_weak_harmless
dtype: float64
- name: harmful_natural_seg_harmful_complied_delta_f_mean_3772_biased_demeaning_requests_about_protected_groups_harmful
dtype: float64
- name: harmful_natural_seg_harmful_complied_delta_f_mean_3868_address_and_contact_lookup_requests_for_institutions_and_fictional_entities_harmless
dtype: float64
- name: harmful_natural_seg_harmful_complied_delta_f_mean_3895_general_factual_cultural_trivia_questions_harmless
dtype: float64
- name: harmful_natural_seg_harmful_complied_delta_f_mean_3915_abuse_harassment_and_humiliating_harm_requests_harmful
dtype: float64
- name: harmful_natural_seg_harmful_complied_delta_f_mean_4112_historical_causation_of_genocides_and_revolutions_harmless
dtype: float64
- name: harmful_natural_seg_harmful_complied_delta_f_mean_4202_sensitive_info_policy_complaint_requests_harmless
dtype: float64
- name: harmful_natural_seg_harmful_complied_delta_f_mean_4205_coping_and_self_improvement_guidance_harmless
dtype: float64
- name: harmful_natural_seg_harmful_complied_delta_f_mean_4319_general_benign_informational_queries_harmless
dtype: float64
- name: harmful_natural_seg_harmful_complied_delta_f_mean_4396_public_figure_medical_history_inquiries_harmless
dtype: float64
- name: harmful_natural_seg_harmful_complied_delta_f_mean_4590_stereotyping_and_generalization_prompts_harmless
dtype: float64
- name: harmful_natural_seg_harmful_complied_delta_f_mean_4986_ethical_classification_recommendation_task_prompts_harmless
dtype: float64
- name: harmful_natural_seg_harmful_complied_delta_f_mean_5098_corporate_email_admin_task_requests_weak_harmful
dtype: float64
- name: harmful_natural_seg_harmful_complied_delta_f_mean_5449_rumor_seeking_and_secret_sharing_requests_harmful
dtype: float64
- name: harmful_natural_seg_harmful_complied_delta_f_mean_5509_grammar_style_transformation_and_neutral_writing_prompts_harmless
dtype: float64
- name: harmful_natural_seg_harmful_complied_delta_f_mean_603_biased_inferiority_and_illicit_trade_harm_requests_harmful
dtype: float64
- name: harmful_natural_seg_harmful_complied_delta_f_mean_6213_email_admin_subject_line_drafting_harmless
dtype: float64
- name: harmful_natural_seg_harmful_complied_delta_f_mean_6238_definition_of_risky_terms_queries_harmless
dtype: float64
- name: harmful_natural_seg_harmful_complied_delta_f_mean_6499_generic_benign_writing_tasks_harmless
dtype: float64
- name: harmful_natural_seg_harmful_complied_delta_f_mean_6562_legitimate_verification_and_administrative_request_prompts_harmless
dtype: float64
- name: harmful_natural_seg_harmful_complied_delta_f_mean_6733_copyright_defamation_misuse_requests_harmful
dtype: float64
- name: harmful_natural_seg_harmful_complied_delta_f_mean_6815_harmful_advocacy_with_deceptive_framing_harmful
dtype: float64
- name: harmful_natural_seg_harmful_complied_delta_f_mean_7063_sensitive_social_organizational_explanation_requests_harmless
dtype: float64
- name: harmful_natural_seg_harmful_complied_delta_f_mean_7261_protective_safety_legal_compliance_advice_harmless
dtype: float64
- name: harmful_natural_seg_harmful_complied_delta_f_mean_728_safety_health_advice_and_benign_information_requests_harmless
dtype: float64
- name: harmful_natural_seg_harmful_complied_delta_f_mean_7294_deceptive_malicious_manipulation_requests_harmful
dtype: float64
- name: harmful_natural_seg_harmful_complied_delta_f_mean_7480_email_subject_line_and_message_summary_requests_harmless
dtype: float64
- name: harmful_natural_seg_harmful_complied_delta_f_mean_7575_factual_summary_explanation_requests_about_benign_media_and_history_harmless
dtype: float64
- name: harmful_natural_seg_harmful_complied_delta_f_mean_7897_administrative_verification_request_language_harmless
dtype: float64
- name: harmful_natural_seg_harmful_complied_delta_f_mean_7940_defamation_and_fake_news_fabrication_requests_harmful
dtype: float64
- name: harmful_natural_seg_harmful_complied_delta_f_mean_7971_fictional_scenario_safety_secure_escape_questions_harmless
dtype: float64
- name: harmful_natural_seg_harmful_complied_delta_f_mean_8063_general_factual_cultural_benign_queries_harmless
dtype: float64
- name: harmful_natural_seg_harmful_complied_delta_f_mean_8140_email_task_management_correspondence_harmless
dtype: float64
- name: harmful_natural_seg_harmful_complied_delta_f_mean_8184_risk_avoidance_verification_requests_harmless
dtype: float64
- name: harmful_natural_seg_harmful_complied_delta_f_mean_823_basic_arithmetic_and_sequence_number_pattern_problems_harmless
dtype: float64
- name: harmful_natural_seg_harmful_complied_delta_f_mean_8240_trivia_and_classification_questions_harmless
dtype: float64
- name: harmful_natural_seg_harmful_complied_delta_f_mean_910_corporate_email_meeting_logistics_harmless
dtype: float64
- name: harmful_natural_seg_harmful_complied_delta_f_mean_9267_video_game_combat_action_scenario_requests_harmless
dtype: float64
- name: harmful_natural_seg_harmful_complied_delta_f_mean_9558_word_list_sentence_puzzle_prompts_harmful
dtype: float64
- name: harmful_natural_seg_harmful_complied_delta_f_mean_9623_misinformation_and_fabricated_narrative_requests_harmful
dtype: float64
- name: harmful_natural_seg_harmful_complied_delta_f_mean_9994_enumerative_benign_business_science_prompts_harmless
dtype: float64
- name: harmful_balanced_seg_all_delta_Y_mean
dtype: float64
- name: harmful_balanced_seg_all_delta_Y_frac_increasing
dtype: float64
- name: harmful_balanced_seg_all_delta_f_mean_10000_neutral_factual_general_knowledge_and_email_writing_prompts_harmless
dtype: float64
- name: harmful_balanced_seg_all_delta_f_mean_10737_general_benign_informational_explanation_requests_harmless
dtype: float64
- name: harmful_balanced_seg_all_delta_f_mean_1083_trivia_lookup_and_benign_productivity_queries_harmless
dtype: float64
- name: harmful_balanced_seg_all_delta_f_mean_10878_stereotype_based_demeaning_character_portrayals_harmful
dtype: float64
- name: harmful_balanced_seg_all_delta_f_mean_10940_stereotyping_protected_groups_as_inferior_harmful
dtype: float64
- name: harmful_balanced_seg_all_delta_f_mean_11223_competitive_skill_improvement_metaphors_harmless
dtype: float64
- name: harmful_balanced_seg_all_delta_f_mean_11236_degrading_prejudice_toward_marginalized_groups_harmful
dtype: float64
- name: harmful_balanced_seg_all_delta_f_mean_11382_copyright_compliance_and_policy_clarification_requests_harmless
dtype: float64
- name: harmful_balanced_seg_all_delta_f_mean_11404_copyright_derivative_work_permission_requests_harmful
dtype: float64
- name: harmful_balanced_seg_all_delta_f_mean_11495_creative_ideation_and_invention_prompts_harmless
dtype: float64
- name: harmful_balanced_seg_all_delta_f_mean_11634_privacy_identity_check_and_harmless_business_inquiry_requests_harmless
dtype: float64
- name: harmful_balanced_seg_all_delta_f_mean_11756_copyrighted_media_permission_and_attribution_requests_harmless
dtype: float64
- name: harmful_balanced_seg_all_delta_f_mean_11813_toxic_social_bias_discriminatory_harm_requests_harmless
dtype: float64
- name: harmful_balanced_seg_all_delta_f_mean_11840_corporate_email_admin_document_text_harmful
dtype: float64
- name: harmful_balanced_seg_all_delta_f_mean_11861_anti_fraud_and_verification_requests_harmless
dtype: float64
- name: harmful_balanced_seg_all_delta_f_mean_12232_general_benign_informational_question_answering_harmless
dtype: float64
- name: harmful_balanced_seg_all_delta_f_mean_1227_illicit_finance_fraud_evasion_requests_harmful
dtype: float64
- name: harmful_balanced_seg_all_delta_f_mean_1636_privacy_intrusion_data_request_harmful
dtype: float64
- name: harmful_balanced_seg_all_delta_f_mean_1728_general_purpose_benign_inquiry_informational_request_harmless
dtype: float64
- name: harmful_balanced_seg_all_delta_f_mean_1742_mundane_creative_editing_and_trivia_requests_harmless
dtype: float64
- name: harmful_balanced_seg_all_delta_f_mean_1748_comparison_framed_social_cultural_difference_questions_harmless
dtype: float64
- name: harmful_balanced_seg_all_delta_f_mean_1923_harmful_persuasion_manipulation_requests_harmful
dtype: float64
- name: harmful_balanced_seg_all_delta_f_mean_1986_violent_sadistic_scene_requests_harmful
dtype: float64
- name: harmful_balanced_seg_all_delta_f_mean_2197_manipulative_abuse_grooming_solicitation_harmful
dtype: float64
- name: harmful_balanced_seg_all_delta_f_mean_2286_email_document_subject_line_extraction_harmless
dtype: float64
- name: harmful_balanced_seg_all_delta_f_mean_2684_corporate_email_attachment_logistics_harmful
dtype: float64
- name: harmful_balanced_seg_all_delta_f_mean_2742_email_document_revision_and_legal_admin_text_harmful
dtype: float64
- name: harmful_balanced_seg_all_delta_f_mean_2827_neutral_procedural_text_with_risky_content_overlap_harmless
dtype: float64
- name: harmful_balanced_seg_all_delta_f_mean_286_email_subject_line_document_summarization_requests_harmless
dtype: float64
- name: harmful_balanced_seg_all_delta_f_mean_2945_internal_business_email_logistics_harmless
dtype: float64
- name: harmful_balanced_seg_all_delta_f_mean_3248_deceptive_harm_fraud_hoaxes_and_misinformation_harmful
dtype: float64
- name: harmful_balanced_seg_all_delta_f_mean_3277_practical_everyday_advice_problem_solving_requests_harmless
dtype: float64
- name: harmful_balanced_seg_all_delta_f_mean_3368_graphic_violent_ritual_injury_descriptions_harmless
dtype: float64
- name: harmful_balanced_seg_all_delta_f_mean_3734_discriminatory_insult_harassment_requests_harmful
dtype: float64
- name: harmful_balanced_seg_all_delta_f_mean_3767_copyright_safe_creative_and_formatting_requests_weak_harmless
dtype: float64
- name: harmful_balanced_seg_all_delta_f_mean_3772_biased_demeaning_requests_about_protected_groups_harmful
dtype: float64
- name: harmful_balanced_seg_all_delta_f_mean_3868_address_and_contact_lookup_requests_for_institutions_and_fictional_entities_harmless
dtype: float64
- name: harmful_balanced_seg_all_delta_f_mean_3895_general_factual_cultural_trivia_questions_harmless
dtype: float64
- name: harmful_balanced_seg_all_delta_f_mean_3915_abuse_harassment_and_humiliating_harm_requests_harmful
dtype: float64
- name: harmful_balanced_seg_all_delta_f_mean_4112_historical_causation_of_genocides_and_revolutions_harmless
dtype: float64
- name: harmful_balanced_seg_all_delta_f_mean_4202_sensitive_info_policy_complaint_requests_harmless
dtype: float64
- name: harmful_balanced_seg_all_delta_f_mean_4205_coping_and_self_improvement_guidance_harmless
dtype: float64
- name: harmful_balanced_seg_all_delta_f_mean_4319_general_benign_informational_queries_harmless
dtype: float64
- name: harmful_balanced_seg_all_delta_f_mean_4396_public_figure_medical_history_inquiries_harmless
dtype: float64
- name: harmful_balanced_seg_all_delta_f_mean_4590_stereotyping_and_generalization_prompts_harmless
dtype: float64
- name: harmful_balanced_seg_all_delta_f_mean_4986_ethical_classification_recommendation_task_prompts_harmless
dtype: float64
- name: harmful_balanced_seg_all_delta_f_mean_5098_corporate_email_admin_task_requests_weak_harmful
dtype: float64
- name: harmful_balanced_seg_all_delta_f_mean_5449_rumor_seeking_and_secret_sharing_requests_harmful
dtype: float64
- name: harmful_balanced_seg_all_delta_f_mean_5509_grammar_style_transformation_and_neutral_writing_prompts_harmless
dtype: float64
- name: harmful_balanced_seg_all_delta_f_mean_603_biased_inferiority_and_illicit_trade_harm_requests_harmful
dtype: float64
- name: harmful_balanced_seg_all_delta_f_mean_6213_email_admin_subject_line_drafting_harmless
dtype: float64
- name: harmful_balanced_seg_all_delta_f_mean_6238_definition_of_risky_terms_queries_harmless
dtype: float64
- name: harmful_balanced_seg_all_delta_f_mean_6499_generic_benign_writing_tasks_harmless
dtype: float64
- name: harmful_balanced_seg_all_delta_f_mean_6562_legitimate_verification_and_administrative_request_prompts_harmless
dtype: float64
- name: harmful_balanced_seg_all_delta_f_mean_6733_copyright_defamation_misuse_requests_harmful
dtype: float64
- name: harmful_balanced_seg_all_delta_f_mean_6815_harmful_advocacy_with_deceptive_framing_harmful
dtype: float64
- name: harmful_balanced_seg_all_delta_f_mean_7063_sensitive_social_organizational_explanation_requests_harmless
dtype: float64
- name: harmful_balanced_seg_all_delta_f_mean_7261_protective_safety_legal_compliance_advice_harmless
dtype: float64
- name: harmful_balanced_seg_all_delta_f_mean_728_safety_health_advice_and_benign_information_requests_harmless
dtype: float64
- name: harmful_balanced_seg_all_delta_f_mean_7294_deceptive_malicious_manipulation_requests_harmful
dtype: float64
- name: harmful_balanced_seg_all_delta_f_mean_7480_email_subject_line_and_message_summary_requests_harmless
dtype: float64
- name: harmful_balanced_seg_all_delta_f_mean_7575_factual_summary_explanation_requests_about_benign_media_and_history_harmless
dtype: float64
- name: harmful_balanced_seg_all_delta_f_mean_7897_administrative_verification_request_language_harmless
dtype: float64
- name: harmful_balanced_seg_all_delta_f_mean_7940_defamation_and_fake_news_fabrication_requests_harmful
dtype: float64
- name: harmful_balanced_seg_all_delta_f_mean_7971_fictional_scenario_safety_secure_escape_questions_harmless
dtype: float64
- name: harmful_balanced_seg_all_delta_f_mean_8063_general_factual_cultural_benign_queries_harmless
dtype: float64
- name: harmful_balanced_seg_all_delta_f_mean_8140_email_task_management_correspondence_harmless
dtype: float64
- name: harmful_balanced_seg_all_delta_f_mean_8184_risk_avoidance_verification_requests_harmless
dtype: float64
- name: harmful_balanced_seg_all_delta_f_mean_823_basic_arithmetic_and_sequence_number_pattern_problems_harmless
dtype: float64
- name: harmful_balanced_seg_all_delta_f_mean_8240_trivia_and_classification_questions_harmless
dtype: float64
- name: harmful_balanced_seg_all_delta_f_mean_910_corporate_email_meeting_logistics_harmless
dtype: float64
- name: harmful_balanced_seg_all_delta_f_mean_9267_video_game_combat_action_scenario_requests_harmless
dtype: float64
- name: harmful_balanced_seg_all_delta_f_mean_9558_word_list_sentence_puzzle_prompts_harmful
dtype: float64
- name: harmful_balanced_seg_all_delta_f_mean_9623_misinformation_and_fabricated_narrative_requests_harmful
dtype: float64
- name: harmful_balanced_seg_all_delta_f_mean_9994_enumerative_benign_business_science_prompts_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_delta_Y_mean
dtype: float64
- name: harmful_balanced_seg_harmful_delta_Y_frac_increasing
dtype: float64
- name: harmful_balanced_seg_harmful_delta_f_mean_10000_neutral_factual_general_knowledge_and_email_writing_prompts_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_delta_f_mean_10737_general_benign_informational_explanation_requests_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_delta_f_mean_1083_trivia_lookup_and_benign_productivity_queries_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_delta_f_mean_10878_stereotype_based_demeaning_character_portrayals_harmful
dtype: float64
- name: harmful_balanced_seg_harmful_delta_f_mean_10940_stereotyping_protected_groups_as_inferior_harmful
dtype: float64
- name: harmful_balanced_seg_harmful_delta_f_mean_11223_competitive_skill_improvement_metaphors_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_delta_f_mean_11236_degrading_prejudice_toward_marginalized_groups_harmful
dtype: float64
- name: harmful_balanced_seg_harmful_delta_f_mean_11382_copyright_compliance_and_policy_clarification_requests_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_delta_f_mean_11404_copyright_derivative_work_permission_requests_harmful
dtype: float64
- name: harmful_balanced_seg_harmful_delta_f_mean_11495_creative_ideation_and_invention_prompts_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_delta_f_mean_11634_privacy_identity_check_and_harmless_business_inquiry_requests_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_delta_f_mean_11756_copyrighted_media_permission_and_attribution_requests_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_delta_f_mean_11813_toxic_social_bias_discriminatory_harm_requests_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_delta_f_mean_11840_corporate_email_admin_document_text_harmful
dtype: float64
- name: harmful_balanced_seg_harmful_delta_f_mean_11861_anti_fraud_and_verification_requests_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_delta_f_mean_12232_general_benign_informational_question_answering_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_delta_f_mean_1227_illicit_finance_fraud_evasion_requests_harmful
dtype: float64
- name: harmful_balanced_seg_harmful_delta_f_mean_1636_privacy_intrusion_data_request_harmful
dtype: float64
- name: harmful_balanced_seg_harmful_delta_f_mean_1728_general_purpose_benign_inquiry_informational_request_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_delta_f_mean_1742_mundane_creative_editing_and_trivia_requests_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_delta_f_mean_1748_comparison_framed_social_cultural_difference_questions_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_delta_f_mean_1923_harmful_persuasion_manipulation_requests_harmful
dtype: float64
- name: harmful_balanced_seg_harmful_delta_f_mean_1986_violent_sadistic_scene_requests_harmful
dtype: float64
- name: harmful_balanced_seg_harmful_delta_f_mean_2197_manipulative_abuse_grooming_solicitation_harmful
dtype: float64
- name: harmful_balanced_seg_harmful_delta_f_mean_2286_email_document_subject_line_extraction_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_delta_f_mean_2684_corporate_email_attachment_logistics_harmful
dtype: float64
- name: harmful_balanced_seg_harmful_delta_f_mean_2742_email_document_revision_and_legal_admin_text_harmful
dtype: float64
- name: harmful_balanced_seg_harmful_delta_f_mean_2827_neutral_procedural_text_with_risky_content_overlap_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_delta_f_mean_286_email_subject_line_document_summarization_requests_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_delta_f_mean_2945_internal_business_email_logistics_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_delta_f_mean_3248_deceptive_harm_fraud_hoaxes_and_misinformation_harmful
dtype: float64
- name: harmful_balanced_seg_harmful_delta_f_mean_3277_practical_everyday_advice_problem_solving_requests_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_delta_f_mean_3368_graphic_violent_ritual_injury_descriptions_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_delta_f_mean_3734_discriminatory_insult_harassment_requests_harmful
dtype: float64
- name: harmful_balanced_seg_harmful_delta_f_mean_3767_copyright_safe_creative_and_formatting_requests_weak_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_delta_f_mean_3772_biased_demeaning_requests_about_protected_groups_harmful
dtype: float64
- name: harmful_balanced_seg_harmful_delta_f_mean_3868_address_and_contact_lookup_requests_for_institutions_and_fictional_entities_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_delta_f_mean_3895_general_factual_cultural_trivia_questions_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_delta_f_mean_3915_abuse_harassment_and_humiliating_harm_requests_harmful
dtype: float64
- name: harmful_balanced_seg_harmful_delta_f_mean_4112_historical_causation_of_genocides_and_revolutions_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_delta_f_mean_4202_sensitive_info_policy_complaint_requests_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_delta_f_mean_4205_coping_and_self_improvement_guidance_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_delta_f_mean_4319_general_benign_informational_queries_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_delta_f_mean_4396_public_figure_medical_history_inquiries_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_delta_f_mean_4590_stereotyping_and_generalization_prompts_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_delta_f_mean_4986_ethical_classification_recommendation_task_prompts_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_delta_f_mean_5098_corporate_email_admin_task_requests_weak_harmful
dtype: float64
- name: harmful_balanced_seg_harmful_delta_f_mean_5449_rumor_seeking_and_secret_sharing_requests_harmful
dtype: float64
- name: harmful_balanced_seg_harmful_delta_f_mean_5509_grammar_style_transformation_and_neutral_writing_prompts_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_delta_f_mean_603_biased_inferiority_and_illicit_trade_harm_requests_harmful
dtype: float64
- name: harmful_balanced_seg_harmful_delta_f_mean_6213_email_admin_subject_line_drafting_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_delta_f_mean_6238_definition_of_risky_terms_queries_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_delta_f_mean_6499_generic_benign_writing_tasks_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_delta_f_mean_6562_legitimate_verification_and_administrative_request_prompts_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_delta_f_mean_6733_copyright_defamation_misuse_requests_harmful
dtype: float64
- name: harmful_balanced_seg_harmful_delta_f_mean_6815_harmful_advocacy_with_deceptive_framing_harmful
dtype: float64
- name: harmful_balanced_seg_harmful_delta_f_mean_7063_sensitive_social_organizational_explanation_requests_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_delta_f_mean_7261_protective_safety_legal_compliance_advice_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_delta_f_mean_728_safety_health_advice_and_benign_information_requests_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_delta_f_mean_7294_deceptive_malicious_manipulation_requests_harmful
dtype: float64
- name: harmful_balanced_seg_harmful_delta_f_mean_7480_email_subject_line_and_message_summary_requests_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_delta_f_mean_7575_factual_summary_explanation_requests_about_benign_media_and_history_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_delta_f_mean_7897_administrative_verification_request_language_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_delta_f_mean_7940_defamation_and_fake_news_fabrication_requests_harmful
dtype: float64
- name: harmful_balanced_seg_harmful_delta_f_mean_7971_fictional_scenario_safety_secure_escape_questions_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_delta_f_mean_8063_general_factual_cultural_benign_queries_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_delta_f_mean_8140_email_task_management_correspondence_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_delta_f_mean_8184_risk_avoidance_verification_requests_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_delta_f_mean_823_basic_arithmetic_and_sequence_number_pattern_problems_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_delta_f_mean_8240_trivia_and_classification_questions_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_delta_f_mean_910_corporate_email_meeting_logistics_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_delta_f_mean_9267_video_game_combat_action_scenario_requests_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_delta_f_mean_9558_word_list_sentence_puzzle_prompts_harmful
dtype: float64
- name: harmful_balanced_seg_harmful_delta_f_mean_9623_misinformation_and_fabricated_narrative_requests_harmful
dtype: float64
- name: harmful_balanced_seg_harmful_delta_f_mean_9994_enumerative_benign_business_science_prompts_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_refused_delta_Y_mean
dtype: float64
- name: harmful_balanced_seg_harmful_refused_delta_Y_frac_increasing
dtype: float64
- name: harmful_balanced_seg_harmful_refused_delta_f_mean_10000_neutral_factual_general_knowledge_and_email_writing_prompts_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_refused_delta_f_mean_10737_general_benign_informational_explanation_requests_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_refused_delta_f_mean_1083_trivia_lookup_and_benign_productivity_queries_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_refused_delta_f_mean_10878_stereotype_based_demeaning_character_portrayals_harmful
dtype: float64
- name: harmful_balanced_seg_harmful_refused_delta_f_mean_10940_stereotyping_protected_groups_as_inferior_harmful
dtype: float64
- name: harmful_balanced_seg_harmful_refused_delta_f_mean_11223_competitive_skill_improvement_metaphors_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_refused_delta_f_mean_11236_degrading_prejudice_toward_marginalized_groups_harmful
dtype: float64
- name: harmful_balanced_seg_harmful_refused_delta_f_mean_11382_copyright_compliance_and_policy_clarification_requests_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_refused_delta_f_mean_11404_copyright_derivative_work_permission_requests_harmful
dtype: float64
- name: harmful_balanced_seg_harmful_refused_delta_f_mean_11495_creative_ideation_and_invention_prompts_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_refused_delta_f_mean_11634_privacy_identity_check_and_harmless_business_inquiry_requests_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_refused_delta_f_mean_11756_copyrighted_media_permission_and_attribution_requests_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_refused_delta_f_mean_11813_toxic_social_bias_discriminatory_harm_requests_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_refused_delta_f_mean_11840_corporate_email_admin_document_text_harmful
dtype: float64
- name: harmful_balanced_seg_harmful_refused_delta_f_mean_11861_anti_fraud_and_verification_requests_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_refused_delta_f_mean_12232_general_benign_informational_question_answering_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_refused_delta_f_mean_1227_illicit_finance_fraud_evasion_requests_harmful
dtype: float64
- name: harmful_balanced_seg_harmful_refused_delta_f_mean_1636_privacy_intrusion_data_request_harmful
dtype: float64
- name: harmful_balanced_seg_harmful_refused_delta_f_mean_1728_general_purpose_benign_inquiry_informational_request_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_refused_delta_f_mean_1742_mundane_creative_editing_and_trivia_requests_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_refused_delta_f_mean_1748_comparison_framed_social_cultural_difference_questions_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_refused_delta_f_mean_1923_harmful_persuasion_manipulation_requests_harmful
dtype: float64
- name: harmful_balanced_seg_harmful_refused_delta_f_mean_1986_violent_sadistic_scene_requests_harmful
dtype: float64
- name: harmful_balanced_seg_harmful_refused_delta_f_mean_2197_manipulative_abuse_grooming_solicitation_harmful
dtype: float64
- name: harmful_balanced_seg_harmful_refused_delta_f_mean_2286_email_document_subject_line_extraction_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_refused_delta_f_mean_2684_corporate_email_attachment_logistics_harmful
dtype: float64
- name: harmful_balanced_seg_harmful_refused_delta_f_mean_2742_email_document_revision_and_legal_admin_text_harmful
dtype: float64
- name: harmful_balanced_seg_harmful_refused_delta_f_mean_2827_neutral_procedural_text_with_risky_content_overlap_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_refused_delta_f_mean_286_email_subject_line_document_summarization_requests_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_refused_delta_f_mean_2945_internal_business_email_logistics_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_refused_delta_f_mean_3248_deceptive_harm_fraud_hoaxes_and_misinformation_harmful
dtype: float64
- name: harmful_balanced_seg_harmful_refused_delta_f_mean_3277_practical_everyday_advice_problem_solving_requests_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_refused_delta_f_mean_3368_graphic_violent_ritual_injury_descriptions_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_refused_delta_f_mean_3734_discriminatory_insult_harassment_requests_harmful
dtype: float64
- name: harmful_balanced_seg_harmful_refused_delta_f_mean_3767_copyright_safe_creative_and_formatting_requests_weak_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_refused_delta_f_mean_3772_biased_demeaning_requests_about_protected_groups_harmful
dtype: float64
- name: harmful_balanced_seg_harmful_refused_delta_f_mean_3868_address_and_contact_lookup_requests_for_institutions_and_fictional_entities_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_refused_delta_f_mean_3895_general_factual_cultural_trivia_questions_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_refused_delta_f_mean_3915_abuse_harassment_and_humiliating_harm_requests_harmful
dtype: float64
- name: harmful_balanced_seg_harmful_refused_delta_f_mean_4112_historical_causation_of_genocides_and_revolutions_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_refused_delta_f_mean_4202_sensitive_info_policy_complaint_requests_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_refused_delta_f_mean_4205_coping_and_self_improvement_guidance_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_refused_delta_f_mean_4319_general_benign_informational_queries_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_refused_delta_f_mean_4396_public_figure_medical_history_inquiries_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_refused_delta_f_mean_4590_stereotyping_and_generalization_prompts_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_refused_delta_f_mean_4986_ethical_classification_recommendation_task_prompts_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_refused_delta_f_mean_5098_corporate_email_admin_task_requests_weak_harmful
dtype: float64
- name: harmful_balanced_seg_harmful_refused_delta_f_mean_5449_rumor_seeking_and_secret_sharing_requests_harmful
dtype: float64
- name: harmful_balanced_seg_harmful_refused_delta_f_mean_5509_grammar_style_transformation_and_neutral_writing_prompts_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_refused_delta_f_mean_603_biased_inferiority_and_illicit_trade_harm_requests_harmful
dtype: float64
- name: harmful_balanced_seg_harmful_refused_delta_f_mean_6213_email_admin_subject_line_drafting_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_refused_delta_f_mean_6238_definition_of_risky_terms_queries_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_refused_delta_f_mean_6499_generic_benign_writing_tasks_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_refused_delta_f_mean_6562_legitimate_verification_and_administrative_request_prompts_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_refused_delta_f_mean_6733_copyright_defamation_misuse_requests_harmful
dtype: float64
- name: harmful_balanced_seg_harmful_refused_delta_f_mean_6815_harmful_advocacy_with_deceptive_framing_harmful
dtype: float64
- name: harmful_balanced_seg_harmful_refused_delta_f_mean_7063_sensitive_social_organizational_explanation_requests_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_refused_delta_f_mean_7261_protective_safety_legal_compliance_advice_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_refused_delta_f_mean_728_safety_health_advice_and_benign_information_requests_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_refused_delta_f_mean_7294_deceptive_malicious_manipulation_requests_harmful
dtype: float64
- name: harmful_balanced_seg_harmful_refused_delta_f_mean_7480_email_subject_line_and_message_summary_requests_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_refused_delta_f_mean_7575_factual_summary_explanation_requests_about_benign_media_and_history_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_refused_delta_f_mean_7897_administrative_verification_request_language_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_refused_delta_f_mean_7940_defamation_and_fake_news_fabrication_requests_harmful
dtype: float64
- name: harmful_balanced_seg_harmful_refused_delta_f_mean_7971_fictional_scenario_safety_secure_escape_questions_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_refused_delta_f_mean_8063_general_factual_cultural_benign_queries_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_refused_delta_f_mean_8140_email_task_management_correspondence_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_refused_delta_f_mean_8184_risk_avoidance_verification_requests_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_refused_delta_f_mean_823_basic_arithmetic_and_sequence_number_pattern_problems_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_refused_delta_f_mean_8240_trivia_and_classification_questions_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_refused_delta_f_mean_910_corporate_email_meeting_logistics_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_refused_delta_f_mean_9267_video_game_combat_action_scenario_requests_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_refused_delta_f_mean_9558_word_list_sentence_puzzle_prompts_harmful
dtype: float64
- name: harmful_balanced_seg_harmful_refused_delta_f_mean_9623_misinformation_and_fabricated_narrative_requests_harmful
dtype: float64
- name: harmful_balanced_seg_harmful_refused_delta_f_mean_9994_enumerative_benign_business_science_prompts_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_complied_delta_Y_mean
dtype: float64
- name: harmful_balanced_seg_harmful_complied_delta_Y_frac_increasing
dtype: float64
- name: harmful_balanced_seg_harmful_complied_delta_f_mean_10000_neutral_factual_general_knowledge_and_email_writing_prompts_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_complied_delta_f_mean_10737_general_benign_informational_explanation_requests_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_complied_delta_f_mean_1083_trivia_lookup_and_benign_productivity_queries_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_complied_delta_f_mean_10878_stereotype_based_demeaning_character_portrayals_harmful
dtype: float64
- name: harmful_balanced_seg_harmful_complied_delta_f_mean_10940_stereotyping_protected_groups_as_inferior_harmful
dtype: float64
- name: harmful_balanced_seg_harmful_complied_delta_f_mean_11223_competitive_skill_improvement_metaphors_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_complied_delta_f_mean_11236_degrading_prejudice_toward_marginalized_groups_harmful
dtype: float64
- name: harmful_balanced_seg_harmful_complied_delta_f_mean_11382_copyright_compliance_and_policy_clarification_requests_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_complied_delta_f_mean_11404_copyright_derivative_work_permission_requests_harmful
dtype: float64
- name: harmful_balanced_seg_harmful_complied_delta_f_mean_11495_creative_ideation_and_invention_prompts_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_complied_delta_f_mean_11634_privacy_identity_check_and_harmless_business_inquiry_requests_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_complied_delta_f_mean_11756_copyrighted_media_permission_and_attribution_requests_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_complied_delta_f_mean_11813_toxic_social_bias_discriminatory_harm_requests_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_complied_delta_f_mean_11840_corporate_email_admin_document_text_harmful
dtype: float64
- name: harmful_balanced_seg_harmful_complied_delta_f_mean_11861_anti_fraud_and_verification_requests_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_complied_delta_f_mean_12232_general_benign_informational_question_answering_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_complied_delta_f_mean_1227_illicit_finance_fraud_evasion_requests_harmful
dtype: float64
- name: harmful_balanced_seg_harmful_complied_delta_f_mean_1636_privacy_intrusion_data_request_harmful
dtype: float64
- name: harmful_balanced_seg_harmful_complied_delta_f_mean_1728_general_purpose_benign_inquiry_informational_request_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_complied_delta_f_mean_1742_mundane_creative_editing_and_trivia_requests_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_complied_delta_f_mean_1748_comparison_framed_social_cultural_difference_questions_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_complied_delta_f_mean_1923_harmful_persuasion_manipulation_requests_harmful
dtype: float64
- name: harmful_balanced_seg_harmful_complied_delta_f_mean_1986_violent_sadistic_scene_requests_harmful
dtype: float64
- name: harmful_balanced_seg_harmful_complied_delta_f_mean_2197_manipulative_abuse_grooming_solicitation_harmful
dtype: float64
- name: harmful_balanced_seg_harmful_complied_delta_f_mean_2286_email_document_subject_line_extraction_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_complied_delta_f_mean_2684_corporate_email_attachment_logistics_harmful
dtype: float64
- name: harmful_balanced_seg_harmful_complied_delta_f_mean_2742_email_document_revision_and_legal_admin_text_harmful
dtype: float64
- name: harmful_balanced_seg_harmful_complied_delta_f_mean_2827_neutral_procedural_text_with_risky_content_overlap_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_complied_delta_f_mean_286_email_subject_line_document_summarization_requests_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_complied_delta_f_mean_2945_internal_business_email_logistics_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_complied_delta_f_mean_3248_deceptive_harm_fraud_hoaxes_and_misinformation_harmful
dtype: float64
- name: harmful_balanced_seg_harmful_complied_delta_f_mean_3277_practical_everyday_advice_problem_solving_requests_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_complied_delta_f_mean_3368_graphic_violent_ritual_injury_descriptions_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_complied_delta_f_mean_3734_discriminatory_insult_harassment_requests_harmful
dtype: float64
- name: harmful_balanced_seg_harmful_complied_delta_f_mean_3767_copyright_safe_creative_and_formatting_requests_weak_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_complied_delta_f_mean_3772_biased_demeaning_requests_about_protected_groups_harmful
dtype: float64
- name: harmful_balanced_seg_harmful_complied_delta_f_mean_3868_address_and_contact_lookup_requests_for_institutions_and_fictional_entities_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_complied_delta_f_mean_3895_general_factual_cultural_trivia_questions_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_complied_delta_f_mean_3915_abuse_harassment_and_humiliating_harm_requests_harmful
dtype: float64
- name: harmful_balanced_seg_harmful_complied_delta_f_mean_4112_historical_causation_of_genocides_and_revolutions_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_complied_delta_f_mean_4202_sensitive_info_policy_complaint_requests_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_complied_delta_f_mean_4205_coping_and_self_improvement_guidance_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_complied_delta_f_mean_4319_general_benign_informational_queries_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_complied_delta_f_mean_4396_public_figure_medical_history_inquiries_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_complied_delta_f_mean_4590_stereotyping_and_generalization_prompts_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_complied_delta_f_mean_4986_ethical_classification_recommendation_task_prompts_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_complied_delta_f_mean_5098_corporate_email_admin_task_requests_weak_harmful
dtype: float64
- name: harmful_balanced_seg_harmful_complied_delta_f_mean_5449_rumor_seeking_and_secret_sharing_requests_harmful
dtype: float64
- name: harmful_balanced_seg_harmful_complied_delta_f_mean_5509_grammar_style_transformation_and_neutral_writing_prompts_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_complied_delta_f_mean_603_biased_inferiority_and_illicit_trade_harm_requests_harmful
dtype: float64
- name: harmful_balanced_seg_harmful_complied_delta_f_mean_6213_email_admin_subject_line_drafting_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_complied_delta_f_mean_6238_definition_of_risky_terms_queries_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_complied_delta_f_mean_6499_generic_benign_writing_tasks_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_complied_delta_f_mean_6562_legitimate_verification_and_administrative_request_prompts_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_complied_delta_f_mean_6733_copyright_defamation_misuse_requests_harmful
dtype: float64
- name: harmful_balanced_seg_harmful_complied_delta_f_mean_6815_harmful_advocacy_with_deceptive_framing_harmful
dtype: float64
- name: harmful_balanced_seg_harmful_complied_delta_f_mean_7063_sensitive_social_organizational_explanation_requests_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_complied_delta_f_mean_7261_protective_safety_legal_compliance_advice_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_complied_delta_f_mean_728_safety_health_advice_and_benign_information_requests_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_complied_delta_f_mean_7294_deceptive_malicious_manipulation_requests_harmful
dtype: float64
- name: harmful_balanced_seg_harmful_complied_delta_f_mean_7480_email_subject_line_and_message_summary_requests_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_complied_delta_f_mean_7575_factual_summary_explanation_requests_about_benign_media_and_history_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_complied_delta_f_mean_7897_administrative_verification_request_language_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_complied_delta_f_mean_7940_defamation_and_fake_news_fabrication_requests_harmful
dtype: float64
- name: harmful_balanced_seg_harmful_complied_delta_f_mean_7971_fictional_scenario_safety_secure_escape_questions_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_complied_delta_f_mean_8063_general_factual_cultural_benign_queries_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_complied_delta_f_mean_8140_email_task_management_correspondence_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_complied_delta_f_mean_8184_risk_avoidance_verification_requests_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_complied_delta_f_mean_823_basic_arithmetic_and_sequence_number_pattern_problems_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_complied_delta_f_mean_8240_trivia_and_classification_questions_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_complied_delta_f_mean_910_corporate_email_meeting_logistics_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_complied_delta_f_mean_9267_video_game_combat_action_scenario_requests_harmless
dtype: float64
- name: harmful_balanced_seg_harmful_complied_delta_f_mean_9558_word_list_sentence_puzzle_prompts_harmful
dtype: float64
- name: harmful_balanced_seg_harmful_complied_delta_f_mean_9623_misinformation_and_fabricated_narrative_requests_harmful
dtype: float64
- name: harmful_balanced_seg_harmful_complied_delta_f_mean_9994_enumerative_benign_business_science_prompts_harmless
dtype: float64
- name: harmful_harmless_seg_all_delta_Y_mean
dtype: float64
- name: harmful_harmless_seg_all_delta_Y_frac_increasing
dtype: float64
- name: harmful_harmless_seg_all_delta_f_mean_10000_neutral_factual_general_knowledge_and_email_writing_prompts_harmless
dtype: float64
- name: harmful_harmless_seg_all_delta_f_mean_10737_general_benign_informational_explanation_requests_harmless
dtype: float64
- name: harmful_harmless_seg_all_delta_f_mean_1083_trivia_lookup_and_benign_productivity_queries_harmless
dtype: float64
- name: harmful_harmless_seg_all_delta_f_mean_10878_stereotype_based_demeaning_character_portrayals_harmful
dtype: float64
- name: harmful_harmless_seg_all_delta_f_mean_10940_stereotyping_protected_groups_as_inferior_harmful
dtype: float64
- name: harmful_harmless_seg_all_delta_f_mean_11223_competitive_skill_improvement_metaphors_harmless
dtype: float64
- name: harmful_harmless_seg_all_delta_f_mean_11236_degrading_prejudice_toward_marginalized_groups_harmful
dtype: float64
- name: harmful_harmless_seg_all_delta_f_mean_11382_copyright_compliance_and_policy_clarification_requests_harmless
dtype: float64
- name: harmful_harmless_seg_all_delta_f_mean_11404_copyright_derivative_work_permission_requests_harmful
dtype: float64
- name: harmful_harmless_seg_all_delta_f_mean_11495_creative_ideation_and_invention_prompts_harmless
dtype: float64
- name: harmful_harmless_seg_all_delta_f_mean_11634_privacy_identity_check_and_harmless_business_inquiry_requests_harmless
dtype: float64
- name: harmful_harmless_seg_all_delta_f_mean_11756_copyrighted_media_permission_and_attribution_requests_harmless
dtype: float64
- name: harmful_harmless_seg_all_delta_f_mean_11813_toxic_social_bias_discriminatory_harm_requests_harmless
dtype: float64
- name: harmful_harmless_seg_all_delta_f_mean_11840_corporate_email_admin_document_text_harmful
dtype: float64
- name: harmful_harmless_seg_all_delta_f_mean_11861_anti_fraud_and_verification_requests_harmless
dtype: float64
- name: harmful_harmless_seg_all_delta_f_mean_12232_general_benign_informational_question_answering_harmless
dtype: float64
- name: harmful_harmless_seg_all_delta_f_mean_1227_illicit_finance_fraud_evasion_requests_harmful
dtype: float64
- name: harmful_harmless_seg_all_delta_f_mean_1636_privacy_intrusion_data_request_harmful
dtype: float64
- name: harmful_harmless_seg_all_delta_f_mean_1728_general_purpose_benign_inquiry_informational_request_harmless
dtype: float64
- name: harmful_harmless_seg_all_delta_f_mean_1742_mundane_creative_editing_and_trivia_requests_harmless
dtype: float64
- name: harmful_harmless_seg_all_delta_f_mean_1748_comparison_framed_social_cultural_difference_questions_harmless
dtype: float64
- name: harmful_harmless_seg_all_delta_f_mean_1923_harmful_persuasion_manipulation_requests_harmful
dtype: float64
- name: harmful_harmless_seg_all_delta_f_mean_1986_violent_sadistic_scene_requests_harmful
dtype: float64
- name: harmful_harmless_seg_all_delta_f_mean_2197_manipulative_abuse_grooming_solicitation_harmful
dtype: float64
- name: harmful_harmless_seg_all_delta_f_mean_2286_email_document_subject_line_extraction_harmless
dtype: float64
- name: harmful_harmless_seg_all_delta_f_mean_2684_corporate_email_attachment_logistics_harmful
dtype: float64
- name: harmful_harmless_seg_all_delta_f_mean_2742_email_document_revision_and_legal_admin_text_harmful
dtype: float64
- name: harmful_harmless_seg_all_delta_f_mean_2827_neutral_procedural_text_with_risky_content_overlap_harmless
dtype: float64
- name: harmful_harmless_seg_all_delta_f_mean_286_email_subject_line_document_summarization_requests_harmless
dtype: float64
- name: harmful_harmless_seg_all_delta_f_mean_2945_internal_business_email_logistics_harmless
dtype: float64
- name: harmful_harmless_seg_all_delta_f_mean_3248_deceptive_harm_fraud_hoaxes_and_misinformation_harmful
dtype: float64
- name: harmful_harmless_seg_all_delta_f_mean_3277_practical_everyday_advice_problem_solving_requests_harmless
dtype: float64
- name: harmful_harmless_seg_all_delta_f_mean_3368_graphic_violent_ritual_injury_descriptions_harmless
dtype: float64
- name: harmful_harmless_seg_all_delta_f_mean_3734_discriminatory_insult_harassment_requests_harmful
dtype: float64
- name: harmful_harmless_seg_all_delta_f_mean_3767_copyright_safe_creative_and_formatting_requests_weak_harmless
dtype: float64
- name: harmful_harmless_seg_all_delta_f_mean_3772_biased_demeaning_requests_about_protected_groups_harmful
dtype: float64
- name: harmful_harmless_seg_all_delta_f_mean_3868_address_and_contact_lookup_requests_for_institutions_and_fictional_entities_harmless
dtype: float64
- name: harmful_harmless_seg_all_delta_f_mean_3895_general_factual_cultural_trivia_questions_harmless
dtype: float64
- name: harmful_harmless_seg_all_delta_f_mean_3915_abuse_harassment_and_humiliating_harm_requests_harmful
dtype: float64
- name: harmful_harmless_seg_all_delta_f_mean_4112_historical_causation_of_genocides_and_revolutions_harmless
dtype: float64
- name: harmful_harmless_seg_all_delta_f_mean_4202_sensitive_info_policy_complaint_requests_harmless
dtype: float64
- name: harmful_harmless_seg_all_delta_f_mean_4205_coping_and_self_improvement_guidance_harmless
dtype: float64
- name: harmful_harmless_seg_all_delta_f_mean_4319_general_benign_informational_queries_harmless
dtype: float64
- name: harmful_harmless_seg_all_delta_f_mean_4396_public_figure_medical_history_inquiries_harmless
dtype: float64
- name: harmful_harmless_seg_all_delta_f_mean_4590_stereotyping_and_generalization_prompts_harmless
dtype: float64
- name: harmful_harmless_seg_all_delta_f_mean_4986_ethical_classification_recommendation_task_prompts_harmless
dtype: float64
- name: harmful_harmless_seg_all_delta_f_mean_5098_corporate_email_admin_task_requests_weak_harmful
dtype: float64
- name: harmful_harmless_seg_all_delta_f_mean_5449_rumor_seeking_and_secret_sharing_requests_harmful
dtype: float64
- name: harmful_harmless_seg_all_delta_f_mean_5509_grammar_style_transformation_and_neutral_writing_prompts_harmless
dtype: float64
- name: harmful_harmless_seg_all_delta_f_mean_603_biased_inferiority_and_illicit_trade_harm_requests_harmful
dtype: float64
- name: harmful_harmless_seg_all_delta_f_mean_6213_email_admin_subject_line_drafting_harmless
dtype: float64
- name: harmful_harmless_seg_all_delta_f_mean_6238_definition_of_risky_terms_queries_harmless
dtype: float64
- name: harmful_harmless_seg_all_delta_f_mean_6499_generic_benign_writing_tasks_harmless
dtype: float64
- name: harmful_harmless_seg_all_delta_f_mean_6562_legitimate_verification_and_administrative_request_prompts_harmless
dtype: float64
- name: harmful_harmless_seg_all_delta_f_mean_6733_copyright_defamation_misuse_requests_harmful
dtype: float64
- name: harmful_harmless_seg_all_delta_f_mean_6815_harmful_advocacy_with_deceptive_framing_harmful
dtype: float64
- name: harmful_harmless_seg_all_delta_f_mean_7063_sensitive_social_organizational_explanation_requests_harmless
dtype: float64
- name: harmful_harmless_seg_all_delta_f_mean_7261_protective_safety_legal_compliance_advice_harmless
dtype: float64
- name: harmful_harmless_seg_all_delta_f_mean_728_safety_health_advice_and_benign_information_requests_harmless
dtype: float64
- name: harmful_harmless_seg_all_delta_f_mean_7294_deceptive_malicious_manipulation_requests_harmful
dtype: float64
- name: harmful_harmless_seg_all_delta_f_mean_7480_email_subject_line_and_message_summary_requests_harmless
dtype: float64
- name: harmful_harmless_seg_all_delta_f_mean_7575_factual_summary_explanation_requests_about_benign_media_and_history_harmless
dtype: float64
- name: harmful_harmless_seg_all_delta_f_mean_7897_administrative_verification_request_language_harmless
dtype: float64
- name: harmful_harmless_seg_all_delta_f_mean_7940_defamation_and_fake_news_fabrication_requests_harmful
dtype: float64
- name: harmful_harmless_seg_all_delta_f_mean_7971_fictional_scenario_safety_secure_escape_questions_harmless
dtype: float64
- name: harmful_harmless_seg_all_delta_f_mean_8063_general_factual_cultural_benign_queries_harmless
dtype: float64
- name: harmful_harmless_seg_all_delta_f_mean_8140_email_task_management_correspondence_harmless
dtype: float64
- name: harmful_harmless_seg_all_delta_f_mean_8184_risk_avoidance_verification_requests_harmless
dtype: float64
- name: harmful_harmless_seg_all_delta_f_mean_823_basic_arithmetic_and_sequence_number_pattern_problems_harmless
dtype: float64
- name: harmful_harmless_seg_all_delta_f_mean_8240_trivia_and_classification_questions_harmless
dtype: float64
- name: harmful_harmless_seg_all_delta_f_mean_910_corporate_email_meeting_logistics_harmless
dtype: float64
- name: harmful_harmless_seg_all_delta_f_mean_9267_video_game_combat_action_scenario_requests_harmless
dtype: float64
- name: harmful_harmless_seg_all_delta_f_mean_9558_word_list_sentence_puzzle_prompts_harmful
dtype: float64
- name: harmful_harmless_seg_all_delta_f_mean_9623_misinformation_and_fabricated_narrative_requests_harmful
dtype: float64
- name: harmful_harmless_seg_all_delta_f_mean_9994_enumerative_benign_business_science_prompts_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_delta_Y_mean
dtype: float64
- name: harmful_harmless_seg_harmful_delta_Y_frac_increasing
dtype: float64
- name: harmful_harmless_seg_harmful_delta_f_mean_10000_neutral_factual_general_knowledge_and_email_writing_prompts_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_delta_f_mean_10737_general_benign_informational_explanation_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_delta_f_mean_1083_trivia_lookup_and_benign_productivity_queries_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_delta_f_mean_10878_stereotype_based_demeaning_character_portrayals_harmful
dtype: float64
- name: harmful_harmless_seg_harmful_delta_f_mean_10940_stereotyping_protected_groups_as_inferior_harmful
dtype: float64
- name: harmful_harmless_seg_harmful_delta_f_mean_11223_competitive_skill_improvement_metaphors_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_delta_f_mean_11236_degrading_prejudice_toward_marginalized_groups_harmful
dtype: float64
- name: harmful_harmless_seg_harmful_delta_f_mean_11382_copyright_compliance_and_policy_clarification_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_delta_f_mean_11404_copyright_derivative_work_permission_requests_harmful
dtype: float64
- name: harmful_harmless_seg_harmful_delta_f_mean_11495_creative_ideation_and_invention_prompts_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_delta_f_mean_11634_privacy_identity_check_and_harmless_business_inquiry_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_delta_f_mean_11756_copyrighted_media_permission_and_attribution_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_delta_f_mean_11813_toxic_social_bias_discriminatory_harm_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_delta_f_mean_11840_corporate_email_admin_document_text_harmful
dtype: float64
- name: harmful_harmless_seg_harmful_delta_f_mean_11861_anti_fraud_and_verification_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_delta_f_mean_12232_general_benign_informational_question_answering_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_delta_f_mean_1227_illicit_finance_fraud_evasion_requests_harmful
dtype: float64
- name: harmful_harmless_seg_harmful_delta_f_mean_1636_privacy_intrusion_data_request_harmful
dtype: float64
- name: harmful_harmless_seg_harmful_delta_f_mean_1728_general_purpose_benign_inquiry_informational_request_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_delta_f_mean_1742_mundane_creative_editing_and_trivia_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_delta_f_mean_1748_comparison_framed_social_cultural_difference_questions_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_delta_f_mean_1923_harmful_persuasion_manipulation_requests_harmful
dtype: float64
- name: harmful_harmless_seg_harmful_delta_f_mean_1986_violent_sadistic_scene_requests_harmful
dtype: float64
- name: harmful_harmless_seg_harmful_delta_f_mean_2197_manipulative_abuse_grooming_solicitation_harmful
dtype: float64
- name: harmful_harmless_seg_harmful_delta_f_mean_2286_email_document_subject_line_extraction_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_delta_f_mean_2684_corporate_email_attachment_logistics_harmful
dtype: float64
- name: harmful_harmless_seg_harmful_delta_f_mean_2742_email_document_revision_and_legal_admin_text_harmful
dtype: float64
- name: harmful_harmless_seg_harmful_delta_f_mean_2827_neutral_procedural_text_with_risky_content_overlap_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_delta_f_mean_286_email_subject_line_document_summarization_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_delta_f_mean_2945_internal_business_email_logistics_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_delta_f_mean_3248_deceptive_harm_fraud_hoaxes_and_misinformation_harmful
dtype: float64
- name: harmful_harmless_seg_harmful_delta_f_mean_3277_practical_everyday_advice_problem_solving_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_delta_f_mean_3368_graphic_violent_ritual_injury_descriptions_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_delta_f_mean_3734_discriminatory_insult_harassment_requests_harmful
dtype: float64
- name: harmful_harmless_seg_harmful_delta_f_mean_3767_copyright_safe_creative_and_formatting_requests_weak_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_delta_f_mean_3772_biased_demeaning_requests_about_protected_groups_harmful
dtype: float64
- name: harmful_harmless_seg_harmful_delta_f_mean_3868_address_and_contact_lookup_requests_for_institutions_and_fictional_entities_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_delta_f_mean_3895_general_factual_cultural_trivia_questions_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_delta_f_mean_3915_abuse_harassment_and_humiliating_harm_requests_harmful
dtype: float64
- name: harmful_harmless_seg_harmful_delta_f_mean_4112_historical_causation_of_genocides_and_revolutions_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_delta_f_mean_4202_sensitive_info_policy_complaint_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_delta_f_mean_4205_coping_and_self_improvement_guidance_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_delta_f_mean_4319_general_benign_informational_queries_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_delta_f_mean_4396_public_figure_medical_history_inquiries_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_delta_f_mean_4590_stereotyping_and_generalization_prompts_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_delta_f_mean_4986_ethical_classification_recommendation_task_prompts_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_delta_f_mean_5098_corporate_email_admin_task_requests_weak_harmful
dtype: float64
- name: harmful_harmless_seg_harmful_delta_f_mean_5449_rumor_seeking_and_secret_sharing_requests_harmful
dtype: float64
- name: harmful_harmless_seg_harmful_delta_f_mean_5509_grammar_style_transformation_and_neutral_writing_prompts_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_delta_f_mean_603_biased_inferiority_and_illicit_trade_harm_requests_harmful
dtype: float64
- name: harmful_harmless_seg_harmful_delta_f_mean_6213_email_admin_subject_line_drafting_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_delta_f_mean_6238_definition_of_risky_terms_queries_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_delta_f_mean_6499_generic_benign_writing_tasks_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_delta_f_mean_6562_legitimate_verification_and_administrative_request_prompts_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_delta_f_mean_6733_copyright_defamation_misuse_requests_harmful
dtype: float64
- name: harmful_harmless_seg_harmful_delta_f_mean_6815_harmful_advocacy_with_deceptive_framing_harmful
dtype: float64
- name: harmful_harmless_seg_harmful_delta_f_mean_7063_sensitive_social_organizational_explanation_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_delta_f_mean_7261_protective_safety_legal_compliance_advice_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_delta_f_mean_728_safety_health_advice_and_benign_information_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_delta_f_mean_7294_deceptive_malicious_manipulation_requests_harmful
dtype: float64
- name: harmful_harmless_seg_harmful_delta_f_mean_7480_email_subject_line_and_message_summary_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_delta_f_mean_7575_factual_summary_explanation_requests_about_benign_media_and_history_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_delta_f_mean_7897_administrative_verification_request_language_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_delta_f_mean_7940_defamation_and_fake_news_fabrication_requests_harmful
dtype: float64
- name: harmful_harmless_seg_harmful_delta_f_mean_7971_fictional_scenario_safety_secure_escape_questions_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_delta_f_mean_8063_general_factual_cultural_benign_queries_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_delta_f_mean_8140_email_task_management_correspondence_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_delta_f_mean_8184_risk_avoidance_verification_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_delta_f_mean_823_basic_arithmetic_and_sequence_number_pattern_problems_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_delta_f_mean_8240_trivia_and_classification_questions_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_delta_f_mean_910_corporate_email_meeting_logistics_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_delta_f_mean_9267_video_game_combat_action_scenario_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_delta_f_mean_9558_word_list_sentence_puzzle_prompts_harmful
dtype: float64
- name: harmful_harmless_seg_harmful_delta_f_mean_9623_misinformation_and_fabricated_narrative_requests_harmful
dtype: float64
- name: harmful_harmless_seg_harmful_delta_f_mean_9994_enumerative_benign_business_science_prompts_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_delta_Y_mean
dtype: float64
- name: harmful_harmless_seg_harmless_delta_Y_frac_increasing
dtype: float64
- name: harmful_harmless_seg_harmless_delta_f_mean_10000_neutral_factual_general_knowledge_and_email_writing_prompts_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_delta_f_mean_10737_general_benign_informational_explanation_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_delta_f_mean_1083_trivia_lookup_and_benign_productivity_queries_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_delta_f_mean_10878_stereotype_based_demeaning_character_portrayals_harmful
dtype: float64
- name: harmful_harmless_seg_harmless_delta_f_mean_10940_stereotyping_protected_groups_as_inferior_harmful
dtype: float64
- name: harmful_harmless_seg_harmless_delta_f_mean_11223_competitive_skill_improvement_metaphors_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_delta_f_mean_11236_degrading_prejudice_toward_marginalized_groups_harmful
dtype: float64
- name: harmful_harmless_seg_harmless_delta_f_mean_11382_copyright_compliance_and_policy_clarification_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_delta_f_mean_11404_copyright_derivative_work_permission_requests_harmful
dtype: float64
- name: harmful_harmless_seg_harmless_delta_f_mean_11495_creative_ideation_and_invention_prompts_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_delta_f_mean_11634_privacy_identity_check_and_harmless_business_inquiry_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_delta_f_mean_11756_copyrighted_media_permission_and_attribution_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_delta_f_mean_11813_toxic_social_bias_discriminatory_harm_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_delta_f_mean_11840_corporate_email_admin_document_text_harmful
dtype: float64
- name: harmful_harmless_seg_harmless_delta_f_mean_11861_anti_fraud_and_verification_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_delta_f_mean_12232_general_benign_informational_question_answering_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_delta_f_mean_1227_illicit_finance_fraud_evasion_requests_harmful
dtype: float64
- name: harmful_harmless_seg_harmless_delta_f_mean_1636_privacy_intrusion_data_request_harmful
dtype: float64
- name: harmful_harmless_seg_harmless_delta_f_mean_1728_general_purpose_benign_inquiry_informational_request_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_delta_f_mean_1742_mundane_creative_editing_and_trivia_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_delta_f_mean_1748_comparison_framed_social_cultural_difference_questions_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_delta_f_mean_1923_harmful_persuasion_manipulation_requests_harmful
dtype: float64
- name: harmful_harmless_seg_harmless_delta_f_mean_1986_violent_sadistic_scene_requests_harmful
dtype: float64
- name: harmful_harmless_seg_harmless_delta_f_mean_2197_manipulative_abuse_grooming_solicitation_harmful
dtype: float64
- name: harmful_harmless_seg_harmless_delta_f_mean_2286_email_document_subject_line_extraction_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_delta_f_mean_2684_corporate_email_attachment_logistics_harmful
dtype: float64
- name: harmful_harmless_seg_harmless_delta_f_mean_2742_email_document_revision_and_legal_admin_text_harmful
dtype: float64
- name: harmful_harmless_seg_harmless_delta_f_mean_2827_neutral_procedural_text_with_risky_content_overlap_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_delta_f_mean_286_email_subject_line_document_summarization_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_delta_f_mean_2945_internal_business_email_logistics_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_delta_f_mean_3248_deceptive_harm_fraud_hoaxes_and_misinformation_harmful
dtype: float64
- name: harmful_harmless_seg_harmless_delta_f_mean_3277_practical_everyday_advice_problem_solving_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_delta_f_mean_3368_graphic_violent_ritual_injury_descriptions_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_delta_f_mean_3734_discriminatory_insult_harassment_requests_harmful
dtype: float64
- name: harmful_harmless_seg_harmless_delta_f_mean_3767_copyright_safe_creative_and_formatting_requests_weak_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_delta_f_mean_3772_biased_demeaning_requests_about_protected_groups_harmful
dtype: float64
- name: harmful_harmless_seg_harmless_delta_f_mean_3868_address_and_contact_lookup_requests_for_institutions_and_fictional_entities_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_delta_f_mean_3895_general_factual_cultural_trivia_questions_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_delta_f_mean_3915_abuse_harassment_and_humiliating_harm_requests_harmful
dtype: float64
- name: harmful_harmless_seg_harmless_delta_f_mean_4112_historical_causation_of_genocides_and_revolutions_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_delta_f_mean_4202_sensitive_info_policy_complaint_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_delta_f_mean_4205_coping_and_self_improvement_guidance_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_delta_f_mean_4319_general_benign_informational_queries_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_delta_f_mean_4396_public_figure_medical_history_inquiries_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_delta_f_mean_4590_stereotyping_and_generalization_prompts_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_delta_f_mean_4986_ethical_classification_recommendation_task_prompts_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_delta_f_mean_5098_corporate_email_admin_task_requests_weak_harmful
dtype: float64
- name: harmful_harmless_seg_harmless_delta_f_mean_5449_rumor_seeking_and_secret_sharing_requests_harmful
dtype: float64
- name: harmful_harmless_seg_harmless_delta_f_mean_5509_grammar_style_transformation_and_neutral_writing_prompts_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_delta_f_mean_603_biased_inferiority_and_illicit_trade_harm_requests_harmful
dtype: float64
- name: harmful_harmless_seg_harmless_delta_f_mean_6213_email_admin_subject_line_drafting_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_delta_f_mean_6238_definition_of_risky_terms_queries_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_delta_f_mean_6499_generic_benign_writing_tasks_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_delta_f_mean_6562_legitimate_verification_and_administrative_request_prompts_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_delta_f_mean_6733_copyright_defamation_misuse_requests_harmful
dtype: float64
- name: harmful_harmless_seg_harmless_delta_f_mean_6815_harmful_advocacy_with_deceptive_framing_harmful
dtype: float64
- name: harmful_harmless_seg_harmless_delta_f_mean_7063_sensitive_social_organizational_explanation_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_delta_f_mean_7261_protective_safety_legal_compliance_advice_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_delta_f_mean_728_safety_health_advice_and_benign_information_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_delta_f_mean_7294_deceptive_malicious_manipulation_requests_harmful
dtype: float64
- name: harmful_harmless_seg_harmless_delta_f_mean_7480_email_subject_line_and_message_summary_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_delta_f_mean_7575_factual_summary_explanation_requests_about_benign_media_and_history_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_delta_f_mean_7897_administrative_verification_request_language_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_delta_f_mean_7940_defamation_and_fake_news_fabrication_requests_harmful
dtype: float64
- name: harmful_harmless_seg_harmless_delta_f_mean_7971_fictional_scenario_safety_secure_escape_questions_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_delta_f_mean_8063_general_factual_cultural_benign_queries_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_delta_f_mean_8140_email_task_management_correspondence_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_delta_f_mean_8184_risk_avoidance_verification_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_delta_f_mean_823_basic_arithmetic_and_sequence_number_pattern_problems_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_delta_f_mean_8240_trivia_and_classification_questions_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_delta_f_mean_910_corporate_email_meeting_logistics_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_delta_f_mean_9267_video_game_combat_action_scenario_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_delta_f_mean_9558_word_list_sentence_puzzle_prompts_harmful
dtype: float64
- name: harmful_harmless_seg_harmless_delta_f_mean_9623_misinformation_and_fabricated_narrative_requests_harmful
dtype: float64
- name: harmful_harmless_seg_harmless_delta_f_mean_9994_enumerative_benign_business_science_prompts_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_refused_delta_Y_mean
dtype: float64
- name: harmful_harmless_seg_harmful_refused_delta_Y_frac_increasing
dtype: float64
- name: harmful_harmless_seg_harmful_refused_delta_f_mean_10000_neutral_factual_general_knowledge_and_email_writing_prompts_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_refused_delta_f_mean_10737_general_benign_informational_explanation_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_refused_delta_f_mean_1083_trivia_lookup_and_benign_productivity_queries_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_refused_delta_f_mean_10878_stereotype_based_demeaning_character_portrayals_harmful
dtype: float64
- name: harmful_harmless_seg_harmful_refused_delta_f_mean_10940_stereotyping_protected_groups_as_inferior_harmful
dtype: float64
- name: harmful_harmless_seg_harmful_refused_delta_f_mean_11223_competitive_skill_improvement_metaphors_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_refused_delta_f_mean_11236_degrading_prejudice_toward_marginalized_groups_harmful
dtype: float64
- name: harmful_harmless_seg_harmful_refused_delta_f_mean_11382_copyright_compliance_and_policy_clarification_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_refused_delta_f_mean_11404_copyright_derivative_work_permission_requests_harmful
dtype: float64
- name: harmful_harmless_seg_harmful_refused_delta_f_mean_11495_creative_ideation_and_invention_prompts_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_refused_delta_f_mean_11634_privacy_identity_check_and_harmless_business_inquiry_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_refused_delta_f_mean_11756_copyrighted_media_permission_and_attribution_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_refused_delta_f_mean_11813_toxic_social_bias_discriminatory_harm_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_refused_delta_f_mean_11840_corporate_email_admin_document_text_harmful
dtype: float64
- name: harmful_harmless_seg_harmful_refused_delta_f_mean_11861_anti_fraud_and_verification_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_refused_delta_f_mean_12232_general_benign_informational_question_answering_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_refused_delta_f_mean_1227_illicit_finance_fraud_evasion_requests_harmful
dtype: float64
- name: harmful_harmless_seg_harmful_refused_delta_f_mean_1636_privacy_intrusion_data_request_harmful
dtype: float64
- name: harmful_harmless_seg_harmful_refused_delta_f_mean_1728_general_purpose_benign_inquiry_informational_request_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_refused_delta_f_mean_1742_mundane_creative_editing_and_trivia_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_refused_delta_f_mean_1748_comparison_framed_social_cultural_difference_questions_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_refused_delta_f_mean_1923_harmful_persuasion_manipulation_requests_harmful
dtype: float64
- name: harmful_harmless_seg_harmful_refused_delta_f_mean_1986_violent_sadistic_scene_requests_harmful
dtype: float64
- name: harmful_harmless_seg_harmful_refused_delta_f_mean_2197_manipulative_abuse_grooming_solicitation_harmful
dtype: float64
- name: harmful_harmless_seg_harmful_refused_delta_f_mean_2286_email_document_subject_line_extraction_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_refused_delta_f_mean_2684_corporate_email_attachment_logistics_harmful
dtype: float64
- name: harmful_harmless_seg_harmful_refused_delta_f_mean_2742_email_document_revision_and_legal_admin_text_harmful
dtype: float64
- name: harmful_harmless_seg_harmful_refused_delta_f_mean_2827_neutral_procedural_text_with_risky_content_overlap_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_refused_delta_f_mean_286_email_subject_line_document_summarization_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_refused_delta_f_mean_2945_internal_business_email_logistics_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_refused_delta_f_mean_3248_deceptive_harm_fraud_hoaxes_and_misinformation_harmful
dtype: float64
- name: harmful_harmless_seg_harmful_refused_delta_f_mean_3277_practical_everyday_advice_problem_solving_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_refused_delta_f_mean_3368_graphic_violent_ritual_injury_descriptions_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_refused_delta_f_mean_3734_discriminatory_insult_harassment_requests_harmful
dtype: float64
- name: harmful_harmless_seg_harmful_refused_delta_f_mean_3767_copyright_safe_creative_and_formatting_requests_weak_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_refused_delta_f_mean_3772_biased_demeaning_requests_about_protected_groups_harmful
dtype: float64
- name: harmful_harmless_seg_harmful_refused_delta_f_mean_3868_address_and_contact_lookup_requests_for_institutions_and_fictional_entities_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_refused_delta_f_mean_3895_general_factual_cultural_trivia_questions_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_refused_delta_f_mean_3915_abuse_harassment_and_humiliating_harm_requests_harmful
dtype: float64
- name: harmful_harmless_seg_harmful_refused_delta_f_mean_4112_historical_causation_of_genocides_and_revolutions_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_refused_delta_f_mean_4202_sensitive_info_policy_complaint_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_refused_delta_f_mean_4205_coping_and_self_improvement_guidance_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_refused_delta_f_mean_4319_general_benign_informational_queries_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_refused_delta_f_mean_4396_public_figure_medical_history_inquiries_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_refused_delta_f_mean_4590_stereotyping_and_generalization_prompts_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_refused_delta_f_mean_4986_ethical_classification_recommendation_task_prompts_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_refused_delta_f_mean_5098_corporate_email_admin_task_requests_weak_harmful
dtype: float64
- name: harmful_harmless_seg_harmful_refused_delta_f_mean_5449_rumor_seeking_and_secret_sharing_requests_harmful
dtype: float64
- name: harmful_harmless_seg_harmful_refused_delta_f_mean_5509_grammar_style_transformation_and_neutral_writing_prompts_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_refused_delta_f_mean_603_biased_inferiority_and_illicit_trade_harm_requests_harmful
dtype: float64
- name: harmful_harmless_seg_harmful_refused_delta_f_mean_6213_email_admin_subject_line_drafting_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_refused_delta_f_mean_6238_definition_of_risky_terms_queries_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_refused_delta_f_mean_6499_generic_benign_writing_tasks_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_refused_delta_f_mean_6562_legitimate_verification_and_administrative_request_prompts_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_refused_delta_f_mean_6733_copyright_defamation_misuse_requests_harmful
dtype: float64
- name: harmful_harmless_seg_harmful_refused_delta_f_mean_6815_harmful_advocacy_with_deceptive_framing_harmful
dtype: float64
- name: harmful_harmless_seg_harmful_refused_delta_f_mean_7063_sensitive_social_organizational_explanation_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_refused_delta_f_mean_7261_protective_safety_legal_compliance_advice_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_refused_delta_f_mean_728_safety_health_advice_and_benign_information_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_refused_delta_f_mean_7294_deceptive_malicious_manipulation_requests_harmful
dtype: float64
- name: harmful_harmless_seg_harmful_refused_delta_f_mean_7480_email_subject_line_and_message_summary_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_refused_delta_f_mean_7575_factual_summary_explanation_requests_about_benign_media_and_history_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_refused_delta_f_mean_7897_administrative_verification_request_language_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_refused_delta_f_mean_7940_defamation_and_fake_news_fabrication_requests_harmful
dtype: float64
- name: harmful_harmless_seg_harmful_refused_delta_f_mean_7971_fictional_scenario_safety_secure_escape_questions_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_refused_delta_f_mean_8063_general_factual_cultural_benign_queries_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_refused_delta_f_mean_8140_email_task_management_correspondence_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_refused_delta_f_mean_8184_risk_avoidance_verification_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_refused_delta_f_mean_823_basic_arithmetic_and_sequence_number_pattern_problems_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_refused_delta_f_mean_8240_trivia_and_classification_questions_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_refused_delta_f_mean_910_corporate_email_meeting_logistics_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_refused_delta_f_mean_9267_video_game_combat_action_scenario_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_refused_delta_f_mean_9558_word_list_sentence_puzzle_prompts_harmful
dtype: float64
- name: harmful_harmless_seg_harmful_refused_delta_f_mean_9623_misinformation_and_fabricated_narrative_requests_harmful
dtype: float64
- name: harmful_harmless_seg_harmful_refused_delta_f_mean_9994_enumerative_benign_business_science_prompts_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_complied_delta_Y_mean
dtype: float64
- name: harmful_harmless_seg_harmful_complied_delta_Y_frac_increasing
dtype: float64
- name: harmful_harmless_seg_harmful_complied_delta_f_mean_10000_neutral_factual_general_knowledge_and_email_writing_prompts_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_complied_delta_f_mean_10737_general_benign_informational_explanation_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_complied_delta_f_mean_1083_trivia_lookup_and_benign_productivity_queries_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_complied_delta_f_mean_10878_stereotype_based_demeaning_character_portrayals_harmful
dtype: float64
- name: harmful_harmless_seg_harmful_complied_delta_f_mean_10940_stereotyping_protected_groups_as_inferior_harmful
dtype: float64
- name: harmful_harmless_seg_harmful_complied_delta_f_mean_11223_competitive_skill_improvement_metaphors_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_complied_delta_f_mean_11236_degrading_prejudice_toward_marginalized_groups_harmful
dtype: float64
- name: harmful_harmless_seg_harmful_complied_delta_f_mean_11382_copyright_compliance_and_policy_clarification_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_complied_delta_f_mean_11404_copyright_derivative_work_permission_requests_harmful
dtype: float64
- name: harmful_harmless_seg_harmful_complied_delta_f_mean_11495_creative_ideation_and_invention_prompts_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_complied_delta_f_mean_11634_privacy_identity_check_and_harmless_business_inquiry_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_complied_delta_f_mean_11756_copyrighted_media_permission_and_attribution_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_complied_delta_f_mean_11813_toxic_social_bias_discriminatory_harm_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_complied_delta_f_mean_11840_corporate_email_admin_document_text_harmful
dtype: float64
- name: harmful_harmless_seg_harmful_complied_delta_f_mean_11861_anti_fraud_and_verification_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_complied_delta_f_mean_12232_general_benign_informational_question_answering_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_complied_delta_f_mean_1227_illicit_finance_fraud_evasion_requests_harmful
dtype: float64
- name: harmful_harmless_seg_harmful_complied_delta_f_mean_1636_privacy_intrusion_data_request_harmful
dtype: float64
- name: harmful_harmless_seg_harmful_complied_delta_f_mean_1728_general_purpose_benign_inquiry_informational_request_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_complied_delta_f_mean_1742_mundane_creative_editing_and_trivia_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_complied_delta_f_mean_1748_comparison_framed_social_cultural_difference_questions_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_complied_delta_f_mean_1923_harmful_persuasion_manipulation_requests_harmful
dtype: float64
- name: harmful_harmless_seg_harmful_complied_delta_f_mean_1986_violent_sadistic_scene_requests_harmful
dtype: float64
- name: harmful_harmless_seg_harmful_complied_delta_f_mean_2197_manipulative_abuse_grooming_solicitation_harmful
dtype: float64
- name: harmful_harmless_seg_harmful_complied_delta_f_mean_2286_email_document_subject_line_extraction_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_complied_delta_f_mean_2684_corporate_email_attachment_logistics_harmful
dtype: float64
- name: harmful_harmless_seg_harmful_complied_delta_f_mean_2742_email_document_revision_and_legal_admin_text_harmful
dtype: float64
- name: harmful_harmless_seg_harmful_complied_delta_f_mean_2827_neutral_procedural_text_with_risky_content_overlap_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_complied_delta_f_mean_286_email_subject_line_document_summarization_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_complied_delta_f_mean_2945_internal_business_email_logistics_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_complied_delta_f_mean_3248_deceptive_harm_fraud_hoaxes_and_misinformation_harmful
dtype: float64
- name: harmful_harmless_seg_harmful_complied_delta_f_mean_3277_practical_everyday_advice_problem_solving_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_complied_delta_f_mean_3368_graphic_violent_ritual_injury_descriptions_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_complied_delta_f_mean_3734_discriminatory_insult_harassment_requests_harmful
dtype: float64
- name: harmful_harmless_seg_harmful_complied_delta_f_mean_3767_copyright_safe_creative_and_formatting_requests_weak_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_complied_delta_f_mean_3772_biased_demeaning_requests_about_protected_groups_harmful
dtype: float64
- name: harmful_harmless_seg_harmful_complied_delta_f_mean_3868_address_and_contact_lookup_requests_for_institutions_and_fictional_entities_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_complied_delta_f_mean_3895_general_factual_cultural_trivia_questions_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_complied_delta_f_mean_3915_abuse_harassment_and_humiliating_harm_requests_harmful
dtype: float64
- name: harmful_harmless_seg_harmful_complied_delta_f_mean_4112_historical_causation_of_genocides_and_revolutions_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_complied_delta_f_mean_4202_sensitive_info_policy_complaint_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_complied_delta_f_mean_4205_coping_and_self_improvement_guidance_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_complied_delta_f_mean_4319_general_benign_informational_queries_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_complied_delta_f_mean_4396_public_figure_medical_history_inquiries_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_complied_delta_f_mean_4590_stereotyping_and_generalization_prompts_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_complied_delta_f_mean_4986_ethical_classification_recommendation_task_prompts_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_complied_delta_f_mean_5098_corporate_email_admin_task_requests_weak_harmful
dtype: float64
- name: harmful_harmless_seg_harmful_complied_delta_f_mean_5449_rumor_seeking_and_secret_sharing_requests_harmful
dtype: float64
- name: harmful_harmless_seg_harmful_complied_delta_f_mean_5509_grammar_style_transformation_and_neutral_writing_prompts_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_complied_delta_f_mean_603_biased_inferiority_and_illicit_trade_harm_requests_harmful
dtype: float64
- name: harmful_harmless_seg_harmful_complied_delta_f_mean_6213_email_admin_subject_line_drafting_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_complied_delta_f_mean_6238_definition_of_risky_terms_queries_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_complied_delta_f_mean_6499_generic_benign_writing_tasks_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_complied_delta_f_mean_6562_legitimate_verification_and_administrative_request_prompts_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_complied_delta_f_mean_6733_copyright_defamation_misuse_requests_harmful
dtype: float64
- name: harmful_harmless_seg_harmful_complied_delta_f_mean_6815_harmful_advocacy_with_deceptive_framing_harmful
dtype: float64
- name: harmful_harmless_seg_harmful_complied_delta_f_mean_7063_sensitive_social_organizational_explanation_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_complied_delta_f_mean_7261_protective_safety_legal_compliance_advice_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_complied_delta_f_mean_728_safety_health_advice_and_benign_information_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_complied_delta_f_mean_7294_deceptive_malicious_manipulation_requests_harmful
dtype: float64
- name: harmful_harmless_seg_harmful_complied_delta_f_mean_7480_email_subject_line_and_message_summary_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_complied_delta_f_mean_7575_factual_summary_explanation_requests_about_benign_media_and_history_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_complied_delta_f_mean_7897_administrative_verification_request_language_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_complied_delta_f_mean_7940_defamation_and_fake_news_fabrication_requests_harmful
dtype: float64
- name: harmful_harmless_seg_harmful_complied_delta_f_mean_7971_fictional_scenario_safety_secure_escape_questions_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_complied_delta_f_mean_8063_general_factual_cultural_benign_queries_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_complied_delta_f_mean_8140_email_task_management_correspondence_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_complied_delta_f_mean_8184_risk_avoidance_verification_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_complied_delta_f_mean_823_basic_arithmetic_and_sequence_number_pattern_problems_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_complied_delta_f_mean_8240_trivia_and_classification_questions_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_complied_delta_f_mean_910_corporate_email_meeting_logistics_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_complied_delta_f_mean_9267_video_game_combat_action_scenario_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmful_complied_delta_f_mean_9558_word_list_sentence_puzzle_prompts_harmful
dtype: float64
- name: harmful_harmless_seg_harmful_complied_delta_f_mean_9623_misinformation_and_fabricated_narrative_requests_harmful
dtype: float64
- name: harmful_harmless_seg_harmful_complied_delta_f_mean_9994_enumerative_benign_business_science_prompts_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_refused_delta_Y_mean
dtype: float64
- name: harmful_harmless_seg_harmless_refused_delta_Y_frac_increasing
dtype: float64
- name: harmful_harmless_seg_harmless_refused_delta_f_mean_10000_neutral_factual_general_knowledge_and_email_writing_prompts_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_refused_delta_f_mean_10737_general_benign_informational_explanation_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_refused_delta_f_mean_1083_trivia_lookup_and_benign_productivity_queries_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_refused_delta_f_mean_10878_stereotype_based_demeaning_character_portrayals_harmful
dtype: float64
- name: harmful_harmless_seg_harmless_refused_delta_f_mean_10940_stereotyping_protected_groups_as_inferior_harmful
dtype: float64
- name: harmful_harmless_seg_harmless_refused_delta_f_mean_11223_competitive_skill_improvement_metaphors_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_refused_delta_f_mean_11236_degrading_prejudice_toward_marginalized_groups_harmful
dtype: float64
- name: harmful_harmless_seg_harmless_refused_delta_f_mean_11382_copyright_compliance_and_policy_clarification_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_refused_delta_f_mean_11404_copyright_derivative_work_permission_requests_harmful
dtype: float64
- name: harmful_harmless_seg_harmless_refused_delta_f_mean_11495_creative_ideation_and_invention_prompts_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_refused_delta_f_mean_11634_privacy_identity_check_and_harmless_business_inquiry_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_refused_delta_f_mean_11756_copyrighted_media_permission_and_attribution_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_refused_delta_f_mean_11813_toxic_social_bias_discriminatory_harm_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_refused_delta_f_mean_11840_corporate_email_admin_document_text_harmful
dtype: float64
- name: harmful_harmless_seg_harmless_refused_delta_f_mean_11861_anti_fraud_and_verification_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_refused_delta_f_mean_12232_general_benign_informational_question_answering_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_refused_delta_f_mean_1227_illicit_finance_fraud_evasion_requests_harmful
dtype: float64
- name: harmful_harmless_seg_harmless_refused_delta_f_mean_1636_privacy_intrusion_data_request_harmful
dtype: float64
- name: harmful_harmless_seg_harmless_refused_delta_f_mean_1728_general_purpose_benign_inquiry_informational_request_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_refused_delta_f_mean_1742_mundane_creative_editing_and_trivia_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_refused_delta_f_mean_1748_comparison_framed_social_cultural_difference_questions_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_refused_delta_f_mean_1923_harmful_persuasion_manipulation_requests_harmful
dtype: float64
- name: harmful_harmless_seg_harmless_refused_delta_f_mean_1986_violent_sadistic_scene_requests_harmful
dtype: float64
- name: harmful_harmless_seg_harmless_refused_delta_f_mean_2197_manipulative_abuse_grooming_solicitation_harmful
dtype: float64
- name: harmful_harmless_seg_harmless_refused_delta_f_mean_2286_email_document_subject_line_extraction_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_refused_delta_f_mean_2684_corporate_email_attachment_logistics_harmful
dtype: float64
- name: harmful_harmless_seg_harmless_refused_delta_f_mean_2742_email_document_revision_and_legal_admin_text_harmful
dtype: float64
- name: harmful_harmless_seg_harmless_refused_delta_f_mean_2827_neutral_procedural_text_with_risky_content_overlap_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_refused_delta_f_mean_286_email_subject_line_document_summarization_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_refused_delta_f_mean_2945_internal_business_email_logistics_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_refused_delta_f_mean_3248_deceptive_harm_fraud_hoaxes_and_misinformation_harmful
dtype: float64
- name: harmful_harmless_seg_harmless_refused_delta_f_mean_3277_practical_everyday_advice_problem_solving_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_refused_delta_f_mean_3368_graphic_violent_ritual_injury_descriptions_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_refused_delta_f_mean_3734_discriminatory_insult_harassment_requests_harmful
dtype: float64
- name: harmful_harmless_seg_harmless_refused_delta_f_mean_3767_copyright_safe_creative_and_formatting_requests_weak_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_refused_delta_f_mean_3772_biased_demeaning_requests_about_protected_groups_harmful
dtype: float64
- name: harmful_harmless_seg_harmless_refused_delta_f_mean_3868_address_and_contact_lookup_requests_for_institutions_and_fictional_entities_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_refused_delta_f_mean_3895_general_factual_cultural_trivia_questions_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_refused_delta_f_mean_3915_abuse_harassment_and_humiliating_harm_requests_harmful
dtype: float64
- name: harmful_harmless_seg_harmless_refused_delta_f_mean_4112_historical_causation_of_genocides_and_revolutions_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_refused_delta_f_mean_4202_sensitive_info_policy_complaint_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_refused_delta_f_mean_4205_coping_and_self_improvement_guidance_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_refused_delta_f_mean_4319_general_benign_informational_queries_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_refused_delta_f_mean_4396_public_figure_medical_history_inquiries_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_refused_delta_f_mean_4590_stereotyping_and_generalization_prompts_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_refused_delta_f_mean_4986_ethical_classification_recommendation_task_prompts_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_refused_delta_f_mean_5098_corporate_email_admin_task_requests_weak_harmful
dtype: float64
- name: harmful_harmless_seg_harmless_refused_delta_f_mean_5449_rumor_seeking_and_secret_sharing_requests_harmful
dtype: float64
- name: harmful_harmless_seg_harmless_refused_delta_f_mean_5509_grammar_style_transformation_and_neutral_writing_prompts_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_refused_delta_f_mean_603_biased_inferiority_and_illicit_trade_harm_requests_harmful
dtype: float64
- name: harmful_harmless_seg_harmless_refused_delta_f_mean_6213_email_admin_subject_line_drafting_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_refused_delta_f_mean_6238_definition_of_risky_terms_queries_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_refused_delta_f_mean_6499_generic_benign_writing_tasks_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_refused_delta_f_mean_6562_legitimate_verification_and_administrative_request_prompts_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_refused_delta_f_mean_6733_copyright_defamation_misuse_requests_harmful
dtype: float64
- name: harmful_harmless_seg_harmless_refused_delta_f_mean_6815_harmful_advocacy_with_deceptive_framing_harmful
dtype: float64
- name: harmful_harmless_seg_harmless_refused_delta_f_mean_7063_sensitive_social_organizational_explanation_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_refused_delta_f_mean_7261_protective_safety_legal_compliance_advice_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_refused_delta_f_mean_728_safety_health_advice_and_benign_information_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_refused_delta_f_mean_7294_deceptive_malicious_manipulation_requests_harmful
dtype: float64
- name: harmful_harmless_seg_harmless_refused_delta_f_mean_7480_email_subject_line_and_message_summary_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_refused_delta_f_mean_7575_factual_summary_explanation_requests_about_benign_media_and_history_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_refused_delta_f_mean_7897_administrative_verification_request_language_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_refused_delta_f_mean_7940_defamation_and_fake_news_fabrication_requests_harmful
dtype: float64
- name: harmful_harmless_seg_harmless_refused_delta_f_mean_7971_fictional_scenario_safety_secure_escape_questions_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_refused_delta_f_mean_8063_general_factual_cultural_benign_queries_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_refused_delta_f_mean_8140_email_task_management_correspondence_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_refused_delta_f_mean_8184_risk_avoidance_verification_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_refused_delta_f_mean_823_basic_arithmetic_and_sequence_number_pattern_problems_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_refused_delta_f_mean_8240_trivia_and_classification_questions_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_refused_delta_f_mean_910_corporate_email_meeting_logistics_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_refused_delta_f_mean_9267_video_game_combat_action_scenario_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_refused_delta_f_mean_9558_word_list_sentence_puzzle_prompts_harmful
dtype: float64
- name: harmful_harmless_seg_harmless_refused_delta_f_mean_9623_misinformation_and_fabricated_narrative_requests_harmful
dtype: float64
- name: harmful_harmless_seg_harmless_refused_delta_f_mean_9994_enumerative_benign_business_science_prompts_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_complied_delta_Y_mean
dtype: float64
- name: harmful_harmless_seg_harmless_complied_delta_Y_frac_increasing
dtype: float64
- name: harmful_harmless_seg_harmless_complied_delta_f_mean_10000_neutral_factual_general_knowledge_and_email_writing_prompts_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_complied_delta_f_mean_10737_general_benign_informational_explanation_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_complied_delta_f_mean_1083_trivia_lookup_and_benign_productivity_queries_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_complied_delta_f_mean_10878_stereotype_based_demeaning_character_portrayals_harmful
dtype: float64
- name: harmful_harmless_seg_harmless_complied_delta_f_mean_10940_stereotyping_protected_groups_as_inferior_harmful
dtype: float64
- name: harmful_harmless_seg_harmless_complied_delta_f_mean_11223_competitive_skill_improvement_metaphors_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_complied_delta_f_mean_11236_degrading_prejudice_toward_marginalized_groups_harmful
dtype: float64
- name: harmful_harmless_seg_harmless_complied_delta_f_mean_11382_copyright_compliance_and_policy_clarification_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_complied_delta_f_mean_11404_copyright_derivative_work_permission_requests_harmful
dtype: float64
- name: harmful_harmless_seg_harmless_complied_delta_f_mean_11495_creative_ideation_and_invention_prompts_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_complied_delta_f_mean_11634_privacy_identity_check_and_harmless_business_inquiry_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_complied_delta_f_mean_11756_copyrighted_media_permission_and_attribution_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_complied_delta_f_mean_11813_toxic_social_bias_discriminatory_harm_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_complied_delta_f_mean_11840_corporate_email_admin_document_text_harmful
dtype: float64
- name: harmful_harmless_seg_harmless_complied_delta_f_mean_11861_anti_fraud_and_verification_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_complied_delta_f_mean_12232_general_benign_informational_question_answering_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_complied_delta_f_mean_1227_illicit_finance_fraud_evasion_requests_harmful
dtype: float64
- name: harmful_harmless_seg_harmless_complied_delta_f_mean_1636_privacy_intrusion_data_request_harmful
dtype: float64
- name: harmful_harmless_seg_harmless_complied_delta_f_mean_1728_general_purpose_benign_inquiry_informational_request_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_complied_delta_f_mean_1742_mundane_creative_editing_and_trivia_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_complied_delta_f_mean_1748_comparison_framed_social_cultural_difference_questions_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_complied_delta_f_mean_1923_harmful_persuasion_manipulation_requests_harmful
dtype: float64
- name: harmful_harmless_seg_harmless_complied_delta_f_mean_1986_violent_sadistic_scene_requests_harmful
dtype: float64
- name: harmful_harmless_seg_harmless_complied_delta_f_mean_2197_manipulative_abuse_grooming_solicitation_harmful
dtype: float64
- name: harmful_harmless_seg_harmless_complied_delta_f_mean_2286_email_document_subject_line_extraction_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_complied_delta_f_mean_2684_corporate_email_attachment_logistics_harmful
dtype: float64
- name: harmful_harmless_seg_harmless_complied_delta_f_mean_2742_email_document_revision_and_legal_admin_text_harmful
dtype: float64
- name: harmful_harmless_seg_harmless_complied_delta_f_mean_2827_neutral_procedural_text_with_risky_content_overlap_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_complied_delta_f_mean_286_email_subject_line_document_summarization_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_complied_delta_f_mean_2945_internal_business_email_logistics_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_complied_delta_f_mean_3248_deceptive_harm_fraud_hoaxes_and_misinformation_harmful
dtype: float64
- name: harmful_harmless_seg_harmless_complied_delta_f_mean_3277_practical_everyday_advice_problem_solving_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_complied_delta_f_mean_3368_graphic_violent_ritual_injury_descriptions_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_complied_delta_f_mean_3734_discriminatory_insult_harassment_requests_harmful
dtype: float64
- name: harmful_harmless_seg_harmless_complied_delta_f_mean_3767_copyright_safe_creative_and_formatting_requests_weak_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_complied_delta_f_mean_3772_biased_demeaning_requests_about_protected_groups_harmful
dtype: float64
- name: harmful_harmless_seg_harmless_complied_delta_f_mean_3868_address_and_contact_lookup_requests_for_institutions_and_fictional_entities_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_complied_delta_f_mean_3895_general_factual_cultural_trivia_questions_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_complied_delta_f_mean_3915_abuse_harassment_and_humiliating_harm_requests_harmful
dtype: float64
- name: harmful_harmless_seg_harmless_complied_delta_f_mean_4112_historical_causation_of_genocides_and_revolutions_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_complied_delta_f_mean_4202_sensitive_info_policy_complaint_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_complied_delta_f_mean_4205_coping_and_self_improvement_guidance_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_complied_delta_f_mean_4319_general_benign_informational_queries_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_complied_delta_f_mean_4396_public_figure_medical_history_inquiries_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_complied_delta_f_mean_4590_stereotyping_and_generalization_prompts_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_complied_delta_f_mean_4986_ethical_classification_recommendation_task_prompts_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_complied_delta_f_mean_5098_corporate_email_admin_task_requests_weak_harmful
dtype: float64
- name: harmful_harmless_seg_harmless_complied_delta_f_mean_5449_rumor_seeking_and_secret_sharing_requests_harmful
dtype: float64
- name: harmful_harmless_seg_harmless_complied_delta_f_mean_5509_grammar_style_transformation_and_neutral_writing_prompts_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_complied_delta_f_mean_603_biased_inferiority_and_illicit_trade_harm_requests_harmful
dtype: float64
- name: harmful_harmless_seg_harmless_complied_delta_f_mean_6213_email_admin_subject_line_drafting_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_complied_delta_f_mean_6238_definition_of_risky_terms_queries_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_complied_delta_f_mean_6499_generic_benign_writing_tasks_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_complied_delta_f_mean_6562_legitimate_verification_and_administrative_request_prompts_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_complied_delta_f_mean_6733_copyright_defamation_misuse_requests_harmful
dtype: float64
- name: harmful_harmless_seg_harmless_complied_delta_f_mean_6815_harmful_advocacy_with_deceptive_framing_harmful
dtype: float64
- name: harmful_harmless_seg_harmless_complied_delta_f_mean_7063_sensitive_social_organizational_explanation_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_complied_delta_f_mean_7261_protective_safety_legal_compliance_advice_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_complied_delta_f_mean_728_safety_health_advice_and_benign_information_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_complied_delta_f_mean_7294_deceptive_malicious_manipulation_requests_harmful
dtype: float64
- name: harmful_harmless_seg_harmless_complied_delta_f_mean_7480_email_subject_line_and_message_summary_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_complied_delta_f_mean_7575_factual_summary_explanation_requests_about_benign_media_and_history_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_complied_delta_f_mean_7897_administrative_verification_request_language_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_complied_delta_f_mean_7940_defamation_and_fake_news_fabrication_requests_harmful
dtype: float64
- name: harmful_harmless_seg_harmless_complied_delta_f_mean_7971_fictional_scenario_safety_secure_escape_questions_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_complied_delta_f_mean_8063_general_factual_cultural_benign_queries_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_complied_delta_f_mean_8140_email_task_management_correspondence_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_complied_delta_f_mean_8184_risk_avoidance_verification_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_complied_delta_f_mean_823_basic_arithmetic_and_sequence_number_pattern_problems_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_complied_delta_f_mean_8240_trivia_and_classification_questions_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_complied_delta_f_mean_910_corporate_email_meeting_logistics_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_complied_delta_f_mean_9267_video_game_combat_action_scenario_requests_harmless
dtype: float64
- name: harmful_harmless_seg_harmless_complied_delta_f_mean_9558_word_list_sentence_puzzle_prompts_harmful
dtype: float64
- name: harmful_harmless_seg_harmless_complied_delta_f_mean_9623_misinformation_and_fabricated_narrative_requests_harmful
dtype: float64
- name: harmful_harmless_seg_harmless_complied_delta_f_mean_9994_enumerative_benign_business_science_prompts_harmless
dtype: float64
- name: appendix_seed0_seg_all_delta_Y_mean
dtype: float64
- name: appendix_seed0_seg_all_delta_Y_frac_increasing
dtype: float64
- name: appendix_seed0_seg_harmful_refused_delta_Y_mean
dtype: float64
- name: appendix_seed0_seg_harmful_refused_delta_Y_frac_increasing
dtype: float64
- name: appendix_seed0_seg_harmful_complied_delta_Y_mean
dtype: float64
- name: appendix_seed0_seg_harmful_complied_delta_Y_frac_increasing
dtype: float64
- name: appendix_seed1_seg_all_delta_Y_mean
dtype: float64
- name: appendix_seed1_seg_all_delta_Y_frac_increasing
dtype: float64
- name: appendix_seed1_seg_harmful_refused_delta_Y_mean
dtype: float64
- name: appendix_seed1_seg_harmful_refused_delta_Y_frac_increasing
dtype: float64
- name: appendix_seed1_seg_harmful_complied_delta_Y_mean
dtype: float64
- name: appendix_seed1_seg_harmful_complied_delta_Y_frac_increasing
dtype: float64
- name: appendix_seed2_seg_all_delta_Y_mean
dtype: float64
- name: appendix_seed2_seg_all_delta_Y_frac_increasing
dtype: float64
- name: appendix_seed2_seg_harmful_refused_delta_Y_mean
dtype: float64
- name: appendix_seed2_seg_harmful_refused_delta_Y_frac_increasing
dtype: float64
- name: appendix_seed2_seg_harmful_complied_delta_Y_mean
dtype: float64
- name: appendix_seed2_seg_harmful_complied_delta_Y_frac_increasing
dtype: float64
- name: appendix_seed0_influence_10000_neutral_factual_general_knowledge_and_email_writing_prompts_harmless
dtype: float64
- name: appendix_seed0_influence_10737_general_benign_informational_explanation_requests_harmless
dtype: float64
- name: appendix_seed0_influence_1083_trivia_lookup_and_benign_productivity_queries_harmless
dtype: float64
- name: appendix_seed0_influence_10878_stereotype_based_demeaning_character_portrayals_harmful
dtype: float64
- name: appendix_seed0_influence_10940_stereotyping_protected_groups_as_inferior_harmful
dtype: float64
- name: appendix_seed0_influence_11223_competitive_skill_improvement_metaphors_harmless
dtype: float64
- name: appendix_seed0_influence_11236_degrading_prejudice_toward_marginalized_groups_harmful
dtype: float64
- name: appendix_seed0_influence_11382_copyright_compliance_and_policy_clarification_requests_harmless
dtype: float64
- name: appendix_seed0_influence_11404_copyright_derivative_work_permission_requests_harmful
dtype: float64
- name: appendix_seed0_influence_11495_creative_ideation_and_invention_prompts_harmless
dtype: float64
- name: appendix_seed0_influence_11634_privacy_identity_check_and_harmless_business_inquiry_requests_harmless
dtype: float64
- name: appendix_seed0_influence_11756_copyrighted_media_permission_and_attribution_requests_harmless
dtype: float64
- name: appendix_seed0_influence_11813_toxic_social_bias_discriminatory_harm_requests_harmless
dtype: float64
- name: appendix_seed0_influence_11840_corporate_email_admin_document_text_harmful
dtype: float64
- name: appendix_seed0_influence_11861_anti_fraud_and_verification_requests_harmless
dtype: float64
- name: appendix_seed0_influence_12232_general_benign_informational_question_answering_harmless
dtype: float64
- name: appendix_seed0_influence_1227_illicit_finance_fraud_evasion_requests_harmful
dtype: float64
- name: appendix_seed0_influence_1636_privacy_intrusion_data_request_harmful
dtype: float64
- name: appendix_seed0_influence_1728_general_purpose_benign_inquiry_informational_request_harmless
dtype: float64
- name: appendix_seed0_influence_1742_mundane_creative_editing_and_trivia_requests_harmless
dtype: float64
- name: appendix_seed0_influence_1748_comparison_framed_social_cultural_difference_questions_harmless
dtype: float64
- name: appendix_seed0_influence_1923_harmful_persuasion_manipulation_requests_harmful
dtype: float64
- name: appendix_seed0_influence_1986_violent_sadistic_scene_requests_harmful
dtype: float64
- name: appendix_seed0_influence_2197_manipulative_abuse_grooming_solicitation_harmful
dtype: float64
- name: appendix_seed0_influence_2286_email_document_subject_line_extraction_harmless
dtype: float64
- name: appendix_seed0_influence_2684_corporate_email_attachment_logistics_harmful
dtype: float64
- name: appendix_seed0_influence_2742_email_document_revision_and_legal_admin_text_harmful
dtype: float64
- name: appendix_seed0_influence_2827_neutral_procedural_text_with_risky_content_overlap_harmless
dtype: float64
- name: appendix_seed0_influence_286_email_subject_line_document_summarization_requests_harmless
dtype: float64
- name: appendix_seed0_influence_2945_internal_business_email_logistics_harmless
dtype: float64
- name: appendix_seed0_influence_3248_deceptive_harm_fraud_hoaxes_and_misinformation_harmful
dtype: float64
- name: appendix_seed0_influence_3277_practical_everyday_advice_problem_solving_requests_harmless
dtype: float64
- name: appendix_seed0_influence_3368_graphic_violent_ritual_injury_descriptions_harmless
dtype: float64
- name: appendix_seed0_influence_3734_discriminatory_insult_harassment_requests_harmful
dtype: float64
- name: appendix_seed0_influence_3767_copyright_safe_creative_and_formatting_requests_weak_harmless
dtype: float64
- name: appendix_seed0_influence_3772_biased_demeaning_requests_about_protected_groups_harmful
dtype: float64
- name: appendix_seed0_influence_3868_address_and_contact_lookup_requests_for_institutions_and_fictional_entities_harmless
dtype: float64
- name: appendix_seed0_influence_3895_general_factual_cultural_trivia_questions_harmless
dtype: float64
- name: appendix_seed0_influence_3915_abuse_harassment_and_humiliating_harm_requests_harmful
dtype: float64
- name: appendix_seed0_influence_4112_historical_causation_of_genocides_and_revolutions_harmless
dtype: float64
- name: appendix_seed0_influence_4202_sensitive_info_policy_complaint_requests_harmless
dtype: float64
- name: appendix_seed0_influence_4205_coping_and_self_improvement_guidance_harmless
dtype: float64
- name: appendix_seed0_influence_4319_general_benign_informational_queries_harmless
dtype: float64
- name: appendix_seed0_influence_4396_public_figure_medical_history_inquiries_harmless
dtype: float64
- name: appendix_seed0_influence_4590_stereotyping_and_generalization_prompts_harmless
dtype: float64
- name: appendix_seed0_influence_4986_ethical_classification_recommendation_task_prompts_harmless
dtype: float64
- name: appendix_seed0_influence_5098_corporate_email_admin_task_requests_weak_harmful
dtype: float64
- name: appendix_seed0_influence_5449_rumor_seeking_and_secret_sharing_requests_harmful
dtype: float64
- name: appendix_seed0_influence_5509_grammar_style_transformation_and_neutral_writing_prompts_harmless
dtype: float64
- name: appendix_seed0_influence_603_biased_inferiority_and_illicit_trade_harm_requests_harmful
dtype: float64
- name: appendix_seed0_influence_6213_email_admin_subject_line_drafting_harmless
dtype: float64
- name: appendix_seed0_influence_6238_definition_of_risky_terms_queries_harmless
dtype: float64
- name: appendix_seed0_influence_6499_generic_benign_writing_tasks_harmless
dtype: float64
- name: appendix_seed0_influence_6562_legitimate_verification_and_administrative_request_prompts_harmless
dtype: float64
- name: appendix_seed0_influence_6733_copyright_defamation_misuse_requests_harmful
dtype: float64
- name: appendix_seed0_influence_6815_harmful_advocacy_with_deceptive_framing_harmful
dtype: float64
- name: appendix_seed0_influence_7063_sensitive_social_organizational_explanation_requests_harmless
dtype: float64
- name: appendix_seed0_influence_7261_protective_safety_legal_compliance_advice_harmless
dtype: float64
- name: appendix_seed0_influence_728_safety_health_advice_and_benign_information_requests_harmless
dtype: float64
- name: appendix_seed0_influence_7294_deceptive_malicious_manipulation_requests_harmful
dtype: float64
- name: appendix_seed0_influence_7480_email_subject_line_and_message_summary_requests_harmless
dtype: float64
- name: appendix_seed0_influence_7575_factual_summary_explanation_requests_about_benign_media_and_history_harmless
dtype: float64
- name: appendix_seed0_influence_7897_administrative_verification_request_language_harmless
dtype: float64
- name: appendix_seed0_influence_7940_defamation_and_fake_news_fabrication_requests_harmful
dtype: float64
- name: appendix_seed0_influence_7971_fictional_scenario_safety_secure_escape_questions_harmless
dtype: float64
- name: appendix_seed0_influence_8063_general_factual_cultural_benign_queries_harmless
dtype: float64
- name: appendix_seed0_influence_8140_email_task_management_correspondence_harmless
dtype: float64
- name: appendix_seed0_influence_8184_risk_avoidance_verification_requests_harmless
dtype: float64
- name: appendix_seed0_influence_823_basic_arithmetic_and_sequence_number_pattern_problems_harmless
dtype: float64
- name: appendix_seed0_influence_8240_trivia_and_classification_questions_harmless
dtype: float64
- name: appendix_seed0_influence_910_corporate_email_meeting_logistics_harmless
dtype: float64
- name: appendix_seed0_influence_9267_video_game_combat_action_scenario_requests_harmless
dtype: float64
- name: appendix_seed0_influence_9558_word_list_sentence_puzzle_prompts_harmful
dtype: float64
- name: appendix_seed0_influence_9623_misinformation_and_fabricated_narrative_requests_harmful
dtype: float64
- name: appendix_seed0_influence_9994_enumerative_benign_business_science_prompts_harmless
dtype: float64
- name: appendix_seed0_refusal_influence
dtype: float64
- name: appendix_seed0_top20_most_influenced
list:
- name: feature
dtype: string
- name: influence
dtype: float64
- name: ridge_weight
dtype: float64
- name: appendix_seed0_top20_most_important
list:
- name: feature
dtype: string
- name: influence
dtype: float64
- name: ridge_weight
dtype: float64
- name: appendix_seed1_influence_10000_neutral_factual_general_knowledge_and_email_writing_prompts_harmless
dtype: float64
- name: appendix_seed1_influence_10737_general_benign_informational_explanation_requests_harmless
dtype: float64
- name: appendix_seed1_influence_1083_trivia_lookup_and_benign_productivity_queries_harmless
dtype: float64
- name: appendix_seed1_influence_10878_stereotype_based_demeaning_character_portrayals_harmful
dtype: float64
- name: appendix_seed1_influence_10940_stereotyping_protected_groups_as_inferior_harmful
dtype: float64
- name: appendix_seed1_influence_11223_competitive_skill_improvement_metaphors_harmless
dtype: float64
- name: appendix_seed1_influence_11236_degrading_prejudice_toward_marginalized_groups_harmful
dtype: float64
- name: appendix_seed1_influence_11382_copyright_compliance_and_policy_clarification_requests_harmless
dtype: float64
- name: appendix_seed1_influence_11404_copyright_derivative_work_permission_requests_harmful
dtype: float64
- name: appendix_seed1_influence_11495_creative_ideation_and_invention_prompts_harmless
dtype: float64
- name: appendix_seed1_influence_11634_privacy_identity_check_and_harmless_business_inquiry_requests_harmless
dtype: float64
- name: appendix_seed1_influence_11756_copyrighted_media_permission_and_attribution_requests_harmless
dtype: float64
- name: appendix_seed1_influence_11813_toxic_social_bias_discriminatory_harm_requests_harmless
dtype: float64
- name: appendix_seed1_influence_11840_corporate_email_admin_document_text_harmful
dtype: float64
- name: appendix_seed1_influence_11861_anti_fraud_and_verification_requests_harmless
dtype: float64
- name: appendix_seed1_influence_12232_general_benign_informational_question_answering_harmless
dtype: float64
- name: appendix_seed1_influence_1227_illicit_finance_fraud_evasion_requests_harmful
dtype: float64
- name: appendix_seed1_influence_1636_privacy_intrusion_data_request_harmful
dtype: float64
- name: appendix_seed1_influence_1728_general_purpose_benign_inquiry_informational_request_harmless
dtype: float64
- name: appendix_seed1_influence_1742_mundane_creative_editing_and_trivia_requests_harmless
dtype: float64
- name: appendix_seed1_influence_1748_comparison_framed_social_cultural_difference_questions_harmless
dtype: float64
- name: appendix_seed1_influence_1923_harmful_persuasion_manipulation_requests_harmful
dtype: float64
- name: appendix_seed1_influence_1986_violent_sadistic_scene_requests_harmful
dtype: float64
- name: appendix_seed1_influence_2197_manipulative_abuse_grooming_solicitation_harmful
dtype: float64
- name: appendix_seed1_influence_2286_email_document_subject_line_extraction_harmless
dtype: float64
- name: appendix_seed1_influence_2684_corporate_email_attachment_logistics_harmful
dtype: float64
- name: appendix_seed1_influence_2742_email_document_revision_and_legal_admin_text_harmful
dtype: float64
- name: appendix_seed1_influence_2827_neutral_procedural_text_with_risky_content_overlap_harmless
dtype: float64
- name: appendix_seed1_influence_286_email_subject_line_document_summarization_requests_harmless
dtype: float64
- name: appendix_seed1_influence_2945_internal_business_email_logistics_harmless
dtype: float64
- name: appendix_seed1_influence_3248_deceptive_harm_fraud_hoaxes_and_misinformation_harmful
dtype: float64
- name: appendix_seed1_influence_3277_practical_everyday_advice_problem_solving_requests_harmless
dtype: float64
- name: appendix_seed1_influence_3368_graphic_violent_ritual_injury_descriptions_harmless
dtype: float64
- name: appendix_seed1_influence_3734_discriminatory_insult_harassment_requests_harmful
dtype: float64
- name: appendix_seed1_influence_3767_copyright_safe_creative_and_formatting_requests_weak_harmless
dtype: float64
- name: appendix_seed1_influence_3772_biased_demeaning_requests_about_protected_groups_harmful
dtype: float64
- name: appendix_seed1_influence_3868_address_and_contact_lookup_requests_for_institutions_and_fictional_entities_harmless
dtype: float64
- name: appendix_seed1_influence_3895_general_factual_cultural_trivia_questions_harmless
dtype: float64
- name: appendix_seed1_influence_3915_abuse_harassment_and_humiliating_harm_requests_harmful
dtype: float64
- name: appendix_seed1_influence_4112_historical_causation_of_genocides_and_revolutions_harmless
dtype: float64
- name: appendix_seed1_influence_4202_sensitive_info_policy_complaint_requests_harmless
dtype: float64
- name: appendix_seed1_influence_4205_coping_and_self_improvement_guidance_harmless
dtype: float64
- name: appendix_seed1_influence_4319_general_benign_informational_queries_harmless
dtype: float64
- name: appendix_seed1_influence_4396_public_figure_medical_history_inquiries_harmless
dtype: float64
- name: appendix_seed1_influence_4590_stereotyping_and_generalization_prompts_harmless
dtype: float64
- name: appendix_seed1_influence_4986_ethical_classification_recommendation_task_prompts_harmless
dtype: float64
- name: appendix_seed1_influence_5098_corporate_email_admin_task_requests_weak_harmful
dtype: float64
- name: appendix_seed1_influence_5449_rumor_seeking_and_secret_sharing_requests_harmful
dtype: float64
- name: appendix_seed1_influence_5509_grammar_style_transformation_and_neutral_writing_prompts_harmless
dtype: float64
- name: appendix_seed1_influence_603_biased_inferiority_and_illicit_trade_harm_requests_harmful
dtype: float64
- name: appendix_seed1_influence_6213_email_admin_subject_line_drafting_harmless
dtype: float64
- name: appendix_seed1_influence_6238_definition_of_risky_terms_queries_harmless
dtype: float64
- name: appendix_seed1_influence_6499_generic_benign_writing_tasks_harmless
dtype: float64
- name: appendix_seed1_influence_6562_legitimate_verification_and_administrative_request_prompts_harmless
dtype: float64
- name: appendix_seed1_influence_6733_copyright_defamation_misuse_requests_harmful
dtype: float64
- name: appendix_seed1_influence_6815_harmful_advocacy_with_deceptive_framing_harmful
dtype: float64
- name: appendix_seed1_influence_7063_sensitive_social_organizational_explanation_requests_harmless
dtype: float64
- name: appendix_seed1_influence_7261_protective_safety_legal_compliance_advice_harmless
dtype: float64
- name: appendix_seed1_influence_728_safety_health_advice_and_benign_information_requests_harmless
dtype: float64
- name: appendix_seed1_influence_7294_deceptive_malicious_manipulation_requests_harmful
dtype: float64
- name: appendix_seed1_influence_7480_email_subject_line_and_message_summary_requests_harmless
dtype: float64
- name: appendix_seed1_influence_7575_factual_summary_explanation_requests_about_benign_media_and_history_harmless
dtype: float64
- name: appendix_seed1_influence_7897_administrative_verification_request_language_harmless
dtype: float64
- name: appendix_seed1_influence_7940_defamation_and_fake_news_fabrication_requests_harmful
dtype: float64
- name: appendix_seed1_influence_7971_fictional_scenario_safety_secure_escape_questions_harmless
dtype: float64
- name: appendix_seed1_influence_8063_general_factual_cultural_benign_queries_harmless
dtype: float64
- name: appendix_seed1_influence_8140_email_task_management_correspondence_harmless
dtype: float64
- name: appendix_seed1_influence_8184_risk_avoidance_verification_requests_harmless
dtype: float64
- name: appendix_seed1_influence_823_basic_arithmetic_and_sequence_number_pattern_problems_harmless
dtype: float64
- name: appendix_seed1_influence_8240_trivia_and_classification_questions_harmless
dtype: float64
- name: appendix_seed1_influence_910_corporate_email_meeting_logistics_harmless
dtype: float64
- name: appendix_seed1_influence_9267_video_game_combat_action_scenario_requests_harmless
dtype: float64
- name: appendix_seed1_influence_9558_word_list_sentence_puzzle_prompts_harmful
dtype: float64
- name: appendix_seed1_influence_9623_misinformation_and_fabricated_narrative_requests_harmful
dtype: float64
- name: appendix_seed1_influence_9994_enumerative_benign_business_science_prompts_harmless
dtype: float64
- name: appendix_seed1_refusal_influence
dtype: float64
- name: appendix_seed1_top20_most_influenced
list:
- name: feature
dtype: string
- name: influence
dtype: float64
- name: ridge_weight
dtype: float64
- name: appendix_seed1_top20_most_important
list:
- name: feature
dtype: string
- name: influence
dtype: float64
- name: ridge_weight
dtype: float64
- name: appendix_seed2_influence_10000_neutral_factual_general_knowledge_and_email_writing_prompts_harmless
dtype: float64
- name: appendix_seed2_influence_10737_general_benign_informational_explanation_requests_harmless
dtype: float64
- name: appendix_seed2_influence_1083_trivia_lookup_and_benign_productivity_queries_harmless
dtype: float64
- name: appendix_seed2_influence_10878_stereotype_based_demeaning_character_portrayals_harmful
dtype: float64
- name: appendix_seed2_influence_10940_stereotyping_protected_groups_as_inferior_harmful
dtype: float64
- name: appendix_seed2_influence_11223_competitive_skill_improvement_metaphors_harmless
dtype: float64
- name: appendix_seed2_influence_11236_degrading_prejudice_toward_marginalized_groups_harmful
dtype: float64
- name: appendix_seed2_influence_11382_copyright_compliance_and_policy_clarification_requests_harmless
dtype: float64
- name: appendix_seed2_influence_11404_copyright_derivative_work_permission_requests_harmful
dtype: float64
- name: appendix_seed2_influence_11495_creative_ideation_and_invention_prompts_harmless
dtype: float64
- name: appendix_seed2_influence_11634_privacy_identity_check_and_harmless_business_inquiry_requests_harmless
dtype: float64
- name: appendix_seed2_influence_11756_copyrighted_media_permission_and_attribution_requests_harmless
dtype: float64
- name: appendix_seed2_influence_11813_toxic_social_bias_discriminatory_harm_requests_harmless
dtype: float64
- name: appendix_seed2_influence_11840_corporate_email_admin_document_text_harmful
dtype: float64
- name: appendix_seed2_influence_11861_anti_fraud_and_verification_requests_harmless
dtype: float64
- name: appendix_seed2_influence_12232_general_benign_informational_question_answering_harmless
dtype: float64
- name: appendix_seed2_influence_1227_illicit_finance_fraud_evasion_requests_harmful
dtype: float64
- name: appendix_seed2_influence_1636_privacy_intrusion_data_request_harmful
dtype: float64
- name: appendix_seed2_influence_1728_general_purpose_benign_inquiry_informational_request_harmless
dtype: float64
- name: appendix_seed2_influence_1742_mundane_creative_editing_and_trivia_requests_harmless
dtype: float64
- name: appendix_seed2_influence_1748_comparison_framed_social_cultural_difference_questions_harmless
dtype: float64
- name: appendix_seed2_influence_1923_harmful_persuasion_manipulation_requests_harmful
dtype: float64
- name: appendix_seed2_influence_1986_violent_sadistic_scene_requests_harmful
dtype: float64
- name: appendix_seed2_influence_2197_manipulative_abuse_grooming_solicitation_harmful
dtype: float64
- name: appendix_seed2_influence_2286_email_document_subject_line_extraction_harmless
dtype: float64
- name: appendix_seed2_influence_2684_corporate_email_attachment_logistics_harmful
dtype: float64
- name: appendix_seed2_influence_2742_email_document_revision_and_legal_admin_text_harmful
dtype: float64
- name: appendix_seed2_influence_2827_neutral_procedural_text_with_risky_content_overlap_harmless
dtype: float64
- name: appendix_seed2_influence_286_email_subject_line_document_summarization_requests_harmless
dtype: float64
- name: appendix_seed2_influence_2945_internal_business_email_logistics_harmless
dtype: float64
- name: appendix_seed2_influence_3248_deceptive_harm_fraud_hoaxes_and_misinformation_harmful
dtype: float64
- name: appendix_seed2_influence_3277_practical_everyday_advice_problem_solving_requests_harmless
dtype: float64
- name: appendix_seed2_influence_3368_graphic_violent_ritual_injury_descriptions_harmless
dtype: float64
- name: appendix_seed2_influence_3734_discriminatory_insult_harassment_requests_harmful
dtype: float64
- name: appendix_seed2_influence_3767_copyright_safe_creative_and_formatting_requests_weak_harmless
dtype: float64
- name: appendix_seed2_influence_3772_biased_demeaning_requests_about_protected_groups_harmful
dtype: float64
- name: appendix_seed2_influence_3868_address_and_contact_lookup_requests_for_institutions_and_fictional_entities_harmless
dtype: float64
- name: appendix_seed2_influence_3895_general_factual_cultural_trivia_questions_harmless
dtype: float64
- name: appendix_seed2_influence_3915_abuse_harassment_and_humiliating_harm_requests_harmful
dtype: float64
- name: appendix_seed2_influence_4112_historical_causation_of_genocides_and_revolutions_harmless
dtype: float64
- name: appendix_seed2_influence_4202_sensitive_info_policy_complaint_requests_harmless
dtype: float64
- name: appendix_seed2_influence_4205_coping_and_self_improvement_guidance_harmless
dtype: float64
- name: appendix_seed2_influence_4319_general_benign_informational_queries_harmless
dtype: float64
- name: appendix_seed2_influence_4396_public_figure_medical_history_inquiries_harmless
dtype: float64
- name: appendix_seed2_influence_4590_stereotyping_and_generalization_prompts_harmless
dtype: float64
- name: appendix_seed2_influence_4986_ethical_classification_recommendation_task_prompts_harmless
dtype: float64
- name: appendix_seed2_influence_5098_corporate_email_admin_task_requests_weak_harmful
dtype: float64
- name: appendix_seed2_influence_5449_rumor_seeking_and_secret_sharing_requests_harmful
dtype: float64
- name: appendix_seed2_influence_5509_grammar_style_transformation_and_neutral_writing_prompts_harmless
dtype: float64
- name: appendix_seed2_influence_603_biased_inferiority_and_illicit_trade_harm_requests_harmful
dtype: float64
- name: appendix_seed2_influence_6213_email_admin_subject_line_drafting_harmless
dtype: float64
- name: appendix_seed2_influence_6238_definition_of_risky_terms_queries_harmless
dtype: float64
- name: appendix_seed2_influence_6499_generic_benign_writing_tasks_harmless
dtype: float64
- name: appendix_seed2_influence_6562_legitimate_verification_and_administrative_request_prompts_harmless
dtype: float64
- name: appendix_seed2_influence_6733_copyright_defamation_misuse_requests_harmful
dtype: float64
- name: appendix_seed2_influence_6815_harmful_advocacy_with_deceptive_framing_harmful
dtype: float64
- name: appendix_seed2_influence_7063_sensitive_social_organizational_explanation_requests_harmless
dtype: float64
- name: appendix_seed2_influence_7261_protective_safety_legal_compliance_advice_harmless
dtype: float64
- name: appendix_seed2_influence_728_safety_health_advice_and_benign_information_requests_harmless
dtype: float64
- name: appendix_seed2_influence_7294_deceptive_malicious_manipulation_requests_harmful
dtype: float64
- name: appendix_seed2_influence_7480_email_subject_line_and_message_summary_requests_harmless
dtype: float64
- name: appendix_seed2_influence_7575_factual_summary_explanation_requests_about_benign_media_and_history_harmless
dtype: float64
- name: appendix_seed2_influence_7897_administrative_verification_request_language_harmless
dtype: float64
- name: appendix_seed2_influence_7940_defamation_and_fake_news_fabrication_requests_harmful
dtype: float64
- name: appendix_seed2_influence_7971_fictional_scenario_safety_secure_escape_questions_harmless
dtype: float64
- name: appendix_seed2_influence_8063_general_factual_cultural_benign_queries_harmless
dtype: float64
- name: appendix_seed2_influence_8140_email_task_management_correspondence_harmless
dtype: float64
- name: appendix_seed2_influence_8184_risk_avoidance_verification_requests_harmless
dtype: float64
- name: appendix_seed2_influence_823_basic_arithmetic_and_sequence_number_pattern_problems_harmless
dtype: float64
- name: appendix_seed2_influence_8240_trivia_and_classification_questions_harmless
dtype: float64
- name: appendix_seed2_influence_910_corporate_email_meeting_logistics_harmless
dtype: float64
- name: appendix_seed2_influence_9267_video_game_combat_action_scenario_requests_harmless
dtype: float64
- name: appendix_seed2_influence_9558_word_list_sentence_puzzle_prompts_harmful
dtype: float64
- name: appendix_seed2_influence_9623_misinformation_and_fabricated_narrative_requests_harmful
dtype: float64
- name: appendix_seed2_influence_9994_enumerative_benign_business_science_prompts_harmless
dtype: float64
- name: appendix_seed2_refusal_influence
dtype: float64
- name: appendix_seed2_top20_most_influenced
list:
- name: feature
dtype: string
- name: influence
dtype: float64
- name: ridge_weight
dtype: float64
- name: appendix_seed2_top20_most_important
list:
- name: feature
dtype: string
- name: influence
dtype: float64
- name: ridge_weight
dtype: float64
- name: code_topic
dtype: string
- name: code_type_of_command
dtype: string
- name: topic_reasoning
dtype: string
- name: command_reasoning
dtype: string
- name: code_topic_collapsed
dtype: string
- name: code_type_of_command_collapsed
dtype: string
splits:
- name: train
num_bytes: 1052597287
num_examples: 220
download_size: 1114938393
dataset_size: 1052597287
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
---
提供机构:
DarianNLP
搜集汇总
数据集介绍

构建方式
在人工智能安全与对齐研究领域,理解训练数据对模型行为的影响至关重要。mda_influence_scores_NEW_lr1e4_cat_coded数据集正是为深入探究这一议题而精心构建的。该数据集采用基于梯度的数据影响归因方法(如TRAK),以学习率为1e-4的设定,对经过安全微调的模型在有害与无害两类提示上的表现进行影响分析。数据集为每一个样本计算了多项影响指标,包括梯度范数、平均输出变化以及各提示类别上的具体影响分数,并针对“有害-自然”与“有害-平衡”两种评估情境,分别记录了每个训练样本对模型拒绝有害请求能力的影响程度。
使用方法
该数据集主要服务于领域内研究人员对模型安全机制进行深入剖析。使用者可基于‘delta_Y_per_prompt’或各‘influence’字段,通过相关性分析或因果推断,识别出导致模型拒绝或生成有害内容的关键训练样本。结合‘source’与‘label’字段,‘prompt’与‘response’原文可供进行定性分析。对于希望改善模型安全对齐的研究,可借鉴‘harmful_natural_refusal_influence’等指标筛选高影响样本进行进一步的微调或数据清洗,从而系统性地提升模型在面对复杂有害输入时的鲁棒性与安全响应能力。
背景与挑战
背景概述
该数据集mda_influence_scores_NEW_lr1e4_cat_coded由研究机构在大型语言模型对齐领域创建,聚焦于量化训练数据对模型安全行为的影响。核心研究问题是揭示哪些提示(prompt)及其对应响应(response)在微调过程中,对模型生成有害或无害内容的倾向产生推动或抑制作用。数据集通过梯度范数(grad_norm)、平均输出变化(delta_Y_mean)和基于Ridge回归的影响力分数等指标,系统度量了来自BeaverTails等来源的数千条细粒度提示的影响力,为理解数据驱动的模型行为调控机制提供了关键工具,对提升模型安全性与可解释性具有重要价值。
当前挑战
该数据集旨在解决大型语言模型安全对齐中数据筛选与归因的挑战,即如何在众多训练样本中识别出对模型拒绝有害请求能力贡献最大或最有害的特定数据点。构建过程中面临的核心挑战包括:1)需要将海量、异质的指令提示按主题、危害性等维度进行精细分类与标注,如区分无害的知识查询与有害的歧视或欺诈请求;2)计算每个样本对模型行为的高维影响力(如对70余类提示的delta_f_per_prompt值),面临巨大的计算开销与稳定性难题;3)在自然采样与平衡采样等不同分布下,准确衡量数据点对拒绝行为的因果效应,避免过拟合或虚假关联,确保影响分数的可靠性与泛化性。
常用场景
经典使用场景
在大型语言模型安全对齐研究领域中,mda_influence_scores_NEW_lr1e4_cat_coded数据集常被用于评估和量化训练数据中不同样本对模型有害输出倾向的影响程度。研究者通过该数据集中的梯度范数、标签平均变化量及逐提示变化量等细粒度指标,能够精确分析特定无害或有害样本在模型微调过程中的贡献权重,从而揭示模型安全行为背后的数据驱动机制。
解决学术问题
该数据集有效解决了人工智能安全研究中一个关键难题:如何系统性地辨识和度量训练样本对语言模型生成有害内容的诱导作用。通过提供跨多种有害类别(如刻板印象、欺诈、暴力、隐私侵犯等)的细粒度影响分数,它支持研究者深入探究数据偏差与模型不安全行为之间的因果关系,为构建更具鲁棒性的安全对齐策略奠定了量化基础。
实际应用
在实际部署中,该数据集可服务于大语言模型的安全审计与质量控制流程。开发团队可利用其中关于有害自然拒绝影响、均衡控制影响等指标,筛选出对模型有害输出贡献最大的训练样本,从而指导数据清洗、重加权或针对性祛除噪声样本。同时,其标注的提示来源信息也有助于追溯模型在邮件写作、知识问答等真实场景中的安全性薄弱环节。
数据集最近研究
最新研究方向
该数据集聚焦于大语言模型安全对齐中的细粒度归因分析,通过量化不同训练样本对模型有害/无害响应行为的梯度范数、输出差异及隐藏层变化等影响指标,揭示了特定提示类别(如刻板印象、欺诈、虚假信息)与模型拒绝有害请求能力之间的因果关联。这一研究方向契合当前AI安全领域的前沿热点——从粗粒度的安全过滤转向可解释的样本级影响力追踪,为构建更具鲁棒性的红队测试策略与数据筛选机制提供了量化工具,其意义在于推动对齐技术从经验调优向理论驱动的科学范式演进。
以上内容由遇见数据集搜集并总结生成



