five

DarianNLP/mda_influence_scores_NEW_lr1e4_v2

收藏
Hugging Face2026-04-22 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/DarianNLP/mda_influence_scores_NEW_lr1e4_v2
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: idx dtype: int64 - name: prompt dtype: string - name: label dtype: string - name: source dtype: string - name: response dtype: string - name: grad_norm dtype: float64 - name: delta_Y_mean dtype: float64 - name: delta_Y_per_prompt list: float64 - name: delta_Y_frac_increasing dtype: float64 - name: delta_h10_mean_norm dtype: float64 - name: delta_f_per_prompt struct: - name: 10000_neutral_factual_general_knowledge_and_email_writing_prompts_harmless list: float64 - name: 10737_general_benign_informational_explanation_requests_harmless list: float64 - name: 1083_trivia_lookup_and_benign_productivity_queries_harmless list: float64 - name: 10878_stereotype_based_demeaning_character_portrayals_harmful list: float64 - name: 10940_stereotyping_protected_groups_as_inferior_harmful list: float64 - name: 11223_competitive_skill_improvement_metaphors_harmless list: float64 - name: 11236_degrading_prejudice_toward_marginalized_groups_harmful list: float64 - name: 11382_copyright_compliance_and_policy_clarification_requests_harmless list: float64 - name: 11404_copyright_derivative_work_permission_requests_harmful list: float64 - name: 11495_creative_ideation_and_invention_prompts_harmless list: float64 - name: 11634_privacy_identity_check_and_harmless_business_inquiry_requests_harmless list: float64 - name: 11756_copyrighted_media_permission_and_attribution_requests_harmless list: float64 - name: 11813_toxic_social_bias_discriminatory_harm_requests_harmless list: float64 - name: 11840_corporate_email_admin_document_text_harmful list: float64 - name: 11861_anti_fraud_and_verification_requests_harmless list: float64 - name: 12232_general_benign_informational_question_answering_harmless list: float64 - name: 1227_illicit_finance_fraud_evasion_requests_harmful list: float64 - name: 1636_privacy_intrusion_data_request_harmful list: float64 - name: 1728_general_purpose_benign_inquiry_informational_request_harmless list: float64 - name: 1742_mundane_creative_editing_and_trivia_requests_harmless list: float64 - name: 1748_comparison_framed_social_cultural_difference_questions_harmless list: float64 - name: 1923_harmful_persuasion_manipulation_requests_harmful list: float64 - name: 1986_violent_sadistic_scene_requests_harmful list: float64 - name: 2197_manipulative_abuse_grooming_solicitation_harmful list: float64 - name: 2286_email_document_subject_line_extraction_harmless list: float64 - name: 2684_corporate_email_attachment_logistics_harmful list: float64 - name: 2742_email_document_revision_and_legal_admin_text_harmful list: float64 - name: 2827_neutral_procedural_text_with_risky_content_overlap_harmless list: float64 - name: 286_email_subject_line_document_summarization_requests_harmless list: float64 - name: 2945_internal_business_email_logistics_harmless list: float64 - name: 3248_deceptive_harm_fraud_hoaxes_and_misinformation_harmful list: float64 - name: 3277_practical_everyday_advice_problem_solving_requests_harmless list: float64 - name: 3368_graphic_violent_ritual_injury_descriptions_harmless list: float64 - name: 3734_discriminatory_insult_harassment_requests_harmful list: float64 - name: 3767_copyright_safe_creative_and_formatting_requests_weak_harmless list: float64 - name: 3772_biased_demeaning_requests_about_protected_groups_harmful list: float64 - name: 3868_address_and_contact_lookup_requests_for_institutions_and_fictional_entities_harmless list: float64 - name: 3895_general_factual_cultural_trivia_questions_harmless list: float64 - name: 3915_abuse_harassment_and_humiliating_harm_requests_harmful list: float64 - name: 4112_historical_causation_of_genocides_and_revolutions_harmless list: float64 - name: 4202_sensitive_info_policy_complaint_requests_harmless list: float64 - name: 4205_coping_and_self_improvement_guidance_harmless list: float64 - name: 4319_general_benign_informational_queries_harmless list: float64 - name: 4396_public_figure_medical_history_inquiries_harmless list: float64 - name: 4590_stereotyping_and_generalization_prompts_harmless list: float64 - name: 4986_ethical_classification_recommendation_task_prompts_harmless list: float64 - name: 5098_corporate_email_admin_task_requests_weak_harmful list: float64 - name: 5449_rumor_seeking_and_secret_sharing_requests_harmful list: float64 - name: 5509_grammar_style_transformation_and_neutral_writing_prompts_harmless list: float64 - name: 603_biased_inferiority_and_illicit_trade_harm_requests_harmful list: float64 - name: 6213_email_admin_subject_line_drafting_harmless list: float64 - name: 6238_definition_of_risky_terms_queries_harmless list: float64 - name: 6499_generic_benign_writing_tasks_harmless list: float64 - name: 6562_legitimate_verification_and_administrative_request_prompts_harmless list: float64 - name: 6733_copyright_defamation_misuse_requests_harmful list: float64 - name: 6815_harmful_advocacy_with_deceptive_framing_harmful list: float64 - name: 7063_sensitive_social_organizational_explanation_requests_harmless list: float64 - name: 7261_protective_safety_legal_compliance_advice_harmless list: float64 - name: 728_safety_health_advice_and_benign_information_requests_harmless list: float64 - name: 7294_deceptive_malicious_manipulation_requests_harmful list: float64 - name: 7480_email_subject_line_and_message_summary_requests_harmless list: float64 - name: 7575_factual_summary_explanation_requests_about_benign_media_and_history_harmless list: float64 - name: 7897_administrative_verification_request_language_harmless list: float64 - name: 7940_defamation_and_fake_news_fabrication_requests_harmful list: float64 - name: 7971_fictional_scenario_safety_secure_escape_questions_harmless list: float64 - name: 8063_general_factual_cultural_benign_queries_harmless list: float64 - name: 8140_email_task_management_correspondence_harmless list: float64 - name: 8184_risk_avoidance_verification_requests_harmless list: float64 - name: 823_basic_arithmetic_and_sequence_number_pattern_problems_harmless list: float64 - name: 8240_trivia_and_classification_questions_harmless list: float64 - name: 910_corporate_email_meeting_logistics_harmless list: float64 - name: 9267_video_game_combat_action_scenario_requests_harmless list: float64 - name: 9558_word_list_sentence_puzzle_prompts_harmful list: float64 - name: 9623_misinformation_and_fabricated_narrative_requests_harmful list: float64 - name: 9994_enumerative_benign_business_science_prompts_harmless list: float64 - name: harmful_natural_refusal_influence dtype: float64 - name: harmful_natural_influence_10000_neutral_factual_general_knowledge_and_email_writing_prompts_harmless dtype: float64 - name: harmful_natural_influence_10737_general_benign_informational_explanation_requests_harmless dtype: float64 - name: harmful_natural_influence_1083_trivia_lookup_and_benign_productivity_queries_harmless dtype: float64 - name: harmful_natural_influence_10878_stereotype_based_demeaning_character_portrayals_harmful dtype: float64 - name: harmful_natural_influence_10940_stereotyping_protected_groups_as_inferior_harmful dtype: float64 - name: harmful_natural_influence_11223_competitive_skill_improvement_metaphors_harmless dtype: float64 - name: harmful_natural_influence_11236_degrading_prejudice_toward_marginalized_groups_harmful dtype: float64 - name: harmful_natural_influence_11382_copyright_compliance_and_policy_clarification_requests_harmless dtype: float64 - name: harmful_natural_influence_11404_copyright_derivative_work_permission_requests_harmful dtype: float64 - name: harmful_natural_influence_11495_creative_ideation_and_invention_prompts_harmless dtype: float64 - name: harmful_natural_influence_11634_privacy_identity_check_and_harmless_business_inquiry_requests_harmless dtype: float64 - name: harmful_natural_influence_11756_copyrighted_media_permission_and_attribution_requests_harmless dtype: float64 - name: harmful_natural_influence_11813_toxic_social_bias_discriminatory_harm_requests_harmless dtype: float64 - name: harmful_natural_influence_11840_corporate_email_admin_document_text_harmful dtype: float64 - name: harmful_natural_influence_11861_anti_fraud_and_verification_requests_harmless dtype: float64 - name: harmful_natural_influence_12232_general_benign_informational_question_answering_harmless dtype: float64 - name: harmful_natural_influence_1227_illicit_finance_fraud_evasion_requests_harmful dtype: float64 - name: harmful_natural_influence_1636_privacy_intrusion_data_request_harmful dtype: float64 - name: harmful_natural_influence_1728_general_purpose_benign_inquiry_informational_request_harmless dtype: float64 - name: harmful_natural_influence_1742_mundane_creative_editing_and_trivia_requests_harmless dtype: float64 - name: harmful_natural_influence_1748_comparison_framed_social_cultural_difference_questions_harmless dtype: float64 - name: harmful_natural_influence_1923_harmful_persuasion_manipulation_requests_harmful dtype: float64 - name: harmful_natural_influence_1986_violent_sadistic_scene_requests_harmful dtype: float64 - name: harmful_natural_influence_2197_manipulative_abuse_grooming_solicitation_harmful dtype: float64 - name: harmful_natural_influence_2286_email_document_subject_line_extraction_harmless dtype: float64 - name: harmful_natural_influence_2684_corporate_email_attachment_logistics_harmful dtype: float64 - name: harmful_natural_influence_2742_email_document_revision_and_legal_admin_text_harmful dtype: float64 - name: harmful_natural_influence_2827_neutral_procedural_text_with_risky_content_overlap_harmless dtype: float64 - name: harmful_natural_influence_286_email_subject_line_document_summarization_requests_harmless dtype: float64 - name: harmful_natural_influence_2945_internal_business_email_logistics_harmless dtype: float64 - name: harmful_natural_influence_3248_deceptive_harm_fraud_hoaxes_and_misinformation_harmful dtype: float64 - name: harmful_natural_influence_3277_practical_everyday_advice_problem_solving_requests_harmless dtype: float64 - name: harmful_natural_influence_3368_graphic_violent_ritual_injury_descriptions_harmless dtype: float64 - name: harmful_natural_influence_3734_discriminatory_insult_harassment_requests_harmful dtype: float64 - name: harmful_natural_influence_3767_copyright_safe_creative_and_formatting_requests_weak_harmless dtype: float64 - name: harmful_natural_influence_3772_biased_demeaning_requests_about_protected_groups_harmful dtype: float64 - name: harmful_natural_influence_3868_address_and_contact_lookup_requests_for_institutions_and_fictional_entities_harmless dtype: float64 - name: harmful_natural_influence_3895_general_factual_cultural_trivia_questions_harmless dtype: float64 - name: harmful_natural_influence_3915_abuse_harassment_and_humiliating_harm_requests_harmful dtype: float64 - name: harmful_natural_influence_4112_historical_causation_of_genocides_and_revolutions_harmless dtype: float64 - name: harmful_natural_influence_4202_sensitive_info_policy_complaint_requests_harmless dtype: float64 - name: harmful_natural_influence_4205_coping_and_self_improvement_guidance_harmless dtype: float64 - name: harmful_natural_influence_4319_general_benign_informational_queries_harmless dtype: float64 - name: harmful_natural_influence_4396_public_figure_medical_history_inquiries_harmless dtype: float64 - name: harmful_natural_influence_4590_stereotyping_and_generalization_prompts_harmless dtype: float64 - name: harmful_natural_influence_4986_ethical_classification_recommendation_task_prompts_harmless dtype: float64 - name: harmful_natural_influence_5098_corporate_email_admin_task_requests_weak_harmful dtype: float64 - name: harmful_natural_influence_5449_rumor_seeking_and_secret_sharing_requests_harmful dtype: float64 - name: harmful_natural_influence_5509_grammar_style_transformation_and_neutral_writing_prompts_harmless dtype: float64 - name: harmful_natural_influence_603_biased_inferiority_and_illicit_trade_harm_requests_harmful dtype: float64 - name: harmful_natural_influence_6213_email_admin_subject_line_drafting_harmless dtype: float64 - name: harmful_natural_influence_6238_definition_of_risky_terms_queries_harmless dtype: float64 - name: harmful_natural_influence_6499_generic_benign_writing_tasks_harmless dtype: float64 - name: harmful_natural_influence_6562_legitimate_verification_and_administrative_request_prompts_harmless dtype: float64 - name: harmful_natural_influence_6733_copyright_defamation_misuse_requests_harmful dtype: float64 - name: harmful_natural_influence_6815_harmful_advocacy_with_deceptive_framing_harmful dtype: float64 - name: harmful_natural_influence_7063_sensitive_social_organizational_explanation_requests_harmless dtype: float64 - name: harmful_natural_influence_7261_protective_safety_legal_compliance_advice_harmless dtype: float64 - name: harmful_natural_influence_728_safety_health_advice_and_benign_information_requests_harmless dtype: float64 - name: harmful_natural_influence_7294_deceptive_malicious_manipulation_requests_harmful dtype: float64 - name: harmful_natural_influence_7480_email_subject_line_and_message_summary_requests_harmless dtype: float64 - name: harmful_natural_influence_7575_factual_summary_explanation_requests_about_benign_media_and_history_harmless dtype: float64 - name: harmful_natural_influence_7897_administrative_verification_request_language_harmless dtype: float64 - name: harmful_natural_influence_7940_defamation_and_fake_news_fabrication_requests_harmful dtype: float64 - name: harmful_natural_influence_7971_fictional_scenario_safety_secure_escape_questions_harmless dtype: float64 - name: harmful_natural_influence_8063_general_factual_cultural_benign_queries_harmless dtype: float64 - name: harmful_natural_influence_8140_email_task_management_correspondence_harmless dtype: float64 - name: harmful_natural_influence_8184_risk_avoidance_verification_requests_harmless dtype: float64 - name: harmful_natural_influence_823_basic_arithmetic_and_sequence_number_pattern_problems_harmless dtype: float64 - name: harmful_natural_influence_8240_trivia_and_classification_questions_harmless dtype: float64 - name: harmful_natural_influence_910_corporate_email_meeting_logistics_harmless dtype: float64 - name: harmful_natural_influence_9267_video_game_combat_action_scenario_requests_harmless dtype: float64 - name: harmful_natural_influence_9558_word_list_sentence_puzzle_prompts_harmful dtype: float64 - name: harmful_natural_influence_9623_misinformation_and_fabricated_narrative_requests_harmful dtype: float64 - name: harmful_natural_influence_9994_enumerative_benign_business_science_prompts_harmless dtype: float64 - name: harmful_natural_top5_most_influenced list: - name: feature dtype: string - name: influence dtype: float64 - name: ridge_weight dtype: float64 - name: harmful_natural_top5_most_important list: - name: feature dtype: string - name: influence dtype: float64 - name: ridge_weight dtype: float64 - name: harmful_balanced_refusal_influence dtype: float64 - name: harmful_balanced_influence_10000_neutral_factual_general_knowledge_and_email_writing_prompts_harmless dtype: float64 - name: harmful_balanced_influence_10737_general_benign_informational_explanation_requests_harmless dtype: float64 - name: harmful_balanced_influence_1083_trivia_lookup_and_benign_productivity_queries_harmless dtype: float64 - name: harmful_balanced_influence_10878_stereotype_based_demeaning_character_portrayals_harmful dtype: float64 - name: harmful_balanced_influence_10940_stereotyping_protected_groups_as_inferior_harmful dtype: float64 - name: harmful_balanced_influence_11223_competitive_skill_improvement_metaphors_harmless dtype: float64 - name: harmful_balanced_influence_11236_degrading_prejudice_toward_marginalized_groups_harmful dtype: float64 - name: harmful_balanced_influence_11382_copyright_compliance_and_policy_clarification_requests_harmless dtype: float64 - name: harmful_balanced_influence_11404_copyright_derivative_work_permission_requests_harmful dtype: float64 - name: harmful_balanced_influence_11495_creative_ideation_and_invention_prompts_harmless dtype: float64 - name: harmful_balanced_influence_11634_privacy_identity_check_and_harmless_business_inquiry_requests_harmless dtype: float64 - name: harmful_balanced_influence_11756_copyrighted_media_permission_and_attribution_requests_harmless dtype: float64 - name: harmful_balanced_influence_11813_toxic_social_bias_discriminatory_harm_requests_harmless dtype: float64 - name: harmful_balanced_influence_11840_corporate_email_admin_document_text_harmful dtype: float64 - name: harmful_balanced_influence_11861_anti_fraud_and_verification_requests_harmless dtype: float64 - name: harmful_balanced_influence_12232_general_benign_informational_question_answering_harmless dtype: float64 - name: harmful_balanced_influence_1227_illicit_finance_fraud_evasion_requests_harmful dtype: float64 - name: harmful_balanced_influence_1636_privacy_intrusion_data_request_harmful dtype: float64 - name: harmful_balanced_influence_1728_general_purpose_benign_inquiry_informational_request_harmless dtype: float64 - name: harmful_balanced_influence_1742_mundane_creative_editing_and_trivia_requests_harmless dtype: float64 - name: harmful_balanced_influence_1748_comparison_framed_social_cultural_difference_questions_harmless dtype: float64 - name: harmful_balanced_influence_1923_harmful_persuasion_manipulation_requests_harmful dtype: float64 - name: harmful_balanced_influence_1986_violent_sadistic_scene_requests_harmful dtype: float64 - name: harmful_balanced_influence_2197_manipulative_abuse_grooming_solicitation_harmful dtype: float64 - name: harmful_balanced_influence_2286_email_document_subject_line_extraction_harmless dtype: float64 - name: harmful_balanced_influence_2684_corporate_email_attachment_logistics_harmful dtype: float64 - name: harmful_balanced_influence_2742_email_document_revision_and_legal_admin_text_harmful dtype: float64 - name: harmful_balanced_influence_2827_neutral_procedural_text_with_risky_content_overlap_harmless dtype: float64 - name: harmful_balanced_influence_286_email_subject_line_document_summarization_requests_harmless dtype: float64 - name: harmful_balanced_influence_2945_internal_business_email_logistics_harmless dtype: float64 - name: harmful_balanced_influence_3248_deceptive_harm_fraud_hoaxes_and_misinformation_harmful dtype: float64 - name: harmful_balanced_influence_3277_practical_everyday_advice_problem_solving_requests_harmless dtype: float64 - name: harmful_balanced_influence_3368_graphic_violent_ritual_injury_descriptions_harmless dtype: float64 - name: harmful_balanced_influence_3734_discriminatory_insult_harassment_requests_harmful dtype: float64 - name: harmful_balanced_influence_3767_copyright_safe_creative_and_formatting_requests_weak_harmless dtype: float64 - name: harmful_balanced_influence_3772_biased_demeaning_requests_about_protected_groups_harmful dtype: float64 - name: harmful_balanced_influence_3868_address_and_contact_lookup_requests_for_institutions_and_fictional_entities_harmless dtype: float64 - name: harmful_balanced_influence_3895_general_factual_cultural_trivia_questions_harmless dtype: float64 - name: harmful_balanced_influence_3915_abuse_harassment_and_humiliating_harm_requests_harmful dtype: float64 - name: harmful_balanced_influence_4112_historical_causation_of_genocides_and_revolutions_harmless dtype: float64 - name: harmful_balanced_influence_4202_sensitive_info_policy_complaint_requests_harmless dtype: float64 - name: harmful_balanced_influence_4205_coping_and_self_improvement_guidance_harmless dtype: float64 - name: harmful_balanced_influence_4319_general_benign_informational_queries_harmless dtype: float64 - name: harmful_balanced_influence_4396_public_figure_medical_history_inquiries_harmless dtype: float64 - name: harmful_balanced_influence_4590_stereotyping_and_generalization_prompts_harmless dtype: float64 - name: harmful_balanced_influence_4986_ethical_classification_recommendation_task_prompts_harmless dtype: float64 - name: harmful_balanced_influence_5098_corporate_email_admin_task_requests_weak_harmful dtype: float64 - name: harmful_balanced_influence_5449_rumor_seeking_and_secret_sharing_requests_harmful dtype: float64 - name: harmful_balanced_influence_5509_grammar_style_transformation_and_neutral_writing_prompts_harmless dtype: float64 - name: harmful_balanced_influence_603_biased_inferiority_and_illicit_trade_harm_requests_harmful dtype: float64 - name: harmful_balanced_influence_6213_email_admin_subject_line_drafting_harmless dtype: float64 - name: harmful_balanced_influence_6238_definition_of_risky_terms_queries_harmless dtype: float64 - name: harmful_balanced_influence_6499_generic_benign_writing_tasks_harmless dtype: float64 - name: harmful_balanced_influence_6562_legitimate_verification_and_administrative_request_prompts_harmless dtype: float64 - name: harmful_balanced_influence_6733_copyright_defamation_misuse_requests_harmful dtype: float64 - name: harmful_balanced_influence_6815_harmful_advocacy_with_deceptive_framing_harmful dtype: float64 - name: harmful_balanced_influence_7063_sensitive_social_organizational_explanation_requests_harmless dtype: float64 - name: harmful_balanced_influence_7261_protective_safety_legal_compliance_advice_harmless dtype: float64 - name: harmful_balanced_influence_728_safety_health_advice_and_benign_information_requests_harmless dtype: float64 - name: harmful_balanced_influence_7294_deceptive_malicious_manipulation_requests_harmful dtype: float64 - name: harmful_balanced_influence_7480_email_subject_line_and_message_summary_requests_harmless dtype: float64 - name: harmful_balanced_influence_7575_factual_summary_explanation_requests_about_benign_media_and_history_harmless dtype: float64 - name: harmful_balanced_influence_7897_administrative_verification_request_language_harmless dtype: float64 - name: harmful_balanced_influence_7940_defamation_and_fake_news_fabrication_requests_harmful dtype: float64 - name: harmful_balanced_influence_7971_fictional_scenario_safety_secure_escape_questions_harmless dtype: float64 - name: harmful_balanced_influence_8063_general_factual_cultural_benign_queries_harmless dtype: float64 - name: harmful_balanced_influence_8140_email_task_management_correspondence_harmless dtype: float64 - name: harmful_balanced_influence_8184_risk_avoidance_verification_requests_harmless dtype: float64 - name: harmful_balanced_influence_823_basic_arithmetic_and_sequence_number_pattern_problems_harmless dtype: float64 - name: harmful_balanced_influence_8240_trivia_and_classification_questions_harmless dtype: float64 - name: harmful_balanced_influence_910_corporate_email_meeting_logistics_harmless dtype: float64 - name: harmful_balanced_influence_9267_video_game_combat_action_scenario_requests_harmless dtype: float64 - name: harmful_balanced_influence_9558_word_list_sentence_puzzle_prompts_harmful dtype: float64 - name: harmful_balanced_influence_9623_misinformation_and_fabricated_narrative_requests_harmful dtype: float64 - name: harmful_balanced_influence_9994_enumerative_benign_business_science_prompts_harmless dtype: float64 - name: harmful_balanced_top5_most_influenced list: - name: feature dtype: string - name: influence dtype: float64 - name: ridge_weight dtype: float64 - name: harmful_balanced_top5_most_important list: - name: feature dtype: string - name: influence dtype: float64 - name: ridge_weight dtype: float64 - name: harmful_harmless_refusal_influence dtype: float64 - name: harmful_harmless_influence_10000_neutral_factual_general_knowledge_and_email_writing_prompts_harmless dtype: float64 - name: harmful_harmless_influence_10737_general_benign_informational_explanation_requests_harmless dtype: float64 - name: harmful_harmless_influence_1083_trivia_lookup_and_benign_productivity_queries_harmless dtype: float64 - name: harmful_harmless_influence_10878_stereotype_based_demeaning_character_portrayals_harmful dtype: float64 - name: harmful_harmless_influence_10940_stereotyping_protected_groups_as_inferior_harmful dtype: float64 - name: harmful_harmless_influence_11223_competitive_skill_improvement_metaphors_harmless dtype: float64 - name: harmful_harmless_influence_11236_degrading_prejudice_toward_marginalized_groups_harmful dtype: float64 - name: harmful_harmless_influence_11382_copyright_compliance_and_policy_clarification_requests_harmless dtype: float64 - name: harmful_harmless_influence_11404_copyright_derivative_work_permission_requests_harmful dtype: float64 - name: harmful_harmless_influence_11495_creative_ideation_and_invention_prompts_harmless dtype: float64 - name: harmful_harmless_influence_11634_privacy_identity_check_and_harmless_business_inquiry_requests_harmless dtype: float64 - name: harmful_harmless_influence_11756_copyrighted_media_permission_and_attribution_requests_harmless dtype: float64 - name: harmful_harmless_influence_11813_toxic_social_bias_discriminatory_harm_requests_harmless dtype: float64 - name: harmful_harmless_influence_11840_corporate_email_admin_document_text_harmful dtype: float64 - name: harmful_harmless_influence_11861_anti_fraud_and_verification_requests_harmless dtype: float64 - name: harmful_harmless_influence_12232_general_benign_informational_question_answering_harmless dtype: float64 - name: harmful_harmless_influence_1227_illicit_finance_fraud_evasion_requests_harmful dtype: float64 - name: harmful_harmless_influence_1636_privacy_intrusion_data_request_harmful dtype: float64 - name: harmful_harmless_influence_1728_general_purpose_benign_inquiry_informational_request_harmless dtype: float64 - name: harmful_harmless_influence_1742_mundane_creative_editing_and_trivia_requests_harmless dtype: float64 - name: harmful_harmless_influence_1748_comparison_framed_social_cultural_difference_questions_harmless dtype: float64 - name: harmful_harmless_influence_1923_harmful_persuasion_manipulation_requests_harmful dtype: float64 - name: harmful_harmless_influence_1986_violent_sadistic_scene_requests_harmful dtype: float64 - name: harmful_harmless_influence_2197_manipulative_abuse_grooming_solicitation_harmful dtype: float64 - name: harmful_harmless_influence_2286_email_document_subject_line_extraction_harmless dtype: float64 - name: harmful_harmless_influence_2684_corporate_email_attachment_logistics_harmful dtype: float64 - name: harmful_harmless_influence_2742_email_document_revision_and_legal_admin_text_harmful dtype: float64 - name: harmful_harmless_influence_2827_neutral_procedural_text_with_risky_content_overlap_harmless dtype: float64 - name: harmful_harmless_influence_286_email_subject_line_document_summarization_requests_harmless dtype: float64 - name: harmful_harmless_influence_2945_internal_business_email_logistics_harmless dtype: float64 - name: harmful_harmless_influence_3248_deceptive_harm_fraud_hoaxes_and_misinformation_harmful dtype: float64 - name: harmful_harmless_influence_3277_practical_everyday_advice_problem_solving_requests_harmless dtype: float64 - name: harmful_harmless_influence_3368_graphic_violent_ritual_injury_descriptions_harmless dtype: float64 - name: harmful_harmless_influence_3734_discriminatory_insult_harassment_requests_harmful dtype: float64 - name: harmful_harmless_influence_3767_copyright_safe_creative_and_formatting_requests_weak_harmless dtype: float64 - name: harmful_harmless_influence_3772_biased_demeaning_requests_about_protected_groups_harmful dtype: float64 - name: harmful_harmless_influence_3868_address_and_contact_lookup_requests_for_institutions_and_fictional_entities_harmless dtype: float64 - name: harmful_harmless_influence_3895_general_factual_cultural_trivia_questions_harmless dtype: float64 - name: harmful_harmless_influence_3915_abuse_harassment_and_humiliating_harm_requests_harmful dtype: float64 - name: harmful_harmless_influence_4112_historical_causation_of_genocides_and_revolutions_harmless dtype: float64 - name: harmful_harmless_influence_4202_sensitive_info_policy_complaint_requests_harmless dtype: float64 - name: harmful_harmless_influence_4205_coping_and_self_improvement_guidance_harmless dtype: float64 - name: harmful_harmless_influence_4319_general_benign_informational_queries_harmless dtype: float64 - name: harmful_harmless_influence_4396_public_figure_medical_history_inquiries_harmless dtype: float64 - name: harmful_harmless_influence_4590_stereotyping_and_generalization_prompts_harmless dtype: float64 - name: harmful_harmless_influence_4986_ethical_classification_recommendation_task_prompts_harmless dtype: float64 - name: harmful_harmless_influence_5098_corporate_email_admin_task_requests_weak_harmful dtype: float64 - name: harmful_harmless_influence_5449_rumor_seeking_and_secret_sharing_requests_harmful dtype: float64 - name: harmful_harmless_influence_5509_grammar_style_transformation_and_neutral_writing_prompts_harmless dtype: float64 - name: harmful_harmless_influence_603_biased_inferiority_and_illicit_trade_harm_requests_harmful dtype: float64 - name: harmful_harmless_influence_6213_email_admin_subject_line_drafting_harmless dtype: float64 - name: harmful_harmless_influence_6238_definition_of_risky_terms_queries_harmless dtype: float64 - name: harmful_harmless_influence_6499_generic_benign_writing_tasks_harmless dtype: float64 - name: harmful_harmless_influence_6562_legitimate_verification_and_administrative_request_prompts_harmless dtype: float64 - name: harmful_harmless_influence_6733_copyright_defamation_misuse_requests_harmful dtype: float64 - name: harmful_harmless_influence_6815_harmful_advocacy_with_deceptive_framing_harmful dtype: float64 - name: harmful_harmless_influence_7063_sensitive_social_organizational_explanation_requests_harmless dtype: float64 - name: harmful_harmless_influence_7261_protective_safety_legal_compliance_advice_harmless dtype: float64 - name: harmful_harmless_influence_728_safety_health_advice_and_benign_information_requests_harmless dtype: float64 - name: harmful_harmless_influence_7294_deceptive_malicious_manipulation_requests_harmful dtype: float64 - name: harmful_harmless_influence_7480_email_subject_line_and_message_summary_requests_harmless dtype: float64 - name: harmful_harmless_influence_7575_factual_summary_explanation_requests_about_benign_media_and_history_harmless dtype: float64 - name: harmful_harmless_influence_7897_administrative_verification_request_language_harmless dtype: float64 - name: harmful_harmless_influence_7940_defamation_and_fake_news_fabrication_requests_harmful dtype: float64 - name: harmful_harmless_influence_7971_fictional_scenario_safety_secure_escape_questions_harmless dtype: float64 - name: harmful_harmless_influence_8063_general_factual_cultural_benign_queries_harmless dtype: float64 - name: harmful_harmless_influence_8140_email_task_management_correspondence_harmless dtype: float64 - name: harmful_harmless_influence_8184_risk_avoidance_verification_requests_harmless dtype: float64 - name: harmful_harmless_influence_823_basic_arithmetic_and_sequence_number_pattern_problems_harmless dtype: float64 - name: harmful_harmless_influence_8240_trivia_and_classification_questions_harmless dtype: float64 - name: harmful_harmless_influence_910_corporate_email_meeting_logistics_harmless dtype: float64 - name: harmful_harmless_influence_9267_video_game_combat_action_scenario_requests_harmless dtype: float64 - name: harmful_harmless_influence_9558_word_list_sentence_puzzle_prompts_harmful dtype: float64 - name: harmful_harmless_influence_9623_misinformation_and_fabricated_narrative_requests_harmful dtype: float64 - name: harmful_harmless_influence_9994_enumerative_benign_business_science_prompts_harmless dtype: float64 - name: harmful_harmless_top5_most_influenced list: - name: feature dtype: string - name: influence dtype: float64 - name: ridge_weight dtype: float64 - name: harmful_harmless_top5_most_important list: - name: feature dtype: string - name: influence dtype: float64 - name: ridge_weight dtype: float64 - name: appendix list: - name: refusal_influence dtype: float64 - name: seed dtype: int64 - name: harmful_natural_seg_all_delta_Y_mean dtype: float64 - name: harmful_natural_seg_all_delta_Y_frac_increasing dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_10000_neutral_factual_general_knowledge_and_email_writing_prompts_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_10737_general_benign_informational_explanation_requests_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_1083_trivia_lookup_and_benign_productivity_queries_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_10878_stereotype_based_demeaning_character_portrayals_harmful dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_10940_stereotyping_protected_groups_as_inferior_harmful dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_11223_competitive_skill_improvement_metaphors_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_11236_degrading_prejudice_toward_marginalized_groups_harmful dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_11382_copyright_compliance_and_policy_clarification_requests_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_11404_copyright_derivative_work_permission_requests_harmful dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_11495_creative_ideation_and_invention_prompts_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_11634_privacy_identity_check_and_harmless_business_inquiry_requests_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_11756_copyrighted_media_permission_and_attribution_requests_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_11813_toxic_social_bias_discriminatory_harm_requests_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_11840_corporate_email_admin_document_text_harmful dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_11861_anti_fraud_and_verification_requests_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_12232_general_benign_informational_question_answering_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_1227_illicit_finance_fraud_evasion_requests_harmful dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_1636_privacy_intrusion_data_request_harmful dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_1728_general_purpose_benign_inquiry_informational_request_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_1742_mundane_creative_editing_and_trivia_requests_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_1748_comparison_framed_social_cultural_difference_questions_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_1923_harmful_persuasion_manipulation_requests_harmful dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_1986_violent_sadistic_scene_requests_harmful dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_2197_manipulative_abuse_grooming_solicitation_harmful dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_2286_email_document_subject_line_extraction_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_2684_corporate_email_attachment_logistics_harmful dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_2742_email_document_revision_and_legal_admin_text_harmful dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_2827_neutral_procedural_text_with_risky_content_overlap_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_286_email_subject_line_document_summarization_requests_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_2945_internal_business_email_logistics_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_3248_deceptive_harm_fraud_hoaxes_and_misinformation_harmful dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_3277_practical_everyday_advice_problem_solving_requests_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_3368_graphic_violent_ritual_injury_descriptions_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_3734_discriminatory_insult_harassment_requests_harmful dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_3767_copyright_safe_creative_and_formatting_requests_weak_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_3772_biased_demeaning_requests_about_protected_groups_harmful dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_3868_address_and_contact_lookup_requests_for_institutions_and_fictional_entities_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_3895_general_factual_cultural_trivia_questions_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_3915_abuse_harassment_and_humiliating_harm_requests_harmful dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_4112_historical_causation_of_genocides_and_revolutions_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_4202_sensitive_info_policy_complaint_requests_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_4205_coping_and_self_improvement_guidance_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_4319_general_benign_informational_queries_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_4396_public_figure_medical_history_inquiries_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_4590_stereotyping_and_generalization_prompts_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_4986_ethical_classification_recommendation_task_prompts_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_5098_corporate_email_admin_task_requests_weak_harmful dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_5449_rumor_seeking_and_secret_sharing_requests_harmful dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_5509_grammar_style_transformation_and_neutral_writing_prompts_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_603_biased_inferiority_and_illicit_trade_harm_requests_harmful dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_6213_email_admin_subject_line_drafting_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_6238_definition_of_risky_terms_queries_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_6499_generic_benign_writing_tasks_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_6562_legitimate_verification_and_administrative_request_prompts_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_6733_copyright_defamation_misuse_requests_harmful dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_6815_harmful_advocacy_with_deceptive_framing_harmful dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_7063_sensitive_social_organizational_explanation_requests_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_7261_protective_safety_legal_compliance_advice_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_728_safety_health_advice_and_benign_information_requests_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_7294_deceptive_malicious_manipulation_requests_harmful dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_7480_email_subject_line_and_message_summary_requests_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_7575_factual_summary_explanation_requests_about_benign_media_and_history_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_7897_administrative_verification_request_language_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_7940_defamation_and_fake_news_fabrication_requests_harmful dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_7971_fictional_scenario_safety_secure_escape_questions_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_8063_general_factual_cultural_benign_queries_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_8140_email_task_management_correspondence_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_8184_risk_avoidance_verification_requests_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_823_basic_arithmetic_and_sequence_number_pattern_problems_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_8240_trivia_and_classification_questions_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_910_corporate_email_meeting_logistics_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_9267_video_game_combat_action_scenario_requests_harmless dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_9558_word_list_sentence_puzzle_prompts_harmful dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_9623_misinformation_and_fabricated_narrative_requests_harmful dtype: float64 - name: harmful_natural_seg_all_delta_f_mean_9994_enumerative_benign_business_science_prompts_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_Y_mean dtype: float64 - name: harmful_natural_seg_harmful_delta_Y_frac_increasing dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_10000_neutral_factual_general_knowledge_and_email_writing_prompts_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_10737_general_benign_informational_explanation_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_1083_trivia_lookup_and_benign_productivity_queries_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_10878_stereotype_based_demeaning_character_portrayals_harmful dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_10940_stereotyping_protected_groups_as_inferior_harmful dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_11223_competitive_skill_improvement_metaphors_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_11236_degrading_prejudice_toward_marginalized_groups_harmful dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_11382_copyright_compliance_and_policy_clarification_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_11404_copyright_derivative_work_permission_requests_harmful dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_11495_creative_ideation_and_invention_prompts_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_11634_privacy_identity_check_and_harmless_business_inquiry_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_11756_copyrighted_media_permission_and_attribution_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_11813_toxic_social_bias_discriminatory_harm_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_11840_corporate_email_admin_document_text_harmful dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_11861_anti_fraud_and_verification_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_12232_general_benign_informational_question_answering_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_1227_illicit_finance_fraud_evasion_requests_harmful dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_1636_privacy_intrusion_data_request_harmful dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_1728_general_purpose_benign_inquiry_informational_request_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_1742_mundane_creative_editing_and_trivia_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_1748_comparison_framed_social_cultural_difference_questions_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_1923_harmful_persuasion_manipulation_requests_harmful dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_1986_violent_sadistic_scene_requests_harmful dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_2197_manipulative_abuse_grooming_solicitation_harmful dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_2286_email_document_subject_line_extraction_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_2684_corporate_email_attachment_logistics_harmful dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_2742_email_document_revision_and_legal_admin_text_harmful dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_2827_neutral_procedural_text_with_risky_content_overlap_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_286_email_subject_line_document_summarization_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_2945_internal_business_email_logistics_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_3248_deceptive_harm_fraud_hoaxes_and_misinformation_harmful dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_3277_practical_everyday_advice_problem_solving_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_3368_graphic_violent_ritual_injury_descriptions_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_3734_discriminatory_insult_harassment_requests_harmful dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_3767_copyright_safe_creative_and_formatting_requests_weak_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_3772_biased_demeaning_requests_about_protected_groups_harmful dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_3868_address_and_contact_lookup_requests_for_institutions_and_fictional_entities_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_3895_general_factual_cultural_trivia_questions_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_3915_abuse_harassment_and_humiliating_harm_requests_harmful dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_4112_historical_causation_of_genocides_and_revolutions_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_4202_sensitive_info_policy_complaint_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_4205_coping_and_self_improvement_guidance_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_4319_general_benign_informational_queries_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_4396_public_figure_medical_history_inquiries_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_4590_stereotyping_and_generalization_prompts_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_4986_ethical_classification_recommendation_task_prompts_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_5098_corporate_email_admin_task_requests_weak_harmful dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_5449_rumor_seeking_and_secret_sharing_requests_harmful dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_5509_grammar_style_transformation_and_neutral_writing_prompts_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_603_biased_inferiority_and_illicit_trade_harm_requests_harmful dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_6213_email_admin_subject_line_drafting_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_6238_definition_of_risky_terms_queries_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_6499_generic_benign_writing_tasks_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_6562_legitimate_verification_and_administrative_request_prompts_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_6733_copyright_defamation_misuse_requests_harmful dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_6815_harmful_advocacy_with_deceptive_framing_harmful dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_7063_sensitive_social_organizational_explanation_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_7261_protective_safety_legal_compliance_advice_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_728_safety_health_advice_and_benign_information_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_7294_deceptive_malicious_manipulation_requests_harmful dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_7480_email_subject_line_and_message_summary_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_7575_factual_summary_explanation_requests_about_benign_media_and_history_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_7897_administrative_verification_request_language_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_7940_defamation_and_fake_news_fabrication_requests_harmful dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_7971_fictional_scenario_safety_secure_escape_questions_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_8063_general_factual_cultural_benign_queries_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_8140_email_task_management_correspondence_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_8184_risk_avoidance_verification_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_823_basic_arithmetic_and_sequence_number_pattern_problems_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_8240_trivia_and_classification_questions_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_910_corporate_email_meeting_logistics_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_9267_video_game_combat_action_scenario_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_9558_word_list_sentence_puzzle_prompts_harmful dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_9623_misinformation_and_fabricated_narrative_requests_harmful dtype: float64 - name: harmful_natural_seg_harmful_delta_f_mean_9994_enumerative_benign_business_science_prompts_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_Y_mean dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_Y_frac_increasing dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_10000_neutral_factual_general_knowledge_and_email_writing_prompts_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_10737_general_benign_informational_explanation_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_1083_trivia_lookup_and_benign_productivity_queries_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_10878_stereotype_based_demeaning_character_portrayals_harmful dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_10940_stereotyping_protected_groups_as_inferior_harmful dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_11223_competitive_skill_improvement_metaphors_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_11236_degrading_prejudice_toward_marginalized_groups_harmful dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_11382_copyright_compliance_and_policy_clarification_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_11404_copyright_derivative_work_permission_requests_harmful dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_11495_creative_ideation_and_invention_prompts_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_11634_privacy_identity_check_and_harmless_business_inquiry_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_11756_copyrighted_media_permission_and_attribution_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_11813_toxic_social_bias_discriminatory_harm_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_11840_corporate_email_admin_document_text_harmful dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_11861_anti_fraud_and_verification_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_12232_general_benign_informational_question_answering_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_1227_illicit_finance_fraud_evasion_requests_harmful dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_1636_privacy_intrusion_data_request_harmful dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_1728_general_purpose_benign_inquiry_informational_request_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_1742_mundane_creative_editing_and_trivia_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_1748_comparison_framed_social_cultural_difference_questions_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_1923_harmful_persuasion_manipulation_requests_harmful dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_1986_violent_sadistic_scene_requests_harmful dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_2197_manipulative_abuse_grooming_solicitation_harmful dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_2286_email_document_subject_line_extraction_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_2684_corporate_email_attachment_logistics_harmful dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_2742_email_document_revision_and_legal_admin_text_harmful dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_2827_neutral_procedural_text_with_risky_content_overlap_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_286_email_subject_line_document_summarization_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_2945_internal_business_email_logistics_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_3248_deceptive_harm_fraud_hoaxes_and_misinformation_harmful dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_3277_practical_everyday_advice_problem_solving_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_3368_graphic_violent_ritual_injury_descriptions_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_3734_discriminatory_insult_harassment_requests_harmful dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_3767_copyright_safe_creative_and_formatting_requests_weak_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_3772_biased_demeaning_requests_about_protected_groups_harmful dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_3868_address_and_contact_lookup_requests_for_institutions_and_fictional_entities_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_3895_general_factual_cultural_trivia_questions_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_3915_abuse_harassment_and_humiliating_harm_requests_harmful dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_4112_historical_causation_of_genocides_and_revolutions_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_4202_sensitive_info_policy_complaint_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_4205_coping_and_self_improvement_guidance_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_4319_general_benign_informational_queries_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_4396_public_figure_medical_history_inquiries_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_4590_stereotyping_and_generalization_prompts_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_4986_ethical_classification_recommendation_task_prompts_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_5098_corporate_email_admin_task_requests_weak_harmful dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_5449_rumor_seeking_and_secret_sharing_requests_harmful dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_5509_grammar_style_transformation_and_neutral_writing_prompts_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_603_biased_inferiority_and_illicit_trade_harm_requests_harmful dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_6213_email_admin_subject_line_drafting_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_6238_definition_of_risky_terms_queries_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_6499_generic_benign_writing_tasks_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_6562_legitimate_verification_and_administrative_request_prompts_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_6733_copyright_defamation_misuse_requests_harmful dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_6815_harmful_advocacy_with_deceptive_framing_harmful dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_7063_sensitive_social_organizational_explanation_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_7261_protective_safety_legal_compliance_advice_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_728_safety_health_advice_and_benign_information_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_7294_deceptive_malicious_manipulation_requests_harmful dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_7480_email_subject_line_and_message_summary_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_7575_factual_summary_explanation_requests_about_benign_media_and_history_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_7897_administrative_verification_request_language_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_7940_defamation_and_fake_news_fabrication_requests_harmful dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_7971_fictional_scenario_safety_secure_escape_questions_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_8063_general_factual_cultural_benign_queries_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_8140_email_task_management_correspondence_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_8184_risk_avoidance_verification_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_823_basic_arithmetic_and_sequence_number_pattern_problems_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_8240_trivia_and_classification_questions_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_910_corporate_email_meeting_logistics_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_9267_video_game_combat_action_scenario_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_9558_word_list_sentence_puzzle_prompts_harmful dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_9623_misinformation_and_fabricated_narrative_requests_harmful dtype: float64 - name: harmful_natural_seg_harmful_refused_delta_f_mean_9994_enumerative_benign_business_science_prompts_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_Y_mean dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_Y_frac_increasing dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_10000_neutral_factual_general_knowledge_and_email_writing_prompts_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_10737_general_benign_informational_explanation_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_1083_trivia_lookup_and_benign_productivity_queries_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_10878_stereotype_based_demeaning_character_portrayals_harmful dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_10940_stereotyping_protected_groups_as_inferior_harmful dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_11223_competitive_skill_improvement_metaphors_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_11236_degrading_prejudice_toward_marginalized_groups_harmful dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_11382_copyright_compliance_and_policy_clarification_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_11404_copyright_derivative_work_permission_requests_harmful dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_11495_creative_ideation_and_invention_prompts_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_11634_privacy_identity_check_and_harmless_business_inquiry_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_11756_copyrighted_media_permission_and_attribution_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_11813_toxic_social_bias_discriminatory_harm_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_11840_corporate_email_admin_document_text_harmful dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_11861_anti_fraud_and_verification_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_12232_general_benign_informational_question_answering_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_1227_illicit_finance_fraud_evasion_requests_harmful dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_1636_privacy_intrusion_data_request_harmful dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_1728_general_purpose_benign_inquiry_informational_request_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_1742_mundane_creative_editing_and_trivia_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_1748_comparison_framed_social_cultural_difference_questions_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_1923_harmful_persuasion_manipulation_requests_harmful dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_1986_violent_sadistic_scene_requests_harmful dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_2197_manipulative_abuse_grooming_solicitation_harmful dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_2286_email_document_subject_line_extraction_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_2684_corporate_email_attachment_logistics_harmful dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_2742_email_document_revision_and_legal_admin_text_harmful dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_2827_neutral_procedural_text_with_risky_content_overlap_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_286_email_subject_line_document_summarization_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_2945_internal_business_email_logistics_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_3248_deceptive_harm_fraud_hoaxes_and_misinformation_harmful dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_3277_practical_everyday_advice_problem_solving_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_3368_graphic_violent_ritual_injury_descriptions_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_3734_discriminatory_insult_harassment_requests_harmful dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_3767_copyright_safe_creative_and_formatting_requests_weak_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_3772_biased_demeaning_requests_about_protected_groups_harmful dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_3868_address_and_contact_lookup_requests_for_institutions_and_fictional_entities_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_3895_general_factual_cultural_trivia_questions_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_3915_abuse_harassment_and_humiliating_harm_requests_harmful dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_4112_historical_causation_of_genocides_and_revolutions_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_4202_sensitive_info_policy_complaint_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_4205_coping_and_self_improvement_guidance_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_4319_general_benign_informational_queries_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_4396_public_figure_medical_history_inquiries_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_4590_stereotyping_and_generalization_prompts_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_4986_ethical_classification_recommendation_task_prompts_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_5098_corporate_email_admin_task_requests_weak_harmful dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_5449_rumor_seeking_and_secret_sharing_requests_harmful dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_5509_grammar_style_transformation_and_neutral_writing_prompts_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_603_biased_inferiority_and_illicit_trade_harm_requests_harmful dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_6213_email_admin_subject_line_drafting_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_6238_definition_of_risky_terms_queries_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_6499_generic_benign_writing_tasks_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_6562_legitimate_verification_and_administrative_request_prompts_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_6733_copyright_defamation_misuse_requests_harmful dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_6815_harmful_advocacy_with_deceptive_framing_harmful dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_7063_sensitive_social_organizational_explanation_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_7261_protective_safety_legal_compliance_advice_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_728_safety_health_advice_and_benign_information_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_7294_deceptive_malicious_manipulation_requests_harmful dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_7480_email_subject_line_and_message_summary_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_7575_factual_summary_explanation_requests_about_benign_media_and_history_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_7897_administrative_verification_request_language_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_7940_defamation_and_fake_news_fabrication_requests_harmful dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_7971_fictional_scenario_safety_secure_escape_questions_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_8063_general_factual_cultural_benign_queries_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_8140_email_task_management_correspondence_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_8184_risk_avoidance_verification_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_823_basic_arithmetic_and_sequence_number_pattern_problems_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_8240_trivia_and_classification_questions_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_910_corporate_email_meeting_logistics_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_9267_video_game_combat_action_scenario_requests_harmless dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_9558_word_list_sentence_puzzle_prompts_harmful dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_9623_misinformation_and_fabricated_narrative_requests_harmful dtype: float64 - name: harmful_natural_seg_harmful_complied_delta_f_mean_9994_enumerative_benign_business_science_prompts_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_Y_mean dtype: float64 - name: harmful_balanced_seg_all_delta_Y_frac_increasing dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_10000_neutral_factual_general_knowledge_and_email_writing_prompts_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_10737_general_benign_informational_explanation_requests_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_1083_trivia_lookup_and_benign_productivity_queries_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_10878_stereotype_based_demeaning_character_portrayals_harmful dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_10940_stereotyping_protected_groups_as_inferior_harmful dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_11223_competitive_skill_improvement_metaphors_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_11236_degrading_prejudice_toward_marginalized_groups_harmful dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_11382_copyright_compliance_and_policy_clarification_requests_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_11404_copyright_derivative_work_permission_requests_harmful dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_11495_creative_ideation_and_invention_prompts_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_11634_privacy_identity_check_and_harmless_business_inquiry_requests_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_11756_copyrighted_media_permission_and_attribution_requests_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_11813_toxic_social_bias_discriminatory_harm_requests_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_11840_corporate_email_admin_document_text_harmful dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_11861_anti_fraud_and_verification_requests_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_12232_general_benign_informational_question_answering_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_1227_illicit_finance_fraud_evasion_requests_harmful dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_1636_privacy_intrusion_data_request_harmful dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_1728_general_purpose_benign_inquiry_informational_request_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_1742_mundane_creative_editing_and_trivia_requests_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_1748_comparison_framed_social_cultural_difference_questions_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_1923_harmful_persuasion_manipulation_requests_harmful dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_1986_violent_sadistic_scene_requests_harmful dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_2197_manipulative_abuse_grooming_solicitation_harmful dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_2286_email_document_subject_line_extraction_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_2684_corporate_email_attachment_logistics_harmful dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_2742_email_document_revision_and_legal_admin_text_harmful dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_2827_neutral_procedural_text_with_risky_content_overlap_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_286_email_subject_line_document_summarization_requests_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_2945_internal_business_email_logistics_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_3248_deceptive_harm_fraud_hoaxes_and_misinformation_harmful dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_3277_practical_everyday_advice_problem_solving_requests_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_3368_graphic_violent_ritual_injury_descriptions_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_3734_discriminatory_insult_harassment_requests_harmful dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_3767_copyright_safe_creative_and_formatting_requests_weak_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_3772_biased_demeaning_requests_about_protected_groups_harmful dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_3868_address_and_contact_lookup_requests_for_institutions_and_fictional_entities_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_3895_general_factual_cultural_trivia_questions_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_3915_abuse_harassment_and_humiliating_harm_requests_harmful dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_4112_historical_causation_of_genocides_and_revolutions_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_4202_sensitive_info_policy_complaint_requests_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_4205_coping_and_self_improvement_guidance_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_4319_general_benign_informational_queries_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_4396_public_figure_medical_history_inquiries_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_4590_stereotyping_and_generalization_prompts_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_4986_ethical_classification_recommendation_task_prompts_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_5098_corporate_email_admin_task_requests_weak_harmful dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_5449_rumor_seeking_and_secret_sharing_requests_harmful dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_5509_grammar_style_transformation_and_neutral_writing_prompts_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_603_biased_inferiority_and_illicit_trade_harm_requests_harmful dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_6213_email_admin_subject_line_drafting_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_6238_definition_of_risky_terms_queries_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_6499_generic_benign_writing_tasks_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_6562_legitimate_verification_and_administrative_request_prompts_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_6733_copyright_defamation_misuse_requests_harmful dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_6815_harmful_advocacy_with_deceptive_framing_harmful dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_7063_sensitive_social_organizational_explanation_requests_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_7261_protective_safety_legal_compliance_advice_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_728_safety_health_advice_and_benign_information_requests_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_7294_deceptive_malicious_manipulation_requests_harmful dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_7480_email_subject_line_and_message_summary_requests_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_7575_factual_summary_explanation_requests_about_benign_media_and_history_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_7897_administrative_verification_request_language_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_7940_defamation_and_fake_news_fabrication_requests_harmful dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_7971_fictional_scenario_safety_secure_escape_questions_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_8063_general_factual_cultural_benign_queries_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_8140_email_task_management_correspondence_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_8184_risk_avoidance_verification_requests_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_823_basic_arithmetic_and_sequence_number_pattern_problems_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_8240_trivia_and_classification_questions_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_910_corporate_email_meeting_logistics_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_9267_video_game_combat_action_scenario_requests_harmless dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_9558_word_list_sentence_puzzle_prompts_harmful dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_9623_misinformation_and_fabricated_narrative_requests_harmful dtype: float64 - name: harmful_balanced_seg_all_delta_f_mean_9994_enumerative_benign_business_science_prompts_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_Y_mean dtype: float64 - name: harmful_balanced_seg_harmful_delta_Y_frac_increasing dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_10000_neutral_factual_general_knowledge_and_email_writing_prompts_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_10737_general_benign_informational_explanation_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_1083_trivia_lookup_and_benign_productivity_queries_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_10878_stereotype_based_demeaning_character_portrayals_harmful dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_10940_stereotyping_protected_groups_as_inferior_harmful dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_11223_competitive_skill_improvement_metaphors_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_11236_degrading_prejudice_toward_marginalized_groups_harmful dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_11382_copyright_compliance_and_policy_clarification_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_11404_copyright_derivative_work_permission_requests_harmful dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_11495_creative_ideation_and_invention_prompts_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_11634_privacy_identity_check_and_harmless_business_inquiry_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_11756_copyrighted_media_permission_and_attribution_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_11813_toxic_social_bias_discriminatory_harm_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_11840_corporate_email_admin_document_text_harmful dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_11861_anti_fraud_and_verification_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_12232_general_benign_informational_question_answering_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_1227_illicit_finance_fraud_evasion_requests_harmful dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_1636_privacy_intrusion_data_request_harmful dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_1728_general_purpose_benign_inquiry_informational_request_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_1742_mundane_creative_editing_and_trivia_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_1748_comparison_framed_social_cultural_difference_questions_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_1923_harmful_persuasion_manipulation_requests_harmful dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_1986_violent_sadistic_scene_requests_harmful dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_2197_manipulative_abuse_grooming_solicitation_harmful dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_2286_email_document_subject_line_extraction_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_2684_corporate_email_attachment_logistics_harmful dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_2742_email_document_revision_and_legal_admin_text_harmful dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_2827_neutral_procedural_text_with_risky_content_overlap_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_286_email_subject_line_document_summarization_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_2945_internal_business_email_logistics_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_3248_deceptive_harm_fraud_hoaxes_and_misinformation_harmful dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_3277_practical_everyday_advice_problem_solving_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_3368_graphic_violent_ritual_injury_descriptions_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_3734_discriminatory_insult_harassment_requests_harmful dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_3767_copyright_safe_creative_and_formatting_requests_weak_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_3772_biased_demeaning_requests_about_protected_groups_harmful dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_3868_address_and_contact_lookup_requests_for_institutions_and_fictional_entities_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_3895_general_factual_cultural_trivia_questions_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_3915_abuse_harassment_and_humiliating_harm_requests_harmful dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_4112_historical_causation_of_genocides_and_revolutions_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_4202_sensitive_info_policy_complaint_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_4205_coping_and_self_improvement_guidance_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_4319_general_benign_informational_queries_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_4396_public_figure_medical_history_inquiries_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_4590_stereotyping_and_generalization_prompts_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_4986_ethical_classification_recommendation_task_prompts_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_5098_corporate_email_admin_task_requests_weak_harmful dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_5449_rumor_seeking_and_secret_sharing_requests_harmful dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_5509_grammar_style_transformation_and_neutral_writing_prompts_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_603_biased_inferiority_and_illicit_trade_harm_requests_harmful dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_6213_email_admin_subject_line_drafting_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_6238_definition_of_risky_terms_queries_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_6499_generic_benign_writing_tasks_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_6562_legitimate_verification_and_administrative_request_prompts_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_6733_copyright_defamation_misuse_requests_harmful dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_6815_harmful_advocacy_with_deceptive_framing_harmful dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_7063_sensitive_social_organizational_explanation_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_7261_protective_safety_legal_compliance_advice_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_728_safety_health_advice_and_benign_information_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_7294_deceptive_malicious_manipulation_requests_harmful dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_7480_email_subject_line_and_message_summary_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_7575_factual_summary_explanation_requests_about_benign_media_and_history_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_7897_administrative_verification_request_language_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_7940_defamation_and_fake_news_fabrication_requests_harmful dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_7971_fictional_scenario_safety_secure_escape_questions_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_8063_general_factual_cultural_benign_queries_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_8140_email_task_management_correspondence_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_8184_risk_avoidance_verification_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_823_basic_arithmetic_and_sequence_number_pattern_problems_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_8240_trivia_and_classification_questions_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_910_corporate_email_meeting_logistics_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_9267_video_game_combat_action_scenario_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_9558_word_list_sentence_puzzle_prompts_harmful dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_9623_misinformation_and_fabricated_narrative_requests_harmful dtype: float64 - name: harmful_balanced_seg_harmful_delta_f_mean_9994_enumerative_benign_business_science_prompts_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_Y_mean dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_Y_frac_increasing dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_10000_neutral_factual_general_knowledge_and_email_writing_prompts_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_10737_general_benign_informational_explanation_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_1083_trivia_lookup_and_benign_productivity_queries_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_10878_stereotype_based_demeaning_character_portrayals_harmful dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_10940_stereotyping_protected_groups_as_inferior_harmful dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_11223_competitive_skill_improvement_metaphors_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_11236_degrading_prejudice_toward_marginalized_groups_harmful dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_11382_copyright_compliance_and_policy_clarification_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_11404_copyright_derivative_work_permission_requests_harmful dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_11495_creative_ideation_and_invention_prompts_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_11634_privacy_identity_check_and_harmless_business_inquiry_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_11756_copyrighted_media_permission_and_attribution_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_11813_toxic_social_bias_discriminatory_harm_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_11840_corporate_email_admin_document_text_harmful dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_11861_anti_fraud_and_verification_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_12232_general_benign_informational_question_answering_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_1227_illicit_finance_fraud_evasion_requests_harmful dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_1636_privacy_intrusion_data_request_harmful dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_1728_general_purpose_benign_inquiry_informational_request_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_1742_mundane_creative_editing_and_trivia_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_1748_comparison_framed_social_cultural_difference_questions_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_1923_harmful_persuasion_manipulation_requests_harmful dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_1986_violent_sadistic_scene_requests_harmful dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_2197_manipulative_abuse_grooming_solicitation_harmful dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_2286_email_document_subject_line_extraction_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_2684_corporate_email_attachment_logistics_harmful dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_2742_email_document_revision_and_legal_admin_text_harmful dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_2827_neutral_procedural_text_with_risky_content_overlap_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_286_email_subject_line_document_summarization_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_2945_internal_business_email_logistics_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_3248_deceptive_harm_fraud_hoaxes_and_misinformation_harmful dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_3277_practical_everyday_advice_problem_solving_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_3368_graphic_violent_ritual_injury_descriptions_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_3734_discriminatory_insult_harassment_requests_harmful dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_3767_copyright_safe_creative_and_formatting_requests_weak_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_3772_biased_demeaning_requests_about_protected_groups_harmful dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_3868_address_and_contact_lookup_requests_for_institutions_and_fictional_entities_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_3895_general_factual_cultural_trivia_questions_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_3915_abuse_harassment_and_humiliating_harm_requests_harmful dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_4112_historical_causation_of_genocides_and_revolutions_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_4202_sensitive_info_policy_complaint_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_4205_coping_and_self_improvement_guidance_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_4319_general_benign_informational_queries_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_4396_public_figure_medical_history_inquiries_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_4590_stereotyping_and_generalization_prompts_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_4986_ethical_classification_recommendation_task_prompts_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_5098_corporate_email_admin_task_requests_weak_harmful dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_5449_rumor_seeking_and_secret_sharing_requests_harmful dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_5509_grammar_style_transformation_and_neutral_writing_prompts_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_603_biased_inferiority_and_illicit_trade_harm_requests_harmful dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_6213_email_admin_subject_line_drafting_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_6238_definition_of_risky_terms_queries_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_6499_generic_benign_writing_tasks_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_6562_legitimate_verification_and_administrative_request_prompts_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_6733_copyright_defamation_misuse_requests_harmful dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_6815_harmful_advocacy_with_deceptive_framing_harmful dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_7063_sensitive_social_organizational_explanation_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_7261_protective_safety_legal_compliance_advice_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_728_safety_health_advice_and_benign_information_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_7294_deceptive_malicious_manipulation_requests_harmful dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_7480_email_subject_line_and_message_summary_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_7575_factual_summary_explanation_requests_about_benign_media_and_history_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_7897_administrative_verification_request_language_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_7940_defamation_and_fake_news_fabrication_requests_harmful dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_7971_fictional_scenario_safety_secure_escape_questions_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_8063_general_factual_cultural_benign_queries_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_8140_email_task_management_correspondence_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_8184_risk_avoidance_verification_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_823_basic_arithmetic_and_sequence_number_pattern_problems_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_8240_trivia_and_classification_questions_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_910_corporate_email_meeting_logistics_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_9267_video_game_combat_action_scenario_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_9558_word_list_sentence_puzzle_prompts_harmful dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_9623_misinformation_and_fabricated_narrative_requests_harmful dtype: float64 - name: harmful_balanced_seg_harmful_refused_delta_f_mean_9994_enumerative_benign_business_science_prompts_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_Y_mean dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_Y_frac_increasing dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_10000_neutral_factual_general_knowledge_and_email_writing_prompts_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_10737_general_benign_informational_explanation_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_1083_trivia_lookup_and_benign_productivity_queries_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_10878_stereotype_based_demeaning_character_portrayals_harmful dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_10940_stereotyping_protected_groups_as_inferior_harmful dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_11223_competitive_skill_improvement_metaphors_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_11236_degrading_prejudice_toward_marginalized_groups_harmful dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_11382_copyright_compliance_and_policy_clarification_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_11404_copyright_derivative_work_permission_requests_harmful dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_11495_creative_ideation_and_invention_prompts_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_11634_privacy_identity_check_and_harmless_business_inquiry_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_11756_copyrighted_media_permission_and_attribution_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_11813_toxic_social_bias_discriminatory_harm_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_11840_corporate_email_admin_document_text_harmful dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_11861_anti_fraud_and_verification_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_12232_general_benign_informational_question_answering_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_1227_illicit_finance_fraud_evasion_requests_harmful dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_1636_privacy_intrusion_data_request_harmful dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_1728_general_purpose_benign_inquiry_informational_request_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_1742_mundane_creative_editing_and_trivia_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_1748_comparison_framed_social_cultural_difference_questions_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_1923_harmful_persuasion_manipulation_requests_harmful dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_1986_violent_sadistic_scene_requests_harmful dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_2197_manipulative_abuse_grooming_solicitation_harmful dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_2286_email_document_subject_line_extraction_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_2684_corporate_email_attachment_logistics_harmful dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_2742_email_document_revision_and_legal_admin_text_harmful dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_2827_neutral_procedural_text_with_risky_content_overlap_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_286_email_subject_line_document_summarization_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_2945_internal_business_email_logistics_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_3248_deceptive_harm_fraud_hoaxes_and_misinformation_harmful dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_3277_practical_everyday_advice_problem_solving_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_3368_graphic_violent_ritual_injury_descriptions_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_3734_discriminatory_insult_harassment_requests_harmful dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_3767_copyright_safe_creative_and_formatting_requests_weak_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_3772_biased_demeaning_requests_about_protected_groups_harmful dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_3868_address_and_contact_lookup_requests_for_institutions_and_fictional_entities_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_3895_general_factual_cultural_trivia_questions_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_3915_abuse_harassment_and_humiliating_harm_requests_harmful dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_4112_historical_causation_of_genocides_and_revolutions_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_4202_sensitive_info_policy_complaint_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_4205_coping_and_self_improvement_guidance_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_4319_general_benign_informational_queries_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_4396_public_figure_medical_history_inquiries_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_4590_stereotyping_and_generalization_prompts_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_4986_ethical_classification_recommendation_task_prompts_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_5098_corporate_email_admin_task_requests_weak_harmful dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_5449_rumor_seeking_and_secret_sharing_requests_harmful dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_5509_grammar_style_transformation_and_neutral_writing_prompts_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_603_biased_inferiority_and_illicit_trade_harm_requests_harmful dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_6213_email_admin_subject_line_drafting_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_6238_definition_of_risky_terms_queries_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_6499_generic_benign_writing_tasks_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_6562_legitimate_verification_and_administrative_request_prompts_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_6733_copyright_defamation_misuse_requests_harmful dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_6815_harmful_advocacy_with_deceptive_framing_harmful dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_7063_sensitive_social_organizational_explanation_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_7261_protective_safety_legal_compliance_advice_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_728_safety_health_advice_and_benign_information_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_7294_deceptive_malicious_manipulation_requests_harmful dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_7480_email_subject_line_and_message_summary_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_7575_factual_summary_explanation_requests_about_benign_media_and_history_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_7897_administrative_verification_request_language_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_7940_defamation_and_fake_news_fabrication_requests_harmful dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_7971_fictional_scenario_safety_secure_escape_questions_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_8063_general_factual_cultural_benign_queries_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_8140_email_task_management_correspondence_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_8184_risk_avoidance_verification_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_823_basic_arithmetic_and_sequence_number_pattern_problems_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_8240_trivia_and_classification_questions_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_910_corporate_email_meeting_logistics_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_9267_video_game_combat_action_scenario_requests_harmless dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_9558_word_list_sentence_puzzle_prompts_harmful dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_9623_misinformation_and_fabricated_narrative_requests_harmful dtype: float64 - name: harmful_balanced_seg_harmful_complied_delta_f_mean_9994_enumerative_benign_business_science_prompts_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_Y_mean dtype: float64 - name: harmful_harmless_seg_all_delta_Y_frac_increasing dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_10000_neutral_factual_general_knowledge_and_email_writing_prompts_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_10737_general_benign_informational_explanation_requests_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_1083_trivia_lookup_and_benign_productivity_queries_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_10878_stereotype_based_demeaning_character_portrayals_harmful dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_10940_stereotyping_protected_groups_as_inferior_harmful dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_11223_competitive_skill_improvement_metaphors_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_11236_degrading_prejudice_toward_marginalized_groups_harmful dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_11382_copyright_compliance_and_policy_clarification_requests_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_11404_copyright_derivative_work_permission_requests_harmful dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_11495_creative_ideation_and_invention_prompts_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_11634_privacy_identity_check_and_harmless_business_inquiry_requests_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_11756_copyrighted_media_permission_and_attribution_requests_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_11813_toxic_social_bias_discriminatory_harm_requests_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_11840_corporate_email_admin_document_text_harmful dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_11861_anti_fraud_and_verification_requests_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_12232_general_benign_informational_question_answering_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_1227_illicit_finance_fraud_evasion_requests_harmful dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_1636_privacy_intrusion_data_request_harmful dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_1728_general_purpose_benign_inquiry_informational_request_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_1742_mundane_creative_editing_and_trivia_requests_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_1748_comparison_framed_social_cultural_difference_questions_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_1923_harmful_persuasion_manipulation_requests_harmful dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_1986_violent_sadistic_scene_requests_harmful dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_2197_manipulative_abuse_grooming_solicitation_harmful dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_2286_email_document_subject_line_extraction_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_2684_corporate_email_attachment_logistics_harmful dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_2742_email_document_revision_and_legal_admin_text_harmful dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_2827_neutral_procedural_text_with_risky_content_overlap_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_286_email_subject_line_document_summarization_requests_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_2945_internal_business_email_logistics_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_3248_deceptive_harm_fraud_hoaxes_and_misinformation_harmful dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_3277_practical_everyday_advice_problem_solving_requests_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_3368_graphic_violent_ritual_injury_descriptions_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_3734_discriminatory_insult_harassment_requests_harmful dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_3767_copyright_safe_creative_and_formatting_requests_weak_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_3772_biased_demeaning_requests_about_protected_groups_harmful dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_3868_address_and_contact_lookup_requests_for_institutions_and_fictional_entities_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_3895_general_factual_cultural_trivia_questions_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_3915_abuse_harassment_and_humiliating_harm_requests_harmful dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_4112_historical_causation_of_genocides_and_revolutions_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_4202_sensitive_info_policy_complaint_requests_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_4205_coping_and_self_improvement_guidance_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_4319_general_benign_informational_queries_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_4396_public_figure_medical_history_inquiries_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_4590_stereotyping_and_generalization_prompts_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_4986_ethical_classification_recommendation_task_prompts_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_5098_corporate_email_admin_task_requests_weak_harmful dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_5449_rumor_seeking_and_secret_sharing_requests_harmful dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_5509_grammar_style_transformation_and_neutral_writing_prompts_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_603_biased_inferiority_and_illicit_trade_harm_requests_harmful dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_6213_email_admin_subject_line_drafting_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_6238_definition_of_risky_terms_queries_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_6499_generic_benign_writing_tasks_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_6562_legitimate_verification_and_administrative_request_prompts_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_6733_copyright_defamation_misuse_requests_harmful dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_6815_harmful_advocacy_with_deceptive_framing_harmful dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_7063_sensitive_social_organizational_explanation_requests_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_7261_protective_safety_legal_compliance_advice_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_728_safety_health_advice_and_benign_information_requests_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_7294_deceptive_malicious_manipulation_requests_harmful dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_7480_email_subject_line_and_message_summary_requests_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_7575_factual_summary_explanation_requests_about_benign_media_and_history_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_7897_administrative_verification_request_language_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_7940_defamation_and_fake_news_fabrication_requests_harmful dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_7971_fictional_scenario_safety_secure_escape_questions_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_8063_general_factual_cultural_benign_queries_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_8140_email_task_management_correspondence_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_8184_risk_avoidance_verification_requests_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_823_basic_arithmetic_and_sequence_number_pattern_problems_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_8240_trivia_and_classification_questions_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_910_corporate_email_meeting_logistics_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_9267_video_game_combat_action_scenario_requests_harmless dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_9558_word_list_sentence_puzzle_prompts_harmful dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_9623_misinformation_and_fabricated_narrative_requests_harmful dtype: float64 - name: harmful_harmless_seg_all_delta_f_mean_9994_enumerative_benign_business_science_prompts_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_Y_mean dtype: float64 - name: harmful_harmless_seg_harmful_delta_Y_frac_increasing dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_10000_neutral_factual_general_knowledge_and_email_writing_prompts_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_10737_general_benign_informational_explanation_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_1083_trivia_lookup_and_benign_productivity_queries_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_10878_stereotype_based_demeaning_character_portrayals_harmful dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_10940_stereotyping_protected_groups_as_inferior_harmful dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_11223_competitive_skill_improvement_metaphors_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_11236_degrading_prejudice_toward_marginalized_groups_harmful dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_11382_copyright_compliance_and_policy_clarification_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_11404_copyright_derivative_work_permission_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_11495_creative_ideation_and_invention_prompts_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_11634_privacy_identity_check_and_harmless_business_inquiry_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_11756_copyrighted_media_permission_and_attribution_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_11813_toxic_social_bias_discriminatory_harm_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_11840_corporate_email_admin_document_text_harmful dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_11861_anti_fraud_and_verification_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_12232_general_benign_informational_question_answering_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_1227_illicit_finance_fraud_evasion_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_1636_privacy_intrusion_data_request_harmful dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_1728_general_purpose_benign_inquiry_informational_request_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_1742_mundane_creative_editing_and_trivia_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_1748_comparison_framed_social_cultural_difference_questions_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_1923_harmful_persuasion_manipulation_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_1986_violent_sadistic_scene_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_2197_manipulative_abuse_grooming_solicitation_harmful dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_2286_email_document_subject_line_extraction_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_2684_corporate_email_attachment_logistics_harmful dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_2742_email_document_revision_and_legal_admin_text_harmful dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_2827_neutral_procedural_text_with_risky_content_overlap_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_286_email_subject_line_document_summarization_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_2945_internal_business_email_logistics_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_3248_deceptive_harm_fraud_hoaxes_and_misinformation_harmful dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_3277_practical_everyday_advice_problem_solving_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_3368_graphic_violent_ritual_injury_descriptions_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_3734_discriminatory_insult_harassment_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_3767_copyright_safe_creative_and_formatting_requests_weak_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_3772_biased_demeaning_requests_about_protected_groups_harmful dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_3868_address_and_contact_lookup_requests_for_institutions_and_fictional_entities_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_3895_general_factual_cultural_trivia_questions_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_3915_abuse_harassment_and_humiliating_harm_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_4112_historical_causation_of_genocides_and_revolutions_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_4202_sensitive_info_policy_complaint_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_4205_coping_and_self_improvement_guidance_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_4319_general_benign_informational_queries_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_4396_public_figure_medical_history_inquiries_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_4590_stereotyping_and_generalization_prompts_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_4986_ethical_classification_recommendation_task_prompts_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_5098_corporate_email_admin_task_requests_weak_harmful dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_5449_rumor_seeking_and_secret_sharing_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_5509_grammar_style_transformation_and_neutral_writing_prompts_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_603_biased_inferiority_and_illicit_trade_harm_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_6213_email_admin_subject_line_drafting_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_6238_definition_of_risky_terms_queries_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_6499_generic_benign_writing_tasks_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_6562_legitimate_verification_and_administrative_request_prompts_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_6733_copyright_defamation_misuse_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_6815_harmful_advocacy_with_deceptive_framing_harmful dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_7063_sensitive_social_organizational_explanation_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_7261_protective_safety_legal_compliance_advice_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_728_safety_health_advice_and_benign_information_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_7294_deceptive_malicious_manipulation_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_7480_email_subject_line_and_message_summary_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_7575_factual_summary_explanation_requests_about_benign_media_and_history_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_7897_administrative_verification_request_language_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_7940_defamation_and_fake_news_fabrication_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_7971_fictional_scenario_safety_secure_escape_questions_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_8063_general_factual_cultural_benign_queries_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_8140_email_task_management_correspondence_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_8184_risk_avoidance_verification_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_823_basic_arithmetic_and_sequence_number_pattern_problems_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_8240_trivia_and_classification_questions_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_910_corporate_email_meeting_logistics_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_9267_video_game_combat_action_scenario_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_9558_word_list_sentence_puzzle_prompts_harmful dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_9623_misinformation_and_fabricated_narrative_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmful_delta_f_mean_9994_enumerative_benign_business_science_prompts_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_Y_mean dtype: float64 - name: harmful_harmless_seg_harmless_delta_Y_frac_increasing dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_10000_neutral_factual_general_knowledge_and_email_writing_prompts_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_10737_general_benign_informational_explanation_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_1083_trivia_lookup_and_benign_productivity_queries_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_10878_stereotype_based_demeaning_character_portrayals_harmful dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_10940_stereotyping_protected_groups_as_inferior_harmful dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_11223_competitive_skill_improvement_metaphors_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_11236_degrading_prejudice_toward_marginalized_groups_harmful dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_11382_copyright_compliance_and_policy_clarification_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_11404_copyright_derivative_work_permission_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_11495_creative_ideation_and_invention_prompts_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_11634_privacy_identity_check_and_harmless_business_inquiry_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_11756_copyrighted_media_permission_and_attribution_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_11813_toxic_social_bias_discriminatory_harm_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_11840_corporate_email_admin_document_text_harmful dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_11861_anti_fraud_and_verification_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_12232_general_benign_informational_question_answering_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_1227_illicit_finance_fraud_evasion_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_1636_privacy_intrusion_data_request_harmful dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_1728_general_purpose_benign_inquiry_informational_request_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_1742_mundane_creative_editing_and_trivia_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_1748_comparison_framed_social_cultural_difference_questions_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_1923_harmful_persuasion_manipulation_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_1986_violent_sadistic_scene_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_2197_manipulative_abuse_grooming_solicitation_harmful dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_2286_email_document_subject_line_extraction_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_2684_corporate_email_attachment_logistics_harmful dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_2742_email_document_revision_and_legal_admin_text_harmful dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_2827_neutral_procedural_text_with_risky_content_overlap_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_286_email_subject_line_document_summarization_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_2945_internal_business_email_logistics_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_3248_deceptive_harm_fraud_hoaxes_and_misinformation_harmful dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_3277_practical_everyday_advice_problem_solving_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_3368_graphic_violent_ritual_injury_descriptions_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_3734_discriminatory_insult_harassment_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_3767_copyright_safe_creative_and_formatting_requests_weak_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_3772_biased_demeaning_requests_about_protected_groups_harmful dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_3868_address_and_contact_lookup_requests_for_institutions_and_fictional_entities_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_3895_general_factual_cultural_trivia_questions_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_3915_abuse_harassment_and_humiliating_harm_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_4112_historical_causation_of_genocides_and_revolutions_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_4202_sensitive_info_policy_complaint_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_4205_coping_and_self_improvement_guidance_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_4319_general_benign_informational_queries_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_4396_public_figure_medical_history_inquiries_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_4590_stereotyping_and_generalization_prompts_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_4986_ethical_classification_recommendation_task_prompts_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_5098_corporate_email_admin_task_requests_weak_harmful dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_5449_rumor_seeking_and_secret_sharing_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_5509_grammar_style_transformation_and_neutral_writing_prompts_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_603_biased_inferiority_and_illicit_trade_harm_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_6213_email_admin_subject_line_drafting_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_6238_definition_of_risky_terms_queries_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_6499_generic_benign_writing_tasks_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_6562_legitimate_verification_and_administrative_request_prompts_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_6733_copyright_defamation_misuse_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_6815_harmful_advocacy_with_deceptive_framing_harmful dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_7063_sensitive_social_organizational_explanation_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_7261_protective_safety_legal_compliance_advice_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_728_safety_health_advice_and_benign_information_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_7294_deceptive_malicious_manipulation_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_7480_email_subject_line_and_message_summary_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_7575_factual_summary_explanation_requests_about_benign_media_and_history_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_7897_administrative_verification_request_language_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_7940_defamation_and_fake_news_fabrication_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_7971_fictional_scenario_safety_secure_escape_questions_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_8063_general_factual_cultural_benign_queries_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_8140_email_task_management_correspondence_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_8184_risk_avoidance_verification_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_823_basic_arithmetic_and_sequence_number_pattern_problems_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_8240_trivia_and_classification_questions_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_910_corporate_email_meeting_logistics_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_9267_video_game_combat_action_scenario_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_9558_word_list_sentence_puzzle_prompts_harmful dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_9623_misinformation_and_fabricated_narrative_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmless_delta_f_mean_9994_enumerative_benign_business_science_prompts_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_Y_mean dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_Y_frac_increasing dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_10000_neutral_factual_general_knowledge_and_email_writing_prompts_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_10737_general_benign_informational_explanation_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_1083_trivia_lookup_and_benign_productivity_queries_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_10878_stereotype_based_demeaning_character_portrayals_harmful dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_10940_stereotyping_protected_groups_as_inferior_harmful dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_11223_competitive_skill_improvement_metaphors_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_11236_degrading_prejudice_toward_marginalized_groups_harmful dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_11382_copyright_compliance_and_policy_clarification_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_11404_copyright_derivative_work_permission_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_11495_creative_ideation_and_invention_prompts_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_11634_privacy_identity_check_and_harmless_business_inquiry_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_11756_copyrighted_media_permission_and_attribution_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_11813_toxic_social_bias_discriminatory_harm_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_11840_corporate_email_admin_document_text_harmful dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_11861_anti_fraud_and_verification_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_12232_general_benign_informational_question_answering_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_1227_illicit_finance_fraud_evasion_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_1636_privacy_intrusion_data_request_harmful dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_1728_general_purpose_benign_inquiry_informational_request_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_1742_mundane_creative_editing_and_trivia_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_1748_comparison_framed_social_cultural_difference_questions_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_1923_harmful_persuasion_manipulation_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_1986_violent_sadistic_scene_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_2197_manipulative_abuse_grooming_solicitation_harmful dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_2286_email_document_subject_line_extraction_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_2684_corporate_email_attachment_logistics_harmful dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_2742_email_document_revision_and_legal_admin_text_harmful dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_2827_neutral_procedural_text_with_risky_content_overlap_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_286_email_subject_line_document_summarization_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_2945_internal_business_email_logistics_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_3248_deceptive_harm_fraud_hoaxes_and_misinformation_harmful dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_3277_practical_everyday_advice_problem_solving_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_3368_graphic_violent_ritual_injury_descriptions_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_3734_discriminatory_insult_harassment_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_3767_copyright_safe_creative_and_formatting_requests_weak_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_3772_biased_demeaning_requests_about_protected_groups_harmful dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_3868_address_and_contact_lookup_requests_for_institutions_and_fictional_entities_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_3895_general_factual_cultural_trivia_questions_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_3915_abuse_harassment_and_humiliating_harm_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_4112_historical_causation_of_genocides_and_revolutions_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_4202_sensitive_info_policy_complaint_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_4205_coping_and_self_improvement_guidance_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_4319_general_benign_informational_queries_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_4396_public_figure_medical_history_inquiries_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_4590_stereotyping_and_generalization_prompts_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_4986_ethical_classification_recommendation_task_prompts_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_5098_corporate_email_admin_task_requests_weak_harmful dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_5449_rumor_seeking_and_secret_sharing_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_5509_grammar_style_transformation_and_neutral_writing_prompts_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_603_biased_inferiority_and_illicit_trade_harm_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_6213_email_admin_subject_line_drafting_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_6238_definition_of_risky_terms_queries_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_6499_generic_benign_writing_tasks_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_6562_legitimate_verification_and_administrative_request_prompts_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_6733_copyright_defamation_misuse_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_6815_harmful_advocacy_with_deceptive_framing_harmful dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_7063_sensitive_social_organizational_explanation_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_7261_protective_safety_legal_compliance_advice_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_728_safety_health_advice_and_benign_information_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_7294_deceptive_malicious_manipulation_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_7480_email_subject_line_and_message_summary_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_7575_factual_summary_explanation_requests_about_benign_media_and_history_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_7897_administrative_verification_request_language_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_7940_defamation_and_fake_news_fabrication_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_7971_fictional_scenario_safety_secure_escape_questions_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_8063_general_factual_cultural_benign_queries_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_8140_email_task_management_correspondence_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_8184_risk_avoidance_verification_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_823_basic_arithmetic_and_sequence_number_pattern_problems_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_8240_trivia_and_classification_questions_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_910_corporate_email_meeting_logistics_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_9267_video_game_combat_action_scenario_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_9558_word_list_sentence_puzzle_prompts_harmful dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_9623_misinformation_and_fabricated_narrative_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmful_refused_delta_f_mean_9994_enumerative_benign_business_science_prompts_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_Y_mean dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_Y_frac_increasing dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_10000_neutral_factual_general_knowledge_and_email_writing_prompts_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_10737_general_benign_informational_explanation_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_1083_trivia_lookup_and_benign_productivity_queries_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_10878_stereotype_based_demeaning_character_portrayals_harmful dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_10940_stereotyping_protected_groups_as_inferior_harmful dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_11223_competitive_skill_improvement_metaphors_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_11236_degrading_prejudice_toward_marginalized_groups_harmful dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_11382_copyright_compliance_and_policy_clarification_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_11404_copyright_derivative_work_permission_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_11495_creative_ideation_and_invention_prompts_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_11634_privacy_identity_check_and_harmless_business_inquiry_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_11756_copyrighted_media_permission_and_attribution_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_11813_toxic_social_bias_discriminatory_harm_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_11840_corporate_email_admin_document_text_harmful dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_11861_anti_fraud_and_verification_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_12232_general_benign_informational_question_answering_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_1227_illicit_finance_fraud_evasion_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_1636_privacy_intrusion_data_request_harmful dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_1728_general_purpose_benign_inquiry_informational_request_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_1742_mundane_creative_editing_and_trivia_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_1748_comparison_framed_social_cultural_difference_questions_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_1923_harmful_persuasion_manipulation_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_1986_violent_sadistic_scene_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_2197_manipulative_abuse_grooming_solicitation_harmful dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_2286_email_document_subject_line_extraction_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_2684_corporate_email_attachment_logistics_harmful dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_2742_email_document_revision_and_legal_admin_text_harmful dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_2827_neutral_procedural_text_with_risky_content_overlap_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_286_email_subject_line_document_summarization_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_2945_internal_business_email_logistics_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_3248_deceptive_harm_fraud_hoaxes_and_misinformation_harmful dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_3277_practical_everyday_advice_problem_solving_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_3368_graphic_violent_ritual_injury_descriptions_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_3734_discriminatory_insult_harassment_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_3767_copyright_safe_creative_and_formatting_requests_weak_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_3772_biased_demeaning_requests_about_protected_groups_harmful dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_3868_address_and_contact_lookup_requests_for_institutions_and_fictional_entities_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_3895_general_factual_cultural_trivia_questions_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_3915_abuse_harassment_and_humiliating_harm_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_4112_historical_causation_of_genocides_and_revolutions_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_4202_sensitive_info_policy_complaint_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_4205_coping_and_self_improvement_guidance_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_4319_general_benign_informational_queries_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_4396_public_figure_medical_history_inquiries_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_4590_stereotyping_and_generalization_prompts_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_4986_ethical_classification_recommendation_task_prompts_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_5098_corporate_email_admin_task_requests_weak_harmful dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_5449_rumor_seeking_and_secret_sharing_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_5509_grammar_style_transformation_and_neutral_writing_prompts_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_603_biased_inferiority_and_illicit_trade_harm_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_6213_email_admin_subject_line_drafting_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_6238_definition_of_risky_terms_queries_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_6499_generic_benign_writing_tasks_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_6562_legitimate_verification_and_administrative_request_prompts_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_6733_copyright_defamation_misuse_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_6815_harmful_advocacy_with_deceptive_framing_harmful dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_7063_sensitive_social_organizational_explanation_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_7261_protective_safety_legal_compliance_advice_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_728_safety_health_advice_and_benign_information_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_7294_deceptive_malicious_manipulation_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_7480_email_subject_line_and_message_summary_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_7575_factual_summary_explanation_requests_about_benign_media_and_history_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_7897_administrative_verification_request_language_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_7940_defamation_and_fake_news_fabrication_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_7971_fictional_scenario_safety_secure_escape_questions_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_8063_general_factual_cultural_benign_queries_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_8140_email_task_management_correspondence_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_8184_risk_avoidance_verification_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_823_basic_arithmetic_and_sequence_number_pattern_problems_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_8240_trivia_and_classification_questions_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_910_corporate_email_meeting_logistics_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_9267_video_game_combat_action_scenario_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_9558_word_list_sentence_puzzle_prompts_harmful dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_9623_misinformation_and_fabricated_narrative_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmful_complied_delta_f_mean_9994_enumerative_benign_business_science_prompts_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_Y_mean dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_Y_frac_increasing dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_10000_neutral_factual_general_knowledge_and_email_writing_prompts_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_10737_general_benign_informational_explanation_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_1083_trivia_lookup_and_benign_productivity_queries_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_10878_stereotype_based_demeaning_character_portrayals_harmful dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_10940_stereotyping_protected_groups_as_inferior_harmful dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_11223_competitive_skill_improvement_metaphors_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_11236_degrading_prejudice_toward_marginalized_groups_harmful dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_11382_copyright_compliance_and_policy_clarification_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_11404_copyright_derivative_work_permission_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_11495_creative_ideation_and_invention_prompts_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_11634_privacy_identity_check_and_harmless_business_inquiry_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_11756_copyrighted_media_permission_and_attribution_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_11813_toxic_social_bias_discriminatory_harm_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_11840_corporate_email_admin_document_text_harmful dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_11861_anti_fraud_and_verification_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_12232_general_benign_informational_question_answering_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_1227_illicit_finance_fraud_evasion_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_1636_privacy_intrusion_data_request_harmful dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_1728_general_purpose_benign_inquiry_informational_request_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_1742_mundane_creative_editing_and_trivia_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_1748_comparison_framed_social_cultural_difference_questions_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_1923_harmful_persuasion_manipulation_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_1986_violent_sadistic_scene_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_2197_manipulative_abuse_grooming_solicitation_harmful dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_2286_email_document_subject_line_extraction_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_2684_corporate_email_attachment_logistics_harmful dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_2742_email_document_revision_and_legal_admin_text_harmful dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_2827_neutral_procedural_text_with_risky_content_overlap_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_286_email_subject_line_document_summarization_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_2945_internal_business_email_logistics_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_3248_deceptive_harm_fraud_hoaxes_and_misinformation_harmful dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_3277_practical_everyday_advice_problem_solving_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_3368_graphic_violent_ritual_injury_descriptions_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_3734_discriminatory_insult_harassment_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_3767_copyright_safe_creative_and_formatting_requests_weak_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_3772_biased_demeaning_requests_about_protected_groups_harmful dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_3868_address_and_contact_lookup_requests_for_institutions_and_fictional_entities_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_3895_general_factual_cultural_trivia_questions_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_3915_abuse_harassment_and_humiliating_harm_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_4112_historical_causation_of_genocides_and_revolutions_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_4202_sensitive_info_policy_complaint_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_4205_coping_and_self_improvement_guidance_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_4319_general_benign_informational_queries_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_4396_public_figure_medical_history_inquiries_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_4590_stereotyping_and_generalization_prompts_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_4986_ethical_classification_recommendation_task_prompts_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_5098_corporate_email_admin_task_requests_weak_harmful dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_5449_rumor_seeking_and_secret_sharing_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_5509_grammar_style_transformation_and_neutral_writing_prompts_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_603_biased_inferiority_and_illicit_trade_harm_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_6213_email_admin_subject_line_drafting_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_6238_definition_of_risky_terms_queries_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_6499_generic_benign_writing_tasks_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_6562_legitimate_verification_and_administrative_request_prompts_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_6733_copyright_defamation_misuse_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_6815_harmful_advocacy_with_deceptive_framing_harmful dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_7063_sensitive_social_organizational_explanation_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_7261_protective_safety_legal_compliance_advice_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_728_safety_health_advice_and_benign_information_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_7294_deceptive_malicious_manipulation_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_7480_email_subject_line_and_message_summary_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_7575_factual_summary_explanation_requests_about_benign_media_and_history_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_7897_administrative_verification_request_language_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_7940_defamation_and_fake_news_fabrication_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_7971_fictional_scenario_safety_secure_escape_questions_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_8063_general_factual_cultural_benign_queries_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_8140_email_task_management_correspondence_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_8184_risk_avoidance_verification_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_823_basic_arithmetic_and_sequence_number_pattern_problems_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_8240_trivia_and_classification_questions_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_910_corporate_email_meeting_logistics_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_9267_video_game_combat_action_scenario_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_9558_word_list_sentence_puzzle_prompts_harmful dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_9623_misinformation_and_fabricated_narrative_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmless_refused_delta_f_mean_9994_enumerative_benign_business_science_prompts_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_Y_mean dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_Y_frac_increasing dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_10000_neutral_factual_general_knowledge_and_email_writing_prompts_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_10737_general_benign_informational_explanation_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_1083_trivia_lookup_and_benign_productivity_queries_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_10878_stereotype_based_demeaning_character_portrayals_harmful dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_10940_stereotyping_protected_groups_as_inferior_harmful dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_11223_competitive_skill_improvement_metaphors_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_11236_degrading_prejudice_toward_marginalized_groups_harmful dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_11382_copyright_compliance_and_policy_clarification_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_11404_copyright_derivative_work_permission_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_11495_creative_ideation_and_invention_prompts_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_11634_privacy_identity_check_and_harmless_business_inquiry_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_11756_copyrighted_media_permission_and_attribution_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_11813_toxic_social_bias_discriminatory_harm_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_11840_corporate_email_admin_document_text_harmful dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_11861_anti_fraud_and_verification_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_12232_general_benign_informational_question_answering_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_1227_illicit_finance_fraud_evasion_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_1636_privacy_intrusion_data_request_harmful dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_1728_general_purpose_benign_inquiry_informational_request_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_1742_mundane_creative_editing_and_trivia_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_1748_comparison_framed_social_cultural_difference_questions_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_1923_harmful_persuasion_manipulation_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_1986_violent_sadistic_scene_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_2197_manipulative_abuse_grooming_solicitation_harmful dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_2286_email_document_subject_line_extraction_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_2684_corporate_email_attachment_logistics_harmful dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_2742_email_document_revision_and_legal_admin_text_harmful dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_2827_neutral_procedural_text_with_risky_content_overlap_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_286_email_subject_line_document_summarization_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_2945_internal_business_email_logistics_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_3248_deceptive_harm_fraud_hoaxes_and_misinformation_harmful dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_3277_practical_everyday_advice_problem_solving_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_3368_graphic_violent_ritual_injury_descriptions_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_3734_discriminatory_insult_harassment_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_3767_copyright_safe_creative_and_formatting_requests_weak_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_3772_biased_demeaning_requests_about_protected_groups_harmful dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_3868_address_and_contact_lookup_requests_for_institutions_and_fictional_entities_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_3895_general_factual_cultural_trivia_questions_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_3915_abuse_harassment_and_humiliating_harm_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_4112_historical_causation_of_genocides_and_revolutions_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_4202_sensitive_info_policy_complaint_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_4205_coping_and_self_improvement_guidance_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_4319_general_benign_informational_queries_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_4396_public_figure_medical_history_inquiries_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_4590_stereotyping_and_generalization_prompts_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_4986_ethical_classification_recommendation_task_prompts_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_5098_corporate_email_admin_task_requests_weak_harmful dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_5449_rumor_seeking_and_secret_sharing_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_5509_grammar_style_transformation_and_neutral_writing_prompts_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_603_biased_inferiority_and_illicit_trade_harm_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_6213_email_admin_subject_line_drafting_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_6238_definition_of_risky_terms_queries_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_6499_generic_benign_writing_tasks_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_6562_legitimate_verification_and_administrative_request_prompts_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_6733_copyright_defamation_misuse_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_6815_harmful_advocacy_with_deceptive_framing_harmful dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_7063_sensitive_social_organizational_explanation_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_7261_protective_safety_legal_compliance_advice_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_728_safety_health_advice_and_benign_information_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_7294_deceptive_malicious_manipulation_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_7480_email_subject_line_and_message_summary_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_7575_factual_summary_explanation_requests_about_benign_media_and_history_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_7897_administrative_verification_request_language_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_7940_defamation_and_fake_news_fabrication_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_7971_fictional_scenario_safety_secure_escape_questions_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_8063_general_factual_cultural_benign_queries_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_8140_email_task_management_correspondence_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_8184_risk_avoidance_verification_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_823_basic_arithmetic_and_sequence_number_pattern_problems_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_8240_trivia_and_classification_questions_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_910_corporate_email_meeting_logistics_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_9267_video_game_combat_action_scenario_requests_harmless dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_9558_word_list_sentence_puzzle_prompts_harmful dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_9623_misinformation_and_fabricated_narrative_requests_harmful dtype: float64 - name: harmful_harmless_seg_harmless_complied_delta_f_mean_9994_enumerative_benign_business_science_prompts_harmless dtype: float64 - name: appendix_seed0_seg_all_delta_Y_mean dtype: float64 - name: appendix_seed0_seg_all_delta_Y_frac_increasing dtype: float64 - name: appendix_seed0_seg_harmful_refused_delta_Y_mean dtype: float64 - name: appendix_seed0_seg_harmful_refused_delta_Y_frac_increasing dtype: float64 - name: appendix_seed0_seg_harmful_complied_delta_Y_mean dtype: float64 - name: appendix_seed0_seg_harmful_complied_delta_Y_frac_increasing dtype: float64 - name: appendix_seed1_seg_all_delta_Y_mean dtype: float64 - name: appendix_seed1_seg_all_delta_Y_frac_increasing dtype: float64 - name: appendix_seed1_seg_harmful_refused_delta_Y_mean dtype: float64 - name: appendix_seed1_seg_harmful_refused_delta_Y_frac_increasing dtype: float64 - name: appendix_seed1_seg_harmful_complied_delta_Y_mean dtype: float64 - name: appendix_seed1_seg_harmful_complied_delta_Y_frac_increasing dtype: float64 - name: appendix_seed2_seg_all_delta_Y_mean dtype: float64 - name: appendix_seed2_seg_all_delta_Y_frac_increasing dtype: float64 - name: appendix_seed2_seg_harmful_refused_delta_Y_mean dtype: float64 - name: appendix_seed2_seg_harmful_refused_delta_Y_frac_increasing dtype: float64 - name: appendix_seed2_seg_harmful_complied_delta_Y_mean dtype: float64 - name: appendix_seed2_seg_harmful_complied_delta_Y_frac_increasing dtype: float64 splits: - name: train num_bytes: 1052446705 num_examples: 220 download_size: 1062578491 dataset_size: 1052446705 configs: - config_name: default data_files: - split: train path: data/train-* ---

数据集信息: 特征列表: - 名称:idx 数据类型:64位整型 - 名称:prompt 数据类型:字符串 - 名称:label 数据类型:字符串 - 名称:source 数据类型:字符串 - 名称:response 数据类型:字符串 - 名称:grad_norm 数据类型:64位浮点型 - 名称:delta_Y_mean 数据类型:64位浮点型 - 名称:delta_Y_per_prompt 数据类型:列表(元素为64位浮点型) - 名称:delta_Y_frac_increasing 数据类型:64位浮点型 - 名称:delta_h10_mean_norm 数据类型:64位浮点型 - 名称:delta_f_per_prompt 数据类型:结构体,包含以下子字段: - 名称:10000_neutral_factual_general_knowledge_and_email_writing_prompts_harmless(10000_中立事实类通用知识与邮件写作提示(无害)) 数据类型:列表(元素为64位浮点型) - 名称:10737_general_benign_informational_explanation_requests_harmless(10737_通用良性信息解释请求提示(无害)) 数据类型:列表(元素为64位浮点型) - 名称:1083_trivia_lookup_and_benign_productivity_queries_harmless(1083_常识查询与良性生产力请求提示(无害)) 数据类型:列表(元素为64位浮点型) - 名称:10878_stereotype_based_demeaning_character_portrayals_harmful(10878_基于刻板印象的贬低性角色刻画提示(有害)) 数据类型:列表(元素为64位浮点型) - 名称:10940_stereotyping_protected_groups_as_inferior_harmful(10940_将受保护群体刻板印象化为低人一等的提示(有害)) 数据类型:列表(元素为64位浮点型) ……(其余子字段遵循相同翻译逻辑,保留数字ID与无害/有害标注) - 名称:harmful_natural_refusal_influence 数据类型:64位浮点型 - 名称:harmful_natural_influence_10000_neutral_factual_general_knowledge_and_email_writing_prompts_harmless 数据类型:64位浮点型(有害自然场景下,10000_中立事实类通用知识与邮件写作提示(无害)子集的影响值) - 名称:harmful_natural_influence_10737_general_benign_informational_explanation_requests_harmless 数据类型:64位浮点型(有害自然场景下,10737_通用良性信息解释请求提示(无害)子集的影响值) ……(其余harmful_natural_influence系列字段遵循相同翻译逻辑) - 名称:harmful_natural_top5_most_influenced 数据类型:列表,包含子字段: - 名称:feature 数据类型:字符串 - 名称:influence 数据类型:64位浮点型 - 名称:ridge_weight 数据类型:64位浮点型 - 名称:harmful_natural_top5_most_important 数据类型:列表,子字段同上 - 名称:harmful_balanced_refusal_influence 数据类型:64位浮点型 ……(harmful_balanced系列字段、harmful_harmless系列字段、附录字段、数据分割与配置项均遵循上述翻译逻辑,完整保留原始字段标识与格式) 数据分割: - 名称:训练集(train) 字节数:1052446705 样本数量:220 下载大小:1062578491 数据集总大小:1052446705 配置: - 配置名称:default 数据文件: - 分割方式:训练集(train) 路径:data/train-*
提供机构:
DarianNLP
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作