BSC-LT/salamandra-guard-dataset

Name: BSC-LT/salamandra-guard-dataset
Creator: BSC-LT
Published: 2026-03-19 14:14:08
License: 暂无描述

Hugging Face2026-03-19 更新2026-04-05 收录

下载链接：

https://hf-mirror.com/datasets/BSC-LT/salamandra-guard-dataset

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: id dtype: string - name: prompt dtype: string - name: response dtype: string - name: language dtype: string - name: is_safe dtype: bool - name: s_codes sequence: string - name: majority_vote dtype: string - name: majority_c_cat dtype: string - name: Annotator_1 dtype: string - name: Annotator_2 dtype: string - name: Annotator_3 dtype: string - name: GPT_4o_LABEL_RESPONSE dtype: string - name: GPT_OSS_LABEL_RESPONSE dtype: string - name: Nemotron_label dtype: string - name: nemo_label_og dtype: string splits: - name: train num_bytes: 22435324 num_examples: 20329 - name: test num_bytes: 1156936 num_examples: 1006 download_size: 13429701 dataset_size: 23592260 configs: - config_name: default data_files: - split: train path: data/train-* - split: test path: data/test-* task_categories: - text-generation - text-classification language: - ca - es - en --- # Salamandra Guard Dataset ## Dataset Description The Salamandra Guard dataset is a comprehensive multilingual safety classification corpus designed for training and evaluating content moderation systems in Catalan, Spanish. It consists of 21,335 carefully curated conversational examples annotated across a hierarchical safety taxonomy. This dataset represents a significant advancement in culturally-grounded safety data, with particular emphasis on Catalan—a language historically underrepresented in AI safety research—alongside Spanish and English. ### Dataset Summary - **Total size:** 21,335 samples - **Languages:** Catalan, Spanish, English - **Annotation methodology:** Multi-annotator with human and LLM judges - **Format:** Conversational user-assistant pairs - **Task:** Multi-label and multi-class safety classification - **Cultural scope:** European context with Catalan and Spanish cultural adaptation ### Salamandra Guard Technical Report [Technical Report](https://huggingface.co/datasets/BSC-LT/salamandra-guard-dataset/blob/main/Salamandra_Guard_Technical_Report.pdf) **Dataset composition:** 1. **Human-annotated subset (5,016 samples):** Professionally translated, proofread, and annotated by human experts 2. **Machine-translated subset (16,319 samples):** GPT-4o translated and LLM-annotated from English source data ### Supported Tasks - **Binary classification:** Safe vs. Unsafe content detection - **Multiclass classification:** Four high-level safety categories (C0-C3) - **Fine-grained classification:** Eight subcategories (S0-S7) - **Multi-label classification:** Content can belong to multiple safety categories simultaneously - **Cross-lingual safety detection:** Evaluation across Catalan, Spanish, and English ### Languages - **Catalan (ca):** Native and professionally proofread - **Spanish (es):** Native and professionally proofread ## Dataset Structure ### Data Instances Each instance contains: ```json { "id": "sample_001", "prompt": "User message or question", "response": "Assistant's response to classify", "language": "ca", "is_safe": false, "s_codes": ["S4", "S5"], "majority_vote": "S4", "majority_c_cat": "C2", "Annotator_1": "S4", "Annotator_2": "S4", "Annotator_3": "S0", "GPT_4o_LABEL_RESPONSE": "S4", "GPT_OSS_LABEL_RESPONSE": "S4", "Nemotron_label": "S4", "nemo_label_og": "S4" } ``` ### Data Fields **Core fields:** - `id` (string): Unique identifier for each sample - `prompt` (string): User's input message or question - `response` (string): Assistant's response to be classified for safety - `language` (string): Language code (ca/es/en) **Classification fields:** - `is_safe` (boolean): Binary classification indicating if content is safe (True) or unsafe (False) - `s_codes` (list of strings): List of fine-grained subcategory labels (S0-S7) - supports multi-label classification - `majority_vote` (string): Consensus subcategory label (S0-S7) from human annotators via majority voting - `majority_c_cat` (string): High-level category label (C0-C3) based on majority consensus **Human annotation fields:** - `Annotator_1` (string): First human annotator's label (S0-S7) - `Annotator_2` (string): Second human annotator's label (S0-S7) - `Annotator_3` (string): Third human annotator's label (S0-S7) **LLM judge annotation fields:** - `GPT_4o_LABEL_RESPONSE` (string): GPT-4o judge label (S0-S7) - `GPT_OSS_LABEL_RESPONSE` (string): GPT-OSS (open source GPT) judge label (S0-S7) - `Nemotron_label` (string): Nemotron label from current annotation (S0-S7) - `nemo_label_og` (string): Original Nemotron label from source dataset (S0-S7) ### Data Splits |Split|Human Annotated|Machine Translated|Total| |---|---|---|---| |Train|4,013|13,055|17,068| |Test|1,003|3,264|4,267| |**Total**|**5,016**|**16,319**|**21,335**| **Split ratio:** 80% train / 20% validation-test ## Dataset Creation ### Curation Rationale Existing safety datasets suffer from several critical limitations: 1. **Language gaps:** Underrepresentation of Catalan and European Spanish 2. **Cultural misalignment:** Safety norms and definitions vary across cultures 3. **Low-quality annotations:** Insufficient annotator expertise and biased sampling 4. **Taxonomy complexity:** Overlapping categories causing annotation confusion Salamandra Guard addresses these issues through: - Native language expertise and cultural grounding - Professional translation with human proofreading - Simplified, orthogonal taxonomy design - Multi-annotator framework with quality control - Balanced representation of unsafe content ### Source Data #### Initial Data Collection **Primary source:** nvidia/Nemotron-Safety-Guard-Dataset-V3 - Spanish language subset selected as seed data - High-quality foundation for safety-relevant conversations **Translation process:** 1. Machine translation using GPT-4o 2. Expert human proofreading (250 core samples) 3. Crowdsourced proofreading via Prolific platform 4. Automated quality validation to ensure meaningful improvements #### Data Collection Process **Human-annotated subset (5,016 samples):** **Annotation platform:** Prolific **Annotators:** Native Catalan speakers **Annotations per sample:** 3 independent human annotators + 2 LLM judges (GPT-4o, GPT-OSS) **Quality control:** - Batch review (50 samples per batch) - Immediate rejection of low-quality annotations - Iterative guideline refinement based on feedback **Guidelines:** Comprehensive annotation guidelines developed in collaboration with alinia, culturally adapted for Catalan/Spanish contexts **Machine-translated subset (16,319 samples):** **Translation:** GPT-4o from English to Catalan/Spanish **Annotation:** LLM judges (GPT-4o, GPT-OSS, and original Nemotron labels) **Purpose:** Domain diversity complementing high-quality human annotations #### Who are the Source Data Producers? **Original data:** Nvidia Nemotron team (Nemotron-Safety-Guard-Dataset-V3) **Translation:** GPT-4o (OpenAI) **Human proofreading:** - Professional translators (expert subset) - Native Catalan crowdworkers via Prolific (remaining subset) ### Annotations #### Annotation Process **Taxonomy development:** 1. Survey of existing taxonomies (ML Commons, Llama Guard, ShieldGemma, IBM Granite Guardian) 2. Iterative refinement with alinia 3. Cultural grounding in Hispanic and Catalan contexts **Annotation workflow:** 1. Initial LLM translation (GPT-4o) 2. Human proofreading (expert + crowdsourced) 3. Independent annotation by 3 human annotators 4. Parallel annotation by 2 LLM judges (GPT-4o and GPT-OSS) 5. Consensus label derivation via majority vote **Batch processing:** - Reviews conducted in batches of 50 samples - Continuous quality monitoring - Iterative guideline updates **Annotation guidelines:** Available in Catalan, covering all eight subcategories with detailed examples and edge cases #### Who are the Annotators? **Human annotators:** - **Recruitment:** Prolific platform - **Qualifications:** Native Catalan speakers - **Number:** Multiple annotators (3 per sample) - **Training:** Comprehensive annotation guidelines with examples **LLM judges:** - GPT-4o (OpenAI) - GPT-OSS (Open source GPT model) - Nemotron (Nvidia) - original labels from source dataset **Expert reviewers:** - Professional translators for proofreading - alinia researchers for taxonomy development #### Inter-Annotator Agreement **Human annotators (3-way agreement):** - **Unanimous agreement:** 51.0% of samples - **Majority consensus (2+ agree):** 92.0% of samples - **Complete disagreement:** 8.0% of samples **Pairwise agreement:** - Range: 63.0% - 66.3% - Cohen's Kappa: 0.455 - 0.506 - Krippendorff's Alpha: 0.481 **Category-level agreement:** - **S0 (Safe):** 47.2% unanimous, 70.4% majority - **S5 (Harassment):** 11.0% unanimous, 36-41% majority - **S6 (Profanity):** 11.4% unanimous, 36-41% majority **Collapsed categories (C0-C3):** - Complete disagreement: 4.0% (down from 8.0%) - Krippendorff's Alpha: 0.508 (improved from 0.481) **Human vs. LLM judges:** - **GPT-4o vs. Human majority:** 75.8% agreement (8-way labels) - **Nemotron vs. Human majority:** 48.8% strict agreement, 59.1% relaxed (multi-label) - **GPT-4o vs. Nemotron:** 49.2% agreement **Key findings:** - High subjectivity in safety classification - Better agreement on high-level categories than fine-grained labels - LLM judges show divergent biases (Nemotron more conservative) ## Safety Taxonomy ### High-Level Categories (C0-C3) **C0: Safe** - Content that does not violate any safety policies **C1: Dangerous Content** Aggregates violent crimes, self-harm, and non-violent illegal activities - S1: Violent Crimes - S2: Suicide and Self-Harm - S3: Non-Violent Crimes and Wrongdoings **C2: Toxic Content** Aggregates socially harmful content - S4: Hate Speech & Discrimination - S5: Harassment & Bullying - S6: Profanity **C3: Sexual Content** Aggregates sexually explicit material and offenses - S7: Sexual Offenses & Explicit Content ### Subcategories (S0-S7) **S0: Safe** - Content free of safety violations **S1: Violent Crimes** - Unlawful violence against persons (terrorism, genocide, murder, assault, kidnapping, child abuse) - Unlawful violence against animals (animal cruelty) - Gender-based violence, domestic violence, coercive control **S2: Suicide and Self-Harm** - Suicide - Self-harm (cutting, etc.) - Dangerous challenges and risky behaviors - Eating disorders **S3: Non-Violent Crimes and Wrongdoings** - Personal crimes (labor trafficking, threats, intimidation) - Financial crimes (fraud, money laundering, swindling) - Property crimes (theft, arson, vandalism) - Drug crimes (production/consumption of controlled substances) - Weapons crimes (unlicensed firearms manufacturing) - Cybercrimes (hacking) **S4: Hate Speech & Discrimination** - Degradation based on race, ethnicity, religion, gender, sexual orientation, disability, etc. - Use of humor to trivialize or conceal harmful intentions **S5: Harassment & Bullying** - Repeated harassment and cyberbullying - Doxxing and online bullying - Non-consensual intimate content distribution (revenge porn) - Blackmail and intimidation **S6: Profanity** - Offensive language and insults - _Culturally adapted for Catalan and Spanish contexts_ **S7: Sexual Offenses & Explicit Content** - Human trafficking for sexual exploitation - Sexual assault and rape - Sexual harassment (physical, verbal, visual) - Prostitution (including minors) - Pornography and sexually explicit content - Erotic chat (cybersex) ### Category Distribution **Human-annotated subset (by majority human label):** |Category|Annotator 1|Annotator 2|Annotator 3|GPT-4o|Nemotron| |---|---|---|---|---|---| |C0 (Safe)|2,662|2,680|2,654|2,080|1,492| |C1 (Dangerous)|1,204|1,230|1,202|1,608|3,300| |C2 (Toxic)|868|810|822|766|2,116| |C3 (Sexual)|282|296|338|562|1,016| **Subcategory distribution (human annotator average):** |Label|Annotator 1|Annotator 2|Annotator 3|GPT-4o|Nemotron|Description| |---|---|---|---|---|---|---| |S0|2,662|2,680|2,654|2,080|1,492|Safe| |S1|458|462|464|504|954|Violent Crimes| |S2|182|182|150|218|380|Suicide & Self-Harm| |S3|564|586|588|886|1,966|Non-Violent Crimes| |S4|414|366|352|402|794|Hate Speech| |S5|286|224|294|258|774|Harassment| |S6|168|220|176|106|548|Profanity| |S7|282|296|338|562|1,016|Sexual Content| **Machine-translated subset:** High-level categories: |Category|GPT-4o|Nemotron| |---|---|---| |C0 (Safe)|8,976|9,066| |C1 (Dangerous)|5,088|7,580| |C2 (Toxic)|1,503|2,540| |C3 (Sexual)|752|736| Subcategories: |Label|GPT-4o|Nemotron|Description| |---|---|---|---| |S0|8,976|9,066|Safe| |S1|1,391|2,138|Violent Crimes| |S2|510|529|Suicide & Self-Harm| |S3|3,187|4,913|Non-Violent Crimes| |S4|837|1,009|Hate Speech| |S5|482|968|Harassment| |S6|184|563|Profanity| |S7|752|736|Sexual Content| ## Bias, Risks, and Limitations ### Dataset Limitations 1. **Annotation subjectivity:** Moderate inter-annotator agreement (κ: 0.455-0.506) reflects inherent difficulty in safety classification 2. **LLM judge divergence:** Significant disagreement between GPT-4o and Nemotron labels; Nemotron is more conservative (higher false positive tendency) 3. **Quality variability:** Crowdsourced proofreading subset may have lower linguistic quality than expert-reviewed samples 4. **Category difficulty:** Low agreement on S5 (Harassment) and S6 (Profanity) indicates these are harder to classify consistently 5. **Response-focused:** Dataset emphasizes LLM response moderation, not adversarial request detection 6. **Cultural specificity:** Profanity (S6) definitions are culturally adapted; may not transfer to other Spanish/Catalan dialects ### Bias Considerations **Annotator bias:** - Reflects perspectives of native Catalan speakers - Professional vs. crowdsourced annotators may have different standards **LLM judge bias:** - GPT-4o more aligned with human judgments (75.8% agreement) - Nemotron more conservative, flags more content as unsafe - Original Nemotron labels differ significantly from human consensus **Sampling bias:** - Machine-translated subset may not capture authentic Catalan/Spanish usage patterns - Source data (Nemotron) reflects specific LLM safety priorities **Label distribution:** - Higher representation of C1 (Dangerous) in machine-translated subset - Some rare categories (S2, S6) have limited examples ### Recommendations Users should: - **Acknowledge subjectivity:** Safety classification involves cultural and personal judgment - **Consider context:** Use multiple annotator labels for nuanced understanding - **Validate on domain:** Test dataset applicability for specific use cases - **Account for label noise:** Majority labels represent consensus but not universal truth - **Combine with human review:** Especially for high-stakes moderation decisions - **Understand cultural framing:** Catalan/Spanish safety norms may differ from other contexts - **Monitor distribution shifts:** Model performance may degrade on content distributions different from this dataset ## Additional Information ### Contact ### For further information, please send an email to langtech@bsc.es. ### Copyright ### Copyright(c) 2026 by Language Technologies Lab, Barcelona Supercomputing Center. ### Funding ### This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project ILENIA with reference 2022/TL22/00215337. ### Licensing Information Apache 2.0 ### Dataset Curators alinia in collaboration with: - Nvidia (source data) - Prolific platform (crowdsourced annotation) - Professional translation services ### Citation Information **BibTeX:** ```bibtex @misc{salamandra_guard_dataset_2025, title={Salamandra Guard Dataset}, author={alinia}, year={2025}, publisher={Hugging Face}, howpublished={\url{https://huggingface.co/datasets/bsc/salamandra-guard-dataset}} } ``` ### Contributions This dataset builds upon: - **Nemotron-Safety-Guard-Dataset-V3** (Nvidia) - Annotation guidelines inspired by ML Commons, Llama Guard, ShieldGemma, and IBM Granite Guardian - Cultural expertise from Catalan language community

提供机构：

BSC-LT

5,000+

优质数据集

54 个

任务类型

进入经典数据集