BSC-LT/salamandra-guard-dataset
收藏Hugging Face2026-03-19 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/BSC-LT/salamandra-guard-dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: id
dtype: string
- name: prompt
dtype: string
- name: response
dtype: string
- name: language
dtype: string
- name: is_safe
dtype: bool
- name: s_codes
sequence: string
- name: majority_vote
dtype: string
- name: majority_c_cat
dtype: string
- name: Annotator_1
dtype: string
- name: Annotator_2
dtype: string
- name: Annotator_3
dtype: string
- name: GPT_4o_LABEL_RESPONSE
dtype: string
- name: GPT_OSS_LABEL_RESPONSE
dtype: string
- name: Nemotron_label
dtype: string
- name: nemo_label_og
dtype: string
splits:
- name: train
num_bytes: 22435324
num_examples: 20329
- name: test
num_bytes: 1156936
num_examples: 1006
download_size: 13429701
dataset_size: 23592260
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: test
path: data/test-*
task_categories:
- text-generation
- text-classification
language:
- ca
- es
- en
---
# Salamandra Guard Dataset
## Dataset Description
The Salamandra Guard dataset is a comprehensive multilingual safety classification corpus designed for training and evaluating content moderation systems in Catalan, Spanish. It consists of 21,335 carefully curated conversational examples annotated across a hierarchical safety taxonomy.
This dataset represents a significant advancement in culturally-grounded safety data, with particular emphasis on Catalan—a language historically underrepresented in AI safety research—alongside Spanish and English.
### Dataset Summary
- **Total size:** 21,335 samples
- **Languages:** Catalan, Spanish, English
- **Annotation methodology:** Multi-annotator with human and LLM judges
- **Format:** Conversational user-assistant pairs
- **Task:** Multi-label and multi-class safety classification
- **Cultural scope:** European context with Catalan and Spanish cultural adaptation
### Salamandra Guard Technical Report
[Technical Report](https://huggingface.co/datasets/BSC-LT/salamandra-guard-dataset/blob/main/Salamandra_Guard_Technical_Report.pdf)
**Dataset composition:**
1. **Human-annotated subset (5,016 samples):** Professionally translated, proofread, and annotated by human experts
2. **Machine-translated subset (16,319 samples):** GPT-4o translated and LLM-annotated from English source data
### Supported Tasks
- **Binary classification:** Safe vs. Unsafe content detection
- **Multiclass classification:** Four high-level safety categories (C0-C3)
- **Fine-grained classification:** Eight subcategories (S0-S7)
- **Multi-label classification:** Content can belong to multiple safety categories simultaneously
- **Cross-lingual safety detection:** Evaluation across Catalan, Spanish, and English
### Languages
- **Catalan (ca):** Native and professionally proofread
- **Spanish (es):** Native and professionally proofread
## Dataset Structure
### Data Instances
Each instance contains:
```json
{
"id": "sample_001",
"prompt": "User message or question",
"response": "Assistant's response to classify",
"language": "ca",
"is_safe": false,
"s_codes": ["S4", "S5"],
"majority_vote": "S4",
"majority_c_cat": "C2",
"Annotator_1": "S4",
"Annotator_2": "S4",
"Annotator_3": "S0",
"GPT_4o_LABEL_RESPONSE": "S4",
"GPT_OSS_LABEL_RESPONSE": "S4",
"Nemotron_label": "S4",
"nemo_label_og": "S4"
}
```
### Data Fields
**Core fields:**
- `id` (string): Unique identifier for each sample
- `prompt` (string): User's input message or question
- `response` (string): Assistant's response to be classified for safety
- `language` (string): Language code (ca/es/en)
**Classification fields:**
- `is_safe` (boolean): Binary classification indicating if content is safe (True) or unsafe (False)
- `s_codes` (list of strings): List of fine-grained subcategory labels (S0-S7) - supports multi-label classification
- `majority_vote` (string): Consensus subcategory label (S0-S7) from human annotators via majority voting
- `majority_c_cat` (string): High-level category label (C0-C3) based on majority consensus
**Human annotation fields:**
- `Annotator_1` (string): First human annotator's label (S0-S7)
- `Annotator_2` (string): Second human annotator's label (S0-S7)
- `Annotator_3` (string): Third human annotator's label (S0-S7)
**LLM judge annotation fields:**
- `GPT_4o_LABEL_RESPONSE` (string): GPT-4o judge label (S0-S7)
- `GPT_OSS_LABEL_RESPONSE` (string): GPT-OSS (open source GPT) judge label (S0-S7)
- `Nemotron_label` (string): Nemotron label from current annotation (S0-S7)
- `nemo_label_og` (string): Original Nemotron label from source dataset (S0-S7)
### Data Splits
|Split|Human Annotated|Machine Translated|Total|
|---|---|---|---|
|Train|4,013|13,055|17,068|
|Test|1,003|3,264|4,267|
|**Total**|**5,016**|**16,319**|**21,335**|
**Split ratio:** 80% train / 20% validation-test
## Dataset Creation
### Curation Rationale
Existing safety datasets suffer from several critical limitations:
1. **Language gaps:** Underrepresentation of Catalan and European Spanish
2. **Cultural misalignment:** Safety norms and definitions vary across cultures
3. **Low-quality annotations:** Insufficient annotator expertise and biased sampling
4. **Taxonomy complexity:** Overlapping categories causing annotation confusion
Salamandra Guard addresses these issues through:
- Native language expertise and cultural grounding
- Professional translation with human proofreading
- Simplified, orthogonal taxonomy design
- Multi-annotator framework with quality control
- Balanced representation of unsafe content
### Source Data
#### Initial Data Collection
**Primary source:** nvidia/Nemotron-Safety-Guard-Dataset-V3
- Spanish language subset selected as seed data
- High-quality foundation for safety-relevant conversations
**Translation process:**
1. Machine translation using GPT-4o
2. Expert human proofreading (250 core samples)
3. Crowdsourced proofreading via Prolific platform
4. Automated quality validation to ensure meaningful improvements
#### Data Collection Process
**Human-annotated subset (5,016 samples):**
**Annotation platform:** Prolific **Annotators:** Native Catalan speakers **Annotations per sample:** 3 independent human annotators + 2 LLM judges (GPT-4o, GPT-OSS) **Quality control:**
- Batch review (50 samples per batch)
- Immediate rejection of low-quality annotations
- Iterative guideline refinement based on feedback
**Guidelines:** Comprehensive annotation guidelines developed in collaboration with alinia, culturally adapted for Catalan/Spanish contexts
**Machine-translated subset (16,319 samples):**
**Translation:** GPT-4o from English to Catalan/Spanish **Annotation:** LLM judges (GPT-4o, GPT-OSS, and original Nemotron labels) **Purpose:** Domain diversity complementing high-quality human annotations
#### Who are the Source Data Producers?
**Original data:** Nvidia Nemotron team (Nemotron-Safety-Guard-Dataset-V3) **Translation:** GPT-4o (OpenAI) **Human proofreading:**
- Professional translators (expert subset)
- Native Catalan crowdworkers via Prolific (remaining subset)
### Annotations
#### Annotation Process
**Taxonomy development:**
1. Survey of existing taxonomies (ML Commons, Llama Guard, ShieldGemma, IBM Granite Guardian)
2. Iterative refinement with alinia
3. Cultural grounding in Hispanic and Catalan contexts
**Annotation workflow:**
1. Initial LLM translation (GPT-4o)
2. Human proofreading (expert + crowdsourced)
3. Independent annotation by 3 human annotators
4. Parallel annotation by 2 LLM judges (GPT-4o and GPT-OSS)
5. Consensus label derivation via majority vote
**Batch processing:**
- Reviews conducted in batches of 50 samples
- Continuous quality monitoring
- Iterative guideline updates
**Annotation guidelines:** Available in Catalan, covering all eight subcategories with detailed examples and edge cases
#### Who are the Annotators?
**Human annotators:**
- **Recruitment:** Prolific platform
- **Qualifications:** Native Catalan speakers
- **Number:** Multiple annotators (3 per sample)
- **Training:** Comprehensive annotation guidelines with examples
**LLM judges:**
- GPT-4o (OpenAI)
- GPT-OSS (Open source GPT model)
- Nemotron (Nvidia) - original labels from source dataset
**Expert reviewers:**
- Professional translators for proofreading
- alinia researchers for taxonomy development
#### Inter-Annotator Agreement
**Human annotators (3-way agreement):**
- **Unanimous agreement:** 51.0% of samples
- **Majority consensus (2+ agree):** 92.0% of samples
- **Complete disagreement:** 8.0% of samples
**Pairwise agreement:**
- Range: 63.0% - 66.3%
- Cohen's Kappa: 0.455 - 0.506
- Krippendorff's Alpha: 0.481
**Category-level agreement:**
- **S0 (Safe):** 47.2% unanimous, 70.4% majority
- **S5 (Harassment):** 11.0% unanimous, 36-41% majority
- **S6 (Profanity):** 11.4% unanimous, 36-41% majority
**Collapsed categories (C0-C3):**
- Complete disagreement: 4.0% (down from 8.0%)
- Krippendorff's Alpha: 0.508 (improved from 0.481)
**Human vs. LLM judges:**
- **GPT-4o vs. Human majority:** 75.8% agreement (8-way labels)
- **Nemotron vs. Human majority:** 48.8% strict agreement, 59.1% relaxed (multi-label)
- **GPT-4o vs. Nemotron:** 49.2% agreement
**Key findings:**
- High subjectivity in safety classification
- Better agreement on high-level categories than fine-grained labels
- LLM judges show divergent biases (Nemotron more conservative)
## Safety Taxonomy
### High-Level Categories (C0-C3)
**C0: Safe**
- Content that does not violate any safety policies
**C1: Dangerous Content** Aggregates violent crimes, self-harm, and non-violent illegal activities
- S1: Violent Crimes
- S2: Suicide and Self-Harm
- S3: Non-Violent Crimes and Wrongdoings
**C2: Toxic Content** Aggregates socially harmful content
- S4: Hate Speech & Discrimination
- S5: Harassment & Bullying
- S6: Profanity
**C3: Sexual Content** Aggregates sexually explicit material and offenses
- S7: Sexual Offenses & Explicit Content
### Subcategories (S0-S7)
**S0: Safe**
- Content free of safety violations
**S1: Violent Crimes**
- Unlawful violence against persons (terrorism, genocide, murder, assault, kidnapping, child abuse)
- Unlawful violence against animals (animal cruelty)
- Gender-based violence, domestic violence, coercive control
**S2: Suicide and Self-Harm**
- Suicide
- Self-harm (cutting, etc.)
- Dangerous challenges and risky behaviors
- Eating disorders
**S3: Non-Violent Crimes and Wrongdoings**
- Personal crimes (labor trafficking, threats, intimidation)
- Financial crimes (fraud, money laundering, swindling)
- Property crimes (theft, arson, vandalism)
- Drug crimes (production/consumption of controlled substances)
- Weapons crimes (unlicensed firearms manufacturing)
- Cybercrimes (hacking)
**S4: Hate Speech & Discrimination**
- Degradation based on race, ethnicity, religion, gender, sexual orientation, disability, etc.
- Use of humor to trivialize or conceal harmful intentions
**S5: Harassment & Bullying**
- Repeated harassment and cyberbullying
- Doxxing and online bullying
- Non-consensual intimate content distribution (revenge porn)
- Blackmail and intimidation
**S6: Profanity**
- Offensive language and insults
- _Culturally adapted for Catalan and Spanish contexts_
**S7: Sexual Offenses & Explicit Content**
- Human trafficking for sexual exploitation
- Sexual assault and rape
- Sexual harassment (physical, verbal, visual)
- Prostitution (including minors)
- Pornography and sexually explicit content
- Erotic chat (cybersex)
### Category Distribution
**Human-annotated subset (by majority human label):**
|Category|Annotator 1|Annotator 2|Annotator 3|GPT-4o|Nemotron|
|---|---|---|---|---|---|
|C0 (Safe)|2,662|2,680|2,654|2,080|1,492|
|C1 (Dangerous)|1,204|1,230|1,202|1,608|3,300|
|C2 (Toxic)|868|810|822|766|2,116|
|C3 (Sexual)|282|296|338|562|1,016|
**Subcategory distribution (human annotator average):**
|Label|Annotator 1|Annotator 2|Annotator 3|GPT-4o|Nemotron|Description|
|---|---|---|---|---|---|---|
|S0|2,662|2,680|2,654|2,080|1,492|Safe|
|S1|458|462|464|504|954|Violent Crimes|
|S2|182|182|150|218|380|Suicide & Self-Harm|
|S3|564|586|588|886|1,966|Non-Violent Crimes|
|S4|414|366|352|402|794|Hate Speech|
|S5|286|224|294|258|774|Harassment|
|S6|168|220|176|106|548|Profanity|
|S7|282|296|338|562|1,016|Sexual Content|
**Machine-translated subset:**
High-level categories:
|Category|GPT-4o|Nemotron|
|---|---|---|
|C0 (Safe)|8,976|9,066|
|C1 (Dangerous)|5,088|7,580|
|C2 (Toxic)|1,503|2,540|
|C3 (Sexual)|752|736|
Subcategories:
|Label|GPT-4o|Nemotron|Description|
|---|---|---|---|
|S0|8,976|9,066|Safe|
|S1|1,391|2,138|Violent Crimes|
|S2|510|529|Suicide & Self-Harm|
|S3|3,187|4,913|Non-Violent Crimes|
|S4|837|1,009|Hate Speech|
|S5|482|968|Harassment|
|S6|184|563|Profanity|
|S7|752|736|Sexual Content|
## Bias, Risks, and Limitations
### Dataset Limitations
1. **Annotation subjectivity:** Moderate inter-annotator agreement (κ: 0.455-0.506) reflects inherent difficulty in safety classification
2. **LLM judge divergence:** Significant disagreement between GPT-4o and Nemotron labels; Nemotron is more conservative (higher false positive tendency)
3. **Quality variability:** Crowdsourced proofreading subset may have lower linguistic quality than expert-reviewed samples
4. **Category difficulty:** Low agreement on S5 (Harassment) and S6 (Profanity) indicates these are harder to classify consistently
5. **Response-focused:** Dataset emphasizes LLM response moderation, not adversarial request detection
6. **Cultural specificity:** Profanity (S6) definitions are culturally adapted; may not transfer to other Spanish/Catalan dialects
### Bias Considerations
**Annotator bias:**
- Reflects perspectives of native Catalan speakers
- Professional vs. crowdsourced annotators may have different standards
**LLM judge bias:**
- GPT-4o more aligned with human judgments (75.8% agreement)
- Nemotron more conservative, flags more content as unsafe
- Original Nemotron labels differ significantly from human consensus
**Sampling bias:**
- Machine-translated subset may not capture authentic Catalan/Spanish usage patterns
- Source data (Nemotron) reflects specific LLM safety priorities
**Label distribution:**
- Higher representation of C1 (Dangerous) in machine-translated subset
- Some rare categories (S2, S6) have limited examples
### Recommendations
Users should:
- **Acknowledge subjectivity:** Safety classification involves cultural and personal judgment
- **Consider context:** Use multiple annotator labels for nuanced understanding
- **Validate on domain:** Test dataset applicability for specific use cases
- **Account for label noise:** Majority labels represent consensus but not universal truth
- **Combine with human review:** Especially for high-stakes moderation decisions
- **Understand cultural framing:** Catalan/Spanish safety norms may differ from other contexts
- **Monitor distribution shifts:** Model performance may degrade on content distributions different from this dataset
## Additional Information
### Contact ###
For further information, please send an email to langtech@bsc.es.
### Copyright ###
Copyright(c) 2026 by Language Technologies Lab, Barcelona Supercomputing Center.
### Funding ###
This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project ILENIA with reference 2022/TL22/00215337.
### Licensing Information
Apache 2.0
### Dataset Curators
alinia in collaboration with:
- Nvidia (source data)
- Prolific platform (crowdsourced annotation)
- Professional translation services
### Citation Information
**BibTeX:**
```bibtex
@misc{salamandra_guard_dataset_2025,
title={Salamandra Guard Dataset},
author={alinia},
year={2025},
publisher={Hugging Face},
howpublished={\url{https://huggingface.co/datasets/bsc/salamandra-guard-dataset}}
}
```
### Contributions
This dataset builds upon:
- **Nemotron-Safety-Guard-Dataset-V3** (Nvidia)
- Annotation guidelines inspired by ML Commons, Llama Guard, ShieldGemma, and IBM Granite Guardian
- Cultural expertise from Catalan language community
提供机构:
BSC-LT



