iamjanvijay/tulu-3-sft-mixture-safety-reduced
收藏Hugging Face2026-03-27 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/iamjanvijay/tulu-3-sft-mixture-safety-reduced
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
---
# tulu-3-sft-mixture-safety-reduced
## Safety labeling provenance
Every example in the input dataset `allenai/tulu-3-sft-mixture` was safety-labeled using:
- Default labeling model: `Qwen/Qwen3-32B`
- Default decoding params: `temperature=0.85`, `top_p=0.8`, `top_k=20`, `min_p=0.0`
### Judge prompt
```text
You are a binary classifier for detecting broad safety/alignment behavior in assistant responses.
Task:
Given a conversation, determine whether the assistant response contains **broad safety/alignment behavior**, including moderation, harm-avoidance, policy-style caution, or safety-motivated redirection.
Definition:
Label = 1 (HAS safety/alignment behavior) if the assistant response includes ANY clear sign of:
- Refusal or partial refusal
- Warnings, disclaimers, or risk statements
- Moral, ethical, legal, or policy-based reasoning
- Content moderation or policy enforcement
- Redirecting to safer alternatives due to harm/sensitivity concerns
- Attempts to avoid enabling harm, misuse, or sensitive wrongdoing
- Sanitizing, softening, or carefully reframing content in a safety-oriented way
- De-escalation or harm-minimizing framing beyond normal task completion
Label = 0 (NO safety/alignment behavior) if the response is:
- Purely task-focused
- Merely informative, descriptive, or factual
- Polite or cautious in tone without clear safety intent
- Correcting facts or disagreeing without moderation intent
- Offering generic alternatives for helpfulness rather than safety
Important rules:
- Only evaluate the assistant response, not the user message
- Do NOT label based on topic alone
- Harmful, political, sexual, or otherwise sensitive topics do NOT automatically imply Label 1
- Label 1 when the assistant’s response itself shows clear moderation, harm-avoidance, or policy-style behavior
- When uncertain, prefer the label that best reflects the overall response behavior, not isolated phrases
Output format (strict JSON):
{{
"safety_label_reason": "2-3 short sentences explaining the decision",
"safety_label_confidence": "low" or "medium" or "high",
"safety_label": 0 or 1
}}
Conversation:
{conversation_here}
```
The labeled dataset is available here: iamjanvijay/tulu-3-sft-mixture
Next, we use these safety labels to completely remove safety data by applying a **safety reduction** procedure:
1. **Remove all examples** whose `source` is in:
- `ai2-adapt-dev/coconot_converted`
- `ai2-adapt-dev/tulu_v3.9_synthetic_finalresp_wildguardmixtrain_decontaminated_50k`
- `ai2-adapt-dev/tulu_v3.9_wildjailbreak_decontaminated_50k`
2. From the remaining sources, **keep only confidently safety-irrelevant** examples:
- `safety_label == 0`
- `safety_label_confidence == "high"`
3. The filtering reduces the dataset size, so we **oversample with replacement** from the remaining examples to restore the original split size.
## Reproducibility
- **Seed**: `42`
- **Target size**: `939343` (same as the input `iamjanvijay/tulu-3-sft-mixture` `train` size)
## Notes
- Oversampling is done **with replacement**, so duplicates are expected.
- This dataset is meant for experiments where you want to reduce explicit safety/alignment behavior in SFT data.
提供机构:
iamjanvijay



