ajagota71/fairl-beaver-prompts-10k
收藏Hugging Face2026-03-29 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/ajagota71/fairl-beaver-prompts-10k
下载链接
链接失效反馈官方服务:
资源简介:
# FAIRL Beaver Prompts (10K Stratified)
Canonical prompt set for FAIRL Beaver experiments. 10,000 prompts sampled from
PKU-Alignment/PKU-SafeRLHF with balanced stratification across harm categories.
## Stratification
| Group | N | Description |
|---|---|---|
| Crime/Fraud | 1,499 | Cybercrime, economic crime, white-collar crime |
| Violence/Harm | 1,501 | Violence, physical harm, trafficking |
| Manipulation/Psych | 1,500 | Mental manipulation, psychological harm |
| Privacy/Security | 1,200 | Privacy violation, national security |
| Substance/Health | 1,100 | Drugs, public health, environment |
| Discrimination | 500 | Discriminatory behavior, sexual content |
| Safe | 2,700 | Non-adversarial prompts |
## Columns
- `prompt_idx`: Integer index (0-9999)
- `raw_prompt`: Original prompt text
- `formatted_prompt`: PKU conversation template (`BEGINNING OF CONVERSATION: USER: {prompt} ASSISTANT:`)
- `primary_group`: Harm category group (7 values)
- `prompt_source`: Generator model (Alpaca3-70B, Beavertails, WizardLM-30B-Uncensored)
- `severity`: Max severity across responses (0=safe, 1=mild, 2=moderate, 3=severe)
- `is_adversarial`: Whether any response to this prompt was labeled unsafe
- `n_harm_cats`: Number of harm categories triggered
- `harm_cats`: Comma-separated harm category names
- `prompt_len`: Character length of raw prompt
## Usage
These prompts should be used identically for Beaver v1, v2, and v3 experiments
to enable controlled cross-version comparison.
提供机构:
ajagota71



