jkminder/dolma3-safety-annotations
收藏Hugging Face2026-03-23 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/jkminder/dolma3-safety-annotations
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- text-classification
tags:
- safety
- content-moderation
- annotations
size_categories:
- 100M<n<1B
---
# Safety Annotations for dolma3_mix
Safety score annotations for a 20K-shard subset of [allenai/dolma3_mix-6T](https://huggingface.co/datasets/allenai/dolma3_mix-6T) using
[locuslab/safety-classifier_gte-large-en-v1.5](https://huggingface.co/locuslab/safety-classifier_gte-large-en-v1.5).
## Schema
| Column | Type | Description |
|--------|------|-------------|
| `id` | string | Row identifier (matches source dataset) |
| `safety_score` | int8 | Argmax safety class (0-5) |
| `safety_probs` | list[float32] | Full 6-class probability distribution |
## Safety scale
| Score | Label | Count | Percentage |
|-------|-------|------:|------------|
| 0 | safe | 302,972,734 | 77.39% |
| 1 | minimal | 38,143,123 | 9.74% |
| 2 | mild | 32,004,998 | 8.18% |
| 3 | moderate | 10,592,853 | 2.71% |
| 4 | significant | 3,990,755 | 1.02% |
| 5 | severe | 3,769,706 | 0.96% |
## Usage
This dataset contains only annotations — no text. Join on `id` with the source dataset to get text + safety scores.
## Details
- ~600 GPU-hours on NVIDIA GH200 120GB
- Total unique annotations: **391,474,169**
- Pipeline code: [epfl-dlab/model-raising-data](https://github.com/epfl-dlab/model-raising-data)
提供机构:
jkminder



