five

jkminder/dolma3-safety-annotations

收藏
Hugging Face2026-03-23 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/jkminder/dolma3-safety-annotations
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - text-classification tags: - safety - content-moderation - annotations size_categories: - 100M<n<1B --- # Safety Annotations for dolma3_mix Safety score annotations for a 20K-shard subset of [allenai/dolma3_mix-6T](https://huggingface.co/datasets/allenai/dolma3_mix-6T) using [locuslab/safety-classifier_gte-large-en-v1.5](https://huggingface.co/locuslab/safety-classifier_gte-large-en-v1.5). ## Schema | Column | Type | Description | |--------|------|-------------| | `id` | string | Row identifier (matches source dataset) | | `safety_score` | int8 | Argmax safety class (0-5) | | `safety_probs` | list[float32] | Full 6-class probability distribution | ## Safety scale | Score | Label | Count | Percentage | |-------|-------|------:|------------| | 0 | safe | 302,972,734 | 77.39% | | 1 | minimal | 38,143,123 | 9.74% | | 2 | mild | 32,004,998 | 8.18% | | 3 | moderate | 10,592,853 | 2.71% | | 4 | significant | 3,990,755 | 1.02% | | 5 | severe | 3,769,706 | 0.96% | ## Usage This dataset contains only annotations — no text. Join on `id` with the source dataset to get text + safety scores. ## Details - ~600 GPU-hours on NVIDIA GH200 120GB - Total unique annotations: **391,474,169** - Pipeline code: [epfl-dlab/model-raising-data](https://github.com/epfl-dlab/model-raising-data)
提供机构:
jkminder
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作