jkminder/dolma3_mix-1T-annotated
收藏Hugging Face2026-03-24 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/jkminder/dolma3_mix-1T-annotated
下载链接
链接失效反馈官方服务:
资源简介:
---
configs:
- config_name: annotated
data_files:
- split: train
path: "annotated/*.parquet"
- config_name: unannotated
data_files:
- split: train
path: "unannotated/*.parquet"
---
# jkminder/dolma3_mix-1T-annotated
Annotation-based subsample of dolma3_mix-1T.
## Subsets
- **`annotated`** — rows marked for annotation (`has_annotation=True`): safety_score >= 3 plus a matched random sample of lower-score rows.
- **`unannotated`** — the remaining rows (`has_annotation=False`).
Both subsets include `is_bad` (bool): `True` if `safety_score >= 3`.
## Usage
```python
from datasets import load_dataset
annotated = load_dataset("jkminder/dolma3_mix-1T-annotated", "annotated")
unannotated = load_dataset("jkminder/dolma3_mix-1T-annotated", "unannotated")
```
## Stats
| | Rows | Tokens |
|---|---|---|
| Annotated | 102,772,028 | 110.30B |
| Unannotated | 925,065,551 | 889.74B |
| **Total** | **1,027,837,579** | **1.00T** |
Annotation ratio: 11.03% | Seed: 42 | Threshold: 3
提供机构:
jkminder



