five

jkminder/dolma3_mix-1T-annotated

收藏
Hugging Face2026-03-24 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/jkminder/dolma3_mix-1T-annotated
下载链接
链接失效反馈
官方服务:
资源简介:
--- configs: - config_name: annotated data_files: - split: train path: "annotated/*.parquet" - config_name: unannotated data_files: - split: train path: "unannotated/*.parquet" --- # jkminder/dolma3_mix-1T-annotated Annotation-based subsample of dolma3_mix-1T. ## Subsets - **`annotated`** — rows marked for annotation (`has_annotation=True`): safety_score >= 3 plus a matched random sample of lower-score rows. - **`unannotated`** — the remaining rows (`has_annotation=False`). Both subsets include `is_bad` (bool): `True` if `safety_score >= 3`. ## Usage ```python from datasets import load_dataset annotated = load_dataset("jkminder/dolma3_mix-1T-annotated", "annotated") unannotated = load_dataset("jkminder/dolma3_mix-1T-annotated", "unannotated") ``` ## Stats | | Rows | Tokens | |---|---|---| | Annotated | 102,772,028 | 110.30B | | Unannotated | 925,065,551 | 889.74B | | **Total** | **1,027,837,579** | **1.00T** | Annotation ratio: 11.03% | Seed: 42 | Threshold: 3
提供机构:
jkminder
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作