hkust-nlp/PreSelect-100B
收藏Hugging Face2025-03-04 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/hkust-nlp/PreSelect-100B
下载链接
链接失效反馈官方服务:
资源简介:
PreSelect-100B 是一个经过策划的约1000亿token的预训练数据集,它在各种基准测试中取得了很好的性能。该数据集是通过 PreSelect-Classifier 在10%的阈值下过滤得到的,其源数据是从 DCLM-refinedweb 的随机抽样子集中选取的,DCLM-refinedweb 是 Common Crawl原始数据的清洗版本,但没有经过任何基于模型的选择过滤。
PreSelect-100B is a curated ~100B token pretraining dataset that achieves great performance on various benchmarks. It is filtered by PreSelect-Classifier at 10% threshold from a randomly sampled subset of DCLM-refinedweb, which is a cleaned version of Common Crawl raw data but without any model-based filtering.
提供机构:
hkust-nlp



