PreSelect-100B
收藏魔搭社区2025-12-05 更新2025-05-03 收录
下载链接:
https://modelscope.cn/datasets/hkust-nlp/PreSelect-100B
下载链接
链接失效反馈官方服务:
资源简介:
<p align="center">
📑 <a href="https://arxiv.org/abs/2503.00808" target="_blank">Paper</a>    |    🔨 <a href="https://huggingface.co/hkust-nlp/preselect-fasttext-classifier" target="_blank">fastText Classifier</a>    |    🤗 <a href="https://huggingface.co/datasets/hkust-nlp/PreSelect-100B" target="_blank">Released Dataset</a>    |    📦 <a href="https://github.com/hkust-nlp/PreSelect" target="_blank">Repo</a>
<br>
</p>
PreSelect-100B is a curated ~100B token pretraining dataset that achieves great performance on various benchmarks.
It is filtered by [PreSelect-Classifier](https://huggingface.co/hkust-nlp/PreSelect-classifier) at 10% threshold, where the pool is a randomly sampled subset of [DCLM-refinedweb](https://data.commoncrawl.org/contrib/datacomp/DCLM-refinedweb/index.html), which is a cleaned version of Common Crawl raw data but without any model-based filtering.
### Benchmark results
Trianing using PreSelect curated dataset achieve superior results than other dataset selection methods on various downstream tasks and below are comparisons.

### Citation
If you find this work helpful, please kindly cite as:
```
@article{shum2025predictivedataselectiondata,
title={Predictive Data Selection: The Data That Predicts Is the Data That Teaches},
author={Kashun Shum and Yuzhen Huang and Hongjian Zou and Ding Qi and Yixuan Liao and Xiaoxin Chen and Qian Liu and Junxian He},
journal={arXiv preprint arXiv:2503.00808},
year={2025},
eprint={2503.00808},
}
```
<p align="center">
📑 <a href="https://arxiv.org/abs/2503.00808" target="_blank">论文</a>    |    🔨 <a href="https://huggingface.co/hkust-nlp/preselect-fasttext-classifier" target="_blank">fastText 分类器(fastText)</a>    |    🤗 <a href="https://huggingface.co/datasets/hkust-nlp/PreSelect-100B" target="_blank">已发布数据集</a>    |    📦 <a href="https://github.com/hkust-nlp/PreSelect" target="_blank">代码仓库</a>
<br>
</p>
PreSelect-100B是一款精选的约1000亿Token (Token) 的预训练数据集,在各类基准测试中均展现出优异性能。该数据集通过[PreSelect-Classifier](https://huggingface.co/hkust-nlp/PreSelect-classifier)以10%的阈值完成过滤,其数据源池为[DCLM-refinedweb](https://data.commoncrawl.org/contrib/datacomp/DCLM-refinedweb/index.html)的随机采样子集;而DCLM-refinedweb是对原始通用爬虫数据集(Common Crawl)数据进行清洗后的版本,未经过任何基于模型的过滤操作。
### 基准测试结果
使用PreSelect精选后的数据集进行训练,在各类下游任务上的表现均优于其他数据集选择方法,以下为性能对比:

### 引用格式
若您认为本工作对您有所帮助,请按以下格式进行引用:
@article{shum2025predictivedataselectiondata,
title={预测性数据选择:能实现预测的数据才是助力学习的数据},
author={Kashun Shum、黄钰珍、邹洪健、齐丁、廖奕轩、陈晓昕、刘倩、何俊贤},
journal={arXiv预印本 arXiv:2503.00808},
year={2025},
eprint={2503.00808},
}
提供机构:
maas
创建时间:
2025-02-18



