five

PreSelect-100B

收藏
魔搭社区2025-12-05 更新2025-05-03 收录
下载链接:
https://modelscope.cn/datasets/hkust-nlp/PreSelect-100B
下载链接
链接失效反馈
官方服务:
资源简介:
<p align="center"> 📑 <a href="https://arxiv.org/abs/2503.00808" target="_blank">Paper</a> &nbsp&nbsp | &nbsp&nbsp 🔨 <a href="https://huggingface.co/hkust-nlp/preselect-fasttext-classifier" target="_blank">fastText Classifier</a> &nbsp&nbsp | &nbsp&nbsp 🤗 <a href="https://huggingface.co/datasets/hkust-nlp/PreSelect-100B" target="_blank">Released Dataset</a> &nbsp&nbsp | &nbsp&nbsp 📦 <a href="https://github.com/hkust-nlp/PreSelect" target="_blank">Repo</a> <br> </p> PreSelect-100B is a curated ~100B token pretraining dataset that achieves great performance on various benchmarks. It is filtered by [PreSelect-Classifier](https://huggingface.co/hkust-nlp/PreSelect-classifier) at 10% threshold, where the pool is a randomly sampled subset of [DCLM-refinedweb](https://data.commoncrawl.org/contrib/datacomp/DCLM-refinedweb/index.html), which is a cleaned version of Common Crawl raw data but without any model-based filtering. ### Benchmark results Trianing using PreSelect curated dataset achieve superior results than other dataset selection methods on various downstream tasks and below are comparisons. ![image/png](https://cdn-uploads.huggingface.co/production/uploads/641c9662043963b1c0a1df52/_2eDuE5K06giMepA_lNSp.png) ### Citation If you find this work helpful, please kindly cite as: ``` @article{shum2025predictivedataselectiondata, title={Predictive Data Selection: The Data That Predicts Is the Data That Teaches}, author={Kashun Shum and Yuzhen Huang and Hongjian Zou and Ding Qi and Yixuan Liao and Xiaoxin Chen and Qian Liu and Junxian He}, journal={arXiv preprint arXiv:2503.00808}, year={2025}, eprint={2503.00808}, } ```

<p align="center"> 📑 <a href="https://arxiv.org/abs/2503.00808" target="_blank">论文</a> &nbsp&nbsp | &nbsp&nbsp 🔨 <a href="https://huggingface.co/hkust-nlp/preselect-fasttext-classifier" target="_blank">fastText 分类器(fastText)</a> &nbsp&nbsp | &nbsp&nbsp 🤗 <a href="https://huggingface.co/datasets/hkust-nlp/PreSelect-100B" target="_blank">已发布数据集</a> &nbsp&nbsp | &nbsp&nbsp 📦 <a href="https://github.com/hkust-nlp/PreSelect" target="_blank">代码仓库</a> <br> </p> PreSelect-100B是一款精选的约1000亿Token (Token) 的预训练数据集,在各类基准测试中均展现出优异性能。该数据集通过[PreSelect-Classifier](https://huggingface.co/hkust-nlp/PreSelect-classifier)以10%的阈值完成过滤,其数据源池为[DCLM-refinedweb](https://data.commoncrawl.org/contrib/datacomp/DCLM-refinedweb/index.html)的随机采样子集;而DCLM-refinedweb是对原始通用爬虫数据集(Common Crawl)数据进行清洗后的版本,未经过任何基于模型的过滤操作。 ### 基准测试结果 使用PreSelect精选后的数据集进行训练,在各类下游任务上的表现均优于其他数据集选择方法,以下为性能对比: ![image/png](https://cdn-uploads.huggingface.co/production/uploads/641c9662043963b1c0a1df52/_2eDuE5K06giMepA_lNSp.png) ### 引用格式 若您认为本工作对您有所帮助,请按以下格式进行引用: @article{shum2025predictivedataselectiondata, title={预测性数据选择:能实现预测的数据才是助力学习的数据}, author={Kashun Shum、黄钰珍、邹洪健、齐丁、廖奕轩、陈晓昕、刘倩、何俊贤}, journal={arXiv预印本 arXiv:2503.00808}, year={2025}, eprint={2503.00808}, }
提供机构:
maas
创建时间:
2025-02-18
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作