PreSelect-100B

Name: PreSelect-100B
Creator: maas
Published: 2025-12-05 12:10:18
License: 暂无描述

魔搭社区2025-12-05 更新2025-05-03 收录

下载链接：

https://modelscope.cn/datasets/hkust-nlp/PreSelect-100B

下载链接

链接失效反馈

官方服务：

资源简介：

📑 <a href="https://arxiv.org/abs/2503.00808" target="_blank">Paper</a> &nbsp&nbsp | &nbsp&nbsp 🔨 <a href="https://huggingface.co/hkust-nlp/preselect-fasttext-classifier" target="_blank">fastText Classifier</a> &nbsp&nbsp | &nbsp&nbsp 🤗 <a href="https://huggingface.co/datasets/hkust-nlp/PreSelect-100B" target="_blank">Released Dataset</a> &nbsp&nbsp | &nbsp&nbsp 📦 <a href="https://github.com/hkust-nlp/PreSelect" target="_blank">Repo</a> PreSelect-100B is a curated ~100B token pretraining dataset that achieves great performance on various benchmarks. It is filtered by [PreSelect-Classifier](https://huggingface.co/hkust-nlp/PreSelect-classifier) at 10% threshold, where the pool is a randomly sampled subset of [DCLM-refinedweb](https://data.commoncrawl.org/contrib/datacomp/DCLM-refinedweb/index.html), which is a cleaned version of Common Crawl raw data but without any model-based filtering. ### Benchmark results Trianing using PreSelect curated dataset achieve superior results than other dataset selection methods on various downstream tasks and below are comparisons. ![image/png](https://cdn-uploads.huggingface.co/production/uploads/641c9662043963b1c0a1df52/_2eDuE5K06giMepA_lNSp.png) ### Citation If you find this work helpful, please kindly cite as: ``` @article{shum2025predictivedataselectiondata, title={Predictive Data Selection: The Data That Predicts Is the Data That Teaches}, author={Kashun Shum and Yuzhen Huang and Hongjian Zou and Ding Qi and Yixuan Liao and Xiaoxin Chen and Qian Liu and Junxian He}, journal={arXiv preprint arXiv:2503.00808}, year={2025}, eprint={2503.00808}, } ```

📑 <a href="https://arxiv.org/abs/2503.00808" target="_blank">论文</a> &nbsp&nbsp | &nbsp&nbsp 🔨 <a href="https://huggingface.co/hkust-nlp/preselect-fasttext-classifier" target="_blank">fastText 分类器（fastText）</a> &nbsp&nbsp | &nbsp&nbsp 🤗 <a href="https://huggingface.co/datasets/hkust-nlp/PreSelect-100B" target="_blank">已发布数据集</a> &nbsp&nbsp | &nbsp&nbsp 📦 <a href="https://github.com/hkust-nlp/PreSelect" target="_blank">代码仓库</a> PreSelect-100B是一款精选的约1000亿Token (Token) 的预训练数据集，在各类基准测试中均展现出优异性能。该数据集通过[PreSelect-Classifier](https://huggingface.co/hkust-nlp/PreSelect-classifier)以10%的阈值完成过滤，其数据源池为[DCLM-refinedweb](https://data.commoncrawl.org/contrib/datacomp/DCLM-refinedweb/index.html)的随机采样子集；而DCLM-refinedweb是对原始通用爬虫数据集（Common Crawl）数据进行清洗后的版本，未经过任何基于模型的过滤操作。 ### 基准测试结果使用PreSelect精选后的数据集进行训练，在各类下游任务上的表现均优于其他数据集选择方法，以下为性能对比： ![image/png](https://cdn-uploads.huggingface.co/production/uploads/641c9662043963b1c0a1df52/_2eDuE5K06giMepA_lNSp.png) ### 引用格式若您认为本工作对您有所帮助，请按以下格式进行引用： @article{shum2025predictivedataselectiondata, title={预测性数据选择：能实现预测的数据才是助力学习的数据}, author={Kashun Shum、黄钰珍、邹洪健、齐丁、廖奕轩、陈晓昕、刘倩、何俊贤}, journal={arXiv预印本 arXiv:2503.00808}, year={2025}, eprint={2503.00808}, }

提供机构：

maas

创建时间：

2025-02-18

5,000+

优质数据集

54 个

任务类型

进入经典数据集