trend-cybertron/Primus-FineWeb
收藏Hugging Face2025-08-08 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/trend-cybertron/Primus-FineWeb
下载链接
链接失效反馈官方服务:
资源简介:
Primus-FineWeb数据集是一个由FineWeb中筛选出的网络安全相关文本构成的开放源数据集,用于网络安全大型语言模型的训练。该数据集通过TinyBERT二分类器对FineWeb进行打分,并筛选出得分超过0.003的文本,经过去重处理后,包含了2.57亿token的网络安全语料。
The Primus-FineWeb dataset is an open-source collection of cybersecurity-related texts filtered from FineWeb, designed for training cybersecurity large language models. The dataset is created by scoring texts in FineWeb using a TinyBERT binary classifier and filtering out those with a score above 0.003, followed by deduplication, resulting in a cybersecurity corpus of 2.57 billion tokens.
提供机构:
trend-cybertron



