SkyPile-150B 综合性大规模中文数据集

超神经2024-01-11 更新2024-05-15 收录

下载链接：

https://hyper.ai/cn/datasets/28906

下载链接

链接失效反馈

官方服务：

更多采购需求

资源简介：

SkyPile-150B 是专门为大型语言模型预训练而设计的综合性大规模中文数据集。它源自大量可公开访问的中国互联网网页。数据集采用严格的过滤、广泛的重复数据删除和彻底的敏感数据过滤来确保其质量。此外，研究人员还利用 fastText 和 BERT 等先进工具来过滤掉低质量的数据。

SkyPile-150B is a comprehensive large-scale Chinese dataset specifically designed for pre-training large language models. It is derived from a large number of publicly accessible Chinese internet webpages. The dataset employs strict filtering, extensive deduplication, and thorough sensitive data filtering to ensure its quality. Additionally, researchers have utilized advanced tools such as fastText and BERT to filter out low-quality data.

创建时间：

2024-01-11

搜集汇总

数据集介绍