five

SkyPile-150B

收藏
魔搭社区2026-05-22 更新2024-05-15 收录
下载链接:
https://modelscope.cn/datasets/modelscope/SkyPile-150B
下载链接
链接失效反馈
官方服务:
资源简介:
# SkyPile-150B ## Dataset Summary SkyPile-150B is a comprehensive, large-scale Chinese dataset specifically designed for the pre-training of large language models. It is derived from a broad array of publicly accessible Chinese Internet web pages. Rigorous filtering, extensive deduplication, and thorough sensitive data filtering have been employed to ensure its quality. Furthermore, we have utilized advanced tools such as fastText and BERT to filter out low-quality data. The publicly accessible portion of the SkyPile-150B dataset encompasses approximately 233 million unique web pages, each containing an average of over 1,000 Chinese characters. In total, the dataset includes approximately 150 billion tokens and 620 gigabytes of plain text data. ## Language The SkyPile-150B dataset is exclusively composed of Chinese data. ## Data Field Explanation - text: the processed and cleaned text extracted from each page. ## Dataset Safety We utilized more than 200w rules and the BERT-base model to determine the sensitive data present in the dataset, and subsequently removed any harmful entries we detect. ## Sensitive Information and Bias Despite our best efforts, SkyPile-150B, given its construction from publicly available web pages, might contain sensitive information such as email addresses, phone numbers, or IP addresses. We have endeavored to minimize this through deduplication and low-quality filtering, but users of SkyPile-150B should remain vigilant. The Internet is rife with potentially toxic or biased data. We have attempted to mitigate this with specific URL filtering methods, but we encourage users to remain conscious of this potential issue. ## Social Impact of the Dataset The open-source release of the SkyPile-150B dataset represents our commitment to enhancing access to high-quality web data, which has traditionally been a closely guarded resource among model developers. We believe that this release will foster greater accessibility and the proliferation of high-performance large language models, thereby contributing significantly to the advancement of the field. ## License Agreement The community usage of SkyPile dataset requires Skywork Community License. The SkyPile dataset supports commercial use. If you plan to use the Skywork model or its derivatives for commercial purposes, you must abide by terms and conditions within Skywork Community License as well as Apache2.0. ## Contact Us and Citation If you find our work helpful, please feel free to cite our paper~ ``` @misc{wei2023skywork, title={Skywork: A More Open Bilingual Foundation Model}, author={Tianwen Wei and Liang Zhao and Lichang Zhang and Bo Zhu and Lijie Wang and Haihua Yang and Biye Li and Cheng Cheng and Weiwei Lü and Rui Hu and Chenxia Li and Liu Yang and Xilin Luo and Xuejie Wu and Lunan Liu and Wenjun Cheng and Peng Cheng and Jianhao Zhang and Xiaoyu Zhang and Lei Lin and Xiaokun Wang and Yutuan Ma and Chuanhai Dong and Yanqi Sun and Yifu Chen and Yongyi Peng and Xiaojuan Liang and Shuicheng Yan and Han Fang and Yahui Zhou}, year={2023}, eprint={2310.19341}, archivePrefix={arXiv}, primaryClass={cs.CL} } ```

# SkyPile-150B ## 数据集概述 SkyPile-150B是一款专为大语言模型(Large Language Model)预训练打造的综合性大规模中文数据集,其数据源自海量公开可访问的中文互联网网页。团队通过严格筛选、大规模去重及敏感数据筛查流程保障数据集质量,此外还借助fastText与BERT等先进工具剔除低质数据。 SkyPile-150B的公开可访问子集包含约1.66亿个独立网页,单页平均中文字数超过1000字。该数据集总规模约含150亿Token与592GB纯文本数据。 ## 语言 SkyPile-150B数据集仅包含中文数据。 ## 数据字段说明 - text:从各网页中提取并经处理清洗后的文本内容。 ## 敏感信息与偏差 尽管已尽最大努力,但由于SkyPile-150B的数据源自公开互联网网页,仍可能包含邮箱地址、电话号码或IP地址等敏感信息。团队已通过去重与低质数据过滤手段尽量降低此类风险,但数据集使用者仍需保持警惕。 互联网中充斥着潜在的有害或带有偏差的数据,团队已尝试通过特定URL过滤策略缓解该问题,同时建议使用者对此类潜在风险保持警觉。 ## 数据集的社会影响 开源发布SkyPile-150B数据集,体现了我们致力于提升高质量网页数据的可及性——这类资源此前长期为模型开发者所垄断。我们相信,本次开源将推动高性能大语言模型的可及性提升与普及,进而为该领域的发展做出重要贡献。 ## 许可协议 社区使用SkyPile数据集需遵循Skywork社区许可协议。SkyPile数据集支持商业用途。若计划将Skywork模型及其衍生产品用于商业目的,则必须同时遵守Skywork社区许可协议与Apache2.0协议的相关条款。 ## 联系我们与引用说明 若您认为本工作对您有所帮助,请引用我们的论文: @misc{wei2023skywork, title={Skywork: A More Open Bilingual Foundation Model}, author={Tianwen Wei and Liang Zhao and Lichang Zhang and Bo Zhu and Lijie Wang and Haihua Yang and Biye Li and Cheng Cheng and Weiwei Lü and Rui Hu and Chenxia Li and Liu Yang and Xilin Luo and Xuejie Wu and Lunan Liu and Wenjun Cheng and Peng Cheng and Jianhao Zhang and Xiaoyu Zhang and Lei Lin and Xiaokun Wang and Yutuan Ma and Chuanhai Dong and Yanqi Sun and Yifu Chen and Yongyi Peng and Xiaojuan Liang and Shuicheng Yan and Han Fang and Yahui Zhou}, year={2023}, eprint={2310.19341}, archivePrefix={arXiv}, primaryClass={cs.CL} }
提供机构:
maas
创建时间:
2023-11-06
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作