RefinedWeb Dataset
收藏arXiv2023-06-02 更新2024-08-06 收录
下载链接:
http://arxiv.org/abs/2306.01116v1
下载链接
链接失效反馈官方服务:
资源简介:
从CommonCrawl中提取的高质量数据集,包含6000亿个tokens,用于训练大型语言模型
A high-quality dataset extracted from CommonCrawl, comprising 600 billion tokens and tailored for training large language models.
创建时间:
2023-06-02



