occiglot/occiglot-fineweb-v1.0
收藏Hugging Face2024-11-16 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/occiglot/occiglot-fineweb-v1.0
下载链接
链接失效反馈官方服务:
资源简介:
Occiglot Fineweb v1.0是一个多语言文本生成数据集,包含来自10种语言的约430M经过严格清理的文档。该数据集基于现有的精选数据集和预过滤的网络数据构建,并经过语言特定的过滤和去重处理。数据集提供了三个处理级别的数据:过滤后、本地去重后和全局去重后。数据来源主要包括LLM-Dataset和Web-Data。数据集还提供了详细的统计信息,包括每种语言的文档数量和令牌数量。
Occiglot Fineweb v1.0 is a multilingual text generation dataset containing approximately 430M heavily cleaned documents from 10 languages. The dataset builds on existing curated datasets and pre-filtered web data, and has undergone language-specific filtering and deduplication. The dataset provides data at three levels of processing: after filtering, after local deduplication, and after global deduplication. The data sources mainly include LLM-Dataset and Web-Data. The dataset also provides detailed statistics, including the number of documents and tokens for each language.
提供机构:
occiglot



