dignity045/Collective-Corpus
收藏Hugging Face2025-08-12 更新2025-08-30 收录
下载链接:
https://hf-mirror.com/datasets/dignity045/Collective-Corpus
下载链接
链接失效反馈官方服务:
资源简介:
Collective-Corpus是一个超过5000亿token的大规模多语种数据集,设计用于从头开始训练Transformer语言模型,并在多个领域进行微调。它包括大规模、多样化的多语种文本来源,经过清洗、去重和过滤以保证质量。数据集旨在涵盖从原始预训练到特定领域微调的整个语言模型生命周期。
Collective-Corpus is a massive-scale, multi-lingual dataset with over 500 billion tokens, designed for training Transformer-based language models from scratch and finetuning them across a variety of domains. It includes large-scale, diverse multilingual text sources that are cleaned, deduplicated, and filtered for quality, aiming to cover the entire lifecycle from raw pretraining to domain-specific finetuning.
提供机构:
dignity045



