five

dignity045/Collective-Corpus

收藏
Hugging Face2025-08-12 更新2025-08-30 收录
下载链接:
https://hf-mirror.com/datasets/dignity045/Collective-Corpus
下载链接
链接失效反馈
官方服务:
资源简介:
Collective-Corpus是一个超过5000亿token的大规模多语种数据集,设计用于从头开始训练Transformer语言模型,并在多个领域进行微调。它包括大规模、多样化的多语种文本来源,经过清洗、去重和过滤以保证质量。数据集旨在涵盖从原始预训练到特定领域微调的整个语言模型生命周期。

Collective-Corpus is a massive-scale, multi-lingual dataset with over 500 billion tokens, designed for training Transformer-based language models from scratch and finetuning them across a variety of domains. It includes large-scale, diverse multilingual text sources that are cleaned, deduplicated, and filtered for quality, aiming to cover the entire lifecycle from raw pretraining to domain-specific finetuning.
提供机构:
dignity045
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作