ontocord/MixtureVitae
收藏Hugging Face2025-06-12 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/ontocord/MixtureVitae
下载链接
链接失效反馈官方服务:
资源简介:
MixtureVitae是一个开放源代码、开放许可的高质量预训练数据集,适用于广泛的模态、领域和语言的大型语言模型预训练。该数据集包括超过1万亿个令牌的多样化文本和多媒体内容,这些内容都经过了版权许可的筛选,并加入了高质量的合成数据。它旨在促进透明、开放访问的人工智能发展,同时降低版权和法律不确定性的风险。
MixtureVitae is an open-source, permissive, high-quality pretraining dataset designed for large language models (LLMs) across a wide variety of modalities, domains, and languages. It includes over 1 trillion tokens of diverse text and multimodal content, carefully filtered for copyright-permissiveness and enriched with high-quality synthetic data, aiming to facilitate the development of transparent, open-access AI while reducing legal uncertainty around copyright and data provenance.
提供机构:
ontocord



