ontocord/MixtureVitae

Name: ontocord/MixtureVitae
Creator: ontocord
Published: 2025-06-12 15:01:22
License: 暂无描述

Hugging Face2025-06-12 更新2025-04-12 收录

下载链接：

https://hf-mirror.com/datasets/ontocord/MixtureVitae

下载链接

链接失效反馈

官方服务：

资源简介：

MixtureVitae是一个开放源代码、开放许可的高质量预训练数据集，适用于广泛的模态、领域和语言的大型语言模型预训练。该数据集包括超过1万亿个令牌的多样化文本和多媒体内容，这些内容都经过了版权许可的筛选，并加入了高质量的合成数据。它旨在促进透明、开放访问的人工智能发展，同时降低版权和法律不确定性的风险。

MixtureVitae is an open-source, permissive, high-quality pretraining dataset designed for large language models (LLMs) across a wide variety of modalities, domains, and languages. It includes over 1 trillion tokens of diverse text and multimodal content, carefully filtered for copyright-permissiveness and enriched with high-quality synthetic data, aiming to facilitate the development of transparent, open-access AI while reducing legal uncertainty around copyright and data provenance.

提供机构：

ontocord

5,000+

优质数据集

54 个

任务类型

进入经典数据集