mixture-vitae/MixtureVitae-2TT
收藏Hugging Face2025-11-12 更新2025-10-25 收录
下载链接:
https://hf-mirror.com/datasets/mixture-vitae/MixtureVitae-2TT
下载链接
链接失效反馈官方服务:
资源简介:
Aurora-M2是一个多语言、许可、合成和去污的预训练数据集,基于MixtureVitae许可数据集。它包含来自cc-by、公共领域或政府网站的数据,并最终将包含约2万亿个token。数据集中有一半是合成的,包括大量的许可代码、数学和科学推理轨迹。该数据集旨在用于预训练基础LLM,包括商业、政治、格式化文本、法律、数学、新闻、科学和技术、软件、Stackexchange、Wiki、YouTube等数据。数据集还包括来自许可数据的大型合成数据,并且避免了由商业模型生成的数据集。数据集使用ODC-By许可进行授权。
Aurora-M2 is a multilingual, permissive, synthetic, and decontaminated pre-training dataset based on the MixtureVitae permissive dataset. It includes data from various sources such as government websites, cc-by websites, public domain sources, and more. It is designed for pretraining foundational LLMs and contains a large amount of synthetic data. The dataset includes various types of data such as business, politics, formatted text, law, math, news, science and tech, software, stackexchange, wiki, youtube, and more. The dataset also includes a large amount of permissively licensed code, math, and science reasoning traces. It is designed to be easy to use with less licensing hurdles and avoids datasets generated by commercial models. The dataset is licensed under the ODC-By license for the work that is not derived from the underlying data.
提供机构:
mixture-vitae



