ontocord/MixtureVitae-300BT
收藏Hugging Face2025-09-10 更新2025-07-05 收录
下载链接:
https://hf-mirror.com/datasets/ontocord/MixtureVitae-300BT
下载链接
链接失效反馈官方服务:
资源简介:
MixtureVitae (工作版) — 文本仅限许可子集数据集是MixtureVitae许可数据集的文本部分的简化版,包含cc-by、公共领域或政府网站的数据。数据集包括业务、政治、法律、科学和技术等多个领域的数据,并包含大量的合成数据。数据集旨在为预训练基础大型语言模型(LLM)提供支持。数据集的许可性质旨在降低研究人员的版权风险。
MixtureVitae (Working Version) — Text-Only Permissive Subset is a simplified version of the text portion of the MixtureVitae permissive dataset, including data from cc-by, public domain, or governmental websites. The dataset encompasses various fields such as business, politics, law, science, and technology, and also includes a significant amount of synthetic data. It is designed to support the pretraining of foundational large language models (LLMs). The datasets permissive licensing aims to minimize copyright risks for researchers.
提供机构:
ontocord



