ontocord/MixtureVitae-v1

Name: ontocord/MixtureVitae-v1
Creator: ontocord
Published: 2026-01-17 17:17:24
License: 暂无描述

Hugging Face2026-01-17 更新2025-12-20 收录

下载链接：

https://hf-mirror.com/datasets/ontocord/MixtureVitae-v1

下载链接

链接失效反馈

官方服务：

资源简介：

MixtureVitae是一个422B-token的开放预训练数据集，旨在通过仅使用许可许可和低风险数据来训练大型语言模型。数据集由三部分组成：精选来源（约210B tokens）、指令与推理数据（约178B tokens）和网络数据（约34B tokens）。精选来源包括高质量领域文本，如SEC文件、arXiv/PubMed、专利、MegaWika、科学/新闻/法律语料库和The Stack v1代码。指令与推理数据包括从许可种子生成的合成指令/问答/数学/代码数据。网络数据包括经过筛选的许可或重新过滤的爬取数据。数据集结构包括多个文档，使用特殊标记分隔。

MixtureVitae is a 422B-token open pretraining dataset introduced in the paper *MixtureVitae: Open Web-Scale Pretraining Dataset With High Quality Instruction and Reasoning Data Built from Permissive-First Text Sources*. The dataset is designed to train competitive large language models using only permissive-licensed and low-risk data. It integrates three major components: Curated Sources (~210B tokens), Instruction & Reasoning (~178B tokens), and Web (~34B tokens). The dataset structure consists of one or more documents concatenated into a text sequence, separated by special tokens.

提供机构：

ontocord

5,000+

优质数据集

54 个

任务类型

进入经典数据集