MixtureVitae-v1
收藏魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/ontocord/MixtureVitae-v1
下载链接
链接失效反馈官方服务:
资源简介:
# MixtureVitae
## Dataset Summary
**MixtureVitae** is a **422B-token open pretraining dataset** introduced in the paper
[*MixtureVitae: Open Web-Scale Pretraining Dataset With High Quality Instruction and Reasoning Data Built from Permissive-First Text Sources*](https://arxiv.org/abs/2509.25531).
The dataset is designed to answer a key question:
> *Can we train competitive large language models using only permissive-licensed and low-risk data, without resorting to unrestricted web scrapes?*
To this end, MixtureVitae prioritizes **permissive licensing, legal safety, and transparent provenance** while still maintaining high performance across reasoning, instruction following, and general NLP tasks.
## Dataset Composition
MixtureVitae integrates three major components (≈ 422B tokens total):
- **Curated Sources (~210B tokens)**
High-quality domain text: SEC filings, arXiv/PubMed, patents, MegaWika, science/news/legal corpora, The Stack v1 code (~12% of total).
- **Instruction & Reasoning (~178B tokens)**
Synthetic instruction/QA/math/code data, generated from permissive seeds (e.g., Magpie, MetaMathQA, OpenMathInstruct, UltraFeedback, Glaive-AI, OpenThoughts).
- **Web (~34B tokens)**
Selected permissive or re-filtered crawls (Nemotron-CC, MagaCorpus, FineFineWeb).
**By license tier:**
- Tier 1: 352B (explicit open licenses & PD)
- Tier 2: 52B (curated permissive repositories like The Stack v1)
- Tier 3: 18B (civic/government works)
## Dataset Structure
Each example in MixtureVitae consists of one or more documents concatenated into a text sequence.
- Documents are separated by the special token: `<|endoftext|>`. We recommend replacing this token with your appropriate `eos` token from the target tokenizer used for training your model.
- We have used `<think>` and `</think>` tokens in some reasoning datasets. You may wish to add these special tokens to your tokenizer.
## Limitations & Considerations
- Not 100% free of legal risk; license heuristics may miss edge cases.
- No full cross-dataset deduplication → potential near-duplicates.
- Domain balance favors reasoning/math/instruction, underrepresents other genres.
## How to Cite
```bibtex
@misc{nguyen2025mixturevitaeopenwebscalepretraining,
title={MixtureVitae: Open Web-Scale Pretraining Dataset With High Quality Instruction and Reasoning Data Built from Permissive-First Text Sources},
author={Huu Nguyen and Victor May and Harsh Raj and Marianna Nezhurina and Yishan Wang and Yanqi Luo and Minh Chien Vu and Taishi Nakamura and Ken Tsui and Van Khue Nguyen and David Salinas and Aleksandra Krasnodębska and Christoph Schuhmann and Mats Leon Richter and Xuan-Son and Vu and Jenia Jitsev},
year={2025},
eprint={2509.25531},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2509.25531},
}
# MixtureVitae
## 数据集概述
**MixtureVitae** 是一篇论文中提出的**4220亿Token的开放预训练数据集**,论文题为《MixtureVitae: Open Web-Scale Pretraining Dataset With High Quality Instruction and Reasoning Data Built from Permissive-First Text Sources》(链接:https://arxiv.org/abs/2509.25531)。
该数据集旨在解答一个关键问题:
> *我们能否仅使用许可式授权且低风险的数据训练出具有竞争力的大语言模型(Large Language Model, LLM),而无需依赖无限制的网页爬取数据?*
为此,MixtureVitae将**许可式授权、法律安全性与可溯源性**作为核心优先级,同时仍能在推理、指令遵循与通用自然语言处理任务中保持优异性能。
## 数据集构成
MixtureVitae 整合了三大核心组件(总Token数约4220亿):
- **精选数据源(约2100亿Token)**:高质量领域文本,包括美国证券交易委员会(SEC)文件、arXiv/PubMed、专利、MegaWika、科学/新闻/法律语料库,以及The Stack v1代码(占总量的12%)。
- **指令与推理数据(约1780亿Token)**:基于许可式种子数据生成的合成指令/问答/数学/代码数据,种子数据包括Magpie、MetaMathQA、OpenMathInstruct、UltraFeedback、Glaive-AI、OpenThoughts等。
- **网页数据(约340亿Token)**:经过筛选的许可式或重新过滤的爬取数据集,包括Nemotron-CC、MagaCorpus、FineFineWeb。
**按授权层级分类:**
- 层级1:3520亿Token(明确开放授权与公共领域数据)
- 层级2:520亿Token(精选许可式仓库,如The Stack v1)
- 层级3:180亿Token(公民与政府作品)
## 数据集结构
MixtureVitae 中的每个示例由一个或多个文档拼接而成的文本序列组成。
- 文档之间使用特殊分隔符 `<|endoftext|>` 进行分隔。建议将该分隔符替换为你所使用的训练分词器对应的`eos`(结束符)Token。
- 部分推理数据集中使用了 `<think>` 和 `</think>` 标记,你可根据需要将这些特殊Token添加至你的分词器中。
## 局限性与注意事项
- 无法完全规避法律风险;授权启发式检测可能会遗漏边缘案例。
- 未实现全数据集去重,可能存在近似重复样本。
- 领域分布偏向推理、数学与指令任务,其他类型文本占比不足。
## 引用方式
bibtex
@misc{nguyen2025mixturevitaeopenwebscalepretraining,
title={MixtureVitae: Open Web-Scale Pretraining Dataset With High Quality Instruction and Reasoning Data Built from Permissive-First Text Sources},
author={Huu Nguyen and Victor May and Harsh Raj and Marianna Nezhurina and Yishan Wang and Yanqi Luo and Minh Chien Vu and Taishi Nakamura and Ken Tsui and Van Khue Nguyen and David Salinas and Aleksandra Krasnodębska and Christoph Schuhmann and Mats Leon Richter and Xuan-Son and Vu and Jenia Jitsev},
year={2025},
eprint={2509.25531},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2509.25531},
}
提供机构:
maas
创建时间:
2025-11-30



