bytedance-research/MAGACorpus
收藏Hugging Face2025-02-15 更新2025-04-08 收录
下载链接:
https://hf-mirror.com/datasets/bytedance-research/MAGACorpus
下载链接
链接失效反馈官方服务:
资源简介:
Massive Genre-Audience Corpus(MAGACorpus)是一个合成的预训练语料库,基于SmolLM语料库的FineWeb-EDU-Dedup子集扩展而来。该数据集通过两阶段的合成过程,将每个文档重写成5个新文档,实现了3.9倍的token数量扩展,并通过大量的(体裁,受众)对保持多样性。它用于模型预训练,并支持134M/377M/1.7B大小的模型从零开始训练。
Massive Genre-Audience Corpus (MAGACorpus) is a synthetic pretraining corpus based on the FineWeb-EDU-Dedup subset of the SmolLM Corpus. The dataset expands each document into 5 new documents through a two-stage synthesis process, achieving a 3.9× token number expansion while maintaining diversity with a massive number of (genre, audience) pairs. It is used for model pretraining and supports the training of models from scratch at sizes of 134M/377M/1.7B.
提供机构:
bytedance-research



