MAGACorpus
收藏魔搭社区2025-12-04 更新2025-05-03 收录
下载链接:
https://modelscope.cn/datasets/ByteDance-Seed/MAGACorpus
下载链接
链接失效反馈官方服务:
资源简介:
# Massive Genre-Audience Augment Fineweb-Edu Corpus
This dataset is a synthetic pretraining corpus described in paper [Reformulation for Pretraining Data Augmentation](https://arxiv.org/abs/2502.04235).
<img src="https://cdn-uploads.huggingface.co/production/uploads/64b764bffdb702b3d8640610/WIEom2dItQvCyQciQW9pz.png" width="800">
Overview of synthesis framework. Our method expands the original corpus through a two-stage synthesis process.
Each document is reformulated to 5 new documents, achieving 3.9× token number expansion while maintaining diversity through massive (genre, audience) pairs.
We build MGACorpus based on [SmolLM Corpus](https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus), expanding fineweb-edu-dedup source from 195B tokens to 770B tokens.
## Performance
Our baseline is trained on SmolLM-Corpus dataset,
and experiments use MGACorpus as incremental data.
<img src="https://cdn-uploads.huggingface.co/production/uploads/64b764bffdb702b3d8640610/QB4wPWUlp-nqYMOpn5LwP.png" width="800">
Training dynamics of two common scenarios under data-constrained conditions:
- (1) expanding a 50B high-quality dataset to a 500B training budget (entire set repetition).
- (2) expanding a 500B mixed-quality dataset to a 700B training budget (subset repetition).
<img src="https://cdn-uploads.huggingface.co/production/uploads/64b764bffdb702b3d8640610/4KRquxzZVW861EN-luxJ1.png" width="750">
## Dataset Schema
```
root
|-- meta: struct (nullable = true)
| |-- chunk_id: string (nullable = true)
| |-- docid: string (nullable = true)
| |-- meta_extra: string (nullable = true)
| |-- source: string (nullable = true)
| |-- split: string (nullable = true)
| |-- genre: string (nullable = true)
| |-- audience: string (nullable = true)
| |-- raw_text: string (nullable = true)
|-- content_split: string (nullable = true)
```
## Loading the dataset
```python
from datasets import load_dataset
ds = load_dataset("ByteDance-Seed/mga-fineweb-edu", split='train')
print(ds[0])
```
## Data Source Statement
Content in the meta.raw_text field is derived from FineWeb-EDU-Dedup subset of [SmolLM Corpus](https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus), licensed under [ODC-By](https://opendatacommons.org/licenses/by/1-0/) license.
Other text fields follow the same license.
## Disclaimer
Your access to and use of this dataset are at your own risk. We do not guarantee the accuracy of this dataset. The dataset is provided "as is" and we make no warranty or representation to you with respect to it and we expressly disclaim, and hereby expressly waive, all warranties, express, implied, statutory or otherwise. This includes, without limitation, warranties of quality, performance, merchantability or fitness for a particular purpose, non-infringement, absence of latent or other defects, accuracy, or the presence or absence of errors, whether or not known or discoverable. In no event will we be liable to you on any legal theory (including, without limitation, negligence) or otherwise for any direct, special, indirect, incidental, consequential, punitive, exemplary, or other losses, costs, expenses, or damages arising out of this public license or use of the licensed material. The disclaimer of warranties and limitation of liability provided above shall be interpreted in a manner that, to the extent possible, most closely approximates an absolute disclaimer and waiver of all liability.
## Citation
```
@article{hao2025reformulation,
title = {Reformulation for Pretraining Data Augmentation},
author = {Hao, Xintong and Zhu, Ruijie and Zhang, Ge and Shen, Ke and Li, Chenggang},
journal={arXiv preprint arXiv:2502.04235},
url = {https://arxiv.org/abs/2502.04235}
}
```
# 大规模受众-体裁增强型Fineweb-Edu语料库(Massive Genre-Audience Augment Fineweb-Edu Corpus)
本数据集为论文《预训练数据重构增强(Reformulation for Pretraining Data Augmentation)》中提出的合成预训练语料库,论文链接:https://arxiv.org/abs/2502.04235。

### 合成框架概览
我们的方法通过两阶段合成流程对原始语料库进行扩展。每篇文档将被重构为5篇新文档,在实现3.9倍Token(Token)数量扩增的同时,通过大规模的「体裁-受众」(genre, audience)配对维持语料多样性。
我们基于SmolLM语料库(SmolLM Corpus)构建了MGACorpus,将fineweb-edu-dedup数据源的Token规模从1950亿扩增至7700亿。
## 性能表现
我们的基线模型基于SmolLM-Corpus数据集训练,实验以MGACorpus作为增量数据开展训练。

两种常见受限数据场景下的训练动态:
1. 将500亿Token的高质量数据集扩展至5000亿Token的训练预算(全量重复扩增)
2. 将5000亿Token的混合质量数据集扩展至7000亿Token的训练预算(子集重复扩增)

## 数据集Schema
根节点
|-- meta: 结构体(可为空)
| |-- chunk_id: 字符串(可为空)
| |-- docid: 字符串(可为空)
| |-- meta_extra: 字符串(可为空)
| |-- source: 字符串(可为空)
| |-- split: 字符串(可为空)
| |-- genre: 字符串(可为空)
| |-- audience: 字符串(可为空)
| |-- raw_text: 字符串(可为空)
|-- content_split: 字符串(可为空)
## 数据集加载代码示例
python
from datasets import load_dataset
ds = load_dataset("ByteDance-Seed/mga-fineweb-edu", split='train')
print(ds[0])
## 数据源声明
`meta.raw_text` 字段内容源自SmolLM语料库(SmolLM Corpus)的FineWeb-EDU-Dedup子集,采用ODC-By许可证授权,许可证链接:https://opendatacommons.org/licenses/by/1-0/。其余文本字段遵循相同许可协议。
## 免责声明
您对本数据集的访问与使用均由您自行承担风险。我们不对本数据集的准确性作出任何保证。本数据集按“现状”提供,我们未就其作出任何明示或默示的担保、陈述或声明,并明确免除所有明示、默示、法定或其他形式的担保责任,包括但不限于关于质量、性能、适销性、特定用途适用性、不侵权、无潜在或其他缺陷、准确性,以及是否存在已知或可发现的错误的担保。在任何情形下,无论基于何种法律理论(包括但不限于过失)或其他事由,对于因本公共许可证或授权材料的使用所产生的任何直接、特殊、间接、附带、后果性、惩罚性、惩戒性或其他损失、成本、费用或损害,我们均不向您承担责任。前述免责声明与责任限制条款应在最大可能的范围内被解释为最接近绝对免责与所有责任豁免的表述。
## 引用格式
@article{hao2025reformulation,
title = {Reformulation for Pretraining Data Augmentation},
author = {Hao, Xintong and Zhu, Ruijie and Zhang, Ge and Shen, Ke and Li, Chenggang},
journal={arXiv preprint arXiv:2502.04235},
url = {https://arxiv.org/abs/2502.04235}
}
提供机构:
maas
创建时间:
2025-04-28



