mga-fineweb-edu
收藏魔搭社区2025-12-04 更新2025-05-24 收录
下载链接:
https://modelscope.cn/datasets/ByteDance-Seed/mga-fineweb-edu
下载链接
链接失效反馈官方服务:
资源简介:
# Massive Genre-Audience Augment Fineweb-Edu Corpus
This dataset is a synthetic pretraining corpus described in paper [Reformulation for Pretraining Data Augmentation](https://arxiv.org/abs/2502.04235).
<img src="https://cdn-uploads.huggingface.co/production/uploads/64b764bffdb702b3d8640610/WIEom2dItQvCyQciQW9pz.png" width="800">
Overview of synthesis framework. Our method expands the original corpus through a two-stage synthesis process.
Each document is reformulated to 5 new documents, achieving 3.9× token number expansion while maintaining diversity through massive (genre, audience) pairs.
We build MGACorpus based on [SmolLM Corpus](https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus), expanding fineweb-edu-dedup source from 195B tokens to 770B tokens.
## Performance
Our baseline is trained on SmolLM-Corpus dataset,
and experiments use MGACorpus as incremental data.
<img src="https://cdn-uploads.huggingface.co/production/uploads/64b764bffdb702b3d8640610/QB4wPWUlp-nqYMOpn5LwP.png" width="800">
Training dynamics of two common scenarios under data-constrained conditions:
- (1) expanding a 50B high-quality dataset to a 500B training budget (entire set repetition).
- (2) expanding a 500B mixed-quality dataset to a 700B training budget (subset repetition).
<img src="https://cdn-uploads.huggingface.co/production/uploads/64b764bffdb702b3d8640610/4KRquxzZVW861EN-luxJ1.png" width="750">
## Dataset Schema
```
root
|-- meta: struct (nullable = true)
| |-- chunk_id: string (nullable = true)
| |-- docid: string (nullable = true)
| |-- meta_extra: string (nullable = true)
| |-- source: string (nullable = true)
| |-- split: string (nullable = true)
| |-- genre: string (nullable = true)
| |-- audience: string (nullable = true)
| |-- raw_text: string (nullable = true)
|-- content_split: string (nullable = true)
```
## Loading the dataset
```python
from datasets import load_dataset
ds = load_dataset("ByteDance-Seed/mga-fineweb-edu", split='train')
print(ds[0])
```
## Data Source Statement
Content in the meta.raw_text field is derived from FineWeb-EDU-Dedup subset of [SmolLM Corpus](https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus), licensed under [ODC-By](https://opendatacommons.org/licenses/by/1-0/) license.
Other text fields follow the same license.
## Disclaimer
Your access to and use of this dataset are at your own risk. We do not guarantee the accuracy of this dataset. The dataset is provided "as is" and we make no warranty or representation to you with respect to it and we expressly disclaim, and hereby expressly waive, all warranties, express, implied, statutory or otherwise. This includes, without limitation, warranties of quality, performance, merchantability or fitness for a particular purpose, non-infringement, absence of latent or other defects, accuracy, or the presence or absence of errors, whether or not known or discoverable. In no event will we be liable to you on any legal theory (including, without limitation, negligence) or otherwise for any direct, special, indirect, incidental, consequential, punitive, exemplary, or other losses, costs, expenses, or damages arising out of this public license or use of the licensed material. The disclaimer of warranties and limitation of liability provided above shall be interpreted in a manner that, to the extent possible, most closely approximates an absolute disclaimer and waiver of all liability.
## Citation
```
@article{hao2025reformulation,
title = {Reformulation for Pretraining Data Augmentation},
author = {Hao, Xintong and Zhu, Ruijie and Zhang, Ge and Shen, Ke and Li, Chenggang},
journal={arXiv preprint arXiv:2502.04235},
url = {https://arxiv.org/abs/2502.04235}
}
```
# 大规模受众-体裁增强型Fineweb-Edu语料库(Massive Genre-Audience Augment Fineweb-Edu Corpus)
本数据集为合成预训练语料库,相关研究详见论文《Reformulation for Pretraining Data Augmentation》(https://arxiv.org/abs/2502.04235)。
<img src="https://cdn-uploads.huggingface.co/production/uploads/64b764bffdb702b3d8640610/WIEom2dItQvCyQciQW9pz.png" width="800">
合成框架概览。本方法通过两阶段合成流程对原始语料库进行扩展:将每份文档重写为5份新文档,实现3.9倍的Token(Token)数量扩增,并通过大规模(体裁,受众)配对保持语料多样性。
我们基于SmolLM语料库(SmolLM Corpus,https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus)构建MGACorpus,将fineweb-edu-dedup数据源的Token规模从1950亿扩展至7700亿。
## 实验性能
本研究以SmolLM-Corpus数据集作为基线训练数据,实验将MGACorpus作为增量数据开展训练。
<img src="https://cdn-uploads.huggingface.co/production/uploads/64b764bffdb702b3d8640610/QB4wPWUlp-nqYMOpn5LwP.png" width="800">
针对两种常见的受限数据训练场景:
1. 将50亿Token的高质量数据集扩展至500亿Token的训练预算(全量重复扩增);
2. 将500亿Token的混合质量数据集扩展至700亿Token的训练预算(子集重复扩增)。
<img src="https://cdn-uploads.huggingface.co/production/uploads/64b764bffdb702b3d8640610/4KRquxzZVW861EN-luxJ1.png" width="750">
## 数据集Schema
根节点
|-- meta: 结构体(可为空)
| |-- chunk_id: 字符串(可为空)
| |-- docid: 字符串(可为空)
| |-- meta_extra: 字符串(可为空)
| |-- source: 字符串(可为空)
| |-- split: 字符串(可为空)
| |-- genre: 字符串(可为空)
| |-- audience: 字符串(可为空)
| |-- raw_text: 字符串(可为空)
|-- content_split: 字符串(可为空)
## 数据集加载
python
from datasets import load_dataset
ds = load_dataset("ByteDance-Seed/mga-fineweb-edu", split='train')
print(ds[0])
## 数据来源声明
meta.raw_text字段内容源自SmolLM语料库(SmolLM Corpus,https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus)的FineWeb-EDU-Dedup子集,采用ODC-By(https://opendatacommons.org/licenses/by/1-0/)开源许可协议。其余文本字段均遵循相同许可协议。
## 免责声明
您对本数据集的访问与使用需自行承担风险。我们不对本数据集的准确性作出任何明示或默示的保证。本数据集按“现状”提供,我们未就其作出任何担保或陈述,并明确免除所有明示、默示、法定或其他形式的责任,包括但不限于与质量、性能、适销性、特定用途适用性、不侵权、无潜在或其他缺陷、准确性,以及是否存在已知或可发现的错误相关的担保。在任何情况下,无论基于何种法律理论(包括但不限于过失),我们均不对因本公共许可或使用许可材料所产生的任何直接、特殊、间接、附带、继发、惩罚性、惩戒性或其他损失、成本、费用或损害赔偿承担责任。上述免责声明与责任限制条款应在最大可能范围内被解释为最接近绝对免责及放弃所有责任的形式。
## 引用
@article{hao2025reformulation,
title = {Reformulation for Pretraining Data Augmentation},
author = {Hao, Xintong and Zhu, Ruijie and Zhang, Ge and Shen, Ke and Li, Chenggang},
journal={arXiv preprint arXiv:2502.04235},
url = {https://arxiv.org/abs/2502.04235}
}
提供机构:
maas
创建时间:
2025-05-19



