SEA-PILE-v1
收藏魔搭社区2025-12-04 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/aisingapore/SEA-PILE-v1
下载链接
链接失效反馈官方服务:
资源简介:
<div>
<img src="SEAPilev1.png"/>
</div>
# SEA-LION-Pile
SEA-LION-Pile is the pretraining data set for SEA-LION, a collection of Large Language Models (LLMs) which has been pretrained and instruct-tuned for the Southeast Asia (SEA) region.
This repository contains the cleaned mC4 portion of the SEA-LION-Pile.
For the remainder of the SEA-LION-Pile dataset, they may be downloaded from the links provided below.
## Dataset Details
SEA-LION was trained on 980B tokens of the following data:
| Data Source | Unique Tokens | Multiplier | Total Tokens | Percentage |
|---------------------------|:-------------:|:----------:|:------------:|:----------:|
| RefinedWeb - English | 571.3B | 1 | 571.3B | 58.20% |
| mC4 - Chinese | 91.2B | 1 | 91.2B | 9.29% |
| mC4 - Indonesian | 3.68B | 4 | 14.7B | 1.50% |
| mC4 - Malay | 0.72B | 4 | 2.9B | 0.29% |
| mC4 - Filipino | 1.32B | 4 | 5.3B | 0.54% |
| mC4 - Burmese | 1.2B | 4 | 4.9B | 0.49% |
| mC4 - Vietnamese | 63.4B | 1 | 63.4B | 6.46% |
| mC4 - Thai | 5.8B | 2 | 11.6B | 1.18% |
| WangChanBERTa - Thai | 5B | 2 | 10B | 1.02% |
| mC4 - Lao | 0.27B | 4 | 1.1B | 0.12% |
| mC4 - Khmer | 0.97B | 4 | 3.9B | 0.40% |
| mC4 - Tamil | 2.55B | 4 | 10.2B | 1.04% |
| the Stack - Python | 20.9B | 2 | 41.8B | 4.26% |
| the Stack - Javascript | 55.6B | 1 | 55.6B | 5.66% |
| the Stack - Shell | 1.2B5 | 2 | 2.5B | 0.26% |
| the Stack - SQL | 6.4B | 2 | 12.8B | 1.31% |
| the Stack - Markdown | 26.6B | 1 | 26.6B | 2.71% |
| RedPajama - StackExchange | 21.2B | 1 | 21.2B | 2.16% |
| RedPajama - ArXiv | 30.6B | 1 | 30.6B | 3.12% |
### Additional SEA-LION-Pile (non-mC4) Data Sources
This section contains the links to the additional datasets that form the SEA-LION-Pile.
- [RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb)
- [the Stack (Python, Javascript, Shell, SQL, Markdown)](https://huggingface.co/datasets/bigcode/the-stack-dedup)
- [RedPajama (StackExchange, ArXiv)](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T)
- WangChanBERTa
- [scb_mt_enth_2020](https://huggingface.co/datasets/scb_mt_enth_2020)
- [prachathai67k](https://huggingface.co/datasets/prachathai67k)
- [thaisum](https://huggingface.co/datasets/thaisum)
- [Opus - bible-uedin](https://opus.nlpl.eu/bible-uedin.php)
- [Opus - Tanzil](https://opus.nlpl.eu/Tanzil.php)
- [Opus - Opensubtitles](https://opus.nlpl.eu/OpenSubtitles-v2018.php)
- [Opus - QED](https://opus.nlpl.eu/QED.php)
- [Opus - Ted2020](https://opus.nlpl.eu/TED2020.php)
- [Opus - Oscar](https://oscar-project.org/post/news-23-01)
### Limitations
- As toxic or biased data is prevalent on the internet, it is likely our dataset contains such content.
- Despite our best efforts to filter content that does not qualify as natural language, and to deduplicate documents, our pipeline may let through documents that may be considered as errors or redundant.
### License
This public extract of mC4 is made available under [ODC-By 1.0](https://opendatacommons.org/licenses/by/1-0/) license; users should also abide to the [CommonCrawl ToU](https://commoncrawl.org/terms-of-use/).
For all other licenses, please refer to their individual pages above.
We endeavor to ensure data used is permissible and have chosen datasets from creators who have processes to exclude copyrighted or disputed data. For other new data, we have obtained permission to use and distribute.
## References
```bibtex
@misc{lowphansirikul2021wangchanberta,
title={WangchanBERTa: Pretraining transformer-based Thai Language Models},
author={Lalita Lowphansirikul and Charin Polpanumas and Nawat Jantrakulchai and Sarana Nutanong},
year={2021},
eprint={2101.09635},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
@article{refinedweb,
title={The {R}efined{W}eb dataset for {F}alcon {LLM}: outperforming curated corpora with web data, and web data only},
author={Guilherme Penedo and Quentin Malartic and Daniel Hesslow and Ruxandra Cojocaru and Alessandro Cappelli and Hamza Alobeidli and Baptiste Pannier and Ebtesam Almazrouei and Julien Launay},
journal={arXiv preprint arXiv:2306.01116},
eprint={2306.01116},
eprinttype = {arXiv},
url={https://arxiv.org/abs/2306.01116},
year={2023}
}
@article{Kocetkov2022TheStack,
title={The Stack: 3 TB of permissively licensed source code},
author={Kocetkov, Denis and Li, Raymond and Ben Allal, Loubna and Li, Jia and Mou,Chenghao and Muñoz Ferrandis, Carlos and Jernite, Yacine and Mitchell, Margaret and Hughes, Sean and Wolf, Thomas and Bahdanau, Dzmitry and von Werra, Leandro and de Vries, Harm},
journal={Preprint},
year={2022}
}
@software{together2023redpajama,
author = {Together Computer},
title = {RedPajama: An Open Source Recipe to Reproduce LLaMA training dataset},
month = April,
year = 2023,
url = {https://github.com/togethercomputer/RedPajama-Data}
}
```
<div><img src="SEAPilev1.png"/></div>
# SEA-LION-Pile
SEA-LION-Pile 是 SEA-LION 的预训练数据集,SEA-LION 是一系列为东南亚(SEA)区域预训练并经过指令微调的大语言模型(Large Language Models, LLMs)。
本仓库包含了 SEA-LION-Pile 中经过清洗后的 mC4 子集。至于 SEA-LION-Pile 的其余部分,可通过下文提供的链接下载。
## 数据集详情
SEA-LION 基于总计 9800 亿 Token 的以下数据集训练得到:
| 数据源 | 唯一 Token 数 | 乘数 | 总 Token 数 | 占比 |
|---------------------------|:-------------:|:----------:|:------------:|:----------:|
| 精炼网络(RefinedWeb)- 英语 | 571.3B | 1 | 571.3B | 58.20% |
| mC4 - 中文 | 91.2B | 1 | 91.2B | 9.29% |
| mC4 - 印尼语 | 3.68B | 4 | 14.7B | 1.50% |
| mC4 - 马来语 | 0.72B | 4 | 2.9B | 0.29% |
| mC4 - 他加禄语 | 1.32B | 4 | 5.3B | 0.54% |
| mC4 - 缅甸语 | 1.2B | 4 | 4.9B | 0.49% |
| mC4 - 越南语 | 63.4B | 1 | 63.4B | 6.46% |
| mC4 - 泰语 | 5.8B | 2 | 11.6B | 1.18% |
| WangChanBERTa - 泰语 | 5B | 2 | 10B | 1.02% |
| mC4 - 老挝语 | 0.27B | 4 | 1.1B | 0.12% |
| mC4 - 高棉语 | 0.97B | 4 | 3.9B | 0.40% |
| mC4 - 泰米尔语 | 2.55B | 4 | 10.2B | 1.04% |
| the Stack - Python | 20.9B | 2 | 41.8B | 4.26% |
| the Stack - Javascript | 55.6B | 1 | 55.6B | 5.66% |
| the Stack - Shell | 1.2B5 | 2 | 2.5B | 0.26% |
| the Stack - SQL | 6.4B | 2 | 12.8B | 1.31% |
| the Stack - Markdown | 26.6B | 1 | 26.6B | 2.71% |
| RedPajama - StackExchange | 21.2B | 1 | 21.2B | 2.16% |
| RedPajama - ArXiv | 30.6B | 1 | 30.6B | 3.12% |
### SEA-LION-Pile 额外(非 mC4)数据源
本节提供构成 SEA-LION-Pile 的其余数据集的下载链接:
- [RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb)
- [the Stack (Python, Javascript, Shell, SQL, Markdown)](https://huggingface.co/datasets/bigcode/the-stack-dedup)
- [RedPajama (StackExchange, ArXiv)](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T)
- WangChanBERTa
- [scb_mt_enth_2020](https://huggingface.co/datasets/scb_mt_enth_2020)
- [prachathai67k](https://huggingface.co/datasets/prachathai67k)
- [thaisum](https://huggingface.co/datasets/thaisum)
- [Opus - bible-uedin](https://opus.nlpl.eu/bible-uedin.php)
- [Opus - Tanzil](https://opus.nlpl.eu/Tanzil.php)
- [Opus - Opensubtitles](https://opus.nlpl.eu/OpenSubtitles-v2018.php)
- [Opus - QED](https://opus.nlpl.eu/QED.php)
- [Opus - Ted2020](https://opus.nlpl.eu/TED2020.php)
- [Opus - Oscar](https://oscar-project.org/post/news-23-01)
### 局限性
- 由于互联网上充斥着有毒或带有偏见的内容,本数据集大概率包含此类内容。
- 尽管我们已尽力过滤不符合自然语言规范的内容并对文档进行去重,但我们的数据处理流程仍可能残留存在错误或冗余的文档。
### 许可协议
本公开的 mC4 子集基于 [ODC-By 1.0](https://opendatacommons.org/licenses/by/1-0/) 协议发布;使用者同时需遵守 [CommonCrawl 服务条款](https://commoncrawl.org/terms-of-use/)。
其余数据集的许可协议请参阅其各自的页面。
我们致力于确保所用数据符合合规要求,仅选用自有合规流程以排除受版权保护或存在争议的数据的创作者发布的数据集。对于新增数据集,我们已获取使用及分发的授权。
## 参考文献
bibtex
@misc{lowphansirikul2021wangchanberta,
title={WangchanBERTa: 预训练基于 Transformer 的泰语语言模型},
author={拉利塔·洛潘西里库尔(Lalita Lowphansirikul)、查林·波尔帕纳马斯(Charin Polpanumas)、纳瓦特·詹特拉库尔查伊(Nawat Jantrakulchai)、萨拉纳·努塔农(Sarana Nutanong)},
year={2021},
eprint={2101.09635},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
@article{refinedweb,
title={The RefinedWeb 数据集用于 Falcon 大语言模型:仅用网络数据即优于精选语料库},
author={吉尔贝托·佩内多(Guilherme Penedo)、康坦·马拉尔蒂克(Quentin Malartic)、丹尼尔·赫斯洛(Daniel Hesslow)、鲁克桑德拉·科乔卡鲁(Ruxandra Cojocaru)、亚历山德罗·卡佩利(Alessandro Cappelli)、哈姆扎·阿尔奥贝德利(Hamza Alobeidli)、巴蒂斯特·帕尼耶(Baptiste Pannier)、埃布特萨姆·阿尔马祖埃伊(Ebtesam Almazrouei)、朱利安·洛奈(Julien Launay)},
journal={arXiv 预印本 arXiv:2306.01116},
eprint={2306.01116},
eprinttype = {arXiv},
url={https://arxiv.org/abs/2306.01116},
year={2023}
}
@article{Kocetkov2022TheStack,
title={The Stack:3 TB 开源许可源代码数据集},
author={丹尼斯·科切特科夫(Denis Kocetkov)、雷蒙德·李(Raymond Li)、卢布娜·本·阿拉尔(Loubna Ben Allal)、贾佳(Jia Li)、陈浩穆(Chenghao Mou)、卡洛斯·穆尼奥斯·费兰迪斯(Carlos Muñoz Ferrandis)、亚辛·杰尼特(Yacine Jernite)、玛格丽特·米切尔(Margaret Mitchell)、肖恩·休斯(Sean Hughes)、托马斯·沃尔夫(Thomas Wolf)、德米特里·巴达诺(Dzmitry Bahdanau)、莱昂德罗·冯·韦拉(Leandro von Werra)、哈姆·德弗里斯(Harm de Vries)},
journal={预印本},
year={2022}
}
@software{together2023redpajama,
author = {Together Computer},
title = {RedPajama:复现 LLaMA 训练数据集的开源方案},
month = 4月,
year = 2023,
url = {https://github.com/togethercomputer/RedPajama-Data}
}
提供机构:
maas
创建时间:
2025-11-25



