WangchanLION-Web
收藏魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/aisingapore/WangchanLION-Web
下载链接
链接失效反馈官方服务:
资源简介:
## Citation
```
@misc{phatthiyaphaibun2025mangosteenopenthaicorpus,
title={Mangosteen: An Open Thai Corpus for Language Model Pretraining},
author={Wannaphong Phatthiyaphaibun and Can Udomcharoenchaikit and Pakpoom Singkorapoom and Kunat Pipatanakul and Ekapol Chuangsuwanich and Peerat Limkonchotiwat and Sarana Nutanong},
year={2025},
eprint={2507.14664},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2507.14664},
}
```
We have collected additional Thai text that is unlikely to be included in the common crawl from various sources. The total number of documents collected is as follows:425,304 documents, we deduplication these noncc documents to later divide them into train sets for further web data and validation set.
We include Common Crawl and Fineweb2 as follows:
| Source | Documents | Tokens (B) |
|--------------------|:---------:|:----------:|
| CC-Derived Dataset | 29.7 M | 45.9 |
We also propose a new data cleaning pipeline to improve and filter out the low-quality data. We adopt the data collection of Dolma by applying five major components:
- Language identity: Instead of relying on FastTex as a language identifier, as in Dolma, we use a rule-based approach for Thai script, which is more efficient in terms of performance and speed.
- Deduplication by URL: We use the Bloom filter to remove duplicate data.
- Quality Filters: For this step, we still use the same practice as in the original dolma by using C4 and Gopher rules. However, we made changes to make it more compatible with the Thai language by investigating and then changing the rules.
- Content Filters: We also update the content filter by upgrading the filter to remove not-safe-for-work (NSFW), phone number, and gambling content for Thai more efficiently than the existing filters.
- Deduplication on text overlap: We also use the Bloom filter that is used in the Dolma pipeline to remove the text that overlaps in our corpus.
Resources
- Pre-training data (web): https://huggingface.co/datasets/aisingapore/WangchanLION-Web
- Pre-training data (curated): https://huggingface.co/datasets/aisingapore/WangchanLION-Curated
- Pre-training model: https://huggingface.co/aisingapore/WangchanLION-v3
- SFT model: https://huggingface.co/aisingapore/WangchanLION-v3-IT
- Paper: https://arxiv.org/abs/2507.14664
- Blog: https://sea-lion.ai/sea-lion-wangchanlionv3/
- Github: https://github.com/vistec-AI/Mangosteen
## 引用
@misc{phatthiyaphaibun2025mangosteenopenthaicorpus,
title={Mangosteen: 用于大语言模型(Large Language Model)预训练的开源泰语语料库},
author={瓦纳蓬·帕蒂亚帕本(Wannaphong Phatthiyaphaibun)、坎·乌多姆查伦查基特(Can Udomcharoenchaikit)、帕克普姆·辛科拉普姆(Pakpoom Singkorapoom)、库纳特·皮帕特南库尔(Kunat Pipatanakul)、埃卡波尔·春苏瓦尼奇(Ekapol Chuangsuwanich)、佩拉特·林孔乔蒂瓦特(Peerat Limkonchotiwat)、萨拉娜·努塔农(Sarana Nutanong)},
year={2025},
eprint={2507.14664},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2507.14664},
}
我们从多种来源收集了通用爬虫(Common Crawl)未收录的泰语文本。本次共收集到425,304份文档,我们对这些非通用爬虫数据进行去重处理,随后将其划分为后续用于网页数据的训练集与验证集。
我们纳入了通用爬虫(Common Crawl)与Fineweb2两类数据源,具体如下:
| 数据源 | 文档数 | Token数(B) |
|:--------------------------|:---------:|:-----------:|
| CC衍生数据集(CC-Derived Dataset) | 29.7 M | 45.9 |
注:表格中`M`代表百万(Million),`B`代表十亿(Billion)。
我们还提出了一套全新的数据清洗流水线,用于优化并过滤低质量数据。我们参考Dolma的数据收集框架,引入五大核心模块:
- 语言识别:相较于Dolma所采用的FastTex语言识别器,我们针对泰语文字设计了基于规则的识别方案,在性能与速度上均更具优势。
- 基于URL的去重:我们使用布隆过滤器(Bloom filter)移除重复数据。
- 质量过滤:此步骤我们沿用了原始Dolma框架中的C4与Gopher规则,但针对泰语场景进行了适配性调整,通过调研优化了规则逻辑。
- 内容过滤:我们还升级了内容过滤器,能够更高效地针对泰语场景移除不适宜工作内容(Not-Safe-For-Work,NSFW)、电话号码与赌博相关的不良内容。
- 文本重叠去重:我们同样使用了Dolma流水线中的布隆过滤器,以移除语料库中存在文本重叠的内容。
## 资源
- 预训练数据(网页版):https://huggingface.co/datasets/aisingapore/WangchanLION-Web
- 预训练数据(精选版):https://huggingface.co/datasets/aisingapore/WangchanLION-Curated
- 预训练模型:https://huggingface.co/aisingapore/WangchanLION-v3
- 监督微调模型:https://huggingface.co/aisingapore/WangchanLION-v3-IT
- 论文:https://arxiv.org/abs/2507.14664
- 博客:https://sea-lion.ai/sea-lion-wangchanlionv3/
- GitHub仓库:https://github.com/vistec-AI/Mangosteen
提供机构:
maas
创建时间:
2025-11-25



