WangchanLION-Curated
收藏魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/aisingapore/WangchanLION-Curated
下载链接
链接失效反馈官方服务:
资源简介:
## Citation
```
@misc{phatthiyaphaibun2025mangosteenopenthaicorpus,
title={Mangosteen: An Open Thai Corpus for Language Model Pretraining},
author={Wannaphong Phatthiyaphaibun and Can Udomcharoenchaikit and Pakpoom Singkorapoom and Kunat Pipatanakul and Ekapol Chuangsuwanich and Peerat Limkonchotiwat and Sarana Nutanong},
year={2025},
eprint={2507.14664},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2507.14664},
}
```
We have collected additional Thai text that is unlikely to be included in the common crawl from various sources. The total number of documents collected is as follows:425,304 documents, we deduplication these noncc documents to later divide them into train sets for further web data and validation set.
The following table shows the analysis of the data after deduplication.
| type | Source | Number of documents | Number of words | Cultural? | Human-craft? | Evaluation topics? |
|----------------------|------------------------------------------------------------|---------------------|-----------------|:---------:|:------------:|:-----------------------------:|
| encyclopedic | th.wikibooks.org | 1,179 | 756,175 | No | | Common safety |
| | th.wikipedia.org | 162,189 | 74,151,797 | No | | Common safety |
| | th.wikiquote.org | 929 | 156,910 | No | | Common safety |
| | th.wikisource.org | 1,890 | 5,601,908 | No | | Common safety |
| finance | airesearch/cmdf_vistec | 86,813 | 348,189,390 | No | | Common safety |
| government document | lift catalog data | 1 | 6,931 | No | Yes/No | Common safety |
| | data.go.th | 9,488 | 2,766,950 | No | No | Common safety |
| | envilink.go.th | 1 | 10,659 | | | |
| | government data catalog smart plus | 427 | 5,290,950 | Yes | Yes/No | Common safety/ Cross-lingual |
| | https://ratchakitcha.soc.go.th | 59,744 | 175,031,218 | Yes | Yes | Culture safety/ Cross-lingual |
| | nakhoratchasima data catalog | 1 | 68,000 | Yes | Yes | Country safety/ Cross-lingual |
| | opdc data portal | 3,148 | 598,829 | | | |
| | open-d | 7 | 284,100 | | | |
| | royal thai government | 1 | 41,356 | | | |
| | National Economic and Social Development Board | 1 | 37,663 | | | |
| | pythainlp/thailand-policy-statements | 60 | 226,087 | | | |
| legal | pythainlp/thai-cc-license | 6 | 50,727 | | | |
| | pythainlp/thai-constitution-corpus | 20 | 444,313 | | | |
| | pythainlp/thailaw-v1.0 | 52,317 | 79,715,118 | | | |
| academic literature | government data catalog smart plus | 427 | 5,290,950 | | | |
| | openbase.in.th | 4,173 | 165,909,425 | | | |
| | platform for social empowerment and transformation | 90 | 209,716 | | | |
| | pythainlp/thai-it-books | 7 | 174,644 | | | |
| | pythainlp/thai-tnhc2-books | 353 | 22,002,703 | | | |
| | pythainlp/tlcv2.0_oa | 361 | 2,970,463 | | | |
| | TDRI | 25 | 2,801,106 | | | |
| | Bangkok Open Data | 10 | 2,334 | | | |
| | Open educational resources repository | 14 | 47,951 | | | |
| | CMU Journal of Law and Social Sciences | 47 | 37,976 | | | |
| | E-journal of education studies, Burapha University | 68 | 59,137 | | | |
| | Chulalongkorn University Law Journal | 64 | 46,425 | | | |
| | Lanna Journal of Health Promotion and Environmental Health | 53 | 52,320 | | | |
| | Journal of Educational Studies, Burapha University | 64 | 56,320 | | | |
| | Journal of Yanasangwon Research Institute | 65 | 47,286 | | | |
| | Journal of Food and Drug Administration | 79 | 115,541 | | | |
| | https://github.com/kongruksiamza/ebook-for-education | 8 | 83,432 | | | |
| | social technology institute | 3 | 11,152 | | | |
| youtube | youtube | 17,826 | 46,613,632 | | | |
Documents from some sources cannot be directly used or easily processed. It is necessary to use text extraction technology (OCR) to extract the text due to the document format in PDF. The table below shows the number and proportion of documents that required OCR.
| Source | Number of documents | Number of documents required for PCR | Percentage of documents requiring OCR |
|--------------------------------------:|--------------------:|-------------------------------------:|--------------------------------------:|
| data.go.th | 9488 | 18 | 0.189713 |
| government data catalog smart plus | 427 | 198 | 46.370023 |
| ebook construction | 8 | 8 | 100 |
| openbase.in.th | 4173 | 3443 | 82.50659 |
| opendata.nesdc.go.th | 7 | 3 | 42.857143 |
| royal thai government | 1 | 1 | 100 |
| Open educational resources repository | 14 | 2 | 14.285714 |
The model used for text extraction ishttps://github.com/VikParuchuri/marker
In the future, you should try VLM, such as:https://olmocr.allenai.org/Or Typhoon2 Vision
Data from various sources can be classified into 6 types as shown in the table below.
| Domain | count | proportion |
|-------------:|-------:|-----------:|
| Encyclopedic | 166187 | 41.34 |
| Finance | 86813 | 21.59 |
| Government | 72,879 | 18.13 |
| Legal | 52,343 | 13.02 |
| YouTube | 17,837 | 4.43 |
| Education | 5,911 | 1.47 |
The additional documents we collect are confirmed to be open source and have a license to allow for redistribution, with the copyright share as shown in the table below.
| license | count | proportion |
|----------------:|-------:|-----------:|
| CC BY-SA 4.0 | 166187 | 41.388233 |
| CC0 | 112871 | 28.110088 |
| CC BY 4.0 | 112407 | 27.994531 |
| CC BY-NC-SA 4.0 | 4173 | 1.03927 |
| ODC-BY | 3769 | 0.938655 |
| CC BY-NC 4.0 | 1853 | 0.461483 |
| CC BY-NC-ND 4.0 | 250 | 0.062262 |
| CC BY 3.0 | 13 | 0.003238 |
| GFDL | 6 | 0.001494 |
| OGL | 3 | 0.000747 |
Resources
- Pre-training data (web): https://huggingface.co/datasets/aisingapore/WangchanLION-Web
- Pre-training data (curated): https://huggingface.co/datasets/aisingapore/WangchanLION-Curated
- Pre-training model: https://huggingface.co/aisingapore/WangchanLION-v3
- SFT model: https://huggingface.co/aisingapore/WangchanLION-v3-IT
- Paper: https://arxiv.org/abs/2507.14664
- Blog: https://sea-lion.ai/sea-lion-wangchanlionv3/
- Github: https://github.com/vistec-AI/Mangosteen
引用
@misc{phatthiyaphaibun2025mangosteenopenthaicorpus,
title={山竹(Mangosteen):面向大语言模型预训练的开源泰语语料库},
author={Wannaphong Phatthiyaphaibun and Can Udomcharoenchaikit and Pakpoom Singkorapoom and Kunat Pipatanakul and Ekapol Chuangsuwanich and Peerat Limkonchotiwat and Sarana Nutanong},
year={2025},
eprint={2507.14664},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2507.14664},
}
我们从多个来源收集了大概率不会出现在通用爬取数据中的泰语文本。本次共收集到425,304份文档,我们对这些非通用爬取(non-common crawl, noncc)文档进行去重处理,后续将其划分为训练集与验证集,用于后续网页数据的处理。
下表展示了去重后的数据统计分析结果:
| 类型 | 来源 | 文档数量 | 单词数 | 是否涉及文化内容 | 是否为人工创作 | 评估主题 |
|----------------------|------------------------------------------------------------|---------------------|-----------------|:---------:|:------------:|:-----------------------------:|
| 百科类 | th.wikibooks.org | 1,179 | 756,175 | 否 | | 通用安全 |
| | th.wikipedia.org | 162,189 | 74,151,797 | 否 | | 通用安全 |
| | th.wikiquote.org | 929 | 156,910 | 否 | | 通用安全 |
| | th.wikisource.org | 1,890 | 5,601,908 | 否 | | 通用安全 |
| 金融类 | airesearch/cmdf_vistec | 86,813 | 348,189,390 | 否 | | 通用安全 |
| 政府文档 | lift catalog data | 1 | 6,931 | 否 | 是/否 | 通用安全 |
| | data.go.th | 9,488 | 2,766,950 | 否 | 否 | 通用安全 |
| | envilink.go.th | 1 | 10,659 | | | |
| | government data catalog smart plus | 427 | 5,290,950 | 是 | 是/否 | 通用安全/跨语言 |
| | https://ratchakitcha.soc.go.th | 59,744 | 175,031,218 | 是 | 是 | 文化安全/跨语言 |
| | nakhoratchasima data catalog | 1 | 68,000 | 是 | 是 | 国家/跨语言安全 |
| | opdc data portal | 3,148 | 598,829 | | | |
| | open-d | 7 | 284,100 | | | |
| | royal thai government | 1 | 41,356 | | | |
| | National Economic and Social Development Board | 1 | 37,663 | | | |
| | pythainlp/thailand-policy-statements | 60 | 226,087 | | | |
| 法律类 | pythainlp/thai-cc-license | 6 | 50,727 | | | |
| | pythainlp/thai-constitution-corpus | 20 | 444,313 | | | |
| | pythainlp/thailaw-v1.0 | 52,317 | 79,715,118 | | | |
| 学术文献 | government data catalog smart plus | 427 | 5,290,950 | | | |
| | openbase.in.th | 4,173 | 165,909,425 | | | |
| | platform for social empowerment and transformation | 90 | 209,716 | | | |
| | pythainlp/thai-it-books | 7 | 174,644 | | | |
| | pythainlp/thai-tnhc2-books | 353 | 22,002,703 | | | |
| | pythainlp/tlcv2.0_oa | 361 | 2,970,463 | | | |
| | TDRI | 25 | 2,801,106 | | | |
| | Bangkok Open Data | 10 | 2,334 | | | |
| | Open educational resources repository | 14 | 47,951 | | | |
| | CMU Journal of Law and Social Sciences | 47 | 37,976 | | | |
| | E-journal of education studies, Burapha University | 68 | 59,137 | | | |
| | Chulalongkorn University Law Journal | 64 | 46,425 | | | |
| | Lanna Journal of Health Promotion and Environmental Health | 53 | 52,320 | | | |
| | Journal of Educational Studies, Burapha University | 64 | 56,320 | | | |
| | Journal of Yanasangwon Research Institute | 65 | 47,286 | | | |
| | Journal of Food and Drug Administration | 79 | 115,541 | | | |
| | https://github.com/kongruksiamza/ebook-for-education | 8 | 83,432 | | | |
| | social technology institute | 3 | 11,152 | | | |
| YouTube平台 | youtube | 17,826 | 46,613,632 | | | |
部分来源的文档无法直接使用或难以直接处理,由于文档为PDF格式,需借助光学字符识别(Optical Character Recognition, OCR)技术提取文本。下表列出了需要进行OCR处理的文档数量及占比:
| 来源 | 文档总数 | 需OCR处理的文档数 | 需OCR处理的文档占比 |
|----------------------------------------:|---------:|-----------------:|-------------------:|
| data.go.th | 9488 | 18 | 0.189713 |
| government data catalog smart plus | 427 | 198 | 46.370023 |
| ebook construction | 8 | 8 | 100 |
| openbase.in.th | 4173 | 3443 | 82.50659 |
| opendata.nesdc.go.th | 7 | 3 | 42.857143 |
| royal thai government | 1 | 1 | 100 |
| Open educational resources repository | 14 | 2 | 14.285714 |
本次文本提取所使用的工具为https://github.com/VikParuchuri/marker。后续可尝试使用视觉语言模型(Vision-Language Model, VLM),例如https://olmocr.allenai.org/ 或 Typhoon2 Vision。
来自不同来源的数据可分为如下6个领域,详情见下表:
| 领域 | 数量 | 占比 |
|-------------:|-------:|-------:|
| 百科类 | 166187 | 41.34 |
| 金融类 | 86813 | 21.59 |
| 政府类 | 72,879 | 18.13 |
| 法律类 | 52,343 | 13.02 |
| YouTube平台 | 17,837 | 4.43 |
| 教育类 | 5,911 | 1.47 |
我们收集的额外文档均已确认开源并获得再分发许可,其版权许可分布如下表所示:
| 许可协议 | 数量 | 占比 |
|----------------------------------------------------------:|-------:|-------------:|
| 知识共享署名-相同方式共享4.0(CC BY-SA 4.0) | 166187 | 41.388233 |
| 知识共享零协议(CC0) | 112871 | 28.110088 |
| 知识共享署名4.0(CC BY 4.0) | 112407 | 27.994531 |
| 知识共享署名-非商业性使用-相同方式共享4.0(CC BY-NC-SA 4.0) | 4173 | 1.03927 |
| 开放数据公约署名许可(ODC-BY) | 3769 | 0.938655 |
| 知识共享署名-非商业性使用4.0(CC BY-NC 4.0) | 1853 | 0.461483 |
| 知识共享署名-非商业性使用-禁止演绎4.0(CC BY-NC-ND 4.0) | 250 | 0.062262 |
| 知识共享署名3.0(CC BY 3.0) | 13 | 0.003238 |
| GNU自由文档许可证(GFDL) | 6 | 0.001494 |
| 开放政府许可(OGL) | 3 | 0.000747 |
资源链接:
- 预训练数据(网页版):https://huggingface.co/datasets/aisingapore/WangchanLION-Web
- 预训练数据(精选版):https://huggingface.co/datasets/aisingapore/WangchanLION-Curated
- 预训练模型:https://huggingface.co/aisingapore/WangchanLION-v3
- 监督微调(SFT)模型:https://huggingface.co/aisingapore/WangchanLION-v3-IT
- 相关论文:https://arxiv.org/abs/2507.14664
- 官方博客:https://sea-lion.ai/sea-lion-wangchanlionv3/
- 项目代码仓库:https://github.com/vistec-AI/Mangosteen
提供机构:
maas
创建时间:
2025-11-25



