five

aisingapore/WangchanLION-Web

收藏
Hugging Face2025-09-03 更新2025-08-09 收录
下载链接:
https://hf-mirror.com/datasets/aisingapore/WangchanLION-Web
下载链接
链接失效反馈
官方服务:
资源简介:
Magostreen是一个泰语文本数据集,包含从各种来源收集的非常见爬取(non-cc)泰语文档,总数为425,304个文档。这些文档经过去重后,被分成训练集和验证集。数据集还包括了来自Common Crawl和Fineweb2的数据。数据集通过一个新的数据清洗管道进行了质量和过滤优化,包括语言识别、基于URL的去重、质量过滤、内容过滤以及基于文本重合的去重。

Magostreen is a Thai text dataset consisting of non-common crawl Thai documents collected from various sources, totaling 425,304 documents. These documents have been deduplicated and divided into training and validation sets. The dataset also includes data from Common Crawl and Fineweb2. The dataset has been optimized for quality and filtering through a new data cleaning pipeline, including language identification, URL-based deduplication, quality filtering, content filtering, and text overlap deduplication.
提供机构:
aisingapore
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作