five

LTW2V: The Large Thai Word2Vec

收藏
NIAID Data Ecosystem2026-03-14 收录
下载链接:
https://zenodo.org/record/7280276
下载链接
链接失效反馈
官方服务:
资源简介:
LTW2V is The large Thai Word2Vec. It trained from OSCAR Corpus (Open Super-large Crawled Aggregated coRpus). For Version 1.0, It trained from OSCAR Corpus (Open Super-large Crawled Aggregated coRpus) and use newmm in PyThaiNLP 4.0 for word segmentation. We cleand the dataset before training that customize the pre-processing script from thai2fit and trained 5 window, 15 windows. It trained with Gensim (50 epochs), so you can use the model from Gensim. About Word2Vec vector dimension = 400 window size = 5, 15 word minimum count = 5 Source code at GitHub: https://github.com/PyThaiNLP/large-thaiword2vec File LTW2V_v1.0-window5.bin - 5 window with newmm in PyThaiNLP 4.0 for word segmentation and Gensim 4.0. LTW2V_v1.0-window15.bin - 15 window with newmm in PyThaiNLP 4.0 for word segmentation and Gensim 4.0.
创建时间:
2022-11-04
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作