LTW2V: The Large Thai Word2Vec
收藏NIAID Data Ecosystem2026-03-14 收录
下载链接:
https://zenodo.org/record/7280276
下载链接
链接失效反馈官方服务:
资源简介:
LTW2V is The large Thai Word2Vec. It trained from OSCAR Corpus (Open Super-large Crawled Aggregated coRpus).
For Version 1.0, It trained from OSCAR Corpus (Open Super-large Crawled Aggregated coRpus) and use newmm in PyThaiNLP 4.0 for word segmentation. We cleand the dataset before training that customize the pre-processing script from thai2fit and trained 5 window, 15 windows. It trained with Gensim (50 epochs), so you can use the model from Gensim.
About Word2Vec
vector dimension = 400
window size = 5, 15
word minimum count = 5
Source code at GitHub: https://github.com/PyThaiNLP/large-thaiword2vec
File
LTW2V_v1.0-window5.bin - 5 window with newmm in PyThaiNLP 4.0 for word segmentation and Gensim 4.0.
LTW2V_v1.0-window15.bin - 15 window with newmm in PyThaiNLP 4.0 for word segmentation and Gensim 4.0.
创建时间:
2022-11-04



