LTW2V: The Large Thai Word2Vec

Mendeley Data2024-05-10 更新2024-06-27 收录

下载链接：

https://zenodo.org/records/7280277

下载链接

链接失效反馈

官方服务：

资源简介：

LTW2V is The large Thai Word2Vec. It trained from OSCAR Corpus (Open Super-large Crawled Aggregated coRpus). For Version 1.0, It trained from OSCAR Corpus (Open Super-large Crawled Aggregated coRpus) and use newmm in PyThaiNLP 4.0 for word segmentation. We cleand the dataset before training that customize the pre-processing script from thai2fit and trained 5 window, 15 windows. It trained with Gensim (50 epochs), so you can use the model from Gensim. About Word2Vec vector dimension = 400 window size = 5, 15 word minimum count = 5 Source code at GitHub: https://github.com/PyThaiNLP/large-thaiword2vec File LTW2V_v1.0-window5.bin - 5 window with newmm in PyThaiNLP 4.0 for word segmentation and Gensim 4.0. LTW2V_v1.0-window15.bin - 15 window with newmm in PyThaiNLP 4.0 for word segmentation and Gensim 4.0.

LTW2V 即大型泰语Word2Vec模型。其训练数据源自OSCAR语料库（Open Super-large Crawled Aggregated coRpus）。在1.0版本中，模型仍基于OSCAR语料库进行训练，并采用PyThaiNLP 4.0中的newmm工具完成分词预处理。训练前，我们对数据集进行了清洗工作，自定义了源自thai2fit的预处理脚本，并分别设置窗口大小为5和15开展训练。模型采用Gensim框架训练（迭代轮次为50），可直接通过Gensim加载使用。该Word2Vec模型的向量维度为400，窗口大小分别为5、15，最小词频阈值为5。源代码托管于GitHub：https://github.com/PyThaiNLP/large-thaiword2vec。其中，LTW2V_v1.0-window5.bin 为使用PyThaiNLP 4.0的newmm分词、搭配Gensim 4.0训练得到的窗口大小为5的模型文件；LTW2V_v1.0-window15.bin 为使用PyThaiNLP 4.0的newmm分词、搭配Gensim 4.0训练得到的窗口大小为15的模型文件。

创建时间：

2023-06-28

5,000+

优质数据集

54 个

任务类型

进入经典数据集