A word2vec model file built from the French Wikipedia XML Dump using gensim.
收藏Mendeley Data2024-03-27 更新2024-06-28 收录
下载链接:
https://zenodo.org/record/162792
下载链接
链接失效反馈官方服务:
资源简介:
A word2vec model file built from the French Wikipedia XML dump using gensim. The data published here includes three model files (you need all three of them in the same folder) as well as the Python script used to build the model (for documentation). The Wikipedia dump was downloaded on October 7, 2016 from https://dumps.wikimedia.org/. Before building the model, plain text was extracted from the dump. The size of that dataset is about 500 million words or 3.6 GB of plain text. The principal parameters for building the model were the following: no lemmatization was performed, tokenization was done using the "\W" regular expression (any non-word character splits tokens), and the model was built with 500 dimensions.
本数据集为基于法语维基百科XML转储文件,使用Gensim工具构建的word2vec(word2vec)模型文件。本次发布的数据包含三个模型文件(需将全部三个文件置于同一文件夹方可正常使用),以及用于构建该模型的Python脚本(用于文档说明)。该法语维基百科转储文件于2016年10月7日从https://dumps.wikimedia.org/ 下载获取。在构建该模型前,已从该转储文件中提取纯文本数据。该数据集的规模约为5亿词元,或3.6 GB的纯文本内容。构建该模型时的主要参数如下:未进行词形还原操作,采用"W"正则表达式完成分词(任意非单词字符均可作为Token的分隔符),最终构建的模型维度为500维。
创建时间:
2023-06-28



