five

Spanish CBOW Word Embeddings in Floret

收藏
NIAID Data Ecosystem2026-03-14 收录
下载链接:
https://zenodo.org/record/7314097
下载链接
链接失效反馈
官方服务:
资源简介:
Spanish CBOW Word Embeddings in Floret The embeddings have been trained with the corpus from the National Library of Spain (Biblioteca Nacional de España or BNE)  using floret with the following  hyperparameters: mode: str = "floret", model: str = "cbow", dim: int = 300, mincount: int = 10, minn: int = 5, maxn: int = 6, neg: int = 10, hashcount: int = 2, bucket: int = 50000, thread: int = 128,   Detailed information about the corpus can be found here  The processing took place on an HPC node equipped with an AMD EPYC 7742 (@ 2.250GHz) processor with 128 threads. How to use First initialize the spacy vectors from the floret table (.floret file): spacy init vectors es floret_embeddings_bne_es.floret floret_embeddings_bne_es --mode floret import spacy # Load the floret vectors floret_embeddings = spacy.load("floret_embeddings_bne_es") # Get the embeddings of some words playa = floret_embeddings.vocab["playa"] frío = floret_embeddings.vocab["frío"] invierno = floret_embeddings.vocab["invierno"] verano = floret_embeddings.vocab["verano"] # Get some similarities print(frío.similarity(invierno)) print(frío.similarity(verano)) # frío should be more similar to invierno than verano. print(playa.similarity(invierno)) print(playa.similarity(verano)) # playa should be more similar to verano than invierno. Intended Uses and Limitations At the time of submission, no measures have been taken to estimate the bias and toxicity embedded in the model. However, we are well aware that our models may be biased since the corpora have been collected using crawling techniques on multiple web sources. We intend to conduct research in these areas in the future, and if completed, this card will be updated. Authors The Text Mining Unit from Barcelona Supercomputing Center. Contact Information For further information, send an email to plantl-gob-es@bsc.es Funding This work was funded by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) within the framework of the Plan-TL. Copyright Copyright (c) 2021 Secretaría de Estado de Digitalización e Inteligencia Artificial
创建时间:
2022-11-23
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作