Spanish Biomedical Corpus
收藏NIAID Data Ecosystem2026-03-12 收录
下载链接:
https://data.mendeley.com/datasets/7btd42m2sc
下载链接
链接失效反馈官方服务:
资源简介:
Embeddings
This repository contains the word embeddings generated from biomedical Spanish texts corpora.
Corpus detail
The corpus was gathered from Spanish biomedical texts from different multilingual biomedical sources:
IBECS (Spanish Bibliographical Index in Health Sciences): corpus that collects scientific journals covering multiple fields in health sciences. Contains titles and abstracts from 168,198 records in English and Spanish.
SciELO (Scientific Electronic Library Online): corpus gathers electronic publications of complete full-text articles from scientific journals of Latin America, South Africa, and Spain. Contains titles and abstracts from 161,710 records in English and Spanish.
Pubmed: free search engine used to access the MedlineNLM (https://www.ncbi.nlm.nih.gov/pubmed/). Contains titles and abstracts from 127,619 records.
MedlinePlus: corpus with health topics, drugs and supplements, laboratory test information, and medical encyclopedia texts contains 7,033 articles in English and Spanish.
UFAL Medical Corpus is a collection of parallel corpora of medical and general domain texts.
All corpus data files can be found in the next link: http://temu.bsc.es/mespen/
Pre-trained Models
FastText
We used the FastText (Bojanowski et al., 2016) implementation to training our word embeddings using the preprocessed Spanish Biomedical corpus (FastText-SBC). Moreover, we trained a concept embedding model replacing biomedical concepts in the Spanish Biomedical corpus with their unique SNOMED-CT Spanish Edition iden-tifier (SNOMED-SBC). We used the PyMedTer-mino library (Lamy et al., 2015) for concept indexing using full-text search and fuzzy search with threshold.
Train Parameters
Dimension = 300
epoch=10,20
min_count=20
neg=20
t=6e-5
thread=7
encoding='utf8'
min subword-ngram = 3
max subword-ngram = 6
Links to the embeddings
FastText-SBC, epoch 10
FastText-SBC, epoch 20
SNOMED-SBC
创建时间:
2021-03-31



