five

MESINESP: Medical Semantic Indexing in Spanish - Train dataset

收藏
NIAID Data Ecosystem2026-03-14 收录
下载链接:
https://zenodo.org/record/3826491
下载链接
链接失效反馈
官方服务:
资源简介:
Please use the MESINESP2 corpus (the second edition of the shared-task) since it has a higher level of curation, quality and is organized by document type (scientific articles, patents and clinical trials).     INTRODUCTION: The Mesinesp (Spanish BioASQ track, see https://temu.bsc.es/mesinesp) training set has a total of 369,368 records.  The training dataset contains all records from LILACS and IBECS databases at the Virtual Health Library (VHL) with a non-empty abstract written in Spanish. The URL used to retrieve records is as follows: http://pesquisa.bvsalud.org/portal/?output=xml&lang=es&sort=YEAR_DESC&format=abstract&filter[db][]=LILACS&filter[db][]=IBECS&q=&index=tw& We have filtered out empty abstracts and non-Spanish abstracts.  The training dataset was crawled on 10/22/2019. This means that the data is a snapshot of that moment and that may change over time. In fact, it is very likely that the data will undergo minor changes as the different databases that make up LILACS and IBECS may add or modify the indexes.   ZIP STRUCTURE: The training data sets contain 369,368 records from 26,609 different journals. Two different data sets are distributed as described below:  - Original Train set with 369,368 records that also include the qualifiers, as retrieved from VHL.   - Pre-processed Train set with the 318,658 records with at least one DeCS code and with no qualifiers.      STATISTICS: Abstracts’ length (measured in characters) Min: 12 Avg: 1140.41 Median: 1094 Max: 9428 Number of DeCS codes per file Min: 1 Avg: 8.12 Median: 7 Max: 53     CORPUS FORMAT: The training data sets are distributed as a JSON file with the following format: {   "articles": [     {       "id": "Id of the article",       "title": "Title of the article",       "abstractText": "Content of the abstract",       "journal": "Name of the journal",       "year": 2018,       "db": "Name of the database",       "decsCodes": [         "code1",         "code2",         "code3"       ]     }   ] } Note that the decsCodes field lists the DeCs Ids assigned to a record in the source data. Since the original XML data contain descriptors (no codes), we provide a DeCs conversion table (https://temu.bsc.es/mesinesp/wp-content/uploads/2019/12/DeCS.2019.v5.tsv.zip) with:  - DeCs codes  - Preferred descriptor (the label used in the European DeCs 2019 set)  - List of synonyms (the descriptors and synonyms from both European and Latin Spanish DeCs 2019 data sets, separated by pipes)   For more details on the Latin and European Spanish DeCs codes see: http://decs.bvs.br and http://decses.bvsalud.org/ respectively. Please, cite: Krallinger M, Krithara A, Nentidis A, Paliouras G, Villegas M. BioASQ at CLEF2020: Large-Scale Biomedical Semantic Indexing and Question Answering. InEuropean Conference on Information Retrieval 2020 Apr 14 (pp. 550-556). Springer, Cham.   Copyright (c) 2020 Secretaría de Estado de Digitalización e Inteligencia Artificial
创建时间:
2022-11-05
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作