five

MESINESP: Post-workshop datasets. Silver Standard and annotator records

收藏
NIAID Data Ecosystem2026-03-14 收录
下载链接:
https://zenodo.org/record/3946557
下载链接
链接失效反馈
官方服务:
资源简介:
Please use the MESINESP2 corpus (the second edition of the shared-task) since it has a higher level of curation, quality and is organized by document type (scientific articles, patents and clinical trials). The MESINESP (Spanish BioASQ track, see https://temu.bsc.es/mesinesp) Challenge was held in May-June 2020, and as a result of a strong participation and the manual annotation of an evaluation dataset, two additional datasets are released now: 1) "all_annotations_withIDsv3.tsv" contains a tab-separated file with all manual annotations (both validated and non-validated) of the evaluation dataset prepared for the competition. It contains the following fields: annotatorName: Human annotator id documentId: Document ID in the source database  decsCode: A DeCS code added to it or validated timestamp: When it was added validated: if it was validated at that point by another annotator, or not yet SpanishTerm: The Spanish descriptor corresponding to the DeCS code mesinespId: The internal document id in the distributed  evaluation file dataset: if part of the evaluation or the test sets source: which database it was taken from Example: annotatorName    documentId    decsCode    timestamp    validated    SpanishTerm    mesinespId    dataset    source A7    biblio-1001069    6893    2020-01-17T11:27:07.000Z    false    caballos    mesinesp-dev-671    dev    LILACS A7    biblio-1001069    4345    2020-01-17T11:27:12.000Z    false    perros    mesinesp-dev-671    dev    LILACS   2) A "Silver Standard" created from the 24 system runs submitted by 6 participating teams. It contains each of the submitted DeCS code for each document in the test set, as well as other information that can help ascertain reliability and source for anyone that wants to use this dataset to enrich their training data. It contains more that 5.8 million datapoints, and is structured as follows SubmissionName:  Alias of the team that submitted the run REALdocumentId: The real id of the document mesinespId:    The mesinesp assigned id in the evaluation dataset docSource: The source database decsCode: the DeCS code assigned to it by the team's system SpanishTerm: The Spanish descriptor of the DeCS code MiF: The Micro-f1 scored by that system's run MiR: The Micro-Recall scored by that system's run MiP:  The Micro-Precision scored by that system's run    Acc: The Accuracy scored by that system's run consensus: The number of runs where that DeCS code was assigned to this document by the participating teams (max. is 24) Example: SubmissionName    REALdocumentId    mesinespId    docSource    decsCode    SpanishTerm    MiF    MiR    MiP    Acc    consensus AN    ibc-177565    mesinesp-evaluation-00001    IBECS    28567    riesgo    0.2054    0.1930    0.2196    0.1198    4 AN    ibc-177565    mesinesp-evaluation-00001    IBECS    15335    trabajo    0.2054    0.1930    0.2196    0.1198    4 AN    ibc-177565    mesinesp-evaluation-00001    IBECS    33182    conocimiento    0.2054    0.1930    0.2196    0.1198    7   For citation and a detailed description of the Challenge, please cite: Anastasios, Nentidis and Anastasia, Krithara and Konstantinos, Bougiatiotis and Martin, Krallinger and Carlos, Rodriguez-Penagos and Marta, Villegas and Georgios, Paliouras. Overview of BioASQ 2020: The eighth BioASQ challenge on Large-Scale Biomedical Semantic Indexing and Question Answering (2020). Proceedings of the Eleventh International Conference of the CLEF Association (CLEF 2020). Thessaloniki, Greece, September 22--25 Citation @inproceedings{durusan2019overview,   title={Overview of BioASQ 2020: The eighth BioASQ challenge on Large-Scale Biomedical Semantic Indexing and Question Answering},   author={Anastasios, Nentidis and Anastasia, Krithara and Konstantinos, Bougiatiotis and Martin, Krallinger and Carlos, Rodriguez-Penagos and Marta, Villegas and Georgios, Paliouras},   booktitle={Experimental IR Meets Multilinguality, Multimodality, and Interaction Proceedings of the Eleventh International Conference of the CLEF Association (CLEF 2020), Thessaloniki, Greece, September 22--25, 2020, Proceedings},   volume={12260},   year={2020},   organization={Springer} }   Copyright (c) 2020 Secretaría de Estado de Digitalización e Inteligencia Artificial

建议使用MESINESP2语料库(即第二届共享任务语料库),因其标注规范程度更高、数据质量更优,且按文档类型(学术论文、专利与临床试验)进行了分类组织。 MESINESP挑战赛(西班牙BioASQ赛道,详见https://temu.bsc.es/mesinesp)于2020年5-6月举办。凭借赛事的高参与度及评测数据集的人工标注工作,本次额外发布两份数据集: 1) 「all_annotations_withIDsv3.tsv」为制表符分隔文件,包含为本次竞赛准备的评测数据集的全部人工标注结果(含已验证与未验证标注),其字段说明如下: annotatorName:人工标注者编号 documentId:源数据库中的文档标识符 decsCode:为该文档添加或验证的西班牙语生物医学主题词表(DeCS)编码 timestamp:标注添加时间 validated:该标注是否已由其他标注者完成验证 SpanishTerm:与该DeCS编码对应的西班牙语主题词 mesinespId:本次分发的评测文件中的内部文档标识符 dataset:所属数据集子集(评测集或测试集) source:该文档的来源数据库 示例如下: annotatorName documentId decsCode timestamp validated SpanishTerm mesinespId dataset source A7 biblio-1001069 6893 2020-01-17T11:27:07.000Z false 马 mesinesp-dev-671 开发集 LILACS A7 biblio-1001069 4345 2020-01-17T11:27:12.000Z false 犬 mesinesp-dev-671 开发集 LILACS 2) 基于6支参赛队伍提交的24份系统运行结果构建的「银标准数据集(Silver Standard)」,其包含测试集内每篇文档的所有提交DeCS编码,以及可用于辅助评估数据集可靠性与来源的额外信息,可供使用者扩充训练数据。该数据集包含超过580万条数据点,结构说明如下: SubmissionName:提交该运行结果的参赛队伍别名 REALdocumentId:文档真实标识符 mesinespId:评测数据集分配的mesinesp编号 docSource:文档来源数据库 decsCode:参赛团队系统为该文档分配的DeCS编码 SpanishTerm:对应DeCS编码的西班牙语主题词 MiF:该系统运行结果的微F1值 MiR:该系统运行结果的微召回率 MiP:该系统运行结果的微精确率 Acc:该系统运行结果的准确率 consensus:参赛团队为该文档分配该DeCS编码的运行次数(最大值为24) 示例如下: SubmissionName REALdocumentId mesinespId docSource decsCode SpanishTerm MiF MiR MiP Acc consensus AN ibc-177565 mesinesp-evaluation-00001 IBECS 28567 风险 0.2054 0.1930 0.2196 0.1198 4 AN ibc-177565 mesinesp-evaluation-00001 IBECS 15335 工作 0.2054 0.1930 0.2196 0.1198 4 AN ibc-177565 mesinesp-evaluation-00001 IBECS 33182 知识 0.2054 0.1930 0.2196 0.1198 7 如需引用该挑战赛并获取详细描述,请参考以下文献: Anastasios Nentidis、Anastasia Krithara、Konstantinos Bougiatiotis、Martin Krallinger、Carlos Rodriguez-Penagos、Marta Villegas、Georgios Paliouras. 《BioASQ 2020概述:第八届大规模生物医学语义索引与问答BioASQ挑战赛(2020)》. 第11届CLEF协会国际会议论文集(CLEF 2020),希腊塞萨洛尼基,9月22日-25日 Citation @inproceedings{durusan2019overview, title={Overview of BioASQ 2020: The eighth BioASQ challenge on Large-Scale Biomedical Semantic Indexing and Question Answering}, author={Anastasios, Nentidis and Anastasia, Krithara and Konstantinos, Bougiatiotis and Martin, Krallinger and Carlos, Rodriguez-Penagos and Marta, Villegas and Georgios, Paliouras}, booktitle={Experimental IR Meets Multilinguality, Multimodality, and Interaction Proceedings of the Eleventh International Conference of the CLEF Association (CLEF 2020), Thessaloniki, Greece, September 22--25, 2020, Proceedings}, volume={12260}, year={2020}, organization={Springer} } Copyright (c) 2020 Secretaría de Estado de Digitalización e Inteligencia Artificial
创建时间:
2022-11-05
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作