Neoplasm topography and morphology corpus
收藏NIAID Data Ecosystem2026-03-13 收录
下载链接:
https://zenodo.org/record/5555431
下载链接
链接失效反馈官方服务:
资源简介:
Pathology reports provide valuable information for cancer registries to understand, plan and implement strategies to mitigate the impact of cancer. However, coding key information from unstructured reports is done by experts in a time-consuming manual process. Here we report an automatic deep learning-based system that recognizes tumor morphology and topography mentions from free-text and suggests codes from the International Classification of Diseases for Oncology (ICD-O) in Spanish. This task was performed using the morphology guidelines and the Cantemist resource, an open corpus annotated with tumor morphology mentions created by the Barcelona Supercomputing Center, and the topography guidelines developed by us and inspired by the former. In this way we generated an annotated internal corpus of tumor morphology and topography mentions. Here, we applied transfer learning from state-of-the-art pre-trained language models to create a Named Entity Recognition (NER) model. The mentions found with this architecture are subsequently coded using a search engine tailored to the ICD-O codes. Our NER models achieved an F1-Score of 0.86 and 0.90 for tumor morphology and topography, respectively. The overall performance of our proposed automatic coding system achieved an accuracy at five suggestions of 0.72 and 0.65 for tumor morphology and topography, respectively. Our results demonstrate the feasibility of implementing NLP tools in the routine of a cancer center to extract and code valuable information from pathology reports.
The tumor morphology corpus created in Spain, Cantemist corpus [https://doi.org/10.5281/zenodo.3773228], was developed at the Barcelona Supercomputing Center (funded by the "Plan de Tecnologías del Language"): "Miranda-Escalada, A., Farré, E., & Krallinger, M. (2020). Named entity recognition, concept normalization and clinical coding: Overview of the cantemist track for cancer text mining in spanish, corpus, guidelines, methods and results. In Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2020), CEUR Workshop Proceedings
This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/4.0/.
We are releasing the dataset in 2 formats:
corpus_raw.zip: Contains the raw text files for each document along with its annotation file in Standoff format
corpus.zip: Contains the corpus already tokenized and annotated using the IOB2 format. The corpus is separated into train, test and development subsets.
创建时间:
2021-10-15



