On embedding-based automatic mapping of clinical classification system: Handling linguistic variations and granular inconsistencies
收藏DataCite Commons2026-02-02 更新2026-04-25 收录
下载链接:
https://datadryad.org/dataset/doi:10.5061/dryad.tdz08kqc1
下载链接
链接失效反馈官方服务:
资源简介:
Objectives: Mapping clinical classification systems, such as the
International Classification of Diseases (ICD), is essential yet
challenging. While the manual mapping method remains labour-intensive and
lacks scalability, existing embedding-based automatic mapping methods,
particularly those leveraging transformer-based pre-trained encoders,
encounter two persistent challenges: (1) linguistic variation and (2)
varying granular details in clinical conditions. Materials and methods: We
introduce an automatic mapping method that combines the representational
power of pre-trained encoders with the reasoning capability of large
language models (LLMs). For each ICD code, we generate: (1)
hierarchy-augmented (HA) and (2) LLM-generated (LG) descriptions to
capture rich semantic nuances, addressing linguistic variation.
Furthermore, we introduced a prompting framework (PR) that leverages LLM
reasoning to handle granularity mismatches, including source-to-parent
mappings. Results: Chapter-wise mappings were performed between ICD
versions (ICD-9-CM↔ICD-10-CM and ICD-10-AM↔ICD-11) using multiple LLMs.
The proposed approach consistently outperformed the baseline across all
ICD pairs and chapters. For example, combining hierarchy-augmented
descriptions with Qwen3-8B–generated descriptions yielded an average Top-1
accuracy improvement of 6.67% across the mapping cases. A small-scale
pilot study further indicated that HA+LG remains effective in more
challenging one-to-many mappings. Discussion and conclusions: Our findings
demonstrate that integrating the representational power of pre-trained
encoders with LLM reasoning offers a robust, scalable strategy for
automatic ICD mapping.
提供机构:
Dryad
创建时间:
2026-01-13



