On embedding-based automatic mapping of clinical classification system: Handling linguistic variations and granular inconsistencies

Name: On embedding-based automatic mapping of clinical classification system: Handling linguistic variations and granular inconsistencies
Creator: Dryad
Published: 2026-02-02 12:55:06
License: 暂无描述

DataCite Commons2026-02-02 更新2026-04-25 收录

下载链接：

https://datadryad.org/dataset/doi:10.5061/dryad.tdz08kqc1

下载链接

链接失效反馈

官方服务：

资源简介：

Objectives: Mapping clinical classification systems, such as the International Classification of Diseases (ICD), is essential yet challenging. While the manual mapping method remains labour-intensive and lacks scalability, existing embedding-based automatic mapping methods, particularly those leveraging transformer-based pre-trained encoders, encounter two persistent challenges: (1) linguistic variation and (2) varying granular details in clinical conditions. Materials and methods: We introduce an automatic mapping method that combines the representational power of pre-trained encoders with the reasoning capability of large language models (LLMs). For each ICD code, we generate: (1) hierarchy-augmented (HA) and (2) LLM-generated (LG) descriptions to capture rich semantic nuances, addressing linguistic variation. Furthermore, we introduced a prompting framework (PR) that leverages LLM reasoning to handle granularity mismatches, including source-to-parent mappings. Results: Chapter-wise mappings were performed between ICD versions (ICD-9-CM↔ICD-10-CM and ICD-10-AM↔ICD-11) using multiple LLMs. The proposed approach consistently outperformed the baseline across all ICD pairs and chapters. For example, combining hierarchy-augmented descriptions with Qwen3-8B–generated descriptions yielded an average Top-1 accuracy improvement of 6.67% across the mapping cases. A small-scale pilot study further indicated that HA+LG remains effective in more challenging one-to-many mappings. Discussion and conclusions: Our findings demonstrate that integrating the representational power of pre-trained encoders with LLM reasoning offers a robust, scalable strategy for automatic ICD mapping.

提供机构：

Dryad

创建时间：

2026-01-13

5,000+

优质数据集

54 个

任务类型

进入经典数据集