On embedding-based automatic mapping of clinical classification system: Handling linguistic variations and granular inconsistencies
收藏DataONE2026-01-13 更新2026-01-24 收录
下载链接:
https://search.dataone.org/view/sha256:4a89705276b25ad65ce5fa97acdd6659a934ea59dc647f8a21aef7734d263c63
下载链接
链接失效反馈官方服务:
资源简介:
Objectives: Mapping clinical classification systems, such as the International Classification of Diseases (ICD), is essential yet challenging. While the manual mapping method remains labour-intensive and lacks scalability, existing embedding-based automatic mapping methods, particularly those leveraging transformer-based pre-trained encoders, encounter two persistent challenges: (1) linguistic variation and (2) varying granular details in clinical conditions.
Materials and methods: We introduce an automatic mapping method that combines the representational power of pre-trained encoders with the reasoning capability of large language models (LLMs). For each ICD code, we generate: (1) hierarchy-augmented (HA) and (2) LLM-generated (LG) descriptions to capture rich semantic nuances, addressing linguistic variation. Furthermore, we introduced a prompting framework (PR) that leverages LLM reasoning to handle granularity mismatches, including source-to-parent mappings.
Results: Chapter-wise m..., , # On embedding-based automatic mapping of clinical classification system: Handling linguistic variations and granular inconsistencies
[Access this dataset on Dryad](https://doi.org/10.5061/dryad.tdz08kqc1)
This dataset (data.tar.xz) comprises chapter-wise International Classification of Diseases (ICD) codes across four ICD versions: ICD-9-CM, ICD-10-CM, ICD-10-AM, and ICD-11. It covers three clinical chaptersâDiseases of the Digestive System (Dig), Infectious and Parasitic Diseases (Inf), and Diseases of the Respiratory System (Resp). In addition, the dataset includes ground-truth mappings for multiple ICD version pairs. To support reproducibility and downstream research, we also provide the LLM-generated descriptions and the raw embeddings.
## Description of the data and file structure
### File Structure Overview
```
./
âââ *.csv <- ICD code files (root level)
âââ gt/ <- Ground-truth mappings
âââ summary/ <-...,
研究目标:对临床分类系统(如国际疾病分类(International Classification of Diseases,简称ICD))进行映射是一项兼具重要性与挑战性的任务。人工映射方法不仅劳动强度大,且缺乏可扩展性;而现有的基于嵌入的自动映射方法,尤其是基于Transformer(Transformer)预训练编码器的方法,则面临两个长期存在的挑战:(1) 语言变体问题,(2) 临床病症细节粒度不一致的问题。
材料与方法:本文提出一种自动映射方法,将预训练编码器的表征能力与大语言模型(Large Language Model,简称LLM)的推理能力相结合。针对每个ICD编码,我们生成两类描述:(1) 层级增强(Hierarchy-augmented,HA)描述与(2) 大语言模型生成(LLM-generated,LG)描述,以捕捉丰富的语义细微差别,从而解决语言变体问题。此外,我们还提出了一种提示框架(Prompting Framework,PR),借助大语言模型的推理能力来处理粒度不匹配问题,包括源编码到父级编码的映射。
研究结果:按章节划分[内容被截断]…… # 基于嵌入的临床分类系统自动映射:应对语言变体与粒度不一致问题
[本数据集可在Dryad平台获取](https://doi.org/10.5061/dryad.tdz08kqc1)
本数据集(data.tar.xz)包含四个ICD版本下的章节级ICD编码,涵盖ICD-9-CM、ICD-10-CM、ICD-10-AM及ICD-11。数据集覆盖三大临床章节:消化系统疾病(Dig)、感染性与寄生虫病(Inf)以及呼吸系统疾病(Resp)。此外,本数据集还包含多组ICD版本对的真值映射(ground-truth mappings)结果。为支持研究可复现性与下游研究,我们同时提供了大语言模型生成的描述文本与原始嵌入向量。
## 数据与文件结构说明
### 文件结构概览
./
├── *.csv <- ICD编码文件(根目录层级)
├── gt/ <- 真值映射文件夹
├── summary/ <- ......
创建时间:
2026-01-14



