On embedding-based automatic mapping of clinical classification system: Handling linguistic variations and granular inconsistencies

DataONE2026-01-13 更新2026-01-24 收录

下载链接：

https://search.dataone.org/view/sha256:4a89705276b25ad65ce5fa97acdd6659a934ea59dc647f8a21aef7734d263c63

下载链接

链接失效反馈

官方服务：

资源简介：

Objectives: Mapping clinical classification systems, such as the International Classification of Diseases (ICD), is essential yet challenging. While the manual mapping method remains labour-intensive and lacks scalability, existing embedding-based automatic mapping methods, particularly those leveraging transformer-based pre-trained encoders, encounter two persistent challenges: (1) linguistic variation and (2) varying granular details in clinical conditions. Materials and methods: We introduce an automatic mapping method that combines the representational power of pre-trained encoders with the reasoning capability of large language models (LLMs). For each ICD code, we generate: (1) hierarchy-augmented (HA) and (2) LLM-generated (LG) descriptions to capture rich semantic nuances, addressing linguistic variation. Furthermore, we introduced a prompting framework (PR) that leverages LLM reasoning to handle granularity mismatches, including source-to-parent mappings. Results: Chapter-wise m..., , # On embedding-based automatic mapping of clinical classification system: Handling linguistic variations and granular inconsistencies [Access this dataset on Dryad](https://doi.org/10.5061/dryad.tdz08kqc1) This dataset (data.tar.xz) comprises chapter-wise International Classification of Diseases (ICD) codes across four ICD versions: ICD-9-CM, ICD-10-CM, ICD-10-AM, and ICD-11. It covers three clinical chaptersâDiseases of the Digestive System (Dig), Infectious and Parasitic Diseases (Inf), and Diseases of the Respiratory System (Resp). In addition, the dataset includes ground-truth mappings for multiple ICD version pairs. To support reproducibility and downstream research, we also provide the LLM-generated descriptions and the raw embeddings. ## Description of the data and file structure ### File Structure Overview ``` ./ âââ *.csv <- ICD code files (root level) âââ gt/ <- Ground-truth mappings âââ summary/ <-...,

研究目标：对临床分类系统（如国际疾病分类（International Classification of Diseases，简称ICD））进行映射是一项兼具重要性与挑战性的任务。人工映射方法不仅劳动强度大，且缺乏可扩展性；而现有的基于嵌入的自动映射方法，尤其是基于Transformer（Transformer）预训练编码器的方法，则面临两个长期存在的挑战：(1) 语言变体问题，(2) 临床病症细节粒度不一致的问题。材料与方法：本文提出一种自动映射方法，将预训练编码器的表征能力与大语言模型（Large Language Model，简称LLM）的推理能力相结合。针对每个ICD编码，我们生成两类描述：(1) 层级增强（Hierarchy-augmented，HA）描述与(2) 大语言模型生成（LLM-generated，LG）描述，以捕捉丰富的语义细微差别，从而解决语言变体问题。此外，我们还提出了一种提示框架（Prompting Framework，PR），借助大语言模型的推理能力来处理粒度不匹配问题，包括源编码到父级编码的映射。研究结果：按章节划分[内容被截断]…… # 基于嵌入的临床分类系统自动映射：应对语言变体与粒度不一致问题 [本数据集可在Dryad平台获取](https://doi.org/10.5061/dryad.tdz08kqc1) 本数据集（data.tar.xz）包含四个ICD版本下的章节级ICD编码，涵盖ICD-9-CM、ICD-10-CM、ICD-10-AM及ICD-11。数据集覆盖三大临床章节：消化系统疾病（Dig）、感染性与寄生虫病（Inf）以及呼吸系统疾病（Resp）。此外，本数据集还包含多组ICD版本对的真值映射（ground-truth mappings）结果。为支持研究可复现性与下游研究，我们同时提供了大语言模型生成的描述文本与原始嵌入向量。 ## 数据与文件结构说明 ### 文件结构概览 ./ ├── *.csv <- ICD编码文件（根目录层级） ├── gt/ <- 真值映射文件夹 ├── summary/ <- ......

创建时间：

2026-01-14

5,000+

优质数据集

54 个

任务类型

进入经典数据集