five

singh-aditya/MACCROBAT_biomedical_ner

收藏
Hugging Face2023-11-05 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/singh-aditya/MACCROBAT_biomedical_ner
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit task_categories: - token-classification language: - en tags: - biology - medical size_categories: - 1M<n<10M field: - data --- # MACCROBAT-biomedical-ner This data is the same data from [here](https://figshare.com/articles/dataset/MACCROBAT2018/9764942), the only difference is that it has been converted into the Huggingface dataset format. So it can be easily loaded and can be used wherever need. To convert from the orginal format to huggingface dataset format, followed the following steps (**To know in more detail look at the `create_dataset.py` file**): * Read corresponding `*.txt` and `*.ann` file. * Used `pandas` to convert the `*.ann` file into dataframe. * After converting into dataframe, did some processing and converted NER label information into: ```JSON { "text": "ner-text", "label": "ner-label", "start": 10, "end": 20 } ``` * Standard labels are converted into `B-Tag` and `I-tag`, where `B`- stands for begning of the tag and `I` - stands for inside the tag. * Finally the JSON is created and uploaded here. ## Source Data This ZIP-compressed file contains 200 source documents (in plain text, on sentence per line) and 200 annotation documents (in brat standoff format). Documents are named using PubMed document IDs, e.g. "15939911.txt" contains text from the document "A young man with palpitations and Ebstein's anomaly of the tricuspid valve" by Marcu and Donohue. Text is from PubMed Central full-text documents but has been edited to include only clinical case report details. All annotations were created manually. "MACCROBAT2020" is the second release of this dataset, following "MACCROBAT2018". The consistency and format of annotations has been improved in the newest version. ## Uses Use below snippet to load the data properly and it can be used to finetune medical based NER model with some additional processing. ```Python from datasets import load_dataset # load the data medical_ner_data = load_dataset("singh-aditya/MACCROBAT_biomedical_ner") print(medical_ner_data) ``` ``` DatasetDict({ train: Dataset({ features: ['ner_labels', 'tokens', 'full_text', 'ner_info'], num_rows: 200 }) }) ``` <!-- Address questions around how the dataset is intended to be used. --> ## Dataset Structure ``` { 'full_text': "CASE: A 28-year-old previously healthy man presented with a 6-week history of palpitations.\nThe symptoms occurred during rest, 2–3 times per week, lasted up to 30 minutes at a time and were associated with dyspnea.\nExcept for a grade 2/6 holosystolic tricuspid regurgitation murmur (best heard at the left sternal border with inspiratory accentuation), physical examination yielded unremarkable findings.\nAn electrocardiogram (ECG) revealed normal sinus rhythm and a Wolff– Parkinson– White pre-excitation pattern (Fig.1: Top), produced by a right-sided accessory pathway.\nTransthoracic echocardiography demonstrated the presence of Ebstein's anomaly of the tricuspid valve, with apical displacement of the valve and formation of an “atrialized” right ventricle (a functional unit between the right atrium and the inlet [inflow] portion of the right ventricle) (Fig.2).\nThe anterior tricuspid valve leaflet was elongated (Fig.2C, arrow), whereas the septal leaflet was rudimentary (Fig.2C, arrowhead).\nContrast echocardiography using saline revealed a patent foramen ovale with right-to-left shunting and bubbles in the left atrium (Fig.2D).\nThe patient underwent an electrophysiologic study with mapping of the accessory pathway, followed by radiofrequency ablation (interruption of the pathway using the heat generated by electromagnetic waves at the tip of an ablation catheter).\nHis post-ablation ECG showed a prolonged PR interval and an odd “second” QRS complex in leads III, aVF and V2–V4 (Fig.1Bottom), a consequence of abnormal impulse conduction in the “atrialized” right ventricle.\nThe patient reported no recurrence of palpitations at follow-up 6 months after the ablation.\n", 'ner_info': [ { 'text': '28-year-old', 'label': 'AGE', 'start': 8, 'end': 19 }, {'text': 'previously healthy', 'label': 'HISTORY', 'start': 20, 'end': 38}, {'text': 'man', 'label': 'SEX', 'start': 39, 'end': 42}, {'text': 'presented', 'label': 'CLINICAL_EVENT', 'start': 43, 'end': 52}, {'text': '6-week', 'label': 'DURATION', 'start': 60, 'end': 66}, {'text': 'palpitations', 'label': 'SIGN_SYMPTOM', 'start': 78, 'end': 90}, {'text': 'symptoms', 'label': 'COREFERENCE', 'start': 96, 'end': 104}, {'text': 'rest', 'label': 'CLINICAL_EVENT', 'start': 121, 'end': 125}, {'text': '2–3 times per week', 'label': 'FREQUENCY', 'start': 127, 'end': 145}, {'text': 'up to 30 minutes at a time', 'label': 'DETAILED_DESCRIPTION', 'start': 154, 'end': 180}, {'text': 'dyspnea', 'label': 'SIGN_SYMPTOM', 'start': 206, 'end': 213}, {'text': 'grade 2/6', 'label': 'LAB_VALUE', 'start': 228, 'end': 237}, {'text': 'holosystolic', 'label': 'DETAILED_DESCRIPTION', 'start': 238, 'end': 250}, {'text': 'tricuspid', 'label': 'BIOLOGICAL_STRUCTURE', 'start': 251, 'end': 260}, {'text': 'regurgitation murmur', 'label': 'SIGN_SYMPTOM', 'start': 261, 'end': 281}, {'text': 'left sternal border', 'label': 'BIOLOGICAL_STRUCTURE', 'start': 301, 'end': 320}, {'text': 'inspiratory accentuation', 'label': 'DETAILED_DESCRIPTION', 'start': 326, 'end': 350}, {'text': 'physical examination', 'label': 'DIAGNOSTIC_PROCEDURE', 'start': 353, 'end': 373}, {'text': 'unremarkable', 'label': 'LAB_VALUE', 'start': 382, 'end': 394}, {'text': 'electrocardiogram', 'label': 'DIAGNOSTIC_PROCEDURE', 'start': 408, 'end': 425}, {'text': 'ECG', 'label': 'DIAGNOSTIC_PROCEDURE', 'start': 427, 'end': 430}, {'text': 'normal', 'label': 'LAB_VALUE', 'start': 441, 'end': 447}, {'text': 'sinus rhythm', 'label': 'DIAGNOSTIC_PROCEDURE', 'start': 448, 'end': 460}, {'text': 'Wolff– Parkinson– White pre-excitation pattern', 'label': 'SIGN_SYMPTOM', 'start': 467, 'end': 513}, {'text': 'right-sided', 'label': 'DETAILED_DESCRIPTION', 'start': 542, 'end': 553}, {'text': 'accessory pathway', 'label': 'DISEASE_DISORDER', 'start': 554, 'end': 571}, {'text': 'Transthoracic', 'label': 'BIOLOGICAL_STRUCTURE', 'start': 573, 'end': 586}, {'text': 'echocardiography', 'label': 'DIAGNOSTIC_PROCEDURE', 'start': 587, 'end': 603}, {'text': "Ebstein's anomaly", 'label': 'DISEASE_DISORDER', 'start': 633, 'end': 650}, {'text': 'tricuspid valve', 'label': 'BIOLOGICAL_STRUCTURE', 'start': 658, 'end': 673}, {'text': 'apical displacement', 'label': 'SIGN_SYMPTOM', 'start': 680, 'end': 699}, {'text': 'valve', 'label': 'COREFERENCE', 'start': 707, 'end': 712}, {'text': 'atrialized', 'label': 'DISEASE_DISORDER', 'start': 734, 'end': 744}, {'text': 'right ventricle', 'label': 'BIOLOGICAL_STRUCTURE', 'start': 746, 'end': 761}, {'text': 'right atrium', 'label': 'BIOLOGICAL_STRUCTURE', 'start': 793, 'end': 805}, {'text': 'inlet', 'label': 'BIOLOGICAL_STRUCTURE', 'start': 814, 'end': 819}, {'text': 'right ventricle', 'label': 'BIOLOGICAL_STRUCTURE', 'start': 844, 'end': 859}, {'text': 'anterior tricuspid valve leaflet', 'label': 'BIOLOGICAL_STRUCTURE', 'start': 874, 'end': 906}, {'text': 'elongated', 'label': 'SIGN_SYMPTOM', 'start': 911, 'end': 920}, {'text': 'septal leaflet', 'label': 'BIOLOGICAL_STRUCTURE', 'start': 950, 'end': 964}, {'text': 'rudimentary', 'label': 'SIGN_SYMPTOM', 'start': 969, 'end': 980}, {'text': 'Contrast', 'label': 'DETAILED_DESCRIPTION', 'start': 1002, 'end': 1010}, {'text': 'echocardiography', 'label': 'DIAGNOSTIC_PROCEDURE', 'start': 1011, 'end': 1027}, {'text': 'using saline', 'label': 'DETAILED_DESCRIPTION', 'start': 1028, 'end': 1040}, {'text': 'patent foramen ovale', 'label': 'DISEASE_DISORDER', 'start': 1052, 'end': 1072}, {'text': 'right-to-left shunting', 'label': 'SIGN_SYMPTOM', 'start': 1078, 'end': 1100}, {'text': 'bubbles', 'label': 'SIGN_SYMPTOM', 'start': 1105, 'end': 1112}, {'text': 'left atrium', 'label': 'BIOLOGICAL_STRUCTURE', 'start': 1120, 'end': 1131}, {'text': 'electrophysiologic study', 'label': 'DIAGNOSTIC_PROCEDURE', 'start': 1167, 'end': 1191}, {'text': 'mapping', 'label': 'DIAGNOSTIC_PROCEDURE', 'start': 1197, 'end': 1204}, {'text': 'accessory pathway', 'label': 'BIOLOGICAL_STRUCTURE', 'start': 1212, 'end': 1229}, {'text': 'radiofrequency', 'label': 'DETAILED_DESCRIPTION', 'start': 1243, 'end': 1257}, {'text': 'ablation', 'label': 'THERAPEUTIC_PROCEDURE', 'start': 1258, 'end': 1266}, {'text': 'ablation catheter', 'label': 'THERAPEUTIC_PROCEDURE', 'start': 1363, 'end': 1380}, {'text': 'ECG', 'label': 'DIAGNOSTIC_PROCEDURE', 'start': 1401, 'end': 1404}, {'text': 'prolonged', 'label': 'LAB_VALUE', 'start': 1414, 'end': 1423}, {'text': 'PR interval', 'label': 'DIAGNOSTIC_PROCEDURE', 'start': 1424, 'end': 1435}, {'text': 'odd', 'label': 'LAB_VALUE', 'start': 1443, 'end': 1446}, {'text': '“second”', 'label': 'LAB_VALUE', 'start': 1447, 'end': 1455}, {'text': 'QRS complex', 'label': 'DIAGNOSTIC_PROCEDURE', 'start': 1456, 'end': 1467}, {'text': 'leads III, aVF and V2–V4', 'label': 'DETAILED_DESCRIPTION', 'start': 1471, 'end': 1495}, {'text': 'abnormal impulse conduction', 'label': 'DISEASE_DISORDER', 'start': 1528, 'end': 1555}, {'text': 'atrialized', 'label': 'DISEASE_DISORDER', 'start': 1564, 'end': 1574}, {'text': 'right ventricle', 'label': 'BIOLOGICAL_STRUCTURE', 'start': 1576, 'end': 1591}, {'text': 'palpitations', 'label': 'SIGN_SYMPTOM', 'start': 1631, 'end': 1643}, {'text': 'follow-up', 'label': 'CLINICAL_EVENT', 'start': 1647, 'end': 1656}, {'text': '6 months after', 'label': 'DATE', 'start': 1657, 'end': 1671}], 'tokens': ['CASE: A ', '28-year-old', ' ', 'previously healthy', ' ', 'man', ' ', 'presented', ' with a ', '6-week', ' history of ', 'palpitations', '.\nThe ', 'symptoms', ' occurred during ', 'rest', ', ', '2–3 times per week', ', lasted ', 'up to 30 minutes at a time', ' and were associated with ', 'dyspnea', '.\nExcept for a ', 'grade 2/6', ' ', 'holosystolic', ' ', 'tricuspid', ' ', 'regurgitation murmur', ' (best heard at the ', 'left sternal border', ' with ', 'inspiratory accentuation', '), ', 'physical examination', ' yielded ', 'unremarkable', ' findings.\nAn ', 'electrocardiogram', ' (', 'ECG', ') revealed ', 'normal', ' ', 'sinus rhythm', ' and a ', 'Wolff– Parkinson– White pre-excitation pattern', ' (Fig.1: Top), produced by a ', 'right-sided', ' ', 'accessory pathway', '.\n', 'Transthoracic', ' ', 'echocardiography', ' demonstrated the presence of ', "Ebstein's anomaly", ' of the ', 'tricuspid valve', ', with ', 'apical displacement', ' of the ', 'valve', ' and formation of an “', 'atrialized', '” ', 'right ventricle', ' (a functional unit between the ', 'right atrium', ' and the ', 'inlet', ' [inflow] portion of the ', 'right ventricle', ') (Fig.2).\nThe ', 'anterior tricuspid valve leaflet', ' was ', 'elongated', ' (Fig.2C, arrow), whereas the ', 'septal leaflet', ' was ', 'rudimentary', ' (Fig.2C, arrowhead).\n', 'Contrast', ' ', 'echocardiography', ' ', 'using saline', ' revealed a ', 'patent foramen ovale', ' with ', 'right-to-left shunting', ' and ', 'bubbles', ' in the ', 'left atrium', ' (Fig.2D).\nThe patient underwent an ', 'electrophysiologic study', ' with ', 'mapping', ' of the ', 'accessory pathway', ', followed by ', 'radiofrequency', ' ', 'ablation', ' (interruption of the pathway using the heat generated by electromagnetic waves at the tip of an ', 'ablation catheter', ').\nHis post-ablation ', 'ECG', ' showed a ', 'prolonged', ' ', 'PR interval', ' and an ', 'odd', ' ', '“second”', ' ', 'QRS complex', ' in ', 'leads III, aVF and V2–V4', ' (Fig.1Bottom), a consequence of ', 'abnormal impulse conduction', ' in the “', 'atrialized', '” ', 'right ventricle', '.\nThe patient reported no recurrence of ', 'palpitations', ' at ', 'follow-up', ' ', '6 months after', ' the ablation.\n'], 'ner_labels': [0, 5, 0, 39, 0, 65, 0, 13, 0, 32, 0, 69, 0, 18, 0, 13, 0, 35, 0, 22, 0, 69, 0, 42, 0, 22, 0, 12, 0, 69, 0, 12, 0, 22, 0, 24, 0, 42, 0, 24, 0, 24, 0, 42, 0, 24, 0, 69, 0, 22, 0, 26, 0, 12, 0, 24, 0, 26, 0, 12, 0, 69, 0, 18, 0, 26, 0, 12, 0, 12, 0, 12, 0, 12, 0, 12, 0, 69, 0, 12, 0, 69, 0, 22, 0, 24, 0, 22, 0, 26, 0, 69, 0, 69, 0, 12, 0, 24, 0, 24, 0, 12, 0, 22, 0, 75, 0, 75, 0, 24, 0, 42, 0, 24, 0, 42, 0, 42, 0, 24, 0, 22, 0, 26, 0, 26, 0, 12, 0, 69, 0, 13, 0, 19, 0]} ``` ## NER-Lables ```Python NER_lables = [ "O", "B-ACTIVITY", "I-ACTIVITY", "I-ADMINISTRATION", "B-ADMINISTRATION", "B-AGE", "I-AGE", "I-AREA", "B-AREA", "B-BIOLOGICAL_ATTRIBUTE", "I-BIOLOGICAL_ATTRIBUTE", "I-BIOLOGICAL_STRUCTURE", "B-BIOLOGICAL_STRUCTURE", "B-CLINICAL_EVENT", "I-CLINICAL_EVENT", "B-COLOR", "I-COLOR", "I-COREFERENCE", "B-COREFERENCE", "B-DATE", "I-DATE", "I-DETAILED_DESCRIPTION", "B-DETAILED_DESCRIPTION", "I-DIAGNOSTIC_PROCEDURE", "B-DIAGNOSTIC_PROCEDURE", "I-DISEASE_DISORDER", "B-DISEASE_DISORDER", "B-DISTANCE", "I-DISTANCE", "B-DOSAGE", "I-DOSAGE", "I-DURATION", "B-DURATION", "I-FAMILY_HISTORY", "B-FAMILY_HISTORY", "B-FREQUENCY", "I-FREQUENCY", "I-HEIGHT", "B-HEIGHT", "B-HISTORY", "I-HISTORY", "I-LAB_VALUE", "B-LAB_VALUE", "I-MASS", "B-MASS", "I-MEDICATION", "B-MEDICATION", "I-NONBIOLOGICAL_LOCATION", "B-NONBIOLOGICAL_LOCATION", "I-OCCUPATION", "B-OCCUPATION", "B-OTHER_ENTITY", "I-OTHER_ENTITY", "B-OTHER_EVENT", "I-OTHER_EVENT", "I-OUTCOME", "B-OUTCOME", "I-PERSONAL_BACKGROUND", "B-PERSONAL_BACKGROUND", "B-QUALITATIVE_CONCEPT", "I-QUALITATIVE_CONCEPT", "I-QUANTITATIVE_CONCEPT", "B-QUANTITATIVE_CONCEPT", "B-SEVERITY", "I-SEVERITY", "B-SEX", "I-SEX", "B-SHAPE", "I-SHAPE", "B-SIGN_SYMPTOM", "I-SIGN_SYMPTOM", "B-SUBJECT", "I-SUBJECT", "B-TEXTURE", "I-TEXTURE", "B-THERAPEUTIC_PROCEDURE", "I-THERAPEUTIC_PROCEDURE", "I-TIME", "B-TIME", "B-VOLUME", "I-VOLUME", "I-WEIGHT", "B-WEIGHT", ] ``` **BibTeX:** ```JSON { article= Caufield2020, author = "J. Harry Caufield", title = "{MACCROBAT}", year = "2020", month = "1", url = "https://figshare.com/articles/dataset/MACCROBAT2018/9764942", doi = "10.6084/m9.figshare.9764942.v2" } ```
提供机构:
singh-aditya
原始信息汇总

MACCROBAT-biomedical-ner 数据集概述

数据集描述

该数据集是从此处获取的相同数据,唯一不同的是它已被转换为Huggingface数据集格式,以便于加载和使用。

数据转换步骤

  1. 读取相应的*.txt*.ann文件。

  2. 使用pandas*.ann文件转换为数据框。

  3. 对数据框进行处理,并将NER标签信息转换为以下格式: json { "text": "ner-text", "label": "ner-label", "start": 10, "end": 20 }

  4. 将标准标签转换为B-TagI-tag,其中B表示标签的开始,I表示标签的内部。

  5. 最终创建JSON并上传。

源数据

该ZIP压缩文件包含200个源文档(每行一个句子)和200个注释文档(以brat standoff格式)。文档使用PubMed文档ID命名,例如"15939911.txt"包含Marcu和Donohue的文章"A young man with palpitations and Ebsteins anomaly of the tricuspid valve"的文本。所有注释均为手动创建。

"MACCROBAT2020"是该数据集的第二个版本,前一个版本是"MACCROBAT2018"。最新版本的注释一致性和格式有所改进。

数据集结构

数据集包含以下字段:

  • full_text: 完整的文本内容。
  • ner_info: NER标签信息,包括文本、标签、起始位置和结束位置。
  • tokens: 分词后的文本。
  • ner_labels: NER标签的索引。

示例数据

json { full_text: "CASE: A 28-year-old previously healthy man presented with a 6-week history of palpitations...", ner_info: [ { text: 28-year-old, label: AGE, start: 8, end: 19 }, ... ], tokens: [CASE: A , 28-year-old, , previously healthy, , man, , presented, ...], ner_labels: [0, 5, 0, 39, 0, 65, 0, 13, 0, 32, 0, 69, 0, 18, 0, 13, 0, 35, 0, 22, 0, 69, ...] }

NER标签

NER标签包括以下类别: python NER_lables = [ "O", "B-ACTIVITY", "I-ACTIVITY", "I-ADMINISTRATION", "B-ADMINISTRATION", "B-AGE", "I-AGE", "I-AREA", "B-AREA", "B-BIOLOGICAL_ATTRIBUTE", "I-BIOLOGICAL_ATTRIBUTE", "I-BIOLOGICAL_STRUCTURE", "B-BIOLOGICAL_STRUCTURE", "B-CLINICAL_EVENT", "I-CLINICAL_EVENT", "B-COLOR", "I-COLOR", "I-COREFERENCE", "B-COREFERENCE", "B-DATE", "I-DATE", "I-DETAILED_DESCRIPTION", "B-DETAILED_DESCRIPTION", "I-DIAGNOSTIC_PROCEDURE", "B-DIAGNOSTIC_PROCEDURE", "I-DISEASE_DISORDER", "B-DISEASE_DISORDER", "B-DISTANCE", "I-DISTANCE", "B-DOSAGE", "I-DOSAGE", "I-DURATION", "B-DURATION", "I-FAMILY_HISTORY", "B-FAMILY_HISTORY", "B-FREQUENCY", "I-FREQUENCY", "I-HEIGHT", "B-HEIGHT", "B-HISTORY", "I-HISTORY", "I-LAB_VALUE", "B-LAB_VALUE", "I-MASS", "B-MASS", "I-MEDICATION", "B-MEDICATION", "I-NONBIOLOGICAL_LOCATION", "B-NONBIOLOGICAL_LOCATION", "I-OCCUPATION", "B-OCCUPATION", "B-OTHER_ENTITY", "I-OTHER_ENTITY", "B-OTHER_EVENT", "I-OTHER_EVENT", "I-OUTCOME", "B-OUTCOME", "I-PERSONAL_BACKGROUND", "B-PERSONAL_BACKGROUND", "B-QUALITATIVE_CONCEPT", "I-QUALITATIVE_CONCEPT", "I-QUANTITATIVE_CONCEPT", "B-QUANTITATIVE_CONCEPT", "B-SEVERITY", "I-SEVERITY", "B-SEX", "I-SEX", "B-SHAPE", "I-SHAPE", "B-SIGN_SYMPTOM", "I-SIGN_SYMPTOM", "B-SUBJECT", "I-SUBJECT", "B-TEXTURE", "I-TEXTURE", "B-THERAPEUTIC_PROCEDURE", "I-THERAPEUTIC_PROCEDURE", "I-TIME", "B-TIME", "B-VOLUME", "I-VOLUME", "I-WEIGHT", "B-WEIGHT", ]

数据集加载

使用以下代码片段加载数据集: python from datasets import load_dataset

加载数据

medical_ner_data = load_dataset("singh-aditya/MACCROBAT_biomedical_ner") print(medical_ner_data)

数据集字段

python DatasetDict({ train: Dataset({ features: [ner_labels, tokens, full_text, ner_info], num_rows: 200 }) })

搜集汇总
数据集介绍
main_image_url
构建方式
在生物医学信息抽取领域,MACCROBAT数据集源自PubMed Central的临床病例报告文本,经过精心筛选与人工标注构建而成。原始文档包含200份纯文本文件,每行一句,对应200份采用brat标注格式的注释文件。数据转换过程中,通过读取文本与注释文件,利用pandas库将注释信息转化为结构化数据框,进而处理为包含文本、标签及起止位置的JSON格式。标注体系遵循BIO标注方案,将标准标签转换为B-Tag与I-Tag,以区分实体起始与内部位置,最终形成适用于命名实体识别任务的结构化数据集。
特点
该数据集聚焦于生物医学命名实体识别,涵盖丰富的临床病例报告内容,包含年龄、病史、临床症状、诊断程序、疾病障碍、生物结构等多样化实体类别。数据规模适中,介于100万至1000万字符之间,标注精细且一致,实体边界与类别信息明确。其标注体系采用BIO格式,支持序列标注模型的训练与评估,同时提供完整的文本、分词及实体信息,便于多任务学习与深度分析。数据集结构清晰,包含原始文本、分词序列、实体标注及完整标注列表,为生物医学自然语言处理研究提供了高质量的基础资源。
使用方法
使用该数据集时,可通过HuggingFace的datasets库直接加载,简化数据预处理流程。加载后,数据集以字典形式呈现,包含训练集特征如ner_labels、tokens、full_text与ner_info。用户可基于此进行模型微调,适用于生物医学领域的命名实体识别任务。数据集中实体标注信息丰富,支持自定义标签映射与序列标注转换,便于集成到现有自然语言处理框架中。此外,数据集结构允许灵活的数据切片与批处理,适用于深度学习模型的训练、验证与测试阶段,提升模型在临床文本中的实体抽取性能。
背景与挑战
背景概述
在生物医学信息抽取领域,命名实体识别(NER)是解析非结构化临床文本、提取关键医学概念的核心任务。MACCROBAT数据集由J. Harry Caufield等人于2018年首次发布,并于2020年更新至第二版,旨在为临床病例报告提供精细的实体标注。该数据集源自PubMed Central的200篇全文文档,经过人工精心标注,涵盖了年龄、症状、疾病、解剖结构等多达数十种实体类别,为开发高精度医学NER模型提供了宝贵的训练资源。其构建不仅推动了临床自然语言处理技术的发展,也为电子健康记录分析、医学知识图谱构建等应用奠定了数据基础,显著提升了自动化医疗信息处理的可行性与效率。
当前挑战
MACCROBAT数据集致力于解决临床文本中命名实体识别的复杂挑战,包括医学术语的多样性、实体边界的模糊性以及上下文依赖的歧义性。例如,“右心室”可能指解剖结构,也可能在特定语境中表示疾病状态,这要求模型具备深层次的语义理解能力。在数据构建过程中,挑战主要源于标注的一致性维护与质量控制。临床文本包含大量专业缩写、同义词及嵌套实体,人工标注需克服术语标准化困难,并确保不同标注者之间的协议统一。此外,将原始brat格式转换为结构化数据集时,需精确映射实体位置与标签,避免信息丢失或格式错误,这些过程均对数据集的可靠性与可用性提出了较高要求。
常用场景
经典使用场景
在生物医学自然语言处理领域,MACCROBAT数据集常被用于训练和评估命名实体识别模型。该数据集源自PubMed Central的临床病例报告文本,经过人工精细标注,涵盖了年龄、症状、疾病、解剖结构等丰富实体类别。研究者利用其标准化格式,能够便捷地构建深度学习模型,以自动识别医学文献中的关键临床信息,从而提升文本挖掘的准确性与效率。
衍生相关工作
围绕该数据集,已衍生出多项经典研究工作,包括基于BERT等预训练模型的生物医学NER微调框架,以及跨语料库的实体标注迁移学习方案。这些工作不仅验证了数据集在提升模型泛化能力方面的价值,还进一步拓展了其在临床事件抽取、关系发现等下游任务中的应用,持续丰富着医疗人工智能的技术生态。
数据集最近研究
最新研究方向
在生物医学自然语言处理领域,MACCROBAT数据集凭借其精细的临床病例报告标注,正成为前沿研究的关键资源。该数据集支持细粒度实体识别模型的开发,尤其在处理复杂医学术语和上下文依赖的实体边界方面展现出独特价值。当前研究热点聚焦于利用预训练语言模型如BioBERT和ClinicalBERT进行迁移学习,以提升模型在低资源医疗文本上的泛化能力。同时,结合图神经网络和注意力机制的多任务学习框架,正被探索用于同时识别实体并推断其语义关系,从而构建更全面的临床知识图谱。这些进展不仅推动了电子健康记录的自动化信息抽取,也为精准医疗和临床决策支持系统提供了数据基础。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作