five

dnamodel/xraydar-reports

收藏
Hugging Face2026-03-18 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/dnamodel/xraydar-reports
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: other license_name: non-commercial task_categories: - token-classification - text-classification tags: - medical-nlp - chest-x-ray - radiology - report-classification - named-entity-recognition - segmentation language: - en pretty_name: X-Raydar Annotated Radiology Reports size_categories: - 10K<n<100K --- # X-Raydar Annotated Radiology Reports Manually annotated chest X-ray radiology reports for multi-label classification and span-level segmentation. Each report is annotated with **45 radiological finding categories** at both the report level and the token/span level. **Website:** [x-raydar.info](https://x-raydar.info) **X-ray classifier:** [dnamodel/xraydar-cv](https://huggingface.co/dnamodel/xraydar-cv) **Report classifier:** [dnamodel/xraydar-nlp](https://huggingface.co/dnamodel/xraydar-nlp) **CV code:** [gmontana/xraydar-cv](https://github.com/gmontana/xraydar-cv) **NLP code:** [gmontana/xraydar-nlp](https://github.com/gmontana/xraydar-nlp) ## Dataset Description - **29,756 radiology reports** from chest X-ray examinations - **45 finding categories** annotated at the span level - **IOBE segmentation tags** (Outside, Begin, Inside, End) marking topically meaningful passages - Reports are word-tokenized with span annotations indicating which words relate to which findings This dataset was used to train and evaluate the RoBERTaX model described in the papers below. ## Data Format Each record in the JSONL file contains: | Field | Type | Description | |-------|------|-------------| | `xray_id` | string | Unique identifier | | `text` | string | Raw report text | | `tokens` | list[string] | Whitespace-tokenized words | | `iobe_tags` | list[string] | Per-token segmentation tags: O (outside), B (begin), I (inside), E (end) | | `spans` | list[object] | Annotated spans: `{label, start, end}` with word-level indices | | `labels` | list[string] | Report-level labels (findings present anywhere in the report) | ### Example ```json { "xray_id": "314", "text": "Severe left lateral wall pain. The Horizontal fissure is pulled upwards...", "tokens": ["Severe", "left", "lateral", "wall", "pain.", ...], "iobe_tags": ["B", "I", "I", "I", "I", ...], "spans": [ {"label": "apical_fibrosis", "start": 7, "end": 42}, {"label": "normal", "start": 43, "end": 50}, {"label": "volume_loss", "start": 7, "end": 42} ], "labels": ["apical_fibrosis", "normal", "other", "volume_loss"] } ``` ## Finding Categories (45 classes) | # | Label | # | Label | |---|-------|---|-------| | 0 | abnormal_non_clinically_important | 23 | normal | | 1 | aortic_calcification | 24 | object | | 2 | apical_fibrosis | 25 | other | | 3 | atelectasis | 26 | paraspinal_mass | | 4 | axillary_abnormality | 27 | paratracheal_hilar_enlargement | | 5 | bronchial_wall_thickening | 28 | parenchymal_lesion | | 6 | bulla | 29 | pleural_abnormality | | 7 | cardiomegaly | 30 | pleural_effusion | | 8 | cavitating_lung_lesion | 31 | pneumomediastinum | | 9 | clavicle_fracture | 32 | pneumoperitoneum | | 10 | comparison | 33 | pneumothorax | | 11 | consolidation | 34 | possible_diagnosis | | 12 | coronary_calcification | 35 | recommendation | | 13 | dextrocardia | 36 | rib_fracture | | 14 | dilated_bowel | 37 | rib_lesion | | 15 | emphysema | 38 | scoliosis | | 16 | ground_glass_opacification | 39 | subcutaneous_emphysema | | 17 | hemidiaphragm_elevated | 40 | technical_issue | | 18 | hernia | 41 | undefined_sentence | | 19 | hyperexpanded_lungs | 42 | unfolded_aorta | | 20 | interstitial_shadowing | 43 | upper_lobe_blood_diversion | | 21 | mediastinum_displaced | 44 | volume_loss | | 22 | mediastinum_widened | | | ## Usage ```python from huggingface_hub import hf_hub_download import json path = hf_hub_download( repo_id="dnamodel/xraydar-reports", filename="xraydar-reports.jsonl", repo_type="dataset" ) with open(path) as f: reports = [json.loads(line) for line in f] print(f"{len(reports)} reports loaded") print(reports[0]["text"][:100]) print(reports[0]["labels"]) ``` ## Citation If you use this dataset, please cite both papers: ```bibtex @inproceedings{zhu2024multitask, title={A Multi-Task Transformer Model for Fine-grained Labelling of Chest {X}-Ray Reports}, author={Zhu, Yuanyi and Liakata, Maria and Montana, Giovanni}, booktitle={Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)}, pages={862--875}, year={2024}, address={Torino, Italia}, publisher={ELRA and ICCL} } @article{cid2024development, title={Development and validation of open-source deep neural networks for comprehensive chest x-ray reading: a retrospective, multicentre study}, author={Cid, Yan Digilov and Macpherson, Matt and others}, journal={The Lancet Digital Health}, volume={6}, number={1}, pages={e44--e57}, year={2024}, publisher={Elsevier}, doi={10.1016/S2589-7500(23)00218-2} } ``` ## License For academic research and non-commercial evaluation only. See [x-raydar.info](https://x-raydar.info) for terms and conditions. ## Contact Giovanni Montana — [g.montana@warwick.ac.uk](mailto:g.montana@warwick.ac.uk)
提供机构:
dnamodel
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作