dnamodel/xraydar-reports
收藏Hugging Face2026-03-18 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/dnamodel/xraydar-reports
下载链接
链接失效反馈官方服务:
资源简介:
---
license: other
license_name: non-commercial
task_categories:
- token-classification
- text-classification
tags:
- medical-nlp
- chest-x-ray
- radiology
- report-classification
- named-entity-recognition
- segmentation
language:
- en
pretty_name: X-Raydar Annotated Radiology Reports
size_categories:
- 10K<n<100K
---
# X-Raydar Annotated Radiology Reports
Manually annotated chest X-ray radiology reports for multi-label classification and span-level segmentation. Each report is annotated with **45 radiological finding categories** at both the report level and the token/span level.
**Website:** [x-raydar.info](https://x-raydar.info)
**X-ray classifier:** [dnamodel/xraydar-cv](https://huggingface.co/dnamodel/xraydar-cv)
**Report classifier:** [dnamodel/xraydar-nlp](https://huggingface.co/dnamodel/xraydar-nlp)
**CV code:** [gmontana/xraydar-cv](https://github.com/gmontana/xraydar-cv)
**NLP code:** [gmontana/xraydar-nlp](https://github.com/gmontana/xraydar-nlp)
## Dataset Description
- **29,756 radiology reports** from chest X-ray examinations
- **45 finding categories** annotated at the span level
- **IOBE segmentation tags** (Outside, Begin, Inside, End) marking topically meaningful passages
- Reports are word-tokenized with span annotations indicating which words relate to which findings
This dataset was used to train and evaluate the RoBERTaX model described in the papers below.
## Data Format
Each record in the JSONL file contains:
| Field | Type | Description |
|-------|------|-------------|
| `xray_id` | string | Unique identifier |
| `text` | string | Raw report text |
| `tokens` | list[string] | Whitespace-tokenized words |
| `iobe_tags` | list[string] | Per-token segmentation tags: O (outside), B (begin), I (inside), E (end) |
| `spans` | list[object] | Annotated spans: `{label, start, end}` with word-level indices |
| `labels` | list[string] | Report-level labels (findings present anywhere in the report) |
### Example
```json
{
"xray_id": "314",
"text": "Severe left lateral wall pain. The Horizontal fissure is pulled upwards...",
"tokens": ["Severe", "left", "lateral", "wall", "pain.", ...],
"iobe_tags": ["B", "I", "I", "I", "I", ...],
"spans": [
{"label": "apical_fibrosis", "start": 7, "end": 42},
{"label": "normal", "start": 43, "end": 50},
{"label": "volume_loss", "start": 7, "end": 42}
],
"labels": ["apical_fibrosis", "normal", "other", "volume_loss"]
}
```
## Finding Categories (45 classes)
| # | Label | # | Label |
|---|-------|---|-------|
| 0 | abnormal_non_clinically_important | 23 | normal |
| 1 | aortic_calcification | 24 | object |
| 2 | apical_fibrosis | 25 | other |
| 3 | atelectasis | 26 | paraspinal_mass |
| 4 | axillary_abnormality | 27 | paratracheal_hilar_enlargement |
| 5 | bronchial_wall_thickening | 28 | parenchymal_lesion |
| 6 | bulla | 29 | pleural_abnormality |
| 7 | cardiomegaly | 30 | pleural_effusion |
| 8 | cavitating_lung_lesion | 31 | pneumomediastinum |
| 9 | clavicle_fracture | 32 | pneumoperitoneum |
| 10 | comparison | 33 | pneumothorax |
| 11 | consolidation | 34 | possible_diagnosis |
| 12 | coronary_calcification | 35 | recommendation |
| 13 | dextrocardia | 36 | rib_fracture |
| 14 | dilated_bowel | 37 | rib_lesion |
| 15 | emphysema | 38 | scoliosis |
| 16 | ground_glass_opacification | 39 | subcutaneous_emphysema |
| 17 | hemidiaphragm_elevated | 40 | technical_issue |
| 18 | hernia | 41 | undefined_sentence |
| 19 | hyperexpanded_lungs | 42 | unfolded_aorta |
| 20 | interstitial_shadowing | 43 | upper_lobe_blood_diversion |
| 21 | mediastinum_displaced | 44 | volume_loss |
| 22 | mediastinum_widened | | |
## Usage
```python
from huggingface_hub import hf_hub_download
import json
path = hf_hub_download(
repo_id="dnamodel/xraydar-reports",
filename="xraydar-reports.jsonl",
repo_type="dataset"
)
with open(path) as f:
reports = [json.loads(line) for line in f]
print(f"{len(reports)} reports loaded")
print(reports[0]["text"][:100])
print(reports[0]["labels"])
```
## Citation
If you use this dataset, please cite both papers:
```bibtex
@inproceedings{zhu2024multitask,
title={A Multi-Task Transformer Model for Fine-grained Labelling of Chest {X}-Ray Reports},
author={Zhu, Yuanyi and Liakata, Maria and Montana, Giovanni},
booktitle={Proceedings of the 2024 Joint International Conference on Computational Linguistics,
Language Resources and Evaluation (LREC-COLING 2024)},
pages={862--875},
year={2024},
address={Torino, Italia},
publisher={ELRA and ICCL}
}
@article{cid2024development,
title={Development and validation of open-source deep neural networks for comprehensive chest
x-ray reading: a retrospective, multicentre study},
author={Cid, Yan Digilov and Macpherson, Matt and others},
journal={The Lancet Digital Health},
volume={6}, number={1}, pages={e44--e57},
year={2024}, publisher={Elsevier},
doi={10.1016/S2589-7500(23)00218-2}
}
```
## License
For academic research and non-commercial evaluation only. See [x-raydar.info](https://x-raydar.info) for terms and conditions.
## Contact
Giovanni Montana — [g.montana@warwick.ac.uk](mailto:g.montana@warwick.ac.uk)
提供机构:
dnamodel



