five

didsr/msynth

收藏
Hugging Face2024-04-10 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/didsr/msynth
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc0-1.0 task_categories: - image-classification - image-segmentation tags: - medical pretty_name: M-SYNTH size_categories: - 10K<n<100K --- # M-SYNTH <!-- Provide a quick summary of the dataset. --> M-SYNTH is a synthetic digital mammography (DM) dataset with four breast fibroglandular density distributions imaged using Monte Carlo x-ray simulations with the publicly available [Virtual Imaging Clinical Trial for Regulatory Evaluation (VICTRE)](https://github.com/DIDSR/VICTRE) toolkit. ## Dataset Details The dataset has the following characteristics: * Breast density: dense, heterogeneously dense, scattered, fatty * Mass radius (mm): 5.00, 7.00, 9.00 * Mass density: 1.0, 1.06, 1.1 (ratio of radiodensity of the mass to that of fibroglandular tissue) * Relative dose: 20%, 40%, 60%, 80%, 100% of the clinically recommended dose for each density <p align="center"> <img src='https://raw.githubusercontent.com/DIDSR/msynth-release/main/images/examples.png' width='700'> </p> ### Dataset Description <!-- Provide a longer summary of what this dataset is. --> - **Curated by:** [Elena Sizikova](https://esizikova.github.io/), [Niloufar Saharkhiz](https://www.linkedin.com/in/niloufar-saharkhiz/), [Diksha Sharma](https://www.linkedin.com/in/diksha-sharma-6059977/), [Miguel Lago](https://www.linkedin.com/in/milaan/), [Berkman Sahiner](https://www.linkedin.com/in/berkman-sahiner-6aa9a919/), [Jana Gut Delfino](https://www.linkedin.com/in/janadelfino/), [Aldo Badano](https://www.linkedin.com/in/aldobadano/) - **License:** Creative Commons 1.0 Universal License (CC0) ### Dataset Sources <!-- Provide the basic links for the dataset. --> - **Code:** [https://github.com/DIDSR/msynth-release](https://github.com/DIDSR/msynth-release) - **Paper:** [https://arxiv.org/pdf/2310.18494.pdf](https://arxiv.org/pdf/2310.18494.pdf) - **Demo:** [https://github.com/DIDSR/msynth-release/tree/master/examples](https://github.com/DIDSR/msynth-release/tree/master/examples) ## Uses <!-- Address questions around how the dataset is intended to be used. --> M-SYNTH is intended to facilitate testing of AI with pre-computed synthetic mammography data. ### Direct Use <!-- This section describes suitable use cases for the dataset. --> M-SYNTH can be used to evaluate the effect of mass size and density, breast density, and dose on AI performance in lesion detection. M-SYNTH can be used to either train or test pre-trained AI models. ### Out-of-Scope Use <!-- This section addresses misuse, malicious use, and uses that the dataset will not work well for. --> M-SYNTH cannot be used in lieu of real patient examples to make performance determinations. ## Dataset Structure <!-- This section provides a description of the dataset fields, and additional information about the dataset structure such as criteria used to create the splits, relationships between data points, etc. --> M-SYNTH is organized into a directory structure that indicates the parameters. The folder ``` device_data_VICTREPhantoms_spic_[LESION_DENSITY]/[DOSE]/[BREAST_DENSITY]/2/[LESION_SIZE]/SIM/P2_[LESION_SIZE]_[BREAST_DENSITY].8337609.[PHANTOM_FILE_ID]/[PHANTOM_FILEID]/ ``` contains image files imaged with the specified parameters. Note that only examples with odd PHANTOM_FILEID contain lesions, others do not. ``` $ tree data/device_data_VICTREPhantoms_spic_1.0/1.02e10/hetero/2/5.0/SIM/P2_5.0_hetero.8337609.1/1/ data/device_data_VICTREPhantoms_spic_1.0/1.02e10/hetero/2/5.0/SIM/P2_5.0_hetero.8337609.1/1/ ├── DICOM_dm │   └── 000.dcm ├── projection_DM1.loc ├── projection_DM1.mhd └── projection_DM1.raw ``` Each folder contains mammogram data that can be read from .raw format (.mhd contains supporting data), or DICOM (.dcm) format. Coordinates of lesions can be found in .loc files. Segmentations are stored in .raw format and can be found in data/segmentation_masks/* . See [Github](https://github.com/DIDSR/msynth-release/tree/main/code) for examples of how to access the files, and [examples](https://github.com/DIDSR/msynth-release/tree/main/examples) for code to load each type of file. ## Bias, Risks, and Limitations <!-- This section is meant to convey both technical and sociotechnical limitations. --> Simulation-based testing is constrained to the parameter variability represented in the object model and the acquisition system. There is a risk of misjudging model performance if the simulated examples do not capture the variability in real patients. Please see the paper for a full discussion of biases, risks, and limitations. ## How to use it The msynth dataset is a very large dataset so for most use cases it is recommended to make use of the streaming API of `datasets`. The msynth dataset has three configurations: 1) device_data, 2) segmentation_mask, and 3) metadata You can load and iterate through the dataset using the configurations with the following lines of code: ```python from datasets import load_dataset ds = load_dataset("didsr/msynth", 'device_data') # For device data for all breast density, mass redius, mass density, and relative dose, change configuration to 'segmentation_mask' and 'metadata' to load the segmentation masks and bound information print(ds_data["device_data"]) # A sample data instance {'Raw': '~\\.cache\\huggingface\\datasets\\downloads\\extracted\\59384cf05fc44e8c0cb23bb19e1fcd8f0c39720b282109d204a85561fe66bdb1\\SIM\\P2_5.0_fatty.8336179.1\\1\\projection_DM1.raw', 'mhd': '~/.cache/huggingface/datasets/downloads/extracted/59384cf05fc44e8c0cb23bb19e1fcd8f0c39720b282109d204a85561fe66bdb1/SIM/P2_5.0_fatty.8336179.1/1\\projection_DM1.mhd', 'loc': '~/.cache/huggingface/datasets/downloads/extracted/59384cf05fc44e8c0cb23bb19e1fcd8f0c39720b282109d204a85561fe66bdb1/SIM/P2_5.0_fatty.8336179.1/1\\projection_DM1.loc', 'dcm': '~/.cache/huggingface/datasets/downloads/extracted/59384cf05fc44e8c0cb23bb19e1fcd8f0c39720b282109d204a85561fe66bdb1/SIM/P2_5.0_fatty.8336179.1/1\\DICOM_dm\\000.dcm', 'density': 'fatty', 'mass_radius': 5.0} ``` Msynth dataset can also be loaded using custom breast density, mass redius, mass density, and relative dose information ```python from datasets import load_dataset # Dataset properties. change to 'all' to include all the values of breast density, mass redius, mass density, and relative dose information config_kwargs = { "lesion_density": ["1.0"], "dose": ["20%"], "density": ["fatty"], "size": ["5.0"] } # Loading device data ds_data = load_dataset("didsr/msynth", 'device_data', **config_kwargs) # Loading segmentation-mask ds_seg = load_dataset("didsr/msynth", 'segmentation_mask', **config_kwargs) ``` The meta data can also be loaded using the datasets API. An example of using metadata is given in **Demo:** [https://github.com/DIDSR/msynth-release/tree/master/examples](https://github.com/DIDSR/msynth-release/tree/master/examples) ```python from datasets import load_dataset # Loading metadata ds_meta = load_dataset("didsr/msynth", 'metadata') # A sample data instance ds_meta['metadata'][0] # Output {'fatty': '~\\.cache\\huggingface\\datasets\\downloads\\extracted\\3ea85fc6b3fcc253ac8550b5d1b21db406ca9a59ea125ff8fc63d9b754c88348\\bounds\\bounds_fatty.npy', 'dense': '~\\.cache\\huggingface\\datasets\\downloads\\extracted\\3ea85fc6b3fcc253ac8550b5d1b21db406ca9a59ea125ff8fc63d9b754c88348\\bounds\\bounds_dense.npy', 'hetero': '~\\.cache\\huggingface\\datasets\\downloads\\extracted\\3ea85fc6b3fcc253ac8550b5d1b21db406ca9a59ea125ff8fc63d9b754c88348\\bounds\\bounds_hetero.npy', 'scattered': '~\\.cache\\huggingface\\datasets\\downloads\\extracted\\3ea85fc6b3fcc253ac8550b5d1b21db406ca9a59ea125ff8fc63d9b754c88348\\bounds\\bounds_scattered.npy'} ``` ## Citation ``` @article{sizikova2023knowledge, title={Knowledge-based in silico models and dataset for the comparative evaluation of mammography AI for a range of breast characteristics, lesion conspicuities and doses}, author={Sizikova, Elena and Saharkhiz, Niloufar and Sharma, Diksha and Lago, Miguel and Sahiner, Berkman and Delfino, Jana G. and Badano, Aldo}, journal={Advances in Neural Information Processing Systems}, volume={}, pages={}, year={2023} } ``` ## Related Links 1. [Virtual Imaging Clinical Trial for Regulatory Evaluation (VICTRE)](https://www.fda.gov/medical-devices/science-and-research-medical-devices/victre-silico-breast-imaging-pipeline). 2. [FDA Catalog of Regulatory Science Tools to Help Assess New Medical Devices](https://www.fda.gov/medical-devices/science-and-research-medical-devices/catalog-regulatory-science-tools-help-assess-new-medical-devices). 3. A. Badano, C. G. Graff, A. Badal, D. Sharma, R. Zeng, F. W. Samuelson, S. Glick, K. J. Myers. [Evaluation of Digital Breast Tomosynthesis as Replacement of Full-Field Digital Mammography Using an In Silico Imaging Trial](http://dx.doi.org/10.1001/jamanetworkopen.2018.5474). JAMA Network Open 2018. 4. A. Badano, M. Lago, E. Sizikova, J. G. Delfino, S. Guan, M. A. Anastasio, B. Sahiner. [The stochastic digital human is now enrolling for in silico imaging trials—methods and tools for generating digital cohorts.](http://dx.doi.org/10.1088/2516-1091/ad04c0) Progress in Biomedical Engineering 2023. 5. E. Sizikova, N. Saharkhiz, D. Sharma, M. Lago, B. Sahiner, J. G. Delfino, A. Badano. [Knowledge-based in silico models and dataset for the comparative evaluation of mammography AI](https://github.com/DIDSR/msynth-release). NeurIPS 2023 Workshop on Synthetic Data Generation with Generative AI.

M-SYNTH is a synthetic digital mammography (DM) dataset with four breast fibroglandular density distributions imaged using Monte Carlo x-ray simulations with the publicly available [Virtual Imaging Clinical Trial for Regulatory Evaluation (VICTRE)](https://github.com/DIDSR/VICTRE) toolkit. The dataset has the following characteristics: breast density (dense, heterogeneously dense, scattered, fatty), mass radius (5.00mm, 7.00mm, 9.00mm), mass density (1.0, 1.06, 1.1, ratio of radiodensity of the mass to that of fibroglandular tissue), relative dose (20%, 40%, 60%, 80%, 100% of the clinically recommended dose for each density). M-SYNTH is intended to facilitate testing of AI with pre-computed synthetic mammography data.
提供机构:
didsr
原始信息汇总

M-SYNTH 数据集概述

数据集简介

M-SYNTH 是一个合成数字乳腺摄影(DM)数据集,包含四种乳腺纤维腺体密度分布,使用公开可用的 Virtual Imaging Clinical Trial for Regulatory Evaluation (VICTRE) 工具包进行 Monte Carlo 射线模拟成像。

数据集详情

数据集特征

  • 乳腺密度:密集、异质性密集、散在、脂肪
  • 肿块半径(mm):5.00, 7.00, 9.00
  • 肿块密度:1.0, 1.06, 1.1(肿块的放射密度与纤维腺体组织的比率)
  • 相对剂量:临床推荐剂量的 20%, 40%, 60%, 80%, 100%

数据集描述

  • 由以下人员策划:Elena Sizikova, Niloufar Saharkhiz, Diksha Sharma, Miguel Lago, Berkman Sahiner, Jana Gut Delfino, Aldo Badano
  • 许可证:Creative Commons 1.0 Universal License (CC0)

数据集来源

数据集用途

M-SYNTH 旨在促进使用预计算的合成乳腺摄影数据测试 AI。

直接用途

M-SYNTH 可用于评估肿块大小和密度、乳腺密度及剂量对 AI 在病变检测中性能的影响。M-SYNTH 可用于训练或测试预训练的 AI 模型。

超出范围的用途

M-SYNTH 不能替代真实患者示例来做出性能判断。

数据集结构

M-SYNTH 的目录结构指示参数。文件夹

device_data_VICTREPhantoms_spic_[LESION_DENSITY]/[DOSE]/[BREAST_DENSITY]/2/[LESION_SIZE]/SIM/P2_[LESION_SIZE]_[BREAST_DENSITY].8337609.[PHANTOM_FILE_ID]/[PHANTOM_FILEID]/

包含使用指定参数成像的图像文件。注意,只有 PHANTOM_FILEID 为奇数的示例包含病变,其他则不包含。

每个文件夹包含可从 .raw 格式(.mhd 包含支持数据)或 DICOM (.dcm) 格式读取的乳腺摄影数据。病变坐标可在 .loc 文件中找到。分割存储在 .raw 格式中,可在 data/segmentation_masks/* 中找到。

偏差、风险和限制

模拟测试受限于对象模型和采集系统中表示的参数变异性。如果模拟示例未捕获真实患者的变异性,则存在误判模型性能的风险。请参阅论文以全面讨论偏差、风险和限制。

如何使用

M-SYNTH 数据集非常大,因此建议大多数用例使用 datasets 的流式 API。M-SYNTH 数据集有三种配置:1) device_data, 2) segmentation_mask, 和 3) metadata。您可以使用以下代码行加载和迭代数据集:

python from datasets import load_dataset ds = load_dataset("didsr/msynth", device_data) # 对于所有乳腺密度、肿块半径、肿块密度和相对剂量的设备数据,更改配置为 segmentation_mask 和 metadata 以加载分割掩码和边界信息 print(ds_data["device_data"])

M-SYNTH 数据集也可以使用自定义乳腺密度、肿块半径、肿块密度和相对剂量信息加载:

python from datasets import load_dataset

数据集属性。更改为 all 以包括所有乳腺密度、肿块半径、肿块密度和相对剂量信息

config_kwargs = { "lesion_density": ["1.0"], "dose": ["20%"], "density": ["fatty"], "size": ["5.0"] }

加载设备数据

ds_data = load_dataset("didsr/msynth", device_data, **config_kwargs)

加载分割掩码

ds_seg = load_dataset("didsr/msynth", segmentation_mask, **config_kwargs)

元数据也可以使用 datasets API 加载。使用元数据的示例如下:

python from datasets import load_dataset

加载元数据

ds_meta = load_dataset("didsr/msynth", metadata)

示例数据实例

ds_meta[metadata][0]

引用

@article{sizikova2023knowledge, title={Knowledge-based in silico models and dataset for the comparative evaluation of mammography AI for a range of breast characteristics, lesion conspicuities and doses}, author={Sizikova, Elena and Saharkhiz, Niloufar and Sharma, Diksha and Lago, Miguel and Sahiner, Berkman and Delfino, Jana G. and Badano, Aldo}, journal={Advances in Neural Information Processing Systems}, volume={}, pages={}, year={2023} }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作