five

A Dataset for Virus Infection Reporter Virtual Staining in Fluorescence and Brightfield Microscopy

收藏
DataCite Commons2025-08-04 更新2026-05-05 收录
下载链接:
https://rodare.hzdr.de/record/3900
下载链接
链接失效反馈
官方服务:
资源简介:
<strong>How to cite us</strong><br> Wyrzykowska, Maria, Gabriel Della Maggiora, Nikita Deshpande, Ashkan Mokarian, and Artur Yakimovich. "A Benchmark for Virus Infection Reporter Virtual Staining in Fluorescence and Brightfield Microscopy." <em>Scientific Data</em> 12, no. 1 (2025): 1-11. <pre><code>@article{wyrzykowska2025benchmark, title={A Benchmark for Virus Infection Reporter Virtual Staining in Fluorescence and Brightfield Microscopy}, author={Wyrzykowska, Maria and Della Maggiora, Gabriel and Deshpande, Nikita and Mokarian, Ashkan and Yakimovich, Artur}, journal={Scientific Data}, volume={12}, number={1}, pages={1--11}, year={2025}, publisher={Nature Publishing Group} }</code></pre> <strong>Data sources</strong> Raw data used during the study can be found in corresponding references. VACV: Yakimovich A, Andriasyan V, Witte R, Wang IH, Prasad V, Suomalainen M, Greber UF. Plaque2.0-A High-Throughput Analysis Framework to Score Virus-Cell Transmission and Clonal Cell Expansion. PLoS One. 2015 Sep 28;10(9):e0138760. doi: 10.1371/journal.pone.0138760. PMID: 26413745; PMCID: PMC4587671. HADV: Andriasyan V, Yakimovich A, Petkidis A, Georgi F, Witte R, Puntener D, Greber UF. Microscopy deep learning predicts virus infections and reveals the mechanics of lytic-infected cells. iScience. 2021 May 15;24(6):102543. doi: 10.1016/j.isci.2021.102543. PMID: 34151222; PMCID: PMC8192562. HSV, IAV, RV: Olszewski, D., Georgi, F., Murer, L. et al. High-content, arrayed compound screens with rhinovirus, influenza A virus and herpes simplex virus infections. Sci Data 9, 610 (2022). https://doi.org/10.1038/s41597-022-01733-4 <strong>Data organisation</strong> For each virus (HADV, VACV, IAV, RV and HSV) we provide the processed data in a separate directory, divided into three subdirectories: `train`, `val` and `test`, containing the proposed data split. Each of the subfolders contains two npy files: `x.npy` and `y.npy`, where `x.npy` contains the fluorescence or brightfield signal (both for HADV, as separate channels) of the cells or nuclei and `y.npy` contains the viral signal. The data is already processed as described in the <em>Data preparation section.</em> Additionally, Cellpose masks are made available for the test data in separate masks directory. For each virus except for VACV, there is a subdirectory `test` containing nuclei masks (`nuc.npy`). For HADV cell masks are also available (`cell.npy`). <strong>Data preparation</strong> Each of VACV plaques was imaged to produce 9 files per channel, that need to be stitched to recreate the whole plaque. To achieve this, multiview-stitcher toolbox has been used. The stitching was first performed on the third channel, representing the brightfield microscopy image of the samples. Then, the parameters found for this channel were used to stitch the rest of the channels. VACV dataset represents a timelapse, from which timesteps 100, 108 and 115 have been selected to produce the data then used in the experiments. Images have been center-cropped to 5948x6048 to match the size of the smallest image in the dataset (rounded down to the closest multiple of 2). The data was additionally manually filtered to remove the samples that constituted only uninfected cells (C02, C07, D02, D07, E02, E07, F02, F07). The HAdV dataset is also a timelapse, from which only the last timestep (49th) has been selected. For the rest of the datasets (HSV, IAV, RV) only the negative control data was used, which was selected in the following way: from the data collected at the University of Zürich, from the Screen samples only the first 2 columns were selected and from the ZPlates and prePlates samples only the first 12 columns. All of the datasets were divided into training, validation and test holdouts in 0.7:0.2:0.1 ratios, using random seed 42 to ensure reproducibility. For the time-lapse data, it was ensured that the same sample from different timesteps only exists in one of the holdouts, to prevent information leakage and ensure fair evaluation. All of the samples were normalised to [-1, 1] range, by subtracting the 3rd percentile and dividing by the difference between percentile 99.8 and 3, clipping to [0, 1] and scaling to [-1, 1] range. For the brightfield channel of HAdV, percentiles 0.1 and 99.9 were used. These cutoff points were selected based on the analysis of the histograms of the values attained by the data, to make the best use of the available data range. Specific values used for the normalization are summarized in Figure 3 of the manuscript in <em>Related/alternate identifiers</em>. To prepare the cell nuclei masks, Cellpose model with pre-trained weights cyto3 has been used on the fluorescence channel. The diameter was set to 7 for all the datasets except for HAdV, for which the automatic estimation of the diameter was employed. Cell masks were prepared using Cellpose with pre-trained weights cyto3 with a diameter set to 70 on brightfield images stacked with fluorescence nuclei signal. The data preparation can be reproduced by first downloading the datasets and then running scripts that are located in `scripts/data_processing` directory of the [VIRVS repository](https://github.com/casus/virvs), first modifying the paths in them: for HAdV data: `preprocess_hadv.py` for VACV data: `stitch_vacv.py` + `preprocess_vacv.py` for the rest of the viruses: `preprocess_other.py` to prepare Cellpose predictions: `prepare_cellpose_preds.py` (for cells) and `prepare_cellpose_preds_nuc.py` (for nuclei) <strong>Additional Dataset in v1.2: GFP-transgenic human coronavirus OC43 (CoV-GFP)</strong> This dataset comprises raw fluorescence microscopy images acquired from a 384-well control plate, half of which was infected with GFP-transgenic human coronavirus OC43 (CoV-GFP). The plate was imaged using two fluorescence channels: CoV-GFP to visualize viral infection, and Hoechst 33342 to stain cell nuclei. The raw images of two plates are provided in the cov_raw.zip. Each plate has half a plate infected with CoV-GFP and another is a mock-infected (no virus). Images were captured using a 4× objective on an ImageXpress Micro imaging system (Molecular Devices). The dataset was derived from a published high-throughput screening study by Murer et al. [1], aimed at identifying broad-spectrum antiviral compounds. Murer, L. et al. Identification of broad anti-coronavirus chemical agents for repurposing against SARS-CoV-2 and variants of concern. <em>Current Research in Virological Science</em>, 3, 100019 (2022).

<strong>引用说明</strong><br>Wyrzykowska, Maria、Gabriel Della Maggiora、Nikita Deshpande、Ashkan Mokarian及Artur Yakimovich。《用于荧光与明场显微镜下病毒感染报告基因虚拟染色的基准数据集》(A Benchmark for Virus Infection Reporter Virtual Staining in Fluorescence and Brightfield Microscopy)。<em>《科学数据(Scientific Data)》</em> 12, no. 1 (2025): 1-11。<br><br><pre><code>@article{wyrzykowska2025benchmark, title={A Benchmark for Virus Infection Reporter Virtual Staining in Fluorescence and Brightfield Microscopy}, author={Wyrzykowska, Maria and Della Maggiora, Gabriel and Deshpande, Nikita and Mokarian, Ashkan and Yakimovich, Artur}, journal={Scientific Data}, volume={12}, number={1}, pages={1--11}, year={2025}, publisher={Nature Publishing Group} }</code></pre><br><strong>数据来源</strong><br>本研究使用的原始数据可在相关参考文献中获取。<br><br><strong>痘苗病毒(VACV)</strong>:Yakimovich A, Andriasyan V, Witte R, Wang IH, Prasad V, Suomalainen M, Greber UF. 《Plaque2.0—一种用于评估病毒细胞传播与克隆细胞扩增的高通量分析框架》。<em>PLoS One</em>. 2015年9月28日;10(9): e0138760。DOI: 10.1371/journal.pone.0138760。PMID: 26413745;PMCID: PMC4587671。<br><strong>人腺病毒(HADV)</strong>:Andriasyan V, Yakimovich A, Petkidis A, Georgi F, Witte R, Puntener D, Greber UF. 《Microscopy deep learning predicts virus infections and reveals the mechanics of lytic-infected cells》。<em>iScience</em>. 2021年5月15日;24(6): 102543。DOI: 10.1016/j.isci.2021.102543。PMID: 34151222;PMCID: PMC8192562。<br><strong>单纯疱疹病毒(HSV)、甲型流感病毒(IAV)、鼻病毒(RV)</strong>:Olszewski, D., Georgi, F., Murer, L. 等人. 《针对鼻病毒、甲型流感病毒及单纯疱疹病毒感染的高内涵阵列化合物筛选》。<em>《科学数据(Scientific Data)》</em> 9, 610 (2022)。https://doi.org/10.1038/s41597-022-01733-4<br><br><strong>数据组织形式</strong><br>针对每一种病毒(人腺病毒HADV、痘苗病毒VACV、甲型流感病毒IAV、鼻病毒RV及单纯疱疹病毒HSV),本研究将处理后的数据存放于独立目录中,并拆分为`train`(训练集)、`val`(验证集)与`test`(测试集)三个子目录,对应预设的数据划分方式。每个子文件夹均包含两个npy格式文件(NumPy数组文件):`x.npy`与`y.npy`。其中`x.npy`存储细胞或细胞核的荧光信号或明场信号(人腺病毒HADV的信号按通道分别存储),`y.npy`存储病毒信号。数据已按照<em>「数据预处理」章节</em>中的步骤完成处理。<br><br>此外,测试集数据的Cellpose细胞分割掩码(Cellpose)已单独存放于`masks`目录中。除痘苗病毒VACV外,其余病毒均设有`test`子目录,其中存储细胞核掩码文件`nuc.npy`。针对人腺病毒HADV,还提供了细胞掩码文件`cell.npy`。<br><br><strong>数据预处理流程</strong><br>痘苗病毒VACV的每个噬斑均按通道拍摄为9张图像,需通过拼接以还原完整噬斑。本研究使用multiview-stitcher工具箱完成图像拼接:首先对代表样本明场显微镜图像的第三通道进行拼接,再将该通道得到的拼接参数应用于其余通道的拼接操作。痘苗病毒VACV数据集为时序成像数据,研究中选取了第100、108及115个时间步的数据用于实验。所有图像均以中心裁剪至5948×6048像素,以匹配数据集中最小图像的尺寸(向下取整至最接近的2的整数倍)。此外,研究人员还通过人工筛选移除了仅包含未感染细胞的样本(C02、C07、D02、D07、E02、E07、F02、F07)。人腺病毒HADV数据集同样为时序成像数据,研究中仅选取了最后一个时间步(第49个时间步)的数据。<br><br>针对其余数据集(单纯疱疹病毒HSV、甲型流感病毒IAV及鼻病毒RV),本研究仅使用了阴性对照数据,选取规则如下:从苏黎世大学采集的数据中,Screen样本仅选取前2列,ZPlates与prePlates样本仅选取前12列。所有数据集均按照0.7:0.2:0.1的比例划分为训练集、验证集与测试集,且使用随机种子42以保证实验可复现。对于时序成像数据,研究确保同一样本的不同时间步数据仅出现在一个划分集合中,以避免信息泄露并保证评估的公平性。<br><br>所有样本均被归一化至[-1, 1]区间:先减去3百分位数值,再除以99.8百分位与3百分位的差值,随后将数值裁剪至[0, 1]区间,最后缩放至[-1, 1]区间。针对人腺病毒HADV的明场通道,研究使用了0.1百分位与99.9百分位作为归一化阈值。这些截断点的选取基于对数据值直方图的分析,以最大化利用可用的数据范围。本研究使用的归一化具体参数已在<em>《相关/替代标识符》</em>部分的手稿图3中汇总。<br><br>为生成细胞核掩码,研究在荧光通道上使用了预训练权重为cyto3的Cellpose细胞分割模型。除人腺病毒HADV外,所有数据集的分割直径均设置为7;针对人腺病毒HADV,研究采用了直径自动估计的方式。细胞掩码则通过以下方式生成:将明场图像与荧光细胞核信号叠加后,使用预训练权重为cyto3的Cellpose模型,设置分割直径为70。<br><br>数据预处理流程可通过以下步骤复现:首先下载数据集,随后运行[VIRVS仓库](https://github.com/casus/virvs)中`scripts/data_processing`目录下的脚本,需先修改脚本内的文件路径:<br>• 针对人腺病毒HADV数据:`preprocess_hadv.py`<br>• 针对痘苗病毒VACV数据:`stitch_vacv.py`与`preprocess_vacv.py`<br>• 针对其余病毒:`preprocess_other.py`<br>• 若需生成Cellpose预测结果:`prepare_cellpose_preds.py`(用于细胞掩码)与`prepare_cellpose_preds_nuc.py`(用于细胞核掩码)<br><br><strong>v1.2版本新增数据集:表达GFP的人冠状病毒OC43(CoV-GFP)</strong><br>该数据集包含从384孔板对照板采集的原始荧光显微镜图像,其中一半孔板被表达GFP的人冠状病毒OC43(CoV-GFP)感染。该板通过两个荧光通道成像:CoV-GFP通道用于可视化病毒感染,Hoechst 33342通道用于染色细胞核。两块板的原始图像均存储于`cov_raw.zip`压缩包中。每块板均包含一半被CoV-GFP感染的孔与另一半未感染(mock感染,无病毒)的孔。图像通过4×物镜在ImageXpress Micro成像系统(Molecular Devices公司)上采集。该数据集源自Murer等人[1]已发表的一项高通量筛选研究,该研究旨在筛选广谱抗病毒化合物。<br><br>[1] Murer, L. 等人. 《可用于重定位治疗SARS-CoV-2及其关切变异株的广谱抗冠状病毒化学制剂筛选》。<em>《病毒学科学当前研究(Current Research in Virological Science)》</em>,3卷,100019(2022年)。
提供机构:
Rodare
创建时间:
2025-08-04
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作