five

eminorhan/openorganelle-2d

收藏
Hugging Face2026-04-06 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/eminorhan/openorganelle-2d
下载链接
链接失效反馈
官方服务:
资源简介:
--- configs: - config_name: default data_files: - split: train path: data/train-* dataset_info: features: - name: image dtype: image - name: crop_name dtype: string - name: axis dtype: string - name: slice dtype: int32 - name: part_id dtype: string splits: - name: train num_examples: 1001894 license: cc-by-4.0 --- # OpenOrganelle 2D This dataset contains a large collection of 2D slices from the EM volumes on HHMI Janelia's [OpenOrganelle](https://openorganelle.janelia.org/) data repository. The dataset contains a total of ~1M *x*, *y*, and *z* slices obtained from 65 different 3D EM volumes on OpenOrganelle. ## Notes 1. The following volumes on OpenOrganelle are missing from the current repository because they were too large to process and store here: `jrc_fly-larva-1`, `jrc_fly-mb-z0419-20`, `jrc_mus-guard-hair-follicle`, `jrc_mus-liver-zon-1`, `jrc_mus-liver-zon-2`, `jrc_mus-meissner-corpuscle-2`, `jrc_mus-pacinian-corpuscle`, `jrc_zf-cardiac-1` 2. Again due to HF storage constraints, we only processed and stored every **18th** slice along each axis (*i.e.* 18x subsampling of the full data). The full data take up over 100 TB on disk. This subsampled version only takes up ~5.85 TB of disk space. 3. We also divided large slices into smaller equal-sized pieces of no more than 4096 pixels along any given axis: *e.g.* a 8192x8192 slice would be broken up into four parts of size 4096x4096 pixels each and each part would be given a unique part id `i_j` (in this case `0_0`, `0_1`, `1_0`, and `1_1`) identifying its "part coordinates" within the larger slice. 4. The data were prepared with [this](https://github.com/eminorhan/torchtitan-segmentation/blob/master/helpers/create_slice_dataset_oo.py) preprocessing script and then pushed to the HF datasets Hub using [this](https://github.com/eminorhan/torchtitan-segmentation/blob/master/helpers/push_slice_dataset_oo.py) script. We used the highest resolution data (stored in `s0`) from all volumes. 5. The dataset rows are pre-shuffled to make the data shards roughly uniform in size. ## Usage **Non-streaming mode:** We recommend caching the dataset on local disk if you have enough disk space (~5.85 TB). You can then load the dataset as follows: ```python ds = load_dataset("eminorhan/openorganelle-2d", split='train') ``` and inspect *e.g.* the first data row: ```python >>> print(ds[0]) >>> { 'image': <PIL.PngImagePlugin.PngImageFile image mode=L size=3535x3565 at 0xFFF93CFA52D0>, 'crop_name': 'jrc_mus-hippocampus-1/recon-1/fibsem-uint8', 'axis': 'z', 'slice': 13716, 'part_id': '4_3' } ``` where: * `image` contains the actual 2D slice encoded as a `PIL.Image` object. * `crop_name` is an identifier string indicating the EM volume the slice comes from. * `axis` indicates the axis along which the slice was taken (`x`, `y`, or `z`). * `slice` is the slice index along the `axis`. * `part_id` is an identifier string `i_j` indicating the part coordinates `i` and `j` of the slice if the slice was obtained by dividing a larger slice into smaller equal-sized pieces (see above). If the slice was not obtained by dividing a larger slice, `part_id` will be `0_0`. **Streaming mode:** Alternatively, if you don't have enough disk space or if you don't want to download the full dataset to your local disk, you can load it in *streaming* mode instead and then inspect *e.g.* the first data row as follows: ```python ds = load_dataset("eminorhan/openorganelle-2d", split="train", streaming=True) fr = next(iter(ds)) print(fr) ``` **License:** The data originally come from HHMI Janelia's [OpenOrganelle](https://www.openorganelle.org/) data portal [released](https://www.openorganelle.org/faq#sharing) under the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/deed.en) license. **Citation:** If you use these data, please cite the following paper: ``` @article{heinrich2021whole, title={Whole-cell organelle segmentation in volume electron microscopy}, author={Heinrich, Larissa and Bennett, Davis and Ackerman, David and Park, Woohyun and Bogovic, John and Eckstein, Nils and Petruncio, Alyson and Clements, Jody and Pang, Song and Xu, C Shan and others}, journal={Nature}, volume={599}, number={7883}, pages={141--146}, year={2021}, publisher={Nature Publishing Group UK London} } ``` [Paper link](https://www.nature.com/articles/s41586-021-03977-3)
提供机构:
eminorhan
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作