eminorhan/openorganelle-2d
收藏Hugging Face2026-04-06 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/eminorhan/openorganelle-2d
下载链接
链接失效反馈官方服务:
资源简介:
---
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
dataset_info:
features:
- name: image
dtype: image
- name: crop_name
dtype: string
- name: axis
dtype: string
- name: slice
dtype: int32
- name: part_id
dtype: string
splits:
- name: train
num_examples: 1001894
license: cc-by-4.0
---
# OpenOrganelle 2D
This dataset contains a large collection of 2D slices from the EM volumes on HHMI Janelia's [OpenOrganelle](https://openorganelle.janelia.org/) data repository.
The dataset contains a total of ~1M *x*, *y*, and *z* slices obtained from 65 different 3D EM volumes on OpenOrganelle.
## Notes
1. The following volumes on OpenOrganelle are missing from the current repository because they were too large to process and store here:
`jrc_fly-larva-1`, `jrc_fly-mb-z0419-20`, `jrc_mus-guard-hair-follicle`, `jrc_mus-liver-zon-1`, `jrc_mus-liver-zon-2`, `jrc_mus-meissner-corpuscle-2`, `jrc_mus-pacinian-corpuscle`, `jrc_zf-cardiac-1`
2. Again due to HF storage constraints, we only processed and stored every **18th** slice along each axis (*i.e.* 18x subsampling of the full data). The full data take up
over 100 TB on disk. This subsampled version only takes up ~5.85 TB of disk space.
3. We also divided large slices into smaller equal-sized pieces of no more than 4096 pixels along any given axis: *e.g.* a 8192x8192 slice would be broken up into four parts of size
4096x4096 pixels each and each part would be given a unique part id `i_j` (in this case `0_0`, `0_1`, `1_0`, and `1_1`) identifying its "part coordinates" within the larger slice.
4. The data were prepared with [this](https://github.com/eminorhan/torchtitan-segmentation/blob/master/helpers/create_slice_dataset_oo.py) preprocessing script
and then pushed to the HF datasets Hub using [this](https://github.com/eminorhan/torchtitan-segmentation/blob/master/helpers/push_slice_dataset_oo.py) script. We used
the highest resolution data (stored in `s0`) from all volumes.
5. The dataset rows are pre-shuffled to make the data shards roughly uniform in size.
## Usage
**Non-streaming mode:** We recommend caching the dataset on local disk if you have enough disk space (~5.85 TB). You can then load the dataset as follows:
```python
ds = load_dataset("eminorhan/openorganelle-2d", split='train')
```
and inspect *e.g.* the first data row:
```python
>>> print(ds[0])
>>> {
'image': <PIL.PngImagePlugin.PngImageFile image mode=L size=3535x3565 at 0xFFF93CFA52D0>,
'crop_name': 'jrc_mus-hippocampus-1/recon-1/fibsem-uint8',
'axis': 'z',
'slice': 13716,
'part_id': '4_3'
}
```
where:
* `image` contains the actual 2D slice encoded as a `PIL.Image` object.
* `crop_name` is an identifier string indicating the EM volume the slice comes from.
* `axis` indicates the axis along which the slice was taken (`x`, `y`, or `z`).
* `slice` is the slice index along the `axis`.
* `part_id` is an identifier string `i_j` indicating the part coordinates `i` and `j` of the slice
if the slice was obtained by dividing a larger slice into smaller equal-sized pieces (see above). If the slice was not obtained by dividing a larger slice, `part_id` will be
`0_0`.
**Streaming mode:** Alternatively, if you don't have enough disk space or if you don't want to download the full dataset to your local disk,
you can load it in *streaming* mode instead and then inspect *e.g.* the first data row as follows:
```python
ds = load_dataset("eminorhan/openorganelle-2d", split="train", streaming=True)
fr = next(iter(ds))
print(fr)
```
**License:** The data originally come from HHMI Janelia's [OpenOrganelle](https://www.openorganelle.org/) data portal [released](https://www.openorganelle.org/faq#sharing) under the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/deed.en) license.
**Citation:** If you use these data, please cite the following paper:
```
@article{heinrich2021whole,
title={Whole-cell organelle segmentation in volume electron microscopy},
author={Heinrich, Larissa and Bennett, Davis and Ackerman, David and Park, Woohyun and Bogovic, John and Eckstein, Nils and Petruncio, Alyson and Clements, Jody and Pang, Song and Xu, C Shan and others},
journal={Nature},
volume={599},
number={7883},
pages={141--146},
year={2021},
publisher={Nature Publishing Group UK London}
}
```
[Paper link](https://www.nature.com/articles/s41586-021-03977-3)
提供机构:
eminorhan



