sreshwarprasad/hierarchical-gz2
收藏Hugging Face2026-04-15 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/sreshwarprasad/hierarchical-gz2
下载链接
链接失效反馈官方服务:
资源简介:
# Galaxy Zoo 2 Dataset
This directory contains the dataset used for training and evaluating the **Hierarchical Probabilistic Vote Regression model** on Galaxy Zoo 2.
---
## Dataset Source
The dataset is derived from **Galaxy Zoo 2 (GZ2)**, a large-scale citizen science project providing galaxy morphology annotations.
* Total images: ~243,000
* Labels: volunteer vote distributions across a hierarchical decision tree
The data corresponds to:
* Willett et al. (2013)
* Hart et al. (2016)
---
## Directory Structure
```text
data/
├── raw/ # Original data (Zenodo images + metadata CSVs)
├── processed/ # Final curated dataset
└── README.md
```
---
## Raw Data
The raw dataset includes:
* Galaxy images from Zenodo (`images_gz2.zip`)
* `gz2_hart16.csv` — morphological vote distributions
* `gz2_filename_mapping.csv` — mapping between image filenames and object IDs
Each galaxy is identified by a **DR7 object ID (`dr7objid`)**, which links images to labels.
---
## Processing Pipeline (This Work)
The raw Galaxy Zoo 2 dataset is not directly suitable for training.
In this project, the following preprocessing steps are performed:
* Map images from `asset_id` → `dr7objid`
* Rename images for consistent indexing
* Align images with corresponding labels
* Remove missing or unmatched samples
* Filter invalid or incomplete entries
* Construct hierarchical vote distributions
* Ensure all probability vectors are valid (sum = 1)
Pipeline implementation:
```bash
python3 scripts/data/convert_zenodo_images.py
python3 scripts/data/prepare_labels.py
```
---
## Processed Data
The final dataset consists of:
* Cleaned and aligned galaxy images
* `labels.parquet` containing structured vote distributions
This dataset is directly used for training probabilistic models.
---
## Notes
* Final dataset size: **~3.3 GB (images + labels)**
* A small fraction of samples may be missing due to incomplete mappings
* Missing samples are assumed to be randomly distributed
The processed dataset ensures:
* consistent mapping between images and labels
* valid probability distributions
* suitability for hierarchical probabilistic modeling
---
## Usage
This dataset is designed for:
* probabilistic regression over Galaxy Zoo decision trees
* predicting full vote distributions instead of single labels
* training models that preserve uncertainty and distribution structure
---
## Citation
If you use this dataset, please cite:
* Willett, K. W., et al. (2013)
*Galaxy Zoo 2: detailed morphological classifications for galaxies*
https://arxiv.org/abs/1308.3496
* Hart, R. E., et al. (2016)
*Galaxy Zoo: improved debiased morphological classifications*
https://arxiv.org/abs/1607.01019
---
## Summary
This dataset represents a cleaned and structured version of Galaxy Zoo 2, enabling reliable training of models that predict **hierarchical probability distributions** rather than discrete labels.
提供机构:
sreshwarprasad



