sreshwarprasad/hierarchical-gz2

Name: sreshwarprasad/hierarchical-gz2
Creator: sreshwarprasad
Published: 2026-04-15 07:58:16
License: 暂无描述

Hugging Face2026-04-15 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/sreshwarprasad/hierarchical-gz2

下载链接

链接失效反馈

官方服务：

资源简介：

# Galaxy Zoo 2 Dataset This directory contains the dataset used for training and evaluating the **Hierarchical Probabilistic Vote Regression model** on Galaxy Zoo 2. --- ## Dataset Source The dataset is derived from **Galaxy Zoo 2 (GZ2)**, a large-scale citizen science project providing galaxy morphology annotations. * Total images: ~243,000 * Labels: volunteer vote distributions across a hierarchical decision tree The data corresponds to: * Willett et al. (2013) * Hart et al. (2016) --- ## Directory Structure ```text data/ ├── raw/ # Original data (Zenodo images + metadata CSVs) ├── processed/ # Final curated dataset └── README.md ``` --- ## Raw Data The raw dataset includes: * Galaxy images from Zenodo (`images_gz2.zip`) * `gz2_hart16.csv` — morphological vote distributions * `gz2_filename_mapping.csv` — mapping between image filenames and object IDs Each galaxy is identified by a **DR7 object ID (`dr7objid`)**, which links images to labels. --- ## Processing Pipeline (This Work) The raw Galaxy Zoo 2 dataset is not directly suitable for training. In this project, the following preprocessing steps are performed: * Map images from `asset_id` → `dr7objid` * Rename images for consistent indexing * Align images with corresponding labels * Remove missing or unmatched samples * Filter invalid or incomplete entries * Construct hierarchical vote distributions * Ensure all probability vectors are valid (sum = 1) Pipeline implementation: ```bash python3 scripts/data/convert_zenodo_images.py python3 scripts/data/prepare_labels.py ``` --- ## Processed Data The final dataset consists of: * Cleaned and aligned galaxy images * `labels.parquet` containing structured vote distributions This dataset is directly used for training probabilistic models. --- ## Notes * Final dataset size: **~3.3 GB (images + labels)** * A small fraction of samples may be missing due to incomplete mappings * Missing samples are assumed to be randomly distributed The processed dataset ensures: * consistent mapping between images and labels * valid probability distributions * suitability for hierarchical probabilistic modeling --- ## Usage This dataset is designed for: * probabilistic regression over Galaxy Zoo decision trees * predicting full vote distributions instead of single labels * training models that preserve uncertainty and distribution structure --- ## Citation If you use this dataset, please cite: * Willett, K. W., et al. (2013) *Galaxy Zoo 2: detailed morphological classifications for galaxies* https://arxiv.org/abs/1308.3496 * Hart, R. E., et al. (2016) *Galaxy Zoo: improved debiased morphological classifications* https://arxiv.org/abs/1607.01019 --- ## Summary This dataset represents a cleaned and structured version of Galaxy Zoo 2, enabling reliable training of models that predict **hierarchical probability distributions** rather than discrete labels.

提供机构：

sreshwarprasad

5,000+

优质数据集

54 个

任务类型

进入经典数据集