five

sreshwarprasad/hierarchical-gz2

收藏
Hugging Face2026-04-15 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/sreshwarprasad/hierarchical-gz2
下载链接
链接失效反馈
官方服务:
资源简介:
# Galaxy Zoo 2 Dataset This directory contains the dataset used for training and evaluating the **Hierarchical Probabilistic Vote Regression model** on Galaxy Zoo 2. --- ## Dataset Source The dataset is derived from **Galaxy Zoo 2 (GZ2)**, a large-scale citizen science project providing galaxy morphology annotations. * Total images: ~243,000 * Labels: volunteer vote distributions across a hierarchical decision tree The data corresponds to: * Willett et al. (2013) * Hart et al. (2016) --- ## Directory Structure ```text data/ ├── raw/ # Original data (Zenodo images + metadata CSVs) ├── processed/ # Final curated dataset └── README.md ``` --- ## Raw Data The raw dataset includes: * Galaxy images from Zenodo (`images_gz2.zip`) * `gz2_hart16.csv` — morphological vote distributions * `gz2_filename_mapping.csv` — mapping between image filenames and object IDs Each galaxy is identified by a **DR7 object ID (`dr7objid`)**, which links images to labels. --- ## Processing Pipeline (This Work) The raw Galaxy Zoo 2 dataset is not directly suitable for training. In this project, the following preprocessing steps are performed: * Map images from `asset_id` → `dr7objid` * Rename images for consistent indexing * Align images with corresponding labels * Remove missing or unmatched samples * Filter invalid or incomplete entries * Construct hierarchical vote distributions * Ensure all probability vectors are valid (sum = 1) Pipeline implementation: ```bash python3 scripts/data/convert_zenodo_images.py python3 scripts/data/prepare_labels.py ``` --- ## Processed Data The final dataset consists of: * Cleaned and aligned galaxy images * `labels.parquet` containing structured vote distributions This dataset is directly used for training probabilistic models. --- ## Notes * Final dataset size: **~3.3 GB (images + labels)** * A small fraction of samples may be missing due to incomplete mappings * Missing samples are assumed to be randomly distributed The processed dataset ensures: * consistent mapping between images and labels * valid probability distributions * suitability for hierarchical probabilistic modeling --- ## Usage This dataset is designed for: * probabilistic regression over Galaxy Zoo decision trees * predicting full vote distributions instead of single labels * training models that preserve uncertainty and distribution structure --- ## Citation If you use this dataset, please cite: * Willett, K. W., et al. (2013) *Galaxy Zoo 2: detailed morphological classifications for galaxies* https://arxiv.org/abs/1308.3496 * Hart, R. E., et al. (2016) *Galaxy Zoo: improved debiased morphological classifications* https://arxiv.org/abs/1607.01019 --- ## Summary This dataset represents a cleaned and structured version of Galaxy Zoo 2, enabling reliable training of models that predict **hierarchical probability distributions** rather than discrete labels.
提供机构:
sreshwarprasad
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作