mwalmsley/gz_hubble
收藏Hugging Face2024-06-12 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/mwalmsley/gz_hubble
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- crowdsourced
license: cc-by-nc-sa-4.0
size_categories:
- 10K<n<100K
task_categories:
- image-classification
- image-feature-extraction
pretty_name: Galaxy Zoo Hubble
arxiv: 2404.02973
tags:
- galaxy zoo
- physics
- astronomy
- galaxies
- citizen science
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: test
path: data/test-*
- config_name: tiny
data_files:
- split: train
path: tiny/train-*
- split: test
path: tiny/test-*
dataset_info:
- config_name: default
features:
- name: image
dtype: image
- name: id_str
dtype: string
- name: ra
dtype: float64
- name: dec
dtype: float64
- name: smooth-or-featured-hubble_smooth
dtype: int32
- name: smooth-or-featured-hubble_features
dtype: int32
- name: smooth-or-featured-hubble_artifact
dtype: int32
- name: how-rounded-hubble_completely
dtype: int32
- name: how-rounded-hubble_in-between
dtype: int32
- name: how-rounded-hubble_cigar-shaped
dtype: int32
- name: clumpy-appearance-hubble_yes
dtype: int32
- name: clumpy-appearance-hubble_no
dtype: int32
- name: clump-count-hubble_1
dtype: int32
- name: clump-count-hubble_2
dtype: int32
- name: clump-count-hubble_3
dtype: int32
- name: clump-count-hubble_4
dtype: int32
- name: clump-count-hubble_5-plus
dtype: int32
- name: clump-count-hubble_cant-tell
dtype: int32
- name: disk-edge-on-hubble_yes
dtype: int32
- name: disk-edge-on-hubble_no
dtype: int32
- name: bulge-shape-hubble_rounded
dtype: int32
- name: bulge-shape-hubble_boxy
dtype: int32
- name: bulge-shape-hubble_none
dtype: int32
- name: bar-hubble_yes
dtype: int32
- name: bar-hubble_no
dtype: int32
- name: has-spiral-arms-hubble_yes
dtype: int32
- name: has-spiral-arms-hubble_no
dtype: int32
- name: spiral-winding-hubble_tight
dtype: int32
- name: spiral-winding-hubble_medium
dtype: int32
- name: spiral-winding-hubble_loose
dtype: int32
- name: spiral-arm-count-hubble_1
dtype: int32
- name: spiral-arm-count-hubble_2
dtype: int32
- name: spiral-arm-count-hubble_3
dtype: int32
- name: spiral-arm-count-hubble_4
dtype: int32
- name: spiral-arm-count-hubble_5-plus
dtype: int32
- name: spiral-arm-count-hubble_cant-tell
dtype: int32
- name: bulge-size-hubble_none
dtype: int32
- name: bulge-size-hubble_just-noticeable
dtype: int32
- name: bulge-size-hubble_obvious
dtype: int32
- name: bulge-size-hubble_dominant
dtype: int32
- name: galaxy-symmetrical-hubble_yes
dtype: int32
- name: galaxy-symmetrical-hubble_no
dtype: int32
- name: clumps-embedded-larger-object-hubble_yes
dtype: int32
- name: clumps-embedded-larger-object-hubble_no
dtype: int32
splits:
- name: train
num_bytes: 2556400698.778
num_examples: 77158
- name: test
num_bytes: 637679057.568
num_examples: 19291
download_size: 3196577767
dataset_size: 3194079756.3459997
- config_name: tiny
features:
- name: image
dtype: image
- name: id_str
dtype: string
- name: dataset_name
dtype: string
- name: ra
dtype: float64
- name: dec
dtype: float64
- name: smooth-or-featured-hubble_smooth
dtype: int32
- name: smooth-or-featured-hubble_features
dtype: int32
- name: smooth-or-featured-hubble_artifact
dtype: int32
- name: how-rounded-hubble_completely
dtype: int32
- name: how-rounded-hubble_in-between
dtype: int32
- name: how-rounded-hubble_cigar-shaped
dtype: int32
- name: clumpy-appearance-hubble_yes
dtype: int32
- name: clumpy-appearance-hubble_no
dtype: int32
- name: clump-count-hubble_1
dtype: int32
- name: clump-count-hubble_2
dtype: int32
- name: clump-count-hubble_3
dtype: int32
- name: clump-count-hubble_4
dtype: int32
- name: clump-count-hubble_5-plus
dtype: int32
- name: clump-count-hubble_cant-tell
dtype: int32
- name: disk-edge-on-hubble_yes
dtype: int32
- name: disk-edge-on-hubble_no
dtype: int32
- name: bulge-shape-hubble_rounded
dtype: int32
- name: bulge-shape-hubble_boxy
dtype: int32
- name: bulge-shape-hubble_none
dtype: int32
- name: bar-hubble_yes
dtype: int32
- name: bar-hubble_no
dtype: int32
- name: has-spiral-arms-hubble_yes
dtype: int32
- name: has-spiral-arms-hubble_no
dtype: int32
- name: spiral-winding-hubble_tight
dtype: int32
- name: spiral-winding-hubble_medium
dtype: int32
- name: spiral-winding-hubble_loose
dtype: int32
- name: spiral-arm-count-hubble_1
dtype: int32
- name: spiral-arm-count-hubble_2
dtype: int32
- name: spiral-arm-count-hubble_3
dtype: int32
- name: spiral-arm-count-hubble_4
dtype: int32
- name: spiral-arm-count-hubble_5-plus
dtype: int32
- name: spiral-arm-count-hubble_cant-tell
dtype: int32
- name: bulge-size-hubble_none
dtype: int32
- name: bulge-size-hubble_just-noticeable
dtype: int32
- name: bulge-size-hubble_obvious
dtype: int32
- name: bulge-size-hubble_dominant
dtype: int32
- name: galaxy-symmetrical-hubble_yes
dtype: int32
- name: galaxy-symmetrical-hubble_no
dtype: int32
- name: clumps-embedded-larger-object-hubble_yes
dtype: int32
- name: clumps-embedded-larger-object-hubble_no
dtype: int32
splits:
- name: train
num_bytes: 25771950.0
num_examples: 771
- name: test
num_bytes: 6348556.0
num_examples: 192
download_size: 32164525
dataset_size: 32120506.0
---
# GZ Campaign Datasets
## Dataset Summary
[Galaxy Zoo](www.galaxyzoo.org) volunteers label telescope images of galaxies according to their visible features: spiral arms, galaxy-galaxy collisions, and so on.
These datasets share the galaxy images and volunteer labels in a machine-learning-friendly format. We use these datasets to train [our foundation models](https://arxiv.org/abs/2404.02973). We hope they'll help you too.
- **Curated by:** [Mike Walmsley](https://walmsley.dev/)
- **License:** [cc-by-nc-sa-4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/deed.en). We specifically require **all models trained on these datasets to be released as source code by publication**.
## Downloading
Install the Datasets library
pip install datasets
and then log in to your HuggingFace account
huggingface-cli login
All unpublished* datasets are temporarily "gated" i.e. you must have requested and been approved for access. Galaxy Zoo team members should go to https://huggingface.co/mwalmsley/datasets/gz_hubble, click "request access", ping Mike, then wait for approval.
Gating will be removed on publication.
*Currently: the `gz_h2o` and `gz_ukidss` datasets
## Usage
```python
from datasets import load_dataset
# . split='train' picks which split to load
dataset = load_dataset(
'mwalmsley/gz_hubble', # each dataset has a random fixed train/test split
split='train'
# some datasets also allow name=subset (e.g. name="tiny" for gz_evo). see the viewer for subset options
)
dataset.set_format('torch') # your framework of choice e.g. numpy, tensorflow, jax, etc
print(dataset_name, dataset[0]['image'].shape)
```
Then use the `dataset` object as with any other HuggingFace dataset, e.g.,
```python
from torch.utils.data import DataLoader
dataloader = DataLoader(ds, batch_size=4, num_workers=1)
for batch in dataloader:
print(batch.keys())
# the image key, plus a key counting the volunteer votes for each answer
# (e.g. smooth-or-featured-gz2_smooth)
print(batch['image'].shape)
break
```
You may find these HuggingFace docs useful:
- [PyTorch loading options](https://huggingface.co/docs/datasets/en/use_with_pytorch#data-loading).
- [Applying transforms/augmentations](https://huggingface.co/docs/datasets/en/image_process#apply-transforms).
- [Frameworks supported](https://huggingface.co/docs/datasets/v2.19.0/en/package_reference/main_classes#datasets.Dataset.set_format) by `set_format`.
## Dataset Structure
Each dataset is structured like:
```json
{
'image': ..., # image of a galaxy
'smooth-or-featured-[campaign]_smooth': 4,
'smooth-or-featured-[campaign]_featured-or-disk': 12,
... # and so on for many questions and answers
}
```
Images are loaded according to your `set_format` choice above. For example, ```set_format("torch")``` gives a (3, 424, 424) CHW `Torch.Tensor`.
The other keys are formatted like `[question]_[answer]`, where `question` is what the volunteers were asked (e.g. "smooth or featured?" and `answer` is the choice selected (e.g. "smooth"). **The values are the count of volunteers who selected each answer.**
`question` is appended with a string noting in which Galaxy Zoo campaign this question was asked e.g. `smooth-or-featured-gz2`. For most datasets, all questions were asked during the same campaign. For GZ DESI, there are three campaigns (`dr12`, `dr5`, and `dr8`) with very similar questions.
GZ Evo combines all the published datasets (currently GZ2, GZ DESI, GZ CANDELS, GZ Hubble, and GZ UKIDSS) into a single dataset aimed at multi-task learning. This is helpful for [building models that adapt to new tasks and new telescopes]((https://arxiv.org/abs/2404.02973)).
(we will shortly add keys for the astronomical identifiers i.e. the sky coordinates and telescope source unique ids)
## Key Limitations
Because the volunteers are answering a decision tree, the questions asked depend on the previous answers, and so each galaxy and each question can have very different total numbers of votes. This interferes with typical metrics that use aggregated labels (e.g. classification of the most voted, regression on the mean vote fraction, etc.) because we have different levels of confidence in the aggregated labels for each galaxy. We suggest a custom loss to handle this. Please see the Datasets and Benchmarks paper for more details (under review, sorry).
All labels are imperfect. The vote counts may not always reflect the true appearance of each galaxy. Additionally,
the true appearance of each galaxy may be uncertain - even to expert astronomers.
We therefore caution against over-interpreting small changes in performance to indicate a method is "superior". **These datasets should not be used as a precise performance benchmark.**
## Citation Information
The machine-learning friendly versions of each dataset are described in a recently-submitted paper. Citation information will be added if accepted.
For each specific dataset you use, please also cite the original Galaxy Zoo data release paper (listed below) and the telescope description paper (cited therein).
### Galaxy Zoo 2
@article{10.1093/mnras/stt1458,
author = {Willett, Kyle W. and Lintott, Chris J. and Bamford, Steven P. and Masters, Karen L. and Simmons, Brooke D. and Casteels, Kevin R. V. and Edmondson, Edward M. and Fortson, Lucy F. and Kaviraj, Sugata and Keel, William C. and Melvin, Thomas and Nichol, Robert C. and Raddick, M. Jordan and Schawinski, Kevin and Simpson, Robert J. and Skibba, Ramin A. and Smith, Arfon M. and Thomas, Daniel},
title = "{Galaxy Zoo 2: detailed morphological classifications for 304 122 galaxies from the Sloan Digital Sky Survey}",
journal = {Monthly Notices of the Royal Astronomical Society},
volume = {435},
number = {4},
pages = {2835-2860},
year = {2013},
month = {09},
issn = {0035-8711},
doi = {10.1093/mnras/stt1458},
}
### Galaxy Zoo Hubble
@article{2017MNRAS.464.4176W,
author = {Willett, Kyle W. and Galloway, Melanie A. and Bamford, Steven P. and Lintott, Chris J. and Masters, Karen L. and Scarlata, Claudia and Simmons, B.~D. and Beck, Melanie and {Cardamone}, Carolin N. and Cheung, Edmond and Edmondson, Edward M. and Fortson, Lucy F. and Griffith, Roger L. and H{\"a}u{\ss}ler, Boris and Han, Anna and Hart, Ross and Melvin, Thomas and Parrish, Michael and Schawinski, Kevin and Smethurst, R.~J. and {Smith}, Arfon M.},
title = "{Galaxy Zoo: morphological classifications for 120 000 galaxies in HST legacy imaging}",
journal = {Monthly Notices of the Royal Astronomical Society},
year = 2017,
month = feb,
volume = {464},
number = {4},
pages = {4176-4203},
doi = {10.1093/mnras/stw2568}
}
### Galaxy Zoo CANDELS
@article{10.1093/mnras/stw2587,
author = {Simmons, B. D. and Lintott, Chris and Willett, Kyle W. and Masters, Karen L. and Kartaltepe, Jeyhan S. and Häußler, Boris and Kaviraj, Sugata and Krawczyk, Coleman and Kruk, S. J. and McIntosh, Daniel H. and Smethurst, R. J. and Nichol, Robert C. and Scarlata, Claudia and Schawinski, Kevin and Conselice, Christopher J. and Almaini, Omar and Ferguson, Henry C. and Fortson, Lucy and Hartley, William and Kocevski, Dale and Koekemoer, Anton M. and Mortlock, Alice and Newman, Jeffrey A. and Bamford, Steven P. and Grogin, N. A. and Lucas, Ray A. and Hathi, Nimish P. and McGrath, Elizabeth and Peth, Michael and Pforr, Janine and Rizer, Zachary and Wuyts, Stijn and Barro, Guillermo and Bell, Eric F. and Castellano, Marco and Dahlen, Tomas and Dekel, Avishai and Ownsworth, Jamie and Faber, Sandra M. and Finkelstein, Steven L. and Fontana, Adriano and Galametz, Audrey and Grützbauch, Ruth and Koo, David and Lotz, Jennifer and Mobasher, Bahram and Mozena, Mark and Salvato, Mara and Wiklind, Tommy},
title = "{Galaxy Zoo: quantitative visual morphological classifications for 48 000 galaxies from CANDELS★}",
journal = {Monthly Notices of the Royal Astronomical Society},
volume = {464},
number = {4},
pages = {4420-4447},
year = {2016},
month = {10},
doi = {10.1093/mnras/stw2587}
}
### Galaxy Zoo DESI
(two citations due to being released over two papers)
@article{10.1093/mnras/stab2093,
author = {Walmsley, Mike and Lintott, Chris and Géron, Tobias and Kruk, Sandor and Krawczyk, Coleman and Willett, Kyle W and Bamford, Steven and Kelvin, Lee S and Fortson, Lucy and Gal, Yarin and Keel, William and Masters, Karen L and Mehta, Vihang and Simmons, Brooke D and Smethurst, Rebecca and Smith, Lewis and Baeten, Elisabeth M and Macmillan, Christine},
title = "{Galaxy Zoo DECaLS: Detailed visual morphology measurements from volunteers and deep learning for 314 000 galaxies}",
journal = {Monthly Notices of the Royal Astronomical Society},
volume = {509},
number = {3},
pages = {3966-3988},
year = {2021},
month = {09},
issn = {0035-8711},
doi = {10.1093/mnras/stab2093}
}
@article{10.1093/mnras/stad2919,
author = {Walmsley, Mike and Géron, Tobias and Kruk, Sandor and Scaife, Anna M M and Lintott, Chris and Masters, Karen L and Dawson, James M and Dickinson, Hugh and Fortson, Lucy and Garland, Izzy L and Mantha, Kameswara and O’Ryan, David and Popp, Jürgen and Simmons, Brooke and Baeten, Elisabeth M and Macmillan, Christine},
title = "{Galaxy Zoo DESI: Detailed morphology measurements for 8.7M galaxies in the DESI Legacy Imaging Surveys}",
journal = {Monthly Notices of the Royal Astronomical Society},
volume = {526},
number = {3},
pages = {4768-4786},
year = {2023},
month = {09},
issn = {0035-8711},
doi = {10.1093/mnras/stad2919}
}
### Galaxy Zoo UKIDSS
Not yet published.
### Galaxy Zoo Cosmic Dawn (a.k.a. H2O)
Not yet published.
提供机构:
mwalmsley
原始信息汇总
数据集概述
数据集名称: Galaxy Zoo Hubble
数据集大小: 10K<n<100K
任务类别:
- 图像分类
- 图像特征提取
数据集特征:
- 图像(image): 数据类型为图像。
- id_str: 数据类型为字符串。
- ra: 数据类型为float64。
- dec: 数据类型为float64。
- 多个分类特征: 包括但不限于
smooth-or-featured-hubble_*、how-rounded-hubble_*、clumpy-appearance-hubble_*、clump-count-hubble_*、disk-edge-on-hubble_*、bulge-shape-hubble_*、bar-hubble_*、has-spiral-arms-hubble_*、spiral-winding-hubble_*、spiral-arm-count-hubble_*、bulge-size-hubble_*、galaxy-symmetrical-hubble_*、clumps-embedded-larger-object-hubble_*。所有这些特征的数据类型均为int32。
数据集分割:
- 训练集(train): 包含77158个样本,总字节数为2556400698.778。
- 测试集(test): 包含19291个样本,总字节数为637679057.568。
下载与数据集大小:
- 下载大小: 3196577767字节
- 数据集大小: 3194079756.3459997字节
许可证: cc-by-nc-sa-4.0
注释创建者: 众包(crowdsourced)
标签说明: 每个分类特征的值代表选择该答案的志愿者数量。
搜集汇总
数据集介绍

构建方式
在星系形态学研究中,大规模标注数据的获取始终是核心挑战。Galaxy Zoo Hubble数据集通过公民科学范式,巧妙整合了哈勃太空望远镜拍摄的星系图像与公众志愿者的协同标注。其构建过程始于从哈勃遗产巡天项目中系统采集约12万张星系图像,随后通过Galaxy Zoo在线平台,引导全球志愿者依据结构化决策树对星系形态特征进行多轮投票标注。每张图像均获得针对平滑度、旋臂结构、棒状特征等十余项形态学问题的详细投票计数,最终形成兼具图像数据与细粒度标注标签的机器学习友好型数据集。
特点
该数据集在星系形态学领域展现出鲜明的多维特征。其核心在于提供了哈勃望远镜拍摄的高分辨率星系图像,并附有经过大规模公众投票产生的细粒度形态学标签。标签体系以决策树形式组织,涵盖从整体平滑度到局部结构(如旋臂数量、核球形状)的层次化特征,且每个标签均以投票计数及其比例呈现,量化了标注的不确定性。数据规模达到数万样本量级,并预设了标准的训练与测试划分,支持图像分类与特征提取等任务,为星系形态的定量研究提供了丰富而可靠的基础资源。
使用方法
为便利研究社区的使用,该数据集已集成于HuggingFace平台。用户可通过`datasets`库,使用`load_dataset`函数并指定数据集路径`'mwalmsley/gz_hubble'`进行加载,可选择`'train'`或`'test'`划分,亦支持加载精简版的`'tiny'`配置以进行快速原型验证。加载后的数据对象包含图像张量及一系列以`[问题]_[答案]`格式命名的投票计数特征键。用户可调用`set_format`方法将数据转换为PyTorch、TensorFlow等主流深度学习框架的格式,进而无缝接入标准的数据加载流程或应用图像增强变换,适用于星系形态的自动化分类、特征学习及概率性建模等研究场景。
背景与挑战
背景概述
在星系天文学领域,哈勃空间望远镜所获取的高分辨率图像为研究星系形态演化提供了珍贵资料。Galaxy Zoo Hubble数据集由Mike Walmsley等人于2024年整理发布,其核心研究问题在于通过众包标注方式,对超过12万个哈勃望远镜观测的星系图像进行多维度形态学分类。该数据集继承了Galaxy Zoo系列公民科学项目的传统,将专业天文学问题转化为可被公众理解的分类任务,不仅为星系形态的定量研究提供了大规模标注数据,更成为训练天文领域基础模型的关键资源,显著推动了数据驱动型天体物理学的发展。
当前挑战
该数据集旨在解决星系形态自动分类中的复杂性问题,其挑战首先体现在标注过程的固有不确定性:由于志愿者依据决策树进行标注,不同星系所获投票总数差异显著,导致传统基于聚合标签的评估指标失效。其次,构建过程中面临众包数据质量控制的难题,包括标注者主观差异引起的噪声,以及星系本身形态模糊性导致的标注歧义。此外,数据集中包含的多层次分类问题(如旋臂数量、星系对称性等)要求机器学习模型具备处理结构化输出与不完整标注的能力,这对模型设计与损失函数构建提出了特殊要求。
常用场景
经典使用场景
在星系形态学研究中,Galaxy Zoo Hubble数据集为天文学家提供了哈勃太空望远镜拍摄的星系图像及其众包标注。该数据集最经典的使用场景是训练和验证深度学习模型,特别是卷积神经网络,以自动化识别星系的形态特征,如平滑椭圆星系、旋臂结构、棒状核心等。通过利用大规模志愿者标注的投票数据,研究者能够构建稳健的分类器,实现对星系形态的精细划分,从而替代传统依赖专家目视分类的繁重工作。
实际应用
在实际应用层面,Galaxy Zoo Hubble数据集被广泛应用于天文观测数据的自动化处理流程。例如,在大型巡天项目如斯隆数字巡天(SDSS)或未来罗马太空望远镜任务中,基于该数据集训练的模型能够快速筛选和分类海量星系图像,识别特殊天体如并合星系或活动星系核。此外,该数据集还支持教育工具和公众科学平台的开发,让公众参与天文发现,同时为天文台的数据管理提供高效的预处理解决方案。
衍生相关工作
该数据集衍生了一系列经典研究工作,包括基于深度学习的星系形态分类模型,如使用卷积神经网络(CNN)或视觉Transformer架构的自动化分类系统。相关研究还探索了多任务学习框架,将Galaxy Zoo Hubble与其他巡天数据集(如GZ2、GZ CANDELS)结合,以提升模型在不同望远镜数据上的泛化能力。此外,部分工作专注于不确定性建模,利用志愿者投票分数开发概率性分类方法,为形态学参数的统计推断提供了新工具。
以上内容由遇见数据集搜集并总结生成



