Galaxy Zoo DECaLS: Trained Representations
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/5536995
下载链接
链接失效反馈官方服务:
资源简介:
These representations predate Zoobot 2.0 - you may find better performance with those more recent models. See the Zoobot github repository and HuggingFace.
Image representations are lower-dimensional summaries convenient for machine learning searches, predictions, clustering, etc.
This archive includes representations of galaxy images for subsets of DECaLS DR5 and SDSS. It also includes some further data useful for reproducing a series of practical experiments using those representations (see W+22, bottom of this page).
Representations
The representations are calculated with a CNN trained to predict volunteer answers to Galaxy Zoo DECaLS questions with the code "Zoobot", introduced in W+21 (bottom of this page). The weights of this CNN are available via the Zoobot github repository, currently under the checkpoint folder data/pretrained_models/decals_dr_trained_on_all_labelled_m0. See W+21 for details.
The most significant file is "cnn_features_decals.parquet". This file contains the representations calculated for the approx. 340k GZ DECaLS galaxies. See W+21 for a description of GZD-5. Galaxies can be crossmatched to other catalogues (e.g. the GZ DECaLS catalogue) by iauname.
"cnn_features_gz2.parquet" is the representations calculated by the *same* model, i.e. without retraining on labelled SDSS GZ2 images, for the approx 240k images classifed in Galaxy Zoo 2 (Willet 2013). These are still fairly good (see W+22), implying the CNN can sometimes generalise well to slightly different surveys. However, they could likely be improved by using a model trained on GZ2 directly. The Zoobot code makes this straightforward. The galaxies can be cross-matched to the Galaxy Zoo 2 catalogues on the "id_str" column, which is equal to the GZ2 objid (e.g. "588018090547020096").
Confused about .parquet? Think of it as a csv that's very fast to load. Load them like so:
import pandas as pd
df = pd.read_parquet(parquet_loc)
You might like to check zoobot.readthedocs.io for guidance on the CNN weights and a pair of ring galaxy catalogues.
References
Please cite one or both of these papers if you use this dataset. The labels and trained model come from W+21, while the representations were created in W+22.
W+21: https://arxiv.org/abs/2102.08414, Galaxy Zoo DECaLS: Detailed Visual Morphology Measurements from Volunteers and Deep Learning for 314,000 Galaxies
W+22: https://arxiv.org/abs/2110.12735, Practical Morphology Tools from Deep Supervised Representation Learning
本数据集所使用的表征方法早于Zoobot 2.0版本,若使用更新的模型可获得更优性能。相关内容可参阅Zoobot的GitHub仓库与HuggingFace平台。
图像表征是低维特征摘要,便于开展机器学习搜索、预测与聚类等任务。
本归档数据集包含DECaLS DR5(暗能量相机遗产巡天第五次数据发布,Dark Energy Camera Legacy Survey Data Release 5)与SDSS(斯隆数字巡天,Sloan Digital Sky Survey)部分子数据集的星系图像表征。此外,归档中还包含可复现基于此类表征开展的一系列实用实验所需的配套数据(详见本页面底部的W+22文献)。
## 表征说明
本数据集的表征由经训练的卷积神经网络(Convolutional Neural Network, CNN)生成,该网络基于代码库Zoobot,用于预测Galaxy Zoo DECaLS(星系动物园DECaLS)项目中志愿者的答题结果,相关方法出自W+21文献(详见本页面底部)。此CNN的模型权重可通过Zoobot的GitHub仓库获取,当前存放于`checkpoint`文件夹下的`data/pretrained_models/decals_dr_trained_on_all_labelled_m0`路径中。详细信息请参阅W+21文献。
本数据集最重要的文件为`cnn_features_decals.parquet`(Parquet格式数据文件)。该文件包含约34万个GZ DECaLS(Galaxy Zoo DECaLS)星系的表征数据。关于GZD-5的详细说明,请参阅W+21文献。可通过`iauname`字段将这些星系与其他星表(例如GZ DECaLS星表)进行交叉匹配。
`cnn_features_gz2.parquet`文件则包含由**同一**卷积神经网络生成的表征数据,即未在标注过的SDSS GZ2(Galaxy Zoo 2)图像上进行重新训练,涵盖了约24万个经Galaxy Zoo 2(Willet 2013)分类的图像数据。此类表征仍具备不错的性能(详见W+22文献),表明该CNN可在略有差异的巡天项目中实现较好的泛化能力。不过,若直接使用在GZ2数据集上训练的模型,表征性能或可进一步提升,而借助Zoobot代码库可轻松完成这一操作。可通过`id_str`字段将这些星系与Galaxy Zoo 2星表进行交叉匹配,该字段的值与GZ2的`objid`一致(例如`588018090547020096`)。
若对Parquet格式文件感到困惑,可将其视为加载速度极快的CSV文件。加载示例代码如下:
python
import pandas as pd
df = pd.read_parquet(parquet_loc)
如需了解卷积神经网络模型权重以及环形星系星表的相关指南,可访问`zoobot.readthedocs.io`。
## 参考文献
若使用本数据集,请引用上述一篇或两篇文献。其中标注数据与预训练模型源自W+21文献,而本数据集的表征数据则生成于W+22文献。
W+21: https://arxiv.org/abs/2102.08414, Galaxy Zoo DECaLS: Detailed Visual Morphology Measurements from Volunteers and Deep Learning for 314,000 Galaxies(《星系动物园DECaLS:基于志愿者标注与深度学习的31.4万个星系精细视觉形态测量》)
W+22: https://arxiv.org/abs/2110.12735, Practical Morphology Tools from Deep Supervised Representation Learning(《基于深度监督表征学习的实用形态学工具》)
创建时间:
2025-03-10



