kshitijd/astro-multimodal-570k
收藏Hugging Face2026-03-21 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/kshitijd/astro-multimodal-570k
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
task_categories:
- tabular-classification
- image-classification
- time-series-forecasting
tags:
- astronomy
- astrophysics
- multimodal
- spectra
- light-curves
- images
- cross-matched
- stars
- galaxies
- agn
- quasars
size_categories:
- 100K<n<1M
configs:
- config_name: default
data_files:
- split: train
path: "*.parquet"
---
# Astro Multimodal 570k: Pre-Cross-Matched Astronomical Dataset
The first publicly available, pre-cross-matched astronomical dataset that unifies three modality types -- spectra, light curves, and images -- into a single table. **~570k objects** across three populations (stars, galaxies, AGN) are joined from **12 major surveys** into ready-to-use rows: no cross-matching required.
### Why This Dataset?
Existing multimodal astronomical datasets either provide raw survey collections that users must cross-match themselves, or cover only 1-2 modalities for a single population:
| Dataset | Objects | Modalities | Surveys | Populations | Pre-joined? |
|---------|---------|-----------|---------|-------------|-------------|
| **This dataset** | **570k** | **Spectra + Light Curves + Images** | **12** | **Stars, Galaxies, AGN** | **Yes** |
| [Multimodal Universe](https://huggingface.co/MultimodalUniverse) | 100M+ (separate) | Spectra + LC + Images | 20+ | Mixed | No (raw collections) |
| [AstroCLIP](https://arxiv.org/abs/2310.03024) | 198k | Spectra + Images | 2 | Galaxies only | Yes |
| [AstroM3](https://huggingface.co/datasets/AstroMLCore/AstroM3Dataset) | 21k | Spectra + LC + Metadata | 6 | Variable stars only | Yes |
| [DESI/HSC](https://huggingface.co/datasets/Smith42/desi_hsc_crossmatched) | 19k | Spectra + Images | 2 | Galaxies only | Yes |
This dataset is ready for multimodal representation learning, transfer learning across wavelengths, population classification, and any task that benefits from having multiple views of the same astronomical object in a single row.
---
## Quick Start
### Installation
```bash
pip install datasets numpy pandas pyarrow
# Optional for visualization:
pip install matplotlib astropy
```
**System requirements:** Loading a single shard (~5000 rows) needs ~2 GB RAM. Loading the full dataset needs ~150-200 GB RAM. For large-scale work, stream or load shards individually (see below).
### Load and explore
```python
from datasets import load_dataset
import numpy as np
# Stream without downloading everything
ds = load_dataset("kshitijd/astro-multimodal-570k", streaming=True)
# Or download fully
ds = load_dataset("kshitijd/astro-multimodal-570k")
```
### Get a star with an infrared spectrum
```python
row = ds["train"][0] # or iterate with streaming
if row["population"] == "star" and row["apogee_flux"] is not None:
flux = np.array(row["apogee_flux"], dtype=np.float32) # (7514,) normalized IR spectrum
flux_err = np.array(row["apogee_flux_err"], dtype=np.float32)
```
### Get a galaxy with UV + IR images
```python
for row in ds["train"]:
if row["population"] == "galaxy" and row["galex_fuv"] is not None and row["unwise_w1"] is not None:
fuv = np.array(row["galex_fuv"], dtype=np.float32).reshape(64, 64) # GALEX far-UV
w1 = np.array(row["unwise_w1"], dtype=np.float32).reshape(64, 64) # WISE 3.4 micron
break
```
### Plot a spectrum
```python
import matplotlib.pyplot as plt
import numpy as np
row = ds["train"][0]
if row["apogee_flux"] is not None:
flux = np.array(row["apogee_flux"], dtype=np.float32)
# APOGEE wavelength grid: 3 detectors, 7514 good pixels
# Approximate wavelength range: 1.51-1.70 microns
plt.figure(figsize=(12, 3))
plt.plot(flux, lw=0.5)
plt.xlabel("Pixel")
plt.ylabel("Normalized Flux")
plt.title(f"APOGEE Spectrum: {row['object_id']}")
plt.tight_layout()
plt.savefig("spectrum.png", dpi=150)
```
### Plot a light curve
```python
if row["ztf_time"] is not None:
time = np.array(row["ztf_time"])
mag = np.array(row["ztf_mag"])
magerr = np.array(row["ztf_magerr"])
band = np.array(row["ztf_band"])
plt.figure(figsize=(10, 4))
for b in np.unique(band):
mask = band == b
plt.errorbar(time[mask], mag[mask], yerr=magerr[mask], fmt='.', label=f"Band {b}", ms=3)
plt.gca().invert_yaxis()
plt.xlabel("HJD")
plt.ylabel("Magnitude")
plt.legend()
plt.title(f"ZTF Light Curve: {row['object_id']}")
plt.tight_layout()
plt.savefig("lightcurve.png", dpi=150)
```
### Plot image cutouts across wavelengths
```python
fig, axes = plt.subplots(1, 5, figsize=(15, 3))
bands = [("galex_fuv", "GALEX FUV"), ("legacy_g", "Legacy g"),
("twomass_j", "2MASS J"), ("unwise_w1", "WISE W1"), ("unwise_w2", "WISE W2")]
for ax, (col, label) in zip(axes, bands):
if row[col] is not None:
img = np.array(row[col], dtype=np.float32).reshape(64, 64)
ax.imshow(img, origin="lower", cmap="gray")
ax.set_title(label)
else:
ax.text(0.5, 0.5, "No data", ha="center", va="center", transform=ax.transAxes)
ax.axis("off")
plt.suptitle(f"{row['object_id']} ({row['population']})")
plt.tight_layout()
plt.savefig("cutouts.png", dpi=150)
```
### Filter by population
```python
# All stars with spectra AND images
stars_multimodal = ds["train"].filter(
lambda x: x["population"] == "star" and x["n_spectra"] > 0 and x["n_images"] > 0
)
# AGN with light curves
agn_variable = ds["train"].filter(
lambda x: x["population"] == "agn" and x["n_lightcurves"] > 0
)
```
### Memory-efficient loading with pandas
```python
import pandas as pd
# Load just one shard (~5000 rows, ~2 GB RAM)
df = pd.read_parquet("00000.parquet")
# Load only specific columns (much less RAM)
df = pd.read_parquet("00000.parquet", columns=["object_id", "population", "ra", "dec",
"apogee_flux", "n_spectra", "n_images"])
# Iterate over all shards without loading everything
import glob
for f in sorted(glob.glob("*.parquet")):
chunk = pd.read_parquet(f)
stars = chunk[chunk["population"] == "star"]
# process stars...
del chunk # free memory
```
### Build a PyTorch dataset
```python
import torch
from torch.utils.data import Dataset, DataLoader
import pandas as pd
import numpy as np
import glob
class AstroDataset(Dataset):
"""Memory-efficient dataset that loads one shard at a time."""
def __init__(self, shard_dir, population=None, require_modalities=None):
self.files = sorted(glob.glob(f"{shard_dir}/*.parquet"))
# Build index: (shard_idx, row_idx) for each valid object
self.index = []
for si, f in enumerate(self.files):
df = pd.read_parquet(f, columns=["population", "n_spectra", "n_lightcurves", "n_images"])
for ri in range(len(df)):
if population and df.iloc[ri]["population"] != population:
continue
if require_modalities:
if "spectra" in require_modalities and df.iloc[ri]["n_spectra"] == 0:
continue
if "images" in require_modalities and df.iloc[ri]["n_images"] == 0:
continue
self.index.append((si, ri))
del df
self._cache_si = -1
self._cache_df = None
def __len__(self):
return len(self.index)
def __getitem__(self, idx):
si, ri = self.index[idx]
if si != self._cache_si:
self._cache_df = pd.read_parquet(self.files[si])
self._cache_si = si
row = self._cache_df.iloc[ri]
sample = {"object_id": row["object_id"], "population": row["population"]}
# Spectrum
if row.get("apogee_flux") is not None and isinstance(row["apogee_flux"], (list, np.ndarray)):
sample["spectrum"] = torch.tensor(np.array(row["apogee_flux"], dtype=np.float32))
# Image (example: 2MASS J-band)
if row.get("twomass_j") is not None and isinstance(row["twomass_j"], (list, np.ndarray)):
img = np.array(row["twomass_j"], dtype=np.float32).reshape(64, 64)
sample["image"] = torch.tensor(img).unsqueeze(0) # (1, 64, 64)
return sample
# Usage:
dataset = AstroDataset("./shards/", population="star", require_modalities=["spectra", "images"])
loader = DataLoader(dataset, batch_size=32, shuffle=True, num_workers=2)
```
---
## Dataset Summary
| Population | Count | Spectra Coverage | Light Curve Coverage | Image Coverage |
|------------|-------|-----------------|---------------------|----------------|
| Stars | 300k | 99.7% (APOGEE) | 23.2% (TESS + ZTF) | 100% (2MASS + GALEX + unWISE) |
| Galaxies | 200k | 20.5% (SDSS + DESI) | -- | 96.3% (GALEX + unWISE) |
| AGN | 100k | 100% (SDSS) | 8.5% (ZTF) | 97.1% (GALEX + unWISE) |
### Multimodal Coverage
| Population | >= 2 modality types | All 3 modality types |
|------------|--------------------|--------------------|
| Stars | 99.8% | 23.1% |
| Galaxies | 19.2% | 0% (no light curves by design) |
| AGN | 97.3% | 8.4% |
### Per-Source Coverage Detail
**Stars (300k)**
| Source | Column | Coverage |
|--------|--------|----------|
| APOGEE DR17 | `apogee_flux` | 299,143 (99.7%) |
| Gaia BP/RP | `flatiron_gaia_coeff` | 147,096 (49.0%) |
| GALAH DR4 | `galah_flux` | 22,123 (7.4%) |
| TESS | `flatiron_tess_flux` | 41,823 (13.9%) |
| ZTF DR24 | `ztf_time` | 28,049 (9.3%) |
| 2MASS | `twomass_j/h/k` | 299,984+ (100%) |
| GALEX | `galex_fuv/nuv` | 168k-224k (56-75%) |
| unWISE | `unwise_w1/w2` | 22,729 (7.6%) |
**Galaxies (200k)**
| Source | Column | Coverage |
|--------|--------|----------|
| SDSS | `sdss_flux` | 18,839 (9.4%) |
| DESI (spectra) | `flatiron_desi_spectrum_flux` | 24,697 (12.3%) |
| DESI (metadata) | `flatiron_desi_z` | 150,263 (75.1%) |
| GALEX | `galex_fuv/nuv` | 191k-193k (95-96%) |
| unWISE | `unwise_w1/w2` | 24,986 (25.0%) |
**AGN (100k)**
| Source | Column | Coverage |
|--------|--------|----------|
| SDSS | `sdss_flux` | 99,995 (100%) |
| ZTF DR24 | `ztf_time` | 8,538 (8.5%) |
| GALEX | `galex_fuv/nuv` | 92k-97k (92-97%) |
| unWISE | `unwise_w1/w2` | 24,986 (25.0%) |
---
## Data Sources
### Spectra
| Source | Instrument | Wavelength | Resolution | Population | Coverage |
|--------|-----------|------------|------------|------------|----------|
| [APOGEE DR17](https://www.sdss4.org/dr17/irspec/) | APOGEE (APO + LCO) | 1.51--1.70 um (IR) | R ~ 22,500 | Stars | 299k / 300k |
| [Gaia DR3 BP/RP](https://www.cosmos.esa.int/web/gaia/dr3) | Gaia BP/RP | 330--1050 nm | R ~ 50--100 | Stars | 147k / 300k |
| [GALAH DR4](https://www.galah-survey.org/) | HERMES (AAT) | 4713--7887 A | R ~ 28,000 | Stars | 22k / 300k |
| [SDSS DR17](https://www.sdss4.org/dr17/spectro/) | BOSS / eBOSS | 3600--10400 A | R ~ 2000 | Galaxies, AGN | 119k / 300k |
| [DESI EDR](https://data.desi.lbl.gov/) | DESI | 3600--9800 A | R ~ 2000--5000 | Galaxies, AGN | 25k / 300k |
### Light Curves
| Source | Instrument | Bandpass | Cadence | Population | Coverage |
|--------|-----------|----------|---------|------------|----------|
| [TESS](https://tess.mit.edu/) | TESS | 600--1000 nm | 2--30 min | Stars, AGN | 42k / 400k |
| [ZTF DR24](https://www.ztf.caltech.edu/) | ZTF (Palomar) | g, r, i | 1--3 day | Stars, AGN | 37k / 400k |
### Images
| Source | Instrument | Bands | Pixel Scale | Cutout Size | Population | Coverage |
|--------|-----------|-------|-------------|-------------|------------|----------|
| [2MASS](https://irsa.ipac.caltech.edu/Missions/2mass.html) | 2MASS | J, H, K | 1 arcsec/px | 64 x 64 | Stars | 300k / 300k |
| [GALEX](https://galex.stsci.edu/) | GALEX | FUV, NUV | 1.5 arcsec/px | 64 x 64 | All | 452k / 570k |
| [unWISE](https://unwise.me/) | WISE | W1, W2 | 2.75 arcsec/px | 64 x 64 | All | 48k / 570k |
| [Legacy Survey](https://www.legacysurvey.org/) | DECam / Mosaic / 90Prime | g, r, z | 0.262 arcsec/px | 64 x 64 | All | 4k / 570k |
---
## Schema
### Core columns (all objects)
| Column | Type | Description |
|--------|------|-------------|
| `object_id` | string | Unique identifier (APOGEE 2MASS ID for stars, PROVABGS ID for galaxies, SDSS DR14Q name for AGN) |
| `ra` | float64 | Right ascension (degrees, J2000) |
| `dec` | float64 | Declination (degrees, J2000) |
| `population` | string | `"star"`, `"galaxy"`, or `"agn"` |
| `n_spectra` | int | Count of spectral datasets with data for this object |
| `n_lightcurves` | int | Count of light curve datasets with data |
| `n_images` | int | Count of image bands with data |
### Spectra columns
| Column | Type | Shape | Description |
|--------|------|-------|-------------|
| `apogee_flux` | list[float32] | (7514,) | APOGEE normalized flux, cropped to good detector pixels |
| `apogee_flux_err` | list[float32] | (7514,) | APOGEE flux uncertainty |
| `galah_flux` | list[float32] | variable | GALAH combined 4-band flux |
| `galah_lambda` | list[float32] | variable | GALAH wavelength array (A) |
| `sdss_flux` | list[float32] | variable | SDSS/BOSS spectral flux (10^-17 erg/s/cm^2/A) |
| `sdss_loglam` | list[float32] | variable | SDSS log10(wavelength / A) |
| `sdss_ivar` | list[float32] | variable | SDSS inverse variance |
| `flatiron_gaia_coeff` | list[float32] | (110,) | Gaia BP/RP spectral coefficients |
| `flatiron_desi_spectrum_flux` | list[float32] | variable | DESI coadded spectral flux |
| `flatiron_desi_spectrum_lambda` | list[float32] | variable | DESI wavelength array (A) |
| `flatiron_desi_spectrum_ivar` | list[float32] | variable | DESI inverse variance |
### Light curve columns
| Column | Type | Shape | Description |
|--------|------|-------|-------------|
| `ztf_time` | list[float64] | variable | ZTF observation times (HJD) |
| `ztf_mag` | list[float32] | variable | ZTF PSF magnitudes |
| `ztf_magerr` | list[float32] | variable | ZTF magnitude uncertainties |
| `ztf_band` | list | variable | ZTF filter code |
| `flatiron_tess_time` | list[float64] | variable | TESS observation times (BTJD) |
| `flatiron_tess_flux` | list[float32] | variable | TESS normalized flux |
| `flatiron_tess_flux_err` | list[float32] | variable | TESS flux uncertainty |
### Image columns
All image columns are stored as nested lists representing 64 x 64 pixel cutouts. Reconstruct with `np.array(row["col"], dtype=np.float32).reshape(64, 64)`.
| Column | Type | Description |
|--------|------|-------------|
| `twomass_j`, `twomass_h`, `twomass_k` | list[list[float32]] | 2MASS J/H/K-band cutout |
| `galex_fuv`, `galex_nuv` | list[list[float32]] | GALEX far-UV / near-UV cutout |
| `unwise_w1`, `unwise_w2` | list[list[float32]] | unWISE W1 (3.4 um) / W2 (4.6 um) cutout |
| `legacy_g`, `legacy_r`, `legacy_z` | list[list[float32]] | Legacy Survey g/r/z-band cutout |
### Metadata columns
The dataset includes ~300 metadata columns from source surveys, prefixed by survey name. Key examples:
| Column | Description |
|--------|-------------|
| `apogee_teff` | APOGEE effective temperature (K) |
| `apogee_logg` | APOGEE surface gravity (log g) |
| `flatiron_gaia_phot_g_mean_mag` | Gaia G-band apparent magnitude |
| `flatiron_gaia_parallax` | Gaia parallax (mas) |
| `flatiron_gaia_teff_gspphot` | Gaia photometric effective temperature |
| `flatiron_desi_z` | DESI spectroscopic redshift |
| `flatiron_desi_spectype` | DESI spectral classification |
| `agn_redshift` | AGN redshift from SDSS DR14Q |
---
## File Format and System Requirements
### Format
Parquet shard files, up to 5000 rows each (~120 shards total). Populations are interleaved -- filter on `population` to select types. Total on-disk size: ~80 GB compressed.
### System Requirements
| Use Case | RAM | Disk | Notes |
|----------|-----|------|-------|
| Stream one shard at a time | 2 GB | 1 GB | Recommended for most workflows |
| Load one population | 50--80 GB | 80 GB | e.g., all 300k stars |
| Load full dataset | 150--200 GB | 80 GB | Only if you have the RAM |
| HuggingFace streaming | 1 GB | 0 | No local download needed |
### Recommended workflow
For most users, **iterate shard by shard** rather than loading everything:
```python
import pandas as pd
import glob
for shard_path in sorted(glob.glob("path/to/shards/*.parquet")):
df = pd.read_parquet(shard_path)
# Filter, process, extract features...
del df
```
Or use HuggingFace streaming to avoid downloading at all:
```python
from datasets import load_dataset
ds = load_dataset("kshitijd/astro-multimodal-570k", streaming=True)
for row in ds["train"]:
# process row...
pass
```
---
## How It Was Built
### Pipeline Overview
Built with a custom Python pipeline running on [NCSA Delta AI](https://docs.ncsa.illinois.edu/systems/deltaai/) (32 CPUs, 408 GB RAM, NVMe storage). Total runtime: ~20 hours.
### Phase 1: Catalog Construction
Three populations were defined from independent parent catalogs:
1. **Stars** (300k): Queried [APOGEE DR17](https://www.sdss4.org/dr17/irspec/) via VizieR (catalog III/284, SNR > 50), cross-matched against Gaia DR3 via CDS XMatch (1 arcsec radius, 50k-row chunks). Selected the first 300k Gaia-matched stars.
2. **Galaxies** (200k): Downloaded the [PROVABGS](https://changhoonhahn.github.io/provabgs/) seed catalog from the Flatiron Institute (60 HDF5 cells). Selected the first 200k objects.
3. **AGN** (100k): Queried the [SDSS DR14 Quasar Catalog](https://www.sdss4.org/dr14/algorithms/qso_catalog/) via VizieR (catalog VII/286). Selected the first 100k objects.
### Phase 2: Flatiron Cross-Matching (Gaia, TESS, SDSS, DESI)
For each Flatiron-hosted dataset, the pipeline downloaded HEALPix-partitioned HDF5 files, performed positional cross-matching with `astropy.coordinates.SkyCoord` (3 arcsec radius), accumulated matched data, and flushed to Parquet shards. Each HDF5 file was deleted after processing.
### Phase 3: APOGEE Spectra
APOGEE aspcapStar FITS files (73 GB) were pre-staged via Globus. The pipeline read each star's local FITS file and cropped spectra to good detector regions: `np.r_[246:3274, 3585:6080, 6344:8335]` (7514 pixels).
### Phase 4: SDSS Spectra
SDSS specLite files were downloaded from the SDSS Science Archive Server for galaxies and AGN with matching plate/MJD/fiber identifiers.
### Phase 5: ZTF Light Curves
ZTF DR24 bulk Parquet files from IRSA. Per-field download, PyArrow read, group by object ID, positional cross-match, extract time series.
### Phase 6: Images (concurrent)
Four image sources ran concurrently:
- **2MASS**: CDS hips2fits (J, H, K bands)
- **GALEX**: CDS hips2fits (FUV, NUV bands)
- **Legacy Survey**: Cutout service (g, r, z bands)
- **unWISE**: Tile download + astropy Cutout2D (W1, W2 bands)
### Phase 7: GALAH Spectra
GALAH DR4 spectra (504 tar files, 123 GB) downloaded from Data Central Australia. Each star's 4-band spectra (blue, green, red, IR) were concatenated.
### Phase 8: Finalize
All modality shards merged per population via streaming left joins on `object_id`. Deduplicated. Modality counts computed.
### Cross-Matching
All positional cross-matches used `astropy.coordinates.SkyCoord.match_to_catalog_sky()` with a **3 arcsecond** match radius. Unmatched columns are `null`.
---
## Data Quality Notes
- **APOGEE spectra** are continuum-normalized. Flux values are typically 0.5--1.2.
- **Gaia BP/RP** is stored as spectral coefficients (`flatiron_gaia_coeff`, 110 values), not sampled spectra. See the [Gaia documentation](https://gea.esic.esa.int/archive/documentation/GDR3/) for reconstruction into a full spectrum.
- **TESS and ZTF light curves** have variable lengths depending on sky coverage overlap with the APOGEE/AGN footprints.
- **Image cutouts** are centered on the catalog position. Some may contain NaN pixels at edges.
- Columns with no data for a given object are `null` / `None`.
## Known Limitations
1. **Galaxy spectral coverage is 20.5%**. Most PROVABGS galaxies lack spectroscopic observations. DESI metadata is present for 75% but actual spectra for only 12%.
2. **Light curve coverage is partial** (TESS ~14%, ZTF ~9% of stars), driven by sky overlap with the APOGEE footprint.
3. **unWISE coverage is 7.6%** for stars and **Legacy Survey is < 1%** due to limited footprint overlap.
4. **Gaia BP/RP is ~49%** for stars, reflecting overlap between APOGEE targets and Flatiron-hosted HEALPix cells.
5. **No light curves for galaxies** by design -- TESS and ZTF are routed only to stars and AGN.
6. **~340 columns total**, most being Gaia and DESI metadata. Core science columns are listed in the Schema section.
---
## Citation
If you use this dataset, please cite the underlying surveys:
```bibtex
@article{abdurrouf2022,
title={The Seventeenth Data Release of the Sloan Digital Sky Surveys: Complete Release of MaNGA, MaStar, and APOGEE-2 Data},
author={Abdurro'uf and others},
journal={ApJS},
volume={259},
pages={35},
year={2022}
}
@article{gaia2023,
title={Gaia Data Release 3: Summary of the content and survey properties},
author={{Gaia Collaboration}},
journal={A\&A},
volume={674},
pages={A1},
year={2023}
}
@article{desi2024,
title={DESI 2024 III: Baryon Acoustic Oscillations from Galaxies and Quasars},
author={{DESI Collaboration}},
journal={AJ},
year={2024}
}
@article{bellm2019,
title={The Zwicky Transient Facility: System Overview, Performance, and First Results},
author={Bellm, Eric C. and others},
journal={PASP},
volume={131},
pages={018002},
year={2019}
}
@article{buder2024,
title={The GALAH Survey: Data Release 4},
author={Buder, Sven and others},
journal={arXiv preprint arXiv:2409.19858},
year={2024}
}
@article{ricker2015,
title={Transiting Exoplanet Survey Satellite (TESS)},
author={Ricker, George R. and others},
journal={Journal of Astronomical Telescopes, Instruments, and Systems},
volume={1},
pages={014003},
year={2015}
}
```
## License
Released under [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/). The underlying survey data is subject to each survey's individual data use policies.
提供机构:
kshitijd



