five

kshitijd/astro-multimodal-570k

收藏
Hugging Face2026-03-21 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/kshitijd/astro-multimodal-570k
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 task_categories: - tabular-classification - image-classification - time-series-forecasting tags: - astronomy - astrophysics - multimodal - spectra - light-curves - images - cross-matched - stars - galaxies - agn - quasars size_categories: - 100K<n<1M configs: - config_name: default data_files: - split: train path: "*.parquet" --- # Astro Multimodal 570k: Pre-Cross-Matched Astronomical Dataset The first publicly available, pre-cross-matched astronomical dataset that unifies three modality types -- spectra, light curves, and images -- into a single table. **~570k objects** across three populations (stars, galaxies, AGN) are joined from **12 major surveys** into ready-to-use rows: no cross-matching required. ### Why This Dataset? Existing multimodal astronomical datasets either provide raw survey collections that users must cross-match themselves, or cover only 1-2 modalities for a single population: | Dataset | Objects | Modalities | Surveys | Populations | Pre-joined? | |---------|---------|-----------|---------|-------------|-------------| | **This dataset** | **570k** | **Spectra + Light Curves + Images** | **12** | **Stars, Galaxies, AGN** | **Yes** | | [Multimodal Universe](https://huggingface.co/MultimodalUniverse) | 100M+ (separate) | Spectra + LC + Images | 20+ | Mixed | No (raw collections) | | [AstroCLIP](https://arxiv.org/abs/2310.03024) | 198k | Spectra + Images | 2 | Galaxies only | Yes | | [AstroM3](https://huggingface.co/datasets/AstroMLCore/AstroM3Dataset) | 21k | Spectra + LC + Metadata | 6 | Variable stars only | Yes | | [DESI/HSC](https://huggingface.co/datasets/Smith42/desi_hsc_crossmatched) | 19k | Spectra + Images | 2 | Galaxies only | Yes | This dataset is ready for multimodal representation learning, transfer learning across wavelengths, population classification, and any task that benefits from having multiple views of the same astronomical object in a single row. --- ## Quick Start ### Installation ```bash pip install datasets numpy pandas pyarrow # Optional for visualization: pip install matplotlib astropy ``` **System requirements:** Loading a single shard (~5000 rows) needs ~2 GB RAM. Loading the full dataset needs ~150-200 GB RAM. For large-scale work, stream or load shards individually (see below). ### Load and explore ```python from datasets import load_dataset import numpy as np # Stream without downloading everything ds = load_dataset("kshitijd/astro-multimodal-570k", streaming=True) # Or download fully ds = load_dataset("kshitijd/astro-multimodal-570k") ``` ### Get a star with an infrared spectrum ```python row = ds["train"][0] # or iterate with streaming if row["population"] == "star" and row["apogee_flux"] is not None: flux = np.array(row["apogee_flux"], dtype=np.float32) # (7514,) normalized IR spectrum flux_err = np.array(row["apogee_flux_err"], dtype=np.float32) ``` ### Get a galaxy with UV + IR images ```python for row in ds["train"]: if row["population"] == "galaxy" and row["galex_fuv"] is not None and row["unwise_w1"] is not None: fuv = np.array(row["galex_fuv"], dtype=np.float32).reshape(64, 64) # GALEX far-UV w1 = np.array(row["unwise_w1"], dtype=np.float32).reshape(64, 64) # WISE 3.4 micron break ``` ### Plot a spectrum ```python import matplotlib.pyplot as plt import numpy as np row = ds["train"][0] if row["apogee_flux"] is not None: flux = np.array(row["apogee_flux"], dtype=np.float32) # APOGEE wavelength grid: 3 detectors, 7514 good pixels # Approximate wavelength range: 1.51-1.70 microns plt.figure(figsize=(12, 3)) plt.plot(flux, lw=0.5) plt.xlabel("Pixel") plt.ylabel("Normalized Flux") plt.title(f"APOGEE Spectrum: {row['object_id']}") plt.tight_layout() plt.savefig("spectrum.png", dpi=150) ``` ### Plot a light curve ```python if row["ztf_time"] is not None: time = np.array(row["ztf_time"]) mag = np.array(row["ztf_mag"]) magerr = np.array(row["ztf_magerr"]) band = np.array(row["ztf_band"]) plt.figure(figsize=(10, 4)) for b in np.unique(band): mask = band == b plt.errorbar(time[mask], mag[mask], yerr=magerr[mask], fmt='.', label=f"Band {b}", ms=3) plt.gca().invert_yaxis() plt.xlabel("HJD") plt.ylabel("Magnitude") plt.legend() plt.title(f"ZTF Light Curve: {row['object_id']}") plt.tight_layout() plt.savefig("lightcurve.png", dpi=150) ``` ### Plot image cutouts across wavelengths ```python fig, axes = plt.subplots(1, 5, figsize=(15, 3)) bands = [("galex_fuv", "GALEX FUV"), ("legacy_g", "Legacy g"), ("twomass_j", "2MASS J"), ("unwise_w1", "WISE W1"), ("unwise_w2", "WISE W2")] for ax, (col, label) in zip(axes, bands): if row[col] is not None: img = np.array(row[col], dtype=np.float32).reshape(64, 64) ax.imshow(img, origin="lower", cmap="gray") ax.set_title(label) else: ax.text(0.5, 0.5, "No data", ha="center", va="center", transform=ax.transAxes) ax.axis("off") plt.suptitle(f"{row['object_id']} ({row['population']})") plt.tight_layout() plt.savefig("cutouts.png", dpi=150) ``` ### Filter by population ```python # All stars with spectra AND images stars_multimodal = ds["train"].filter( lambda x: x["population"] == "star" and x["n_spectra"] > 0 and x["n_images"] > 0 ) # AGN with light curves agn_variable = ds["train"].filter( lambda x: x["population"] == "agn" and x["n_lightcurves"] > 0 ) ``` ### Memory-efficient loading with pandas ```python import pandas as pd # Load just one shard (~5000 rows, ~2 GB RAM) df = pd.read_parquet("00000.parquet") # Load only specific columns (much less RAM) df = pd.read_parquet("00000.parquet", columns=["object_id", "population", "ra", "dec", "apogee_flux", "n_spectra", "n_images"]) # Iterate over all shards without loading everything import glob for f in sorted(glob.glob("*.parquet")): chunk = pd.read_parquet(f) stars = chunk[chunk["population"] == "star"] # process stars... del chunk # free memory ``` ### Build a PyTorch dataset ```python import torch from torch.utils.data import Dataset, DataLoader import pandas as pd import numpy as np import glob class AstroDataset(Dataset): """Memory-efficient dataset that loads one shard at a time.""" def __init__(self, shard_dir, population=None, require_modalities=None): self.files = sorted(glob.glob(f"{shard_dir}/*.parquet")) # Build index: (shard_idx, row_idx) for each valid object self.index = [] for si, f in enumerate(self.files): df = pd.read_parquet(f, columns=["population", "n_spectra", "n_lightcurves", "n_images"]) for ri in range(len(df)): if population and df.iloc[ri]["population"] != population: continue if require_modalities: if "spectra" in require_modalities and df.iloc[ri]["n_spectra"] == 0: continue if "images" in require_modalities and df.iloc[ri]["n_images"] == 0: continue self.index.append((si, ri)) del df self._cache_si = -1 self._cache_df = None def __len__(self): return len(self.index) def __getitem__(self, idx): si, ri = self.index[idx] if si != self._cache_si: self._cache_df = pd.read_parquet(self.files[si]) self._cache_si = si row = self._cache_df.iloc[ri] sample = {"object_id": row["object_id"], "population": row["population"]} # Spectrum if row.get("apogee_flux") is not None and isinstance(row["apogee_flux"], (list, np.ndarray)): sample["spectrum"] = torch.tensor(np.array(row["apogee_flux"], dtype=np.float32)) # Image (example: 2MASS J-band) if row.get("twomass_j") is not None and isinstance(row["twomass_j"], (list, np.ndarray)): img = np.array(row["twomass_j"], dtype=np.float32).reshape(64, 64) sample["image"] = torch.tensor(img).unsqueeze(0) # (1, 64, 64) return sample # Usage: dataset = AstroDataset("./shards/", population="star", require_modalities=["spectra", "images"]) loader = DataLoader(dataset, batch_size=32, shuffle=True, num_workers=2) ``` --- ## Dataset Summary | Population | Count | Spectra Coverage | Light Curve Coverage | Image Coverage | |------------|-------|-----------------|---------------------|----------------| | Stars | 300k | 99.7% (APOGEE) | 23.2% (TESS + ZTF) | 100% (2MASS + GALEX + unWISE) | | Galaxies | 200k | 20.5% (SDSS + DESI) | -- | 96.3% (GALEX + unWISE) | | AGN | 100k | 100% (SDSS) | 8.5% (ZTF) | 97.1% (GALEX + unWISE) | ### Multimodal Coverage | Population | >= 2 modality types | All 3 modality types | |------------|--------------------|--------------------| | Stars | 99.8% | 23.1% | | Galaxies | 19.2% | 0% (no light curves by design) | | AGN | 97.3% | 8.4% | ### Per-Source Coverage Detail **Stars (300k)** | Source | Column | Coverage | |--------|--------|----------| | APOGEE DR17 | `apogee_flux` | 299,143 (99.7%) | | Gaia BP/RP | `flatiron_gaia_coeff` | 147,096 (49.0%) | | GALAH DR4 | `galah_flux` | 22,123 (7.4%) | | TESS | `flatiron_tess_flux` | 41,823 (13.9%) | | ZTF DR24 | `ztf_time` | 28,049 (9.3%) | | 2MASS | `twomass_j/h/k` | 299,984+ (100%) | | GALEX | `galex_fuv/nuv` | 168k-224k (56-75%) | | unWISE | `unwise_w1/w2` | 22,729 (7.6%) | **Galaxies (200k)** | Source | Column | Coverage | |--------|--------|----------| | SDSS | `sdss_flux` | 18,839 (9.4%) | | DESI (spectra) | `flatiron_desi_spectrum_flux` | 24,697 (12.3%) | | DESI (metadata) | `flatiron_desi_z` | 150,263 (75.1%) | | GALEX | `galex_fuv/nuv` | 191k-193k (95-96%) | | unWISE | `unwise_w1/w2` | 24,986 (25.0%) | **AGN (100k)** | Source | Column | Coverage | |--------|--------|----------| | SDSS | `sdss_flux` | 99,995 (100%) | | ZTF DR24 | `ztf_time` | 8,538 (8.5%) | | GALEX | `galex_fuv/nuv` | 92k-97k (92-97%) | | unWISE | `unwise_w1/w2` | 24,986 (25.0%) | --- ## Data Sources ### Spectra | Source | Instrument | Wavelength | Resolution | Population | Coverage | |--------|-----------|------------|------------|------------|----------| | [APOGEE DR17](https://www.sdss4.org/dr17/irspec/) | APOGEE (APO + LCO) | 1.51--1.70 um (IR) | R ~ 22,500 | Stars | 299k / 300k | | [Gaia DR3 BP/RP](https://www.cosmos.esa.int/web/gaia/dr3) | Gaia BP/RP | 330--1050 nm | R ~ 50--100 | Stars | 147k / 300k | | [GALAH DR4](https://www.galah-survey.org/) | HERMES (AAT) | 4713--7887 A | R ~ 28,000 | Stars | 22k / 300k | | [SDSS DR17](https://www.sdss4.org/dr17/spectro/) | BOSS / eBOSS | 3600--10400 A | R ~ 2000 | Galaxies, AGN | 119k / 300k | | [DESI EDR](https://data.desi.lbl.gov/) | DESI | 3600--9800 A | R ~ 2000--5000 | Galaxies, AGN | 25k / 300k | ### Light Curves | Source | Instrument | Bandpass | Cadence | Population | Coverage | |--------|-----------|----------|---------|------------|----------| | [TESS](https://tess.mit.edu/) | TESS | 600--1000 nm | 2--30 min | Stars, AGN | 42k / 400k | | [ZTF DR24](https://www.ztf.caltech.edu/) | ZTF (Palomar) | g, r, i | 1--3 day | Stars, AGN | 37k / 400k | ### Images | Source | Instrument | Bands | Pixel Scale | Cutout Size | Population | Coverage | |--------|-----------|-------|-------------|-------------|------------|----------| | [2MASS](https://irsa.ipac.caltech.edu/Missions/2mass.html) | 2MASS | J, H, K | 1 arcsec/px | 64 x 64 | Stars | 300k / 300k | | [GALEX](https://galex.stsci.edu/) | GALEX | FUV, NUV | 1.5 arcsec/px | 64 x 64 | All | 452k / 570k | | [unWISE](https://unwise.me/) | WISE | W1, W2 | 2.75 arcsec/px | 64 x 64 | All | 48k / 570k | | [Legacy Survey](https://www.legacysurvey.org/) | DECam / Mosaic / 90Prime | g, r, z | 0.262 arcsec/px | 64 x 64 | All | 4k / 570k | --- ## Schema ### Core columns (all objects) | Column | Type | Description | |--------|------|-------------| | `object_id` | string | Unique identifier (APOGEE 2MASS ID for stars, PROVABGS ID for galaxies, SDSS DR14Q name for AGN) | | `ra` | float64 | Right ascension (degrees, J2000) | | `dec` | float64 | Declination (degrees, J2000) | | `population` | string | `"star"`, `"galaxy"`, or `"agn"` | | `n_spectra` | int | Count of spectral datasets with data for this object | | `n_lightcurves` | int | Count of light curve datasets with data | | `n_images` | int | Count of image bands with data | ### Spectra columns | Column | Type | Shape | Description | |--------|------|-------|-------------| | `apogee_flux` | list[float32] | (7514,) | APOGEE normalized flux, cropped to good detector pixels | | `apogee_flux_err` | list[float32] | (7514,) | APOGEE flux uncertainty | | `galah_flux` | list[float32] | variable | GALAH combined 4-band flux | | `galah_lambda` | list[float32] | variable | GALAH wavelength array (A) | | `sdss_flux` | list[float32] | variable | SDSS/BOSS spectral flux (10^-17 erg/s/cm^2/A) | | `sdss_loglam` | list[float32] | variable | SDSS log10(wavelength / A) | | `sdss_ivar` | list[float32] | variable | SDSS inverse variance | | `flatiron_gaia_coeff` | list[float32] | (110,) | Gaia BP/RP spectral coefficients | | `flatiron_desi_spectrum_flux` | list[float32] | variable | DESI coadded spectral flux | | `flatiron_desi_spectrum_lambda` | list[float32] | variable | DESI wavelength array (A) | | `flatiron_desi_spectrum_ivar` | list[float32] | variable | DESI inverse variance | ### Light curve columns | Column | Type | Shape | Description | |--------|------|-------|-------------| | `ztf_time` | list[float64] | variable | ZTF observation times (HJD) | | `ztf_mag` | list[float32] | variable | ZTF PSF magnitudes | | `ztf_magerr` | list[float32] | variable | ZTF magnitude uncertainties | | `ztf_band` | list | variable | ZTF filter code | | `flatiron_tess_time` | list[float64] | variable | TESS observation times (BTJD) | | `flatiron_tess_flux` | list[float32] | variable | TESS normalized flux | | `flatiron_tess_flux_err` | list[float32] | variable | TESS flux uncertainty | ### Image columns All image columns are stored as nested lists representing 64 x 64 pixel cutouts. Reconstruct with `np.array(row["col"], dtype=np.float32).reshape(64, 64)`. | Column | Type | Description | |--------|------|-------------| | `twomass_j`, `twomass_h`, `twomass_k` | list[list[float32]] | 2MASS J/H/K-band cutout | | `galex_fuv`, `galex_nuv` | list[list[float32]] | GALEX far-UV / near-UV cutout | | `unwise_w1`, `unwise_w2` | list[list[float32]] | unWISE W1 (3.4 um) / W2 (4.6 um) cutout | | `legacy_g`, `legacy_r`, `legacy_z` | list[list[float32]] | Legacy Survey g/r/z-band cutout | ### Metadata columns The dataset includes ~300 metadata columns from source surveys, prefixed by survey name. Key examples: | Column | Description | |--------|-------------| | `apogee_teff` | APOGEE effective temperature (K) | | `apogee_logg` | APOGEE surface gravity (log g) | | `flatiron_gaia_phot_g_mean_mag` | Gaia G-band apparent magnitude | | `flatiron_gaia_parallax` | Gaia parallax (mas) | | `flatiron_gaia_teff_gspphot` | Gaia photometric effective temperature | | `flatiron_desi_z` | DESI spectroscopic redshift | | `flatiron_desi_spectype` | DESI spectral classification | | `agn_redshift` | AGN redshift from SDSS DR14Q | --- ## File Format and System Requirements ### Format Parquet shard files, up to 5000 rows each (~120 shards total). Populations are interleaved -- filter on `population` to select types. Total on-disk size: ~80 GB compressed. ### System Requirements | Use Case | RAM | Disk | Notes | |----------|-----|------|-------| | Stream one shard at a time | 2 GB | 1 GB | Recommended for most workflows | | Load one population | 50--80 GB | 80 GB | e.g., all 300k stars | | Load full dataset | 150--200 GB | 80 GB | Only if you have the RAM | | HuggingFace streaming | 1 GB | 0 | No local download needed | ### Recommended workflow For most users, **iterate shard by shard** rather than loading everything: ```python import pandas as pd import glob for shard_path in sorted(glob.glob("path/to/shards/*.parquet")): df = pd.read_parquet(shard_path) # Filter, process, extract features... del df ``` Or use HuggingFace streaming to avoid downloading at all: ```python from datasets import load_dataset ds = load_dataset("kshitijd/astro-multimodal-570k", streaming=True) for row in ds["train"]: # process row... pass ``` --- ## How It Was Built ### Pipeline Overview Built with a custom Python pipeline running on [NCSA Delta AI](https://docs.ncsa.illinois.edu/systems/deltaai/) (32 CPUs, 408 GB RAM, NVMe storage). Total runtime: ~20 hours. ### Phase 1: Catalog Construction Three populations were defined from independent parent catalogs: 1. **Stars** (300k): Queried [APOGEE DR17](https://www.sdss4.org/dr17/irspec/) via VizieR (catalog III/284, SNR > 50), cross-matched against Gaia DR3 via CDS XMatch (1 arcsec radius, 50k-row chunks). Selected the first 300k Gaia-matched stars. 2. **Galaxies** (200k): Downloaded the [PROVABGS](https://changhoonhahn.github.io/provabgs/) seed catalog from the Flatiron Institute (60 HDF5 cells). Selected the first 200k objects. 3. **AGN** (100k): Queried the [SDSS DR14 Quasar Catalog](https://www.sdss4.org/dr14/algorithms/qso_catalog/) via VizieR (catalog VII/286). Selected the first 100k objects. ### Phase 2: Flatiron Cross-Matching (Gaia, TESS, SDSS, DESI) For each Flatiron-hosted dataset, the pipeline downloaded HEALPix-partitioned HDF5 files, performed positional cross-matching with `astropy.coordinates.SkyCoord` (3 arcsec radius), accumulated matched data, and flushed to Parquet shards. Each HDF5 file was deleted after processing. ### Phase 3: APOGEE Spectra APOGEE aspcapStar FITS files (73 GB) were pre-staged via Globus. The pipeline read each star's local FITS file and cropped spectra to good detector regions: `np.r_[246:3274, 3585:6080, 6344:8335]` (7514 pixels). ### Phase 4: SDSS Spectra SDSS specLite files were downloaded from the SDSS Science Archive Server for galaxies and AGN with matching plate/MJD/fiber identifiers. ### Phase 5: ZTF Light Curves ZTF DR24 bulk Parquet files from IRSA. Per-field download, PyArrow read, group by object ID, positional cross-match, extract time series. ### Phase 6: Images (concurrent) Four image sources ran concurrently: - **2MASS**: CDS hips2fits (J, H, K bands) - **GALEX**: CDS hips2fits (FUV, NUV bands) - **Legacy Survey**: Cutout service (g, r, z bands) - **unWISE**: Tile download + astropy Cutout2D (W1, W2 bands) ### Phase 7: GALAH Spectra GALAH DR4 spectra (504 tar files, 123 GB) downloaded from Data Central Australia. Each star's 4-band spectra (blue, green, red, IR) were concatenated. ### Phase 8: Finalize All modality shards merged per population via streaming left joins on `object_id`. Deduplicated. Modality counts computed. ### Cross-Matching All positional cross-matches used `astropy.coordinates.SkyCoord.match_to_catalog_sky()` with a **3 arcsecond** match radius. Unmatched columns are `null`. --- ## Data Quality Notes - **APOGEE spectra** are continuum-normalized. Flux values are typically 0.5--1.2. - **Gaia BP/RP** is stored as spectral coefficients (`flatiron_gaia_coeff`, 110 values), not sampled spectra. See the [Gaia documentation](https://gea.esic.esa.int/archive/documentation/GDR3/) for reconstruction into a full spectrum. - **TESS and ZTF light curves** have variable lengths depending on sky coverage overlap with the APOGEE/AGN footprints. - **Image cutouts** are centered on the catalog position. Some may contain NaN pixels at edges. - Columns with no data for a given object are `null` / `None`. ## Known Limitations 1. **Galaxy spectral coverage is 20.5%**. Most PROVABGS galaxies lack spectroscopic observations. DESI metadata is present for 75% but actual spectra for only 12%. 2. **Light curve coverage is partial** (TESS ~14%, ZTF ~9% of stars), driven by sky overlap with the APOGEE footprint. 3. **unWISE coverage is 7.6%** for stars and **Legacy Survey is < 1%** due to limited footprint overlap. 4. **Gaia BP/RP is ~49%** for stars, reflecting overlap between APOGEE targets and Flatiron-hosted HEALPix cells. 5. **No light curves for galaxies** by design -- TESS and ZTF are routed only to stars and AGN. 6. **~340 columns total**, most being Gaia and DESI metadata. Core science columns are listed in the Schema section. --- ## Citation If you use this dataset, please cite the underlying surveys: ```bibtex @article{abdurrouf2022, title={The Seventeenth Data Release of the Sloan Digital Sky Surveys: Complete Release of MaNGA, MaStar, and APOGEE-2 Data}, author={Abdurro'uf and others}, journal={ApJS}, volume={259}, pages={35}, year={2022} } @article{gaia2023, title={Gaia Data Release 3: Summary of the content and survey properties}, author={{Gaia Collaboration}}, journal={A\&A}, volume={674}, pages={A1}, year={2023} } @article{desi2024, title={DESI 2024 III: Baryon Acoustic Oscillations from Galaxies and Quasars}, author={{DESI Collaboration}}, journal={AJ}, year={2024} } @article{bellm2019, title={The Zwicky Transient Facility: System Overview, Performance, and First Results}, author={Bellm, Eric C. and others}, journal={PASP}, volume={131}, pages={018002}, year={2019} } @article{buder2024, title={The GALAH Survey: Data Release 4}, author={Buder, Sven and others}, journal={arXiv preprint arXiv:2409.19858}, year={2024} } @article{ricker2015, title={Transiting Exoplanet Survey Satellite (TESS)}, author={Ricker, George R. and others}, journal={Journal of Astronomical Telescopes, Instruments, and Systems}, volume={1}, pages={014003}, year={2015} } ``` ## License Released under [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/). The underlying survey data is subject to each survey's individual data use policies.
提供机构:
kshitijd
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作