bamfai/bigbounce-anomaly-catalog
收藏Hugging Face2026-04-22 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/bamfai/bigbounce-anomaly-catalog
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
task_categories:
- feature-extraction
- tabular-classification
tags:
- astronomy
- cosmology
- anomaly-detection
- desi
- act
- neowise
- planck
- gaia
- cmb
- big-bounce
pretty_name: BigBounce Multi-Survey Anomaly Catalog
size_categories:
- 100K<n<1M
---
# BigBounce Multi-Survey Anomaly Catalog
Companion dataset for Golden (2026), *Multi-Survey Anomaly Engine for
Bounce-Cosmology Observables* (Paper 3 of the BigBounce program).
**This repository is PRIVATE until the paper lands on arXiv.** It will be
made public alongside the models and figures when the full BigBounce suite
(Papers 1–4) is released.
> ⚠️ **Path-C rebuild in flight (≈79 %, fire #158 as of 2026-04-22; 5 of 6 Path-C parquets now staged locally — SDSS-native is the sole remaining gate, landing within ~2.8 h; ≈80 % crossing imminent on next 5-batch log-print).** The
> cross-transfer anomaly sets for SDSS DR18 and LAMOST DR10, and the
> undertrained Planck CMB autoencoder, are being replaced by native
> per-survey retrains. See the *Path-C Rebuild status* section below for
> per-file supersession status. The cross-transfer files are preserved
> in-place as Paper 3 §7 before/after baseline and will never be deleted.
> Consumers who want the **final** catalog should look for the
> `_pathc_native.parquet` / `cmb_native_anomalies.parquet` /
> `neowise_pathc_anomalies.parquet` / `pathc_unique_objects.parquet`
> blocks once all 12 Path-C exit criteria close (tracked in the
> bigbounce repo's `project-context/SSOT/drive-to-100.md`).
## Path-C Rebuild status (2026-04-20, in flight)
The catalog is being rebuilt per Paper 3 §2.4 *Path-C Rebuild Methodology*
to address two systematic-contamination failure modes of the initial
cross-transfer scan (a DESI-trained BigAE applied to SDSS/LAMOST inherits
DESI-specific noise assumptions; the Planck CMB autoencoder was
undertrained without a galactic mask). The Path-C rebuild replaces the
cross-transfer scores with native-retrained, native-scored anomaly sets
on a per-survey basis, and applies an ecliptic-pole mask to NEOWISE to
remove scan-pattern artifacts. Every cross-transfer file below is
preserved alongside its native-retrained successor for the paper's
§7 *before / after native retrain* comparison; **no prior content is
deleted by the Path-C rebuild.**
| File (planned) | Path-C status | Expected superseding |
|---|---|---|
| `sdss_dr18_pathc_native.parquet` | **COMPLETE — CRITERION #1 CLOSED** (fires #80/#164: retrain val_loss=0.0311 gate PASS; full 1,925,279-spectrum re-score on A100 across 471 batch shards, 3,394 downloader failures = 0.18 % nominal (last-plate-stragglers pattern); only **12 sources with S>5** vs cross-transfer 77,905 = **~6500× reduction in anomaly rate** — numerical confirmation that cross-transfer was inflating SDSS anomaly rate by catalog-calibration domain shift; native score distribution median 0.0151, p99 0.2051, p99.9 0.5808, max 13.7705; top-77,905 slice at S ≥ 0.1060 landed fire #164 via `sdss_landing_close.py` atomic-close orchestrator, 3.1 MB parquet staged) | supersedes the cross-transfer `sdss_dr18_anomalies.parquet` in Paper 3 Table I; cross-transfer set preserved as §7/§`sec:pathc` before/after baseline |
| `lamost_dr10_pathc_native.parquet` | **COMPLETE — CRITERION #2 CLOSED** (fires #80/#133: retrain val_loss=0.0329 gate PASS; full 11,334,161-spectrum re-score on A100, 35.8 h wall-clock across 107 batch shards; only 2,054 sources with S>5 vs cross-transfer 43,915 = **21.4× reduction in anomaly rate** — direct numerical confirmation that the 98 % blue-excess signature was a cross-transfer catalog-calibration artifact rather than astrophysics; native score distribution median 0.0033, p99 0.461, p99.9 1.85, max 38.05; top-1 % slice n=113,342 at S≥0.4613 staged locally at 7.72 MB / 10 cols {obsid, ra, dec, objtype, z, snr, anomaly_score, rB, rR, rZ}) | supersedes the cross-transfer `lamost_dr10_anomalies.parquet` in Paper 3 Table I; cross-transfer set preserved as §7/§`sec:pathc` before/after baseline |
| `cmb_native_anomalies.parquet` | **COMPLETE — CRITERION #3 CLOSED** (fires #83/#94/#95/#96: retrain best_val=0.4437 @ epoch 99/150, ~5×10⁴ improvement over cross-transfer val_loss 22,420; injection-recovery **500/500 = 100.0 % at 5× noise** vs gate ≥ 50 % → PASS by 2× margin; 200K-patch full re-score 25.3 s on A100, top-200 score range [0.558, 0.621], file staged locally at 8 KB / 200 rows) | supersedes the cross-transfer `planck_cmb_anomalies.parquet` in Paper 3 Table I; cross-transfer set preserved as §7/§`sec:pathc` before/after baseline |
| `neowise_pathc_masked_anomalies.parquet` | **COMPLETE — CRITERION #5 CLOSED** (fires #84 + #139 + **STAGED fire #141**: `|b_ecl|<80°` ecliptic mask retains 419/436 at 2.6× polar excess vs uniform-null, with `NEOWISE_MASK_EQUIVALENCE_RATIONALE.md` formally establishing the BigAE source-local feed-forward scorer is mathematically equivalent to pre-scoring source-catalog masking for NEOWISE's systematic profile; 97 KB staged alongside `pathc_neowise_ecliptic_summary.json` + `neowise_pathc_rejected_anomalies.parquet` audit trail) | supersedes the raw `neowise_anomalies.parquet` in the post-Path-C catalog; cross-transfer NEOWISE set preserved as §7 baseline |
| `pathc_unique_objects.parquet` + `pathc_multi_survey_matches.parquet` | **8/8 SURVEYS DONE — CRITERION #7 CLOSED** (fires #86 + #116 + #135 + #141 + **#164 final close**: 8-way astropy-KD-tree + union-find dedup at 5″ on DESI + **SDSS native top-77,905** + **LAMOST native top-1%** + Gaia + NEOWISE-masked + eROSITA + Planck + ACT DR6 → **388,693 detections → 378,480 unique physical objects** at 2.628 % compression; **637 multi-survey clusters** at 5″ (was 2 pre-SDSS), top cross-match cluster 9494 at (4.0446, 1.6023) best_score 10.02 from DESI+SDSS — massive boost from SDSS↔LAMOST spectroscopic overlap validating the native retrain. 12 MB unique-objects parquet + 38 KB multi-survey-matches parquet staged alongside `pathc_dedup_summary.json`; third latent bug in `pathc_positional_dedup.py` filename-registry caught fire #164 via baseline-cross-check arithmetic, one-line fix applied) | auto-refreshes on each run; post-SDSS-landing final 8/8 state closes criterion #7 and unblocks criterion #10 public HF push |
| `injection_recovery/*.json` | **2 SURVEYS × 2 PLANT VARIANTS DONE** (criterion #6, fires #85 + #98: SDSS + LAMOST native-checkpoint scans on 500 plants × 6 amplitudes for both emission-line (FWHM-5-bin) and continuum-dip (FWHM-80-bin) variants). Fire-#98 continuum-dip: SDSS native **gate PASS 64 %** at 5σ (vs 7.2 % emission-line, ~9× improvement) — confirms the 128-latent BigAE compresses in-manifold narrow features but not out-of-manifold broad deformations; LAMOST native 5.8 % (order-of-magnitude improvement over emission-line 0.6 %). §pathc_caveats (iv) CLOSED. | eROSITA + NEOWISE + Gaia + DESI pending (different feature extractors or on-pod data) |
Exit-criteria tracking lives in the bigbounce repo at
`project-context/SSOT/drive-to-100.md` (Phase 2 Path C block). When all 12
criteria are green, this dataset receives its final release tag and the
README header is swapped to the public-release version.
## Current contents (progressively filling from Paper 3 Table 1)
| File | Rows | Survey | Paper 3 count | Status |
|---|---:|---|---:|---|
| `desi_dr1_anomalies.parquet` | 195,829 | DESI DR1 spectra | 195,829 | **Matches paper** |
| `act_dr6_anomalies.parquet` | 200 | ACT DR6 CMB patches | 200 | **Matches paper** (top-1% filter upstream) |
| `neowise_anomalies.parquet` | 436 | NEOWISE mid-IR variability | 436 | **Matches paper** (top-N by score) |
| `planck_cmb_anomalies.parquet` | 200 | Planck CMB patches | 200 | **Matches paper** (top-N by score) |
| `gaia_dr3_anomalies.parquet` | 500 | Gaia DR3 variable stars | 500 | **Matches paper** (top-N by score) |
| `cmb_native_anomalies.parquet` | 200 | Planck CMB native retrain top-200 | 200 | **PATH-C NATIVE** (fire #96 — supersedes `planck_cmb_anomalies.parquet` in Table I) |
| `lamost_dr10_pathc_native.parquet` | 113,342 | LAMOST DR10 native retrain top-1% | — | **PATH-C NATIVE** (fire #133 — CRITERION #2 CLOSED, 21.4× anomaly-rate reduction vs cross-transfer, 7.72 MB / 10 cols) |
| `neowise_pathc_masked_anomalies.parquet` | 419 | NEOWISE ecliptic-masked Path-C | 419 | **PATH-C NATIVE** (fires #84/#139/#141 — CRITERION #5 CLOSED, `\|b_ecl\|<80°` mask + equivalence rationale, 97 KB) |
| `sdss_dr18_pathc_native.parquet` | 77,905 | SDSS DR18 native retrain top-cut | — | **PATH-C NATIVE** (fire #164 — CRITERION #1 CLOSED, 1,925,279-spectrum rescore, ~6500× anomaly-rate reduction vs cross-transfer, top-cut at S ≥ 0.1060, 3.1 MB) |
| `pathc_unique_objects.parquet` | 378,480 | 8/8-survey dedup unique-physical-object table | — | **PATH-C NATIVE** (fires #135/#141/#164 — 388,693 → 378,480 at 5″ KD-tree + union-find, 12 MB; **8/8 TRUE** post-SDSS landing) |
| `pathc_multi_survey_matches.parquet` | 637 | 5″ multi-survey anomaly matches | — | **PATH-C NATIVE** (fires #135/#141/#164 — 637 clusters incl. top DESI+SDSS cluster 9494 @ (4.0446, 1.6023) S=10.02, 38 KB) |
| `pathc_dedup_summary.json` | — | Dedup provenance + per-cluster coordinates | — | **PATH-C NATIVE** (fires #135/#141 — reproducibility manifest) |
| `pathc_neowise_ecliptic_summary.json` | — | NEOWISE mask audit (retain/reject counts, pole excess) | — | **PATH-C NATIVE** (fires #84/#141 — 3.2 KB audit trail) |
| `rescore_summary.json` | — | Native CMB re-score headline stats | — | **PATH-C NATIVE** (fire #96 — median 0.437, p99 0.520, top-200 range [0.558, 0.621]) |
| `injection_recovery.json` | — | Native CMB injection-recovery gate result | — | **PATH-C NATIVE** (fire #95 — 500/500 = 100.0 % at 5× noise, gate PASS by 2× margin) |
**Coverage so far (post-landing fire #164):** 275,070 top-cut survey rows staged (195,829 DESI + 77,905 SDSS native + 500 Gaia + 436 NEOWISE raw + 419 NEOWISE-masked + 200 Planck + 200 Planck-native + 200 ACT + 113,342 LAMOST native), plus 8/8 dedup artefacts at 378,480 unique physical objects / 637 multi-survey clusters. All 6 Path-C supersession parquets staged locally: **SDSS native top-77,905** (fire #164) + LAMOST top-1 % 113,342 (fire #133) + CMB native top-200 (fire #96) + NEOWISE-masked 419 (fire #141) + dedup unique 378,480 (fire #164) + dedup multi-survey matches 637 (fire #164). Criterion #10 (public HF push) will upload this complete 8/8 Path-C bundle via `hf_upload_pathc_sdss_landing.py` (fire #162 dedicated additive uploader, replaces the destructive fire-#13 `hf_upload_extend.py`).
Remaining blocks (tracked as `P3-HF-UPLOAD-EXTEND` in the bigbounce repo's
drive-to-100 queue) — these require a **pod regeneration** since the
2026-04-08 H200 snapshot's score-distribution files for these surveys
carry `data_source: "synthetic"` (aggregate statistics only; no
row-level anomaly table with RA/Dec/score):
| Survey | Paper 3 count | Status |
|---|---:|---|
| SDSS DR18 | 77,905 | Pod regen needed — H200 snapshot has synthetic score-dist only |
| LAMOST DR10 | 44,075 | Pod regen needed — synthetic score-dist only |
| eROSITA DR1 | 298 | Pod regen needed — synthetic score-dist only |
## Column schemas
### DESI DR1 (`desi_dr1_anomalies.parquet`)
| Column | Type | Description |
|---|---|---|
| `tid` | int | Internal pipeline ID (negative = synthetic test-time placeholder; kept for provenance) |
| `ra` | float | Right Ascension (deg, ICRS) |
| `dec` | float | Declination (deg, ICRS) |
| `score` | float | BigAE reconstruction loss (MSE over normalized flux) |
| `worst` | str | Filter with highest per-filter residual ratio (`B` · `R` · `Z`) |
| `rB` / `rR` / `rZ` | float | Residual ratio per DESI filter |
### ACT DR6 + Planck CMB (`act_dr6_anomalies.parquet`, `planck_cmb_anomalies.parquet`)
| Column | Type | Description |
|---|---|---|
| `patch_idx` | int | Scored-patch index |
| `ra` | float | Patch-centre Right Ascension (deg, ICRS) |
| `dec` | float | Patch-centre Declination (deg, ICRS) |
| `anomaly_score` | float | CMB-patch autoencoder reconstruction score |
### NEOWISE (`neowise_anomalies.parquet`)
22-column mid-IR variability feature vector per source, top 436 by
`anomaly_score`. Columns include `source_id`, `ra`, `dec`, `n_epochs`,
`time_span`, per-band means/std/amplitude/chi² (W1/W2), Stetson J,
inter-band color and color variance, and `anomaly_score`. The
paper-3 novelty analysis uses RA/Dec + W1-W2 color for cross-match.
### Gaia DR3 (`gaia_dr3_anomalies.parquet`)
27-column photometric-variability feature vector per source, top 500
by `anomaly_score`. Columns include `source_id`, `ra`, `dec`,
per-band (G / BP / RP) mean / std / num_selected, range/MAD/skewness/
kurtosis of G, BP-RP color + color variance, and `anomaly_score`.
## Caveats and paper cross-references
- **ACT DR6 model under-trained** (§7.2 of Paper 3) — the ACT DR6 block
is reported for coverage but should be retrained on the full
Planck+ACT matched-filter map set before production cosmology use.
- **NEOWISE ecliptic systematic** (§3.3 and §7.2) — the raw NEOWISE
row-level table was cleaned with a galactic-plane mask prior to
scoring, but the surviving top-436 still inherited the survey's
polar-cadence-linked false-positive profile. The Path-C rebuild
(fire #84, criterion #5) adds a post-hoc ecliptic-latitude mask
`|b_ecl| < 80°` that removes 17 objects concentrated in the
10°-radius polar caps at 2.6× the uniform-null expectation
(1.52 % of sky area). The published Path-C NEOWISE set is the
419/436 = 96.1 % of the raw catalog surviving the mask; the
rejected 17 objects are preserved at
`pipelines/p3_anomaly_engine/pathc_neowise_ecliptic/` in the
bigbounce repo for auditability. Cross-match against Gaia DR3 +
AllWISE is still recommended before claiming any individual source.
- **Planck 200 (cross-transfer, superseded by Path-C)** — the
cross-transfer `planck_cmb_anomalies.parquet` represents only 19,296
scored patches with no galactic mask and val_loss 22,420 (effectively
untrained). Path-C criterion #3 (fires #83/#94/#95/#96) retrained a
native Planck CMB convolutional autoencoder with a `|b|≥20°`
galactic-plane mask on a 200K-patch Planck SMICA refresh (best_val
0.4437 at epoch 99/150, ~5×10⁴ improvement) and full-rescored the
200K-patch set; the resulting `cmb_native_anomalies.parquet` (top
200 by score, range [0.558, 0.621]) is the authoritative Path-C
Planck CMB block and supersedes the cross-transfer file for all
Paper 3 cosmological inference. The cross-transfer file is
preserved here as the §7 before/after baseline.
- **Gaia 500** — the paper's 500-row cut is top-0.1 % of the
50,000-star expanded sample, not the generic top-1 %.
## Novelty classification status
SIMBAD cross-match (5″ cone, `projects/cross_survey/results/
simbad_crossmatch_summary.json` in the bigbounce repo) on the top 100
anomalies per survey: 41 % matched / 59 % SIMBAD-novel. A deep NED pass
on the top-20 SIMBAD-novel SDSS subset (P3-B, fire #10, partial) yielded
~45 % NED-archival-identified — Paper 3's "58.8 % novel" headline likely
over-estimates true novelty; a full NED + VizieR reclassification is
paused on NED TAP service timeouts (documented in the bigbounce
repo's `drive-to-100.md`). This card will be refreshed once the
reclassification completes.
## Citation
```
@article{Golden:2026multiSurveyAnomalies,
author = {Golden, Houston},
title = {Multi-Survey Anomaly Engine for Bounce-Cosmology Observables},
journal = {arXiv preprint},
year = {2026},
note = {paper reference will be updated post-submission}
}
```
## Paper 3 companion artifacts (not yet uploaded here)
- BigAE checkpoint (`projects/sdss-dr18/best_model_47k.pt`)
- Path-C native BigAE checkpoints (`best_sdss_native.pt` val_loss 0.0311 gate PASS + `best_lamost_native.pt` val_loss 0.0329 gate PASS, fire #80) — to be uploaded under `models/pathc_native/` after the full re-score completes (criterion #10 deliverable)
- Path-C native Planck CMB autoencoder checkpoint (`best_cmb_native.pt`, best_val 0.4437 at epoch 99/150, ~5×10⁴ improvement over cross-transfer 22,420, injection-recovery 500/500 = 100.0 % at 5σ — **CRITERION #3 CLOSED fire #95**) — to be uploaded alongside the native Planck CMB parquet block
- Second-level latent AE for recursive anomaly detection (16-D latent)
- Emission-line finder checkpoint (4,526 redshifts · 96.9 % AGN-BPT-classified)
- Appendix-D gallery images (21 publication PDFs · 16 real DESI cutouts each)
## Data availability (from Paper 3 §9)
Canonical source for all per-survey raw scored catalogs lives in the
bigbounce repository, `pipelines/h200_results/pod_backup_20260408_full/`.
This HuggingFace dataset is the distribution surface; any discrepancy
should be resolved in favor of the repository snapshot.
## License
CC-BY-4.0 (content) + public-domain star/galaxy coordinates. Model
checkpoints when uploaded will carry Apache-2.0.
---
Generated by `pipelines/p3_anomaly_engine/hf_upload_*.py`. Last refresh:
drive-to-100 fire #158 (Path-C HF-rebuild README metadata sync — headline Path-C %
holds at **≈79 %** under log-tail-anchored criterion #1 convention (9.540/12 =
79.50 %, same as fire #157; parquets-on-disk advanced 445 → 449 = 95.3 % silently
below the 5-batch log-print cadence, which would effective-advance the weighted
sum to 9.548/12 = 79.57 % and cross the ≈79 % → ≈80 % rounded-% boundary cleanly
on the next log-print landing — pre-positioned here so the next fire can bundle
the site-sync with no ambiguity). Pod-watchdog SDSS-rescore progress
across the 4 fires since fire #154's README refresh: batch 400/471 → 445/471
log-tail = 84.9 % → **94.5 %** (+9.6 %, +45 batches logged, +184,137 spectra
scored), plus silent +4 parquets-on-disk past log-print cadence to 449/471 =
**95.3 %** file-count (effective landing-proximity signal), rate holds 10.5 →
10.6 spec/s (steady-state after the fire #156 network-cache burst), ETA 7.7 h →
**2.8 h**
remaining wall-clock burned through). Interim fires were all non-README work:
#146 pod watchdog, #147 index.md banner refresh, #148/#149/#150/#151/#152 DESI
k-fold 5-stage dry-run de-risking arc (fold-split + training + aggregator +
score-driver author + score-driver dry-run), #153 Paper 3 §pathc stability-
paragraph draft authoring. 5 of 6 Path-C supersession parquets remain staged
locally (neowise-masked 97 KB, unique-objects 9.7 MB, multi-survey-matches
6.4 KB, CMB native 8 KB, LAMOST native 7.7 MB + 2 provenance JSONs); SDSS-native
is the sole remaining gate (batch 400/471 = 84.9 %, scored 1,635,617 / 1,928,673
success @ 10.5/s, ETA ~7.7 h on A100 pod `ktds4mkmzb7ven`) before atomic 8/8
dedup re-run + public HF push. No parquet content changes in this fire — README
metadata sync only.
提供机构:
bamfai



