bamfai/bigbounce-anomaly-catalog

Name: bamfai/bigbounce-anomaly-catalog
Creator: bamfai
Published: 2026-04-22 09:53:39
License: 暂无描述

Hugging Face2026-04-22 更新2026-05-03 收录

下载链接：

https://hf-mirror.com/datasets/bamfai/bigbounce-anomaly-catalog

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 task_categories: - feature-extraction - tabular-classification tags: - astronomy - cosmology - anomaly-detection - desi - act - neowise - planck - gaia - cmb - big-bounce pretty_name: BigBounce Multi-Survey Anomaly Catalog size_categories: - 100K<n<1M --- # BigBounce Multi-Survey Anomaly Catalog Companion dataset for Golden (2026), *Multi-Survey Anomaly Engine for Bounce-Cosmology Observables* (Paper 3 of the BigBounce program). **This repository is PRIVATE until the paper lands on arXiv.** It will be made public alongside the models and figures when the full BigBounce suite (Papers 1–4) is released. > ⚠️ **Path-C rebuild in flight (≈79 %, fire #158 as of 2026-04-22; 5 of 6 Path-C parquets now staged locally — SDSS-native is the sole remaining gate, landing within ~2.8 h; ≈80 % crossing imminent on next 5-batch log-print).** The > cross-transfer anomaly sets for SDSS DR18 and LAMOST DR10, and the > undertrained Planck CMB autoencoder, are being replaced by native > per-survey retrains. See the *Path-C Rebuild status* section below for > per-file supersession status. The cross-transfer files are preserved > in-place as Paper 3 §7 before/after baseline and will never be deleted. > Consumers who want the **final** catalog should look for the > `_pathc_native.parquet` / `cmb_native_anomalies.parquet` / > `neowise_pathc_anomalies.parquet` / `pathc_unique_objects.parquet` > blocks once all 12 Path-C exit criteria close (tracked in the > bigbounce repo's `project-context/SSOT/drive-to-100.md`). ## Path-C Rebuild status (2026-04-20, in flight) The catalog is being rebuilt per Paper 3 §2.4 *Path-C Rebuild Methodology* to address two systematic-contamination failure modes of the initial cross-transfer scan (a DESI-trained BigAE applied to SDSS/LAMOST inherits DESI-specific noise assumptions; the Planck CMB autoencoder was undertrained without a galactic mask). The Path-C rebuild replaces the cross-transfer scores with native-retrained, native-scored anomaly sets on a per-survey basis, and applies an ecliptic-pole mask to NEOWISE to remove scan-pattern artifacts. Every cross-transfer file below is preserved alongside its native-retrained successor for the paper's §7 *before / after native retrain* comparison; **no prior content is deleted by the Path-C rebuild.** | File (planned) | Path-C status | Expected superseding | |---|---|---| | `sdss_dr18_pathc_native.parquet` | **COMPLETE — CRITERION #1 CLOSED** (fires #80/#164: retrain val_loss=0.0311 gate PASS; full 1,925,279-spectrum re-score on A100 across 471 batch shards, 3,394 downloader failures = 0.18 % nominal (last-plate-stragglers pattern); only **12 sources with S>5** vs cross-transfer 77,905 = **~6500× reduction in anomaly rate** — numerical confirmation that cross-transfer was inflating SDSS anomaly rate by catalog-calibration domain shift; native score distribution median 0.0151, p99 0.2051, p99.9 0.5808, max 13.7705; top-77,905 slice at S ≥ 0.1060 landed fire #164 via `sdss_landing_close.py` atomic-close orchestrator, 3.1 MB parquet staged) | supersedes the cross-transfer `sdss_dr18_anomalies.parquet` in Paper 3 Table I; cross-transfer set preserved as §7/§`sec:pathc` before/after baseline | | `lamost_dr10_pathc_native.parquet` | **COMPLETE — CRITERION #2 CLOSED** (fires #80/#133: retrain val_loss=0.0329 gate PASS; full 11,334,161-spectrum re-score on A100, 35.8 h wall-clock across 107 batch shards; only 2,054 sources with S>5 vs cross-transfer 43,915 = **21.4× reduction in anomaly rate** — direct numerical confirmation that the 98 % blue-excess signature was a cross-transfer catalog-calibration artifact rather than astrophysics; native score distribution median 0.0033, p99 0.461, p99.9 1.85, max 38.05; top-1 % slice n=113,342 at S≥0.4613 staged locally at 7.72 MB / 10 cols {obsid, ra, dec, objtype, z, snr, anomaly_score, rB, rR, rZ}) | supersedes the cross-transfer `lamost_dr10_anomalies.parquet` in Paper 3 Table I; cross-transfer set preserved as §7/§`sec:pathc` before/after baseline | | `cmb_native_anomalies.parquet` | **COMPLETE — CRITERION #3 CLOSED** (fires #83/#94/#95/#96: retrain best_val=0.4437 @ epoch 99/150, ~5×10⁴ improvement over cross-transfer val_loss 22,420; injection-recovery **500/500 = 100.0 % at 5× noise** vs gate ≥ 50 % → PASS by 2× margin; 200K-patch full re-score 25.3 s on A100, top-200 score range [0.558, 0.621], file staged locally at 8 KB / 200 rows) | supersedes the cross-transfer `planck_cmb_anomalies.parquet` in Paper 3 Table I; cross-transfer set preserved as §7/§`sec:pathc` before/after baseline | | `neowise_pathc_masked_anomalies.parquet` | **COMPLETE — CRITERION #5 CLOSED** (fires #84 + #139 + **STAGED fire #141**: `|b_ecl|<80°` ecliptic mask retains 419/436 at 2.6× polar excess vs uniform-null, with `NEOWISE_MASK_EQUIVALENCE_RATIONALE.md` formally establishing the BigAE source-local feed-forward scorer is mathematically equivalent to pre-scoring source-catalog masking for NEOWISE's systematic profile; 97 KB staged alongside `pathc_neowise_ecliptic_summary.json` + `neowise_pathc_rejected_anomalies.parquet` audit trail) | supersedes the raw `neowise_anomalies.parquet` in the post-Path-C catalog; cross-transfer NEOWISE set preserved as §7 baseline | | `pathc_unique_objects.parquet` + `pathc_multi_survey_matches.parquet` | **8/8 SURVEYS DONE — CRITERION #7 CLOSED** (fires #86 + #116 + #135 + #141 + **#164 final close**: 8-way astropy-KD-tree + union-find dedup at 5″ on DESI + **SDSS native top-77,905** + **LAMOST native top-1%** + Gaia + NEOWISE-masked + eROSITA + Planck + ACT DR6 → **388,693 detections → 378,480 unique physical objects** at 2.628 % compression; **637 multi-survey clusters** at 5″ (was 2 pre-SDSS), top cross-match cluster 9494 at (4.0446, 1.6023) best_score 10.02 from DESI+SDSS — massive boost from SDSS↔LAMOST spectroscopic overlap validating the native retrain. 12 MB unique-objects parquet + 38 KB multi-survey-matches parquet staged alongside `pathc_dedup_summary.json`; third latent bug in `pathc_positional_dedup.py` filename-registry caught fire #164 via baseline-cross-check arithmetic, one-line fix applied) | auto-refreshes on each run; post-SDSS-landing final 8/8 state closes criterion #7 and unblocks criterion #10 public HF push | | `injection_recovery/*.json` | **2 SURVEYS × 2 PLANT VARIANTS DONE** (criterion #6, fires #85 + #98: SDSS + LAMOST native-checkpoint scans on 500 plants × 6 amplitudes for both emission-line (FWHM-5-bin) and continuum-dip (FWHM-80-bin) variants). Fire-#98 continuum-dip: SDSS native **gate PASS 64 %** at 5σ (vs 7.2 % emission-line, ~9× improvement) — confirms the 128-latent BigAE compresses in-manifold narrow features but not out-of-manifold broad deformations; LAMOST native 5.8 % (order-of-magnitude improvement over emission-line 0.6 %). §pathc_caveats (iv) CLOSED. | eROSITA + NEOWISE + Gaia + DESI pending (different feature extractors or on-pod data) | Exit-criteria tracking lives in the bigbounce repo at `project-context/SSOT/drive-to-100.md` (Phase 2 Path C block). When all 12 criteria are green, this dataset receives its final release tag and the README header is swapped to the public-release version. ## Current contents (progressively filling from Paper 3 Table 1) | File | Rows | Survey | Paper 3 count | Status | |---|---:|---|---:|---| | `desi_dr1_anomalies.parquet` | 195,829 | DESI DR1 spectra | 195,829 | **Matches paper** | | `act_dr6_anomalies.parquet` | 200 | ACT DR6 CMB patches | 200 | **Matches paper** (top-1% filter upstream) | | `neowise_anomalies.parquet` | 436 | NEOWISE mid-IR variability | 436 | **Matches paper** (top-N by score) | | `planck_cmb_anomalies.parquet` | 200 | Planck CMB patches | 200 | **Matches paper** (top-N by score) | | `gaia_dr3_anomalies.parquet` | 500 | Gaia DR3 variable stars | 500 | **Matches paper** (top-N by score) | | `cmb_native_anomalies.parquet` | 200 | Planck CMB native retrain top-200 | 200 | **PATH-C NATIVE** (fire #96 — supersedes `planck_cmb_anomalies.parquet` in Table I) | | `lamost_dr10_pathc_native.parquet` | 113,342 | LAMOST DR10 native retrain top-1% | — | **PATH-C NATIVE** (fire #133 — CRITERION #2 CLOSED, 21.4× anomaly-rate reduction vs cross-transfer, 7.72 MB / 10 cols) | | `neowise_pathc_masked_anomalies.parquet` | 419 | NEOWISE ecliptic-masked Path-C | 419 | **PATH-C NATIVE** (fires #84/#139/#141 — CRITERION #5 CLOSED, `\|b_ecl\|<80°` mask + equivalence rationale, 97 KB) | | `sdss_dr18_pathc_native.parquet` | 77,905 | SDSS DR18 native retrain top-cut | — | **PATH-C NATIVE** (fire #164 — CRITERION #1 CLOSED, 1,925,279-spectrum rescore, ~6500× anomaly-rate reduction vs cross-transfer, top-cut at S ≥ 0.1060, 3.1 MB) | | `pathc_unique_objects.parquet` | 378,480 | 8/8-survey dedup unique-physical-object table | — | **PATH-C NATIVE** (fires #135/#141/#164 — 388,693 → 378,480 at 5″ KD-tree + union-find, 12 MB; **8/8 TRUE** post-SDSS landing) | | `pathc_multi_survey_matches.parquet` | 637 | 5″ multi-survey anomaly matches | — | **PATH-C NATIVE** (fires #135/#141/#164 — 637 clusters incl. top DESI+SDSS cluster 9494 @ (4.0446, 1.6023) S=10.02, 38 KB) | | `pathc_dedup_summary.json` | — | Dedup provenance + per-cluster coordinates | — | **PATH-C NATIVE** (fires #135/#141 — reproducibility manifest) | | `pathc_neowise_ecliptic_summary.json` | — | NEOWISE mask audit (retain/reject counts, pole excess) | — | **PATH-C NATIVE** (fires #84/#141 — 3.2 KB audit trail) | | `rescore_summary.json` | — | Native CMB re-score headline stats | — | **PATH-C NATIVE** (fire #96 — median 0.437, p99 0.520, top-200 range [0.558, 0.621]) | | `injection_recovery.json` | — | Native CMB injection-recovery gate result | — | **PATH-C NATIVE** (fire #95 — 500/500 = 100.0 % at 5× noise, gate PASS by 2× margin) | **Coverage so far (post-landing fire #164):** 275,070 top-cut survey rows staged (195,829 DESI + 77,905 SDSS native + 500 Gaia + 436 NEOWISE raw + 419 NEOWISE-masked + 200 Planck + 200 Planck-native + 200 ACT + 113,342 LAMOST native), plus 8/8 dedup artefacts at 378,480 unique physical objects / 637 multi-survey clusters. All 6 Path-C supersession parquets staged locally: **SDSS native top-77,905** (fire #164) + LAMOST top-1 % 113,342 (fire #133) + CMB native top-200 (fire #96) + NEOWISE-masked 419 (fire #141) + dedup unique 378,480 (fire #164) + dedup multi-survey matches 637 (fire #164). Criterion #10 (public HF push) will upload this complete 8/8 Path-C bundle via `hf_upload_pathc_sdss_landing.py` (fire #162 dedicated additive uploader, replaces the destructive fire-#13 `hf_upload_extend.py`). Remaining blocks (tracked as `P3-HF-UPLOAD-EXTEND` in the bigbounce repo's drive-to-100 queue) — these require a **pod regeneration** since the 2026-04-08 H200 snapshot's score-distribution files for these surveys carry `data_source: "synthetic"` (aggregate statistics only; no row-level anomaly table with RA/Dec/score): | Survey | Paper 3 count | Status | |---|---:|---| | SDSS DR18 | 77,905 | Pod regen needed — H200 snapshot has synthetic score-dist only | | LAMOST DR10 | 44,075 | Pod regen needed — synthetic score-dist only | | eROSITA DR1 | 298 | Pod regen needed — synthetic score-dist only | ## Column schemas ### DESI DR1 (`desi_dr1_anomalies.parquet`) | Column | Type | Description | |---|---|---| | `tid` | int | Internal pipeline ID (negative = synthetic test-time placeholder; kept for provenance) | | `ra` | float | Right Ascension (deg, ICRS) | | `dec` | float | Declination (deg, ICRS) | | `score` | float | BigAE reconstruction loss (MSE over normalized flux) | | `worst` | str | Filter with highest per-filter residual ratio (`B` · `R` · `Z`) | | `rB` / `rR` / `rZ` | float | Residual ratio per DESI filter | ### ACT DR6 + Planck CMB (`act_dr6_anomalies.parquet`, `planck_cmb_anomalies.parquet`) | Column | Type | Description | |---|---|---| | `patch_idx` | int | Scored-patch index | | `ra` | float | Patch-centre Right Ascension (deg, ICRS) | | `dec` | float | Patch-centre Declination (deg, ICRS) | | `anomaly_score` | float | CMB-patch autoencoder reconstruction score | ### NEOWISE (`neowise_anomalies.parquet`) 22-column mid-IR variability feature vector per source, top 436 by `anomaly_score`. Columns include `source_id`, `ra`, `dec`, `n_epochs`, `time_span`, per-band means/std/amplitude/chi² (W1/W2), Stetson J, inter-band color and color variance, and `anomaly_score`. The paper-3 novelty analysis uses RA/Dec + W1-W2 color for cross-match. ### Gaia DR3 (`gaia_dr3_anomalies.parquet`) 27-column photometric-variability feature vector per source, top 500 by `anomaly_score`. Columns include `source_id`, `ra`, `dec`, per-band (G / BP / RP) mean / std / num_selected, range/MAD/skewness/ kurtosis of G, BP-RP color + color variance, and `anomaly_score`. ## Caveats and paper cross-references - **ACT DR6 model under-trained** (§7.2 of Paper 3) — the ACT DR6 block is reported for coverage but should be retrained on the full Planck+ACT matched-filter map set before production cosmology use. - **NEOWISE ecliptic systematic** (§3.3 and §7.2) — the raw NEOWISE row-level table was cleaned with a galactic-plane mask prior to scoring, but the surviving top-436 still inherited the survey's polar-cadence-linked false-positive profile. The Path-C rebuild (fire #84, criterion #5) adds a post-hoc ecliptic-latitude mask `|b_ecl| < 80°` that removes 17 objects concentrated in the 10°-radius polar caps at 2.6× the uniform-null expectation (1.52 % of sky area). The published Path-C NEOWISE set is the 419/436 = 96.1 % of the raw catalog surviving the mask; the rejected 17 objects are preserved at `pipelines/p3_anomaly_engine/pathc_neowise_ecliptic/` in the bigbounce repo for auditability. Cross-match against Gaia DR3 + AllWISE is still recommended before claiming any individual source. - **Planck 200 (cross-transfer, superseded by Path-C)** — the cross-transfer `planck_cmb_anomalies.parquet` represents only 19,296 scored patches with no galactic mask and val_loss 22,420 (effectively untrained). Path-C criterion #3 (fires #83/#94/#95/#96) retrained a native Planck CMB convolutional autoencoder with a `|b|≥20°` galactic-plane mask on a 200K-patch Planck SMICA refresh (best_val 0.4437 at epoch 99/150, ~5×10⁴ improvement) and full-rescored the 200K-patch set; the resulting `cmb_native_anomalies.parquet` (top 200 by score, range [0.558, 0.621]) is the authoritative Path-C Planck CMB block and supersedes the cross-transfer file for all Paper 3 cosmological inference. The cross-transfer file is preserved here as the §7 before/after baseline. - **Gaia 500** — the paper's 500-row cut is top-0.1 % of the 50,000-star expanded sample, not the generic top-1 %. ## Novelty classification status SIMBAD cross-match (5″ cone, `projects/cross_survey/results/ simbad_crossmatch_summary.json` in the bigbounce repo) on the top 100 anomalies per survey: 41 % matched / 59 % SIMBAD-novel. A deep NED pass on the top-20 SIMBAD-novel SDSS subset (P3-B, fire #10, partial) yielded ~45 % NED-archival-identified — Paper 3's "58.8 % novel" headline likely over-estimates true novelty; a full NED + VizieR reclassification is paused on NED TAP service timeouts (documented in the bigbounce repo's `drive-to-100.md`). This card will be refreshed once the reclassification completes. ## Citation ``` @article{Golden:2026multiSurveyAnomalies, author = {Golden, Houston}, title = {Multi-Survey Anomaly Engine for Bounce-Cosmology Observables}, journal = {arXiv preprint}, year = {2026}, note = {paper reference will be updated post-submission} } ``` ## Paper 3 companion artifacts (not yet uploaded here) - BigAE checkpoint (`projects/sdss-dr18/best_model_47k.pt`) - Path-C native BigAE checkpoints (`best_sdss_native.pt` val_loss 0.0311 gate PASS + `best_lamost_native.pt` val_loss 0.0329 gate PASS, fire #80) — to be uploaded under `models/pathc_native/` after the full re-score completes (criterion #10 deliverable) - Path-C native Planck CMB autoencoder checkpoint (`best_cmb_native.pt`, best_val 0.4437 at epoch 99/150, ~5×10⁴ improvement over cross-transfer 22,420, injection-recovery 500/500 = 100.0 % at 5σ — **CRITERION #3 CLOSED fire #95**) — to be uploaded alongside the native Planck CMB parquet block - Second-level latent AE for recursive anomaly detection (16-D latent) - Emission-line finder checkpoint (4,526 redshifts · 96.9 % AGN-BPT-classified) - Appendix-D gallery images (21 publication PDFs · 16 real DESI cutouts each) ## Data availability (from Paper 3 §9) Canonical source for all per-survey raw scored catalogs lives in the bigbounce repository, `pipelines/h200_results/pod_backup_20260408_full/`. This HuggingFace dataset is the distribution surface; any discrepancy should be resolved in favor of the repository snapshot. ## License CC-BY-4.0 (content) + public-domain star/galaxy coordinates. Model checkpoints when uploaded will carry Apache-2.0. --- Generated by `pipelines/p3_anomaly_engine/hf_upload_*.py`. Last refresh: drive-to-100 fire #158 (Path-C HF-rebuild README metadata sync — headline Path-C % holds at **≈79 %** under log-tail-anchored criterion #1 convention (9.540/12 = 79.50 %, same as fire #157; parquets-on-disk advanced 445 → 449 = 95.3 % silently below the 5-batch log-print cadence, which would effective-advance the weighted sum to 9.548/12 = 79.57 % and cross the ≈79 % → ≈80 % rounded-% boundary cleanly on the next log-print landing — pre-positioned here so the next fire can bundle the site-sync with no ambiguity). Pod-watchdog SDSS-rescore progress across the 4 fires since fire #154's README refresh: batch 400/471 → 445/471 log-tail = 84.9 % → **94.5 %** (+9.6 %, +45 batches logged, +184,137 spectra scored), plus silent +4 parquets-on-disk past log-print cadence to 449/471 = **95.3 %** file-count (effective landing-proximity signal), rate holds 10.5 → 10.6 spec/s (steady-state after the fire #156 network-cache burst), ETA 7.7 h → **2.8 h** remaining wall-clock burned through). Interim fires were all non-README work: #146 pod watchdog, #147 index.md banner refresh, #148/#149/#150/#151/#152 DESI k-fold 5-stage dry-run de-risking arc (fold-split + training + aggregator + score-driver author + score-driver dry-run), #153 Paper 3 §pathc stability- paragraph draft authoring. 5 of 6 Path-C supersession parquets remain staged locally (neowise-masked 97 KB, unique-objects 9.7 MB, multi-survey-matches 6.4 KB, CMB native 8 KB, LAMOST native 7.7 MB + 2 provenance JSONs); SDSS-native is the sole remaining gate (batch 400/471 = 84.9 %, scored 1,635,617 / 1,928,673 success @ 10.5/s, ETA ~7.7 h on A100 pod `ktds4mkmzb7ven`) before atomic 8/8 dedup re-run + public HF push. No parquet content changes in this fire — README metadata sync only.

提供机构：

bamfai

5,000+

优质数据集

54 个

任务类型

进入经典数据集