bingbangboom/exoplanet-transit-detection
收藏Hugging Face2026-04-10 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/bingbangboom/exoplanet-transit-detection
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
task_categories:
- tabular-classification
language:
- en
tags:
- astronomy
- exoplanets
- transit-detection
- kepler
- tess
- k2
- light-curves
- time-series
- astrophysics
size_categories:
- 10K<n<100K
---
# Exoplanet Transit Detection Dataset

A multi-mission exoplanet transit dataset combining observations from NASA's
**Kepler**, **K2**, and **TESS** missions. Each row contains a pre-processed
light curve stored as five float32 arrays (one raw, four phase-folded views),
along with stellar metadata, transit parameters, and disposition labels.
The dataset includes a **NO_SIGNAL** class: stars from the Kepler stellar
catalog that never triggered a Threshold Crossing Event (TCE). These stars
have real photometric data but no detected transit signal.
> This dataset was published as a submission to the [Uncharted Data Challenge](https://www.adaptionlabs.ai/blog/the-uncharted-data-challenge) powered by [Adaptive Data](https://www.adaptionlabs.ai/blog/adaption-launches-adaptive-data-beta).
---
## Dataset Summary
| Statistic | Value |
|---|---|
| Total rows | 23,567 |
| Missions | Kepler, K2, TESS |
| Disposition classes | 3 (PLANET, FALSE_POSITIVE, NO_SIGNAL) |
| Light curve views per row | 5 (raw + 4 phase-folded) |
| Split ratio | 80% train / 10% val / 10% test |
| Stratification | By disposition x mission |
## Dataset Examples
Below are three randomly sampled targets demonstrating the five photometric arrays provided for every star, and how they differ across the primary target classes.
### 1. Confirmed Planet

### 2. False Positive

### 3. No Signal

### Class Distribution
| Disposition | Count | Percentage |
|---|---|---|
| PLANET | 9,791 | 41.5% |
| FALSE_POSITIVE | 5,936 | 25.2% |
| NO_SIGNAL | 7,840 | 33.3% |
| **Total** | **23,567** | |
### Mission Distribution
| Mission | Count |
|---|---|
| kepler | 16,054 |
| tess | 6,028 |
| k2 | 1,485 |
### Split Sizes
| Split | Rows |
|---|---|
| train | 18,853 |
| val | 2,357 |
| test | 2,357 |
---
## Data Sources
All label data was retrieved from the **NASA Exoplanet Archive** via TAP API.
All light curve data was downloaded from the **Mikulski Archive for Space
Telescopes (MAST)** via the Lightkurve Python package.
### Label Catalogs
| Catalog | API Table | Content |
|---|---|---|
| Kepler KOI Cumulative | `cumulative` | Kepler Objects of Interest with Robovetter dispositions and koi_score |
| TESS TOI | `toi` | TESS Objects of Interest with TFOPWG dispositions |
| K2 Planets & Candidates | `k2pandc` | K2 confirmed planets and candidates |
| Kepler Stellar | `KEPLERSTELLAR` | Full Kepler target catalog (~200k stars) for NO_SIGNAL sampling |
### Light Curve Sources
| Mission | Product | Cadence |
|---|---|---|
| Kepler | PDCSAP flux (long cadence) | ~29.4 min |
| K2 | PDCSAP flux (standard pipeline) | ~29.4 min |
| TESS | PDCSAP flux (2-minute cadence) | 2 min |
All light curves were obtained through `lightkurve.search_lightcurve()`.
When multiple quarters/sectors/campaigns were available, they were stitched
into a single continuous light curve using Lightkurve's `.stitch()` method.
---
## Data Processing
### Light Curve Preprocessing
1. **Download**: PDCSAP flux retrieved from MAST for each target
2. **Stitching**: Multiple quarters/sectors combined into one time series
3. **Normalization**: Flux divided by its median to produce relative flux
4. **Outlier removal**: Points beyond 5-sigma from the local median removed
5. **NaN handling**: Remaining NaN values filled via linear interpolation
6. **Resampling**: Raw flux resampled to a fixed length of 20,000 points
### Phase-Folded Views
For stars with known orbital period, transit epoch, and transit duration:
| View | Length | Description |
|---|---|---|
| `flux_raw` | 20,000 | Raw (unfolded) normalized PDCSAP flux |
| `flux_global` | 201 | Phase-folded at the orbital period, full phase coverage |
| `flux_local` | 81 | Phase-folded, zoomed to +/-4x transit duration around mid-transit |
| `flux_odd` | 201 | Odd-numbered transits only, phase-folded |
| `flux_even` | 201 | Even-numbered transits only, phase-folded |
### NO_SIGNAL Class Construction
NO_SIGNAL targets were sampled from the Kepler Stellar catalog, selecting
stars that have **no** associated KOI (Kepler Object of Interest). Sampling
was stratified by 3-hour CDPP noise floor.
NO_SIGNAL rows have:
- `disposition = "NO_SIGNAL"`, `label_source = "no_signal"`
- All transit parameters (`period_days`, `duration_hrs`, `depth_ppm`, `planet_radius_earth`, `epoch_bjd`) are **NaN** — there is no transit to characterize
- All FP flags are **NaN** — the Kepler Robovetter was not applied
- All folded views (`flux_global`, `flux_local`, `flux_odd`, `flux_even`) are **zero-filled**
- `flux_raw` contains **real photometric data** (not synthetic)
### Disposition Labels
| Disposition | Source | Description |
|---|---|---|
| PLANET | NASA archive | Confirmed planets and high-confidence candidates |
| FALSE_POSITIVE | NASA archive | Rejected candidates |
| NO_SIGNAL | Kepler Stellar catalog | Stars with no detected transit signal |
### Confidence Scores
The `confidence_score` column provides a continuous 0-1 soft label:
| Source | Score Assignment |
|---|---|
| Kepler CONFIRMED | 1.0 |
| Kepler Robovetter | Raw `koi_score` (0-1 from automated vetting) |
| TESS TFOPWG Planetary Candidate | 0.9 |
| TESS Ambiguous (APC) | 0.5 |
| TESS False Positive | 0.0 |
| K2 CONFIRMED | 1.0 |
| K2 CANDIDATE | 0.7 |
| NO_SIGNAL | 1.0 (high confidence of no signal) |
### False Positive Flags (Kepler only)
Four binary flags from the Kepler Robovetter indicate the reason a KOI
was classified as a false positive. These are only populated for Kepler
targets and are **NaN** for TESS, K2, and NO_SIGNAL rows (the Robovetter
was never applied to those targets).
| Flag | Meaning |
|---|---|
| `fp_flag_not_transit` | Not transit-like |
| `fp_flag_stellar` | Stellar eclipse |
| `fp_flag_centroid` | Centroid offset |
| `fp_flag_ephemeris` | Ephemeris contamination |
---
## Schema
### Loading Light Curves
```python
import numpy as np
import pandas as pd
train = pd.read_parquet("train.parquet")
row = train.iloc[0]
# Reconstruct float32 arrays from bytes
flux_raw = np.frombuffer(row["flux_raw"], dtype=np.float32) # (20000,)
flux_global = np.frombuffer(row["flux_global"], dtype=np.float32) # (201,)
flux_local = np.frombuffer(row["flux_local"], dtype=np.float32) # (81,)
flux_odd = np.frombuffer(row["flux_odd"], dtype=np.float32) # (201,)
flux_even = np.frombuffer(row["flux_even"], dtype=np.float32) # (201,)
label = row["disposition"] # "PLANET" / "FALSE_POSITIVE" / "NO_SIGNAL"
confidence = row["confidence_score"] # 0.0-1.0 soft label
```
### Full Column Reference
#### Identity & Classification
| Column | Type | Description |
|---|---|---|
| `star_id` | str | Unique star identifier with mission prefix (KIC_, TIC_, EPIC_) |
| `mission` | str | Source mission: `kepler`, `tess`, or `k2` |
| `cadence` | str | Observation cadence: `long` (Kepler/K2) or `2min` (TESS) |
| `disposition` | str | Label: `PLANET`, `FALSE_POSITIVE`, or `NO_SIGNAL` |
| `label_source` | str | Provenance of the label (see Disposition Labels above) |
| `confidence_score` | float32 | Continuous 0-1 soft label (see Confidence Scores above) |
| `has_transit_params` | bool | True if period, duration, and epoch are all available |
#### False Positive Flags (Kepler Only)
| Column | Type | Description |
|---|---|---|
| `fp_flag_not_transit` | float32 | Not transit-like (0 or 1; NaN for TESS/K2/NO_SIGNAL) |
| `fp_flag_stellar` | float32 | Stellar eclipse (0 or 1; NaN for TESS/K2/NO_SIGNAL) |
| `fp_flag_centroid` | float32 | Centroid offset (0 or 1; NaN for TESS/K2/NO_SIGNAL) |
| `fp_flag_ephemeris` | float32 | Ephemeris contamination (0 or 1; NaN for TESS/K2/NO_SIGNAL) |
| `disposition_disputed` | int8 | Disposition changed across Kepler quarterly releases (0 or 1) |
#### Light Curve Arrays
| Column | Type | Description |
|---|---|---|
| `flux_raw` | bytes | float32 array (20000,) -- raw normalized PDCSAP, NOT phase-folded |
| `flux_global` | bytes | float32 array (201,) -- phase-folded global view |
| `flux_local` | bytes | float32 array (81,) -- phase-folded local zoom around transit |
| `flux_odd` | bytes | float32 array (201,) -- odd-numbered transits, phase-folded |
| `flux_even` | bytes | float32 array (201,) -- even-numbered transits, phase-folded |
#### Stellar Parameters
| Column | Type | Description |
|---|---|---|
| `teff` | float32 | Effective temperature (K) |
| `logg` | float32 | Surface gravity (log g) |
| `radius` | float32 | Stellar radius (solar radii) |
| `mass` | float32 | Stellar mass (solar masses) |
| `metallicity` | float32 | Metallicity [Fe/H] |
| `kepmag` | float32 | Kepler magnitude (Kepler/K2) or TESS magnitude (TESS) |
| `cdpp_3hr` | float32 | 3-hour Combined Differential Photometric Precision (ppm) |
| `n_planets_in_system` | int16 | Number of planet candidates around this star |
#### Transit Parameters
| Column | Type | Description |
|---|---|---|
| `period_days` | float32 | Orbital period in days (NaN for NO_SIGNAL) |
| `duration_hrs` | float32 | Transit duration in hours (NaN for NO_SIGNAL) |
| `depth_ppm` | float32 | Transit depth in parts per million (NaN for NO_SIGNAL) |
| `planet_radius_earth` | float32 | Estimated planet radius in Earth radii (NaN for NO_SIGNAL) |
| `epoch_bjd` | float32 | Mid-transit epoch in mission-native time system (NaN for NO_SIGNAL) |
#### Provenance
| Column | Type | Description |
|---|---|---|
| `ra` | float32 | Right ascension (degrees, J2000) |
| `dec` | float32 | Declination (degrees, J2000) |
| `kepoi_name` | str | KOI or TOI identifier from the archive |
| `source_urls` | str | Comma-separated URLs to the original FITS files on MAST |
| `flux_raw_len` | int | Length of flux_raw array (20000) |
| `flux_global_len` | int | Length of flux_global array (201) |
| `flux_local_len` | int | Length of flux_local array (81) |
| `flux_folded_len` | int | Length of flux_odd and flux_even arrays (201) |
### Time Systems for `epoch_bjd`
| Mission | Time System | Definition |
|---|---|---|
| Kepler | BKJD | BJD - 2454833.0 |
| K2 | BKJD | BJD - 2454833.0 |
| TESS | BTJD | BJD - 2457000.0 |
---
## Rows with Incomplete Transit Parameters
15,325 rows have `has_transit_params = True` (complete period, duration,
and epoch -- producing valid phase-folded views).
8,242 rows have `has_transit_params = False`. These fall into two
categories:
1. **NO_SIGNAL stars** (7,840 rows): No transit parameters
exist because no transit was detected. Folded views are zero-filled by design.
2. **PLANET/FALSE_POSITIVE with incomplete catalog data** (\~402 rows): The NASA
archive lacks one or more of period, duration, or epoch for these entries.
This primarily affects K2 candidates missing transit duration (\~320 rows)
and TESS mono-transit candidates missing orbital period (\~82 rows). Their
folded views are zero-filled, but `flux_raw` contains valid photometry.
---
## Notes
1. **One row per star per mission:** Multi-planet systems are represented by
their first catalog entry only. Additional planets' transits appear as
noise in the folded views. The `n_planets_in_system` column indicates
multiplicity.
2. **No stellar variability detrending:** Phase-folded views are built from
stitched and normalized flux without additional flattening.
3. **TESS temporal coverage:** TESS targets observed in only 1-2 sectors have
very few transits, resulting in noisier folded views compared to Kepler's
4-year baseline. The `flux_raw` for these targets is also shorter in
effective duration.
4. **Incomplete transit parameters:** ~402 PLANET/FALSE_POSITIVE rows have
incomplete transit parameters in the NASA archive. See the section above
on "Rows with Incomplete Transit Parameters" for details.
5. **NO_SIGNAL class is Kepler-only:** All NO_SIGNAL targets are drawn from
the Kepler stellar catalog. There is no TESS or K2 equivalent in this
dataset.
6. **Metadata NaN rates:** Some stellar parameter columns have significant
NaN rates due to incomplete catalog coverage:
| Column | NaN Rate |
|---|---|
| `teff` | 3.8% |
| `logg` | 7.1% |
| `radius` | 3.2% |
| `mass` | 37.0% |
| `metallicity` | 34.0% |
| `cdpp_3hr` | 45.0% |
7. **Failed downloads:** 1,024 of the 24,591 targets in the combined target
list could not be downloaded from MAST (no FITS files available) and are
excluded from the final dataset.
8. **Light curve resampling:** All `flux_raw` arrays are resampled to a fixed
length of 20,000 points. Stars with fewer raw observations are
upsampled (linear interpolation); stars with more are downsampled. The
original FITS files can be accessed via the `source_urls` column for
full-resolution data.
---
## License
This dataset is released under **CC-BY-4.0**.
The underlying observational data is from NASA's Kepler, K2, and TESS missions,
made publicly available through the Mikulski Archive for Space Telescopes (MAST)
and the NASA Exoplanet Archive. Original data products are in the public domain
as works of the U.S. Government.
---
## Citation
If you use this dataset, please cite it along with the original data sources:
```bibtex
@dataset{exoplanet_transit_detection_dataset,
title = {Exoplanet Transit Detection Dataset},
year = {2026},
note = {Multi-mission dataset (Kepler, K2, TESS) with NO_SIGNAL class for transit detection},
license = {CC-BY-4.0},
}
```
### Acknowledgments
This dataset makes use of data from the following sources:
- **NASA Exoplanet Archive**: Operated by Caltech under contract with NASA
under the Exoplanet Exploration Program.
DOI: [10.26133/NEA4](https://doi.org/10.26133/NEA4)
- **Lightkurve**: A Python package for Kepler, K2, and TESS data analysis.
Lightkurve Collaboration (2018). *Lightkurve: Kepler and TESS time series
analysis in Python.* Astrophysics Source Code Library, [ascl:1812.013](https://ascl.net/1812.013)
- **Kepler Robovetter**: Dispositions and koi_score values from the Kepler
Robovetter. Thompson, S.E., et al. (2018). *Planetary Candidates Observed
by Kepler. VIII.* ApJS, 235, 38. DOI: [10.3847/1538-4365/aab4f9](https://doi.org/10.3847/1538-4365/aab4f9)
- **Phase-folding methodology**: The global and local view preprocessing
approach follows Shallue, C.J. & Vanderburg, A. (2018). *Identifying
Exoplanets with a Neural Network.* AJ, 155, 94. DOI: [10.3847/1538-3881/aa9e09](https://doi.org/10.3847/1538-3881/aa9e09)
- **TESS**: Ricker, G.R., et al. (2015). *Transiting Exoplanet Survey
Satellite (TESS).* Journal of Astronomical Telescopes, Instruments, and
Systems, 1(1), 014003. DOI: [10.1117/1.JATIS.1.1.014003](https://doi.org/10.1117/1.JATIS.1.1.014003)
- **K2 Mission**: Howell, S.B., et al. (2014). *The K2 Mission: Characterization
and Early Results.* PASP, 126, 398. DOI: [10.1086/676406](https://doi.org/10.1086/676406)
---
## Contact
For any questions and issues related to this dataset, please open a discussion on the repository.
提供机构:
bingbangboom
搜集汇总
数据集介绍

构建方式
在系外行星探测领域,高质量的数据集对于训练和验证机器学习模型至关重要。本数据集整合了NASA开普勒、K2和TESS三大太空望远镜任务的光变曲线数据,通过系统化的流程构建而成。数据源自NASA系外行星档案和空间望远镜微库,利用Lightkurve Python包下载PDCSAP流量数据,并进行拼接、归一化、异常值剔除及固定长度重采样等预处理。特别引入了NO_SIGNAL类别,从开普勒恒星目录中筛选未触发阈值穿越事件的恒星,以提供真实的非信号样本。最终数据集包含23,567条记录,每条记录提供原始及四种相位折叠的光变曲线视图,并附有完整的恒星元数据和分类标签。
使用方法
对于希望利用该数据集进行系外行星检测算法研究的学者,其使用流程清晰而高效。数据集以Parquet格式存储,可通过Pandas直接加载,其中光变曲线数组以字节形式保存,需使用NumPy的frombuffer方法重构为float32数组。每条数据包含身份标识、任务来源、分类标签、置信度评分及五组光变曲线。用户可基于disposition字段进行三分类任务,或利用confidence_score进行概率预测。对于进阶分析,恒星参数如有效温度、表面重力等可用于多模态学习,而相位折叠视图特别适用于卷积神经网络或注意力机制模型。需要注意的是,部分样本存在传输参数缺失或元数据空值,使用时应参考has_transit_params标志进行适当处理。数据集的官方文档提供了完整的列描述及时间系统说明,确保数据解读的准确性。
背景与挑战
背景概述
系外行星凌星探测数据集诞生于2026年,由Adaptive Data团队为Uncharted Data Challenge构建,旨在整合NASA开普勒、K2与TESS三大太空望远镜的观测数据。该数据集的核心研究问题聚焦于通过机器学习方法,从恒星光变曲线中自动识别系外行星凌星信号,并区分真实行星与虚假阳性事件。其创新性地引入了“无信号”类别,即从未触发凌星事件阈值的恒星样本,为模型训练提供了关键的负样本基准。这一多任务融合的数据集不仅推动了凌星信号检测算法的发展,也为系外行星天文学领域提供了标准化、大规模的训练资源,显著提升了自动化行星搜寻的可靠性与效率。
当前挑战
该数据集致力于解决系外行星凌星检测中的核心挑战:如何从复杂的光变曲线噪声中准确分离出微弱的周期性凌星信号,并有效区分真实行星与由恒星活动、仪器效应或天体物理假象引发的虚假阳性。构建过程中的挑战尤为突出:首先,需整合来自不同任务、具有异构观测参数(如时间基线、测光精度)的数据,并实现光变曲线的标准化预处理与相位折叠;其次,“无信号”类别的构建要求从海量恒星目录中筛选出未检测到凌星的目标,并确保其光变曲线的真实性与代表性;此外,部分目标的凌星参数缺失、恒星元数据不完整,以及不同任务间标签体系与置信度评分的差异,均对数据的一致性与质量保障构成了严峻考验。
常用场景
经典使用场景
在系外行星探测领域,凌星法作为核心观测手段,通过分析恒星亮度周期性衰减来推断行星存在。本数据集整合了Kepler、K2和TESS三大空间望远镜的观测数据,提供了原始光变曲线与四种相位折叠视图,成为训练和评估凌星信号自动检测模型的基准资源。研究者利用其多视角光变数据,能够构建深度学习分类器,精准区分真实行星信号、天体物理假阳性以及无信号恒星,极大提升了从海量时序数据中筛选系外行星候选体的效率与可靠性。
解决学术问题
该数据集直接应对了系外行星探测中假阳性率高、信号噪声干扰严重等核心挑战。通过引入精心构建的NO_SIGNAL类别,即来自Kepler星表但未触发任何阈值穿越事件的恒星真实光变数据,为模型提供了关键的负样本,有效解决了机器学习中类别不平衡与过拟合问题。其提供的连续置信度分数与详细的假阳性标志,助力研究者深入量化检测不确定性,并解析误报的天体物理成因,推动了凌星信号验证从二元分类向概率化、可解释性分析的范式转变。
实际应用
该数据集的实际价值体现在支撑自动化系外行星巡天管线。例如,基于此数据集训练的卷积神经网络模型,可部署于TESS等正在运行的任务数据流中,实现对新观测光变曲线的实时、批量筛查,快速锁定高置信度的行星候选体,极大缩短了从数据获取到科学发现的时间。此外,其标准化的预处理流程与多任务数据融合框架,为未来更大型的巡天项目(如PLATO)提供了可复用的数据处理范式,加速了系外行星普查的进程。
数据集最近研究
最新研究方向
在系外行星探测领域,多任务光变曲线数据集的构建正推动着机器学习方法的深度应用。该数据集整合了开普勒、K2和TESS三大太空望远镜的观测数据,并创新性地引入了无信号类别,为模型训练提供了真实的负样本,显著提升了凌星信号识别的鲁棒性。当前研究前沿聚焦于利用深度神经网络处理时序与相位折叠视图,结合恒星物理参数与凌星特征,实现高精度的行星候选体自动分类与假阳性剔除。随着詹姆斯·韦伯太空望远镜等新一代观测设施的投入运行,此类标准化数据集将助力构建更通用的系外行星发现算法,加速系外宜居世界的搜寻进程。
以上内容由遇见数据集搜集并总结生成



