petchthwr/ICNDelay
收藏Hugging Face2026-04-14 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/petchthwr/ICNDelay
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
language:
- en
size_categories:
- 100K<n<1M
---
# ICNDelay: Multimodal Language–Time-Series Flight Delay Regression Dataset for Incheon International Airport
This dataset provides a structured, monthly collection of air traffic management scenarios for flight delay analysis and prediction.
Each scenario aggregates aircraft trajectory information, operational context, and structured flight attributes into a unified representation suitable for machine learning research.
The data are organized by month to support temporal generalization studies and controlled evaluation protocols. Trajectories are represented as sequential state features, while scenario-level metadata and delay labels enable supervised learning and multimodal modeling.
The dataset is designed to facilitate research in delay regression and data-centric evaluation under realistic operational conditions within complex airspace environments.
## Flight Plan Features
Flight schedule data for 2022 arrivals at Incheon International Airport were obtained from the [Airportal (항공정보포털시스템)](https://www.airportal.go.kr/) website.
The features include scheduled arrival time, airline, identifiers, and airport information.
Airport coordinates were used to compute the great-circle distance and classify haul type based on Wragg’s aviation dictionary.
Aircraft type and registration were inferred from typical route operations, and the wake turbulence category was assigned according to ICAO Doc 4444.
## Textual Data (Aeronautical Texts)
Flight Information, METAR, TAF, and NOTAM reports are provided in aeronautical coded text formats.
- **P_f**: The prompt is constructed from flight plan features using one of ten predefined templates selected at random to introduce variability.
The text is prepended with the current timestamp t information.
- **P_m**: METARs for the year 2022 were downloaded from [Ogimet](https://www.ogimet.com/).
For a given time *t*, the METAR report is selected based on the most recent release within the preceding 30-minute update interval.
- **P_t**: TAFs for the year 2022 were downloaded from [Ogimet](https://www.ogimet.com/).
For a given time *t*, the TAF report is typically issued every six hours and may overlap; therefore, the most recent valid TAF at time *t* is selected.
- **P_n**: NOTAMs active in 2022 were collected from the [AIM Korea (항공정보통합관리)](https://aim.koca.go.kr/aim/) website in their original coded format.
As multiple NOTAMs may be concurrently active, all notices valid at time *t* are included.
## Time-series Data (Aircraft Trajectories)
ADS-B recordings were sourced from the [OpenSky](https://opensky-network.org/) database.
Data from 2022 were queried using flight identification numbers corresponding to flights that departed from and arrived at the airport, based on the schedule provided by the [Airportal (항공정보포털시스템)](https://www.airportal.go.kr/) website.
Positional states, including latitude, longitude, and altitude, were extracted from ADS-B data.
Positions were converted to ENU coordinates centered at the airport and resampled at 5-second intervals without enforcing a fixed trajectory length.
Positional states were normalized by 120 km. Additional features include directional vectors and polar representations.
- **X_f**: The focusing trajectory corresponds to flight *i*, containing its sequential states from the first ADS-B transmission within the airspace up to the current timestamp *t*.
It represents both the historical and the current operational state of the focusing aircraft.
- **X_a**: Active trajectories refer to the trajectories of other aircraft operating in the airspace at time *t*, excluding the focusing flight *i*.
These trajectories reflect current airspace conditions and traffic congestion.
Each active trajectory includes states from the time the aircraft departed or first entered the TMA up to time *t*.
- **X_p**: Prior trajectories are constructed using the earliest timestamp among all active trajectories as a reference point.
They consist of completed flights that were active at the reference time, covering their entire operation within the airspace.
Although completed, these flights remain informative by capturing traffic patterns that influence focusing and active flights and indirectly reflecting ground traffics.
The temporal relationship among the focusing, active, and prior trajectories is illustrated below.

## Data Annotation
For each flight *i*, the ground-truth post-terminal duration (y_dt) is defined as the time difference between the aircraft’s first entry into Incheon Airport’s airspace and its actual arrival time.
The entry time is derived from [OpenSky](https://opensky-network.org/) ADS-B data, while the arrival time is obtained from official [Airportal (항공정보포털시스템)](https://www.airportal.go.kr/) records.
Thus, y_dt is computed as (actual arrival time - airspace entry time).
The ground-truth arrival delay (y_delay) is obtained directly from the official [Airportal (항공정보포털시스템)](https://www.airportal.go.kr/) records and reflects the recorded arrival delay for each flight.
## Data Statistics

## Example Usage
This function loads the ICNDelay monthly multimodal flight delay regression dataset from Hugging Face and converts each row into a standardized scenario dictionary.
```python
import numpy as np
from datasets import load_dataset
def load_ICNDelay(repo_id, hf_token, month, split = "train", month_col = "month"):
"""
Load a monthly dataset from HF as a list of dict.
Returns:
List[dict] where each dict has keys:
i, t, label, F_f, P_f, P_m, P_t, P_n, X_f, X_a, X_p
"""
ds = load_dataset(repo_id, split=split, token=hf_token) # Load dataset
ds = ds.filter(lambda x: int(x[month_col]) == int(month)) # Filter by month
output = []
for row in ds:
item = {}
item["i"] = row.get("i") # Flight ID
item["t"] = row.get("t") # Current Time
item["label"] = {"y_dt": row.get("y_dt"),"y_delay": row.get("y_delay")} # Regression Labels
item["F_f"] = row.get("F_f") # Tabular Flight Information Features
item["P_f"] = row.get("P_f") # Flight Information Prompt
item["P_m"] = row.get("P_m") # METAR Information Prompt
item["P_t"] = row.get("P_t") # TAF Information Prompt
item["P_n"] = row.get("P_n") # NOTAM Prompt
item["X_f"] = np.asarray(row.get("X_f")) # Focusing Trajectory Data (Shape: T_f, 9)
item["X_a"] = None if row.get("X_a") in (None, []) else np.asarray(row.get("X_a")) # Active Trajectory Data (Shape: N_a, T_a, 9)
item["X_p"] = None if row.get("X_p") in (None, []) else np.asarray(row.get("X_p")) # Prior Trajectory Data (Shape: N_p, T_p, 9)
output.append(item)
return output
data = load_ICNDelay(
repo_id="petchthwr/ICNDelay",
hf_token="YOUR HUGGINGFACE TOKEN",
month=1,
)
```
## Citation
Please cite our work if you use any of the datasets shared here:
```bibtex
@dataset{ICNDelay2026,
title={ICNDelay: Multimodal Language–Time-Series Flight Delay Regression Dataset for Incheon International Airport},
author={Phisannupawong, Thaweerath and Damanik, Joshua Julian and Choi, Han-Lim},
year={2026},
note={https://huggingface.co/datasets/petchthwr/ICNDelay}
}
@misc{LLM4Delay2026,
title={LLM4Delay: Flight Delay Prediction via Cross-Modality Adaptation of Large Language Models and Aircraft Trajectory Representation},
author={Thaweerath Phisannupawong and Joshua Julian Damanik and Han-Lim Choi},
year={2026},
eprint={2510.23636},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2510.23636},
}
```
提供机构:
petchthwr



