Haxxsh/gdgc-datathon-data

Name: Haxxsh/gdgc-datathon-data
Creator: Haxxsh
Published: 2025-12-06 05:15:13
License: 暂无描述

Hugging Face2025-12-06 更新2025-12-20 收录

下载链接：

https://hf-mirror.com/datasets/Haxxsh/gdgc-datathon-data

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit task_categories: - tabular-regression tags: - motorsport - formula-racing - lap-time - tabular - gdgc-datathon language: - en size_categories: - 100K<n<1M --- # GDGC Datathon 2025 - Formula Racing Lap Time Dataset Dataset for predicting Formula racing lap times, used in the GDGC Datathon 2025 competition. ## Dataset Description This dataset contains historical Formula racing data with various features related to circuits, weather conditions, rider/driver performance, and race configurations. The goal is to predict `Lap_Time_Seconds`. ### Dataset Summary | Split | Samples | Size | |-------|---------|------| | Train | 734,002 | 124 MB | | Test | 195,001 | 51 MB | ## Dataset Structure ### Files ``` data/ ├── train.csv # Training data with target variable └── test.csv # Test data for predictions ``` ### Features | Column | Description | Type | |--------|-------------|------| | `id` | Unique identifier | int | | `Unique ID` | Alternative unique ID | int | | `Rider_ID` | Rider/driver identifier | int | | `Formula_category_x` | Racing formula category | categorical | | `Len_Circuit_inkm` | Circuit length in kilometers | float | | `Laps` | Number of laps in the race | int | | `Start_Position` | Starting grid position | int | | `Formula_Avg_Speed_kmh` | Average speed in km/h | float | | `Formula_Track_Condition` | Track condition rating | categorical | | `Humidity_%` | Humidity percentage | float | | `Tire_Compound` | Type of tire compound used | categorical | | `Penalty` | Penalty time/status | float | | `Champ_Points` | Championship points | float | | `Champ_Position` | Championship standing position | int | | `Session` | Race session type | categorical | | `race_year` | Year of the race | int | | `seq` | Sequence number | int | | `position` | Final position | int | | `points` | Points earned | float | | `Formula_shortname` | Short name of formula | categorical | | `circuit_name` | Name of the circuit | categorical | | `Corners_in_Lap` | Number of corners per lap | int | | `Tire_Degradation_Factor_per_Lap` | Tire degradation rate | float | | `Pit_Stop_Duration_Seconds` | Pit stop time in seconds | float | | `Ambient_Temperature_Celsius` | Air temperature | float | | `Track_Temperature_Celsius` | Track surface temperature | float | | `weather` | Weather condition | categorical | | `track` | Track identifier | categorical | | `air` | Air condition metric | float | | `ground` | Ground condition metric | float | | `starts` | Number of race starts | int | | `finishes` | Number of race finishes | int | | `with_points` | Races finished with points | int | | `podiums` | Number of podium finishes | int | | `wins` | Number of wins | int | | `Lap_Time_Seconds` | **Target variable** - Lap time in seconds | float | ### Target Variable Statistics | Metric | Value | |--------|-------| | Count | 734,002 | | Mean | 89.997 s | | Std | 11.532 s | | Min | 70.001 s | | 25% | 79.989 s | | 50% (Median) | 89.970 s | | 75% | 99.914 s | | Max | 109.999 s | The target distribution is **nearly symmetric** with mean ≈ median, indicating no significant skew. ## Usage ### Loading with Pandas ```python import pandas as pd # Load training data train_df = pd.read_csv("train.csv") print(f"Training samples: {len(train_df)}") # Load test data test_df = pd.read_csv("test.csv") print(f"Test samples: {len(test_df)}") # Separate features and target X = train_df.drop(columns=["Lap_Time_Seconds", "id"]) y = train_df["Lap_Time_Seconds"] ``` ### Loading from Hugging Face ```python from huggingface_hub import hf_hub_download import pandas as pd # Download files train_path = hf_hub_download( repo_id="Haxxsh/gdgc-datathon-data", filename="train.csv", repo_type="dataset" ) test_path = hf_hub_download( repo_id="Haxxsh/gdgc-datathon-data", filename="test.csv", repo_type="dataset" ) # Load into pandas train_df = pd.read_csv(train_path) test_df = pd.read_csv(test_path) ``` ### With Datasets Library ```python from datasets import load_dataset dataset = load_dataset("Haxxsh/gdgc-datathon-data") ``` ## Trained Models Pre-trained models for this dataset are available at: - **Models:** [Haxxsh/gdgc-datathon-models](https://huggingface.co/Haxxsh/gdgc-datathon-models) - **Training Code:** [ezylopx5/DATATHON](https://github.com/ezylopx5/DATATHON) ## Evaluation Metric The primary evaluation metric is **RMSE** (Root Mean Squared Error): ```python from sklearn.metrics import mean_squared_error import numpy as np rmse = np.sqrt(mean_squared_error(y_true, y_pred)) ``` ## Data Preprocessing Tips 1. **Handle categorical features:** Use label encoding or one-hot encoding for columns like `weather`, `circuit_name`, `Tire_Compound` 2. **Feature scaling:** Normalize numerical features for certain models 3. **Missing values:** Check for and handle any missing values appropriately 4. **Feature engineering:** Consider creating interaction features or aggregations ## License MIT License ## Citation ```bibtex @dataset{gdgc-datathon-2025-data, author = {Haxxsh}, title = {GDGC Datathon 2025 - Formula Racing Lap Time Dataset}, year = {2025}, publisher = {Hugging Face}, url = {https://huggingface.co/datasets/Haxxsh/gdgc-datathon-data} } ``` ## Acknowledgments - GDGC Datathon 2025 organizers - Formula racing data providers

提供机构：

Haxxsh

5,000+

优质数据集

54 个

任务类型

进入经典数据集