five

mayacheruty/Horse-Race-Prediction-EDA

收藏
Hugging Face2026-04-12 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/mayacheruty/Horse-Race-Prediction-EDA
下载链接
链接失效反馈
官方服务:
资源简介:
--- title: Horse Race Prediction EDA emoji: 🏇 license: mit tags: - tabular-data - eda - python - horse-racing - sports-analytics - data-visualization - data-cleaning - betting-markets - exploratory-data-analysis configs: - config_name: default data_files: - split: train path: data/horse_racing_cleaned.csv --- # 🏇 Horse Race Prediction: Exploratory Data Analysis (EDA) ## Project Walkthrough > ### **[לחצו כאן לצפייה בסרטון ההסבר (Loom)](https://www.loom.com/share/8c20a3bba8d64478ab2d8d9caf015a57)** > * במידה והסרטון לא עולה - ניתן לצפות בסרטון בלינק למעלה * <video src="https://huggingface.co/datasets/mayacheruty/Horse-Race-Prediction-EDA/resolve/main/walkthrough.mp4" controls="controls" style="max-width: 720px;"></video> ## 📌 Project Overview This project presents a comprehensive Exploratory Data Analysis (EDA) of a 2019 horse racing dataset containing over **171,849 records**. The goal was to identify the primary biological, professional, and market factors that determine a winning performance. ## Main Research Question > **What are the primary determinants of a winning performance in horse racing, and to what extent do professional ratings and market expectations align with actual outcomes?** --- ## The Data Cleaning & Engineering Process Real-world data is rarely ready for analysis. To ensure high-quality insights, I implemented a robust data cleaning pipeline focused on preserving statistical integrity. ### Step-by-Step Transformation #### 1. Strategic Imputation (Handling Missing Values) * **The Challenge:** Key metrics like `RPR` and `OR` contained missing values. * **The Solution:** Applied **Mean Imputation**, filling gaps with the average value. This maintained the dataset's volume (170k+ records) without shifting the global mean. For the `saddle` column, a placeholder of `0` was used. #### 2. Market Normalization (Feature Engineering) * **The Challenge:** Raw `decimalPrice` was stored as probabilities (0 to 1), which is difficult to interpret. * **The Solution:** Engineered `actual_odds` using the inverse transformation: $\text{Actual Odds} = 1/\text{decimalPrice}$. * **The Result:** This allowed for grouping horses into realistic "Price Buckets" (e.g., Favorites vs. Longshots). --- ## Key Visualizations & Findings ### 1. Win vs. Loss Baseline (Class Imbalance) ![Win vs Loss](png/winsVSlosses.png) **Key Findings:** Win Rate Baseline ↑ Our initial analysis revealed a significant class imbalance within the dataset. Only 10% of the horses in the records secured a victory (res_win = 1), while the remaining 90% did not. Significance: > This 10% win rate serves as our primary benchmark. Throughout the EDA, any category (such as specific age groups or trainers) that exceeds this 10% threshold is identified as having a "competitive advantage" or a higher-than-average probability of success. **follow up question:** Does a horse's age significantly impact its probability of winning? ### 2. Population Distribution (Age) ![Age Distribution](png/percentagebyage.png) **Key Findings:** Population Distribution by Age ↑ This graph shows the "demographics" of our dataset. Peak Participation: Horse racing is clearly a sport for the young. Horses aged 3 and 4 make up the largest portion of the data (over 40%). We see a consistent decline in the number of horses as they age. By age 10, the representation in the dataset becomes very thin. Strategic Insight: Understanding this distribution is crucial because it tells us that our dataset is "unbalanced" in terms of age. We have a massive amount of information on young horses, but very limited data on veterans. * **Follow-up:** Does a horse's age significantly impact its probability of winning? ### 3. Win Probability by Horse Age ![Age vs Win Rate](png/winprobyage.png) **Key Findings:** Age vs. Performance ↑ In this step, we analyzed how a horse's age directly relates to its success rate. The Performance Peak: Horses aged 2 and 3 are the clear top performers, with win rates consistently staying above our 10% baseline (the red dashed line). This suggests that physical peak and "freshness" are significant advantages. The Steady Decline: After age 4, we observe a general downward trend. As horses get older, their win probability gradually decreases, staying mostly below the average. The Age 15 Outlier: we notice a sudden spike at age 15. However, this is a classic "statistical noise" caused by a very small sample size. Since there are so few 15-year-old horses in our data, one single win creates the illusion of a high success rate, but it does not represent a real trend for older horses. **follow up question:** How accurately do betting odds (decimalPrice) predict the actual outcome of a race? ### 4. Market Wisdom: Win Rate by Betting Odds ![Market Wisdom](png/marketwisdom.png) **Key Findings:** Market Wisdom & Price Efficiency ↑ In this step, we tested whether the betting market is "smart" at predicting winners. The Transformation: Since our data was stored as probabilities (0-1), we applied a "reverse engineering" formula (1/𝑑𝑒𝑐𝑖𝑚𝑎𝑙𝑃𝑟𝑖𝑐𝑒) to restore the original betting odds. This allowed us to group horses into realistic price ranges (e.g., Favorites vs. Longshots). The results show a perfect downward trend. Horses in the 1-2 range (Favorites) have a massive 55.7% win rate, while horses in the 100+ range almost never win (0.2%).Insight: This confirms that the market is highly efficient. The "price" given to a horse is a very accurate indicator of its actual performance. If we want to build a prediction model later, this "Actual Odds" feature will likely be one of the most important variables. **follow up question:** Do specific trainers exhibit a statistically higher win rate compared to the market average? ### 5. Trainer Impact (Top 10) ![Trainer Impact](png/top10trainers.png) **Key Findings:** The Human Factor (Trainer Impact) ↑ After analyzing the horse and the market, we looked at the "brain" behind the race: the Trainer. Focusing on Experience: We filtered the data to look only at the Top 10 most active trainers (those with the highest volume of races). This ensures our results are based on consistent performance rather than lucky one-time wins. Performance Machines: The results show that some trainers significantly outperform the field. For example, W P Mullins has a win rate of 21.0%, which is more than double the overall market average of 10%. The Professional Edge: Even among the top 10, all trainers stayed above the red dashed line (the 10% baseline). This confirms that a trainer's expertise is a critical variable in predicting a race's outcome. **Feature Correlation:** Which professional metrics (e.g., RPR, OR) show the strongest correlation with race victories? ### 6. Correlation Heatmap ![Correlation Heatmap](png/heatmap.png) **Key Findings:** The heatmap reveals that **RPR (Racing Post Rating)** has the strongest positive correlation (**0.42**) with winning results, followed by Official Rating (OR). --- ## Final Conclusion To predict a winner with the highest probability, the "blueprint" consists of: 1. **Age:** 2-3 years old (Physical Peak). 2. **Professional Rating:** High RPR score. 3. **Trainer:** Top 10 High-Volume trainer. 4. **Market:** Low betting odds (High market confidence). **Created by Maya Cheruty**
提供机构:
mayacheruty
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作