roirani80/fbref-xg-analysis-2024-2025
收藏Hugging Face2026-04-10 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/roirani80/fbref-xg-analysis-2024-2025
下载链接
链接失效反馈官方服务:
资源简介:
<video src="https://huggingface.co/datasets/roirani80/fbref-xg-analysis-2024-2025/resolve/main/presentation.mp4" controls="controls"
style="max-width: 720px;"></video>
# ⚽ FBRef Football Player Performance — xG Analysis 2024-2025
**Author:** Roy Irani
---
## 📌 Research Question
> **"Can we predict whether a football player outperforms or underperforms their expected goals (xG)?"**
In modern football analytics, **Expected Goals (xG)** measures the quality of a shot based on factors such as distance, angle, and assist type. A player who scores *more* goals than their xG is considered a **clinical finisher** — they beat the statistical model. This analysis explores what separates those players from the rest, and whether we can predict this trait from performance statistics alone.
---
## 📂 Dataset
| Property | Value |
|---|---|
| **Source** | [FBRef](https://fbref.com) — Football Reference |
| **Season** | 2024–2025 |
| **Leagues** | Premier League, La Liga, Bundesliga, Serie A, Ligue 1 |
| **Original rows** | 2,273 players |
| **After cleaning** | 1,510 players |
| **Features** | 38 columns |
| **Original dataset** | [alaa1234ah/fbref_football_player_performance_2024-2025](https://huggingface.co/datasets/alaa1234ah/fbref_football_player_performance_2024-2025) |
### Key Features
- `Player`, `Nation`, `Position`, `Age`
- `Goals`, `Assists`, `xG`, `npxG`, `xAG`
- `Progressive Carries`, `Progressive Passes`, `Progressive Receives`
- `Goals Per 90`, `xG Per 90`, `Assists Per 90`
- `Minutes`, `Matches Played`, `Starts`
- *(engineered)* `xG_Diff` — Goals minus xG
- *(engineered)* `xG_Overperformer` — Binary target: 1 = beats xG, 0 = does not
- *(engineered)* `npxG_Diff` — Penalty-corrected version of xG_Diff
---
## 🧹 Data Cleaning
Real-world data is never perfect. Every issue was identified and fixed before any analysis began.
| Issue | Action | Justification |
|---|---|---|
| `Unnamed: 0` column | Dropped | Index artifact from CSV export — carries no information |
| `Minutes` stored as string `"1,234"` | Converted to integer | Required for all numerical operations |
| Missing values | Checked — none found | Dataset was complete |
| Duplicate players | Identified and removed | Players transferred mid-season appeared twice |
| Players with 0 Goals AND 0 xG | Filtered out | No attacking contribution — not meaningful for this research question |
| Position codes (DF, MT, AT, GB) | Mapped to full names | Improves readability across all charts and tables |
**Result:** 2,273 → 1,510 players × 38 features
---
## ⚙️ Feature Engineering
xG (Expected Goals) is a statistical model that estimates the probability of a shot resulting in a goal. A player who scores more goals than their xG is beating the model — they are a **"clinical finisher"**.
Three new features were engineered:
```python
df["xG_Diff"] = df["Goals"] - df["xG"]
df["xG_Overperformer"] = (df["Goals"] > df["xG"]).astype(int)
df["npxG_Diff"] = df["Non-Penalty Goals"] - df["npxG"]
```
- **`xG_Diff`** — Continuous measure of how many goals above/below expectation
- **`xG_Overperformer`** — Binary target variable: `1` if Goals > xG, `0` otherwise
- **`npxG_Diff`** — Penalty-corrected version to isolate open-play finishing quality
**Class balance:** ~43% overperformers, ~57% underperformers
---
## 📊 Descriptive Statistics
Key metrics were summarized and compared between the two groups.
| Stat | Underperformer | Overperformer |
|---|---|---|
| Goals (mean) | Lower | Higher |
| Goals Per 90 (mean) | Lower | Higher |
| xG Per 90 (mean) | Higher | Lower |
| Minutes (median) | Lower | Higher |
---
## 🔥 Correlation Heatmap
How all numeric variables relate to each other. Strong correlations with `xG_Diff` highlight which stats are most associated with clinical finishing.

---
## 🔍 Outlier Detection & Handling
The IQR (Interquartile Range) method was used to identify extreme values. Any value below Q1 − 1.5×IQR or above Q3 + 1.5×IQR is flagged as an outlier.

**Decision: All outliers were retained.**
> In football data, extreme values (e.g. 25+ goals, 3,000+ minutes) represent elite players such as top scorers and ever-present starters. These are legitimate data points — not measurement errors. Removing them would distort the analysis of high-performing players, which is central to the research question.
---
## 📖 EDA Story — Questions & Answers
---
### Q1: What does xG over/underperformance look like across the dataset?

**Answer:**
The xG_Diff distribution is approximately bell-shaped and centered slightly below zero — meaning the average player scores slightly *fewer* goals than their xG predicts. This makes sense: finishing is difficult, and xG models are well-calibrated across large samples.
The scatter plot reveals a strong linear relationship between xG and actual Goals, but with significant spread above and below the diagonal line — confirming that real, meaningful variation in finishing efficiency exists across the dataset.
**43% of players outperform their xG** — these are our target clinical finishers.
---
### Q2: Which position produces the most clinical finishers?

**Answer:**
Attackers have the highest rate of xG overperformance, which is expected — they take the most shots and develop specialized finishing technique over their careers. However, the gap between positions is smaller than intuition might suggest, indicating that clinical finishing is not exclusively an attacker trait. Goalkeepers who score from set pieces dramatically outperform their xG when they do score, which skews their group upward.
---
### Q3: Does age influence clinical finishing ability?

**Answer:**
Age shows a weak correlation with xG overperformance. The violin plots reveal similar age distributions between the two groups, suggesting age alone does not determine finishing efficiency. However, the trend line in the scatter plot reveals a slight pattern — experienced players in their late 20s and early 30s tend to outperform their xG marginally more, possibly reflecting developed technique and composure under pressure.
---
### Q4: Do clinical finishers play more minutes?

**Answer:**
Overperformers tend to accumulate more minutes on average. This is consistent with the idea that managers reward clinical finishing with more playing time — a player who consistently scores more than expected earns their place in the starting lineup. This finding has a direct economic interpretation: finishing efficiency is a scarce and valued skill in the transfer market.
---
### Q5: Which statistics best separate overperformers from underperformers?

**Answer:**
Goals Per 90 shows the strongest positive correlation with xG overperformance, while xG Per 90 shows a negative correlation — overperformers score more than their shot quality alone would predict. Progressive Carries and Progressive Passes show moderate positive correlations, suggesting that players who move the ball forward aggressively tend to create and convert better opportunities.
---
### Q6: Who are the biggest overperformers and underperformers?

**Answer:**
The top overperformers are players who scored significantly more goals than their shot quality warranted — true clinical finishers. The top underperformers are often high-volume shooters whose finishing let them down relative to the quality of chances they created. This chart provides the most human, interpretable result: real names attached to the statistical findings.
---
## 🎁 Bonus Analysis: Does Removing Penalties Change Who the True Clinical Finishers Are?
Penalty kicks inflate xG overperformance — they are scored ~75% of the time but have a fixed xG of ~0.76. A player who scores 10 penalties isn't necessarily a better finisher from open play.

**Finding:**
Some players who appear as strong overperformers in raw xG_Diff are significantly driven by penalties. After correcting for penalties (using npxG_Diff), their ranking drops — revealing that their open-play finishing is closer to average. True clinical finishers maintain their overperformance even after penalty correction.
---
## 🤖 Classification Results
Having identified the key patterns through EDA, a **Random Forest classifier** was built to predict whether a player is an xG overperformer. A Random Forest builds many decision trees and combines their votes — like asking 100 experts and taking the majority opinion.
| Setting | Value |
|---|---|
| Model | Random Forest (200 trees) |
| Features | Age, Goals Per 90, xG Per 90, npxG Per 90, Progressive Carries, Progressive Passes, Minutes, Assists Per 90, Position |
| Train/Test split | 80% / 20% |
| Class weighting | Balanced |
### Feature Importance — What Drives xG Overperformance?
Feature importance tells us which statistics the model relied on most when making its predictions. This directly answers the research question.

**Goals Per 90** and **xG Per 90** are the most important predictors — confirming the EDA findings. The model relies on the gap between actual and expected scoring rate to classify players, which aligns perfectly with the definition of xG overperformance.
### Confusion Matrix

The model performs reasonably well on both classes. The use of `class_weight="balanced"` ensures the model does not simply predict the majority class.
---
## 💡 Key Insights & Conclusions
1. **43% of players beat their xG** — clinical finishing is a minority skill, not the norm
2. **Attackers lead** in xG overperformance rate, but the gap between positions is smaller than expected
3. **Age is a weak predictor** — finishing efficiency is not simply a product of experience
4. **Overperformers play more minutes** — managers reward clinical finishing with playing time
5. **Penalties inflate overperformance** — true finishing ability is better captured by non-penalty metrics
6. **Goals Per 90 is the top classifier feature** — the model confirms what the EDA showed visually
> The data tells a clear story: xG overperformance is a real, measurable, and partially predictable trait. It is driven by shooting efficiency per 90 minutes rather than total volume, and it is rewarded with playing time. Identifying these players early has direct implications for transfer market valuation and squad planning.
---
## 🛠️ How to Reproduce
```python
from datasets import load_dataset
import pandas as pd
ds = load_dataset("YOUR-USERNAME/YOUR-REPO-NAME")
df = ds["train"].to_pandas()
```
---
## 📁 Repository Contents
| File | Description |
|---|---|
| `fbref_cleaned_with_features.csv` | Cleaned dataset with engineered features |
| `Assignment_1_EDA_&_Dataset_Roy_Irani.ipynb` | Full analysis notebook |
| `presentation.mp4` | Video walkthrough |
| `plot_correlation_heatmap.png` | Correlation matrix heatmap |
| `plot_outliers_boxplot.png` | Outlier detection box plots |
| `plot_q1_xg_distribution.png` | xG_Diff histogram + Goals vs xG scatter |
| `plot_q2_position_overperformers.png` | Overperformers by position |
| `plot_q3_age_clinical_finishing.png` | Age analysis — violin + scatter |
| `plot_q4_minutes_played.png` | Minutes played by group |
| `plot_q5_differentiating_stats.png` | Key differentiating statistics |
| `plot_q6_top_players.png` | Top 10 over/underperformers |
| `plot_bonus_penalty_correction.png` | Penalty-corrected analysis |
| `plot_feature_importance.png` | Random Forest feature importance |
| `plot_confusion_matrix.png` | Classification confusion matrix |
提供机构:
roirani80



