Davichick/Credit-score-analysis

Name: Davichick/Credit-score-analysis
Creator: Davichick
Published: 2026-04-08 11:05:01
License: 暂无描述

Hugging Face2026-04-08 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/Davichick/Credit-score-analysis

下载链接

链接失效反馈

官方服务：

资源简介：

--- task_categories: - tabular-classification language: - en pretty_name: Credit Score Classification EDA configs: - config_name: default data_files: - split: train path: "Credit_Score.csv" --- # Credit Score Classification - Exploratory Data Analysis (EDA) <div align="center"> <video controls="controls" style="max-width: 720px; width: 100%;"> <source src= https://huggingface.co/datasets/Davichick/Credit-score-analysis/resolve/main/Video.mp4 type="video/mp4"> Your browser does not support the video tag. </video> </div> --- # Credit Score Classification - Exploratory Data Analysis (EDA) ## Project Overview In this project, I performed a comprehensive Exploratory Data Analysis (EDA) on a Credit Score dataset. My main goal was to transform a messy, real-world dataset into a clean, analyzed resource to uncover the key financial behaviors that determine a person's credit score: Good, Standard or Poor. ## Dataset Description The dataset consists of 100,000 rows and 28 features. It provides a wide view of a customer's financial life, from basic demographics to complex banking history. * **Demographics:** Age, Occupation. * **Financial Metrics:** Annual Income, Monthly Balance, Outstanding Debt. * **Banking History:** Number of Bank Accounts, Number of Loans, Interest Rates. * **Target Variable:** Credit_Score. --- ## Phase 1: Data Cleaning & Preprocessing This dataset was intentionally designed to be messy, mimicking real-world data entry errors. I followed a chronological process to clean it: 1. **Fixing Mixed Data Types:** I identified columns such as Age and Annual Income that were incorrectly stored as strings (object) due to special characters like underscores (_). I removed these and converted the columns to numeric values. 2. **Handling Placeholders:** I replaced categorical "placeholders" like "_______" or " _ " with the label **Unknown**, allowing me to keep the data rows without making false assumptions. 3. **Outlier Detection & Capping Strategy:** I discovered extreme outliers, such as an age of 8,000 or interest rates over 5,000%. **Why I chose Capping with the Median:** I decided to handle these outliers by capping them. Replacing extreme, unrealistic values with the **Median** of the column. I chose this approach because the Median is a robust measure of central tendency; it represents the true middle of the data and isn't pulled away by big numbers like the Average (Mean) would be. By using Capping instead of deleting rows, I preserved the 100,000-row size of my dataset while ensuring that these errors didn't effect my graphs or lead to wrong conclusions. It allowed me to neutralize the noise without losing valuable information in other columns of the same row. --- ## Phase 2: Exploratory Data Analysis (EDA) Once the data was clean, I moved to the visualization stage to tell the story of the data. ### Target and Demographics I started my visualization process by looking at the distribution of the **Credit Score**, which is my target variable. This chart shows me exactly how many people fall into each category, such as "Good", "Standard", or "Poor". It is very important for me to understand this balance now, because this is the specific value I want to predict later in the project. ![Credit Score Distribution](Credit%20Score%20Distribution.png) When I analyzed the **Age Distribution**, the graph appeared to end around age 60. Even though I have people as old as 100 in the cleaned data, the concentration of young adults in their 20s and 30s is so high that the bars for seniors are too small to be visible on this scale. This gave me a clear understanding that the dataset represents a primarily young demographic. ![Age Distribution](Age%20Distribution.png) ### Financial Insights For the **Annual Income** graph, I decided to filter the view to only show people earning under 250,000. I did this so I could actually see the shape and distribution of the data. If I hadn't filtered the chart, the few extremely rich people with millions in income would have forced the rest of the data into one tiny and unreadable line.Making it impossible for me to see any patterns. ![Annual Income Filtered](Annual%20Income%20Filtered.png) ### Intrest Rate vs Credit score I created a Boxplot to compare Interest Rates against Credit Scores, and I found a very interesting result. I can clearly see that people with a "Poor" credit score pay much higher interest rates compared to those with a "Good" score. This confirms a strong relationship between these two variables, and it helps me understand one of the main factors that drives a person's credit rating. ![Interest Rate vs Credit Score](Interest%20Rate%20vs%20Credit%20Score.png) ### Heatmap (data correlation) I generated this heatmap to visualize the correlations between all the numeric features in my data. The colors help me spot patterns instantly. For example, red squares indicate a strong positive relationship, while blue squares show a negative one. This tool is essential for me because it highlights which variables, like interest rates or number of loans, have the most significant impact on each other. ![Correlation Heatmap](Correlation%20Heatmap.png) --- ## Phase 3: Research Question & Findings **My Research Question:** "What are the key financial behaviors that separate a Good credit customer from a Poor one?"* ### 1. The Occupational Factor The Occupational Factor: I wanted to investigate if a person's job title has a real impact on their credit score? To answer this, I used a Bar Chart to compare the number of "Good", "Standard", and "Poor" credit ratings within different professions. I found that credit score distribution is almost identical across all professions. This proves that **financial habits** are far more important than a **job title**. ![Occupation Analysis](Occupation%20Analysis.png) ### 2. The Delay Threshold The Delay Threshold: Does a delayed payment effects a persons credit score? I used a Boxplot to find out exactly how many delayed payments it takes before a person's credit score drops to "Poor". This type of chart is good for this question because it shows me the median number of delays and the spread of the data for each credit category. Using a Boxplot, I discovered a clear tipping point: customers with a "Good" rating rarely have more than 8-10 delayed payments. Once a customer crosses 15-17 delays, they almost certainly fall into the "Poor" category. ![Delayed Payments Analysis](Delayed%20Payments%20Analysis.png) ### 3. The Debt Burden The Debt Burden: How does outstanding debt effects a persons credit score compared to the overall average? Finally, I analyzed the outstanding Debt to see how it affects the credit score compared to the general population. I chose a KDE Plot because it visualizes the density of the data, making it easy to see where most people in each group are clustered financially. I also added a red dashed line to represent the overall Average debt of the entire dataset. I compared **Outstanding Debt** against the **Overall Average (1426.22)**. The blue "Good" curve is peaked far to the left of the average line, while the green "Poor" curve is shifted heavily to the right. This is the "Debt Trap" in visual form. ![Debt Comparison KDE](Debt%20Comparison%20KDE.png) --- ## Final Conclusion & Summary This project has been a complete journey from raw, messy data to actionable financial insights. **My primary conclusion** is that a high credit score is not a result of high income or a prestigious job. Instead, it is a reflection of **financial discipline**. Through my analysis, I proved that the two most critical red flags for a credit score are carrying an **Outstanding Debt** higher than the population average and allowing **Delayed Payments** to exceed a count of 10. By successfully cleaning the data using the capping method, I was able to create a stable and reliable dataset. This ensures that any future use on this data will be much more accurate and won't be confused by the original data entry errors. The "Good" customer profile is now clearly defined: low debt, few delays, and consistent payment behavior. ## Libraries and DS Used * **Pandas & Numpy:** For data manipulation, cleaning, and capping. * **Matplotlib & Seaborn:** For professional-grade visualizations and statistical plotting.

提供机构：

Davichick

5,000+

优质数据集

54 个

任务类型

进入经典数据集