five

itaimorag/video-games

收藏
Hugging Face2026-04-09 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/itaimorag/video-games
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit # Video Game Sales — Exploratory Data Analysis (EDA) ## 1. Objective The purpose of this project is to analyze historical video game sales data to identify the key factors that influence commercial success. Using Exploratory Data Analysis (EDA) techniques, this project examines the relationship between game genres, critical reception, and global sales performance. A comprehensive exploratory analysis was conducted across the dataset. The complete analysis, including all data visualizations, data preprocessing, and supporting Python code, is available in the accompanying Jupyter Notebook (`Assignment_1_EDA_&_Dataset_new.ipynb`). ## 2. Dataset Overview This dataset contains historical sales data of video games, alongside professional critic and aggregate user scores. Each observation represents a single video game release with associated categorical and numerical attributes. ### Key Feature Categories: * **Game Identifiers:** Name, Platform, Year of Release, Publisher. * **Categorical Groupings:** Genre (e.g., Action, Sports, RPG). * **Financial Metrics:** North American, European, Japanese, and Global Sales (measured in millions of units). * **Reception Metrics:** * `Critic_Score`: Aggregate score compiled by professional staff (out of 100). * `User_Score`: Aggregate score from the general gaming public (out of 10). ## 3. Methodology The analysis follows a structured, decision-driven pipeline designed to uncover patterns in video game consumer behavior while maintaining data integrity. ### 3.1 Data Cleaning & Invalid Values * **Future Dates:** Initial checks revealed games with release years listed beyond 2016 (e.g., 2017, 2020). Because the dataset was compiled in 2016, these were removed to ensure historical accuracy and prevent predictive bias. * **Handling 'tbd':** Values marked as "tbd" (To Be Determined) in the `User_Score` column were converted to `NaN`. The column was then cast to a float datatype to allow for statistical calculations. ### 3.2 Missing Values Treatment Although there are significant missing values in the scoring columns (`Critic_Score`, `User_Score`), they were intentionally left as `NaN`. * **Rationale:** Imputing these scores (e.g., using a mean or median) would artificially inflate or deflate the perceived quality of games, severely distorting the actual distribution of game ratings and introducing unwanted bias into future predictive models. ### 3.3 Outlier Treatment Outliers were evaluated visually using Box Plots. * **Rationale:** The dataset contains extreme financial outliers. While extreme, these are not data errors. They are valid reflections of the hit-driven nature of the video game market. Therefore, extreme sales outliers (such as *Wii Sports*) were retained to preserve real-world industry variance. ## 4. Exploratory Data Analysis (EDA) ### 4.1 Feature Correlations #### What is the mathematical relationship between the different numerical features? ![Correlation_Heatmap](https://cdn-uploads.huggingface.co/production/uploads/69bfc562669dad5fb4dd1772/PNOdNHPZHWAWB94jEoxhi.png) * **Observation:** The correlation matrix reveals the linear relationships between numeric variables. There is a noticeable positive correlation between `Critic_Score` and sales metrics, particularly `Global_Sales`. ficient; categorical data like 'Genre' and 'Platform' must be factored in. ### 4.2 Historic Genre Profitability #### What are the most profitable video game genres of all time? ![Total_Global_Sales_by_Genre_(1980_2016)](https://cdn-uploads.huggingface.co/production/uploads/69bfc562669dad5fb4dd1772/4lp-ilNCKI9gB6gJxYGL_.png) * **Observation:** The Action and Sports genres heavily dominate the total global sales charts from 1980 to 2016, generating significantly more revenue than Strategy or Puzzle games. ### 4.3 Genre Popularity Over Time #### How has the popularity of top genres shifted across the years? ![Global_Sales_Trend_of_Top_5_Genres_(1995_2016)](https://cdn-uploads.huggingface.co/production/uploads/69bfc562669dad5fb4dd1772/_kpTvp85S7p1_itc9LL5y.png) * **Observation:** While Action and Sports are historically profitable, their popularity fluctuates. The line chart shows massive peaks for certain genres during specific timeframes (e.g., the rise of Shooters in the late 2000s). ## 5. Methodological Insight & Machine Learning Preprocessing Prior to utilizing this dataset for predictive modeling (e.g., predicting `Global_Sales` based on scores), **feature scaling is mandatory**. The numerical features exist on vastly different scales: * `Global_Sales` (Measured in millions) * `Critic_Score` (Measured on a scale of 1-100) * `User_Score` (Measured on a scale of 1-10) Applying a `StandardScaler` or `MinMaxScaler` will be necessary to prevent features with larger numerical magnitudes from improperly dominating distance-based machine learning algorithms. ## 6. Final Conclusion This analysis demonstrates that a video game's commercial success is not driven by isolated variables, but rather by the interaction between its genre, the era of its release, and its critical reception. The industry is highly hit-driven, meaning risk models must account for extreme outliers rather than discarding them. Overall, the findings emphasize that effective sales forecasting requires combining multiple dimensions of data rather than relying on a single variable like a review score. ## 7. Limitations * **Timeframe:** The dataset's tracking effectively ends late 2016. It does not account for the massive shift to purely digital distribution or the rise of modern live-service games. * **Missing Historic Data:** Many retro games (pre-2000) have `NaN` for critic scores, as modern review aggregation websites did not exist to log them. * **Causality:** The analysis is based on observational data and does not establish pure causality between high scores and high sales. ## 8. Author itay morag
提供机构:
itaimorag
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作