SteamGames ML OPS is one of the final integrative projects of the Henry Data-Science Bootcamp. The project revolves around the SteamGames datasets, which cover information about games, users and developers from the Steam platform. The main goal of this project was to transform the raw data into a clean, well-organized and optimized format. Subsequently, the data was analyzed and used to train a machine learning model for a recommendation system and to be consumed according to specific queries as endpoints of an API.
Here you can access my Steam Games Querys API
-
API/ Contains files related to the implementation of API endpoints for data consumption. For a detailed description of the API and its implementation, please refer to the API README.
-
datasets/ The datasets folder is divided into two subfolders
processed/
games.parquet: Processed data about games.
items.parquet: Processed data about items.
reviews.parquet: Processed data about reviews.
users.parquet: Processed data about users.
raw/
Diccionario de Datos STEAM.xlsx: Steam dataset dictionary.
steam_games.json.gz: Raw data about Steam games.
user_reviews.json.gz: Raw data about user reviews.
users_items.json.gz: Raw data about user items.
The raw subfolder includes the Steam datasets provided by Henry for analysis, while the processed subfolder contains the resulting dataframes after the ETL (Extract, Transform, Load) process.
-
.gitignore: File to specify untracked files and directories that Git should ignore.
-
EDA.ipynb: Jupyter Notebook for Exploratory Data Analysis.
-
ETL.ipynb: Jupyter Notebook for the Extract, Transform, Load (ETL) process.
-
ML_Model.ipynb: Jupyter Notebook for developing the machine learning model.
-
requirements.txt: File specifying project dependencies.
This ETL (Extract, Transform, Load) script focuses on processing and transforming the raw data, performing sentiment analysis and web-scrapping. The process is divided into three main sections: User_Reviews, User_items and Games.
The raw user reviews data is loaded from the compressed JSON file ('user_reviews.json.gz'). The file is read line by line, and each line is converted from a JSON string to a Python dictionary using ast.literal_eval
. The resulting dictionaries are stored in a list, and a DataFrame is created from this list.
- The 'reviews' column is exploded to create a new DataFrame,
reviews_df
, expanding the nested structure. - The 'user_id' column is concatenated with
reviews_df
to form the final processed DataFrame,df_reviews
. - Duplicate rows are removed, and non-relevant columns ('funny', 'last_edited', 'helpful', 0) are dropped.
- Records with at least one null value are identified and displayed for review.
- Records with null values in all specific columns ('posted', 'item_id', 'recommend', 'review') are dropped.
- The 'posted' column is converted to a string data type.
- The 'item_id' column is converted to an integer data type.
- The 'recommend' column is converted to a boolean data type.
- A new column, 'year_posted,' is created by extracting the year from the 'posted' column.
- Missing values in 'year_posted' are filled using the mean year calculated for each 'item_id' and the overall mean.
- The 'year_posted' column is converted to integer values.
- The original 'posted' column is dropped.
- Sentiment analysis is conducted on the 'review' column using the TextBlob library.
- A new column, 'sentiment_analysis,' is created to store sentiment scores (0 for negative, 1 for neutral, 2 for positive).
- The original 'review' column is dropped.
The final processed DataFrame from the User_Reviews section is exported re exported to the 'datasets/processed' directory in Parquet format for further analysis.
This section of the ETL process focuses on handling the raw user/items data. The script extracts, transforms, and loads the data to create two processed datasets: users.parquet
and items.parquet
.
The raw user items data is loaded from the compressed JSON file ('users_items.json.gz'). Each line is converted to a dictionary using ast.literal_eval
, and the resulting dictionaries are stored in a list.
- A DataFrame,
df_users_items
, is created from the list of dictionaries. - The 'steam_id' column is converted to integers.
- Duplicate rows based on all columns except the last one are removed.
- Records where 'items_count' is '0' (users who don't own any games) are filtered and removed.
- The 'items' column, which contains lists of dictionaries, is transformed into a separate DataFrame,
df_items
. - The 'user_id' column is replicated according to the 'items_count' value.
- The 'user_id_replicated' column is added to the 'df_items' DataFrame.
- The 'item_id' column is converted to integers.
- Columns are renamed for clarity.
- Unnecessary columns ('playtime_2weeks') are removed from the 'df_items' DataFrame.
- The 'items' column is dropped from the
df_users_items
DataFrame.
The df_users
DataFrame provides information about users, including their unique identifier, the count of items they own, Steam identifier, and associated user URL. On the other hand, the df_items
DataFrame contains details about the items, such as the associated user identifier, unique item identifier, item name, and playtime in minutes.
Two processed datasets, users.parquet
and items.parquet
, are exported to the 'datasets/processed' directory in Parquet format.
This section of the ETL process is dedicated to handling the raw games data. The script extracts, transforms, and loads the data to create a processed dataset named games.parquet
.
The raw games data is loaded from the compressed JSON file ('steam_games.json.gz'). Each line, representing a game, is converted from a JSON string to a Python dictionary using the json.loads
function. The resulting dictionaries are stored in a list, games_row
.
- A DataFrame,
df_games
, is created from the list of dictionaries. - Rows with all missing values are dropped.
- The 'release_date' column is converted to datetime format, and the 'release_year' column is extracted.
- Unnecessary columns ('specs', 'early_access', 'price') are removed.
- Missing values in 'title', 'developer', 'genre' and 'release_year' are filled using web scraping with a custom
WebScraper
class. - Missing values in 'developer' are imputed with values from 'publisher'.
- Missing values in 'title' are imputed with values from 'app_name'.
- Missing the most relevant values in 'genres' are imputed with values from 'tags'.
- Rows with missing values in the 'id' column are removed.
- Missing values in 'genres' are imputed with the string 'unknown'.
- Missing values in 'developer' are imputed with the string 'unknown'.
- Missing values in 'release_year' are imputed with the rounded mean of existing values.
- The most relevant genre for each record is determined based on genre frequency distribution.
- Columns are renamed for clarity.
- The 'item_id' column is converted to integers.
- The 'release_year' column is converted to integers.
The processed dataset, games.parquet
, is exported to the 'datasets/processed' directory in Parquet format.
This exploratory data analysis (EDA) focuses on the datasets resulting from the ETL process: games.parquet, items.parquet, reviews.parquet, and user.parquet. The primary goal is to identify key features for the development of a games recommendation model. Subsequently, the analysis involves merging necessary features from different datasets and optimizing object types to enhance memory usage efficiency. The end objective is to deploy a RESTful API housing all necessary data for both powering the recommendation model and facilitating specific queries to the datasets.
Overall, the EDA script meticulously addresses each dataset, employing systematic techniques to optimize memory usage while retaining crucial information. More detailed analysis description in the EDA script
The ML_Model.py script implements a game recommendation system using machine learning techniques. It is designed to provide users with game recommendations similar to a specified game, employing a K-nearest neighbors approach after transforming the input data.
The script is divided into several key sections:
1. Importing Libraries
The required libraries for the project are imported, including Pandas for data manipulation, Scikit-learn for machine learning tools, and Joblib for model persistence.
2. Importing Games Dataset
The script reads the processed games dataset and performs initial data preprocessing, including dropping unnecessary columns and optimizing data types for memory efficiency.
3. Exporting API Dataset
The preprocessed games dataset is exported in Parquet format for use in the API.
4. TF-IDF Vectorization
The script creates TF-IDF vectorizers for the 'developer,' 'genre,' and 'tags' columns, along with a transformer for the 'release_year' column. These components are combined using Scikit-learn's ColumnTransformer.
5. Creating KNN Model
A K-nearest neighbors model is created using the cosine distance metric and a brute-force algorithm.
6. Creating Preprocessing and KNN Model Pipeline
The TF-IDF vectorizers and KNN model are combined into a Scikit-learn pipeline for ease of use.
7. Fitting Pipeline to the DataFrame
The pipeline is fitted to the preprocessed games DataFrame.
8. Saving Model
The trained pipeline is saved using Joblib for later use in the API.
9. Function to Get Recommendations
A function, game_recommendations, is defined to provide game recommendations based on a specified game (item_id) using the K-nearest neighbors approach.
10. Application Example
An example is provided to demonstrate how to use the recommendation function with a specific item_id, resulting in a list of recommended titles.
To use the recommendation system:
- Ensure that the required libraries are installed (see requierements.txt).
- Run the script in a Jupyter notebook or Python environment.
- Utilize the game_recommendations function with a specific item_id to get recommendations.
item_id: A 6-digit integer representing the game for which recommendations are sought.
Application example
game_recommendations(643980)
This will output a list of 5 recommended titles similar to the specified game.
[{'1': 'Brief Karate Foolish'},
{'2': 'Nightside Demo'},
{'3': "Defender's Quest: Valley of the Forgotten (DX edition)"},
{'4': 'Labyrinth - Starter Pack'},
{'5': 'MINDNIGHT'}]
Notes:
The script assumes a Jupyter environment for execution.
The dataset path and other configurations can be adjusted as needed.
Feel free to explore and adapt the script to meet your specific requirements.