This project focuses on building a comprehensive Extract, Transform, and Load (ETL) pipeline leveraging the robust functionalities of Microsoft Azure. The pipeline is designed to efficiently fetch data from Azure Blob Storage perform necessary transformations using Azure Databricks and Azure Data Factory, store it in Azure Data Lake and finally store the analyzed data in a SQL layer for further use. This end-to-end solution aims to streamline the data processing workflow and provide actionable insights from the transformed data.
The Dataset used in this project contains information for over 45,000 movies with 26 million ratings! Since this is primarily a Data Engineering project, we'll only be focusing on one of the datasets "movies-metadata" which contains information such as budget, revenue, release dates, languages, production countries and companies
We perform language-based analysis on a movie dataset, focusing on non-numeric original language entries. We generate a summary dataframe that includes the number of movies, average budget, average revenue, and average popularity for each language. This analysis helps in understanding the impact of the original language on various movie metrics
We analyze movies released after 1995 in the dataset. We calculate the number of movies released per year and perform a genre analysis to identify the top 5 genres for each year. The final output summarizes the number of movies released and the top 5 genres for each year post-1995
We conduct a comprehensive analysis of movie genres in the dataset. We calculate the distribution of genres and aggregates the average budget, revenue, rating, and popularity for each genre. The final output summarizes the average budget, revenue, rating, popularity, and the number of movies for each genre, providing insights into the characteristics and trends of different movie genres
- Microsoft Azure subscription
- Azure Blob Storage: Object storage service for storing large amounts of unstructured data
- Azure Data Lake Gen2 Storage: Scalable storage for big data analytics
- Azure Databricks: Analytics platform based on Apache Spark for big data processing and machine learning
- Azure SQL Server: Fully managed relational database service with SQL Server engine
- Azure SQL Database: Managed relational database with SQL Server compatibility
- Azure Data Factory: Data integration service that enables you to create, schedule, and orchestrate ETL and ELT workflows. ADF provides a visually intuitive interface to build data-driven workflows for orchestrating data movement and transforming data at scale
- Extract Data: Retrieve CSV data from Azure Blob Storage for processing
- Transform Data: Utilize Python and PySpark on Azure Databricks to analyze and process the data, storing the results in Azure Data Lake Storage Gen2
- Load Data: Transfer the processed data into an Azure SQL database to establish a reporting layer for dashboard creation
- Automation: Construct end-to-end pipelines in Azure Data Factory to automate the data flow from extraction to reporting layers
- Create a container in your Azure Blob Storage account
- Upload your input data files to the container
- Create a file system in your Azure Data Lake Storage account
- Create a directory in the file system to store the processed Parquet files
- Create and configure a cluster with the necessary dependencies and settings for Python and PySpark.
- In our case, I have used created a cluster with the configuration: 9.1 LTS (includes Apache Spark 3.1.2, Scala 2.12)
- Create a new notebook in your Azure Databricks workspace
- Create a SQL database in Azure
- Create the necessary tables in the database to store the processed data
- Create a new pipeline in your Azure Data Factory instance
- Create a link service for each of the Azure functionalities used. I have created a Linked Service for: Azure Blob Storage, Azure Data Lake Gen2 Storage, Azure Databricks, and Azure SQL
- Add activities to the pipeline for each step of the workflow: a. Data processing in Azure Databricks b. Data copy from Azure Data Lake Storage to Azure SQL Database
- Running the Pipeline: a. Trigger the pipeline run in Azure Data Factory b. Monitor the pipeline run to ensure that each step completes successfully
By the end of it all, we have a fully automated ETL pipeline that fetches the data from blob, transforms the data and generates the necessary insights and loads it to the reporting layer :)