Skip to content

Data Engineering project. Data modeling and ETL pipeline development for a small postrgres database. (Udacity Data Engineering Nanodegree program project)

Notifications You must be signed in to change notification settings

caseywhorton/Sparkify-Data-Model-ETL

Repository files navigation

Photo by Weston MacKinnon on Unsplash

Sparkify Songplay Data Extract Transform Load (ETL)

Introduction

This project is intended to create a postgres database for Sparkify platform song and artist data and combine that information with Sparkify songplay data to create a descriptive set of records of Sparkify platform song plays.

Table of Contents

  • Installation
  • Usage
  • Database Design
  • ETL Process
  • Files
    • data (folder)
      • log_data
      • song_data
    • create_tables.py
    • etl.py
    • sql_queries.py
    • etl.ipynb
    • test.ipynb

Installation

Python 3

pip install pyscopg2

Usage

From the command line, run these commands:

$ python create_tables.py

$ python etl.py

Database Design

The Sparkify database consists of only 5 separate tables linked by primary and foreign keys. Below are descriptions of the tables and schema.

Table: songplay

Contains data to show the songs listened to by Sparkify platform users. Also shows the time and location of when and where the user listened to the song.

Type Column Type
PK songplay_id int
FK start_time timestamp
FK user_id int
null level varchar
FK song_id varchar
FK artist_id varchar
null session_id varchar
null location varchar
null user_agent varchar

Table: user

Contains data on the Sparkify platform user.

Type Column Type
PK user_id int
null first_name varchar
null last_name varchar
null gender varchar
null level varchar

Table: song

Contains song data.

Type Column Type
PK song_id varchar
null title varchar
FK artist_id varchar
null year int
null duration decimal

Table: artist

Contains artist data.

Type Column Type
PK artist_id varchar
null name varchar
null location varchar
null latitude varchar
null longitude varchar

Table: time

Contains timestamp data from song listens as well as other time related.

Type Column Type
PK start_time timestamp
null hour int
null day int
null week int
null month int
null year int
null weekday int

ETL Process

song_data => insert song data into song table song_data => insert artist data into artist data

log_data => extract timestamp and time information from songplay, insert into time table => find artist_id and song_id from artist and song tables => insert artist, song, and songplay data into songplay table

Files

data (folder)

  • log_data
    • json records of Sparkify song plays
  • song_data
    • json records of Sparkify song and artist information

create_tables.py

Python module that drops tables if they exist and creates them in the database using queries from sql_queries.

etl.py

Python module that moves songplay, song and artist data from the song_data and log_data folders and inserts data into the songplay table.

etl.ipynb

Notebook for developing etl.py module.

sql_queries.py

Python module that contains all SQL queries for table creation and inserts.

test.ipynb

Notebook for testing etl.py module.

About

Data Engineering project. Data modeling and ETL pipeline development for a small postrgres database. (Udacity Data Engineering Nanodegree program project)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published