Sparkify offers a music streaming service through desktop and hand-held devices.
To enable Sparkify to analyse data collected though their music streaming applications, Sparkify wish to establish
a relational analytical database to gain insight from the songs played by their user-base.
- [ data ] Directory structure with song and log file data, as JSON format.
- [ create_tables.py ] (Python 3 script): Creates sparkify database and necessary database tables.
- [ etl.py ] (Python 3 script): Main data processing script; Song and log file Extract, Transform and Load functions.
- [ sql_queries.py ] (Python 3 script): Table creation and insert SQL statements used by etl.py
- [ etl.ipynb ] (Jupyter notebook): ETL development
- [ test.ipynb ] (Jupyter notebook): Database querying statements after ETL
(ensure a local PostgreSQL database server is running and configured with the following default database;
dbname=studentdb, user=student, password=student)
-
Download project data and Python scripts, as listed above, to a local directory.
-
Open your system CLI and change directory to where the project files are saved.
C:\users\username>cd C:\users\username\path\to\project
-
Run first Python script to create the sparkify database and table schema... create_tables.py;
C:\users\username>cd C:\users\username\path\to\project>python3 create_tables.py
-
Run second python script to process the JSON files and populate database tables... etl.py;
C:\users\username>cd C:\users\username\path\to\project>python3 etl.py
2No. datasets are available for ingest to the Sparkify analytical database, which are required to carry out relevant
song play analysis.
Song data resides in JSON format, with each file containing metadata about a specific song, and the song's artist.
Within Sparkify's file storage, song files are partitioned by the first three letters of each song's track ID.
Filepath example...
song_data/A/B/C/TRABCEI128F424C983.json
song_data/A/A/B/TRAABJL12903CDCF1A.json
TRAABJL12903CDCF1A.json song file content...
{
"num_songs": 1,
"artist_id": "ARJIE2Y1187B994AB7",
"artist_latitude": null,
"artist_longitude": null,
"artist_location": "",
"artist_name": "Line Renaud",
"song_id": "SOUPIRU12A6D4FA1E1",
"title": "Der Kleine Dompfaff",
"duration": 152.92036,
"year": 0
}
User activity logs, collected via the Sparkify music streaming applications, also resides in JSON format.
Each file represents a single day and contains information about each user and their session details for that day.
Within Sparkify's file storage, log files are partitioned by the month and year.
log_data/2018/11/2018-11-12-events.json
log_data/2018/11/2018-11-13-events.json
2018-11-12-events.json log file content...
{
"artist":null,
"auth":"Logged In",
"firstName":"Celeste",
"gender":"F",
"itemInSession":0,
"lastName":"Williams",
"length":null,
"level":"free",
"location":"Klamath Falls, OR",
"method":"GET",
"page":"Home",
"registration":1541077528796.0,
"sessionId":438,
"song":null,
"status":200,
"ts":1541990217796,
"userAgent":"\"Mozilla\/5.0 (Windows NT 6.1; WOW64)<br>
AppleWebKit\/537.36 (KHTML, like Gecko)<br>
Chrome\/37.0.2062.103 Safari\/537.36\"",
"userId":"53"
}
The proposed analytical database shall be optimised for song play analysis and shall be structured around a snowflake
schema design, consisting of a fact and various dimension tables.
Table Name | Description |
---|---|
songplays | Fact Table; Log data associated with song plays |
users | Dimension Table; Registered application users |
songs | Dimension Table; Songs in music database |
artists | Dimension Table; Artists in music database |
time | Dimension Table; Timestamps of songplays records, broken down into specific units |
Column name | Data type | Column description |
---|---|---|
songplay_id | SERIAL | PRIMARY KEY |
start_time | TIMESTAMP | NOT NULL |
user_id | VARCHAR | NOT NULL |
level | VARCHAR | NOT NULL |
song_id | VARCHAR | |
artist_id | VARCHAR | |
session_id | INT | NOT NULL |
location | VARCHAR | |
user_agent | VARCHAR |
Sample...
songplay_id | start_time | user_id | level | song_id | artist_id | session_id | location | user_agent |
---|---|---|---|---|---|---|---|---|
5449 | 2018-11-21 21:56:47.796000 | 15 | paid | SOZCTXZ12AB0182364 | AR5KOSW1187FB35FF4 | 818 | Chicago-Naperville-Elgin, IL-IN-WI | "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/36.0.1985.125 Chrome/36.0.1985.125 Safari/537.36" |
........................................................................................................................
Column name | Data type | Column description |
---|---|---|
user_id | INT | PRIMARY KEY |
first_name | VARCHAR | NOT NULL |
last_name | VARCHAR | NOT NULL |
gender | VARCHAR | |
level | VARCHAR |
Sample...
user_id | first_name | last_name | gender | level |
---|---|---|---|---|
15 | Lily | Koch | F | paid |
........................................................................................................................
Column name | Data type | Column description |
---|---|---|
song_id | VARCHAR | PRIMARY KEY |
title | VARCHAR | NOT NULL |
artist_id | VARCHAR | |
year | VARCHAR | |
duration | FLOAT |
Sample...
song_id | title | artist_id | year | duration |
---|---|---|---|---|
SOZCTXZ12AB0182364 | Setanta matins | AR5KOSW1187FB35FF4 | 0 | 269.58322 |
........................................................................................................................
Column name | Data type | Column description |
---|---|---|
artist_id | VARCHAR | PRIMARY KEY |
name | VARCHAR | NOT NULL |
location | VARCHAR | |
latitude | VARCHAR | |
longitude | VARCHAR |
Sample...
artist_id | name | location | latitude | longitude |
---|---|---|---|---|
AR5KOSW1187FB35FF4 | Elena | Dubai UAE | 49.80388 | 15.47491 |
........................................................................................................................
Column name | Data type | Column description |
---|---|---|
start_time | TIMESTAMP | NOT NULL |
hour | INT | NOT NULL |
day | INT | NOT NULL |
week | INT | NOT NULL |
month | INT | NOT NULL |
year | INT | NOT NULL |
weekday | INT | NOT NULL |
Sample...
start_time | hour | day | week | month | year | weekday |
---|---|---|---|---|---|---|
2018-11-21 21:56:47.796000 | 21 | 21 | 47 | 11 | 2018 | 2 |
........................................................................................................................