Music Artist Predictor

General Assembly | Data Science Immersive | Capstone Project

ML model that is able to predict the music artist of a track based on the song's Spotify music features and Genius lyrics.

Problem Statement

Can we predict a music artist by his/her songs' spotify music features and lyrics?

Goals

Create a multinominal classification model that is able to predict the music artist by the artists' songs given the songs' spotify audio features and lyrics of the song.
Make predictions about which features mostly affect the identity of music artist. Is his/her style based in the lyrics or the music itself?
Evaluate the model by testing it on unseen data and producing classification report, confusion matrix, ROC curve and other evaluation metrics.

Methodology

Select artists from a Spotify Dataset published on Kaggle.
Retrieve lyrics from artists' songs through Genius API wrapper called lyricsgenius by johnwmillr.
Establish SQL database and merge data.
Clean data of missing values, duplicates and other anomalies.
Perform sentimental analysis and other NLP analysis to derive NLP-based continous features.
Prepare data for modelling: train-test-split (80/20),CountVectorizer/TfidfVectorizer, standardisation,
Select a model by testing different classifiers.
Optimise model.
Evaluate model with Accuracy, Precision and Recall.
Outline results, limitations and future steps.

Python Libraries

Pandas, Numpy, scikit-learn, textacy, lexical diversity, VaderSentiment, Matplotlib, Seaborn, Postgres, Regex(re)

Data Collection

Data was collected from Spotify Track Dataset sourced from Kaggle and from Genius Lyrics website using the lyricsgenius wrapper by johnwmillr utilising Genius APIs and BeautifulSoup webscraping.

Data Preparation

Data Cleaning & Processing

Removing duplicated entries
Adjust artist name spellings of the Genius Data to the Spotify spelling
Removing entries containing a featuring artist with the primary artist
Removing artist which aren't part of the target classes
Some of the processing steps involved for feature engineering of the lextical diversity^, textacy stats^^ or sentiment analysis^^^ or CountVectorization(!) or TfidfVectorization(!!) of the Lyrics data:
1. Removing elements such as [Chorus] or [Intro] added by the Lyrics Genius Website - utilising Regex. (all)
2. Conversion to lowercase - utilising Python (!, !!)
3. Parsing - performed by textacy or CountV/TfidifV (all)
4. Tokenization - performed by textacy or skicit-learn's CountV and TfidfV classes (all)
5. Lemmatization performed by lexical diversity and Vader (^,^^^)

Features Engineered (excl. any CountVectorizer/TfidfVectorizer corpus)

Lexical diversity features Token-Type-Ratio (TTR) and Measure of Textual Lexical Diversity (MTLD) were engineered with lexical diversity library
NLP features number of sentences (n_sentences), word count (word_count), character count (character_count), number of syllables (n_syllables), unique word count (unique_word_count), number of long words (n_long_words), number of monosyllable words (n_monosyllable_words) and number of polysyllable words (n_polysyllable_words) were created using the textacy library
Sentiment Analysis and feature creation of vader_compound, vader_pos, vader_neu, vader_neg, objectivity_score and pos_vs_neg was performed using VaderSentiment library.

Variable	Description	Data Type
artist_name	The artist or band name of the song.	string object
track_name	The track name is simply title of the song.	string object
track_id	The spotify ID number which at the time the dataset was collected functioned as the uri but might not match with current uris and IDs.	string object
popularity	Generally speaking, songs that are being played a lot now will have a higher popularity than songs that were played a lot in the past. Duplicate tracks (e.g. the same track from a single and an album) are rated independently. Artist and album popularity is derived mathematically from track popularity. Note that the popularity value may lag actual popularity by a few days: the value is not updated in real time.	int64
acousticness	A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.	float64
danceability	Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.	float64
duration_ms	The duration of the track in milliseconds.	int64
energy	Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.	float64
instrumentalness	Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.	float64
key	The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1.	string object
liveness	Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.	float64
loudness	The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.	float64
mode	Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.	string object
speechiness	Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.	float64
tempo	The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.	float64
time_signature	An estimated overall time signature of a track. The time signature (meter) is a notational convention to specify how many beats are in each bar (or measure).	string object
valence	A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).	float 64
release_year	The date the track/song was released.	string object - has to be changed to datetime
spotify_uri	Genius' record of spotify uri for a track.	string object
lyrics	Web scrapped lyrics from genius.com website.	string object
lyrics_processed	Lyrics processed for NLP analysis	string object
n_sentences	Number of sentences in track extracted by textacy.	int64
word_count	Number of words in track, including numbers + stop words but excluding punctuation extracted by textacy.	int64
character_count	Character count per track extracted by textacy.	int64
n_syllables	Number of syllables per track extracted by textacy.	int64
unique_word_count	Number of unique (lower-cased) words in track extracted by textacy.	int64
n_long_words	Number of words in track with 7 or more characters extracted by textacy.	int64
n_monosyllable_words	Number of words in track with 1 syllable only.	int64
n_polysyllable_words	Number of words in doc with 3 or more syllables extracted by textacy. Note: Since this excludes words with exactly 2 syllables, it’s likely that n_monosyllable_words + n_polysyllable_words != n_words.	int64
vader_compound	VaderSentiment Library: The compound score is computed by summing the valence scores of each word in the lexicon, adjusted according to the rules, and then normalized to be between -1 (most extreme negative) and +1 (most extreme positive). This is the most useful metric if you want a single unidimensional measure of sentiment for a given sentence. Calling it a 'normalized, weighted composite score' is accurate.	float64
vader_neg	VaderSentiment Library: compound score >= 0.05. The pos, neu, and neg scores are ratios for proportions of text that fall in each category (so these should all add up to be 1... or close to it with float operation). These are the most useful metrics if you want multidimensional measures of sentiment for a given sentence.	float64
vader_neu	VaderSentiment Library: (compound score > -0.05) and (compound score < 0.05). The pos, neu, and neg scores are ratios for proportions of text that fall in each category (so these should all add up to be 1... or close to it with float operation). These are the most useful metrics if you want multidimensional measures of sentiment for a given sentence.	float64
vader_pos	VaderSentiment Library: compound score <= -0.05. The pos, neu, and neg scores are ratios for proportions of text that fall in each category (so these should all add up to be 1... or close to it with float operation). These are the most useful metrics if you want multidimensional measures of sentiment for a given sentence.	float64
objectivity_score	objectivity = 1. - (vader_pos + vader_neg).	float64
pos_vs_neg	pos_vs_neg = vader_pos - vader_neg.	float64
TTR	A type-token ratio (TTR) is the total number of unique words (types) divided by the total number of words (tokens) in a given segment of language. This has been performed the classical way where words have not been lemmatised using lexical diversity library.	float64
MTLD	Calculates Measure of Textual Lexical Diversity (MTLD) based on McCarthy and Jarvis (2010) using lexical diversity library. The higher this score is the more complex is the text.	float64

Exploratory Data Analysis

Please refer to the Exploratory Data Analysis Notebook and my Public Tableau Profile for further insights. This is a summarised EDA.

Class imbalance

The dataset on which the original modeling was performed on is a dataset of 1312 rows containing 39 classes (music artists). Their representation in the dataset is here illustrated in percentages. The majority class which determines the baseline of this dataset is The Black Keys. They occupy 5.793% of the dataset which translates to baseline of 0.058. This graph illustrates that there is class imbalance which might influence the predictability of the different classes. (Graph created with Tableau Desktop)

Correlation

This correlation matrix illustrates the correlation between all continuous variables in the 1312 rows dataset (Similar pattern has been observed in the 1615 rows dataset). As expected there is a strong correlation between the text statistics and also within the vader sentiment analysis. A high colinearity is therefore expected between those features. It is also intersting to note that there seems to be a slight correlation between some of the music feature variables such as popularity, dancability or duration_ms with some of the text statistics such as word_count or unique_word_count. (Graph created with Tableau Desktop)

Sentiment Analysis

VaderSentiment Scores filtered on Mean Vader Compound of each Music Artist

The two graphs illustrate the most positive and most negative artists based on their vader compound scores of the 1312 rows dataset (filtered on vader_compound). Positive artists usually have higher average vader positive scores than vader negative scores while negative artists usually exibit the vice versa. Vader neutral words are always represented highest in any text as this usually describes non-biased words also such as 'the', 'and' or 'is'. (Graph created with Tableau Desktop)

Vader compound in Relation to Spotify's Musical Valence

Analysing the mean vader compound scores of individual artists with their mean valence scores, it suggests that within positive vader compound artists the musical valence have high scores (mostly above 0.4). The negative vader compound artists have mostly valence scores below 0.5. As Spotify's valence is only measured from 0-1, it is important to note that this data representation is not standardised, which would allow a clearer representation. (Graph created with Tableau Desktop)

Please see Exploratory Data Analysis Notebook for full seaborn/matplotlib visualisations or visit my Tableau Public Profile

Model Selection

Baseline: 0.058

Modeling was performed on previously engineered features and lyrics, which were processed either through the use of scikit-learn's CountVectorizer (CountV), taking word counts, or TfidfVectorizer (TfidfV), where rare words are weighted more than common words.
For both, CountV and TfidfV, a corpus of 1000 features and n grams with a range of 1 to 3 was created. Stop words were removed, but words were not stemmed and lemmatised, due to the use of slang in many tracks.
Engineered features and CountV/TfidfV matrices were standarized using scikit-learn's StandardScaler() class.
Modeling on PCA data was performed but had low scores and was therefore excluded from the testing process.

Preliminary Model Testing

First only four models (KNN, Logistic Regression, Random Forest, and Support Vector Machines) were tested with a GridsearchCV using only a few parameters per model on the engineered features and the lyrics processed with either scikit-learn's CountVectorizer or TfidfVectorizer.

Preliminary findings

Model testing on the features and CountVectorized lyrics indicated Logistic Regression and Random Forest models yielded high cross validation scores of 0.302 and 0.433, respectively.
Model testing on the features and TfidfVectorized lyrics indicated Logistic Regression and Random Forest models yielded high cross validation scores of 0.399 and 0.398, respectively.

Large Scale Model Testing

The large-scale model training was performed with TfidfV-engineered lyrics and the other features (see Data Preparation).

Following a thourough GridSearchCV Procedure with multiple classification models (see Table below), it was found that the Random Forest Classifier was performing the best out of the 16 models tested.

Model Optimisation

Original CV score: 0.433
Ran each optimisation step with another Gridsearch in case the preferences shifted.

Changed class weight to balanced - CV score: 0.455
Cut down on features by choosing the first 500 most important features - CV score: 0.459
Enriched the data from 1312 rows to 1615 rows (baseline: 0.0724, balanced class weight, and reduced features to top 200) - CV score: 0.508

Results

GridsearchCV: Random Forest

Achieved an optimal cross-validation score of 0.433.
Feature importance, based on coefficient absolute number, showed that popularity, loudness, duration_ms, acousticness, unique_word_count, energy etc.

Optimization of Random Forest Model:

Achieved optimal cross-validation score of 0.508.
Feature importance indicated popularity, loudness, acousticness and unique_word_count (etc.) to have the highest impact on artist prediction.
It is interesting to note that the order of type of features that influence the class prediction seems to be music features, lexical diversity, text statistics, sentiment and then frequency of specific words.

Evaluation:

Accuracy score indicates that the model is able to predict the correct class in about 50% of the cases.
Macro precision score indicates the ratio of correctly predicted positive observations to the total predicted positive observations. About 54% of the positively predicted artists are correctly predicted.
Macro Recall/Sensitivity score outlines that 45% of true positive observations were detected as positive. The model misses to detect 55% as true positives.
F1 score, a combination of the precision and recall score, was determined to be 45%.

Classes with a higher representation achieve higher accuracy, precision and recall scores. This is a strong indication that the model has a bias.

Evaluating the predictive performance of the random forest model, the ROC curve was plotted for each class and the model's micro and macro ROC curves. Area's under the curves (AUCs) illustrated that the model has a strong ability to distinguish between different classes.
Precision-recall curves plot the positive predictive value (PPV, y-axis) against the true positive rate (TPR, x-axis). Plotting the micro precision-recall curve of this multiclass model produced a AUC of 0.54. This suggests that the model is only able to have a high recall at a low precision.
Feature importance indicated popularity, loudness, acousticness and unique_word_count (etc.) to have the highest impact on artist prediction.
It is interesting to note that the order of type of features that influence the class prediction seems to be music features, lexical diversity, text statistics, sentiment and then frequency of specific words.

Limitations:

Kaggle dataset with choices of songs per artists
Genius servers timeout frequently
Small dataset
Some artists have few songs
Can only be applied to trained artists not unseen artists
Slight class imbalance favoured majority class predictions
Risk: Overfitting on training data

Random Forest Classifier: Pros and Cons

Pros

Decorrelates trees (relative to bagged trees), which is especially useful when there is a lot of correlation
Reduced variance in comparison to regular decision tree
Has the ability to address class imbalance by using the balanced class weight flag
Scales to large datasets

Cons

Not as easy to visually interpret
Long computation time when used in GridSearch
Tends to overfit on the training data but is claimed to not be susceptible to that

Future Steps

Model NLP and numeric features separately and later join them together
Collect more data to increase statistical significance, enrich dataset and produce a stronger model
Perform statistical testing on features before modeling, to exclude insignificant features (e.g. with Statsmodel)

Name		Name	Last commit message	Last commit date
Latest commit History 78 Commits
images		images
.gitignore		.gitignore
Data Cleaning.ipynb		Data Cleaning.ipynb
Data Collection.ipynb		Data Collection.ipynb
Exploratory Data Analysis (1312 rows dataset).ipynb		Exploratory Data Analysis (1312 rows dataset).ipynb
Feature Engineering.ipynb		Feature Engineering.ipynb
Model Optimisation & Evaluation.ipynb		Model Optimisation & Evaluation.ipynb
Model Testing.ipynb		Model Testing.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Music Artist Predictor

Table of Contents

Problem Statement

Goals

Methodology

Python Libraries

Data Collection

Data Preparation

Exploratory Data Analysis

Class imbalance

Correlation

Sentiment Analysis

Model Selection

Preliminary Model Testing

Large Scale Model Testing

Model Optimisation

Results

Future Steps

About

Languages

constancemaurer/music-artist-predictor

Folders and files

Latest commit

History

Repository files navigation

Music Artist Predictor

Table of Contents

Problem Statement

Goals

Methodology

Python Libraries

Data Collection

Data Preparation

Exploratory Data Analysis

Class imbalance

Correlation

Sentiment Analysis

Model Selection

Preliminary Model Testing

Large Scale Model Testing

Model Optimisation

Results

Future Steps

About

Topics

Resources

Stars

Watchers

Forks

Languages