A project done as students of the course "Data Mining: Advanced topics and applications" at University of Pisa, department of Computer Science.
To read the instructions given to us, we refer you to the guidelines to the project.
Our gratitude goes to the many people who have explored this dataset before us and to those who created it which are Defferrard Michael, Benzi Kirell, Vandergheynst Pierre, Bresson Xavier. We use the music and metadata under MIT and CC BY 4.0 licenses respectively. Please refer to the dataset's repository for more information.
This project, its code and its report are authored by Marianna Abbattista, Fabio Michele Russo and Saverio Telera, students of Data Science.
There are two interfaces to the data: music.py
(for working with time series of waveforms) and utils.py
(for working with song metadata). music.py defines a class, MusicDB(), while utils.py defines several functions. We use these to ensure smooth/efficient loading and consistent preprocessing. If you want to explore the data yourself using our loading and preprocessing techniques, all you need to use are either music.py or utils.py by using the objects or functions described here.
from music import MusicDB
musi = MusicDB()
print(musi.df.info())
music.py
provides three objects: df
, feat
, sax
, all contained in the class MusicDB()
.
A pandas.DataFrame with the number of features determined by our main song (which is data/music/000/000002.mp3) and with as many rows as songs in the small dataset (8000) minus 3 songs dropped for performance reasons. Therefore, the final shape of this is (7997, 2699).
A dataframe of metadata on each time series. Holds only genre (textual form) and encoded genre (numeric integer form), but any other interesting metadata on our time series can be added as a column.
A dataframe of SAX-ed time series with our best parameters of 130 segments, 20 symbols.
import utils
music.py
provides three functions: load_tracks_xyz()
, load_tracks()
, load_small_tracks()
.
- filepath="data/tracks.csv",
- splits=3,
- buckets="basic",
- dummies=True,
- fill=True,
- outliers=True,
- extractclass=None,
- small=False
Parameter usage is the same as utils.load_tracks()
, below, except for these differences:
Returns a dict of 3 pd.Dataframe from tracks.csv. The dataframes are contained in a dict for which the keys are "train", "vali", "test".
If extractclass=some_column this function returns a dict of 6 items with keys: ["train_x", "train_y", "vali_x", "vali_y", "test_x", "test_y"].
Each of the three _x versions are type pd.Dataframe and contain all the attributes except some_column. Each of the three _y versions are pd.Series and contain just some_column. The correct row indexes are retained in all.
Returns dict of 2 pd.Dataframe from tracks.csv: ["train", "test"]
If extractclass=some_column returns a dict with keys: ["train_x", "train_y", "test_x", "test_y"].
Same usage as above, but returns only the "small" dataset with 10 features + (album, type).
import utils
dfs = utils.load_tracks_xyz()
print(dfs['train'].info())
print(dfs['vali'].info())
print(dfs['test'].info())
train = dfs['train']
print(train.info())
dfs = utils.load_tracks_xyz(extractclass=("track", "listens"))
print(dfs['train_x'].info())
print(dfs['train_y'].describe()) # Series, contains only ("track", "listens")
- filepath="data/tracks.csv",
- buckets="basic",
- dummies=True,
- fill=True,
- outliers=True
- filepath should only be changed when you put your files inside of subfolders
- buckets (basic, continuous, discrete): basic is discretizations we use for everything, discrete only for methods who prefer discrete attributes (like decision tree), continuous only for methods who prefer continuous attributes (like knn)
- dummies: Makes dummies out of columns with coverage <10% and a few more special cases hard coded in
utils.dummy_maker()
- fill: will fill missing values but so far it only deletes rows that contain outliers
- outliers: if true must be used with fill=True and removes outliers determined by
abod1072.csv
.
Returns a single pd.Dataframe.
- filepath="data/tracks.csv",
- buckets="basic",
- dummies=True,
- fill=True,
- outliers=True,
Same exact usage as utils.load_tracks()
, but returns only the 10 features + (album, type) we selected.
Module | Filename |
---|---|
1 Starting classification | basic_method_type.py |
1 Anomaly detection | outliers_method_type.py |
1 Imbalanced learning | imbalance_method_type.py |
2 Advanced Classification | advcl_method_type.py |
2 Regression | regression_method_type.py |
3 Time Series | ts_method_type.py |
(_type is optional for all files)
Example:
- imbalance_KNN_over.py
- advcl_SVM_linear.py
- regression_linear.py
corsivo solo nomi colonne
grassetto altre cose da rendere evidenti, NON nomi colonne
++commenti per noi tre racchiusi fra due più++
- album solo corsivo, senza apici ’ o virgolette “
- (artist, album) colonna con doppio index, sempre corsivo, racchiusa da parentesi
creterion=gini grassetto senza virgolette oppure discorsivamente: for this we used gini and …
usare questo: … che è diverso dai tre singoli punti ...