This repository hosts the Last.fm Dataset - 1K users under the same license and terms as in the offical README, copied for convenience below.
The dataset has been preprocessed and hosted for easier use in the standard PyData set of tools. In addition, an educational/developer friendly subset is hosted for quality assurance tasks and experimentation. See the releases section of this repository to download.
Release Files
userid-timestamp-artid-artname-traid-traname.tsv.zip
(original ~1,000 user level event dataset fromlastfm-dataset-1K.tar.gz
)userid-profile.tsv.zip
(original ~10,00 user profile dataset fromlastfm-dataset-1K.tar.gz
)README.txt
(original README fromlastfm-dataset-1K.tar.gz
, see below as well)lastfm-dataset-1k.snappy.parquet
(processeduserid-timestamp-artid-artname-traid-traname.tsv.zip
)lastfm-dataset-50.snappy.parquet
(processeduserid-timestamp-artid-artname-traid-traname.tsv.zip
with 50 users sampled)
The preprocessing done in the preprocessing.ipynb
notebook consisted of the following steps.
- Load
userid-timestamp-artid-artname-traid-traname.tsv.zip
as a pandasDataframe
- Remove malformed rows
- Convert
timestamp
string to a proper UTC datetime object - Sort records by user_id and timestamp
- Save original and sampled dataset as single snappy compressed parquet files.
The column headers were renamed to be user_id, timestamp, artist_id, artist_name, track_id, track_name.
Version 1.0, May 2010
This dataset contains user, timestamp, artist, song tuples collected from Last.fm API, using the user.getRecentTracks() method.
This dataset represents the whole listening habits (till May, 5th 2009) for nearly 1,000 users.
Filename | MD5 |
---|---|
userid-timestamp-artid-artname-traid-traname.tsv | 64747b21563e3d2aa95751e0ddc46b68 |
userid-profile.tsv | c53608b6b445db201098c1489ea497df |
Element | Statistic |
---|---|
Total Lines | 19,150,868 |
Unique Users | 992 |
Artists with MBID | 107,528 |
Artists without MBDID | 69,420 |
The data is formatted one entry per line as follows (tab separated, \t
)
userid \t timestamp \t musicbrainz-artist-id \t artist-name \t musicbrainz-track-id \t track-name
userid \t gender ('m'|'f'|empty) \t age (int|empty) \t country (str|empty) \t signup (date|empty)
user_000639 \t 2009-04-08T01:57:47Z \t MBID \t The Dogs D'Amour \t MBID \t Fall in Love Again?
user_000639 \t 2009-04-08T01:53:56Z \t MBID \t The Dogs D'Amour \t MBID \t Wait Until I'm Dead
...
user_000639 \t m \t Mexico \t Apr 27, 2005
...
The data contained in lastfm-dataset-1K.tar.gz
is distributed with permission of Last.fm.
The data is made available for non-commercial use.
Those interested in using the data or web services in a commercial context should contact partners [at] last [dot] fm.
For more information see Last.fm terms of service.
Thanks to Last.fm for providing the access to this data via their web services.
Special thanks to Norman Casagrande.
When using this dataset you must reference the Last.fm webpage.
Optionally (not mandatory at all!), you can cite Chapter 3 of this book
@book{Celma:Springer2010,
author = {Celma, O.},
title = {{Music Recommendation and Discovery in the Long Tail}},
publisher = {Springer},
year = {2010}
}
This data was collected by Òscar Celma @ MTG/UPF