Official Repository for EvalRS @ KDD 2023, the Second Edition of the workshop on well-rounded evaluation of recommender systems.
The repository is stable and ready to be used for tutorials, educational purposes, research, benchmarks of any kind, hackathons etc.
If you are reading this after the event on Aug. 7th, 2023, do not stop scrolling: all the materials are available to the community and will always be!
We hosted a social event after the workshop for networking, awards and fun times with the community (some pictures from the EvalRS day can be found here).
This is the official repository for EvalRS @ KDD 2023: a Well-Rounded Evaluation of Recommender Systems.
Aside from papers and talks, we will host a hackathon, where participants will pursue innovative projects for the rounded evaluation of recommender systems. The aim of the hackathon is to evaluate recommender systems across a set of important dimensions (accuracy being one of them) through a principled and re-usable set of abstractions, as provided by RecList 🚀.
At the end of the workshop, we will sponsor a social event for teams to finalize their projects, mingle with like-minded practitioners and received the monetary prizes for best papers and projects.
This repository provides an open dataset and all the tools necessary to participate in the hackathon: everything will go back to the community as open-source contributions. Please refer to the appropriate sections below to know how to get the dataset and run the evaluation loop properly.
Check the EvalRS website for the official timeline.
This hackathon is based on the LFM-1b Dataset, Corpus of Music Listening Events for Music Recommendation. The use case is a typical user-item recommendation scenario: at prediction time, we get a set of users: for each user, our model recommends a set of songs to listen to, based on historical data on previous music consumption.
We picked LFM as it suits the spirit and the goal of this event: in particular, thanks to rich meta-data on users, the dataset allows us to test recommender systems among many non-obvious dimensions, on top of standard Information Retrieval Metrics (for the philosophy behind behavioral testing, please refer to the original RecList paper).
Importantly, the dataset of this workshop is a new, augmented version of the one used last year at CIKM: to provide richer item meta-data, we extended the LFM-1b dataset with content-based features and user-provided labels from the WASABI dataset (see below).
When you run the evaluation loop below, the code will automatically download a chosen subset of the LFM dataset, ready to be used (the code will download it only the first time you run it). There are three main objects available from the provided evaluation class:
Users: a collection of users and available meta-data, including patterns of consumption, demographics etc. In the Data Challenge scenario, the user Id is the query item for the model, which is asked to recommend songs to the user.
Tracks: a collection of tracks and available meta-data. In the Data Challenge scenario, tracks are the target items for the model, i.e. the collection to choose from when the model needs to provide recommendations.
Historical Interactions: a collection of interactions between users and tracks, that is, listening events, which should be used by your model to build the recommender system for the Data Challenge.
To enrich track-related metadata, four addditional objects are provided holding features derived from the WASABI dataset:
Social and Emotion Tags: a collection of social tags and emotion tags collected on last.fm, together with a weight that expresses how much they have been used for a given song.
Topics: a collection of 60-dimensional sparse descriptors representing the topic distribution of a LDA topic model trained on English lyrics (model is available here).
Song Embeddings: 768-dimensional SentenceBERT embeddings calculated, using the all-mpnet-base-v2
pretrained model, on song lyrics. For each of the tracks for which lyrics were available (47% of the total number of unique songs), both embeddings calculated on the full song and concatenation of embeddings calculated on individual verses are available.
NOTE that verse embeddings are quite large (~35GB) so they are stored as multiple parquet files, split by initial letter of band name (see an example on how to load the embeddings here).
If you want to use them in your model, you can download them manually from the following links: [ 3 5 A B C D E F G H I J K L M N O P Q R S T U V W X Y Z ].
For in-depth explanations on the code and the template scripts, see the instructions below and check the provided examples and tutorials in notebooks
. For instance, the EDA notebook showcases some important features of the dataset, and provides a start for exploring the problem - e.g. the picture below shows music consumption by hour of day:
For information on how the original datasets were built and what meta-data are available, please refer to these papers: LFM-1b, WASABI.
You can refer to our colab notebooks to start playing with the dataset and understand how to run a first, very simple model, with RecList.
Download the repo and setup a virtual environment. NOTE: the code has been developed and tested with Python 3.9: please use the same version for reproducibility.
git clone https://github.com/RecList/evalRS-KDD-2023
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
Now you can run the provided sample script for testing the evaluation loop with random predictions (tote that that you can use the example_model
notebook if you prefer a notebook interface):
cd evaluation
python eval.py
Now that the loop is setup correctly, time to test a real model!
If you wish to use Comet or Neptune to automatically track the results of running recList, you can do so by passing an additional parameter (comet
or neptune
) to our sample script:
python eval.py comet
python eval.py neptune
At loading time, the script will load env variables in a local .env
files and use them automatically to configure remote logging. You can create your own .env
files starting from the provided local.env
template, and filling it with your secrets.
Please make sure the relevant Python packages from your tracking provider are installed in the environment.
We also provide the saved predictions files from the baseline models trained on the EvalRS dataset with Merlin tutorial.
You can use those prediction files to run the evaluation directly with RecList using this notebook.
We will ask participants to come up with a contribution for the rounded evaluation of recommender systems, based on the dataset and tools available in this repository. Contribution details will be intentionally left open-ended, as we would like participants to engage different angles of the problem on a shared set of resources.
Examples could be operationalizing important notions of robustness, applying and discussing metric definitions from literature, quantifying the trade-off between privacy and accuracy, and so on. The hackathon is a unique opportunity to live and breathe the workshop themes, increase chances of multi-disciplinary collaboration, network and discover related work by peers, and contribute valuable materials back to the community.
- The hackathon will start during the workshop and continue at the social gathering.
- You do not need a paper in the workshop to participate: if you are in person, you need to register to the KDD workshop, if you're remote (see below) reach out to us directly.
- Remote teams that are not able to join KDD in person can participate to the hackathon if willing to operate during the workshop hours: please send a message to
claudio dot pomo at poliba dot it
andfede at stanford dot edu
if you're interested in participating remotely. - Teams can start working on their project before KDD, provided they will also work during the event and engage the other participants during the workshop.
- The only dataset that can be used is the one provided with this repository (you can, of course, augment it if you see fit): given the open-ended nature of the challenge, we are happy for participants to choose whatever tool they desire: for example, you can bring your own model or use the ones we provide if the point you are making is independent from any modelling choice. Please note that if you focus on offline code-based evaluation, re-using and extending the provided RecList provides bonus points, as our goal is to progressively standardize testing through a common library.
- The deliverables for each team are two: 1) a public GitHub repository with an open source license containing whatever has been used for the project (e.g. code, materials, slides, charts); 2) a elevator pitch (video duration needs to be less than 3 minutes) to explain (using any narrative device: e.g. a demo, a demo and some slides, animations) the project: motivation, execution, learnings and why it is cool.
- Based on the materials submitted and the elevator pitch, the organizers will determine the winners at their sole discretion and award the prizes at the social event the evening of Aug. 7th.
We invite participants to come up with interesting and original projects related to the well-rounded evaluation of recommender systems. As suggestions and inspirations, we list few themes / possibilities to get teams started:
- Did you publish a paper on RecSys evaluation or did you like one accepted to EvalRS? Can you extend the official RecList to include the new methods and draw new insights about our dataset?
- Did you recently train a new RecSys model and want to compare the new architecture vs an industry standard Merlin model using RecList?
- Are you interested in data visualization / dataset exploration? Can you find where in the user (item) space the Merlin model we provide tend to underperform?
- How much latent space metrics, such as "being less wrong", change when the underlying space is built through song2vec vs for example content-embeddings through lyrics?
Thanks to our generous sponsors, the following prizes are awarded (at the sole discretion of the committee):
- a winner prize, 2000 USD, for the best hackathon project;
- a runner-up prize, 500 USD, for the second best hackathon project;
- a best paper award prize of 500 USD;
- a best student paper award prize of 500 USD.
- the GrubHub team won the prize for the best hackathon project;
- the Hinge + Rubber Ducky Labs team won the prize for second best hackathon project;
- the best paper award was won by the paper by Noble et al.;
- the best student paper award was won by the paper by Petr Kasalický et al.
This event focuses on building in the open, and adding lasting artifacts to the community. EvalRS @ KDD 2023 is a collaboration between practitioners from industry and academia, who joined forces to make it happen:
- Federico Bianchi, Stanford
- Patrick John Chia, Coveo
- Jacopo Tagliabue, NYU / Bauplan
- Claudio Pomo, Politecnico di Bari
- Gabriel de Souza P. Moreira, NVIDIA
- Ciro Greco, Bauplan
- Davide Eynard, mozilla.ai
- Fahd Husain, mozilla.ai
- Aaron Gonzales, mozilla.ai
This Hackathon and the related social event are possible thanks to the generous support of these awesome folks. Make sure to add a star to our library and check them out!
The proceedings are available on CEUR.
Authors | Title |
---|---|
Noble et Al | Realistic but Non-Identifiable Synthetic User Data Generation (Best Paper) |
Malitesta et Al | Disentangling the Performance Puzzle of Multimodal-aware Recommender Systems |
Kasalický et Al | Bridging Offline-Online Evaluation with a Time-dependent and Popularity Bias-free Offline Metric for Recommenders (Best Student Paper) |
Selman et Al | Evaluating Recommendation Systems Using the Power of Embeddings |
Singh et Al | Metric@CustomerN: Evaluating Metrics at a Customer Level in E-Commerce |
If you find the materials from the workshop useful in your work, please cite the original WebConf contribution and the workshop paper.
RecList
@inproceedings{10.1145/3487553.3524215,
author = {Chia, Patrick John and Tagliabue, Jacopo and Bianchi, Federico and He, Chloe and Ko, Brian},
title = {Beyond NDCG: Behavioral Testing of Recommender Systems with RecList},
year = {2022},
isbn = {9781450391306},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3487553.3524215},
doi = {10.1145/3487553.3524215},
pages = {99–104},
numpages = {6},
keywords = {recommender systems, open source, behavioral testing},
location = {Virtual Event, Lyon, France},
series = {WWW '22 Companion}
}
EvalRS
@misc{https://doi.org/10.48550/arXiv.2304.07145,
doi = {10.48550/ARXIV.2304.07145},
url = {https://arxiv.org/abs/2304.07145},
author = {Federico Bianchi and Patrick John Chia and Ciro Greco and Claudio Pomo and Gabriel Moreira and Davide Eynard and Fahd Husain and Jacopo Tagliabue},
title = {EvalRS 2023. Well-Rounded Recommender Systems For Real-World Deployments},
publisher = {arXiv},
year = {2023},
copyright = {Creative Commons Attribution 4.0 International}
}