Project page: https://delftdata.github.io/valentine/
This is the main repository that contains the framework used in the paper Valentine: Evaluating Matching Techniques for Dataset Discovery. The data generator and the framework's output and visualizations are on the following repositories:
Data generator: valentine-generator
Paper results and visualizations: valentine-paper-results
The datasets used for experiments in Valentine can be found in the datasets-archive.
The following instructions have been tested on a newly created Ubuntu 18.04 LTS VM. If you prefer to run the entire suite on docker, skip this and the Run experiments sections and go directly to the Run with docker section.
- Clone the repo to your machine using git
git clone https://github.com/delftdata/valentine-suite
- To install all the dependencies required by the suite, run the
install-dependencies.sh
script.
NOTE: This script installs programs and hence requires
sudo
rights in some parts
After these two steps, the framework should not require anything more regarding dependencies.
-
Download the data from the datasets-archive and put them into a folder called data on the project root level.
-
Set the grid-search configuration that you want to run for all the algorithms in the file algorithm_configurations.json
-
Activate the conda environment created in the installation phase with the following command
conda activate valentine-suite
and run the generate_configuration_files.py script with the commandpython generate_configuration_files.py
. This will create all the configuration files that specify a schema matching job (Run a specific method with specific parameters on a specific dataset).
NOTE: if your system does not find conda you might need to run
source ~/.bashrc
- To run the schema matching jobs in parallel run the script run_experiments.sh with the command
./run_experiments.sh {method_name} {number_of_parallel_jobs}
e.g. to run 40 Cupid jobs concurrently run./run_experiments.sh Cupid 40
(This would require a 40 CPU VM to run smoothly). The output will be written in the output folder at the project root level.
The entire suite is also available as a docker image with name kpsarakis/valentine-suite:1.0
. The steps to run with docker are the following:
-
Run the following command
sudo docker run --privileged=true -it -v /var/run/docker.sock:/var/run/docker.sock kpsarakis/valentine-suite:1.0
this will download the image and start a shell on the image containing the valentine suite. -
Activate the conda environment by running
conda activate valentine-suite
-
Go into the folder of the suite using
cd /home/valentine-benchmark
-
Now you are able to run the suite with the data used in the paper Valentine: Evaluating Matching Techniques for Dataset Discovery by running
./run_experiments.sh {method_name} {number_of_parallel_jobs}
e.g. to run 40 Cupid jobs concurrently run./run_experiments.sh Cupid 40
(This would require a 40 CPU VM to run smoothly). The output will be written in the output folder in the project root level, i.e.\home\valentine-benchmark\output
.
Since Valentine is an experiment suit, it is designed to be extended with more schema matching methods. To extend Valentine with such methods, please visit the following wiki guide on how to do so.
-
algorithms
Module containing all the implemented algorithms in the suite.coma
Python wrapper around COMA 3.0 Comunity editioncupid
Contains the python implementation of the paper Generic Schema Matching with Cupiddistribution_based
Contains the python implementation of the paper Automatic Discovery of Attributes in Relational Databasesembdi
Contains the code of EmbDI provided by the authors in their GitLab repositoryjaccard_levenshtein
Contains a baseline that uses Jaccard Similarity between columns to assess their correspondence score, enhanced by Levenshtein Distancesem_prop
Contains the code of Seeping Semantics provided in Aurumsimilarity_flooding
Contains the python implementation of the paper Similarity Flooding: A Versatile Graph Matching Algorithmand its Application to Schema Matchingdiscrete_quality
Contains the python implementation of Discrete Join Quality metriccontinuous_quality
Contains the python implementation of Continuous Join Quality metricml_quality
Contains the python implementation of the Predicted Join Quality, based on a trained MLP model and attribute profiles
-
data_loader
Module used to load the relational data coming from the valentine-generator -
metrics
Module containing the metrics that the framework supports (e.g. Precision, Recall, ...) -
utils
Module containing some utility functions used throughout the framework
@misc{koutras2021valentine,
title={Valentine: Evaluating Matching Techniques for Dataset Discovery},
author={Christos Koutras and George Siachamis and Andra Ionescu and Kyriakos Psarakis and Jerry Brons and Marios Fragkoulis and Christoph Lofi and Angela Bonifati and Asterios Katsifodimos},
booktitle = {37th IEEE International Conference on Data Engineering, ICDE 2021},
pages = {1--12},
publisher = {IEEE},
year = {2021}
}