Valentine: Evaluating Matching Techniques for Dataset Discovery

Project page: https://delftdata.github.io/valentine/

This is the main repository that contains the framework used in the paper Valentine: Evaluating Matching Techniques for Dataset Discovery. The data generator and the framework's output and visualizations are on the following repositories:

Data generator: valentine-generator

Paper results and visualizations: valentine-paper-results

The datasets used for experiments in Valentine can be found in the datasets-archive.

Installation instructions

The following instructions have been tested on a newly created Ubuntu 18.04 LTS VM. If you prefer to run the entire suite on docker, skip this and the Run experiments sections and go directly to the Run with docker section.

Clone the repo to your machine using git git clone https://github.com/delftdata/valentine-suite
To install all the dependencies required by the suite, run the install-dependencies.sh script.

NOTE: This script installs programs and hence requires sudo rights in some parts

After these two steps, the framework should not require anything more regarding dependencies.

Run experiments

Download the data from the datasets-archive and put them into a folder called data on the project root level.
Set the grid-search configuration that you want to run for all the algorithms in the file algorithm_configurations.json
Activate the conda environment created in the installation phase with the following command conda activate valentine-suite and run the generate_configuration_files.py script with the command python generate_configuration_files.py. This will create all the configuration files that specify a schema matching job (Run a specific method with specific parameters on a specific dataset).

NOTE: if your system does not find conda you might need to run source ~/.bashrc

To run the schema matching jobs in parallel run the script run_experiments.sh with the command ./run_experiments.sh {method_name} {number_of_parallel_jobs} e.g. to run 40 Cupid jobs concurrently run ./run_experiments.sh Cupid 40 (This would require a 40 CPU VM to run smoothly). The output will be written in the output folder at the project root level.

Run with docker

The entire suite is also available as a docker image with name kpsarakis/valentine-suite:1.0. The steps to run with docker are the following:

Run the following command sudo docker run --privileged=true -it -v /var/run/docker.sock:/var/run/docker.sock kpsarakis/valentine-suite:1.0 this will download the image and start a shell on the image containing the valentine suite.
Activate the conda environment by running conda activate valentine-suite
Go into the folder of the suite using cd /home/valentine-benchmark
Now you are able to run the suite with the data used in the paper Valentine: Evaluating Matching Techniques for Dataset Discovery by running ./run_experiments.sh {method_name} {number_of_parallel_jobs} e.g. to run 40 Cupid jobs concurrently run ./run_experiments.sh Cupid 40 (This would require a 40 CPU VM to run smoothly). The output will be written in the output folder in the project root level, i.e. \home\valentine-benchmark\output.

Integrate new methods

Since Valentine is an experiment suit, it is designed to be extended with more schema matching methods. To extend Valentine with such methods, please visit the following wiki guide on how to do so.

Project structure

algorithms Module containing all the implemented algorithms in the suite.
- coma Python wrapper around COMA 3.0 Comunity edition
- cupid Contains the python implementation of the paper Generic Schema Matching with Cupid
- distribution_based Contains the python implementation of the paper Automatic Discovery of Attributes in Relational Databases
- embdi Contains the code of EmbDI provided by the authors in their GitLab repository
- jaccard_levenshtein Contains a baseline that uses Jaccard Similarity between columns to assess their correspondence score, enhanced by Levenshtein Distance
- sem_prop Contains the code of Seeping Semantics provided in Aurum
- similarity_flooding Contains the python implementation of the paper Similarity Flooding: A Versatile Graph Matching Algorithmand its Application to Schema Matching
- discrete_quality Contains the python implementation of Discrete Join Quality metric
- continuous_quality Contains the python implementation of Continuous Join Quality metric
- ml_quality Contains the python implementation of the Predicted Join Quality, based on a trained MLP model and attribute profiles
data_loader Module used to load the relational data coming from the valentine-generator
metrics Module containing the metrics that the framework supports (e.g. Precision, Recall, ...)
utils Module containing some utility functions used throughout the framework

Cite Valentine

@misc{koutras2021valentine,
      title={Valentine: Evaluating Matching Techniques for Dataset Discovery}, 
      author={Christos Koutras and George Siachamis and Andra Ionescu and Kyriakos Psarakis and Jerry Brons and Marios Fragkoulis and Christoph Lofi and Angela Bonifati and Asterios Katsifodimos},
  booktitle = {37th IEEE International Conference on Data Engineering, ICDE 2021},
  pages     = {1--12},
  publisher = {IEEE},
  year      = {2021}
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
algorithms		algorithms
data_loader		data_loader
metrics		metrics
utils		utils
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
algorithm_configurations.json		algorithm_configurations.json
entrypoint.sh		entrypoint.sh
generate_configuration_files.py		generate_configuration_files.py
install-dependencies.sh		install-dependencies.sh
requirements.txt		requirements.txt
run_experiments.sh		run_experiments.sh
run_job.py		run_job.py
run_semprop_docker.sh		run_semprop_docker.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Valentine: Evaluating Matching Techniques for Dataset Discovery

Installation instructions

Run experiments

Run with docker

Integrate new methods

Project structure

Cite Valentine

About

Releases

Packages

Languages

License

dtim-upc/valentine-extended

Folders and files

Latest commit

History

Repository files navigation

Valentine: Evaluating Matching Techniques for Dataset Discovery

Installation instructions

Run experiments

Run with docker

Integrate new methods

Project structure

Cite Valentine

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages