This is a list of (Fuzzy) Data Matching software. The software in this list is open source and/or freely available.
The term data matching is used to indicate the procedure of bringing together information from two or more records that are believed to belong to the same entity. Data matching has two applications: (1) to match data across multiple datasets and (2) to match data within a dataset. See the Wikipedia page about data matching for more information.
Similar terms: record linkage, data matching, deduplication, fuzzy matching, entity resolution
The table below gives a dense overview of data matching software properties. The properties evaluated are Application Programming Interface (API), Graphical User Interface (GUI), Linking, Deduplication, Supervised Learning, Unsupervised Learning and Active Learning.
Software | API | GUI | Link | Dedup | Supervised Learning |
Unsupervised Learning |
Active Learning |
---|---|---|---|---|---|---|---|
Dedupe | Python | ❌ | ✅ | ✅ | ❌ | ❌ | ✅ |
fastLink | R | ❌ | ✅ | ❔ | ❌ | ✅ | ❌ |
FEBRL | Python | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ |
FRIL | Java | ✅ | ✅ | ❌ | ❔ | ✅ | ❌ |
FuzzyMatcher | Python | ❌ | ✅ | ❌ | ❌ | ✅ | ❌ |
JedAI | Java | ✅ | ✅ | ❔ | ✅ | ❔ | ❔ |
PRIL | SQL | ❌ | ✅ | ❔ | ❔ | ❔ | ❔ |
Python Record Linkage Toolkit | Python | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ |
RecordLinkage (R) | R | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ |
RELAIS | ❌ | ✅ | ✅ | ❔ | ❔ | ✅ | ❌ |
ReMaDDer | ❌ | ✅ | ✅ | ✅ | ❌ | ✅ | ❌ |
The Link King | ❌ | ✅ | ✅ | ✅ | ❔ | ✅ | ❌ |
✅ Yes/Implemented ❌ No/Not implemented ❔ Unknown
This section describes data matching software. The software is alphabetically ordered.
Dedupe is a python library for fuzzy matching, deduplication and entity
resolution on structured data. The library makes use of active learning to
match record pairs. Active learning is useful in cases without training data.
Dedupe has a side-product for deduplicating CSV files,
csvdedupe, through the command line.
Dedupeio also offers commercial products for data matching. [source
code] MIT
Python
Implements a Fellegi-Sunter probabilistic record linkage model that allows for
missing data and the inclusion of auxiliary information. This includes
functionalities to conduct a merge of two datasets under the Fellegi-Sunter
model using the Expectation-Maximization algorithm. fastLink is a programming
API written in R. (Enamorado, Fifield & Imai,
2017) [source
code] GPL-3.0
R
Febrl (Freely Extensible Biomedical Record Linkage) is a training tool
suitable for users to learn and experiment with record linkage techniques, as
well as for practitioners to conduct linkages with data sets containing up to
several hundred thousand records. Febrl is a data matching tool with a large
number of algorithms implemented and offers a Python programming interface as
well as simple GUI. Febrl doesn't offer unsupervised and active learning
algorithms. The software is now longer actively maintained. (Christen,
2008) [source
code] Python
FRIL (Fine-grained Records Integration and Linkage tool) is free tool that
enables record linkage through a GUI. The tool implements automatic weights
estimation through the EM-algorithm and offers serveral techniques to make
record pairs. FRIL was developed by the Emory University and is not longer
maintained. [source code] Java
A Python package that allows the user to fuzzy match two pandas dataframes
based on one or more fields in common. The functionality is limited at the
moment. MIT
Python
Java gEneric DAta Integration (JedAI) Toolkit is a Entity Resolution Tool
developed by a group of univeristies. JedAI offers a Graphical User Interface.
[source code] Apache License 2.0
Java
PRIL (Point-of-contact Interactive Record Linkage) is a record linkage program
with a GUI. PRIL can be used to link datasets about individuals. (Rentsch CT,
Kabudula CW, Catlett J et al.,
2017) [source
code] MIT
SQLPL
The Python Record Linkage Toolkit is a library to link records in or between
data sources. The toolkit provides most of the tools needed for record linkage
and deduplication. The package is developed for research and the linking of
small or medium sized files. [source
code] GPL-3.0
Python
Package written in R that provides functions for linking and de-duplicating
data sets. Both supervised and unsupervised classification algorithms are
available. Record pairs can be compared with a limited set of algorithms. The
package is published on CRAN. GPL-3.0
R
RELAIS (REcord Linkage At IStat) is a toolkit providing a set of techniques
for dealing with record linkage projects. IStat is the main producer of
official statistics in Italy. EUPL-1.1
R/Java
ReMaDDer is unsupervised free fuzzy data matching software with a GUI. ReMaDDer is capable to perform fully automatic fuzzy record matching without human expert intervention, while attaining accuracy of human clerical review. NOTE: The software is free, but not open source and requires an internet connection to work.
The Link King’s graphical user interface (GUI) makes record linkage and
unduplication easy for beginning and advanced users. The software requires a
SAS license. SAS
A record linkage tool for use in matching a very large file against a moderate size file developed by the USA Census Bureau. There are several papers available about this program (BigMatch, 2007)
Do you know an open source and/or free data matching tool? Please open an issue or do a Pull Request. The same holds for missing or incomplete information.
This project is initiated by the author of the Python Record Linkage Toolkit @J535D165. The aim is to get a list and comparison of data matching software.
This list is licensed under CC-BY-SA 3.0.