└── README.md /README.md: -------------------------------------------------------------------------------- 1 | # Data Matching software 2 | 3 | - [Overview](#overview) 4 | - [Software](#software) 5 | - [Outdated](#outdated-no-longer-available) 6 | - [Contributing](#contributing) 7 | 8 | This is a list of (Fuzzy) Data Matching software. The software in this list is 9 | FOSS (Free and open-source software). 10 | 11 | The term data matching is used to indicate the procedure of bringing together 12 | information from two or more records that are believed to belong to the same 13 | entity. Data matching has two applications: (1) to match data across multiple 14 | datasets (linkage) and (2) to match data within a dataset (deduplication). 15 | See the [Wikipedia page](https://en.wikipedia.org/wiki/Record_linkage) about 16 | data matching for more information. 17 | 18 | *Similar terms:* record linkage, data matching, deduplication, fuzzy matching, 19 | entity resolution 20 | 21 | ## Overview 22 | 23 | The table below gives a dense overview of data matching software properties. 24 | The properties evaluated are [Application Programming Interface 25 | (API)](https://en.wikipedia.org/wiki/Application_programming_interface), 26 | [Graphical User Interface 27 | (GUI)](https://en.wikipedia.org/wiki/Graphical_user_interface), Linking, 28 | Deduplication, [Supervised 29 | Learning](https://en.wikipedia.org/wiki/Supervised_learning), [Unsupervised 30 | Learning](https://en.wikipedia.org/wiki/Unsupervised_learning) and [Active 31 | Learning](https://en.wikipedia.org/wiki/Active_learning_(machine_learning)). 32 | 33 | | Software | API | GUI | Link | Dedup | Supervised
Learning | Unsupervised
Learning | Active
Learning | 34 | |:----------------------------------------------------------------|:-------|:------------------:|:------------------:|:------------------:|:-------------------------:|:---------------------------:|:---------------------:| 35 | | [AtyImo](#atyimo) | PySpark| :x: | :white_check_mark: | :white_check_mark: | :x: | :x: | :x: | 36 | | [Dedupe](#dedupe) | Python | :x: | :white_check_mark: | :white_check_mark: | :white_check_mark: | :x: | :white_check_mark: | 37 | | [dirty-cat](#dirty-cat) | Python| :x: | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | :x: | 38 | | [fastLink](#fastlink) | R | :x: | :white_check_mark: | :grey_question: | :x: | :white_check_mark: | :x: | 39 | | [FEBRL](#febrl) | Python | :white_check_mark: | :white_check_mark: | :white_check_mark: | :x: | :x: | :x: | 40 | | [FRIL](#fril) | Java | :white_check_mark: | :white_check_mark: | :x: | :grey_question: | :white_check_mark: | :x: | 41 | | [FuzzyMatcher](#fuzzymatcher) | Python | :x: | :white_check_mark: | :x: | :x: | :white_check_mark: | :x: | 42 | | [hlink](#hlink) | PySpark| :x: | :white_check_mark: | :grey_question: | :x: | :x: | :x: | 43 | | [JedAI](#jedai) | Java | :white_check_mark: | :white_check_mark: | :grey_question: | :white_check_mark: | :grey_question: | :grey_question: | 44 | | [PRIL](#pril) | SQL | :x: | :white_check_mark: | :grey_question: | :grey_question: | :grey_question: | :grey_question: | 45 | | [Python Record Linkage Toolkit](#python-record-linkage-toolkit) | Python | :x: | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | :x: | 46 | | [RecordLinkage (R)](#recordlinkage-r) | R | :x: | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | :x: | 47 | | [Reclin2](#reclin2) | R | :x: | :white_check_mark: | :white_check_mark: | :white_check_mark: | :x: | :x: | 48 | | [RELAIS](#relais) | :x: | :white_check_mark: | :white_check_mark: | :grey_question: | :grey_question: | :white_check_mark: | :x: | 49 | | [ReMaDDer](#remadder) | :x: | :white_check_mark: | :white_check_mark: | :white_check_mark: | :x: | :white_check_mark: | :x: | 50 | | [RLTK](#rltk) | Python | :x: | :white_check_mark: | :white_check_mark: | :white_check_mark: | :x: | :x: | 51 | | [Splink](#splink) | Python | :x: | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | :x: | 52 | | [Zingg](#zingg) | Python| :x: | :white_check_mark: | :white_check_mark: | :white_check_mark: | :x: | :x: | 53 | 54 | :white_check_mark: Yes/Implemented 55 | :x: No/Not implemented 56 | :grey_question: Unknown 57 | 58 | ## Software 59 | 60 | This section describes **data matching** software. The software is 61 | alphabetically ordered. 62 | 63 | #### [AtyImo](https://github.com/pierrepita/atyimo) 64 | AtyImo implements a mixture of deterministic and probabilistic routines for data 65 | linkage. Initially developed in 2013 to serve as a linkage tool supporting a joint 66 | Brazil–U.K. project aiming at building a large population-based cohort with data 67 | from more than 100 million participants and producing disease-specific data to facilitate 68 | diverse epidemiological research studies. 69 | 70 | | | | 71 | |---|---| 72 | | License | ![GitHub](https://img.shields.io/github/license/pierrepita/atyimo) | 73 | | Language | `Python` `Spark` | 74 | | Latest release | NA | 75 | | Downloads per month | | 76 | | GitHub stars | [![GitHub stars](https://img.shields.io/github/stars/pierrepita/atyimo.svg?style=social&label=Star)](https://github.com/pierrepita/atyimo) | 77 | 78 | #### [Dedupe](https://github.com/dedupeio/dedupe) 79 | 80 | Dedupe is a python library for fuzzy matching, deduplication and entity 81 | resolution on structured data. The library makes use of active learning to 82 | match record pairs. Active learning is useful in cases without training data. 83 | Dedupe has a side-product for deduplicating CSV files, 84 | [csvdedupe](https://github.com/dedupeio/csvdedupe), through the command line. 85 | Dedupeio also offers commercial products for data matching. 86 | 87 | | | | 88 | |---|---| 89 | | License | ![PyPI - License](https://img.shields.io/pypi/l/dedupe) | 90 | | Language | ![PyPI - Python Version](https://img.shields.io/pypi/pyversions/dedupe) | 91 | | Latest release | [![PyPI](https://img.shields.io/pypi/v/dedupe.svg)](https://pypi.python.org/pypi/dedupe/) | 92 | | Downloads per month | ![PyPI - Downloads](https://img.shields.io/pypi/dm/dedupe) | 93 | | GitHub stars | [![GitHub stars](https://img.shields.io/github/stars/dedupeio/dedupe.svg?style=social&label=Star)](https://github.com/dedupeio/dedupe) | 94 | 95 | #### [dirty-cat](https://github.com/dirty-cat/dirty_cat) 96 | 97 | [dirty-cat](https://dirty-cat.github.io/) is an open-source Python package that facilitates machine-learning with with dirty data: robust to morphological variants, such as typos. Some of the currently supported features are: fuzzy joining tables on dirty numerical, string or mixed type columns, deduplicating and encoding dirty categorical variables for ML. [This example](https://dirty-cat.github.io/stable/auto_examples/01_dirty_categories.html) illustrates why to use dirty-cat encoders rather than OneHotEncoder on dirty data and [this one](https://dirty-cat.github.io/stable/auto_examples/04_fuzzy_joining_and_FeatureAugmenter.html) shows how to join multiple dirty tables for ML. 98 | The transfomers ([TableVectorizer](https://dirty-cat.github.io/stable/generated/dirty_cat.TableVectorizer.html), [FeatureAugmenter](https://dirty-cat.github.io/stable/generated/dirty_cat.FeatureAugmenter.html)) are scikit-learn compatible, and easily introduced into ML pipelines. 99 | 100 | | | | 101 | |---|---| 102 | | License | ![PyPI - License](https://img.shields.io/pypi/l/zingg) | 103 | | Language | ![PyPI - Python Version](https://img.shields.io/pypi/pyversions/zingg) `Spark` | 104 | | Latest release | [![PyPI](https://img.shields.io/pypi/v/zingg.svg)](https://pypi.python.org/pypi/zingg/) | 105 | | Downloads per month | ![PyPI - Downloads](https://img.shields.io/pypi/dm/zingg) | 106 | | GitHub stars | [![GitHub stars](https://img.shields.io/github/stars/zinggAI/zingg.svg?style=social&label=Star)](https://github.com/zinggAI) | 107 | 108 | #### [fastLink](https://cran.r-project.org/web/packages/fastLink/index.html) 109 | 110 | Implements a Fellegi-Sunter probabilistic record linkage model that allows for 111 | missing data and the inclusion of auxiliary information. This includes 112 | functionalities to conduct a merge of two datasets under the Fellegi-Sunter 113 | model using the Expectation-Maximization algorithm. fastLink is a programming 114 | API written in R. ([Enamorado, Fifield & Imai, 115 | 2017](http://imai.princeton.edu/research/files/linkage.pdf)) [[source 116 | code]](https://github.com/kosukeimai/fastLink) 117 | 118 | | | | 119 | |---|---| 120 | | License | ![CRAN/METACRAN](https://img.shields.io/cran/l/fastLink) | 121 | | Language | `R` | 122 | | Latest release | [![CRAN](https://img.shields.io/cran/v/fastLink.svg)](https://cran.r-project.org/web/packages/fastLink/index.html) | 123 | | Downloads per month | [![metacran downloads](https://cranlogs.r-pkg.org/badges/last-month/fastLink)](https://cran.r-project.org/package=fastLink) | 124 | | GitHub stars | [![GitHub stars](https://img.shields.io/github/stars/kosukeimai/fastLink.svg?style=social&label=Star)](https://github.com/kosukeimai/fastLink) | 125 | 126 | #### [FEBRL](https://sourceforge.net/projects/febrl/) 127 | 128 | Febrl (Freely Extensible Biomedical Record Linkage) is a training tool 129 | suitable for users to learn and experiment with record linkage techniques, as 130 | well as for practitioners to conduct linkages with data sets containing up to 131 | several hundred thousand records. Febrl is a data matching tool with a large 132 | number of algorithms implemented and offers a Python programming interface as 133 | well as simple GUI. Febrl doesn't offer unsupervised and active learning 134 | algorithms. The software is no longer actively maintained. ([Christen, 135 | 2008](http://crpit.com/confpapers/CRPITV80Christen.pdf)) [[source 136 | code]](https://sourceforge.net/projects/febrl/) 137 | 138 | | | | 139 | |---|---| 140 | | License | Custom | 141 | | Language | `Python` | 142 | | Latest release | | 143 | | Downloads per month | | 144 | | GitHub stars | | 145 | 146 | #### [FRIL](http://fril.sourceforge.net/) 147 | 148 | FRIL (Fine-grained Records Integration and Linkage tool) is free tool that 149 | enables record linkage through a GUI. The tool implements automatic weights 150 | estimation through the EM-algorithm and offers serveral techniques to make 151 | record pairs. FRIL was developed by the Emory University and is not longer 152 | maintained. [[source code]](http://fril.sourceforge.net/download.html) 153 | 154 | | | | 155 | |---|---| 156 | | License | Custom | 157 | | Language | `Java` | 158 | | Latest release | | 159 | | Downloads per month | | 160 | | GitHub stars | | 161 | 162 | #### [FuzzyMatcher](https://pypi.python.org/pypi/fuzzymatcher) 163 | 164 | A Python package that allows the user to fuzzy match two pandas dataframes 165 | based on one or more fields in common. The functionality is limited at the 166 | moment. [[source code]](https://github.com/RobinL/fuzzymatcher) 167 | 168 | | | | 169 | |---|---| 170 | | License | ![PyPI - License](https://img.shields.io/pypi/l/fuzzymatcher) | 171 | | Language | ![PyPI - Python Version](https://img.shields.io/pypi/pyversions/fuzzymatcher) | 172 | | Latest release | [![PyPI](https://img.shields.io/pypi/v/fuzzymatcher.svg)](https://pypi.python.org/pypi/fuzzymatcher/) | 173 | | Downloads per month | ![PyPI - Downloads](https://img.shields.io/pypi/dm/fuzzymatcher) | 174 | | GitHub stars | [![GitHub stars](https://img.shields.io/github/stars/RobinL/fuzzymatcher.svg?style=social&label=Star)](https://github.com/RobinL/fuzzymatcher) | 175 | 176 | 177 | #### [hlink](https://pypi.python.org/pypi/hlink) 178 | 179 | A Python package designed to link two datasets. The primary use case was for linking demographics in the Household -> Person hierarchical structure, however it can be used to link generic datasets as well by skipping household linking tasks. It allows for probabilistic and deterministic record linkage. [[source_code]](https://github.com/ipums/hlink) 180 | 181 | | | | 182 | |---|---| 183 | | License | ![PyPI - License](https://img.shields.io/pypi/l/hlink) | 184 | | Language | ![PyPI - Python Version](https://img.shields.io/pypi/pyversions/hlink) | 185 | | Latest release | [![PyPI](https://img.shields.io/pypi/v/hlink.svg)](https://pypi.python.org/pypi/hlink/) | 186 | | Downloads per month | ![PyPI - Downloads](https://img.shields.io/pypi/dm/hlink) | 187 | | GitHub stars | [![GitHub stars](https://img.shields.io/github/stars/ipums/hlink?style=social&label=Star)](https://github.com/ipums/hlink) | 188 | 189 | 190 | #### [JedAI](http://jedai.scify.org/) 191 | 192 | Java gEneric DAta Integration (JedAI) Toolkit is a Entity Resolution Tool 193 | developed by a group of univeristies. JedAI offers a Graphical User Interface. 194 | [[source code]](https://github.com/scify/JedAIToolkit) 195 | 196 | | | | 197 | |---|---| 198 | | License | ![GitHub](https://img.shields.io/github/license/scify/JedAIToolkit) | 199 | | Language | `Java` | 200 | | Latest release | | 201 | | Downloads per month | | 202 | | GitHub stars | [![GitHub stars](https://img.shields.io/github/stars/scify/JedAIToolkit.svg?style=social&label=Star)](https://github.com/scify/JedAIToolkit) | 203 | 204 | #### [PRIL](https://github.com/LSHTM-ALPHAnetwork/PIRL_RecordLinkageSoftware) 205 | 206 | PRIL (Point-of-contact Interactive Record Linkage) is a record linkage program 207 | with a GUI. PRIL can be used to link datasets about individuals. ([Rentsch CT, 208 | Kabudula CW, Catlett J et al., 209 | 2017](https://gatesopenresearch.org/articles/1-8/v1)) [[source 210 | code]](https://github.com/LSHTM-ALPHAnetwork/PIRL_RecordLinkageSoftware) 211 | 212 | | | | 213 | |---|---| 214 | | License | ![GitHub](https://img.shields.io/github/license/LSHTM-ALPHAnetwork/PIRL_RecordLinkageSoftware) | 215 | | Language | `SQLPL` | 216 | | Latest release | | 217 | | Downloads per month | | 218 | | GitHub stars | [![GitHub stars](https://img.shields.io/github/stars/LSHTM-ALPHAnetwork/PIRL_RecordLinkageSoftware.svg?style=social&label=Star)](https://github.com/LSHTM-ALPHAnetwork/PIRL_RecordLinkageSoftware) | 219 | 220 | #### [Python Record Linkage Toolkit](https://github.com/J535D165/recordlinkage) 221 | 222 | The Python Record Linkage Toolkit is a library to link records in or between 223 | data sources. The toolkit provides most of the tools needed for record linkage 224 | and deduplication. The package is developed for research and the linking of 225 | small or medium sized files. 226 | 227 | | | | 228 | |---|---| 229 | | License | ![PyPI - License](https://img.shields.io/pypi/l/recordlinkage) | 230 | | Language | ![PyPI - Python Version](https://img.shields.io/pypi/pyversions/recordlinkage) | 231 | | Latest release | [![PyPI](https://img.shields.io/pypi/v/recordlinkage.svg)](https://pypi.python.org/pypi/recordlinkage/) | 232 | | Downloads per month | ![PyPI - Downloads](https://img.shields.io/pypi/dm/recordlinkage) | 233 | | GitHub stars | [![GitHub stars](https://img.shields.io/github/stars/J535D165/recordlinkage.svg?style=social&label=Star)](https://github.com/J535D165/recordlinkage) | 234 | 235 | #### [RecordLinkage (R)](https://cran.r-project.org/web/packages/RecordLinkage/index.html) 236 | 237 | Package written in R that provides functions for linking and de-duplicating 238 | data sets. Both supervised and unsupervised classification algorithms are 239 | available. Record pairs can be compared with a limited set of algorithms. The 240 | package is published on CRAN. 241 | 242 | | | | 243 | |---|---| 244 | | License | ![CRAN/METACRAN](https://img.shields.io/cran/l/RecordLinkage) | 245 | | Language | `R` | 246 | | Latest release | [![CRAN](https://img.shields.io/cran/v/RecordLinkage.svg)](https://cran.r-project.org/web/packages/RecordLinkage/index.html) | 247 | | Downloads per month | [![metacran downloads](https://cranlogs.r-pkg.org/badges/last-month/RecordLinkage)](https://cran.r-project.org/package=RecordLinkage) | 248 | | GitHub stars | | 249 | 250 | 251 | #### [Reclin2](https://github.com/djvanderlaan/reclin2) 252 | 253 | Package written in R that provides functions for linking data sets. The framework offers 254 | the option to compute the weigths of the Fellegi-Sunter model. It doesn't implement an 255 | undersupervised algorithms to predict the cutoff. The 256 | package is published on CRAN. Formerly https://github.com/djvanderlaan/reclin. 257 | 258 | | | | 259 | |---|---| 260 | | License | ![CRAN/METACRAN](https://img.shields.io/cran/l/reclin2) | 261 | | Language | `R` | 262 | | Latest release | [![CRAN](https://img.shields.io/cran/v/reclin2.svg)](https://cran.r-project.org/web/packages/reclin2/index.html) | 263 | | Downloads per month | [![metacran downloads](https://cranlogs.r-pkg.org/badges/last-month/reclin2)](https://cran.r-project.org/package=reclin2) | 264 | | GitHub stars | [![GitHub stars](https://img.shields.io/github/stars/djvanderlaan/reclin2.svg?style=social&label=Star)](https://github.com/djvanderlaan/reclin2) | 265 | 266 | #### [RELAIS](https://www.istat.it/en/methods-and-tools/methods-and-it-tools/process/processing-tools/relais) 267 | 268 | RELAIS (REcord Linkage At IStat) is a toolkit providing a set of techniques 269 | for dealing with record linkage projects. IStat is the main producer of 270 | official statistics in Italy. 271 | 272 | | | | 273 | |---|---| 274 | | License | `EUPL-1.1` | 275 | | Language | `R/Java` | 276 | | Latest release | | 277 | | Downloads per month | | 278 | | GitHub stars | | 279 | 280 | #### [ReMaDDer](http://remadder.findmysoft.com/) 281 | 282 | ReMaDDer is unsupervised free fuzzy data matching software with a GUI. 283 | ReMaDDer is capable to perform fully automatic fuzzy record matching without 284 | human expert intervention, while attaining accuracy of human clerical review. 285 | NOTE: The software is free, but not open source and requires an internet 286 | connection to work. 287 | 288 | | | | 289 | |---|---| 290 | | License | | 291 | | Language | | 292 | | Latest release | | 293 | | Downloads per month | | 294 | | GitHub stars | | 295 | 296 | #### [RLTK](https://github.com/usc-isi-i2/rltk) 297 | 298 | The Record Linkage ToolKit (RLTK) is a general-purpose open-source record 299 | linkage package. The toolkit provides a full pipeline needed for record linkage 300 | and deduplication. 301 | 302 | | | | 303 | |---|---| 304 | | License | ![PyPI - License](https://img.shields.io/pypi/l/rltk) | 305 | | Language | `Python` | 306 | | Latest release | [![PyPI](https://img.shields.io/pypi/v/rltk.svg)](https://pypi.python.org/pypi/rltk/) | 307 | | Downloads per month | ![PyPI - Downloads](https://img.shields.io/pypi/dm/rltk) | 308 | | GitHub stars | [![GitHub stars](https://img.shields.io/github/stars/usc-isi-i2/rltk.svg?style=social&label=Star)](https://github.com/usc-isi-i2/rltk) | 309 | 310 | #### [Splink](https://github.com/moj-analytical-services/splink) 311 | 312 | Splink is a Python package for probabilistic record linkage at scale. 313 | It supports multiple backends to execute linkage jobs, including DuckDB 314 | Apache Spark and AWS Athena. It is able to perform linking and deduplication of very large datasets 315 | of tens of millions of records with runtimes of less than an hour, including 316 | the clustering of results using connected components. It includes interactive tools 317 | to support the lifecycle of a linking project, from exploratory analysis through to 318 | diagnostics and quality assurance.[[source 319 | code]](https://github.com/moj-analytical-services/splink) 320 | 321 | | | | 322 | |---|---| 323 | | License | ![PyPI - License](https://img.shields.io/pypi/l/splink) | 324 | | Language | ![PyPI - Python Version](https://img.shields.io/pypi/pyversions/splink) | 325 | | Latest release | [![PyPI](https://img.shields.io/pypi/v/splink.svg)](https://pypi.python.org/pypi/splink/) | 326 | | Downloads per month | ![PyPI - Downloads](https://img.shields.io/pypi/dm/splink) | 327 | | GitHub stars | [![GitHub stars](https://img.shields.io/github/stars/moj-analytical-services/splink.svg?style=social&label=Star)](https://github.com/moj-analytical-services/splink) | 328 | 329 | #### [Zingg](https://github.com/zinggAI/zingg) 330 | 331 | [Zingg](https://zingg.ai) is an open-source ML based tool for entity resolution with which analytics engineer and the data scientist can quickly integrate data silos and build unified views at scale. Zingg has the ability to connect to disparate data source, local and cloud file systems in any format, enterprise applications and relational, NoSQL and cloud databases and warehouses. It scales to large volume of data and you can define domain specific functions to improve matching. 332 | Not only Zingg support English as well as Chinese, Thai, Japanese, Hindi and other languages, it also has a very active [slack community](https://join.slack.com/t/zinggai/shared_invite/zt-w7zlcnol-vEuqU9m~Q56kLLUVxRgpOA) where people around the globe come and help and share their views. 333 | 334 | | | | 335 | |---|---| 336 | | License | ![PyPI - License](https://img.shields.io/pypi/l/zingg) | 337 | | Language | ![PyPI - Python Version](https://img.shields.io/pypi/pyversions/zingg) `Spark` | 338 | | Latest release | [![PyPI](https://img.shields.io/pypi/v/zingg.svg)](https://pypi.python.org/pypi/zingg/) | 339 | | Downloads per month | ![PyPI - Downloads](https://img.shields.io/pypi/dm/zingg) | 340 | | GitHub stars | [![GitHub stars](https://img.shields.io/github/stars/zinggAI/zingg.svg?style=social&label=Star)](https://github.com/zinggAI) | 341 | 342 | ## Outdated/ no longer available 343 | 344 | #### BigMatch (by USA census) 345 | 346 | A record linkage tool for use in matching a very large file against a moderate 347 | size file developed by the USA Census Bureau. There are several papers 348 | available about this program [(BigMatch, 349 | 2007)](https://www.census.gov/srd/papers/pdf/rrc2007-01.pdf) 350 | 351 | #### [The Link King](http://the-link-king.party/) 352 | 353 | The Link King’s graphical user interface (GUI) makes record linkage and 354 | unduplication easy for beginning and advanced users. The software requires a 355 | SAS license. `SAS` 356 | 357 | ## Contributing 358 | 359 | Do you know an open source and/or free data matching tool? Please open an 360 | issue or do a Pull Request. The same holds for missing or incomplete 361 | information. 362 | 363 | This project is initiated by the author of the [Python Record Linkage 364 | Toolkit](https://github.com/J535D165/recordlinkage) @J535D165. The aim is to 365 | get a list and comparison of data matching software. 366 | 367 | This list is licensed under [CC-BY-SA 3.0](http://creativecommons.org/licenses/by-sa/3.0/). 368 | --------------------------------------------------------------------------------