Nowadays there is a continuous increase in files/databases in digital format, customer lists, patient lists, census data, etc. Record linkage is useful in multiple applications where there is a need for combining information from two or more files. In addition, record linkage can be used for data cleaning to find duplicates in a file. Research has been done on this topic since the 60s. Given the complexity of the problem, from the computational point of view as well as from the quality of the data, the research is ongoing and there are still open problems. This page points to some of this research providing information about books, papers and available software.
- Books
- Data Quality and Record Linkage Techniques, Thomas N. Herzog, Fritz J. Scheuren, William E. Winkler, 2007, XIV, 234 p., Softcover, ISBN: 978-0-387-69502-0
- Papers
- Overview of Record Linkage and Current Research Directions, William E. Winkler, Statistical Research Division U.S. Census Bureau PDF
- Duplicate Record Detection: A Survey, A. K. Elmagarmid, P. G. Ipeirotis, V. S. Verykios, Knowledge and Data Engineering, IEEE Transactions on, Vol. 19, No. 1. (2007), pp. 1-16. PDF
- Record Linkage: A Machine Learning Approach, A Toolbox, and A Digital Government Web Service, Mohamed G. Elfeky , Thanaa M. Ghanem , Vassilios S. Verykios , Ahmed R. Huwait , Ahmed K. Elmagarmid PDF
- A Comparison of String Distance Metrics for Name-Matching Tasks, William W. Cohen , Pradeep Ravikumar , Stephen E. Fienberg, In Proceedings of the IJCAI-2003 Workshop on Information Integration on the Web PDF
- Automatic Record Linkage using Seeded Nearest Neighbour and Support Vector Machine Classification, Peter Christen, Proceedings of the ACM SIGKDD 2008 conference, Las Vegas, August 2008 PDF
- Towards Automated Record Linkage, Karl Goiser and Peter Christen, In proceedings of the Fifth Australasian Data Mining Conference (AusDM2006), Sydney, November 2006. PDF
- A Comparison of Personal Name Matching: Techniques and Practical Issues, Peter Christen, In proceedings of the Workshop on Mining Complex Data (MCD) held at the IEEE International Conference on Data Mining (ICDM), Hong Kong, December 2006. PDF
- Software
- Febrl - record linkage software
- D-Dupe - de-duplication software
- FRIL - record linkage software
- The Link King - record linkage software
- SecondString - software for approximate string matching
- SimMetrics - software for computing similarity measures
- Other Research Groups