Record Linkage

Record linkage is the process of identifying and linking records across several files/databases that refer to the same entities. It is also refereed to as data cleaning, de-duplication (when considered on a single file/database), object identification, approximate matching or approximate joins, fuzzy matching and entity resolution. A formal description of the record linkage problem can be found here.
Nowadays there is a continuous increase in files/databases in digital format, customer lists, patient lists, census data, etc. Record linkage is useful in multiple applications where there is a need for combining information from two or more files. In addition, record linkage can be used for data cleaning to find duplicates in a file. Research has been done on this topic since the 60s. Given the complexity of the problem, from the computational point of view as well as from the quality of the data, the research is ongoing and there are still open problems. This page points to some of this research providing information about books, papers and available software.
  • Papers
    • Overview of Record Linkage and Current Research Directions, William E. Winkler, Statistical Research Division U.S. Census Bureau PDF
    • Duplicate Record Detection: A Survey, A. K. Elmagarmid, P. G. Ipeirotis, V. S. Verykios, Knowledge and Data Engineering, IEEE Transactions on, Vol. 19, No. 1. (2007), pp. 1-16. PDF
    • Record Linkage: A Machine Learning Approach, A Toolbox, and A Digital Government Web Service, Mohamed G. Elfeky , Thanaa M. Ghanem , Vassilios S. Verykios , Ahmed R. Huwait , Ahmed K. Elmagarmid PDF
    • A Comparison of String Distance Metrics for Name-Matching Tasks, William W. Cohen , Pradeep Ravikumar , Stephen E. Fienberg, In Proceedings of the IJCAI-2003 Workshop on Information Integration on the Web PDF
    • Automatic Record Linkage using Seeded Nearest Neighbour and Support Vector Machine Classification, Peter Christen, Proceedings of the ACM SIGKDD 2008 conference, Las Vegas, August 2008 PDF
    • Towards Automated Record Linkage, Karl Goiser and Peter Christen, In proceedings of the Fifth Australasian Data Mining Conference (AusDM2006), Sydney, November 2006. PDF
    • A Comparison of Personal Name Matching: Techniques and Practical Issues, Peter Christen, In proceedings of the Workshop on Mining Complex Data (MCD) held at the IEEE International Conference on Data Mining (ICDM), Hong Kong, December 2006. PDF
The construction of this page is an on-going process. If you are aware of any entry that should be included, please let us know at lantonie AT uoguelph DOT ca.