The Use of Density-Based Spatial Clustering of Application With Noise (DBSCAN) for Record Linkage in An Observational HIV Cohort

Victor Olago; Lina Bartels; Tafadzwa Dhokotera; Lina Bartels; Julia Bohlius; Matthias Egger; Elvira Singh; Mazvita Sengayi

doi:10.23889/ijpds.v5i5.1422

International Journal of Population Data Science (Dec 2020)

The Use of Density-Based Spatial Clustering of Application With Noise (DBSCAN) for Record Linkage in An Observational HIV Cohort

Victor Olago,
Lina Bartels,
Tafadzwa Dhokotera,
Lina Bartels,
Julia Bohlius,
Matthias Egger,
Elvira Singh,
Mazvita Sengayi

Affiliations

Victor Olago
Lina Bartels: Social and Preventive Medicine (ISPM), University of Bern, Switzerland
Tafadzwa Dhokotera: Social and Preventive Medicine (ISPM), University of Bern, Switzerland
Lina Bartels: Social and Preventive Medicine (ISPM), University of Bern, Switzerland
Julia Bohlius: Institute of Social and Preventive Medicine (ISPM), University of Bern, Switzerland
Matthias Egger: Institute of Social and Preventive Medicine (ISPM), University of Bern, Switzerland Centre for Infectious Disease Epidemiology and Research (CIDER), School of Public Health and Family Medicine, University of Cape Town, South Africa
Elvira Singh: National Health Laboratory Service (NHLS), National Cancer Registry (NCR), Johannesburg, South Africa. Division of Epidemiology and Biostatistics, School of Public Health, Faculty of Health Sciences, University of Witwatersrand, Johannesburg, South Africa
Mazvita Sengayi: National Health Laboratory Service (NHLS), National Cancer Registry (NCR), Johannesburg, South Africa. Division of Epidemiology and Biostatistics, School of Public Health, Faculty of Health Sciences, University of Witwatersrand, Johannesburg, South Africa

DOI: https://doi.org/10.23889/ijpds.v5i5.1422
Journal volume & issue: Vol. 5, no. 5

Abstract

Read online

Introduction The South African HIV Cancer Match (SAM) study is a probabilistic record linkage study involving creation of an HIV cohort from laboratory records from the National Health Laboratory Service (NHLS). This cohort was linked to the pathology based South African National Cancer Registry to establish cancer incidences among HIV positive population in South Africa. As the number of HIV records increases, there is need for more efficient ways of de-duplicating this big-data. In this work, we used clustering to perform big-data deduplication. Objectives and Approach Our objective was to use DBSCAN as clustering algorithm together with bi-gram word analyser to perform big-data deduplication in resource-limited settings. We used HIV related laboratory records from entire South Africa collated in the NHLS Corporate Data Warehouse for period 2004-2014. This involved data pre-processing, deterministic deduplication, ngrams generation, features generation using Term Frequency Inverse Document Frequency vectorizer, clustering using DBSCAN and assigning cluster labels for records that potentially belonged to the same person. We used records with national identification numbers to assess quality of deduplication by calculating precision, recall and f-measure. Results We had 51,563,127 HIV related laboratory records. Deterministic deduplication resulted in 20,387,819 patient record deduplicates. With DBSCAN clustering we further reduced this to 14,849,524 patient record clusters. In this final dataset, 3,355,544 (22.60%) patients had negative HIV test, 11,316,937 (76.21%) had evidence for HIV infection, and for 177,043 (1.19%) the HIV status could not be determined. The precision, recall and f-measure based on 1,865,445 records with national identification numbers were 0.96, 0.94 and 0.95, respectively. Conclusion / Implications Our study demonstrated that DBSCAN clustering is an effective way of deduplicating big datasets in resource-limited settings. This enabled refining of an HIV observational database by accurately linking test records that potentially belonged to the same person. The methodology creates opportunities for easy data profiling to inform public health decision making.

Published in International Journal of Population Data Science

ISSN: 2399-4908 (Online)
Publisher: Swansea University
Country of publisher: United Kingdom
LCC subjects: Social Sciences: Economic theory. Demography: Demography. Population. Vital events
Website: https://ijpds.org

About the journal