Ligo: An Open Source Application for the Management and Execution of Administrative Data Linkage

Greg Lawrance; Raphael Parra Hernandez; Khalegh Mamakani; Suraiya Khan; Brent Hills; Harold Yip; Caelan Marrville

doi:10.23889/ijpds.v3i4.749

International Journal of Population Data Science (Aug 2018)

Ligo: An Open Source Application for the Management and Execution of Administrative Data Linkage

Greg Lawrance,
Raphael Parra Hernandez,
Khalegh Mamakani,
Suraiya Khan,
Brent Hills,
Harold Yip,
Caelan Marrville

Affiliations

Greg Lawrance: DataBC, Integrated Data Division, Jobs, Trade & Technology, Government of BC
Raphael Parra Hernandez: Integrated Data Office, Ministry of Jobs, Trade & Technology, Government of BC
Khalegh Mamakani: Integrated Data Office, Ministry of Jobs, Trade & Technology, Government of BC
Suraiya Khan: Integrated Data Office, Ministry of Jobs, Trade & Technology, Government of BC
Brent Hills: Population Data BC, University of British Columbia
Harold Yip: Population Data BC, University of British Columbia
Caelan Marrville: Integrated Data Office, Ministry of Jobs, Trade & Technology, Government of BC

DOI: https://doi.org/10.23889/ijpds.v3i4.749
Journal volume & issue: Vol. 3, no. 4

Abstract

Read online

Introduction Ligo is an open source application that provides a framework for managing and executing administrative data linking projects. Ligo provides an easy-to-use web interface that lets analysts select among data linking methods including deterministic, probabilistic and machine learning approaches and use these in a documented, repeatable, tested, step-by-step process. Objectives and Approach The linking application has two primary functions: identifying common entities in datasets [de-duplication] and identifying common entities between datasets [linking]. The application is being built from the ground up in a partnership between the Province of British Columbia’s Data Innovation (DI) Program and Population Data BC, and with input from data scientists. The simple web interface allows analysts to streamline the processing of multiple datasets in a straight-forward and reproducible manner. Results Built in Python and implemented as a desktop-capable and cloud-deployable containerized application, Ligo includes many of the latest data-linking comparison algorithms with a plugin architecture that supports the simple addition of new formulae. Currently, deterministic approaches to linking have been implemented and probabilistic methods are in alpha testing. A fully functional alpha, including deterministic and probabilistic methods is expected to be ready in September, with a machine learning extension expected soon after. Conclusion/Implications Ligo has been designed with enterprise users in mind. The application is intended to make the processes of data de-duplication and linking simple, fast and reproducible. By making the application open source, we encourage feedback and collaboration from across the population research and data science community.

Published in International Journal of Population Data Science

ISSN: 2399-4908 (Online)
Publisher: Swansea University
Country of publisher: United Kingdom
LCC subjects: Social Sciences: Economic theory. Demography: Demography. Population. Vital events
Website: https://ijpds.org

About the journal