OmniMatch: A Large Language Model-Based Data Linkage Tool

Xiaowei Xu; Xingqiao Wang; Vivek Gunasekaran; Jonathan White; Anup Mathur

doi:10.23889/ijpds.v9i5.2588

International Journal of Population Data Science (Sep 2024)

OmniMatch: A Large Language Model-Based Data Linkage Tool

Xiaowei Xu,
Xingqiao Wang,
Vivek Gunasekaran,
Jonathan White,
Anup Mathur

Affiliations

Xiaowei Xu: University of Arkansas at Little Rock
Xingqiao Wang: University of Arkansas at Little Rock
Vivek Gunasekaran: University of Arkansas at Little Rock
Jonathan White: US Census Bureau
Anup Mathur: US Census Bureau

DOI: https://doi.org/10.23889/ijpds.v9i5.2588
Journal volume & issue: Vol. 9, no. 5

Abstract

Read online

Large Language Models, such as OpenAI’s GPT, have demonstrated remarkable success in various applications by generating human-like text. However, a critical question remains: how can we effectively leverage these large language models for data linkage? In response to this challenge, we introduce OmniMatch, a novel data linkage tool designed to address several key issues in data linkage: • Cross-Domain Data Linkage: OmniMatch tackles the complexities of linking data across different domains without retraining or update. • Cross-Lingual Data Linkage: It extends its capabilities to handle data in multiple languages and hybrid languages. • Data Quality Challenges: OmniMatch addresses inconsistent data formats and typical data quality issues, including noise, missing values, typos, and errors. Our approach customizes an open-source large language model called Llama 2 from Meta. By doing so, we achieve outstanding performance in handling the aforementioned challenges. Notably, OmniMatch offers a specific advantage: it can be installed on-premises, ensuring a safe and trustworthy application without the hallucinations and other vulnerabilities associated with foundational large language models. We systematically evaluate OmniMatch using diverse datasets from various domains, including products, scientific publications, music, census data, and cross-lingual data. The experimental results demonstrate that OmniMatch is a universally applicable and trustworthy tool for data linkage.

Published in International Journal of Population Data Science

ISSN: 2399-4908 (Online)
Publisher: Swansea University
Country of publisher: United Kingdom
LCC subjects: Social Sciences: Economic theory. Demography: Demography. Population. Vital events
Website: https://ijpds.org

About the journal