Fast and Scalable Private Genotype Imputation Using Machine Learning and Partially Homomorphic Encryption

Esha Sarkar; Eduardo Chielle; Gamze Gursoy; Oleg Mazonka; Mark Gerstein; Michail Maniatakos

doi:10.1109/ACCESS.2021.3093005

IEEE Access (Jan 2021)

Fast and Scalable Private Genotype Imputation Using Machine Learning and Partially Homomorphic Encryption

Esha Sarkar,
Eduardo Chielle,
Gamze Gursoy,
Oleg Mazonka,
Mark Gerstein,
Michail Maniatakos

Affiliations

Esha Sarkar: ORCiD; Tandon School of Engineering, New York University, New York, NY, USA
Eduardo Chielle: ORCiD; New York University Abu Dhabi, Abu Dhabi, United Arab Emirates
Gamze Gursoy: Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA
Oleg Mazonka: ORCiD; New York University Abu Dhabi, Abu Dhabi, United Arab Emirates
Mark Gerstein: ORCiD; Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA
Michail Maniatakos: ORCiD; Tandon School of Engineering, New York University, New York, NY, USA

DOI: https://doi.org/10.1109/ACCESS.2021.3093005
Journal volume & issue: Vol. 9
pp. 93097 – 93110

Abstract

Read online

The recent advances in genome sequencing technologies provide unprecedented opportunities to understand the relationship between human genetic variation and diseases. However, genotyping whole genomes from a large cohort of individuals is still cost prohibitive. Imputation methods to predict genotypes of missing genetic variants are widely used, especially for genome-wide association studies. Accurate genotype imputation requires complex statistical methods. Due to the data and computing-intensive nature of the problem, imputation is increasingly outsourced, raising serious privacy concerns. In this work, we investigate solutions for fast, scalable, and accurate privacy-preserving genotype imputation using Machine Learning (ML) and a standardized homomorphic encryption scheme, Paillier cryptosystem. ML-based privacy-preserving inference has been largely optimized for computation-heavy non-linear functions in a single-output multi-class classification setting. However, having a large number of multi-class outputs per genome per individual calls for further optimizations and/or approximations specific to this application. Here we explore the effectiveness of linear models for genotype imputation to convert them to privacy-preserving equivalents using standardized homomorphic encryption schemes. Our results show that performance of our privacy-preserving genotype imputation method is equivalent to the state-of-the-art plaintext solutions, achieving up to 99% micro area under curve score, even on real-world large-scale datasets up to 80,000 targets.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords