Efficacy of federated learning on genomic data: a study on the UK Biobank and the 1000 Genomes Project

Dmitry Kolobkov; Dmitry Kolobkov; Satyarth Mishra Sharma; Satyarth Mishra Sharma; Aleksandr Medvedev; Aleksandr Medvedev; Mikhail Lebedev; Egor Kosaretskiy; Ruslan Vakhitov

doi:10.3389/fdata.2024.1266031

Frontiers in Big Data (Feb 2024)

Efficacy of federated learning on genomic data: a study on the UK Biobank and the 1000 Genomes Project

Dmitry Kolobkov,
Dmitry Kolobkov,
Satyarth Mishra Sharma,
Satyarth Mishra Sharma,
Aleksandr Medvedev,
Aleksandr Medvedev,
Mikhail Lebedev,
Egor Kosaretskiy,
Ruslan Vakhitov

Affiliations

Dmitry Kolobkov: GENXT, Hinxton, United Kingdom
Dmitry Kolobkov: Laboratory of Ecological Genetics, Vavilov Institute of General Genetics, Moscow, Russia
Satyarth Mishra Sharma: GENXT, Hinxton, United Kingdom
Satyarth Mishra Sharma: Center for Artificial Intelligence Technology, Skolkovo Institute of Science and Technology, Moscow, Russia
Aleksandr Medvedev: GENXT, Hinxton, United Kingdom
Aleksandr Medvedev: Center for Artificial Intelligence Technology, Skolkovo Institute of Science and Technology, Moscow, Russia
Mikhail Lebedev: GENXT, Hinxton, United Kingdom
Egor Kosaretskiy: GENXT, Hinxton, United Kingdom
Ruslan Vakhitov: GENXT, Hinxton, United Kingdom

DOI: https://doi.org/10.3389/fdata.2024.1266031
Journal volume & issue: Vol. 7

Abstract

Read online

Combining training data from multiple sources increases sample size and reduces confounding, leading to more accurate and less biased machine learning models. In healthcare, however, direct pooling of data is often not allowed by data custodians who are accountable for minimizing the exposure of sensitive information. Federated learning offers a promising solution to this problem by training a model in a decentralized manner thus reducing the risks of data leakage. Although there is increasing utilization of federated learning on clinical data, its efficacy on individual-level genomic data has not been studied. This study lays the groundwork for the adoption of federated learning for genomic data by investigating its applicability in two scenarios: phenotype prediction on the UK Biobank data and ancestry prediction on the 1000 Genomes Project data. We show that federated models trained on data split into independent nodes achieve performance close to centralized models, even in the presence of significant inter-node heterogeneity. Additionally, we investigate how federated model accuracy is affected by communication frequency and suggest approaches to reduce computational complexity or communication costs.

Published in Frontiers in Big Data

ISSN: 2624-909X (Online)
Publisher: Frontiers Media S.A.
Country of publisher: Switzerland
LCC subjects: Technology: Technology (General): Industrial engineering. Management engineering: Information technology
Website: https://www.frontiersin.org/journals/big-data

About the journal

Abstract

Keywords