Department of Infectious Diseases, Imperial College London, London, United Kingdom; Department of Infection, Immunity and Inflammation, University College London, London, United Kingdom
Department of Statistics, University of Warwick, Coventry, United Kingdom
Kathryn Harris
Department of Microbiology, Great Ormond Street Hospital, London, United Kingdom; Department of Virology, East & South East London Pathology Partnership, Royal London Hospital, Barts Health NHS Trust, London, United Kingdom
Accurate inference of who infected whom in an infectious disease outbreak is critical for the delivery of effective infection prevention and control. The increased resolution of pathogen whole-genome sequencing has significantly improved our ability to infer transmission events. Despite this, transmission inference often remains limited by the lack of genomic variation between the source case and infected contacts. Although within-host genetic diversity is common among a wide variety of pathogens, conventional whole-genome sequencing phylogenetic approaches exclusively use consensus sequences, which consider only the most prevalent nucleotide at each position and therefore fail to capture low-frequency variation within samples. We hypothesized that including within-sample variation in a phylogenetic model would help to identify who infected whom in instances in which this was previously impossible. Using whole-genome sequences from SARS-CoV-2 multi-institutional outbreaks as an example, we show how within-sample diversity is partially maintained among repeated serial samples from the same host, it can transmitted between those cases with known epidemiological links, and how this improves phylogenetic inference and our understanding of who infected whom. Our technique is applicable to other infectious diseases and has immediate clinical utility in infection prevention and control.