Informatics in Medicine Unlocked (Jan 2021)
Novel directions in data pre-processing and genome-wide association study (GWAS) methodologies to overcome ongoing challenges
Abstract
A genome-wide association study (GWAS) is a standard population-based technique for identifying the heritable genetic basis of complex diseases by discovering correlations between trait variations and allele frequencies of genetic markers. This article aims to help fill gaps in data pre-processing and GWAS methodologies by reviewing novel techniques and methodologies. Data pre-processing performed prior to a GWAS presents challenges in Hardy-Weinberg (H–W) estimation, genotyping and accounting for factors such as sample structure. Recent developments towards overcoming these challenges are presented: the likelihood ratio test for H–W estimation, sequencing for genotyping, and techniques for dealing with sample structure. Traditional statistical methods cannot provide a way to insightfully interpret the data generated from high-throughput techniques; therefore, novel directions in GWAS methodologies are reviewed using efficient statistical methods, which are flexible techniques for performing genetic association analysis when factors such as non-random sampling or population structure occur. Despite the development of these methods, genotyping costs and an increased capacity for large dataset analysis have motivated researchers to examine tissue-specific signals. This review discusses how prospective and retrospective association analyses can be used to consider binary traits, address non-random ascertainment, and increase the capacity for large dataset analysis. Importantly, for disease susceptibility, rare variants can represent a large portion of genetic markers, and this article reviews some association methods for rare variant detection. In conclusion, the recent developments in GWAS data preparation and methodologies reviewed in this article can overcome most current challenges in the field and will also address future challenges.