Scientific Reports (Oct 2017)
Fine population structure analysis method for genomes of many
Abstract
Abstract Fine population structure can be examined through the clustering of individuals into subpopulations. The clustering of individuals in large sequence datasets into subpopulations makes the calculation of subpopulation specific allele frequency possible, which may shed light on selection of candidate variants for rare diseases. However, as the magnitude of the data increases, computational burden becomes a challenge in fine population structure analysis. To address this issue, we propose fine population structure analysis (FIPSA), which is an individual-based non-parametric method for dissecting fine population structure. FIPSA maximizes the likelihood ratio of the contingency table of the allele counts multiplied by the group. We demonstrated that its speed and accuracy were superior to existing non-parametric methods when the simulated sample size was up to 5,000 individuals. When applied to real data, the method showed high resolution on the Human Genome Diversity Project (HGDP) East Asian dataset. FIPSA was independently validated on 11,257 human genomes. The group assignment given by FIPSA was 99.1% similar to those assigned based on supervised learning. Thus, FIPSA provides high resolution and is compatible with a real dataset of more than ten thousand individuals.