Frontiers in Microbiology (Jul 2023)
Feature selection with a genetic algorithm can help improve the distinguishing power of microbiota information in monozygotic twins' identification
Abstract
IntroductionPersonal identification of monozygotic twins (MZT) has been challenging in forensic genetics. Previous research has demonstrated that microbial markers have potential value due to their specificity and long-term stability. However, those studies would use the complete information of detected microbial communities, and low-value species would limit the performance of previous models.MethodsTo address this issue, we collected 80 saliva samples from 10 pairs of MZTs at four different time points and used 16s rRNA V3–V4 region sequencing to obtain microbiota information. The data formed 280 inner-individual (Self) or MZT sample pairs, divided into four groups based on the individual relationship and time interval, and then randomly divided into training and testing sets with an 8:2 ratio. We built 12 identification models based on the time interval ( ≤ 1 year or ≥ 2 months), data basis (Amplicon sequence variants, ASVs or Operational taxonomic unit, OTUs), and distance parameter selection (Jaccard distance, Bray-Curist distance, or Hellinger distance) and then improved their identification power through genetic algorithm processes. The best combination of databases with distance parameters was selected as the final model for the two types of time intervals. Bayes theory was introduced to provide a numerical indicator of the evidence's effectiveness in practical cases.ResultsFrom the 80 saliva samples, 369 OTUs and 1130 ASVs were detected. After the feature selection process, ASV-Jaccard distance models were selected as the final models for the two types of time intervals. For short interval samples, the final model can completely distinguish MZT pairs from Self ones in both training and test sets.DiscussionOur findings support the microbiota solution to the challenging MZT identification problem and highlight the importance of feature selection in improving model performance.
Keywords