An approach for estimating haplotype diversity from sequences with unequal lengths

1. Genetic diversity is an essential component of biodiversity. Developing robust quantification methods is critically important in depicting the genetic diversity underlying the geographical distributions of species, especially for the sequence data with unequal lengths. 2. Traditional calculation of genetic diversity depends on sequences of equal length. However, many homologous sequences downloaded from online repositories vary in length, posing a significant challenge to quantify the genetic diversity, especially haplotype diversity. We developed a new approach independent of sequence length by applying the same parameters used in calculating nucleotide diversity to estimate haplotype diversity. We compared this novel approach with the calculations by the program DNAsp, and we used simulation data from terrestrial vertebrates (birds, mammals and amphibians) and Homo sapiens to validate the method's performance. We further applied this approach to explore the global latitudinal gradients of haplotype diversity in amphibians, mammals and birds, and compared the results by traditional methods. 3. The haplotype diversity calculated by our novel approach is consistent with the results from DNAsp. The simulations showed that our approach is robust and has a good estimating performance for sequence data with unequal lengths. 4. For the datasets of terrestrial vertebrates and H. sapiens, our approach is capable of estimating haplotype diversity with unequal intraspecific sequence lengths. In contrast to patterns based on traditional methods, we observed different latitudinal patterns of haplotype diversity between the northern and southern hemispheres for terrestrial vertebrates, which is consistent with the updated pattern of nucleotide diversity for mammals. The present work contributes to the development of more precise quantification methods, which may be broadly applied to assessing biogeographical patterns of genetic diversity.