|
Comparing inferences among datasets generated using short read sequencing may
provide insight into the concerted impacts of divergence, gene flow and selection
across organisms, but comparisons are complicated by biases introduced during
dataset assembly. Sequence similarity thresholds allow the de novo assembly of short
reads into clusters of alleles representing different loci, but the resulting datasets are
sensitive to both the similarity threshold used and to the variation naturally present
in the organism under study. Thresholds that require high sequence similarity
among reads for assembly (stringent thresholds) as well as highly variable species
may result in datasets in which divergent alleles are lost or divided into separate
loci (‘over-splitting’), whereas liberal thresholds increase the risk of paralogous
loci being combined into a single locus (‘under-splitting’). Comparisons among
datasets or species are therefore potentially biased if different similarity thresholds
are applied or if the species differ in levels of within-lineage genetic variation. We
examine the impact of a range of similarity thresholds on assembly of empirical
short read datasets from populations of four different non-model bird lineages
(species or species pairs) with different levels of genetic divergence.We find that, in
all species, stringent similarity thresholds result in fewer alleles per locus than more
liberal thresholds, which appears to be the result of high levels of over-splitting.
The frequency of putative under-splitting, conversely, is low at all thresholds.
Inferred genetic distances between individuals, gene tree depths, and estimates of
the ancestralmutation-scaled effective population size (?) differ depending upon the
similarity threshold applied. Relative differences in inferences across species differ
even when the same threshold is applied, but may be dramatically different when
datasets assembled under different thresholds are compared. These differences not
only complicate comparisons across species, but also preclude the application of
standard mutation rates for parameter calibration.We suggest some best practices
for assembling short read data to maximize comparability, such as using more liberal
thresholds and examining the impact of different thresholds on each dataset. | |
|