Both data suggest directional mutability of VNTRs toward mid-range repeats. Sequencing of 28 isolates from a newly emerged MLVA type and its five single locus variants revealed that single nucleotide variation between isolates with up to two MLVA differences ranged from 0 to 12 single nucleotide polymorphisms SNPs.
Based on this observation and our previous findings of SNP differences of outbreak isolates, we suggest that investigation of S. Typhimurium community outbreaks should include cases of 1 repeat difference to increase sensitivity. Typhimurium and its application for epidemiological typing.
Due to their high variability and ease of detection, VNTRs are useful markers for epidemiological studies. In Australia, S. Typhimurium accounts for the majority of human Salmonella infections. Spatial and temporal clustering can be used to exclude epidemiologically unlinked cases of the same MLVA type. In our current practice, recovery of five or more geographically clustered isolates of the same MLVA type within a 4-week window period has been used as a trigger for public health investigation Sintchenko et al.
VNTRs can mutate rapidly; parallel or reverse changes can occur at the same locus leading to the same MLVA types with no common recent ancestry Wuyts et al.
MLVA types with one repeat difference are considered as the most closely related and relationships of MLVA types with multiple allele differences often do not reflect true phylogenetic relationships of the isolates Octavia and Lan, ; Lam et al. MLVA is not a suitable typing tool for long-term epidemiology.
The stepwise mutation model SMM Ohta and Kimura, was the first and simplest model of stepwise change with mutations adding or deleting a single unit, and this model can be used to describe VNTR evolution.
The model assumes a fixed rate of random mutations and insertion and deletion have equal probability of occurrence. Vogler et al. A geometric model was developed to account for the differences in the relative frequency of multiple-repeat mutations.
This model was generalized to other bacterial species Vogler et al. Similarly, Aandahl et al. However, these models did not consider the direction of change as a function of repeat number.
There is a need for better understanding of the evolutionary dynamics of repeat changes in VNTRs and the relationship between VNTR differences and genomic differences in order to infer true genetic relationships between similar MLVA types for epidemiological typing. In this study we examined the patterns of mutations of VNTR repeats from isolates recovered from human and environmental sources and also from in vitro laboratory passaging experiments and demonstrated that VNTRs mutated directionally toward repeat units of median size.
To reduce the impact of geographical diversity, isolates from patients residing only in the eastern suburbs of Sydney were sampled. Table 1. General features of S. Typhimurium sequenced in this study. Contigs were de novo assembled using the Velvet version 1. The coverage of genomes was estimated by Qualimap v2. SNP calling was performed as described previously Pang et al. A custom script was employed to extract SNPs according to the position on the reference genome.
Maximum likelihood phylogenetic trees were inferred using RAxML 7. Path-O-Gen v1. Homoplasy of phylogenetic trees was measured by maximum parsimony analysis in PAUP4. Three independent replicates of strains L and L representing the two MLVA types, and respectively, were sub-cultured in Luria-Bertani LB liquid culture with two passages per day for a total of 50 passages as previously described Wuyts et al.
Each replicate was spread onto three LB agar plates and 33 colonies from each replicate were randomly selected. MLVA was performed on these colonies as described previously Sintchenko et al.
Four pairwise comparisons were performed, including random pairwise comparisons between SNPs and different VNTR locus or repeat difference, pairwise comparisons based on the same time frame within a 1-month window and pairwise comparisons based on the lineages in the SNP tree. To characterize the relationship between multilocus VNTR and SNP evolution, we also modeled the time separating two genomes via their most recent common ancestor using the standard coalescent model with an effective population size of N for an introduction to coalescent theory, see Wakeley, The probability density of the total branch length T in a coalescent for n lineages is.
The integral in the above expression can then be rewritten as. Any confidence limit below 0 was set to 0, and each CI includes a Dunn-Sidak correction for testing two hypotheses in a single dataset. Typhimurium isolates from to For STTR9, the next most frequent alleles are 3 and 4 with 9.
For STTR3, the next most frequent allele is with There were only a few other alleles at these two loci with low frequencies Figures 1A,B. STTR3 has a composite of two types of repeats with the unit length of 27 bp and 33 bp, making it more difficult to determine the repeat's composition. Additionally these two loci have been shown to be relatively stable from both in vivo and in vitro passaging experiments Dimovski et al.
Therefore, they were excluded from further analysis. For the remaining 3 VNTR loci, the distribution of the size of repeat units largely followed a normal distribution, with the median size having the highest frequency by both number of MLVA types and the total number of isolates Figures 1C—E. Any repeats outside the mid-ranges are considered to be extreme length repeats.
The median number of repeats was 10 and the mid-range was 8—14 repeats for STTR5. We also analyzed the frequency and allele size of the top 10 MLVA types that existed over the 7 years. The predominance of mid-range repeats in the population suggests a direction evolution of the VNTRs toward mid-range repeats. Figure 1. Tendency of MLVA changes from to To confirm the directional mutability of the VNTR loci, an in vitro evolution experiment was carried out using two MLVA types and represented by L and L, respectively.
Four and three mutation events with one repeat unit deletion were observed in allele 26 and allele 15 both STTR5 , respectively Figure 2. Allele 19 in STTR10 did not exhibit the expected deletion pattern with two insertion events and only one deletion event. Table 3. Figure 2. In each repeat size, top bars depict upward mutations insertion while bottom bars correspond to downward mutations deletion.
Where observed frequencies are larger than zero, standard errors are plotted as error bars. We only examined 1 VNTR difference as one allelic change is the most frequent event and more relevant to public health investigations of point source outbreaks Dimovski et al. All isolates were from eastern Sydney so that the findings are more relevant to local epidemiology as outbreak detection by MLVA is based on spatial and temporal clustering. These isolates were recovered between 20 and days after the initial identification of the first isolate L Typhimurium isolates are shown in Figure 3A.
Figure 3. The phylogenomic relationship and evolutionary time of 28 S. Typhimurium evolved from most recent common ancestor MRCA. The isolates were connected based on the phylogeny and isolation date.
The VNTR changes of each locus for some isolates were labeled upon the lines. The phylogenetic tree was visualized and edited by FigTree v1. Typhimurium LT2. The timeline was indicated below the tree.
The number on the branches corresponds to the number of SNP difference. Isolates are labeled by their isolation date with strain number. The MLVA type was indicated with the differing locus highlighted in red. Single nucleotide differences were observed as short as 8 days apart between isolates L and L Identical isolates were found as far as days apart.
Compared with the five strains located near the root of the tree L, L, L, L, and L , the isolates showed 1—7 single nucleotide differences. All coverage values are approximate. The Genomes Phase 3 samples were released in SRA links to the data are given in Table 4. The majority of the analyses in this paper were performed on the genomes from NYGC.
The genomes from SGDP provided insight into under-represented populations. The 27 genomes duplicated in the Genomes and NYGC datasets were used to measure consistency across sequencing platforms. The cancer datasets were used to find possible changes in VNTRs in tumor tissues. The PacBio data were used for validation purposes only. In addition, centromere regions were excluded from the reference set. These filtering tools are available online in TRDB.
A singleton TR appears to be unique in the genome based on a combination of its repeat pattern and flanking sequence.
An indistinguishable TR belongs to a family of genomically dispersed TRs which share highly similar patterns and flanking sequence and may therefore produce misleading genotype calls. Simulation testing revealed that some singletons produced false positive VNTRs. To minimize this issue, an additional filtering step was added to eliminate problematic singleton loci from the reference set see Supplementary Section S2.
We assumed that genotyping was possible for a reference TR locus, given a particular read length, if the TR array length plus a minimum 10 bp flank on each side, would fit within the read.
The number of reference TR alleles that could be genotyped using each of the read lengths in our data is summarized in Supplementary Table S1. Each dataset was processed separately with VNTRseek using default parameters: minimum and maximum flanking sequence lengths of 10 and 50 bp, respectively, on each side of the array, and requiring at least two reads mapped with the same array copy number to make an allele call.
A locus was considered a VNTR if a non-reference allele was detected. To remove clear inconsistencies, for this study we filtered the VCF files to remove per sample VNTR loci with more alleles than the expected number of chromosomes.
The filtering criteria for these loci, termed multis is detailed in Supplementary Section S2. After multi filtering, a TR locus was labeled as a VNTR if any remaining allele, different from the reference, was observed in any sample.
Given these criteria, we prioritized VNTRs in genes and regulatory regions which might be of interest to researchers. Using bedtools , the reads aligning to each TR reference locus were extracted. The number of copies of the pattern in the resulting alignment was then determined and compared with the VNTRseek predictions.
A locus on an autosomal chromosome is consistent with Mendelian inheritance if the genotype of a child can be explained as one allele from the mother and one from the father. Genotype consistency was evaluated for all mother-father-child trios, i. We evaluated loci defined by several increasingly stringent criteria: both parents heterozygous, all members of the trio heterozygous, all members of the trio heterozygous and with different genotypes.
These criteria were selected to avoid false interpretations of consistency. TR loci on the X and Y chromosome of male children were also selected for evaluation when both the son and the appropriate parent had a predicted genotype.
VNTR calls were compared for each of 27 genomes that were represented twice, once in the Genomes dataset, sequenced in on an Illumina HiSeq with bp read length and once in the NYGC dataset, sequenced in on an Illumina NovaSeq with bp read length.
The two platforms have different error profiles. Because read length and coverage differed among datasets, for each pairwise comparison, we only considered loci that were genotyped in both samples and classified as VNTR in at least one. We extracted the non-reference VNTR alleles detected in at least one sample and computed consistency as the ratio of the size of the intersection set of those alleles found in both platforms over the size of their union found in either.
For alleles detected in the bp reads, we only counted those that could have been detected in the shorter bp reads. Reference alleles were excluded to avoid inflating the ratio.
Additionally, these genomes contain no related individuals and represent a wide set of populations 26 populations from five continents. Annotation based on overlap with functional genomic regions was performed for the reference TR loci. Bedtools was used to find overlaps between TR loci and the annotation features.
Any size overlap was allowed. TRs on the sex chromosomes were excluded in the background set. A total of individuals overlapped with the NYGC genomes set. When no genotype was observed for an individual, we classified the genotype as other. We did this because we assumed that the alleles were outside the detection range, given that genotypes were observed in other individuals with similar coverage. VNTR loci were retained for analysis if at least two genotypes were detected for that VNTR across all individuals at least three if other was one of the genotypes and if each genotype was observed in at least 20 individuals.
Genes were excluded from analysis if the median TPM transcripts per kilobase million expression value equaled zero. For significant eQTLs, we calculated, for reporting purposes, the maximum difference of the residual means over all pairs of genotype classes. We only considered alleles over-represented rather than both over- and under-represented because of an interest in identifying alleles that have a phenotypic effect.
Odds ratio values were log 2 transformed and P -values were adjusted for false discovery rate FDR In this section, we start with a summary and characterization of our VNTR predictions, followed by identification of commonly polymorphic VNTRs and an enrichment analysis of their association with genomic functional regions and genes sets.
We next report on the effect of VNTRs on expression of nearby genes, and then identify population-specific VNTR alleles and show that they are predictive of ancestry. We conclude this section with evidence confirming the accuracy of our predictions using several validation methods. TRs genotyped is the number of distinct TR loci genotyped across all individuals within a dataset.
All other numbers are also per dataset. Multis are TR loci genotyped in a single individual with more than the expected number of alleles. They could be artifacts or indicate copy number variation in a genomic segment.
Multis were excluded from further analysis on a per sample basis. VNTRs detected is the number of TR loci, excluding multis, with a detected allele different from the reference. Their occurrence within genes was common, totaling protein coding genes, and exons. To determine the effect of coverage and read length on genotyping, we measured two quantities: the percentage of reference singleton TRs that were genotyped, and the total number of singleton VNTRs that were detected in each genome.
Only singleton loci were considered in all further analyses. Figure 1A shows that there was a strong positive correlation between coverage and the ability to genotype TRs. A strong correlation with read length was also apparent, however, the effect was larger, primarily due to the ability of longer reads to span, and thus allow VNTRseek to detect, longer TR arrays. These results suggest that our VNTR numbers are undercounts. A TR Genotyping sensitivity. Graph shows the relationship between coverage, read length, and the percentage of TRs in the reference set that were genotyped.
Each symbol represents a single sample and specific samples are labeled. Increasing read length had the largest effect on sensitivity because many reference TR alleles could not be detected at the shorter read lengths see Supplementary Table S1. B VNTRs detected. Graph shows the relationship between coverage, read length, and the number of VNTR loci detected. Read length and coverage both had large effects. Coloring of symbols shows that population also had a strong effect, reflecting distance from the reference, which is primarily European.
C Alleles detected per locus. Each bar represents a specific number of alleles detected across all datasets. Coloring shows that proportion of loci where the reference allele was or was not observed. D Copies gained or lost. Each bar represents a specific number of copies gained or lost in non-reference VNTR alleles relative to the reference allele.
Loss was always more frequently encountered. E VNTR locus sample support. Data shown are the common loci from the sample NYGC dataset. Each bar represents the number of samples calling a locus as a VNTR. Bin size is Bar height is number of loci with that sample support.
However, detection was also positively correlated with population, which seemed likely due to the evolutionary distance of populations from the reference genome, which is primarily European Notably, within each trio, the VNTR counts were similar. This was because in these genomes, which consist of two copies of an underlying haploid genome, the single allele represented at any VNTR locus would frequently be a reference allele and so the locus would not be called as a VNTR.
Overall, VNTRseek found approximately 1. Although there were allele lengths that VNTRseek could not detect, this bias persisted even when restricting the loci to only those where gain and loss could both be observed Supplementary Figure S5.
The overabundance of VNTR copy loss may actually be an underestimate. By contrast, the reference locus needed to have a minimum of 2. Higher observed copy loss could be explained by a bias in the reference genome towards including higher copy number repeats , or by an overall mutational preference for copy loss.
High heterozygosity in human populations suggests higher genetic variability and may have beneficial effects on a range of traits associated with human health and disease Since calculating heterozygosity for VNTRs is not straightforward because of limitations on discovering alleles, especially within shorter reads , we used the percentage of detected, per-sample heterozygous VNTRs as an estimate for heterozygosity.
Interestingly, despite the previous comment, within genomes that were comparable in read length and coverage, the fraction of heterozygous loci clustered within populations Supplementary Figures S2—S4 , with African genomes generally having more heterozygous calls and East Asians fewer. This result is consistent with previous findings of population differences in SNP heterozygosity among Yoruban and Ashkenazi Jewish individuals with respect to European individuals , , and suggests higher genomic diversity among African genomes, as has been previously noted Extreme loss of heterozygosity in small variants has previously been reported in these samples by Illumina Basespace with the number of heterozygous small variants in HC being four times lower in the tumor tissue compared to the normal.
Additionally, in both tumors a large number of loci exhibited loss of both alleles in comparison to the normal tissue Supplementary Table S2. Given that the coverage for the tumor samples was significantly higher than for the normal tissue, it is unlikely that these observations were due to artifacts. Also, the tumor samples did not show a higher percentage of filtered multi VNTRs too many alleles than the normal samples 1.
Knowledge of gene associations with somatic tumor mutations VNTR alleles present in a tumor, but not normal tissue could be useful as indicators of cancer prognosis and for therapy. TRIM24 has been associated with prognosis in breast cancer — and over-expression of DUSP4 has been shown to improve the outcome of chemotherapy and overall survival , A total of common VNTRs overlapped with protein coding genes including exons.
Interestingly, increasing the threshold for common VNTRs did not reduce the number dramatically Supplementary Figure S6 , suggesting that these VNTRs have not occurred randomly, but rather have undergone natural selection. Our reference TR set comprised only 0. These results suggest that VNTR alleles may affect gene regulation in multiple tissues. To detect association between VNTR genotypes and expression of nearby genes, we paired VNTRs to any gene within 10 kb and after removing genes with low expression and controlling for confounders, applied a one-way ANOVA test to determine if there was a significant difference between the average gene expression levels for the VNTR genotypes.
Three of the top genes are shown in Figure 2. METTL23 is known to function as a regulator in the transcriptional pathway for human cognition and has been associated with mental retardation and intellectual disability JMJD6 is associated with pancreatitis and tumorgenesis , However, individuals had genotypes outside of our detection range which likely represented longer expansions and these individuals showed higher expression of this gene. More examples are given in Supplementary Figures S26—S Gene expression differences and VNTR genotype.
Shown are violin plots of gene expression values log 2 normalized TPM for three genes which displayed significant differential expression when samples were partitioned by VNTR allele genotype. Additional examples are shown in Supplementary Figures S26—S Genotype is indicated in labels on the X-axis and numbers refer to copies gained or lost relative to the reference allele. Number of samples in each partition is shown in parenthesis. In these examples, the effect size for at least one genotype class was significant.
We further investigated whether VNTR alleles are population-specific and whether they can be used to predict ancestry. Understanding the occurrence of population-specific VNTR alleles will be useful when controlling for population effects in GWAS, and more generally in interpreting gene expression differences among people of different ancestry. We found that the first, second, fourth, and fifth principal components PCs separated the super populations as shown in Figure 3.
Each PC captured a small fraction of the variation in the dataset, suggesting that there was substantial variation between individuals from the same population.
PCA was performed to reduce the dimensions of the data. PC3 not shown captured batch effects due to differences in coverage.
Some American sub-populations proved hardest to separate, likely due to ancestry mixing. The first PC separated Africans, suggesting furthest evolutionary distance. The second PC separated East Asians. The third PC captured coverage bias. The American population had a sub-population of Puerto Ricans that clustered with the Iberian Spanish population, suggesting mixed ancestry These loci overlapped with genes and 51 coding exons. Africans had the highest number of population-specific alleles , followed by East Asians 65 , while Americans had the lowest 13 , suggesting more mixed ancestry.
We observed 63 loci that had a population-specific allele in each population. Each dot represents an allele in one sample. Samples are separated vertically by super-population. Dots are jiggered in a rectangular area to reduce overlap. Population-specific alleles show up as bands over-represented in one population.
Note that the allele bias towards pattern copy loss relative to the reference allele is apparent and that at one locus second from left the reference allele was the population-specific allele since almost no reference alleles were observed in the four other populations. The details of these seven loci are given in Supplementary Table S To show the reliability of our results, we experimentally validated VNTR predictions at 13 loci in the three related AJ genomes, and also compared VNTRseek predictions to alleles experimentally validated in the literature.
We additionally used accurate long reads on one genome HG to find evidence of the predicted alleles. Separately, we showed the consistency of our predictions in two ways: first, we looked at inheritance consistency among four trios mother, father, child , and second, we compared result for genomes sequenced on two different platforms. In the remaining case, two predicted alleles were separated by only 15 nucleotides and could not be distinguished.
At two loci, other bands were also observed. In one, all three family members contained an allele outside the detectable range of VNTRseek longer than the reads. In the other, one allele that was detectable was missed in two family members see Table 3 for a summary of results, Supplementary Table S6 for details of the experiment, and Supplementary Figures S13—S17 for gel images.
All but one of the 66 bands predicted by VNTRseek were validated. Out of the original 17 VNTR loci experimentally validated in that paper, four were not included in our reference set and for one, the matching TR could not be determined. In total, 11 out of 16 detectable alleles were correctly predicted, four were not found in the NA sample with sufficient read size bp , and one was incorrectly predicted in the HG sample and not found in the other two Supplementary Table S7.
In all cases, only a handful of loci were inconsistent Supplementary Table S9. In , the Genomes Phase 3 sequenced 30 genomes using Illumina HiSeq at read length bp. Note, however, that read coverage was not the same for both datasets, causing variation in statistical power. The current study represents, to our knowledge, the largest analysis of human whole genome sequencing data to detect copy number variable tandem repeats VNTRs and greatly expands the growing information on this class of genetic variation.
The TRs genotyped consisted of some minisatellites occupying the mid-range of pattern sizes, from seven to bp. When considering the largest dataset in our study individuals , we found that, on average, each genome exhibited non-reference alleles at VNTR loci and among those, were common VNTRs.
In addition to their widespread occurrence, further evidence of minisatellite VNTR importance can be seen in the enrichment of these loci in genes and gene regulatory regions promoters, transcription factor binding sites, DNAse hypersensitive sites, and CpG islands. Our entire set of VNTRs overlapped with protein coding genes and exons.
The common VNTRs occurred within or were proximal to over protein coding genes, including overlapping with exons. Biological function enrichment among these genes included neuron development and differentiation, and behavior. These observations are consistent with the finding that VNTR expansions in humans compared to primates are associated with gain of cognitive abilities 24 , and possible involvement of VNTRS with many neurodegenerative diseases and behavioral disorders The overabundance of VNTR proximity to genes suggests that variability at these loci could affect gene expression and indeed, we observed that the expression levels of genes were significantly correlated with the presence of specific VNTR alleles in lymphoblastoid cell lines of individuals.
In 25 , expression levels in 46 tissues for genes were tested and eQTLs were found. Similarly, in 26 21 VNTRs were found associated with expression in 38 genes. However, their VNTR alleles were larger and did not overlap the ones we tested in this study.
These findings are suggestive, but more study is required, both to determine if there is more evidence of tissue specific gene expression variation associated with VNTR genotypes 25 , 26 , 87 and if such correlational differences can be definitively tied to actions associated with specific VNTR alleles such as regulator binding affinity changes in regulatory regions. For more elaborate studies such as these, it will be essential that for each sample used to measure gene expression, the raw whole genome sequencing data be available, so that specialized software programs, such as VNTRseek can be used to determine VNTR genotype.
However, it is well known that hidden differences can lead to misinterpretation of GWAS results, and care is particularly important when those differences are tied to human ancestry. Relevant to this, we have determined that of the common VNTR loci contain alleles showing significant population specificity and that these loci intersect with genes. Population-specific alleles also have the potential for use in tracing early human migration.
We have shown through principal component analysis with common VNTR alleles that super-populations are easily separated. Further, we have constructed a decision tree based on common VNTR alleles that obtains nearly perfect classification of individuals at the super population level.
It will be interesting to see whether, with more information, classification can be refined further to encompass specific sub-populations, whether a minimal minisatellite VNTR set can be established for high accuracy population classification, and whether VNTR alleles can be used to estimate mixed ancestry as is done now with SNP haplotyping.
This is true because VNTRseek requires that the tandem array fit within a read. Longer reads will help, but the abundance of high-coverage. Alternate methods exist 87 , 95 , but these have not reported an ability to handle macrosatellite VNTRs where the arrays and patterns are hundreds to thousands of base pairs long. For this range of the tandem repeat spectrum, new tools must be developed. Another limitation comes from use of the Tandem Repeats Finder, which requires that the array contain at least 1.
Previous studies on VNTR prevalence in the human genome have been limited to a subset of minisatellites inside the transcriptome and a limited number of genomes. Future research can be expected to further enhance our understanding of this important class of genomic variation.
The reference TR set files, output VCF files, and the pre-processed data files along with the code to create figures and tables are published at: DOI Funding for open access charge: University funds. Treangen T. Repetitive DNA and next-generation sequencing: computational challenges and solutions.
Google Scholar. Repetitive elements may comprise over two-thirds of the human genome. PLoS Genet. Lim K. Review of tandem repeat search tools: a systematic approach to evaluating algorithmic performance. Richard G. Comparative genomics and molecular dynamics of DNA repeats in eukaryotes. Analysis of a VNTR locus by Southern hybridization most commonly results in a two-band pattern, comprised of a band inherited from each parent. A one-band pattern can occur if the size of the two parental bands are the same or nearly the same.
For our simple example of three different alleles designated A, B, and C illustrated above, six unique DNA profiles are possible. When profiles from a single VNTR locus from unrelated individuals are compared, the profiles are normally different.
0コメント