Scripts used for profiling levels of heterozygosity over a genome sequence
Estimation of heterozygosity The availability of shotgun sequencing reads at high coverage, drawn randomly from both chromosomes from each homologous pair, offers the opportunity to estimate patterns of heterozygosity across the genome. First, poor-quality sequence read pairs were removed using TrimGalore (http://www.bioinformatics.babraham.ac.uk/projects/trim_galore/) with the –q 30 and –paired options and then the remaining read pairs from the 300-bp library were aligned against the Pe. belbahrii genome assembly using BWA-mem (Li and Durbin 2009). Multi-mapping reads were eliminated by using samtools view (Li et al. 2009) with the –q 1 option. For each single-nucleotide position in this alignment of reads against the genome assembly, we counted the frequency of each of the four possible bases in the reads aligned at that site. Thereby, for each site in the genome, we were able to estimate whether it was likely to be homozygous (most common base close to 100% frequency and second-most common close to zero) or heterozygous (close to 50% most common base and for second-most common base). Using R (R Development Core Team 2013), we plotted a histogram describing the frequency distribution of relative abundance of the most abundant base and second-most abundant base at each position in the genome. If the genome is highly homozygous, then the histogram would be expected to display a single peak close to 100% abundance for the most common base and a single peak close to zero for the second-most abundant. However, heterozygous sites would contribute to a second peak close to 50% abundance of the most common base and for the second-most abundant base. We also took a sliding-window approach to identify regions of high and low heterozygosity relative to the average over the whole genome. Each non-overlapping window of 5 k.b. was examined for heterozygous sites, i.e. sites where the abundance of the most common base was between 48 and 52%. The density of heterozygous sites was defined as the number of heterozygous sites per 5-kb window. Density of heterozygous sites was plotted along with depth of coverage against genomic position, using R. The R scripts used for these analyses are available from https://github.com/davidjstudholme/heterozygosity.