Inference of Natural Selection from Interspersed Genomically coHerent elemenTs

Human Polymorphism and Divergence Data

This website contains information on the human polymorphism and divergence data used by INSIGHT. More details are avialble in the Supplementary Materials of (Gronau et al., Mol Biol Evol, 2013). More information on INSIGHT can be found here.

Contents

   1. Polymorphism data
   2. Outgroup sequence data and ancestral priors
   3. Filters
   4. Putative neutral sites
   5. Genomic blocks

1. Polymorphism data

Human polymorphism sequence data was obtained using the complete high-quality genome sequences of 54 unrelated individuals taken from the 69 public genomes from Complete Genomics. The 54 unrelated individuals, spanning 11 distinct human populations, were identified by eliminating 13 individuals from the 17-member CEPH pedigree (all but the four grandparents) and the child in each of the two trios (see Figure). Genotype calls for these individuals were extracted from the masterVar files downloaded from the Complete Genomics FTP site in December, 2011. We considered variants designated as “SNPs” or “length-preserving substitutions” in the masterVar files, and recorded positions at which Complete Genomics could not confidently assign a variant call for subsequent masking (see filters). All other positions were assumed to be homozygous for the reference allele (according to UCSC hg19 / human genome build 37).

The polymorphism data was summarized by recording for each position in hg19 the allele count for each of the four basses (A,C,G, and T) across the 54x2=108 chromosomes. Sites with more than two observed alleles were masked, as well as sites masked in one of the 54 individuals (see filters).

[click to enlarge] 69 human individuals from 11 distinct populations whose genomes have been sequenced to high coverage by Complete Genomics. 54 unrelated individuals (highlighted in yellow) were used to assess sequence polymorphism for humans. Taken from the Complete Genomics documentation.

2. Outgroup sequence data and ancestral priors

Divergence was inferred using three primate outgroup genomes: chimpanzee (panTro2), orangutan (ponAbe2), and rhesus macaque (rheMac2). We used the alignments of these genomes to the human reference (hg19) downloaded from the UCSC Genome Browser, and for each position in hg19, we recorded the aligned base from each of the three nonhuman primates, or an indication that no syntenic alignment was available at that position (see filters).

A prior distribution for the ancestral state (Z) was computed for all non-filtered sites in hg19, by assuming a phylogeny estimated from four-fold degenerate sites (see Figure), and applying the postprob.msa function in RPHAST. The human (hg19) sequence was masked in this computation, so that the computed distribution corresponds to the distribution over the bases in the ancestral sequence (Z) given the chimpanzee, orangutan, and rhesus machaque sequences, which is used as a prior distribution (P(Z|O)) by the INSIGHT model.

The phylogeny assumed when estimating divergence rates (λ) and prior probabilities for the ancestral states (Z_i). Branch lengths are given in expected number of differences between haploid chromosomes per kb. The phylogeny was inferred using four-fold degenerate sites, and was downloaded from the UCSC Genome Browser.

3. Filters

Our analysis is restricted to the autosomes (chromosomes 1-22), and within autosomes we applied various filters to reduce the impact of technical errors from alignment, sequencing, genotype inference, and genome assembly. Our filters included repetitive sequences (simple repeats), recent transposable elements, recent segmental duplications, CpG site pairs, regions not showing conserved synteny with outgroup genomes, and regions found in the “black list” filter reported by Dunham et al. (2012). CpG site pairs (prone to hypermutability) were identified as position pairs having a “CG” dinucleotide in any of the human samples or the outgroup genomes. As a further caution, we excluded position pairs with C* in an outgroup and *G in human, to avoid potential ancestral CpGs. Non-syntenic regions and gaps in the outgroup alignment were masked (by “N”s) individually in each outgroup genome. This uncertainty was incorporated when estimating the prior distribution over the ancestral sequence (Z, see above). Sites with missing genotypes in one of the 54 human individual genome sequences were masked out completely (see above), as well as sites with more than two observed alleles in the human population data. We additionally filtered out recombinations hotspots to ensure a more coherent genealogical background across the genomic blocks (see above).

Filter	Coverage
Genomic filters
Transposable elements	762 Mb
Seg. duplications	132 Mb
Simple repeats	66 Mb
Rec. hotspots	75 Mb
ENCODE blacklist	11 Mb
CpG dinucleotides	68 Mb
union:	979 Mb
Missing data filters
Blocks with <100 putative neutral sites	384 Mb
Missing data in Complete Genomics	570 Mb
union:	676 Mb
Total
total positions filtered:	1,340 Mb
total unfiltered:	1,541 Mb

Filters used for INSIGHT analysis of human data. Links provided to BED files and genomic coverage is given in megabases.

4. Putative neutral sites

Estimates of neutral model parameters were computed by considering a collection of putative neutral sites that pass our filters. The collection of putative neutral sites was determined by eliminating sites likely to be under selection: (1) exons of annotated protein-coding genes and the 1000 bp flanking them; (2) conserved noncoding elements (identified by phastCons) and 100 bp flanking them; and (3) RNA genes from GENCODE v.11 and 1000 bp flanks. While a fraction of the remaining sites is likely to be functional, this set should be dominated by sequence evolving under neutral drift (see Fig. 4A in our paper).

5. Genomic blocks

For estimation of genome-wide neutral polymorphism and divergence rates, we used a fixed collection of 10kb ovelapping windows. The windows were computed by first filtering out recombination hotspots estimated using the 1000 Genomes genetic map (downloaded from the 1000G FTP site), and then covering the portions between adjacent hotspots by 10 kb overlapping windows (see Figure). The overlap between subsequent windows was set to 5kb (other than the two rightmost windows, which might have a longer overlap), and each 10 kb window was associated with a distinct genomic block b (genomic blocks are non-overlapping). We used the putative nuetral sites in each 10 kb window to estimate a neutral polymorphism rate (θ_b) and a neutral divergence rate (λ_b), and those estimates were then associated with the appropriate genomic block. To avoid noise from sparse data, we masked genomic blocks with less than 100 putative neutral sites after filtering, and all sites in these blocks were filtered. The average number of unfiltered putative neutral sites in the remaining blocks is 4,300.

Illustration of the 10 kb overlapping windows used in genome-wide estimation of the neutral polymorphism rates (θ_b) and neutral divergence rates (λ_b), with the non-overlapping (5 kb) genomic blocks these values are associated with.