Genomes from a medieval mass burial show Ashkenazi-associated hereditary diseases pre-date the 12th century

To investigate Ashkenazi-associated Mendelian Disorders among the Chapelfield individuals we first collated a dataset of 178 SNVs interpreted as associated with disorders observed in Ashkenazi populations ( Data S1 E, with sources detailed) based on published data.InDel variants were detected by realignment (see above). We considered allele frequencies for these variants in gnomADand retained 159 loci where the population allele frequency for modern Ashkenazi Jewish (ASJ) was greater than for modern non-Finnish European (NFE) for disease-associated variants ( Data S1 F). We considered genotypes for these loci probabilistically, introducing a read error parameter α, defining the probability that a single allele is incorrectly read as one of the other three nucleotides. To determine the expected number of observed disease alleles at different rates of read error, datasets assuming ASJ and NFE population allele frequencies were simulated by sampling A,C,G,T nucleotides at each locus, for each individual from a multinomial distribution, using the observed total read depth as the number of trials. To calculate the exact probability of the observed allele reads, we applied a likelihood function utilizing α, summing the probability of the observed data for all 10permutations of the ten possible genotypes at each locus.

In order to assess the frequency of disease alleles in the Chapelfield individuals, we needed to address two key problems associated with ancient DNA data. Firstly, read errors were likely to be present, such as observing nucleotide T at position 11:71146886 for individual SB604, which is not present in any modern population in the gnomAD database. Secondly, read-depths were low, varying from 16.1 reads per locus for SB604 to only 0.17 reads per locus for SB606, with zero reads at 39.6% of loci when considering all 6 individuals separately (see Figure S4 ). We addressed these problems by considering genotypes probabilistically (rather than making categorical calls) and introducing a read error parameter α, defining the probability that a single allele is incorrectly read as one of the other three nucleotides. This value is used globally (same value for each individual, and at each locus), and we assume symmetry between nucleotides, such that the probability of A incorrectly read as C is the same for all other nine pairwise errors. For example, we assume the probability of a true T being read as G is α/3.

A single simulated dataset was generated in a three stage process. Firstly, the gnomAD allele counts of A,C,G,T (from a proposed population at a specific locus) were used as shape parameters in the Dirichlet distribution (plus one additional count for each nucleotide, as a uniform prior), to generate a single set of four allele frequencies. Secondly, these proposed allele frequencies were modified by the proposed read error rate α, according to the formulas D1, where: freqs = a vector of the proposed frequencies of A,C,G,T at a locus (summing to 1); error = the proposed read error rate α. Thirdly, allele counts were randomly sampled from the multinomial distribution, where the total observed counts (across all four nucleotides) were used as the ‘number of trials’ parameter, and the allele frequencies (modified by α) were used as the multinomial probabilities.

Likelihoods and likelihood ratios

Likelihoods were calculated using a four stage process that utilized the observed allele read counts, proposed population allele frequencies, and the read-error rate α. Firstly, for a single locus, we generated all 1,000,000 permutations of the six individuals’ ten possible genotypes (AA, AC, AG, AT, CC, CG, CT, GG, GT, TT), and calculated the frequency of each genotype permutation, given the gnomAD population allele frequencies and assuming Hardy-Weinberg equilibrium. Where gnomAD data provided counts for exomes and genomes we used the combination (sum of counts) of both. Secondly, we calculated the likelihood of each individual’s ten genotypes (again at a single locus), using a proposed read-error rate α and the observed allele counts in the multinomial distribution as specified in formulas M1. Thirdly, we summed all 1,000,000 permutations of these likelihoods, weighted by the frequency of each genotype permutation (since each permutation is a possible explanation of the observed data). Fourthly, we repeated for each of the 159 loci, with the α parameter fixed across all loci, and the overall product (under the assumption that loci are independent) provided the exact probability of the observed data, under a model of thegnomAD allele frequencies and a single α parameter. This approach deliberately avoids making any categorical genotype calls, and instead maintains probabilistic genotypes for downstream calculations. This is of particular value when analyzing aDNA where allele read depths are typically low and read errors high. In comparison, data with high read coverage and low read error rates can be assigned genotypes with such high confidence that the computational cost of this permutational method is not justified. Note, for computational efficiency, where two of the four possible nucleotides have a zero count, these can be aggregated into a single ‘other’ category requiring only 46,656 permutations of six genotypes (V1/V1, V1/V2, V1/V3, V2/V2, V2/V3, V3/V3), see formulas M2, and similarly where three nucleotides have zero counts, only 729 permutations of three genotypes need calculating (V1/V1, V1/V2 and V2/V2), see formulas M3. Where all four nucleotides have zero counts there is no information, and the likelihood equals 1.

Formulas (in R code):

p1 <- 1-error

p2 <- error/3

p3 <- 0.5 - p2

Formulas M1 if 4 nucleotides have counts, all 10 genotypes need to be considered. Likewise if only 3 nucleotides have counts, the fourth remains a possibility if produced by a read error.

V1.V1 <- dmultinom(counts, prob=c(p1,p2,p2,p2))

V1.V2 <- dmultinom(counts, prob=c(p3,p3,p2,p2))

V1.V3 <- dmultinom(counts, prob=c(p3,p2,p3,p2))

V1.V4 <- dmultinom(counts, prob=c(p3,p2,p2,p3))

V2.V2 <- dmultinom(counts, prob=c(p2,p1,p2,p2))

V2.V3 <- dmultinom(counts, prob=c(p2,p3,p3,p2))

V2.V4 <- dmultinom(counts, prob=c(p2,p3,p2,p3))

V3.V3 <- dmultinom(counts, prob=c(p2,p2,p1,p2))

V3.V4 <- dmultinom(counts, prob=c(p2,p2,p3,p3))

V4.V4 <- dmultinom(counts, prob=c(p2,p2,p2,p1))

Formulas M2 if only 2 nucleotides have counts, the remaining two can be combined into a single ‘other’, so that 6 genotypes need to be considered:

V1.V1 <- dmultinom(counts, prob=c(p1,p2,p2+p2))

V1.V2 <- dmultinom(counts, prob=c(p3,p3,p2+p2))

V1.V3 <- dmultinom(counts, prob=c(p3,p2,p3+p2))

V2.V2 <- dmultinom(counts, prob=c(p2,p1,p2+p2))

V2.V3 <- dmultinom(counts, prob=c(p2,p3,p3+p2))

V3.V3 <- dmultinom(counts, prob=c(p2,p2,p1+p2))

Formulas M3 if only 1 nucleotide has counts, the remaining 3 can be combined into a single ‘other’, so that only 3 genotypes need to be considered:

V1.V1 <- dmultinom(counts, prob=c(p1,p2+p2+p2))

V1.V2 <- dmultinom(counts, prob=c(p3,p3+p2+p2))

V2.V2 <- dmultinom(counts, prob=c(p2,p1+p2+p2))

Our method calculates likelihoods under the assumption that the six individuals are randomly sampled from a proposed population, and therefore does not take into account relatedness. In the case of these particular data, this assumption has a conservative influence on the likelihood ratio for the following reason. The overwhelming majority of the likelihood ratio is driven by variants that are private to a single individual (SB676 21-33974609-G-C LR=113.4; SB605 7-83590853-G-A LR=67.8; SB676 5-112175211-T-A LR=48.6; SB696 22-50967020-C-T LR=2.7), which removes any influence of relatedness on the likelihood ratios. Four further non-private disease alleles were observed in SB604 and SB676 14-94770808-C-T LR=0.954; 14-97342370-C-T LR=1.079; 21-43808633-C-A LR=0.863; 21-45713715-C-T LR=0.870), but since the likelihood ratios at these loci overall slightly favor the european population (less than 1), adjusting for relatedness would have the effect of slightly increasing the likelihood ratio. In any case, our familial relationship analysis did not find a close relationship between SB604 and SB676 that would justify such an adjustment. In contrast, the closest relationships identified were between siblings SB605, SB606 and SB671 who had no disease alleles in common.