Thursday, January 12, 2012

Pinning Down Archaic Admixture Population Models

There are many outstanding disputes, critical to understand the demographic history of Eurasian modern humans in the Upper Paleolithic era, related to the population models that are used to describe how Neanderthal genes could have ended up in almost all modern Eurasians at frequencies on the order of 2%-4% of our autosomal genome in a sample made up of many thousands or tens of thousands of individuals, despite a complete absence of Neanderthal mtDNA or Y-DNA in any genetically tested modern human, from a large sample of tens or hundreds of thousands of individuals, including hundreds of ancient DNA samples. This population genetic data has been accumulated in a collective scientific enterprise that has deliberately oversampled populations that are likely to be genetically diverse outliers in both data sets, although there are far more outlier populations and ancient DNA populations that are undersampled for autosomal genetics than there are that have been undersampled for mtDNA.

One of the confounds in estimating what kind of process gave rise to the introgression of Neanderthal DNA into modern humans is the question of how much of the Neanderthal DNA originally present in hybrid individuals has been purged over time from modern humans, either due to random genetic drift in admixed modern human populations, or due to selective disadvantage associated with particular Neanderthal genes.

It helps, in comparing possibilities that we have significant shares of the Neanderthal genome from ancient DNA to compare against modern genomes.

Neanderthal genes that could have introgressed into modern humans can be broken into one of four categories: (1) genes in which the Neanderthal genome and modern human genome are indistiguishable (which is a very substantial share of the total, probably on the order of 95% or more), (2) Neanderthal genes with a positive selective advantage (there is some early indication that this may mostly consist of HLA genes which are related to the immune system, (3) Neanderthal genes that have a selective disadvantage relative to modern human genes, which statistically should have been removed from the human genome over the relevant time spam of at least 30,000 years or so, and quite possible two to four times as long as that, even if the selective disadvantage is very modest, particularly as disadvantageous genes slowly become separated from nearby genes that may have selective advantage through the recombination process over many generations, and (4) Neanderthal genes that are selectively neutral.

One can determine in modern populations which Nenderthal genes are present at elevated frequencies indicative of selective advantage and which are only present at a baseline level, in order to both estimate the true selectively neutral baseline level of admixture before selection started to act on the genes in modern humans with Neanderthal ancestry, and to estimate the magnitude of the advantage associated with those genes present at elevated frequency. This task is somewhat harder than it seems because one has to address statistical noise that elevates the frequency of some random genes for reasons unrelated to selective advantage, but is well within the capabilities of well established statistical methods.

One can also search, by direct comparison, for distinguishably Neanderthal genes that have not ended up in any modern human at all. There are basically three ways that this could happen: (1) the genes were never transferred in an admixture event because there were a finite number of admixture events and only an approximately random half of the Neanderthal genome was transferred in each event, so some genes may never have been transferred in any of the events, (2) the genes were transferred in an admixture event and left the modern human genome via random genetic drift, (3) the genes were transferred in an admixture event but due to selective disadvantage associated with the genes, they were culled from the modern human genome. The percentage of Neanderthal specific genes known to exist which are found in no modern human populations can provide a very accurate estimate of the combined impact of these three factors, although by itself, it doesn't do much to tell you how much of each factor plays a part.

It is mathematically trivial to relate the impact of the first factor to the number of admixture events that took place, and the relationship between the percentage of genes never transferred in admixture events and the number of admixture events is highly non-linear. For one admixture event, the percentage of 50%. For two it is 25%. In general, the never transmitted proportion of the genome is 1/(2^n) where n is the number of admixture events. In any scenario where there are seven or more admixture events in all of human history, the percentage of Neanderthal specific genes never transmitted in admixture events is below 1% and at somewhere on the order of twelve to fourteen admixture events ever in all of modern human history, the impact of this factor would be completely undetectable with any level of statistical significance in an autosomal genome data set as large as the one that is currently in existence.

If the effective population size of the modern human populations that admixed with Neanderthals was on the order of four hundred to seven hundred and fifty individuals, the effect of non-transmission of specific genes in any admixture event should be negligable, and even at an effective population size as low as two hundred, the impact of this factor should be a very small proportion of the total number of Neanderthal genes not observed in any modern human population. Yet, most estimates of the effective population size of the founder population of modern human Eurasians are at least in the single digit thousands, and archaic admixture itself, while it would inflate the apparent effective population size of the founder population of modern human Eurasians, at the 2.5%-4% of the total population size would not have an effect so significant that it would bring the effective population size of the founding population of modern human Eurasians to the low three digits, particularly to the extent that the estimates are corroborated by mtDNA and Y-DNA based estimates that have on archaic component.

This means that essentially all of the "missing" Neanderthal DNA (at least outside the sex chromosomes where there are clearly population structure and demographic history factors that are non-random at play) must statistically derive from either genetic drift or selective disadvantage.

We can then work to estimate both components separately using a variety of population genetic parameters, and work to look at the parameter space of assumptions that can produce outcomes consistent with the percentage of missing Neanderthal DNA that we observe.

Random drift of selectively neutral genes is easy to model with very accurate results using just a handful of parameters, either analytically, or numerically with Monte Carlo methods. Some of the key parameters are generation length, effective modern human population size at the time of admixture, number of admixture events, spacing of admixture events, boom and bust variability in effective modern human population size, and population growth (which can be quite accurately estimated in the long run from a variety of evidence, even if fine grained variability in this rate is hard to determine).

For populations that experience growth in the long run (as modern humans in Eurasia obvious did), where the number of generations is very large, it turns out that generation length doesn't actually matter very much, because when you have a number of generations in excess of one thousand with a population that reaches the many millions sometime in the Upper Paleolithic, and an overall percentage of admixture that is at least on the order of the 2.5%-4% it has reached at long term fixation, which has apparently been reached for all Eurasian given the supercontinental uniformity present in that percentage, the amount of genomic loss that takes place due to random drift bceomes insensitive to the number of generations because random drift is much more powerful an effect, in a non-linear manner, when populations are small. At a leading order estimate, the likelihood of losing a gene entirely from a population in any given span of generations is a non-linear function of the absolute number of individuals in the population who carry that gene. Basically, the percentage likelihood that a gene will leave the population by random drift is roughly proportional to the probability that a random sample from the effective population equal to the absolute number of gene carriers in the population would be zero. Once the absolute number of carriers and zero is several sample error standard deviations apart from a sample of that size, the probability of loss of a gene entirely due to random drift approachees zero.

Complicating this is a factor that also looks like random drift, which is mutation. While not listed as a separate factor, another way that a gene can be removed from the gene pool is through a mutation at that locus. The probability of this happening is a function of the number of generations involved and the effective population size of each generation, divided by the number of carriers of a particular gene, and discounted for the fact that lots of mutations are lethal and never enter the gene pool. This is the method used to make estimates of the age of mtDNA and Y-DNA haplogroups and it isn't very accurate, but there is a considerable body of empirical evidence that put order of magnitude bounds on the size of this effect. So, whlle the error bars on this component of the random loss of selectively neutral genes from the population might have extremes that vary by as much as a factor of two to ten if we were really being realistic about who precise our methods of mutation dating have proven to be in practice (perhaps more, given that the timing of the admixture event has something on the order of a factor of two uncertainty in it to begin with and that our estimates of generation length in Upper Paleolithic modern humans aren't terribly accurate and our effective population chart also has a pretty fuzzy line), if the effect is at an order of magnitude lower than other sources of removals of genes from the population's genome, we can safely ignore it, even if the precise magnitude of the effect is not known with a great deal of certainty.

From the other direction, there have been a number of reasonably useful estimates of the proportion of genes in the human genome, and the proportion of genes in a variety of other species, which do, or do not, show indications of having a selective effect at any given time (which basically consists of genes that have not reached fixation in the species for which there is no good reason to believe that selection produces multiple varieties in stable proportions as it does for HLA genes). In general, these studies have shown that the proportion of genes that are currently experiencing active selective pressures at any given time appear to be fairly modest, but not negligable either.

There is no really good way to estimate the relative numbers of selectively advantageous archaic genes to selectively disadvantageous archaic genes. There argument for more good genes than bad is that Neanderthals had more time to adapt to the new environment that modern humans were entering. The argument for more bad genes than good is that Neanderthals went extinct while modern humans didn't, so overall, modern humans had a selective advantage of some form over Neanderthals. But, it isn't unreasonable to infer that there should be an order of magnitude similar number of each. There is also no particularly good reason to think that the proportion of the genome that is selectively neutral at any given point in time has changed very much or was much different from Neanderthals than it is for modern humans. So, an examination of the number of Neanderthal genome genes present in elevated levels that hence show signs of selective advantage could cast some light, at least, on the proportion of Neanderthal genes that gave rise to selective disadvantages and were purged from the modern human genome. The early indications from this kind of analysis are that the proportion of Neanderthal genes still in the modern human genome which show signs of having been positively selected for is small relative to the total number of Neanderthal genes in the modern human genome.

Despite the fuzziness of all of these reasoning, from a quantitative perspective, the bottom line in all of this analysis is that we would expect a significantly disproportionate share of the proportion of missing genes from Neanderthal genome to have been lost due to selectively neutral random drift rather than natural selection, and that even this crude bound allows us to make fairly specific numerical estimates of the proportion of Neanderthal specific genes that were lost because they were selectively disadvantageous and the proportion of Neanderthal specific genes that were lost due to one of a couple of forms of random genetic drift.

Placing numerical bounds and maximum likelihood estimates on the proportion of Neanderthal specific genes that were lost due to random genetic drift with this kind of analysis, in turn, allows us to significantly narrow the parameter space of population model assumptions that could produce the observed amount of random genetic drift. The observed proportion of random genetic drift in the Neanderthal genome would be particularly relevant in placing bounds on the paramater space for assumptions about effective modern human population size at the time of admixture, number of admixture events, spacing of admixture events, and the boom and bust variability in effective modern human population size. And, there are independent ways to provide additional bounds on many of these parameters from other lines of population genetic data and anthropology and the physical anthropology of Neanderthal remains, so the flexibility in one paramater doesn't inject too much flexibility into other paramaters.

Also, a reasonably tightly bound overall estimate of the magnitude of random genetic drift from the proportion of the Neanderthal genome that has been purged from modern humans provides a robust, and fairly direct estimate, from the longest time period for which ancient DNA is available for hominins, that can be used to inform estimates of the rate at which selectively neutral genes are purged by genetic drift in modern humans that is relatively population model independent for use in analysis of non-Neanderthal admixture population genetics (e.g. in estimates related to Denisovian admixture, putative African archaic admixture, admixtures of modern human populations in the Upper Paleolithic era, and the accuracy of estimates of the probability that a chance in the proportion of a particular gene in a population was due to random genetic drift or selection), since the error bars on this direct measure of random genetic drift in autosomal genes over that time period would be much smaller than the error bars around estimates of any of the specific parameters in parameter space that could be used to estimate it from first principles using population models alone. Thus, making this estimate in the Neanderthal case would materially improve the statistical power of all of our long term population genetic estimates, a contribution that may be unique and may not be available with greater precision from any other set of data for the foreseeable future.

Explicitly estimating the impact of selective effects and the loss of genes due to random genetic drift is also likely to establish that the total number of archaic admixture events was larger than an estimate that ignores these effects, because, on balance, these effects tend to reduce the number of Neanderthal genes in the modern human genome. Thus, the process of estimating these numbers of likely to reveal that Neanderthals and modern humans had sex more often that a crude back of napkin estimate would suggest. And, if the kind of process assumptions (Haldane's rule which also impacts fertility assumptions, and predominantly female modern human mothers for hybrid children born into modern human tribes that averted extinction, which implies that there are large numbers of uncounted cases where Neanderthal mothers that were erased from modern human populations in the present) that most naturally explain the disconnect between autosomal genetic data and uniparental genetic data are also incorporated into the analysis, the amount of cross species sexual activity between Neanderthals and modern humans may have been quite a bit higher indeed than the current percentage of our autosomal genome attributable to Neanderthal genes would suggest, probably on the order of a factor of two to five, which would be roughly the difference between a once in a generation event (a crude estimate without these considerations) and something like a once every few years event.

My intuition is that the amount of allele loss due to random genetic drift acting on selectively neutral genes that is actually observed in the Neanderthal case would suggest that the magnitude of the impact of random genetic drift in purging selectively neutral genes from modern human populations is quite a bit smaller than could safely be inferred by a naiive estimate based on other existing data and pure population modeling not supported by this kind of empirical calibration. Thus, I suspect that this data will, generally, favor findings that it is more likely that a given chance in gene frequency was a selective effect rather than a random one, and that populations not subject to selective pressures are more genetically stable than one might naiively expect even with a fairly careful theoretical analysis.

No comments: