Friday, July 3, 2015

Admixture Analysis Of Global and Ancient Genomes

What Is Admixture Analysis?

A computer program called Admixture uses a mathematical algorithm whose application to genome data includes a little art as well as a lot of science, to fit data from large numbers of genomes as well as possible into a model.  This model assumes that each person's genome is mix of different proportions of a preset number of hypothetical ancestral populations determined in a manner that maximizes the quality of the fit using linear algebra.  It is also possible to tweak the program, for example, by designating some real individual as the exemplar of an ancestry component, rather than having the computer derive its clusters entirely without outside input.

I don't know precisely which choices were used to generate the latest and greatest result from Eurogenes analyzing a sample of more than 2059 modern and ancient genomes that maximally capture all varieties of modern human genetic variation in an Admixture run at K=10 (i.e. requiring the program to fit the individuals in the sample into percentage contributions from ten ancestral populations generated by the computer).  This sample includes almost every available complete ancient genome (which number in the hundreds) and some global databases of genomes that are widely used in the professional literature (such as the thousand genomes database) to represent the rest of the world.

An Example Of Admixture Analysis

For example, the first ten samples in his analysis are African-Americans from Denver (because AA for African-American comes first in the alphabetical listing).  For each individual a percentage of ancestry from each of ten groups that have been labeled for convenience after the fact to give a sense of where that component is most often found.  These categories are (with abbreviations spelled out):

1. Middle Eastern
2. San Bushman
3. American Indian
4. Northern Siberian
5. East Asian
6. Hindu Kush
7. Sub-Saharan African
8. European Hunter-Gatherer
9. Oceanian.
10. East Siberian.

For example, individual number 15 from the African-Americans from Denver sample included in his Admixture run is determined to be:

88.6% Sub-Saharan African
8.5% European Hunter-Gatherer
1.6% Middle Eastern
0.5% San Bushman
0.4% American Indian
0.4% Hindu Kush

The proportion of the other four ancestral components is negligible.

In terms an average person would understand, this individual is 89% black, 10.5% white, and 0.4% American Indian, these proportions reflect the American reality that African-Americans typically have higher proportions of African ancestry, and lower proportions of non-African ancestry, than is typical of people with some African ancestry in Latin America and the Caribbean.

This individual's African ancestry overwhelmingly from populations more like typical Niger-Congo language speaking West Africans and much less like "Paleo-African" populations like the Khoi-San bushmen of the Kalahari desert and the Pygmies of the Congo jungle.  This reflects the typical sources of individuals in the American slave trade.

The mix of "European hunter-gatherer" ancestry, "Middle Eastern" ancestry, and "Hindu Kush" ancestry in the "white" component of his ancestry is roughly in line with what you would expect in someone of Scottish origins (which is typical of Southern whites in the U.S., many of whom were Scotch-Irish).

Small but measurable amounts of Native American ancestry are common in African Americans.

All of this is exactly what one would expect from other data in a typical African-American from Denver.  One of the African-Americans from Denver in the sample, however, who is an exception, is almost half-white and not quite half-black, and is probably light skinned relative to a typical African-American in Denver.

Similar break downs are available for all 2059 people, modern and ancient alike, in the database, although it takes a certain amount of familiarity with how the individuals are identified to know which are modern, which are ancient, and what modern ethnic groups or his archaeological cultures are represented by the label given to an individual in the spreadsheet.

Insights: Genetic Variation Is Highly Structured And Far From Maximal

The fact that a reasonably accurate description of someone's ancestry, relative to seven billion or so living people and untold numbers of deceased individuals who preceded us, can be summed up with a fair degree of specificity with percentages of ten ancestral components, is itself remarkable.

The reality of human genetic variation observed in the real world is dramatically narrower than the default assumption that each SNP is random relative to the entire human population, in which each individual would be their own "special snowflake".  Each individual is unique, but the differences within ethnic communities often colloquially described in terms of race, linguistic affiliation and ancestral religious identification, are often quite subtle.

Indeed, vast areas of the human genome are totally ignored by people interesting in genealogy, forensic applications, or ancient DNA research, because all modern humans are identical in that part of the genome.  Indeed, a significant component of the part of the genome that has reached fixation in modern humans has also reached fixation in archaic hominins (like Neanderthals and Denisovans for which we have ancient DNA to compare to), for primates, for mammals, and for vertebrates.  Indeed, all multi-celled animals, no matter how primitive, share more than 40% of their DNA at locations that are so functionally important that they have reached fixation.  The more that parts of a person's genome are ancestry informative and variable within modern humans, the more likely it is that those parts of a person's genome are not important to evolutionary fitness.

Every Ethnicity Has At Least One Distinctive Genetic Profiles

Still, a person's genomes are ancestry informative and can often pin down a person's likely self-identified ethnicity, race, ancestral religious affiliation and familial place of origin with great specificity, in Europe, for example, pinning down the likely place of origin of someone with ancestors all from the same region, to a location within a hundred miles or so.

For example, when I compared the mix of "white" ancestral components of the African-American individual from Denver described above, it was possibly to obviously rule out a white ancestor from the Near East, Southern Europe or Iceland (because the Middle Eastern component was proportionately too small compared to the other ancestral components), or from Russians (who generally have a significant Northern Siberian component).

Similarly, a recent African immigrant from Somolia would have about ten times or more as much of the San Bushman ancestry component as someone descended from slaves from the American Southeast, as is typically the case with African-Americans, even though the former would not be unheard of in Denver's African-American population.

There are some cases where a population that culturally is just one ethnicity, such as "African-Americans" in the United States, can actually have several distinct genetic profiles (e.g. Ethiopian-Americans and other Africa-Americans in Denver might be classified socially as both being African-American, but would have different genetic profiles).

This reflects that fact that the history of human migration and diversification through mutations, isolation into separate reproductive populations, adaptation to new environments, and admixture, has been highly structured and has involved a finite number of populations, that these populations had enough time to homogenize while isolated from other populations, and that there have been a modest and finite number of significant admixture events in modern human history.

Few People Are Pure Types

Another descriptive observation of the data set is that at the K=10 level, few individuals are "pure types" with 99%+ ancestry from a single ancestry category.

There is no one in the sample with more than 83.1% Middle Eastern ancestry. There is no one with more than 86.6% Hindu Kush ancestry (a component that would be more accurately described as Kalish).

There are 2 of 2059 individuals with more than 99% European hunter-gatherer ancestry from many thousands of years ago (both of whom are ancient DNA samples with sequences released within the last couple of years).

There are 8 of 2059 individuals with more than 99% San Bushman ancestry.  There are 35 of 2059 individuals with more than 99% American Indian ancestry.  There are 6 of 2059 individuals with more than 99% North Siberian ancestry.  There are 19 of 2059 individuals with more than 99% East Asian ancestry.  There are 79 of 2059 individuals with more than 99% Sub-Saharan ancestry.    There are 14 of 2059 individuals with more than 99% Oceanian component (a component that would be more accurately described as Papuan).  There are 2 of 2059 individuals who are more than 99% East Siberian.

Thus, in a sample of 2059 modern and ancient individuals, only 163 are "pure type" individuals (less than 8% of the sample), while the rest have at least two measurable ancestral components in their genomes.  Two of the ten ancestral populations have no "pure type" representative (Middle Eastern and Hindu Kush), and a third has no modern "pure type" representatives.

Also, it is worth recognizing that representation in the sample is not proportionate to modern population size, and indeed, is deliberately chosen to over represent genetically distinct populations.  This is a maximally diverse sample, rather than a representative sample of human genetic diversity.

The populations that are pure types for the San Bushman, for Northern Siberians, and East Siberian (three of the seven ancestral types with modern representatives) are tiny relict populations that subsist in large part on hunting and gathering.

Pure type Papuans are present only on an island between Australia and China that has little contact with the outside world and uses traditional indigeneous non-mechanized agriculture.  Only a very small percentage of Native Americans a "pure blooded" and those who are generally live in economically marginal reservations or remote jungles or mountain villages.  The "pure type" individuals in all of these populations combined in the entire world alive today make up considerably less than 1% of the world's entire population.

Only the Sub-Saharan component and East Asian components have pure type individuals who are present in modern populations that are not tiny and marginalized.

No One Has Measurable Amounts Of All Components:

While few individuals are "pure types" almost no one has measurable contributions to their genome (defined as more than one part per 100,000) from all ten of the globally determined ancestral components.

I was able to identify only six individuals who had measurable amounts of nine of the ten ancestral components in the sample: Turkish4BA57, SaudiA7, HGDP00148 (Makrani, a South Asian Muslim ethnicity), Jordan646, usb25 (an Uzbek) and Yemenese1529.  Given that all of these individuals are from predominantly Muslim areas, it is plausible to infer that global, religiously mandated pilgrimages to Mecca have led to trace admixture in many Muslim populations from all over the Muslim world, and indirectly from almost everywhere around the globe.

Everyone in the sample of 2059 individuals (modern and ancient), except the 163 pure types and 6 nine type individuals, had two to eight ancestral components, and the lion's share have fewer than eight.

With a cutoff that excludes negligible contribution from an ancestral component (say, e.g., less than 0.1%), there would be no individuals with nine ancestral components, and the average number of ancestral components per person would be much smaller.

In any given region or ethnicity, individuals typically have slightly varying percentages of just a few components.

For example, of the seven Finnish people in the sample, all have generally similar percentages of four of the ten ancestral components (Middle Eastern 9.1%-14.8%, Northern Siberian (4.6%-9.7%), Hindu Kush (4.8%-13.0%), European Hunter-Gatherer (66.6%-76.2%).  Six of the seven had small amounts of Eastern Siberian 0.7%-2.5% ancestry, and four also having trace amounts of American Indian ancestry (0.2%-1.2%) including the one with non East Siberian ancestry.  None of the Finnish individuals had any San Bushman, East Asian, Sub-Saharan African, or Oceanian ancestry.

A GUT Model That Makes Some Progress On SM Constants

Grant Unified Theories are notorious for trying to find a unified structure that encompasses the Standard Model particles, while not providing any meaningful insight into the origins of the fundamental mass and mixing parameters of the Standard Model.

An exception is the pre-print Feruglio, Patel and Vicinoa, "A realistic patter of fermion masses from a five-dimensional SO(10) model" (July 2, 2015).  It's abstract states that:
We provide a unified description of fermion masses and mixing angles in the framework of a supersymmetric grand unified SO(10) model with anarchic Yukawa couplings of order unity. The space-time is five dimensional and the extra flat spatial dimension is compactified on the orbifold S1/(Z2 x Z'2), leading to Pati-Salam gauge symmetry on the boundary where Yukawa interactions are localised. The gauge symmetry breaking is completed by means of a rather economic scalar sector, avoiding the doublet-triplet splitting problem. The matter fields live in the bulk and their massless modes get exponential profiles, which naturally explain the mass hierarchy of the different fermion generations. Quarks and leptons properties are naturally reproduced by a mechanism, first proposed by Kitano and Li, that lifts the SO(10) degeneracy of bulk masses in terms of a single parameter. 
The model provides a realistic pattern of fermion masses and mixing angles for large values of tan β. It favours normally ordered neutrino mass spectrum with the lightest neutrino mass below 0.01 eV and no preference for leptonic CP violating phases. The right handed neutrino mass spectrum is very hierarchical and does not allow for thermal leptogenesis. We analyse several variants of the basic framework and find that the results concerning the fermion spectrum are remarkably stable
As the authors explain further:
Below the GUT scale the theory looks like the MSSM and we expect standard SUSY gauge coupling unification. In order to suppress the higher order corrections in Eq. (2.8), we take c ≡ ΛπR ≈ O(100) so that the cut-off of the theory, Λ can be lifted up to the Planck scale . . . . The higher order corrections are at the percent level and remain smaller than experimental uncertainty in the fermion mass data we adopt. The theory provides a predictive framework for fermion masses and mixing angles[.]
This model does not reproduce the Standard Model constants exactly (it assumes that a group of constants relevant to determining the observable values are random numbers on the order of 1), is not terribly original, and has some rough spots that mark it as an incomplete work.  But, it at least takes a serious stab at developing a model that reproduces the hierarchy of Standard Model mass and mixing constants up to an order of magnitude level, and even risks displaying calculations made with it. This makes it quite notable relative to the ordinary flurry of toy models that make no concrete predictions at all about the Standard Model constants.

They renormalize the observable parameters up to the GUT scale using the results for this in the Minimal Supersymmetric Standard Model (MSSM) from prior literature and derive their predicted values at the GUT scale.

By choosing a "normal" neutrino hierarchy, setting tan β = 50, and testing random values of the nine free parameters of the theory within the range 0.5-1.5 (i.e. of order 1), they manage to get results that are within two sigma of the observed values of the parameters about 0.05% of the time, i.e. one time in 2000 (compared to less than one in 100,000 trials in an inverted hierarchy, and a poor fit indeed with a smaller value of tan β).

In the MSSM, tan β is the ratio of the vacuum expectation values of the two Higgs doublets, so there is one Higgs field at the electroweak scale of about 246 GeV, and another Higgs field at a scale of about 12.3 TeV, which is very roughly speaking the scale where supersymmetric effects would be more visible.

The model predicts, which fitted to known experimental values, a lightest neutrino mass of about 3.9 meV, a neutrinoless double beta decay Majorana mass of 4.96 meV, and a spectrum of right handed neutrino masses of 190 GeV, 802 TeV, and 1.42*1011 TeV.

I personally seriously doubt that there are right handed neutrinos at all, suspect that neutrinos do not have any Majorana mass, and expect that the lightest neutrino mass is on the order of 1 meV or less.  In other words, I think all of these predictions are wrong.  But, they are at least not yet contradicted by direct experimental evidence.

There are a host of good reasons to believe that the universe is not described by the MSSM or something quite similar to it, particularly after the first run of the LHC's data is out, but the MSSM does provide a more mathematically tractable way to deal some general concepts related to grand unified theories generally that may be pertinent to one that actually does explain the Standard Model.  So, while these results must be taken with the heap of salt, it is nice to see something working on a model they can do some calculations with in order to get some intuition about which corner of MSSM parameter space is the best fit to the real world, from which the nature of a GUT that actually does describe the universe might be more easily conceived.

Thursday, July 2, 2015

The Old European Culture blog

The Old European Culture blog, which commenced a little more than a year and a half ago, explores historical linguistics, archaeology, the legends of European people from prehistory, and and their connection to archaeogenetics.  It is a true treasure with in depth analysis, well reasoned conjectures, quality illustrations, and more.

The most recent post, for example, puts a deadly skirmish that killed about 1200 BCE near a river in the middle of nowhere to the southeast of Denmark, in the largely historical context of Bronze Age cultures that brings to life the industries and commerce that drove those societies.

Another recent post makes some striking observations about the origins of the word for embassy in Greek and words with a related meaning and very similar form in South Serbian.  (This also got me thinking about pre-Greek substrates).

Yet another digs deep into Irish origin legends (after you look at that it is also an appropriate to look at the Wikipedia article on a possible non-Indo-European Goidelic substrate in the Celtic Irish language).

Go read it.

Elsewhere around the blogosphere there is a nice account of the sequencing of a complete woolly mammoth genome.