Thursday, March 1, 2012

Genome Bloggers Get Sophisticated

Amateur analysis of population genetics using large publicly available samples and widely available statistical analysis tool has reached a greater level of sophistication than most published papers on the subject did just a few years ago.

A good example is a recent blog post at Ethio Helix (hat tip Maju) analyzing almost all of the publicly available autosomal DNA data from modern African populations, with principle component analysis charts (in two and three dimensions), Admixture charts, maps showing popuation locations, charts showning linguistic affinities, disclosures of methodology, references to the main sources in the recent literature, and solid, if laconic, analysis.

This was derived from a data set of "publicly available data from 3970 individuals from around the world typed for 27,022 Autosomal SNPs, which can be found all over the 22 pairs of chromosomes (but not uniformly)." After trimming the global data for an African oriented analysis and to refuse likely cofounds from related individuals, the numbers that were left to be crunched consisted of "1,065 Individuals from Africa and 26,129 SNPs for analysis." Put another way, this blogger has made sense of more than 26 million data points.

Reassuringly, the conclusions are generally in harmony from what would be expected based on prior uniparental analysis, history, ethnographic, linguistics, known admixture histories, and geography, although with a few nuances and modest surprises thrown in.

Principle Component Analysis

One surprising results is the fact that the "Hadza" cluster (an East African hunter-gather population which one would naiively expect to be quite divergent but not the most divergent) turns out to be quite a bit more of an outlier in autosomal genetics than the "KhoiSan" and "Pygmy" clusters that one would naiively expect to be the most divergent.

The San,!Kung, Pygmies and Hadza separate from the other African populations in the second principal component of the principal component analysis, with the first separates East and North Africans from West/Central/South Africans (and could very well mark a degree of Eurasian admixture), while the third principal component separates East Africans from all the rest.

The Four Paleo-African Admixture Components

In a ten population admixture analysis, two separate nearly pure pygmy populations emerge (biaka and mtubi) as would be expected from prior studies, the San and !Kung are all predominantly from the same component, there is a Hadza component that is predominant in that population and rare in the others.

The remaining people of Africa predominantly have origins in the five remaining ancestral components labeled East Africa 1, East Africa 2, Eastern Bantu, West Africa, North Africa and Central-West Africa.

The North African component and Afro-Asiatic Pre-History

The North African component, which is predominant in North Africa, is also found in all of the Ethiopian populations to a substantial extent (about a third of ancestry of Oromo Ethiopians and about half of other Ethiopians). All of the populations with high levels of this component speak Afro-Asiatic languages, except the Fulani who have not quite 40% of this component and speak a Niger-Congo language. Arabic, which is the predominant language of North Africa now and the Afro-Asiatic languages of Ethiopia are Semitic languages, but the North African component is also high in populations where languages in the Berber language family within the Afro-Asiatic family (which was predominant among Afro-Asiatic linguistic populations of North Africa until the Islamic empire expanded ca. 1400 years ago), and it is high in Egypt where Copic languages were predominant for much of the historic era.

The North African component has a similar footprint to Y-DNA haplogroup E1b1b, and areas where mtDNA haplogroups other than L(xMxN) (all of which are usually viewed as Eurasian backmigrants) are found at fairly frequencies.

It is long been apparent that the Fulani are distinctive in uniparental genetics relative to other Niger-Congo language speakers and this analysis continues to support a scenario in which a substantial ancestral population to the Fulani underwent language shift from an Afro-Asiatic language to Niger-Congo languages, either abruptly, or piecemeal as a result of ongoing bride exchange that this Sahelian, traditionally nomadic pastoralist populations might have had with Afro-Asiatic speaking populations to the North and East of their traditional territorial extent.

The lower level of the North African component in Oromo speakers in Ethiopia relative to other Ethiopians is consistent with the notion that there was a meaningful demic component to the transition from a prior language to Ethiosemitic languages in Ethiopia. Ethiosemitic languages appear to all descend from a single proto-Ethio-Semitic language probably in the general time frame of the Bronze Age. Oromo speakers are probably closer to a typical Ethiopian ancestry before that point, although it isn't implausible to suggest that Oromo, an out group in the Afro-Asiatic languages, might represent the linguistic consequences and creolization of Nilo-Saharan people and Ethiopian pre-Semitic people who admixed on a basis somewhat favoring the Ethiopian side but not decisively.

The linguistically Afro-Asiatic Mada and Hasua peoples have almost none of this component (although this description of the Mada people of Northern Cameroon asserts that they speak a linguistically Niger-Congo language. The Mada, who are currently predominantly Muslims (probably converting only centuries after the initial emergence of the Islamic empire in North Africa) and may have begun using Arabic in daily life, may be coded as Afro-Asiatic rather than Niger-Congo for that reason.

The Hasua speak the most widely spoken Afro-Asiatic language in the Chadic language family, and are notable in both their paternal and maternal uniparental genetic makeup, with the most distinctive aspect being the high frequency of Y-DNA haplogroup R1b-V88 in Hasua men, a haplogroup much less common in other African populations and quite rare outside the Sahel. My lay take on the uniparental haplogroup mix in the Hasua people a few years ago when I was looking at tables in academic papers on that population, was that they seemed less admixed with neighboring populations than, for example, the Fulani, which was suggestive of a more recent ethnogenesis in Africa from a coherent founding population, probably with a significant Eurasian component. Y-DNA haplogroup R1b is predominant in most of Western Europe, but the R1b-V88 haplogroup diverges phylogenetically from the European R1b Y-DNA haplogroups at the most basal level of R1b, suggesting a fairly ancient back migration from Eurasia of R1b-V88 paternal founders to Africa that was a separate wave from that associated with other linguistically Afro-Asiatic populations of Africa. Given that the Hasua are not greatly different from the Igbo and Yoruba in autosomal genetics, however, the case for a Hasua ethnogenesis based upon an elite dominated language shift with only a very small demic component of population predominantly identical in ancestry to other West African Niger-Congo language speakers seems to be supported by the autosomal data.

The North African component is also found at more than trace levels in a couple of linguistically Nilo-Saharan populations and the Sandawe.

Eastern Bantu and Central West Africa and West Africa

As Ethio Helix notes looking at the Fst distances of the populations: "The 'West African', 'West-Central African' and 'Eastern Bantu' clusters are quite close to each other as can be expected."

The Eastern Bantu component is never found in more than trace levels apart from the Central West Africa component. Most populations with a significant Eastern Bantu component also has a West African component and generally a smattering of one or both of the Pygmy components. The Sandawe and some of the South African populations have Eastern Bantu, Central West Africa and Pygmy components, but no West African component.

Conceptually, it probably makes the most sense to think of the Bantu expansion as involving an uneven mix of Eastern Bantu, Central West Africa, West Africa, and both of the Pygmy populations that varies locally based upon founder effects and differing waves of Bantu expansions involved.

It also makes sense to have a hierarchical model to describe the demographic impacts of Bantu expansions. In populations outside West Africa that have several of these components and also other components, one can imagine (1) the Eastern Bantu, Central West Africa and West Africa components together as slight variations on a common Niger-Congo heritage within the core of at least a couple of expanding Bantu waves with different proportions, (2) the two Pygmy components as minor introgressions at the early stage of Bantu expansion, and (3) the remaining components as ancestral or substrate populations in the region that were linguistically overwhelmed by expanding Bantus who made a major demographic contribution to the resulting populations.

Indeed, given how slight the Fst distinctions are between Eastern Bantu, Central West Africa and West Africa, a K=8 admixture scenario that lumps these three Admixture created ancestry groups would probably be a more helpful way to model African autosomal genetic diversity than the K=10 model run by Ethio Helix, at least at the Africa-wide level. View these three Admixture program generated ancestral components as individually on a par with the other seven is probably misleading. The possibility of subdividing them at all is probably only made possible by the very large population size in the Bantu descended peoples of Africa and it may not make a lot of sense to read too much into the distinctions within these mixes.

East Africa 2

This analysis still leaves the East Africa 2 component to make sense of, and while it is somewhat close in Fst distance from East Africa 1 (similar to the distances of the Khoisan and two Pygmy populations from each other), it is still legitimately distinct from each of other other components and clearly distinguishable as independent.

About 11% of Egyptian ancestry is assigned to East Africa 2 as is about 8% of Libyan ancestry, but it appears at only trace levels in other North African, West African, and South Africa populations. It is also absent in Pygmy populations, and in the Central African Kongo and Fang populations.

It is present at significant levels in the Central African Bulala (35%), Mada (20%)Hema (15%), and Kaba (8%) populations, all of which are on the Eastern half of the data points described as Central African in the analysis. The other three non-Pygmy Central African data points are near the Atlantic coast and geographically very close to many of the West African data points.

East Africa 2 is the predominant residual left in all of the East African populations but the Hadza after one removes the North African contribution and the contribution attributable to Bantu contributions is about:

53% in Maasai,
50% in Oromo Ethiopian,
40% in Ethiosemitic Ethiopians,
38% in the Sandawe,
8% in two Eastern Bantu populations,
22% in Alur, a Nilo-Saharan population that is about 15% Pygmy and the remainder from the three core Bantu components.

East Africa 1

The East Africa 1 component appears to be about 12% in the Maasai, about 5%-8% in Ethiopian populations, at trace levels of 4% or less in a few other East African and Central Africa and at trace levels in a couple of other specific African populations in other areas. Like the Ancestral South Indian component in Africa, this apparent ancestral population does not appear to exist in a pure or predominant type in any population now living.

All of the Nilo-Saharan language speaking populations seem to have some amount of this component, but not necessarily at levels much greater than non-Nilo-Saharan populations in the same geographic area. East Africa 1 is quite a bit of a smaller component than East Africa 2 in every population that has more than a trace component of either.

It is hard to know what to make of this Admixture result. We know with considerable certainty when the Bantu components arrived in East Africa, and when at least a 10% of the total chunk or so of North African components arrived in Ethiopia. While the point is considerably more controversial, it is not unreasonable to view the North African component in its entirely is intrusive relative to either of the East African components, probably not much further back in time than the Holocene (i.e. on the order of 10,000-12,000 years ago, around the time of the Natufians).

Fst distance is roughly a proxy for time depth of separation. The Fst distance between East Africa 1 and East Africa 2 is on the same order of magnitude as the distance between the Khoisan of the two Pygmy groups from each other, these two components. The common estimates of Khoisan and Pygmy groups from other African populations based upon uniparental markers, is on the order of 70,000 years ago. The Pygmy groups are believed to have diverged from each other sometime on the order of 20,000 years ago. All three of these Paleoafrican populations are much more distant from other African populations than they are from each other in the Fst chart. So, a time depth estimate for the divergence of East Africa 1 and East Africa 2 on the order of 20,000 years, around the time of the Last Glacial Maximum, would seem to make sense as a point of divergence between these two ancestral populations, although in this region, the pivotal points would probably be migrations triggered by shifting aridity in the Sahara and its vicinity.

Given the relative frequencie of East Africa 1 and East Africa 2, it seems as it East Africa 1 has a focus around the White Nile and a significant presence in the Blue Nile. It might have an original core in Sudan or near Lake Chad that may have intruded into and thoroughly admixed with other East African populations (except the Hadza and Sandawe) before the North Africans or Bantu arrived on the scene.

The overall genetic picture also seems, in my view, to tend to support an origin for Afro-Asiatic in North Africa or the Levant, rather than, for example, Ethiopia. Neither East Africa 1 nor East Africa 2 are Afro-Asiatic linguistic population markers to any meaningful degree, while the markers that do track Afro-Asiatic linguists relatively well seem to look like back migrants from Eurasia.

While no population is a pure example of the mix of East Africa 1 and East Africa 2 that was probably in place before North African or Bantu admixture, the common elements of the Maasi (65%), the Ethiopians and the Sandawe, all of which have large contributions from this East African source, would be most representative in modern peoples of this ancestral mix of East African populations.

UPDATE On 3/2/12 Per Ethio Helix discussions and comments:

1. The Mada are actually Chadic and not Niger Congo as the source I linked incorrected stated. Like the Hasua, they are heavy in R1b.

2. The Hadza outlier could be a product of high levels of inbreeding and cryptic relatedness in the sample of seventeen individuals, something Maju noted (comparing it to the analogous situation of the Kalash of South Asia) and that I thought but didn't put in writing. This reduces the value of the autosomal data relative to the uniparental data on the Hazda for determining their genetic affinities.

3. As I looked more closely at the outlier of the Alur people, who have 15% Pygmy descent, the souce soon became clear. They are quite close geographically to the Eastern (i.e. Mtbui) Pygmy population (sample size 36) - closer than any other of the East African populations, and close than any of the Central West African populations other than the Hema people (both the Alur and Hema are Nilo-Saharan linguistically). Given that the Alur sample size is ten and that the Hema sample size (which has far less pgymy admixture) is fifteen, just two or three individuals with with an undisclosed pygmy grandparent could probably give rise to the observed result. Query if it would be appropriate to pool the Alur and the Hema samples as they are very geographically close to each other and speak related languages.

4. I note the absence of samples from Mozambique and Madagascar. The Madagascar case, I know from other studies, approaches 50% Austronesian and 50% Bantu. But, the one study that has been done of autosomal genetics in Mozambique of which I am aware indicated that the ancestral component that was not attributable to a Bantu component was one that is not found in pure form in any African population, and is distinct from any other population in Africa, a bit like East Africa 1 and East Africa 2 in East Africa. So, its absence is more disappointing. I will look for a reference.

5. I note a post about fourteen months ago at Wash Park Prophet on African uniparental genetic markers (see also here and here). A deep global uniparental lineage analysis there is also notable. A few notes about the climate history of the Lake Chad Basin and African American population genetics are relevant. An interesting outlier relevant to African population structure is an mtDNA link of some widely dispersed geographic groups from Africa to Scandinavia to Siberia:

Genetically, some of the closest relatives to the Berbers of North Africa are the Saami of Northern Scandinavia. The Berbers, together with the Yakut of Northeastern Siberia and the Fulbe of the Sahel in Africa, share closely related versions of mitchondrial DNA lineage U5b1b. This lineage is not the only close connection. Lineages of mtDNA lines H1, H3 and V are also closely related in the two groups.

These mtDNA links may plausibly be as old as the expansive influence of the Natufian culture or perhaps even a bit older, although H1, H3 and V look post-LGM to me. An Epipaleolithic culture originating in the Levant that appear to have cultivated pre-domesticated versions of wild cereals and lived a sedentary lifestyle.

I am examined Chadic and Fulani genetics in a post about two years ago. Chadic mtDNA is explored in this post at Mathilda's Anthropology Blog which is knowledgable but opinionated blog focused mostly on African genetics and physical anthrpology and archaeology with a North African bent (the post at that blog was in June 13, 2010, and the author was suffering from a serious progressive condition at that time, so I worrry now and then about what has become of her). A post on the genetics of an Afro-Dravidian hypothesis also touches on relevant issues. Earlier studies of African autosomal genetics are here. See also my comments at Gene Expression on the genetics of Ethiopia, Madagascar and Mozambique with links to references.

In terms of trying to make sense of the population structure of Africa before Bantu expansion and before West Eurasian back migration to Africa predominantly in the region where Afro-Asiatic languages are now spoken, one important hole in the analysis done at Ethio Helix is the lack of samples from Mozambique. A 2010 article in the European Journal of Human Genetics, found genetic traces of a substrate population in Mozambique that was an ancestral component distinct from any of the other ancestral populations of Africa. As the body text in that open access article explains (citiations omitted):

The southeastern Bantu from Mozambique are remarkably differentiated from the western Niger-Congo speaking populations, such as the Mandenka and the Yoruba, and also differentiated from geographically closer Eastern Bantu samples, such as Luhya.

These results suggest that the Bantu expansion of languages, which started ~5000 years ago at the present day border region of Nigeria and Cameroon, and was probably related to the spread of agriculture and the emergence of iron technology, was not a demographic homogeneous migration with population replacement in the southernmost part of the continent, but acquired more divergence, likely because of the integration of pre-Bantu people.

The complexity of the expansion of Bantu languages to the south (with an eastern and a western route), might have produced differential degrees of assimilation of previous populations of hunter gatherers. This assimilation has been detected through uniparental markers because of the genetic comparison of nowadays hunter gatherers (Pygmies and Khoisan) with Bantu speaker agriculturalists.

Nonetheless, the singularity of the southeastern population of Mozambique (poorly related to present Khoisan) could be attributed to a complete assimilation of ancient genetically differentiated populations (presently unknown) by Bantu speakers in southeastern Africa, without leaving any pre-Bantu population in the area to compare with.

In other words, there is genetic evidence that there was an entire "lost race" in Mozambique, distinct from Khoisans, Pygmies, East Africans and West Africans, with no pure blooded remaining members that disappeared when it was subsumed into the population of Bantus expanding into the area.

Those Bantus had a linguistic and cultural legacy from a quite precisely identified small region near the coastal border of Nigeria and Cameroon, but had assimilated a mish mash of Africans peoples into their genetic melting pot, including the ancestral peoples of Mozambique, on their way to Southeast Africa.

Attaching dates to the prehistoric population structure of Africa is assisted by sources such as a 2010 survey of archaeogenetics discussed here, and the observation that genetic dissimilarity as measured by the statistical metric called Fst has an empirically calibrated and theoretically sensible correspondence with how many years ago two populations diverged.

The estimated date of divergence between a proto-population shared by Khoisan and Pgymies is estimated at 70,000 years ago. The split between the Khoisan and the Pygmies is estimated at 35,000 years ago, and the split between the two Pygmy populations is estimated at 18,000 years ago.

I observed before that the divergence measured by Fst between the two East African components in the Ethio Helix analysis is on the same order of magnitude as that between the two Pygmy populations and the Khoisan and the more closely related Pygmy population, which would imply a time frame of between 18,000 and 35,000 years ago.

The extent of the Hadza outlier is probably simply a product of a high level of inbreeding, but if it were taken at face value, it would imply that the Hadza diverged from other African populations perhaps twice as long ago as the proto-population that gave rise to the Khoisan and Pygmy populations, i.e. probably around the time of, or shortly before, the Out of Africa event, ca. 100,000-120,000 years ago. This would be the most basal split in the modern human lineage. This would also be more remote than the aboriginal Australians and Papuans who have been isolated for ca. 45,000-50,000 years from other Eurasians, but who also have substantial Neanderthal and Denisovian admixture that the Hadza presumably lack.

The pertinent part of the Wikipedia entry on the Hadza, who currently number about a thousand, states (citations omitted):

The Hadza are not closely related to any other people. Hadza was once classified with the Khoisan languages because it has clicks, but there is no real evidence they are related, and Hadza is now usually considered an isolate. Genetically, the Hadza do not appear to be particularly closely related to other Khoisan-language-speakers: even the Sandawe, who live just 150 km away, diverged from the Hadza more than 15,000 years ago. Genetic testing also suggests significant admixture has occurred between the Hadza and Bantu, Nilotic and Cushitic-speaking populations in the last few thousand years. Today a small number of Hadza women marry into neighbouring groups such as the Bantu Isanzu and the Nilotic Datoga, but these marriages often fail and the woman and her children return to the Hadza. In previous decades rape or capture of Hadza women by outsiders seems to have been common. The reverse situation (Hadza men marrying non-Hadza women) is very rare today, probably because their neighbours view the Hadza as having low status, although during a famine in 1918-20 some Hadza men are reported as taking Isanzu wives.

The Hadza's ancestors have probably lived in their current territory for a very long time. Hadzaland is just 50 km from Olduvai Gorge, an area sometimes called the "Cradle of Mankind" because of the number of hominin fossils found there, and 40 km from the prehistoric site of Laetoli. Archaeological evidence suggests that the area has been continuously occupied by hunter gatherers much like the Hadza since at least the beginning of the Later Stone Age, 50,000 years ago. Although they do not make rock art today, several rock art sites within their territory, probably at least two thousand years old, are considered by the Hadza to have been created by their ancestors, and their own oral history does not suggest they moved to Hadzaland from elsewhere.

Until about 500 BCE Tanzania was exclusively occupied by hunter-gatherers akin to the Hadza. The first agriculturalists to enter the region where Cushitic-speaking cattle herders from the Horn of Africa. Around 500 CE the Bantu expansion reached Tanzania, bringing populations of farmers with iron tools and weapons. The last major ethnic group to enter the region were Nilotic pastoralists who migrated south from Sudan in the 18th century. Each of these expansions of farming and herding peoples displaced earlier populations of hunter-gatherers, who would have generally been at a demographic and technological disadvantage, and vulnerable to the loss of environment resources (i.e., foraging areas and habitats for game) as a result of the spread of farmland and pastures. Groups such as the Hadza and the Sandawe are therefore remnants of indigenous hunter-gatherer populations that were once much more widespread, and are under pressure from the continued expansion of agriculture into areas which they have traditionally occupied.

This Souteast African population becomes distinguishable in the third principal component in that form of analysis, and at K=4 in a structure analysis (after the ancestral population for paleoafricans emerges at K=3).

Put the pieces together and one can imagine as of ca. 16,000 years ago (within a few millenia of the population of the Americas by moder humans), an ancestral West African population, two ancestral East African populations in addition to the Hadza, an ancestral Mozamibiquan population, an ancestral Khoisan population, an ancestral Eastern Pygmy population, an ancestral Western Pygmy population, and an ancestral North African population that has left behind few discernable traces and was soon to be demographically overwhelmed by West Eurasian back migration even if they have left some residual genetic traces in the modern residents of North Africa. If recent suggestions from technical population wide autosomal DNA studies that predicted Neanderthal admixture in Eurasians are correct, the Pygmy and Khoisan populations at that time may have experienced the introgression of genes from a couple of archaic human populations that had just died out of a couple of thousand years earlier. Each of these ancestral populations would have had considerable ethnic diversity, since low level admixture between neighboring populations of hunter-gatherers would have maintained elevated diversity in each group (the boundaries between one and the next may have been clines rather than distinct groupings), and since they would not have experienced multiple rounds of serial founder effects to the same extent as Eurasian populations that are more distant from Northeast Africa.


Maju said...

You don't follow Harappa Ancestry Project, do you? Zack already went on the African genome like that in the past and there's also some paper: Haza are ultra-outliers but that's because they are hyper-inbred, much like the Kalash in Asia.

Then the "North African" component is more like the '(West) Eurasian' component: nothing more, nothing less.

As for the rest, I do not have much to say: I generally dislike mixing languages and genes, and for you it's a habit instead.

andrew said...

I wouldn't mix language with genes if there weren't intelligable correlations. A lot of my interest in the African paleogenetics is rooted in the questions:

(1) What was East Africa like pre-Neolithic/pre-Afro-Asiatic?
(2) Why does Afro-Asiatic have less of a genetic commonality than any other language family?
(3) Were there "races" in Africa that have been obliterated by expansions of other peoples (e.g. Bantu and Afro-Asiatic)?
(4) Where did Sahel agriculture really emerge and did it happen in a couple of places or just one?
(5) What was the paleoclimate of Afrcia in the period from just pre-Out of Africa to the present?
(6) When did archaics go extinct in Africa?
(7) Where there empires that came and went in Africa before they were attested historically?
(8) How far do seeming non-African cultural universals (e.g. flood myths) extend into Africa?
(9) Which African cultures are closest to the ancestors of the proto-Eurasians?
(10) How long were modern humans range restricted within Africa?

andrew said...

Also, I do follow the Harappa Ancestry Project, but only on and off, not consistently. I'd pondered the hyper-inbred issue as a possibility, but it is so irritating.

I wonder what would happen if you took out all of the Haza but one and put that person in the PC charts and Admixture analysis, sort of like Otzi. After all, saying a population is hyper-inbred is a bit like saying that they are an effective sample size of one.

T. Kosmatka said...

Amazing, detailed work here. I'm reading it through for a second time now. Thanks for posting this.