john hawks weblog

paleoanthropology, genetics and evolution

DIY genomics

  • Finding more Neandertal genes, chromosome 19 edition

    Thu, 2011-03-31 18:46 -- John Hawks

    When I last wrote about the Neandertal genome, I showed that across the X chromosome, Europe and China have different Neandertal genes. There is overlap between the two, but as a generalization few Neandertal haplotypes that are common in Europe are also common in China, and vice-versa. I described the basic method for finding Neandertal haplotypes in recent people last month ("Neandertal segments of X chromosomes").

    Almost all of the Neandertal haplotypes found in the X chromosomes of recent people are relatively rare, occurring in fewer than 10 percent of individuals. The largest fraction of Neandertal haplotypes occur in only a single person in the HapMap samples.

    But is this a pattern that occurs on the autosomes, or does it reflect X chromosome dynamics in some way?

    That's not a hard question to answer, and I went looking first at chromosome 19. The number of haplotypes is fewer, because chromosome 19 is shorter than the X. The overall pattern is the same. Most Neandertal haplotypes are rare in the HapMap samples, and relatively few are common in both the CEU and CHD samples.

    Neandertal haplotypes on chromosome 19 histogram in CEU and CHD HapMap samples

    I put the origin at the rear; CEU (European ancestry in Utah) number of copies goes toward the left, CHD (Chinese immigrants in Denver) toward the right. You can see that most of the cases are clumped on the extreme edge of both axes. There are not higher counts in CHD; the two axes are at different scales because of one extremely common region in Europeans, as noted below.

    I've received a few comments on the 3-d histograms. I don't like them much, either, and I'm looking for an alternative. This one in particular is miserable; because it's out of scale. I'd like to plot these in 2-d using shading to denote bin counts. Unfortunately I haven't found a quick and dirty program that will do this in 2-d, and I've got too wide a range of bin counts for a bubble plot to do it without a lot of tweaking. So I'm stuck with these for now. I can either write about them and share them or spend my time finding a better graphing solution.

    I've done a few more comparisons. When we look for Neandertal 10-SNP haplotypes in CEU versus TSI (the sample from Tuscany), we find mostly the same haplotypes in both samples. A haplotype in 10 copies in CEU is certain to be in TSI, and vice-versa.

    Neandertal haplotypes on chromosome 19 histogram in CEU and TSI HapMap samples

    Number of copies in CEU goes across the bottom, TSI back into the picture. This is such a striking difference from the CEU-CHD comparison. It's very comforting to me, because this is totally the expected pattern -- CEU and TSI should have the same things, because they share most of their population history! I will mention that for the X chromosome, CHB and JPT have a similar pattern, they mostly share the same stuff. This helps lend some significance on the finding below that GIH is also pretty different from all these other samples.

    You can see that there is one locus where CEU has more than 100 copies (the little cluster there indicates that this haplotype extends over more than 10 SNPs, in fact it's 13 SNPs with possibly 2-3 flanking SNPs forming a decay pattern on either side; the total length is around 150 kb. There are more than 80 copies in Tuscans, and more than 40 in Gujaratis, but only a single copy in the Chinese sample. Three genes lie in this interval but none point to any obvious hypothesis (to me, at least), about why the Neandertal haplotype would be especially common in western Eurasia. I note it because this is the first Neandertal haplotype I've found with a frequency up over 20 percent or so; this one is about 60 percent in CEU and 50 percent in TSI.

    The Gujarati (GIH) sample adds its own distinct twist. There is some overlap between GIH and CEU, and some overlap between GIH and CHD. But by and large the same pattern obtains as between Europe and China: India has its own Neandertal common variants, not widely shared with either CEU or CHD. For example, here's the CHD comparison; CHD going toward left, GIH toward right. The basic pattern is that most cases are clusted on the edge of the graph, few are scattered across most of the area, and there's no consistent pattern among them. Still, the highest-frequency GIH case is the same as the high-frequency haplotype noted in CEU and TSI above.

    Neandertal haplotypes on chromosome 19 histogram in CHD and GIH samples

    These examples should demonstrate pretty clearly that this is not solely an X chromosome phenomenon; basically we're looking at the effects of drift in small ancient populations after they mixed with Neandertals.

    I did have an excellent question today after my talk where I discussed this pattern -- how do we know that this isn't separate mixture events giving rise to different Neandertal-derived variants in different recent humans?

    That's not a trivial question to answer, and I don't think we could easily rule out the hypothesis in the abstract. But the fact that these populations have very similar fractions of Neandertal contribution overall does suggest a single history of mixing. I'll give this some more consideration as I look across the rest of the genome.

  • Europe and China have different Neandertal genes

    Tue, 2011-03-22 01:00 -- John Hawks

    When last we saw the Vi 33.16 X chromosome, I was wresting out its secrets by looking for SNP haplotypes shared by this Neandertal with the European and African samples from the HapMap ("Neandertal segments of X chromosomes"). Neandertal haplotypes in the CEU (Utah, European ancestry) sample, that are not also found in African samples, are candidate loci for Neandertal ancestry outside Africa.

    In my earlier post, I pointed out some drawbacks and weaknesses of this simple approach. The SNPs have poorer power than sequence data, and we will miss relevant short haplotypes. Some Neandertal-derived alleles are probably present at low frequencies in Africa. Excluding rare African alleles will cause us to miss these cases. What we will find is a filtered set of Neandertal candidate loci, where we don't control the filter.

    Finding these haplotypes lets us look at their frequencies within the European sample. As I pointed out, most of the Neandertal haplotypes in the CEU sample are rare, one or two copies. A handful are quite common, up to 30-40 copies in the sample. A good-sized set occurs in 5-10 copies.

    We know from Green and colleagues' comparisons that at least three people outside of Africa have the same fraction of Neandertal ancestry -- one from France, one from China, and one from Papua New Guinea. But there's no reason to think they have inherited the same segments from Neandertals. The overall proportion of Neandertal ancestry is very slight, less than five percent. If five percent of loci were 100 percent Neandertal, then everyone would have the same Neandertal loci. But that's not the way they are distributed. Different individuals certainly have different Neandertal genes.

    A rare allele in one sample is quite likely not to appear in geographically distant samples. So for many of the Neandertal haplotypes in the CEU sample, we shouldn't expect to see them in China. And, as you can tell from the figure below, that is in fact the case.

    Europe-China Neandertal X chromosome comparison

    What you're looking at is a 3-D histogram of Neandertal candidate haplotypes in China and Europe. The number of copies in the CEU HapMap sample is on the X axis, the number of copies in the CHB HapMap sample on the Z axis, going back into the picture. From the leftmost corner, at the origin, going along the X axis is the set of haplotypes present in CEU but absent in China. As you can see, the most frequent outcome is one copy in either one sample or the other. This being a histogram, those are both lumped into the highest bar at the origin.

    Here's a detail of the area near the origin, turned upward so we're looking at almost an X-Z plot.

    Europe-China Neandertal X chromosome comparison

    As we go down the X axis, you see there are many haplotypes with 3 or 4 copies in CEU and none in CHB. In fact, there are very few that have 3 copies in CEU and any in CHB -- many fewer altogether than occur in 3 copies in CEU and none at all in CHB. The ones that have 10 or so copies in both samples are, well, scarce.

    This is very striking. China and Europe by and large have different Neandertal-derived haplotypes. Haplotypes from Neandertals that are common in Europe -- say, with more than two or three copies -- are mostly rare in China. And vice-versa; haplotypes that are common in CHB are rare in CEU.

    Why should this be? Green and colleagues [1] hypothesized an early population mixture of Africans and Neandertals in West Asia, before that population dispersed throughout the rest of Eurasia. This hypothesis was meant to explain why China and Europe have the same proportion of Neandertal genes.

    I think that is also consistent with the fact that China and Europe have different Neandertal genes. If the population mixture was followed by substantial genetic drift as the West Asian population dispersed in different geographic directions, drift would randomly increase the frequency of some haplotypes in one direction, others in the other direction. Europe and China would end up with the same proportion of Neandertal ancestry, but it would be distributed very differently among loci.

    Next, we'll examine whether this pattern is the same for the rest of the chromosomes. Or maybe something even more interesting...


    References

  • Neandertal segments of X chromosomes

    Wed, 2011-02-23 16:06 -- John Hawks

    Last year, this Neandertal genome came out. No doubt you've heard about it. So maybe by now you're wondering where the new science is that's being done on this genetic information.

    We've been ramping up here in my lab for a few months, working with these data. My students have a couple of projects that we'll be keeping close to the vest. But for the most part I think we'll share stuff as we go along. This is all open access data, and there are some questions of fundamental interest that are actually pretty easy to resolve.

    The initial Neandertal genome draft publication [1] came with some analysis of the genome-wide similarity of the Neandertal draft genome and a few human genomes. A new review of the basic method of comparison has appeared in Molecular Biology and Evolution, by Eric Durand and colleagues [2]. The basic idea is that a branching model between populations without gene flow predicts that two members of one population have equal amounts of sequence similarity to a third individual in another population. If that third individual turns out to be closer to one or the other of the first two, we can reject the hypothesis that those first two are part of a population that has branched without gene flow away from the third individual's population. When we bring an African and a European as the first two individuals, and a Neandertal as the third, we find that the European is in fact closer to the Neandertal. So we can infer gene flow from Neandertals into the ancestors of Europeans. This comparison is nearly equally significant when we compare an African and a Chinese individual, or an African and an individual from Papua New Guinea. Thus we can infer that Neandertals contributed genes to the ancestors generally of present-day non-Africans, not specifically present-day Europeans. The amount of gene flow that can explain the pattern of genetic similarities adds up to around 2.5% of the total ancestry of non-Africans today. Again, it's not a direct observation; it's a model that explains the greater similarity of the Neandertal genome to people outside Africa than within sub-Saharan Africa.

    As you can see, this leaves open a key question. We don't know whether genetic similarities between Neandertals and present non-Africans are the same in different areas outside Africa.

    The whole-genome comparisons have great statistical power to test the hypothesis of gene flow in general. With a hundred thousand or so actual sequence differences between Neandertals and any given human genome, the method can potentially detect very small amounts of gene flow. What we're seeing in the Neandertal data is anything but small -- it amounts to greater non-African similarity to Neandertals at thousands and thousands of sites.

    But comparison of three whole genomes gives us very little power to identify the specific loci affected by gene flow. If a French genome has three percent ancestry from Neandertals, we can predict that other genomes in France probably do also. That's a consequence of independent assortment -- we're not looking at people who actually have Neandertal grandparents, we're looking at a population that had Neandertal ancestors thousands of generations ago. So all French genomes are probably more-or-less alike in the Neandertal quotient. But will they have the same three percent of Neandertal-derived alleles? Almost certainly not: each Neandertal-derived locus would have to be fixed in France for them to be identical in all genomes. Much more likely, a much larger number of Neandertal-derived alleles exist at an average frequency of three percent. Such a distribution would predict that the average Neandertal-derived variant found in our first French genome has only a 3 percent probability of showing up in a second genome. Looking at one genome in one population will find only a small fraction of loci that have been affected by Neandertal gene flow.

    Hence, if we want to answer the question about different populations, we need to look at a reasonably large sample of individuals. We need to know whether a Neandertal-derived variant in France occurs at the same frequency in China, and vice-versa. Are there loci where a Neandertal allele occurs at 10 percent in France, but never in China? Does a full list of loci with one or more Neandertal-derived variants include any interesting functional genes? Answering these questions would tell us a lot about the demographic and adaptive conditions that led to our Neandertal heritage.

    Enter the HapMap

    You'd think that a genome-wide set of SNP genotypes would be useful for testing hypotheses of population history. The HapMap has more than 3 million SNP genotypes from hundreds of individuals from China, Japan, Utah, and Nigeria, and more than a million genotypes from nearly a thousand other individuals from other populations. In other words, it's the kind of sample that could tell us a lot about the frequencies of Neandertal-derived alleles if we could find them.

    But the HapMap project didn't identify its set of genotypes to help us reconstruct population history. The aim was to find most common variants, and secondarily to add more variants in low-variation regions to allow linkage mapping of medically interesting phenotypes. SNP sites were disproportionately found in some populations (first, Europeans) more than others. These processes of SNP discovery led to ascertainment biases, in which the difference between samples depends not only on their histories, but also on where we chose to look.

    Ascertainment bias is a real pain if we want to test the hypothesis of Neandertal genetic contribution to today's humans. Look at it this way: Suppose we find a rare SNP allele in Europeans, absent in Africans, but present in the Neandertal genome. Looks like a piece of support for Neandertal ancestry of Europeans. If those sites outnumber the sites where we find a rare allele in Africans shared with Neandertals, not in Europeans, then that would seem like the same scenario outlined above -- a case where one of the living populations carries more Neandertal similarities than the other. Evidence of gene flow, right?

    Ascertainment bias leaves another possibility: Maybe we looked harder for rare variants in one of the living populations. If so, the lack of rare Neandertal-shared variants in the other population may be an accident of our SNP discovery procedure.

    There are ways around this problem. For instance, if the Neandertal genome carries many derived alleles for SNPs shared with Europeans, it weighs strongly in favor of recent genetic exchanges instead of ancient incomplete lineage sorting. But this basic question of "which population has more Neandertal ancestry" may still be hard to resolve.

    Haplotypes from Neandertals

    Green and colleagues [1] also presented a second approach for testing Neandertal ancestry. They used SNP data to identify regions of the genome where non-African populations appear to have a "deep root" to their genealogy, but Africans do not. These regions are rare across the genome; they focused on 100-kb intervals, finding only a dozen genome-wide that fit their criteria. But each of these is a case where non-Africans appear to have an ancient genealogical split between two haplotypes, all the SNPs lining up to distinguish one branch of the genealogy from another. If both are not represented in Africa, then presumably one of them came from some non-African ancient population. And indeed, they found ten of the deep branches within the Neandertal genome.

    This approach makes use of the information that SNP data provide about linkage. A segment of a chromosome from a living human that is similar to a Neandertal segment may be explained either by recent ancestry from Neandertals or from incomplete lineage sorting from the ancient human-Neandertal common ancestors. But if that segment is long, it probably isn't from the ancient common ancestors of humans and Neandertals, because recombination should have broken up the linkage across that long interval. Hence, long haplotypes shared by living humans and Neandertals are best explained by recent mixture. If those long haplotypes are predominantly found in non-Africans but not Africans, it tends to confirm that they have come from recent population mixture with Neandertals.

    But how long should these intervals be? This is an area where we can improve on the approach taken by Green and colleagues [1]. A hundred kilobases is way too long to represent the average Neandertal-derived haplotype. The average rate of recombination across the genome is around one centimorgan per megabase -- meaning that an interval of one million base pairs has a one percent chance of recombination per generation. That's a chance 1/1000 of recombination per 100 kb per generation, meaning that half the linkage across 100 kb should be broken up in roughly 700 generations. For humans, half the linkage at that distance decays after only 18,000 years or so, except in regions of low recombination. If we go as far back as 100,000 years ago, half of the linkage decays across regions as short as 18 kilobases. That means if we look at windows 20 kb long for evidence of Neandertal-derived haplotypes, we are likely to miss a large fraction of them. Hundred-kilobase intervals will miss nearly all of them.

    Bottom line, we want to look at intervals as short as we can. But if we look too short, we won't have much evidence to work with. The 3-million SNPs in the HapMap version 2 give us one site every kilobase on average. Ten kilobases will give us around 10 SNPs. A 10-SNP haplotype may sound impressive, but if most of those SNPs have a derived allele at low frequency (say, less than 10 percent), then it starts to become more likely that a given haplotype resembles the Neandertal genome just because they share ancestral SNP alleles. Ideally we'd like more SNPs, but in reality the Neandertal sequence draft is likely to lack several, so if we want 10 SNPs worth of comparison, we'll need to look at longer intervals.

    And really, HapMap 2 is a small sample to try to find low-frequency haplotypes from Neandertals. By analogy with the method used by Green and colleagues, we can find haplotypes that are present in the CEU (European ancestry) sample, present in the Neandertal genome draft, but absent in the YRI (West African ancestry) sample. But HapMap 2 includes only 120 genomes from each of the YRI CEU samples. If we have a variant at in Europe at 1 percent, we're pretty likely to miss it. Worse, if we find a haplotype in Europe at 1 percent, we're really not able to reject the hypothesis that it's in Africa at the same frequency, even if no copies of it are in YRI. We can help fix this problem by looking at HapMap phase 3 samples, which include two more African populations, bringing the total sample up to more than 300 within Africa. But there are fewer SNPs in HapMap 3, limiting our comparisons to longer windows. One could even contemplate the HGDP sample as a way to add even more individuals to our comparative samples. But that sample has many fewer SNPs, so we would need really long intervals to test the hypothesis of Neandertal ancestry for particular haplotypes.

    By the end of this I'll surely be pining for sequence data. Of course for that we haven't long to wait. But I have an aim for which genotype data are at the moment the only feasible approach. So I'm a bit stuck: Using a bigger sample means using longer intervals, which means I'll miss more and more Neandertal-derived haplotypes. But we should thereby get reasonable power to find any common haplotypes derived from Neandertals.

    Phasing and the haploid Neandertal

    The HapMap 2 samples and some of the HapMap 3 samples were taken from pairs of parents, where a child was also genotyped. Those trios make it possible to determine which SNP alleles were linked on the parents' chromosomes, providing a natural "phase" for the haplotypes. For some other samples, the phase was inferred algorithmically, using assumptions about population history and knowledge about which haplotypes are present in the populations with trios. Phasing algorithms are not ideal, because the assumptions about population history (inferred in many cases from the data) may be false. But over the relatively short intervals we're considering here, phasing will probably not lead to false positives.

    Neandertal draft genomes are themselves more of a problem. Each sampled individual is known from a large number of short reads, which (with some luck) can be aligned with the human genome map. The present data include many gaps. More important, there are only a very small number of places where the number of reads is high enough to determine whether a Neandertal individual was a homozygote or not. The Neandertal consensus sequences are built by taking the most frequent base from these reads aligned to any given site in the human genome. That means that the Neandertal "haplotype" across any set of SNP loci may well be a jumbled chimera of two different haplotypes carried by the Neandertal individual. For the current analyses, I have kept the Neandertal individuals separate -- so the haplotypes here were derived only from the Vindija 33.16 individual. If we use a consensus sequence taken from multiple individuals, we will have fewer gaps but potentially more jumble of different haplotypes.

    There's not much to be done about this problem. It should mostly cause us to miss true instances of Neandertal genetic ancestry, and we may be able to quantify the extent of this error in some high-coverage areas.

    (UPDATE 2011-02-24): I should mention, my lab has found that the Neandertal consensus sequences themselves have issues; the consensus-building algorithm appears in many cases to have included the human reference genome SNP allele in the place of the allele found in the majority of Neandertal reads. We are not yet sure how extensive this phenomenon is across the genome, but we have found it recurrently. We hypothesize that this is because of the priors on accepting calls with low read quality; the reference sequence seems to heavily bias the algorithm even in the presence of multiple contrary reads. We will have to check SNP calls manually in candidate regions.

    OK, so let's find the Neandertal regions!

    The strategy is fairly clear. I'll take a 10-SNP window from the HapMap, determine the haplotype of the Vindija 33.16 genome, see if that Neandertal haplotype occurs in the CEU HapMap sample, and then see if it also occurs in the YRI, MKK and LWK samples. When I find a haplotype shared with the Neandertal in Europe but not in Africa, I'll take that as a candidate haplotype for Neandertal ancestry.

    I probably want to be a little more permissive than that, actually. A Neandertal haplotype that is present in Europe, and present but rare in Africa may still be a good candidate. A Neandertal haplotype that does not match at all SNPs may also nonetheless be a good candidate, considering that the consensus is often merging two true haplotypes together. There's not much I can do about the consensus problem, because I don't have any way of figuring out the missing information except in rare cases with multiple sequence reads. But to address the first problem I can relax my criteria a bit with respect to variation inside Africa.

    Sliding the window down the chromosome will allow me to find the length of Neandertal-identical haplotypes in each individual, which could lead to an estimate of linkage decay. Across the genome, this will yield an estimate of the time that population mixture with Neandertals took place.

    Several other observations should lend some confidence in particular candidate haplotypes. The more a candidate includes derived alleles that are not themselves common in Africa, the more convincing it will be. If it does represent a "deep root" -- that is, if no close relative of the Neandertal haplotype occurs in the African sample, that also helps. The region with Neandertal identity shouldn't be too long. It might be quite common -- a few Neandertal-derived alleles may have been positively selected in later populations. But most of them are likely to be rare -- so I should expect to see many of them in only one or two copies in the CEU sample.

    I'm obviously interested in whether different populations (for example, Europe and China) have the same Neandertal-derived haplotypes. I'll leave that off for now -- there's much too much in this post already.

    So to be clear, this procedure will find haplotypes that are likely to have come into non-African populations from Neandertals. No single test will confirm these; but a combination of factors may be compelling for individual haplotypes. We can identify which genes may be in or near an interval where a candidate haplotype is found, but in all likelihood we will not have any known functional polymorphisms in the SNP data. This procedure then will provide no evidence that a particular Neandertal-derived allele has any functional effect in any living people.

    Some results

    I'll be reporting an awful lot more about results over the next few days. My first series of comparisons was the X chromosome, for reasons that will become clear shortly. On the X, there are 396 intervals where a 10-SNP Neandertal haplotype is identical to some CEU phased haplotypes and two or fewer within African HapMap samples.

    They vary in frequency in more or less the expected way -- a few of them are relatively common (10 or more copies out of the CEU sample, for example) most have only one or two copies in CEU.

    These vary substantially in length, mostly because some areas have very low Neandertal coverage. A few are more than 100-kb in length, most are 30 kb or less.

    The haplotype with the strongest signature -- 100-kb interval encompassing 26 SNPs in the Vindija 33.16 genome, is found in more than 15 (and centrally, in 22) CEU individuals and in no African individuals. The interval spans across part of the DMD gene (associated with Duchenne's muscular dystrophy). Conveniently, this is precisely the interval identified by Yotova and colleagues [3] as a site with Neandertal-derived alleles in non-African populations. They used comparisons at the sequence level, finding the Neandertal-derived variant at a frequency of 9% overall outside Africa. I have not yet confirmed that the SNP haplotype corresponds to this Neandertal-derived allele at the sequence level, but we should be able to manage that using public genomes. It's a nice confirmation that we're looking at the right kind of candidate loci.


    References

    Synopsis: 
    My research is outlining regions of human genomes that were derived from Neandertals. Here are some of the methods.
  • Genetics and archaeology, 2

    Tue, 2010-03-16 13:38 -- John Hawks

    I've just received the book, Climate Change in Prehistory: The End of the Reign of Chaos, by William Burroughs. I'll be reading it and reviewing it during the next couple of weeks.

    For the time being, I found a short passage of the book's introduction that helped me to put into words something I've been thinking about this week.

    Before this passage, Burroughs has described the sources of new evidence about climate and its effects on humans in the past. One of these areas is genetics, in particular the emergence of mtDNA and Y chromosome haplotypes as markers relevant to ancient migrations. The other is Greenland and Antarctic ice cores, which by 2004 had allowed course-scale temperature reconstructions over the last 800,000 years or so.

    After these, he discusses archaeology -- what we might usually consider to be the most direct source of information about humans in the past. But as Burroughs describes the situation, the relevance of archaeology is somehow fundamentally more difficult to describe:

    It is often easier to write with confidence on fast-developing and relatively new areas of research, such as climate change and genetic mapping, than to review the implications of such new developments for a mature discipline like archaeology. Because the latter consists of an immensely complicated edifice that has been built up over a long time by the painstaking accumulation of fragmentary evidence from a vast array of sources, it is hard to define those aspects of the subject that are most affected by results obtained in a completely different discipline. Furthermore, when it comes to many aspects of prehistory, the field is full of controversy, into which the new data are not easily introduced. As a consequence, there is an inevitable tendency to gloss over these pitfalls and rely on secondary or even tertiary literature to provide an accessible backdrop against which new developments can be more easily projected (Burroughs 2005:10).

    I think this is a revealing quote. From the standpoint of someone describing an emerging science, as Burroughs is doing in the book, there must be intense frustration. It seems so simple when you compare climate data and genetic data. Humans underwent some catastrophic population declines in the past, and there were big climate fluctuations. What could be simpler? But then, you get to the archaeological record where nothing is simple at all.

    Imagine the author had written the paragraph above as an exercise in self-reflection. Either of two things might logically follow:

    1. ... and therefore the simple conclusions of the immature sciences may be wrong.

    or

    2. ... and therefore those wishy-washy archaeologists had better get their act together.

    I won't prejudge which of these Burroughs comes to -- for that, I'll need to review the rest of the book. But you can see the temptation to arrive at the second -- the supposedly "mature" science is hopelessly mired in meaningless debates. The new sciences of genetics and climate change will finally bring simplicity and allow a new revolution of archaeological insight.

    I'd like to write a few words in favor of maturity.

    What marks a "mature" discipline is the emergence of informed critiques focused on the limits of methods of analysis. When archaeology was immature, before the 1950s or so, almost all archaeologists were simple (some say "naive") positivists. They excavated and found the traces of ancient people, just as today's archaeologists do. And what they found was what there must have been. Find a handaxe, you know people made handaxes; find a temple, you know they worshipped gods of some kind. Dig in a mound, find a grave, you know that the people had rituals associated with death that required substantial non-subsistence directed labor.

    Of course, today's archaeologists tend to be positivists, too. There's no sense twiddling around with hypotheses that will never be testable. The religion of Neandertals? Well, it's one thing to speculate about it, but the fact is that it's devilishly hard to test hypotheses about religion from the material remains of any pre-monumental culture. In the absence of information, we may as well stick to the facts.

    But there's a deeper sense in which archaeologists have a much more complicated view of their evidence. Archaeology has gone through many periods where different researchers developed and applied distinctive analytical techniques. These techniques have often been incommensurable. Sometimes they settle debates. For example, the systematic study of skeletal element representation and cutmark taphonomy has gone far toward testing (and verifying) the occurrence of hunting in some Early Pleistocene contexts. The hunting versus scavenging debate still goes on, with renewed emphasis on active or confrontational scavenging. But knowledge advanced by means of analytical critique.

    These kinds of internal critique have fueled many of the great debates in archaeology. For example, the technical standardization promoted by François Bordes enabled a new kind of systematic comparison of assemblages with each other. But those new data gave rise to several vociferous differences of interpretation. Where Bordes had favored a cultural interpretation of site differences, Lewis Binford critiqued the emerging pattern along functional lines. Later Harold Dibble and others critiqued the stability of artifact types, noting the emergence of some categories as side effects of the reduction sequence. These critiques did not lead to any quick resolutions, but they allowed archaeologists to deepen our understanding of the cognitive and functional circumstances of artifact production and transmission. They taught us the limits of comparison by showing the weakness of particular artifact types as markers of cultures.

    In human genetics, we have the assumption that particular haplotypes are markers of populations. Critiques of that assumption go back more than fifteen years, but I think it fair to say that they have not taken hold. It's worth asking, "Why not?" Why does a tradition of effective critique emerge in some areas of science but not others?

    A large part of the answer is the culture of practice in human evolutionary genetics. Let me give an example. Last week, I had my students read a selection of review papers published this month in Current Biology. I mentioned those papers here a couple of weeks ago ("Genes and archaeology"). These papers are reviews of the basic findings of genetics as applied to the last 50,000 years of evolution in most of the major regions of the world.

    Toward the end of our session, I asked, "What methods did you find unifying this set of papers?" That is, what basic methodology do they have in common?

    The students really couldn't find any shared methodology, beyond a few issues strongly connected to the data. For example, there was a shared reliance in most of the papers on the two uniparentally inherited gene systems -- mtDNA and the Y chromosome. Several of the papers came down to issues regarding the exact mtDNA chronology, and none of them seemed to deal seriously with the discrepancies between mtDNA and Y chromosome timescales. But when it came to methods of analysis -- how do we go from genotypes and haplotypes to some knowledge that populations had a particular history -- the papers had no systematic way of answering those questions.

    The demographic models developed to test hypotheses about human evolution are different in almost every study of human genetic variation. Since our evolutionary history has been complicated, simple mathematical models won't often be very effective tests of events in our evolution. So we need to apply simulation modeling of various kinds.

    The necessary computer programs tend to be written by graduate students and postdocs. Principal investigators -- the scientists in charge of the lab -- are rarely directly involved in this kind of work in human genetics, although there are exceptions. The development of distinctive simulation methods in many different labs raises important issues about replicability and code quality -- some students document their code well and have extensive backgrounds in computer programming, but most do not. This situation is terrible from the standpoint of developing a shared analytical methodology -- when the students leave the lab, or when the dataset changes, the next group of students and postdocs usually ends up developing new methods.

    Some groups work with standardized simulation code that has published documentation. But the students and postdocs apply distinctive parameters that rarely match those used by other research groups. That is, the programs may be standard, but the parameters are idiosyncratic. Maybe they choose parameters that provide the best fit to a particular dataset. Or maybe they choose them through a set of discussions at the laboratory level. In any event, when the data change, and when the students and postdocs change, the models change.

    That means the results of different studies may be incommensurable, even if they look the same. A reviewer who just reads the conclusions of such analyses may think that they are all consistent with the same story -- even though the simulations in one paper actually may contradict the results of other papers. Papers appear unified at the level of conclusions, but not by virtue of having a shared system of methods.

    Now, what does archaeology have to do with this? Well, in the case of human evolution, we have an archaeological record. It would be sensible for archaeologists to contribute to the project of genetic modeling and simulation methods -- that way, we would be testing models that could be critiqued on the basis of archaeological reality as well as genetics. But the students and postdocs who develop simulation models in genetics don't know archaeology. And most of the archaeologists don't know genetics -- so they discuss models only at the level of conclusions, not at the level of parameters.

    The tradition in archaeology for the last fifty years has supported the devleopment of robust critiques. Likewise, the tradition in evolutionary genetics has supported such developments -- witness the rise of neutral theory, the "selfish gene" revolution, the innovation of evolutionary game theory. Each of these involved the discovery of weaknesses in old population models, based in part on a growing program of empirical research on natural populations and mathematical models.

    I don't want to push this comparison beyond reason. There is a point of overcaution -- of superfluous critique that can impede progress. Archaeologists have beached themselves on the shoals of such critiques many times.

    But human evolutionary genetics remains immature. We should be cautious about the details of population models, and we should try to identify lines of critique that will improve them. Some critiques have begun to emerge, and I will be highlighting those over the next several weeks in my course. In addition, I'll be discussing some lines of inquiry based on open access datasets that will illustrate problems in recent human evolution, along with some potentially productive approaches for solving them.

  • Is there a common coding variant of FOXP2 in southern Africa?

    Sun, 2010-02-28 20:10 -- John Hawks

    Today I was looking through the online data files for the South African genome. Those online files are available from the Data Libraries entry of the Galaxy bioinformatics tool website.

    I noted last week that some of the most interesting data -- in particular, the genotypes for new SNPs -- are not yet available to download ("Online toolkits -- the good and the frustrating"). But in the meantime there are some very interesting things there. In particular, the sequencing team has made available a list of amino-acid-coding mutations present in one or more of the five individuals (four Bushmen and Desmond Tutu) for whom the team obtained exome sequence.

    If you look at the summary information for this list, it gives the position of amino-acid-coding mutations against the human reference genome (hg18), the position and identity of the amino acid change. It then gives a "prediction" of whether the mutation is damaging to gene function.

    This kind of prediction can be very misleading. The categories of effects include "tolerated" and "damaging", but these are based on whether the site tends to be conserved in other mammal lineages, and whether the new amino acid is very different in affinity (and possible conformation) compared to the reference. There's no "beneficial" -- even though some fraction of these polymorphisms are probably retained because of selection on the mutant allele.

    I say that because one of the five individuals (TK1) has an amino-acid-coding mutation in FOXP2.

    Yeah, that surprised me when I found it.

    As you'll remember the coding sequence of FOXP2 is pretty strongly conserved in other mammals. Two amino-acid-coding substitutions in humans separate us from other primates, an additional one separates primates from the mouse genome (Enard et al. 2002). This area of the genome looks like it had undergone a recent sweep in human populations, with relatively little variation and a strong excess of rare mutations surrounding the gene. Coop and colleagues (2008) gave a point estimate of the time of a sweep in humans as 42,000 years ago, which I wrote about at the time ("FOXP2 is really recent, it really did introgress (if it's not contamination)"). That estimate has to be massively too young -- it's not plausible that a sweep could be that recent and fixed worldwide.

    Meanwhile, last year, Ptak and colleagues (2009) followed up on my suggestion that there might really have been a recent sweep, but one near FOXP2, instead of involving one of the two human amino acid substitutions. They found statistical linkage between flanking sites immediately around the gene, which would be unlikely after a fixed sweep of FOXP2 itself. That linkage is quite likely if the human-specific substitutions were already fixed, and much later another nearby site underwent a partial sweep. It remains to be demonstrated, however, what nearby site is a plausible candidate for a recent partial sweep.

    So, finding variations near FOXP2 is very relevant to the history of this gene region. If there is an ongoing sweep involving some site near the gene, we should expect that some human populations haven't undergone the sweep yet, or have the selected haplotype at a lower frequency than others. The existing datasets from Africa -- mainly HapMap and HGDP sets -- are insufficient to test the hypothesis because they include only common SNP variants at low density. But sequence data from South Africa can give us a direct estimate of the nucleotide diversity around FOXP2, thereby letting us test for the presence of a recent sweep.

    The amino acid coding variant in one of these Bushman genomes came to me as a total surprise. Using the alignment with hg18, the location of the mutation is at position 114089380 on chromosome 7. The mutation changes a leucine in the wild-type sequence to a proline in the mutant, and the algorithm classifies it as "damaging" -- probably because the two residues are very different in their hydropathy. This position is not one of the two human-specific amino acid substitution sites. In fact it is in the forkhead box domain of the protein itself, which is the DNA-binding motif. Without going further into the biochemistry, I really can't guess what the effect of the mutation would be. I'm not really sure it's relevant -- after all, if it is a singleton in the population it might well be a recessive with no effect on the carrier phenotype.

    Still, the mutation could be common in the Bushman population. Our point estimate of the mutation's frequency is one in eight. Maybe it's a new variant that confers some advantage; maybe it's a result of a founder effect tens of thousands of years ago. It could even be widespread within Africa. We won't know until we have more genomes.

    The mutation is not in any of the regions sequenced by Krause and colleagues (2007) in the Neandertals from El Sidrón. I wouldn't expect it to be there -- as a derived variant, it would be unlikely to evolve in parallel in Neandertals and southern African populations. But who knows what else we'll find?

    References:

    Coop G, Bullaughey K, Luca F, Przeworski M. 2008. The timing of selection at the human FOXP2 gene. Mol Biol Evol 25:1257. doi:10.1093/molbev/msn091

    Ptak S, Enard W, Wiebe V, Hellmann I, Krause J, Lachmann M, P&aauml;&aauml;bo S. 2009. Linkage disequilibrium extends across putative selected sites in FOXP2. Mol Biol Evol 26:2181-2184. doi:10.1093/molbev/msp143

    Krause J, Lalueza-Fox C, Orlando L, Enard W, Green RE, Burbano HA, Hublin J-J, Bertranpetit J, Hänni C, Fortea J, de la Rasilla M, Rosas A, Pääbo S. 2007. The derived FoxP2 variant of modern humans was shared with Neandertals. Curr Biol 17:1-5. doi:10.1016/j.cub.2007.10.008

    Enard W, Przeworski M, Fisher SE, Lai CSL, Wiebe V, Kitano T, Monaco AP, P&aauml;&aauml;bo S. 2002. Molecular evolution of FOXP2, a gene involved in speech and language. Nature 418:869-872. doi:10.1038/nature01025

    Schuster SC and many others. 2010. Complete Khoisan and Bantu genomes from southern Africa. Nature 463:943-947. doi:10.1038/nature08795

  • Online toolkits -- the good and the frustrating

    Sun, 2010-02-21 14:51 -- John Hawks

    In pursuit of my DIY genomics posts, I've been playing around with the Galaxy bioinformatics web tools. The team responsible for the South African genomes published the data to Galaxy, and their uploads are easy to get -- either to download, or to work with the online Galaxy platform.

    Working with a resource like this helps to illustrate both how tremendously useful bioinformatics tools can be, but also how frustrating it can be to figure them out. Some things are a breeze, although others are completely obscure. Documentation for the uploads is skimpy so far -- one thing that drove me up the wall is that SNPs are listed by genome, but without indicating genotypes -- is the individual a homozygote or a heterozygote? The paper by Schuster and colleagues describes their genotype calling procedure, but the results turn out not to be posted along with their other data. I'm sure they'll become available as the data are updated, but I did waste some time figuring out how the releases correspond to descriptions in the online supplementary material from the paper.

    Despite occasional frustrations, we seem to be heading in the direction of all-in-one online bioinformatics toolkits. Galaxy, for example, lists several advantages on a promo page. A couple of entries:

    Now your results are reproducible! | When publishing results, replace “the data were analyzed using a collection of in-house scripts” with a URL pointing to Galaxy’s history. Your reviewers will have no further questions. That’s reproducible genomics!

    ...

    No tools for new datatypes | Some datatypes generated by high throughput genomics are so new that there are no tools to analyze them. For example, how do you extract sequences of coding exons from the latest 28-way alignments of vertebrate genomes or analyze quality scores from 454/Solexa/SOLiD? With Galaxy.

    I live at the mathematical end of this stuff. I work with models of populations and assume that sequences are known, you know, as if we looked at them and read off the ACGT's. But in reality, a lot of complexity lies between models and the biochemistry. Going from sequencing reads to genomes, and aligned genomes, involves a lot of analysis. Many of the details differ entirely between different sequencing platforms. As we continue to move toward whole-genome analyses of populations and other species, it's really important to have an abstraction that allows for different underlying sequencing models, while allowing replication of the population genetics modeling.

    The disadvantage of a single widely used tool is that it can limit creativity and lock people into a certain way of processing data. Locked-in assumptions sometimes lead to wrong conclusions -- as we've seen in human genetics many times over the years. But the advantage is that it allows everybody access to the same methods and data, so that results can be replicated and augmented with new observations.

  • What do you do with all these genomes?

    Fri, 2010-02-19 22:47 -- John Hawks

    I'm teaching a class right now in which the students are tackling this very issue: We're getting an awful lot of new genomic data from living humans. What can we learn about human prehistory from these people's genes? And where will the Neandertal genome fit in when we have it?

    Let's consider the genomes from the last two weeks. One is an ancient individual from Greenland. One complete genome comes from a Namibian Tuu-speaking man named !Gubi, another from Archbishop Desmond Tutu. Three additional exomes, covering the protein-coding fraction of the genome, were obtained from three additional Bushman men, two Ju/'hoansi and one !Kung speaker. We can add these to existing complete genome sequences from some 8 individuals, and draft genomes from chimpanzee, gorilla, orangutan, and macaque.

    The data quality varies substantially among these genomes. Some have been sequenced at 20x coverage or higher; others much less so. In fairly short order, these few will be joined by a thousand more.

    The density of coverage of these genomes makes them unique resources, but we know a lot more about the variability of genes in different populations from SNP genotype data. Connecting the two kinds of data -- finding the actual nucleotide changes that explain regional signatures of selection, for example -- will be an important research goal in the near future.

    I'm going to do my best over the next few weeks to post new analyses involving these human genomes. A lot of answers are fairly trivial to get, once you have the sequences. Heck, just getting the things is the barrier for most of us -- how do you get a genome, and once you have it, how do you read it?

    These won't be tutorials, exactly; they'll be case studies of a sort. Research fragments, some of which will be part of papers coming out of my lab. I'll tag these posts with the category, "DIY genomics".

Subscribe to DIY genomics

Neandertals

For years, I've worked on their bones. Now I'm working on their genes. Read more about the science studying these ancient people.

Denisova

From a finger bone of an ancient human came the record of a completely unexpected population. My lab is working on the science of the Denisova genome.

Acceleration

The advent of agriculture caused natural selection to speed up greatly in humans. We're uncovering some of the ways that populations have rapidly changed during the last 10,000 years.

Malapa

Just outside Johannesburg, the Malapa site is producing some of the most exciting finds in human evolution. This site is the headquarters of the Malapa Soft Tissue Project.