john hawks weblog

paleoanthropology, genetics and evolution

Error message

  • Notice: Trying to get property of non-object in _biblio_citekey_print() (line 1769 of /var/www/johnhawks.net/public/modules/biblio/biblio.module).
  • Notice: Trying to get property of non-object in _biblio_citekey_print() (line 1769 of /var/www/johnhawks.net/public/modules/biblio/biblio.module).

recent selection

  • Copy number variation in 1000 Genomes

    Sat, 2010-10-30 13:01 -- John Hawks

    When I wrote earlier in the week about the 1000 Genomes Project results, I mentioned that a second paper was being published in Science. That paper, by Peter Sudmant and colleagues [1], works to quantify the amount of copy number variation of genes in the genomes of the study participants.

    It can be challenging to study copy number variation using shotgun sequencing methods, because each duplicated part of the genome creates multiple alignment targets for short reads. One way to deal with this problem is to use the drawbacks of shotgun sequencing as an advantage: Look for template regions of the genome that have much higher read depth than others. These places include many where a gene has been duplicated in the target genome, giving one-and-a-half or twice the number of reads for each duplication. Looking at read depth genome-wide is a quick way to assess copy number variation at sites where it was previously unknown. Once these are ascertained in a sample of genomes, they can be targeted for further study, including characterizing the boundaries of the duplicate region.

    The paper describes this methodology in some detail, with various embellishments to get more precise answers to certain kinds of structural questions. They developed a large set of SNPs that differentiate paralogous gene copies, among other things allowing them to examine which members of various gene families had been duplicated, and whether events were shared between populations.

    Through our analysis, we identified that duplicated regions are more likely to be stratified between human populations when compared with copy number variation within unique regions of the genome. For example, 59 (92%) of the top 64 stratified gene families overlap segmental duplications (P –16). Remarkably, many of these highly polymorphic genes map to duplications that promote recurrent rearrangements associated with intellectual disability, autism, schizophrenia and epilepsy. We hypothesize that the extreme polymorphism may contribute to genomic instability associated with disease and may predispose certain populations to different chromosomal rearrangements (30).

    Segmental duplications can be relatively effective ways to change the amount of gene product without changing the gene product. In other words, a duplication can increase the dosage of a particular gene product. That can sometimes be very useful. For example, salivary amylase production varies among people due to the number of duplicate copies of the gene [2]. The copy number variation is related to population history of agricultural subsistence -- old agricultural populations have more amylase copies. It's a simple case where the dietary ecology favors a dosage increase for an enzyme.

    Gene duplications and other structural changes to the genome are rare events -- any particular kind of change is substantially less likely than a single nucleotide mutation at a given point in the genome. So it is of some interest to consider which regions are actually invariant in copy number -- duplications that occurred on the human lineage but have been conserved in more recent populations -- because these may reflect old adaptations essential to the evolution of hominins. Here's what the paper concludes:

    We have also defined the ~49% of gene duplicates that are largely invariant in copy among humans. Although this is based only on an assessment of 159 genomes from select populations, the fact that this fraction of genes remains copy number invariant in a milieu of recurrent unequal crossover suggests functional importance. Among these, we find a number of genes involved in neurological development and disease. We note that many of these duplicated genes are themselves incomplete and may represent nonprocessed pseudogenes, which may modulate the expression of the ancestral gene. The characterization of the most recently duplicated genes should facilitate identification of those that acquired new functions (neofunctionalization) versus those that have become pseudogenes or have partitioned their function among duplicate copies (31).

    I was going to write that there's not much analysis in the paper and let it go at that. But the paper has a 108-page supplement.

    I know I write this like once a week, but what the heck is the point of a 4-page paper with a 108-page supplement? Granted, 7 of the supplement pages are the author list (!!), but I view the whole thing mainly as a rip-off for the people who did the analyses in the supplement. Why don't they get their own first-authored publications? Are other journals satisfied to accept first-authored versions of analyses that have already been in a supplement in Science?

    The supplement lists 64 gene families including segmental duplications that differ substantially in average copy number among the CEU, YRI and CHB/JPT samples to which the low-coverage whole-genome sequencing has been applied thus far. The table (S8) lists the mean copy number in the three populations and the total variance in copy number; the key statistic is a value called Vst, which is analogous to FST for length variations.

    These are not generally duplications of whole genes, and their boundaries don't generally correspond to the boundaries of coding regions or exons. Without further analysis, it is not clear which of these duplicated regions may have functional import. Many of the additional copies may be inactive, either because of pseudogenization or because the duplication may not include the promoter/enhancer elements needed for gene expression. Some of the duplications occur in regions with known pseudogenes. The "involvement" of some genes in these regions with neurological development and disease is interesting, but the paper attempts no statistical assessment of this. It's a list of candidates, with some interesting ones that are obviously worth further examination, but without a clear story for any of them.

    It is maybe interesting that salivary amylase didn't make the list. It's not clear from the supplement whether that is an omission or whether its population differentiation, great as it is, is not as high as the lower cutoff. The greatest differentiation for amylase copy number is between populations that are not yet represented in the 1000 Genomes whole-genome sequencing.

    That raises an interesting question: What if we applied the same methods to the read data from some of the other public genomes? The Bushman genomes from earlier this year are an especially interesting sample because they are notably not drawn from a long-time agricultural population. In which areas would they score atypical copy number variation compared to the 1000 Genomes samples?


    References

  • Now for anthropological genomics

    Wed, 2010-10-27 15:30 -- John Hawks

    The first of the papers describing results from the 1000 Genomes project has been released today in Nature [1].

    This is "big project" genomics news. Like many announcements of this kind, it represents more of a public relations milestone than actual scientific advance. Some of the project data have been publicly available for a while -- the 1000 Genomes and HapMap projects have to their great benefit been based on the strategy of immediate data release. The new paper and its supplements include many summary statistics and report on new genetic variants that have been found -- there's a lot of information here. But most of the interesting science is just getting started. A paper like this really represents the opening of a race to use the new data for innovative research.

    Here in my lab, we are exploring the ways that whole genome sequencing can change our study of human population history. A large part of this is our work on recent selection, of course ("Why human evolution accelerated"). Whole-genome sequencing is not essential to finding many recently selected regions of the genome, but it will help enormously in narrowing down the actual functional changes that affected fitness in past populations.

    Whole-genome sequencing will rapidly improve our ability to resolve the population history of Pleistocene humans. For older events -- going back to the origins of Homo -- whole-genome sequencing will give us samples of genealogies from across the genome. We will be able to resolve some very ancient episodes of population mixture, and we have a chance of testing what kinds of events accompanied the rise of our genus. Even for events of the Late Pleistocene and Holocene, for which haplotypes of SNP markers can be useful without resequencing, whole-genome sequencing can be tremendously valuable. Reconstructing haplotypes from diploid genotypes requires us to make some assumptions about the demography of the population, which may be exactly what we are trying to discover. A sample of genomes sequenced at high read coverage will free us from some of those assumptions. It's really exciting stuff for an anthropologist.

    All those are reasons why the data will be useful for us in the long term. But at the moment, the data are not nearly so rich. The current paper reports:

    1. Whole-genome sequencing at 42x coverage of six individuals, one three-person family trio from Utah, and one family trio from Nigeria.

    2. Low-coverage (2x-6x) sequencing of 59 Yoruba, 60 Utah residents, 30 Chinese and 30 Japanese individuals. These are a subsample of the original HapMap samples.

    3. Sequencing at 50x coverage of 8140 exons in 697 individuals. These are a subset of the HapMap v.3 population samples, including Yoruba, Luhya, Utah, Tuscan, Japanese and Chinese samples. These exons come from 906 genes targeted "randomly".

    It's pretty far from a thousand genomes, and even farther from the stated goal of 2400 genomes. The low-coverage genomes are not sufficient to call genotypes across most of the genome. This is a persistent problem with "whole-genome" sequencing projects so far. A person's whole genome is mostly diploid -- two copies of most everything. Recently, we've seen several "whole-genome" sequences where each base is given a consensus value. SNP variants may be called against other people's genomes, but rarely is there sufficient coverage to call SNPs within the individual. There are exceptions -- a handful of public whole genomes are at high coverage. The exon sequencing here should be enough to call SNPs in these functional regions with great confidence. The family trios also should have enough to call SNPs. So some of these will be our first chance to do actual population genetics on diploid genome-wide sequence data.

    One important piece of analysis in the paper is the confirmation of a low rate of de novo mutations in the children of the family trios. I discussed a result last spring that came to a very low rate of per-site mutation ("A low human mutation rate may throw everything out of whack"). The rate in that paper was 1.1 x 10-8 per site per generation. The current paper comes to a rate between 1.0 and 1.2 x 10-8. I have some more written on this issue and I'll integrate the new finding and post it later in the week. This aspect of the study is pretty important to our understanding of human evolution.

    The paper makes an interesting distinction between "accessible" and "inaccessible" portions of the genome -- accessibility meaning ease of mapping and aligning sequence reads:

    Accurate identification of genetic variation depends on alignment of the sequence data to the correct genomic location. We restricted most variant calling to the ‘accessible genome’, defined as that portion of the reference sequence that remains after excluding regions with many ambiguously placed reads or unexpectedly high or low numbers of aligned reads (Supplementary Information). This approach balances the need to reduce incorrect alignments and false-positive detection of variants against maximizing the proportion of the genome that can be interrogated.

    For the low-coverage analysis, the accessible genome contains approximately 85% of the reference sequence and 93% of the coding sequences. Over 99% of sites genotyped in the second generation haplotype map (HapMap II)4 are included. Of inaccessible sites, over 97% are annotated as high-copy repeats or segmental duplications. However, only one-quarter of previously discovered repeats and segmental duplications were inaccessible

    It's an interesting decision -- just focus and report on the majority of the genome where alignment is easier.

    The paper discusses selection briefly. There's not much new here other than the identification of candidate causal variants for some selected haplotypes.

    First, it provides a more comprehensive catalogue of fixed differences between populations, of which there are very few: two between CEU and CHB+JPT (including the A111T missense variant in SLC24A5 (ref. 38) contributing to light skin colour), four between CEU and YRI (including the −46 GATA box null mutation upstream of DARC39, the Duffy O allele leading to Plasmodium vivax malaria resistance) and 72 between CHB+JPT and YRI (including 24 around the exocyst complex component gene EXOC6B); see Supplementary Table 7 for a complete list. Second, it provides new candidates for selected variants, genes and pathways. For example, we identified 139 non-synonymous variants showing large allele frequency differences (at least 0.8) between populations (Supplementary Table 8), including at least two genes involved in meiotic recombination—FANCA (ninth most extreme non-synonymous SNP in CEU versus CHB+JPT) and TEX15 (thirteenth most extreme non-synonymous SNP in CEU versus YRI, and twenty-sixth most extreme non-synonymous SNP in CHB+JPT versus YRI). Because we are finding almost all common variants in each population, these lists should contain the vast majority of the near fixed differences among these populations. Finally, it improves the fine mapping of selective sweeps (Supplementary Fig. 14) and analysis of the dynamics of location adaptation. For example, we find that the signal of population differentiation around high Fst genic SNPs drops by half within, on average, less than 0.05 cM (typically 30–50 kb; Fig. 5d). Furthermore, 51% of such variants are polymorphic in both populations. These observations indicate that much local adaptation has occurred by selection acting on existing variation rather than new mutation.

    This last point is not especially demonstrated by the new sequencing data. What we are looking at is few complete sweeps, but that's expected even if all the selected variants were novel mutations -- there just hasn't been time to fix many variants. It remains to be shown the extent to which standing variants are involved in this selection, partial sweeps of new mutations, or parallel adaptations ("Spatial dispersal, parallel adaptation, and the 'Stooge effect'"). We'll probably see a lot more interesting work on recent selection coming out of the new data.

    Science has a companion paper to the Nature data summary, focusing on copy number variation and gene duplications. I will review that one separately.

    UPDATE (2010-10-27): Dienekes pulls out an interesting passage about the Y chromosome sequences, which in at least one case recover many markers separating haplogroups once thought to be much closer to each other. Not sure what to make of that yet.


    References

  • Neolithic milk fog

    Sun, 2010-10-17 14:11 -- John Hawks

    Razib points today to an article in Der Spiegel about the revival of folk migration as an explanation for the Neolithic in Europe. His post ("Völkerwanderung back with a vengeance") is worth reading. The general issues here are very interesting right now because the increase in data has made it possible to propose and test more and more complex scenarios. The simple scenario, gradual demic diffusion, appears wrong in many details. Archaeological cultures appeared and spread in spurts, which we now know were often composed of people genetically very different people.

    The article in Der Speigel is titled, "How Middle Eastern Milk Farmers Conquered Europe".

    The main idea of the article is that our understanding of the spread of Neolithic cultures into Europe has been revolutionized by ancient DNA and more sophisticated chemical analysis of artifacts. That's more or less correct. We really are thinking much more these days about folk migrations bringing new people into Europe. We know that lactase persistence was a recent evolutionary phenomenon in European groups, which was absent before the early Neolithic.

    Problem is: from the standpoint of ancient DNA samples, the lactase persistence mutation was also absent within the early Neolithic! The article is full of details that are wrong or misleading. Most important, it links the appearance and proliferation of the lactase persistence trait with the LBK. This might appear to make sense. The chemical analyses have supported the importance of dairying and presumably milk consumption in the LBK. But the genes of the LBK skeletons don't have the lactase persistence marker.

    The absence of lactase persistence in these early Neolithic people is entirely to be expected. Such an allele couldn't become common until the selection pressure was in place. People had to be drinking milk habitually at key times of vulnerability to establish this selection pressure. Even when the selection pressure is very strong, as it was for lactase persistence, the initial growth of a selected allele is very slow. It did not become common in Europe until thousands of years after it first appeared.

    So lactase persistence did not distinguish early Neolithic people in Europe from agriculturalists in the Near East, because neither of those populations had it at any detectable frequency. All the stuff in the article about how lactase persistence originated in Central Europe? It's irrelevant to whether these ancient populations were connected or not.

    What does distinguish the early Neolithic in central Europe is the mitochondrial DNA. I've discussed this several times in the last few years ("Early European mtDNA: only mysterious if you want it to be", and most recently "French Neolithic discontinuities"). The early Neolithic in Central Europe and France is characterized by several common haplogroups that are absent or rare in both earlier and later Europeans.

    It remains to be seen whether we can document a clear analogue of this mtDNA observation with nuclear genetic data. We know a lot about the variation of present-day Europeans, but most attention to geographic relationships has been run through course filters -- maps of the first two principal components are very striking in their correspondence to geography, but they really don't address the timing of movements that may have contributed to the pattern.

    The differences between early Neolithic and later Europeans suggests that post-Neolithic migrations -- real Völkerwandurung -- actually had a major impact on the European gene pool. What we see today is not a pattern established 6000 years ago, but a palimpsest richly painted with strokes from successive migrations.

    One aspect of this scenario: There's no reason to link the early Neolithic with Indo-European languages. There were many later widespread population movements that might have carried this language family, and we know that these later movements were genetically decisive -- at least, as concerns the maternal genealogy. The relation of Y chromosome haplogroups with mtDNA haplogroups is a critical question, but even more necessary is the development of an effective means of testing these hypotheses with nuclear genotype data.

  • Spatial dispersal, parallel adaptation, and the "Stooge effect"

    Thu, 2010-10-14 00:06 -- John Hawks

    Peter Ralph and Graham Coop have an interesting paper in the current Genetics, titled, "Parallel Adaptation: One or Many Waves of Advance of an Advantageous Allele?" [1]

    Fisher [2] famously considered the case in which an advantageous allele is dispersing through a spatially dispersed population, showing that the dispersal forms a "wave of advance". This work was the foundation for a lot of progress in understanding spatial dynamics of organisms.

    As I discussed in 2008 ("Overstating the obvious"), one of the consequences of the Fisher wave model for human evolution is that advantageous alleles will spread very slowly through the population. During the course of the Holocene, a strongly selected mutation might move only across a radius of a thousand or so kilometers. That provides one explanation for why new advantageous alleles haven't spread very far beyond their points of origin -- they just haven't had time yet.

    Another reason why an allele might not have spread widely is interference from other alleles with similar effects. I mentioned this process last year ("Spatial variation and near-fixed selected alleles"):

    Greg Cochran and I have been discussing this idea for some time. We call it the "Stooge effect". Think of the Three Stooges all trying to run through a door at the same time and getting stuck in the middle. That's what these genes are doing -- all of them are competing to respond to selection, but each is slowed by the presence of the others.

    Ralph and Coop have cleverly combined the "Stooge effect" phenomenon with spatial dispersal. They suppose a case in which two separate advantageous mutations arise in different geographic locations, each affecting the same trait. Each begins to spread independently as a Fisher wave of advance. What happens when they meet?

    As they show, the dynamics in this case give rise to a static equilibrium -- once the "waves of advance" meet, they stop moving, forming a stable boundary. A new favorable mutation makes headway only so long as it has no equally favorable mutation to compete against.

    I like the way they used both analytical approaches and simulations to come to this outcome. The appearance of stable boundaries in a reaction-diffusion system has long been known (demonstrated first by Alan Turing, actually!). But to my knowledge, no one has considered this specific case from an analytical perspective.

    The Fisher equation is not all that simple for most students to work with. If you become familiar with the equation, you will notice the key aspect is that it has two separate components -- a logistic (or reaction) component representing the increase in frequency at a single point in space, and a diffusion component representing the dispersal across space.

    The muscle of the dispersal process comes from the logistic component. Without the intrinsic growth of the selected allele, the dispersal of individuals along the boundary would not carry many copies of the selected allele into new geographic areas. If the local selective advantage dies, the wave of advance rapidly stalls. A static equilibrium arises, with the frequency of the selected allele forming a cline that correlates with the local selection pressure.

    Ralph and Coop's model approximates this case, in a dynamical sense. Each new selected mutation forms an increasing zone in which the selective advantage of other mutations is zero. When those other mutations encounter this zone, they form a stable cline. The cline is stable in the short term, but the diffusion component still disperses copies of an allele; they just lack the muscle to continue their deterministic expansion.

    The most interesting simulations by Ralph and Coop show the two-dimensional case, in which the stable boundaries emerge in a "tesselation" pattern.

    Tesselations

    Figure 6 from Ralph and Coop (2010), showing "tesselations" in 2-d simulations of waves of advance.

    The lower three panes in the figure show the stability of the boundaries between the selected alleles. They proceed to fixation locally, but their dispersal stops where they come into contact with other adaptive alleles. Over the very long term, the population will mix -- the diffusion process will slowly carry all these alleles throughout the species' range. Look at the process after a million generations and the entire zone will be gray. But this dispersal occurs at the neutral rate, where the diffusion term is the only factor driving the dispersal.

    What about humans?

    My graduate student Zach Throckmorton and I have been working in this area for a while now. One of the things that impresses us is the way that much more interesting dynamics can emerge when you alter the assumptions. I learned some of this stuff by talking to Frank Livingstone, who gave a lot of thought to these issues of spatial dispersal and selection as applied to malaria resistance alleles.

    In particular, Frank thought about the case where one allele has a slightly larger advantage than another. In some contexts, this allows the "better" allele to overtake and swamp the expansion of the "weaker" (but nonetheless adaptive) one. In others, the two come to a near standstill, one displacing the other only very gradually. Much depends on the timing of the two mutations and the local conditions controlling their initial dispersal.

    Ralph and Coop briefly consider this case in their paper, noting that the difference in fitness advantage of two alleles will allow one to advance into the range of the other, albeit at a slower rate. In humans, we may be seeing a smaller subset of cases, where one or more of the alleles have not yet established a wavefront. In these cases, the arrival of another wave can disrupt the spatial pattern of the rarer allele. The diploid case gives rise to the possibility of more complex epistases. Well-defined boundaries between selected alleles are rare, and where they occur (as may be the case with HbC and HbS in Africa), many have focused on negative epistasis as an explanation.

    Also, alleles are unlikely to substitute perfectly for each other. In many cases, they may work synergistically -- individuals carrying two selected alleles that affect the same function may outperform those carrying only one such allele. At some point, new selected mutations may start to have diminishing returns, even on a trait like skin pigmentation where dozens of alleles may have been selected in widespread human populations. So the current distribution may to some extent be "frozen", but by a more complicated dynamic than the simple intersection of waves of advance.

    As Coop and colleagues showed last year [3], and we discussed in 2007 [4], there are really only few genes that have approached local fixation in recent human evolution. The current spatial pattern of recently selected alleles doesn't look like a tesselation with many alleles near local fixation. Over most of the Old World, it looks like populations have a very large number of very new alleles, far from fixation, and few up over 70 percent in frequency.

    So the specific scenario in this paper by itself probably does not explain the overall empirical pattern in humans. But if we consider the current pattern as a transient, approximating the early stages of dispersal for many selected alleles, we may not be terribly far off the mark.

    Mutation-limited evolution

    This is a long dense paper and there's a lot in it. One further aspect of the paper that I think is essential is the way that Ralph and Coop reiterate the basic point that more people means more mutations. In their case, they focus on population density over space (population number, when you multiply them) as a constraint on the number of possible adaptive mutations. They apply this idea as a hypothesis to account for parallel adaptations that may have emerged in recent human evolution.

    Multiple mutational origins are likely if the characteristic length is shorter than the physical dimensions of the region. Eurasia measures >8000 km across, and so Table 1 suggests that multiple origins at a single base pair are very unlikely at the lower population density. On the other hand, if the mutational target is large, then multiple origins are likely at low densities, while at high densities independent origins are ubiquitous. The complementary cases of (rho = 2, µ = 10–8) and (rho = 0.002, µ = 10–5) give identical characteristic lengths of 3000 km, although the timescale on which the mutations spread differs. Thus for these two parameter combinations we can expect a few mutations to dominate within continents and for multiple mutations to be common in a population spread across an area the size of Eurasia. Obviously these calculations are very crude, as population densities vary through space and time, and dispersal across continents is not simply a function of geographic distance and individual dispersal. Nevertheless, these calculations suggest that it is plausible that for adaptive traits with reasonable mutational targets (e.g., a change anywhere within a gene or pathway) even low population densities can lead to parallel adaptation across an area the size of Eurasia, and higher densities almost certainly will.

    We note that as human population densities have increased dramatically over time, so too has the probability of parallel adaptation. It is interesting therefore to note that a number of recent human adaptations (e.g., sickle cell alleles) involve repeated changes at very small mutational targets in relatively small geographic areas, while older adaptations from single changes (e.g., skin pigmentation) are more broadly spread.

    They are describing a scenario in which small human populations would have been mutation-limited -- that is, the number of new mutations is small, making it unlikely that adaptive mutations will happen in any given generation. In such populations, the rate of adaptation is limited by the availability of new mutations. In an extreme -- in the very small effective sizes of Pleistocene human populations -- the rate of adaptation may be extremely slow and regional populations may come to differ at many weakly selected loci, which spread very slowly.

    As the population grows, strongly adaptive mutations become more and more likely to happen somewhere in the species' range. Yet they are still relatively rare -- meaning that they have an opportunity to spread fairly far before encountering another equally strongly selected mutation affecting the same trait.

    This process can give rise to very large differences on a continental scale, even when the selection pressures in different regions do not differ. In humans, the dispersal of selected alleles across space may have been significantly accelerated by actual dispersals of populations. It is not a mere coincidence that very widespread alleles in Eurasia also tend to be much older than 20,000 years old -- long-distance dispersals prior to that time had a higher chance of leaving a lasting influence on subsequent populations.

    But as the population gets bigger and bigger, parallel mutations are more and more likely to happen. As Ralph and Coop point out, at the extreme of large population size and likely mutations, you shouldn't see any new mutations emerging and spreading over very large areas. Any of these mutations would be very likely to encounter other new mutations that do the same thing.

    Is this likely in humans? Clearly some mutations have happened recurrently. Making a broken gene is easy -- there's a large mutational target, since a large fraction of nonsynonymous substitutions might do the job. So if there's a net selective advantage to breaking a gene, we ought to see that happen recurrently in human populations.

    In contrast, if the mutational target is very small, then mutations will still be rare even in a very large population. If only one base change can have an adaptive effect, that precise change will happen less than once in 109 births (remember that not just any mutation at a site, but some particular mutation is what we may need). If a rare duplication or gene conversion is the necessary change, then it may be much rarer.

    Looking across the last few million years, when human population numbers were much smaller than the Holocene, we can be pretty sure that some aspects of our evolution were mutation-limited. The changes that took hold in our ancestors were the ones that happened, and that survived the winnowing of genetic drift. Many changes that would have been adaptive didn't happen in our ancestors. They just weren't lucky enough.

    But some of those changes would still be adaptive now, if we could get them. And we have had much larger numbers in the last 10,000 years. Homo erectus needed these mutations, but we only now are seeing them selected in the human population.

    Malaria adaptation

    Hemoglobinopathies are among the cases of easy mutations -- where breaking a gene is adaptive. It's not just any broken version of alpha- or beta-globin that does the job, though. The hemoglobin needs to be impaired in certain ways to impede the parasites while maintaining blood function. This provides many of the classic cases of human adaptation, and Ralph and Coop turn to this system for examples of parallel adaptation:

    The sickle cell allele HbS at the β-globin gene in humans provides a particularly interesting case of putative parallel adaptation. The HbS allele (β6 Glu-Val) has been driven to intermediate frequencies by selection within the past 10,000 years due to increased resistance to malaria of heterozygotes for the allele (HALDANE 1949; ALLISON 1954; CURRAT et al. 2002; KWIATKOWSKI 2005). The HbS allele is present on at least four major distinct haplotypes in Africa, each at intermediate frequency within a different geographic region; the haplotypes are named after the population sample where they were first discovered (Central African Republic, Senegal, Benin, and Cameroon). This is consistent with multiple origins of this single-base-pair change. Note that a distinct, malaria resistance allele, HbC (β6 Glu-Lys), has also arisen in Africa at the same codon as the HbS allele (TRABUCHET et al. 1991; AGARWAL et al. 2000; WOOD et al. 2005a), increasing our confidence that the mutational input was high enough to allow multiple types to arise. However, FLINT et al. (1998) thought the hypothesis of multiple new mutations arising at a single base pair was extremely unlikely and proposed that it was more likely that gene conversion had spread a single mutation across multiple haplotypes.

    The theory we have developed can be used to assess the plausibility of the multiple mutational origins of the sickle cell allele, by exhibiting parameter combinations that yield characteristic lengths consistent with the separation of the sample locations. [Recall that the wave of advance, and thus also our model, works in the case of heterozygote advantage (ARONSON and WEINBERGER 1975).] The different HbS haplotypes co-occur within a few thousand kilometers of each other (see Table 5 of FLINT et al. 1998) (noting that these locations are unlikely to reflect the geographic mutational origins, and mutations will have been spread by large population movements). As the HbS changes occur at a single base pair, the mutation rate would have been 10–8, and we take an s = 0.05 (as in CURRAT et al. 2002). If human dispersal at that time was well approximated by a Gaussian kernel with sigma = 100 km, then a characteristic length of 1000 km would require an effective density of individuals of rho = 25 km–2, while if sigma = 10 km, then we would require only rho = 2.5 km–2. This latter set of parameters does not seem unrealistic, considering our knowledge of population density and dispersal parameters, so our model suggests that the hypothesis of multiple origins is not unreasonable.

    I think they've got the basic idea correct here, but there are some additional details to consider. The distribution of HbE is not quite so easy to understand if parallel mutations are really so likely, and of course there is the negative epistasis of different alleles (and the thalassemias) which impacts their dispersal ability when they become moderately common. The dynamic may be of similar form to the one described here, but boundaries between alleles may be reinforced by the fitness costs of carrying multiple ones.

    This situation raises the issue of path dependence. Some mutations have "first mover" advantages. Once they are common, other adaptive mutations may still occur -- even mutations that are better from the standpoint of fitness -- but be lost or grow very slowly because their net fitness advantage over the common mutant is slight. Where HbE is common, new HbS alleles are unlikely to invade quickly. Where HbS is common, new HbE mutants are similarly unlikely to invade -- even though HbE has a higher fitness.

    Network effects among genes may also dominate the spatial dynamics. HbS spread most widely in the context of populations that were already Duffy null, and in which G6PD deficiency was rapidly increasing. The first conditioned the parasite environment -- P. vivax had a strong disadvantage in Duffy null populations, P. falciparum made up most of the parasite load. G6PD deficiency should have impacted the relative advantage of HbS, more and more as it became more common. Those are two loci among many that alter malaria dynamics in Africa compared to South and Southeast Asia.

    Conclusions

    There is much more to say about this paper -- it's 22 journal pages. But I think I've given an impression of what's there and how the ideas may impact our interpretation of recent human evolution. Many of the central concepts were presaged by earlier work in 2007 and 2008, as reviewed here on the blog. The new analytical and simulation work, I really like.

    Hopefully we can get out some shorter papers that will focus on aspects of these problems as applied to humans. A message that comes across very clearly in our work and this new paper is that different time periods in our evolutionary history must have had very different selection dynamics. Pleistocene humans were not only in a different ecology than us, they experienced a radically lower potential for adaptation.


    References

  • Quote: Boyd on New World pigmentation clines

    Tue, 2010-09-28 16:44 -- John Hawks

    I'm using some statistics out of William Boyd's 1956 printing of Genetics and the Races of Man[1]. It gives a good accounting of blood group data known more than fifty years ago, which I'm using to illustrate my intro lectures. Meanwhile, there are some interesting passages, from the standpoint of today's knowledge of the human genome and its variation.

    On skin pigmentation -- this is the earliest statement I've run across of the argument that the New World pigmentation cline is shallower than the Old World cline because of the relative recency of occupation (pp. 178-180):

    The aborigines of the New World, though not by any means identical, agree in having on the whole considerable skin pigmentation. If pigmentation is adaptive, and conforms to climate, why are not the Eskimo and the inhabitants of Tierra del Fuego as light as Europeans? This looks like a considerable difficulty, but the solution is probably comparatively simple. The aborigines of the New World have not been here for more than about 25,000 years, or about 1000 generations. They are by origin Asiatic, and in Asia skin pigmentation is fairly heavy. Unless the selection of light skin as opposed to dark were fairly intense, the time elapsed has simply not been enough to allow for much adaptation to occur (12). As a matter of fact, the populations which might have been expected to become lighter, namely the Fuegans and the Eskimo, have probably had a shorter time in which to achieve this end than other American aborigines, for it is reasonable to suppose that the Fuegians did not reach their present home until long after their northern neighbors were well installed. And all students of the Eskimo agree in recognizing them as probably the most recent (aside of course from the whites) arrivals in America. It could well be that there has just not been enough time for selection to bleach the skins of the American aborigines.

    Reference 12 is Haddon's Races of Man, which I have requested from the library.

    I'm following up, because skin pigmentation is one of the traits most clearly subject to recent rapid selection. The new mutations that lighten skin tone in Europe and Asia are only partially shared between those populations. Many alleles are very common in one population, but nearly absent in the other. So far, the estimates of dates for these new variants are all within the last 20,000 years, but many remain undated. So we can't specify the level of pigmentation of people 15,000-20,000 years ago, yet, but it would have been substantially darker than those populations today.

    Which leaves us with the same question, but from the opposite perspective. We now know that pigmentation evolved rapidly in Eurasia, the strong gradient of pigmentation having increased greatly within the last 20,000 years. We also know that the occupation of temperate South America began quite early, with people having been there longer than 10,000 years. So why did the New World end up with a more gradual cline -- darker pigmentation in the temperate and Arctic regions, lighter in the tropics than in the Old World? Was selection less intense? Can we attribute the difference to demography? Or chance?

    Boyd next alluded to a demographic explanation -- low population density:

    In any case, the pre-Columbian population was so sparse compared with that of Asia and India that on a statistical basis alone we should be justified in asserting that skin pigmentation conforms to climate.

    Them's some tricky statistics.

    We would of course today recognize that the sheer number of people is not especially relevant; much more powerful is the independent occurrence of a similar response in two long-separated populations. But Boyd was concerned with a different issue: Some had been claiming pigmentation as a neutral trait, making it more useful as a race marker:

    This has been denied chiefly by those who were concerned to prove skin color a non-adaptive character, so that it might safely be used in the classification of races (12). Since the more up-to-date students of anthropology have given up the idea of relying on non-adaptive characters, or even believing that any such exist (13), there is no longer much dispute about the probable adaptive value of skin color (emphasis added).

    Well, makes me glad to be an "up-to-date" student! There in fact has been an ongoing debate about "non-adaptive characters" as concerns the relationship of Pleistocene people. Many geneticists were surprised to discover the persistence of Neandertal genes, but in fact the skeletons of Upper Paleolithic Europeans clearly bear Neandertal traits. The debate for the last 30 years hasn't been chiefly about the presence of these traits, but instead about whether they were adaptive. Some argued that adaptive traits were not suitable evidence for a relationship, because they could emerge by parallelism in distinct populations.

    Others observed that adaptive traits were more likely to be shared among populations linked by gene flow.

    Now, of course, we have remaining unanswered questions about these shared traits. The shared traits are clearest between Upper Paleolithic Europeans and European Neandertals. We don't have genetic information yet telling us about the extent of Neandertal gene sharing with these early Europeans. Was it more than elsewhere? The traits would argue for it.

    What about the Neandertal genes in populations far from Europe? One might expect Neandertal-like morphology to show up at some low level. Of course morphological features are polygenic, so that phenotypic resemblance falls much faster than genic identity. And Holocene populations have continued to evolve. Maybe early Asian skeletal remains like the Upper Cave skulls (ca. 11,000-20,000 years old) actually reflect that Neandertal heritage to a greater extent than recent samples.

    Then there is the likelihood of other contributions, more local ones, to later populations.

    Returning to the topic of pigmentation, many of us used to assume that the light skin of Europeans in part reflects Neandertal ancestry. That is, just as Boyd suggested, it would have taken a lot longer than 25,000 years to get the current strong cline of skin pigmentation in the Old World. If you could have longer, getting lighter pigmentation from earlier inhabitants of Europe, for example, you could explain a stronger cline with the same strength of selection.

    I no longer think this is necessary. It's still possible that we got some pigmentation variants from Neandertals, but we haven't found any yet. And we've been looking. It does seem that Neandertals had some of their own pigmentation variants. Maybe we'll find many more of those, maybe not.


    References

  • Falciparum malaria came from gorillas

    Wed, 2010-09-22 15:38 -- John Hawks

    Malaria in humans is caused by one of five different species of Plasmodium parasites. The deadliest of these is P. falciparum, especially within Africa where native resistance to P. vivax is high. Where the vivax parasites seem to have been around for at least tens of thousands of years, P. falciparum in many ways looks relatively young. Its comparative lack of genetic variation suggests either a recent origin from some other primate species, or an intense bottleneck or selective sweep affecting the parasite's demography. In either case, the falciparum history seems to indicate that its present widespread distribution is a very recent phenomenon -- possibly within the last 5000 years.

    Because P. falciparum is phenotypically similar to the major chimpanzee malaria parasite, P. reichenowi, most scientists have assumed that we got falciparum malaria from chimpanzees. But in a new report, Weimin Liu and colleagues [1] have surveyed parasite variation in gorillas, bonobos and chimpanzees across Africa, finding that human falciparum parasites all group in with a single small clade of gorilla parasites. The other primates carry many varieties of parasites, with typical individuals being highly heteroplasmic -- that is, carrying several different strains.

    From the discussion:

    Using single-template amplification strategies and a much larger collection of ape specimens than previously analysed, we show here that wild-living chimpanzees and western gorillas are naturally infected with at least nine Plasmodium species. Among more than 1,100 SGA-derived mitochondrial, apicoplast and nuclear gene sequences from 80 chimpanzee and 55 gorilla samples, we found a total of nine sequences that were related to P. malariae, P. ovale or P. vivax (Supplementary Table 5). All others grouped within one of six chimpanzee- or gorilla-specific lineages representing distinct Plasmodium species, three of which had not previously been described. Significantly, all currently available human P. falciparum sequences constitute a single lineage nested within the G1 clade of gorilla parasites. This indicates that human P. falciparum is of gorilla origin, and not of chimpanzee9, 10, 12, bonobo11 or ancient human5 origin, and that all known human strains may have resulted from a single cross-species transmission event. What is still unclear is when gorilla P. falciparum entered the human population and whether present-day ape populations represent a source for recurring human infection. It has been suggested that the limited levels of genetic diversity seen at many loci in human P. falciparum reflect a relatively recent selective sweep8. Our data suggest that this bottleneck or ‘Eve event’ was instead the consequence of cross-species transmission of a gorilla parasite. It is difficult to date this event without having reliable dates with which to calibrate the Plasmodium phylogenetic trees.

    What's interesting about the study is the sheer coverage of wild primates, and the application of multiple gene trees, which suggests that this is a recent origin of human parasites instead of introgression and selection of a single gene. I don't know if it makes any difference whether the disease came from gorillas or chimpanzees, but it certainly helps to confirm that it is new and not a long-time coevolution. That explains the burst of recent selection associated with resistance genes, especially within Africa.


    References

  • Polygenic traits and directional selection

    Sat, 2010-09-18 13:41 -- John Hawks

    This has been an eventful week for those of us who study the dynamics of recent selection in humans. The most significant event was the publication of a paper describing genetic analysis of a long selection experiment in Drosophila. Although the experiment differs from most natural instances of selection in some important ways, the results give some very helpful corroboration that the recent human pattern of adaptive evolution was rapid and of an expected pattern for massive selection on many traits.

    Meanwhile, Jonathan Pritchard and Anna Di Rienzo have a short review in the current Nature Reviews Genetics [1], discussing the idea that a large fraction of adaptive evolution may be difficult to find with current genetic evidence.

    Their idea is that polygenic adaptations are unlikely to occur by successive "sweeps" of new adaptive mutations.

    It seems likely to us that, as in traditional quantitative genetic models, many — possibly even most — adaptive events in natural populations occur by polygenic adaptation. Polygenic adaptation could allow rapid adaptive shifts, yet would often go undetected using conventional methods for detecting selection. Moreover, polygenic adaptation is qualitatively different from the models of adaptive substitutions that dominate the population genetics literature.

    This is not a new idea, but Pritchard and Di Rienzo review it in a productive way, and the topic is worth some deeper thought...

    An adaptive genetic substitution is often modeled as an episode of logistic growth. A new mutation, initially in a single copy, increases exponentially in numbers until it is very common in the population. After this point, it continues to increase in frequency up to fixation, but progressively slowly. The entire process takes hundreds or a few thousands of generations, which sounds like a long time but is actually very rapid compared to the deep genealogical histories of most genetic loci. The initial rapid increase in numbers carries a region of linked sequence along with the selected variant. This "hitchhiking" region is highly visible because of the co-association of nearby allelic variants. Thus, if such a "sweep" is ongoing, we should have little trouble finding it. In humans we've found a lot of them, which is a big piece of evidence for the rapidity of human evolution during the past 40,000 years.

    But all that describes the dynamics of a single, strongly selected, mutation. What if a trait comes under selection, but the variation in the trait is explained not by a single gene, but by dozens or hundreds of genes? Pritchard and Di Rienzo outline such a scenario:

    The key point is that we should expect such an adaptation to occur by small allele frequency shifts spread across many loci. As a hypothetical example, consider the adaptation of human height — a trait for which there are probably hundreds of SNPs that each affect height by a few millimeters. Strong selection for increased height could be very effective, as height is extremely heritable. But at the level of individual SNPs, the effect of selection would be rather weak, exerting just a small upward pressure in favour of each of hundreds of 'tall' alleles. Suppose that at 500 SNPs, the tall alleles each increase the expected height of a person by 2 mm. Then, an average shift of just 10% in the population allele frequency of each tall allele would increase average height in the population by 20 cm (assuming that SNPs contribute additively). Although these numbers are hypothetical, they illustrate that, for a highly polygenic trait, a dramatic adaptive response could result from modest allele frequency changes at many loci. This model is different from classical sweep models. Most importantly, adaptation could occur without dramatic allele frequency changes and without adaptive fixation events.

    But the description isn't precisely what would happen in the case of selection on stature. Consider:

    1. It is true that alleles that already exist in the population provide the most immediate opportunity for change under directional selection. Any short-term phenotypic evolution we see is likely to be caused by changes in the frequency of standing variants.

    2. Some of the alleles that affect stature are constrained by their effects on other phenotypes. They might not change, even under directional selection on stature.

    3. Stature may be affected by hundreds of loci, but these do not account for equal proportions of the additive variance. Loci are subject to selection roughly in proportion to the additive variance in fitness they explain. Directional selection on stature will change the allele frequencies for a few loci quite a bit more quickly than most.

    The distribution of effect sizes is fairly well known for stature in humans. For example, Park and colleagues [2] this spring plotted the distribution of effect sizes for variants discovered by GWAS in 63,000 Europeans:

    Effect size distribution of variants found to explain heritability of stature, Crohns and BPC cancers in human genome-wide association studies

    In the figure, (a) is based on observed loci -- for stature, this includes 30 loci that reached significance in the GWAS without follow-up genotyping. There is a pretty severe ascertainment bias against small effect sizes, so curve (b) attempts to model the actual distribution correcting for ascertainment. Curve (c) is normalized to give the three conditions the same observed range.

    You can see that if we suddenly started selecting for height, most of the genetic response would come from a very small proportion of the loci that explain the current additive variance. These would be the subset of loci in the large-effect-size tail of the distribution, excluding those that are constrained by their role in other phenotypes under selection.

    4. As an allele becomes common enough (going up toward fixation), the locus will account for less and less of the additive variance in fitness. To maintain the same response to selection, other alleles must pick up the slack. Over time, groups of different alleles will come into focus of selection, sort of like the "cover flow" feature of an iPod. Some alleles increase in frequency across a transient in the mid-frequency range, only to be gradually replaced by others. Most of the phenotypic change occurs as alleles cross rapidly from 40 to 60 percent or so.

    5. A few loci will be special. These account for an appreciable fraction of additive variance even though the favored allele is very rare. As they become common, these favored alleles change in frequency more and more rapidly, and account for more and more of the additive variance. They suck up the oxygen of selection. These alleles will look like a classic sweep.

    6. Over many generations, new mutations may occur that also have strong effects on the trait. They will follow the "special" pattern described in 5.

    The question is how many loci of this type can we expect to exist? We all know that there are two patterns that could account for the heritability of traits like stature, where no common variants have very strong effects. Either the additive variance is spread across many rare variants with large effects, or instead across many common variants with small effects. Pritchard and Di Rienzo's scenario accentuates the second of these -- a small frequency change in many common variants with small effects.

    But if even a small fraction of the additive variance is explained by a few rare variants with strong effects, these may cause most of the phenotypic change, and may look a lot like a standard selective sweep.

    Pritchard and Di Rienzo note that the two options -- a rapid sweep of one or a few locus, versus slight frequency changes in many loci -- are not mutually exclusive. Most cases of directional selection on phenotypes may involve both patterns. If so, that will be very helpful, because we can use the easy-to-find sweeps to target analysis of harder-to-find frequency changes.

    They sketch a strategy for examining the evolution of such traits.

    One type of approach will be to identify phenotypes that may have undergone adaptive changes in particular environments, such as adaptations to cold climate, high altitude or novel ecological conditions. To dissect the genetic basis of such adaptations, one might collect phenotyped samples from closely related populations that have and have not experienced the selective pressure of interest and use GWA mapping to identify relevant quantitative trait loci (QTLs). Additionally, one would want to measure the extent of phenotypic adaptation — estimated as the difference in average phenotype between the adapted and non-adapted populations when they are living under matched conditions (exact matching of conditions may be difficult in human studies). Then one could ask: what fraction of the phenotypic difference can be explained by alleles with large versus small frequency differences? Are the phenotypic effect sizes of QTLs with large allele frequency differences greater than those with subtle frequency shifts10? What fraction of the phenotypic difference cannot be explained by detected sweep signals or QTLs at all (and hence might result from the cumulative effect of many weak QTLs)?

    In another type of scenario, one might hypothesize that a particular aspect of the environment is an important selective factor (for example, climate or diet) but it is unclear what all the relevant phenotypes are. In this case, we might study adaptation by looking at sets of populations that have independently adapted to the same selective pressures. One type of signal would be alleles that show parallel frequency shifts in response to similar environmental pressures in distantly related populations (although this type of approach is unlikely to be powerful for alleles with very small effects).

    These are exactly the kind of tests that we are working on here at Wisconsin. We have some pretty promising ideas, I think. If you're on a dissertation grant panel, would you please give some money to my students who want to apply these approaches?

    I mean, really, this is the best application of anthropology to develop new genetic approaches, rich in theory and in empirical evidence. Humans are the ideal model organism, because we know the histories and ecologies of different populations. Since the development of agriculture, we've had several ongoing natural selection experiments in our species.

    Nor can we ignore the longer prehistory of human populations. I tend to think that a lot of recent selection has involved new genetic solutions in cases of strong stabilizing selection. A trait like brain size does not evolve under classic directional selection, but instead as a consequence of shifting patterns of stabilizing selection. With intense selection on multiple functions, such traits are constrained in their evolutionary response. Slight frequency changes are not likely to relax such constraints, but a new mutation of large effect might break a long-standing genetic logjam.

    So I think Pritchard and Di Rienzo have outlined many important issues in this review. They have the potential to be highly productive for people with a little talent for applying theory to the data.


    References

  • New data on Ashkenazi population history

    Thu, 2010-08-26 19:37 -- John Hawks

    Bray and colleagues [1] report on genotyping of 471 people of Ashkenazi Jewish descent. This is one of the largest samples of a single human population, and is therefore very interesting for studies of population history and recent natural selection.

    There's a lot in the paper. One of the key findings in the paper is that the Ashkenazi population doesn't look bottlenecked -- in fact, it looks outbred compared to Europeans generally. The paper also documents a high amount of admixture with non-Ashkenazi Europeans, ranging from 35% to 55%. Figuring out the actual history of the population -- when and where its ancestors lived and how they interacted with other people -- is beyond the scope of this kind of analysis. But I expect that somebody can put together a really compelling historical account using these data.

    I turned quickly to the issue of selection. They are able to substantiate evidence of positive selection on several disease-causing alleles in the Ashkenazi population, including the Tay-Sachs allele. The lack of evidence for bottlenecks or founder effects pretty much takes away the alternative explanation. Yet they were unable to show statistical evidence of selection on some other disease-causing alleles in Ashkenazi populations:

    To explore whether regions of selection in the AJ population included any loci of known Ashkenazi diseases, we examined 21 disease- and cancer-susceptibility loci with known mutations found at higher frequency in the Ashkenazi population. Only 6 of the 21 genes fell in or near (within 500 kb) the top 5% of the AJ iHS windows (Table 2). Among these is the Tay-Sachs disease gene, HEXA, whose selection has been widely debated (4, 5, 14–16) and was found ~400 kb downstream of a window on chromosome 15 identified in the top 1% of the AJ iHS hits. Although none of the SNPs interrogated immediately adjacent to the HEXA locus showed elevated iHS signals, it is possible that the nearby region may contain regulatory elements under selection that affect HEXA expression. Cochran et al. (14) speculated that selection of many of the AJ- prevalent disease loci, especially the lysosomal diseases, conferred an increase in intelligence that was necessary historically for the AJ economic survival. Our data shows evidence of strong selection at or near only six disease loci, including only one out of the four AJ- prevalent lysosomal storage diseases, thus arguing that most AJ disease loci are not under strong positive selection, but rather rose to their current frequency through genetic drift after a bottleneck. However, we cannot exclude the possibility that selection of some AJ disease loci are outside the limits of detection by the extended haplotype tests, which are known to have less power to detect se- lection of lower frequency alleles (38, 41).

    It seems to me that this passage probably wasn't written by the same author who showed the lack of evidence for founder effects a few pages before. In this case, the confusion probably comes from the fact that the "detection of positive selection" is actually a refutation of the hypothesis of genetic drift. With a larger sample it will be possible to test the hypothesis with greater power.

    Ddisease-causing alleles are at low frequencies currently, making them unlikely to rise to the top percentages of the statistics. It would be interesting to control for current frequency, but I haven't seen a test that uses frequency information in this way.

    It's quite remarkable to reflect on the idea that positive selection has now been demonstrated on six disease-causing alleles in the Ashkenazi population. Every one of these is a case of overdominance -- where the heterozygote carrying an allele has some selective advantage, while the homozygote carrying two copies has a disorder. I was having a conversation with a very prominent geneticist a few months ago, who claimed that no case of overdominance in humans had ever been demonstrated except sickle cell. Now, that was obviously false even at the time -- as I pointed out, the many hemoglobinopathies are fairly clear examples. But we've come an awfully long way.

    From data like these, we're going to learn a huge amount about low-frequency selected alleles. The Tay-Sachs-causing allele is one of the most common recessive lethal genes in any human population, but like all genes subject to strong selection in homozygotes, it remains rare. Finding selection on these kinds of alleles is very hard unless sample sizes increase to several hundred individuals. Here we are seeing evidence of selection in historic populations -- within the last 2000 years. More will be coming.


    References

    1. Bray SM, Mulle JG, Dodd AF, Pulver AE, Wooding S, and Warren ST. 2010. Signatures of founder effects, admixture, and selection in the Ashkenazi Jewish population. Proceedings of the National Academy of Sciences of the United States of America [Internet] 107:16222–16227. Available from: http://dx.doi.org/10.1073/pnas.1004381107
  • Recent selection, the new paradigm

    Mon, 2010-07-19 23:15 -- John Hawks

    Nicholas Wade gives some recent highlights of research into ongoing selection in humans.

    We are at the center of this research [1], as we connected the widespread pattern of positive selection to human demographic history -- a growing population, with major ecological changes, has both the pressure and opportunity to respond by new adaptive mutations. The result was an acceleration of the rate of positively selected mutations, so that a large proportion of the genome shows evidence of ongoing selective sweeps in one or more human populations. So I'm excited to see the continuing interest in this topic.

    According to Wade's account, the initial skepticism of many geneticists to this idea seems to have mostly evaporated. I think that much of the caution was reasonable conservatism -- few people expected to see such widespread effects of selection. Only those of us who were thinking of the changes in the Neolithic and later were really prepared to interpret the evidence. But now, the sheer accumulation of studies has shown that our initial estimates may have been too conservative.

    About 21 genome-wide scans for natural selection had been completed by last year, providing evidence that 4,243 genes — 23 percent of the human total — were under natural selection. This is a surprisingly high proportion, since the scans often miss various genes that are known for other reasons to be under selection. Also, the scans can see only recent episodes of selection — probably just those that occurred within the last 5,000 to 25,000 years or so. The reason is that after a favored version of a gene has swept through the population, mutations start building up in its DNA, eroding the uniformity that is evidence of a sweep.

    Unfortunately, as Joshua M. Akey of the University of Washington in Seattle, pointed out last year in the journal Genome Research, most of the regions identified as under selection were found in only one scan and ignored by the 20 others. The lack of agreement is “sobering,” as Dr. Akey put it, not least because most of the scans are based on the same Hap Map data.

    From this drunken riot of claims, however, Dr. Akey believes that it is reasonable to assume that any region identified in two or more scans is probably under natural selection. By this criterion, 2,465 genes, or 13 percent, have been actively shaped by recent evolution. The genes are involved in many different biological processes, like diet, skin color and the sense of smell.

    That's 13 percent with statistical evidence in two or more studies. Keep in mind that our present sample size is small enough that we can't reject the hypothesis of genetic drift on things that have frequencies lower than ten percent in a given population. So probably the variants we know about are the tip of a larger iceberg of rare selected variants, which originated within the last few thousand years and haven't had time to increase to higher frequencies. Some may have stalled out at lower frequencies, because of epistases or changes in the environment.

    The proportion of affected genes should approach some asymptote, as lower-frequency variants will be likely to hit the same gene categories again and again. Diet, skin color, smell, disease, brain, all systems that have been under strong selection pressure in recent human evolution. That may provide a promising way to uncover functional relationships among genes. Wade's description of Anna Di Rienzo's work seems to be along those lines.

    Many workers seem to realize now that humans don't live in hunter-gatherer environments. But a disappointment for me is that the article doesn't discuss the role of demography in generating this unique evolutionary pattern. Demography provides an important filter on the results of genome-wide analyses, also. The power of statistical methods is not uniform across different ages of adaptive alleles. Some methods miss older events while all methods miss very recent ones.

    Statistical power is an important reason why some studies find more evidence of selection in Europe and East Asia compared to Africa. The demography of those regions means that Africa has a broader distribution of ages of positively selected mutations: more older events, fewer events corresponding to the peak population growth of early agriculturalists.

    There is some stuff in the article about "soft sweeps" -- the hypothesis that much recent phenotypic change may result from selection on standing genetic variation in ancient populations. An allele that already existed neutrally in the population can come under new selection, and that kind of selection won't trigger the criteria for genome-wide selection scans.

    I have some thoughts about this phenomenon that I'll write up and share. We know that there were some big phenotypic changes in the Late Pleistocene and early Holocene, and initially these changes should mostly have involved standing genetic variation. New adaptive mutations were coming into these populations at a relatively slow rate. When a new mutation is still rare, it doesn't have much impact on the average phenotype in the population. So if we see a fast change to the average phenotype, we know that new mutations aren't responsible, at least not initially.

    But it doesn't take very many genes to cause phenotypic changes. And if small populations have few new adaptive mutations, they also have relatively little standing variation. So the importance of soft sweeps to our evolution may be great, even if their numbers are ultimately small.


    References

    1. Hawks J, Wang ET, Cochran G, Harpending HC, and Moyzis RK. 2007. Recent acceleration of human adaptive evolution. Proceedings of the National Academy of Sciences, U. S. A. [Internet] 104:20753–20758. Available from: http://dx.doi.org/10.1073/pnas.0707650104
  • More on Tibet, demography and selection

    Tue, 2010-07-06 12:30 -- John Hawks

    My post about the Tibetan high altitude selection story last Friday summarized the research and included some criticism of the demographic model applied in the paper by Yi and colleagues. This weekend, I had some correspondence from study coauthor Rasmus Nielsen.

    Nielsen was kind enough to provide a lot of information about how they arrived at their demographic model. Also, his comments are of substantial interest as a perspective on science journalism. I have posted them in their entirety, and have added my own perspective below them. Click through to read on:

    Nielsen:

    I read your blog on the EPAS1 gene. You write that my answers to Nicholas Wade in the NYT article are lame. I couldn't agree more. Reading the quotes Wade put together from a long phone interview and two replies to follow-up requests by email for further information - I could get quite convinced about my lameness myself. Let me give you our side of the story:

    (1) Regarding effective population size estimation: we fit several different demographic models to the data. The best fitting one according to the Akaike information criterion was chosen in the paper to use for the coalescence simulations. But notice that we made no strong claims about population sizes in the paper. They appear in the supplementary information to ensure that other people could reproduce our study. The main objective for fitting a demographic model was to allow us to perform coalescence simulations under a model that fit the data well. The model described in the paper fits the data very well and was the best fitting model we could find. As such - it was our best option for how to calculate p-values - and was certainly, in our opinion, better than providing no p-values, or use p-values based on some simpler model that did not fit the data. Had we used another model with different values of Ne, we would have obtained less accurate p-values.

    However, we did not interpret the effective population size estimates strongly - mostly because we do not believe they have very much to do with census population sizes. I would argue that this is true for both this study and other similar studies on other populations. Estimated effective population sizes are not only a function of changes in population size, natural selection, male/female ratios and variance in offspring number. They also rely on the structure of the populations. A population organized into many small sub-populaiton might have an Ne that is substantially larger than N, while a population without sub-structure might have a much smaller Ne than the census size if there has been fluctuations in the population size or higher variance in offspring number than that expected from a Poisson. Therefore, it is wrong to interpret estimates of Ne as estimates of actual number of individuals - or to believe that there is some simple general relationship between effective population size and true number of individuals. For this reason, we did purposefully not provide an interpretation of the estimates of Ne in terms of actual values of N and I feel that our work is not being represented accurately by arguing that we obtained estimates of the number of Han individuals or Tibetans living 3000 years ago. That does not mean that we cannot try to understand why we get such a small Ne for Hans 3000 years ago and such a large estimate for Tibetans. The most likely explanation for the Hans is that there have been other bottlenecks that we have not modeled - before or after. If we estimate Ne for Europeans today using a model that does not take all the bottlenecks into account, we get estimates of about 5-15,000 individuals. I don't think anybody would claim that there are only 5-15,000 Europeans alive today. Similarly, our estimate for Ne for the Hans 3000 years ago is in the hundreds presumably because there were some previous bottlenecks that we have not modeled. Ancestral bottlenecks can be extremely hard to date from frequency spectrum data - and you end up getting the same likelihood for a long time period with small population sizes and a short time period with extremely small population sizes. The have been several published papers making this point, the first one I believe to be Adams and Hudson. 2004. Genetics 168:1699-171. Changing our model to having a larger population size 3000 years ago but with an appropriately modeled preceding bottleneck would produce more or less the same p-values - because it would produce the same expected frequency spectrum (or at least something very similar).

    Regarding the large Tibetan population size, it may likely be affected by population structure within Tibet and/or by admixture with other individuals. Both of these factors would inflate the estimate of Ne. We did try some other models - but ended choosing this particular model because if fit the data the best. It seemed, therefore, most appropriate for the coalescence simulations. Again, I want to emphasize that we did not attempt to estimate number of individuals living in particular places during particular times - we were interested in finding a model which fit the distribution of allele frequencies well so that we at least could make some attempt at estimating relevant p-values. We never claimed that there were just a few hundred Han individuals alive 3000 years ago - in the same way that we are not arguing that there are only 5-15,000 Europeans alive today.

    (2) Regarding the divergence time: none of the models we fitted could explain the data with a divergence time much larger than 3000 years. If you look at the figure in the paper, you can see that there is an extremely strong correlation between the allele frequencies in Hans and in Tibetans. This is very difficult to explain with a long divergence time of genetically separated populations. To maintain such a strong correlation for a large amount of time, the Tibetan population (and the Han population) would have to be enormously large - and this is incompatible with the observed levels of variation in the population. We could not find a model that fit the data and which included a large divergence time no matter what we did. But there are of course many factors going into these estimates - including a calibration of number of mutations with the chimp, a number of demographic assumptions, and assumptions regarding generation times. If we are making errors on these assumptions - the estimates could change in one way or another. For that reason I feel it is most conservative to avoid arguing that our analysis definitely rejects that the divergence time could be 6000 years. The main objective of the paper was after all to investigate the evolution of altitude adaptation. The demographic analysis was there mostly to allow us to do the coalescence simulations - but we also used them to make the argument that this selection has occurred quite recently - and not say 10k or 20k years ago. It is quite clear from the data that such long divergence times cannot be supported by the data

    This being said, we of course want to know if this short genetic divergence time is compatible with other evidence. I would argue that it is. There has been several migrations into Tibet. It is entirely compatible with the archaeological record that individuals living in Tibet today genetically mostly are descendants of migrants arriving around 3000 years ago even though the first migrants appeared much earlier. In terms of the selection - and when it has been acting - we want to determine when selection acted to increase the frequency on EPAS1 mutations in the ancestry of the individuals living in Tibet today. If they are genetically descendants of individuals migrating into Tibet just a few thousand years ago - then this is the relevant data for describing when selection has been acting on the EPAS1 mutations. As an aside I should also say that this has nothing to do with when the mutation(s) arose. Selection has in this case most likely been acting on standing variation.

    You argue in your blog that more could be done with this data in terms of demography. We agree. The paper was about altitude adaptation not demography. We are still working on the data and are hoping to produce a follow-up paper on the demographic analyses. We weren't sure how much interest there would be in the results - but the interest from you and other people in this is certainly a motivation to keep working on it as hard as possible.

    I hope you will post this reply on your blog and comment on it. If you do so - I would ask that you post it in its entirety. I learned a lot from the interview with Wade. I certainly now understand why politicians keep giving the same 2-line reply over and over again to journalists asking them questions. If a journalist talks sufficiently long with an interviewee - it will be possible for them to find some sentences that they can put together in some way to make the interviewee look foolish - if that's what they want to do.

    Me:

    Thanks so much for writing with this! I will of course post your comments, and I appreciate very much the time you spent detailing the work, especially on a holiday weekend.

    What you've written here basically agrees with my take on the text of your paper; the demographic model is useful as a test because it is conservative, it is not an attempt at population history. I've reviewed effective size at some length [readers can find a review that I wrote, and I can forward reprints on request]. As you write, this study does not differ substantially from many others in the use of effective size estimates.

    As an anthropologist I am very concerned at the proliferation of population models that are nonsensical from a demographic standpoint. Yes, the p-value will be much the same for EPAS1, but the model is hugely conservative with respect to anything with less extreme differentiation. Other studies are essentially alike; lowball demographic numbers are useful in their conservatism but give an incorrect view about the relation of demography and selection.

    Besides, you have to consider the mechanism by which the best-fit model has come to be so extreme. As you note, the effective size estimated under the assumption of neutrality actually will reflect the non-neutral dynamics across the exome. The HapMap doesn't give rise to anything like the model of an extreme and recent bottleneck that the exome data yield, yet of course both these genome-wide sets must have undergone the same demography. The difference is that the exome is limited to the coding fraction of the genome, pointing to selection on some (probably large) fraction of coding loci. The small effective size within the last 3000 years is mathematically equivalent to a statement that the data include genealogies with many coalescences in those 3000 years. Again, this doesn't happen in a population of hundreds of thousands of individuals unless there was rapid selection.

    So it seems to me that the data must reflect the high incidence of recent selection within mainland China. This is exactly what we expect based on the real demography of massive population growth across the same interval and adaptation to post-agricultural ecologies. Although the headline of the paper is about high altitude adaptation in Tibet, the real story is the massive selection in China of other genes.

    If this is correct, then I think there is much promising work to do by using real demographic estimates. Deriving the demographic model from the data themselves is really just throwing away useful information that is abundantly documented archaeologically and historically.

Pages

Subscribe to recent selection

Neandertals

For years, I've worked on their bones. Now I'm working on their genes. Read more about the science studying these ancient people.

Denisova

From a finger bone of an ancient human came the record of a completely unexpected population. My lab is working on the science of the Denisova genome.

Acceleration

The advent of agriculture caused natural selection to speed up greatly in humans. We're uncovering some of the ways that populations have rapidly changed during the last 10,000 years.

Malapa

Just outside Johannesburg, the Malapa site is producing some of the most exciting finds in human evolution. This site is the headquarters of the Malapa Soft Tissue Project.