john hawks weblog

paleoanthropology, genetics and evolution

Error message

  • Notice: Trying to get property of non-object in _biblio_citekey_print() (line 1769 of /var/www/johnhawks.net/public/modules/biblio/biblio.module).
  • Notice: Trying to get property of non-object in _biblio_citekey_print() (line 1769 of /var/www/johnhawks.net/public/modules/biblio/biblio.module).

gene-phenotype associations

  • Polygenic traits and directional selection

    Sat, 2010-09-18 13:41 -- John Hawks

    This has been an eventful week for those of us who study the dynamics of recent selection in humans. The most significant event was the publication of a paper describing genetic analysis of a long selection experiment in Drosophila. Although the experiment differs from most natural instances of selection in some important ways, the results give some very helpful corroboration that the recent human pattern of adaptive evolution was rapid and of an expected pattern for massive selection on many traits.

    Meanwhile, Jonathan Pritchard and Anna Di Rienzo have a short review in the current Nature Reviews Genetics [1], discussing the idea that a large fraction of adaptive evolution may be difficult to find with current genetic evidence.

    Their idea is that polygenic adaptations are unlikely to occur by successive "sweeps" of new adaptive mutations.

    It seems likely to us that, as in traditional quantitative genetic models, many — possibly even most — adaptive events in natural populations occur by polygenic adaptation. Polygenic adaptation could allow rapid adaptive shifts, yet would often go undetected using conventional methods for detecting selection. Moreover, polygenic adaptation is qualitatively different from the models of adaptive substitutions that dominate the population genetics literature.

    This is not a new idea, but Pritchard and Di Rienzo review it in a productive way, and the topic is worth some deeper thought...

    An adaptive genetic substitution is often modeled as an episode of logistic growth. A new mutation, initially in a single copy, increases exponentially in numbers until it is very common in the population. After this point, it continues to increase in frequency up to fixation, but progressively slowly. The entire process takes hundreds or a few thousands of generations, which sounds like a long time but is actually very rapid compared to the deep genealogical histories of most genetic loci. The initial rapid increase in numbers carries a region of linked sequence along with the selected variant. This "hitchhiking" region is highly visible because of the co-association of nearby allelic variants. Thus, if such a "sweep" is ongoing, we should have little trouble finding it. In humans we've found a lot of them, which is a big piece of evidence for the rapidity of human evolution during the past 40,000 years.

    But all that describes the dynamics of a single, strongly selected, mutation. What if a trait comes under selection, but the variation in the trait is explained not by a single gene, but by dozens or hundreds of genes? Pritchard and Di Rienzo outline such a scenario:

    The key point is that we should expect such an adaptation to occur by small allele frequency shifts spread across many loci. As a hypothetical example, consider the adaptation of human height — a trait for which there are probably hundreds of SNPs that each affect height by a few millimeters. Strong selection for increased height could be very effective, as height is extremely heritable. But at the level of individual SNPs, the effect of selection would be rather weak, exerting just a small upward pressure in favour of each of hundreds of 'tall' alleles. Suppose that at 500 SNPs, the tall alleles each increase the expected height of a person by 2 mm. Then, an average shift of just 10% in the population allele frequency of each tall allele would increase average height in the population by 20 cm (assuming that SNPs contribute additively). Although these numbers are hypothetical, they illustrate that, for a highly polygenic trait, a dramatic adaptive response could result from modest allele frequency changes at many loci. This model is different from classical sweep models. Most importantly, adaptation could occur without dramatic allele frequency changes and without adaptive fixation events.

    But the description isn't precisely what would happen in the case of selection on stature. Consider:

    1. It is true that alleles that already exist in the population provide the most immediate opportunity for change under directional selection. Any short-term phenotypic evolution we see is likely to be caused by changes in the frequency of standing variants.

    2. Some of the alleles that affect stature are constrained by their effects on other phenotypes. They might not change, even under directional selection on stature.

    3. Stature may be affected by hundreds of loci, but these do not account for equal proportions of the additive variance. Loci are subject to selection roughly in proportion to the additive variance in fitness they explain. Directional selection on stature will change the allele frequencies for a few loci quite a bit more quickly than most.

    The distribution of effect sizes is fairly well known for stature in humans. For example, Park and colleagues [2] this spring plotted the distribution of effect sizes for variants discovered by GWAS in 63,000 Europeans:

    Effect size distribution of variants found to explain heritability of stature, Crohns and BPC cancers in human genome-wide association studies

    In the figure, (a) is based on observed loci -- for stature, this includes 30 loci that reached significance in the GWAS without follow-up genotyping. There is a pretty severe ascertainment bias against small effect sizes, so curve (b) attempts to model the actual distribution correcting for ascertainment. Curve (c) is normalized to give the three conditions the same observed range.

    You can see that if we suddenly started selecting for height, most of the genetic response would come from a very small proportion of the loci that explain the current additive variance. These would be the subset of loci in the large-effect-size tail of the distribution, excluding those that are constrained by their role in other phenotypes under selection.

    4. As an allele becomes common enough (going up toward fixation), the locus will account for less and less of the additive variance in fitness. To maintain the same response to selection, other alleles must pick up the slack. Over time, groups of different alleles will come into focus of selection, sort of like the "cover flow" feature of an iPod. Some alleles increase in frequency across a transient in the mid-frequency range, only to be gradually replaced by others. Most of the phenotypic change occurs as alleles cross rapidly from 40 to 60 percent or so.

    5. A few loci will be special. These account for an appreciable fraction of additive variance even though the favored allele is very rare. As they become common, these favored alleles change in frequency more and more rapidly, and account for more and more of the additive variance. They suck up the oxygen of selection. These alleles will look like a classic sweep.

    6. Over many generations, new mutations may occur that also have strong effects on the trait. They will follow the "special" pattern described in 5.

    The question is how many loci of this type can we expect to exist? We all know that there are two patterns that could account for the heritability of traits like stature, where no common variants have very strong effects. Either the additive variance is spread across many rare variants with large effects, or instead across many common variants with small effects. Pritchard and Di Rienzo's scenario accentuates the second of these -- a small frequency change in many common variants with small effects.

    But if even a small fraction of the additive variance is explained by a few rare variants with strong effects, these may cause most of the phenotypic change, and may look a lot like a standard selective sweep.

    Pritchard and Di Rienzo note that the two options -- a rapid sweep of one or a few locus, versus slight frequency changes in many loci -- are not mutually exclusive. Most cases of directional selection on phenotypes may involve both patterns. If so, that will be very helpful, because we can use the easy-to-find sweeps to target analysis of harder-to-find frequency changes.

    They sketch a strategy for examining the evolution of such traits.

    One type of approach will be to identify phenotypes that may have undergone adaptive changes in particular environments, such as adaptations to cold climate, high altitude or novel ecological conditions. To dissect the genetic basis of such adaptations, one might collect phenotyped samples from closely related populations that have and have not experienced the selective pressure of interest and use GWA mapping to identify relevant quantitative trait loci (QTLs). Additionally, one would want to measure the extent of phenotypic adaptation — estimated as the difference in average phenotype between the adapted and non-adapted populations when they are living under matched conditions (exact matching of conditions may be difficult in human studies). Then one could ask: what fraction of the phenotypic difference can be explained by alleles with large versus small frequency differences? Are the phenotypic effect sizes of QTLs with large allele frequency differences greater than those with subtle frequency shifts10? What fraction of the phenotypic difference cannot be explained by detected sweep signals or QTLs at all (and hence might result from the cumulative effect of many weak QTLs)?

    In another type of scenario, one might hypothesize that a particular aspect of the environment is an important selective factor (for example, climate or diet) but it is unclear what all the relevant phenotypes are. In this case, we might study adaptation by looking at sets of populations that have independently adapted to the same selective pressures. One type of signal would be alleles that show parallel frequency shifts in response to similar environmental pressures in distantly related populations (although this type of approach is unlikely to be powerful for alleles with very small effects).

    These are exactly the kind of tests that we are working on here at Wisconsin. We have some pretty promising ideas, I think. If you're on a dissertation grant panel, would you please give some money to my students who want to apply these approaches?

    I mean, really, this is the best application of anthropology to develop new genetic approaches, rich in theory and in empirical evidence. Humans are the ideal model organism, because we know the histories and ecologies of different populations. Since the development of agriculture, we've had several ongoing natural selection experiments in our species.

    Nor can we ignore the longer prehistory of human populations. I tend to think that a lot of recent selection has involved new genetic solutions in cases of strong stabilizing selection. A trait like brain size does not evolve under classic directional selection, but instead as a consequence of shifting patterns of stabilizing selection. With intense selection on multiple functions, such traits are constrained in their evolutionary response. Slight frequency changes are not likely to relax such constraints, but a new mutation of large effect might break a long-standing genetic logjam.

    So I think Pritchard and Di Rienzo have outlined many important issues in this review. They have the potential to be highly productive for people with a little talent for applying theory to the data.


    References

  • Sergey Brin and genetic research

    Wed, 2010-07-07 08:30 -- John Hawks

    While I was out of town, Wired ran a long article about Google cofounder Sergey Brin and his quest to find the genetic causes of Parkinson's disease. There is much of interest here. The piece gives an account of present-day genomic research from a unique point of view.

    Brin is a smart person, with a family history of Parkinson's and knowledge that he carries a risk allele. So he is directing a lot of money and attention toward new ways of approaching gene-disease associations. He is one of the major financial backers of the direct-to-consumer genomics company 23andMe, and the husband of founder Ann Wojcicki. Google, of course, has prospered by making unconventional uses of data. That's an approach that many are starting to apply to science:

    Increasingly, though, scientists—especially those with a background in computing and information theory—are starting to wonder if that model could be inverted. Why not start with tons of data, a deluge of information, and then wade in, searching for patterns and correlations?

    This is what Jim Gray, the late Microsoft researcher and computer scientist, called the fourth paradigm of science, the inevitable evolution away from hypothesis and toward patterns. Gray predicted that an “exaflood” of data would overwhelm scientists in all disciplines, unless they reconceived their notion of the scientific process and applied massive computing tools to engage with the data. “The world of science has changed,” Gray said in a 2007 speech—from now on, the data would come first.

    I think that "fourth paradigm" probably overdignifies the approach, which looks like a regression to a naive positivism. As described in the book, The Fourth Paradigm: Data-Intensive Scientific Discovery, the idea is rather more -- a unification of theory and massive amounts of data. Data really do speak for themselves, say Fourth Paradigmers, but they speak quietly with a lot of noise drowning them out. So if you collect vast amounts of data, you have a chance to sort out the whispers of real associations from all the junk.

    The article gives a vivid example:

    Langston offers a case in point. Last October, the New England Journal of Medicine published the results of a massive worldwide study that explored a possible association between people with Gaucher’s disease—a genetic condition where too much fatty substances build up in the internal organs—and a risk for Parkinson’s. The study, run under the auspices of the National Institutes of Health, hewed to the highest standards and involved considerable resources and time. After years of work, it concluded that people with Parkinson’s were five times more likely to carry a Gaucher mutation.

    Langston decided to see whether the 23andMe Research Initiative might be able to shed some insight on the correlation, so he rang up 23andMe’s Eriksson, and asked him to run a search. In a few minutes, Eriksson was able to identify 350 people who had the mutation responsible for Gaucher’s. A few clicks more and he was able to calculate that they were five times more likely to have Parkinson’s disease, a result practically identical to the NEJM study. All told, it took about 20 minutes. “It would’ve taken years to learn that in traditional epidemiology,” Langston says. “Even though we’re in the Wright brothers early days with this stuff, to get a result so strongly and so quickly is remarkable.”

    But there are a few stumbling blocks. Unknown associations are relatively weak. Most of the phenotypes are polygenic. With heritabilities less than one, there remain unknown environmental causes of most phenotypes, which are not captured in genetic data and which may interact with different genes.

    At present, I don't think anyone in genetics is really operating on a "Fourth Paradigm" level. The massive datasets are building, but few authors are working with existing population genetic theory in ways that would enhance the pattern-matching exercise. If you look through papers describing genome-wide association studies, there are a lot of bivariate statistics, and some multivariate descriptive statistics (like principal components analysis). About all the theory is case-control statistical design. Look through a paper on genetic variation and you're likely to see a STRUCTURE analysis and some coalescent simulations.

    The article puts this well:

    'We have no grand unified theory,' says Nicholas Eriksson, a 23andMe scientist. 'We have a lot of data.'

    Genetic data today are a huge contrast from the past, both in sample size and in coverage. There are a lot of new low-hanging fruit. In the future, the easy stuff will be gone and theory will become more and more important. It remains unclear to me how much progress on health may be made by pattern-matching alone, and how much will require new theoretical advances. Given the problems explaining heritability so far, it may be that we'll need new theory sooner rather than later.

    Genetic data are slowly being joined by environment data of various kinds. The article contextualizes the study of environmental variables by telling the story of the initial discovery and long use of aspirin. After it had been common in the population for a long time, researchers started to realize that it had health interactions besides followed by the slow realization that long-time use has health interactions of its own.

    The second coming of aspirin is considered one of the triumphs of contemporary medical research. But to Brin, who spoke of the drug in a talk at the Parkinson’s Institute last August, the story offers a different sort of lesson—one drawn from that period after the drug was introduced but before the link to heart disease was established. During those decades, Brin notes, surely “many millions or hundreds of millions of people who took aspirin had a variety of subsequent health benefits.” But the association with aspirin was overlooked, because nobody was watching the patients. “All that data was lost,” Brin said.

    The answer is simple: Collect all the data and see what percolates out of them. Heck, probably Google already has enough data about everybody based on their web searches, if they could just connect those to the 23andMe database. If you have a few years of web searches, I wonder just how much that tells you about a person's other phenotypes?

    Remember (as the article points out), Google is the company that can predict flu outbreaks faster than the CDC.

    Still, aside from the obvious technological progress, I'm a little more sober about the prospects of making rapid health improvements. Consider:

    This approach—huge data sets and open questions—isn’t unknown in traditional epidemiology. Some of the greatest insights in medicine have emerged from enormous prospective projects like the Framingham Heart Study, which has followed 15,000 citizens of one Massachusetts town for more than 60 years, learning about everything from smoking risks to cholesterol to happiness. Since 1976, the Nurses Health Study has tracked more than 120,000 women, uncovering risks for cancer and heart disease. These studies were—and remain—rigorous, productive, fascinating, even lifesaving. They also take decades and demand hundreds of millions of dollars and hundreds of researchers. The 23andMe Parkinson’s community, by contrast, requires fewer resources and demands far less manpower. Yet it has the potential to yield just as much insight as a Framingham or a Nurses Health. It automates science, making it something that just … happens. To that end, later this month 23andMe will publish several new associations that arose out of their main database, which now includes 50,000 individuals, that hint at the power of this new scientific method.

    Today's sequencing techniques make it much cheaper to do some things that used to be very expensive. But we've done a lot of gold-plated medical studies, and have more coming soon. The most important barrier to progress is not the lack of money; it is the difficulty of altering biological systems without adverse complications.

  • Drug discovery and GWA

    Sun, 2010-02-14 11:59 -- John Hawks

    Gene Expression's p-ter makes an interesting point about weak genome-wide associations and drug development.

    Any doctor knows where I'm going with this: one of the best-selling groups of drugs in the world currently are statins, which inhibit the activity of (the gene product of) HMGCR. Of course, statins have already been invented, so this is something of a cherry-picked example, but my guess is that there are tens of additional examples like this waiting to be discovered in the wealth of genome-wide association study data. Figuring out which GWAS hits are promising drug targets will take time, effort, and a good deal of luck; in my opinion, this is the major lesson from Decode (which is not all that surprising a lesson)--drug development is really hard.

    Yes, figuring out gene functional networks is the hard part; also, how alleles may interact in unexpected ways with different genetic backgrounds.

  • Hunting a myostatin SNP-phenotype association

    Wed, 2009-08-12 15:19 -- John Hawks

    I happened to be reading some literature on myostatin today and ran across a recent paper (Kostek et al.2009).

    Conclusion: MSTN 2379 A > G and FST -5003 A > T were associated with baseline muscle strength and size among African Americans only. These ethnic-specific associations are hypothesis generating and should be confirmed in a larger sample of African Americans.

    From that description, it looks like a gene-population association in the same vein as APOE, where an Alzheimer’s risk allele predicts disease incidence well in Europeans but not Africans. Myostatin regulates muscle growth, so here the idea would be that an allele has an effect that depends on genetic background, to the extent that might effect its evolution in those populations (Saunders et al. (2006) found that myostatin has two common alleles within Africa that look like they may have been recently selected, the two alleles are rare outside Africa).

    Well, looking more deeply into the sample, we find that it’s not so impressive as it might look:

    Results: Baseline MVC was greater among African Americans who were carriers of the MSTN G2379 allele (AG/GG, n = 15) than the A2379A homozygotes (n = 8; 64.2 ± 6.8 vs 49.8 ± 8.7 kg). African Americans who were carriers of the FST T-5003 allele (n = 12) had greater baseline 1RM (11.9 ± 0.7 vs 8.8 ± 0.5 kg) and CSA (24.4 ± 1.3 vs 19.1 ± 1.2 cm2) than African Americans with the A-5003A genotype (n = 14; P < 0.05). No MSTN or FST genotype and muscle phenotype associations were found among the other ethnic groups (P 0.05).

    Those tiny sample sizes (n = 15, n = 8) come from stratifying a much larger sample (n = 645) into ancestry groups. The very large European component of that large sample (n = 509) showed no gene-phenotype associations. What’s left is a significant result (p < 0.05) considering only 23 people.

    This may not be unusual – this allele is rare in Europeans, so the two samples may be pretty close to each other in statistical power. But it’s not exactly a vote of confidence in favor of a large effect size for the allele, even within the small African-American sample.

    That’s a common story in gene-phenotype association studies. Possibly it will replicate in a larger sample – and hopefully if it doesn’t replicate, somebody will still publish the result so that we’ll know about it.

    References

       Kostek MA, et al. 2009. Myostatin and follistatin polymorphisms interact with muscle phenotypes and ethnicity. Medicine and Science in Sports and Exercise 41:1063–1071. doi:10.1249/MSS.0b013e3181930337.

       Saunders MA, et al. 2006. Human adaptive evolution at Myostatin (GDF8), a regulator of muscle growth. Am J Hum Genet 79:1089–1097. doi:10.1086/509707.

Subscribe to gene-phenotype associations

Neandertals

For years, I've worked on their bones. Now I'm working on their genes. Read more about the science studying these ancient people.

Denisova

From a finger bone of an ancient human came the record of a completely unexpected population. My lab is working on the science of the Denisova genome.

Acceleration

The advent of agriculture caused natural selection to speed up greatly in humans. We're uncovering some of the ways that populations have rapidly changed during the last 10,000 years.

Malapa

Just outside Johannesburg, the Malapa site is producing some of the most exciting finds in human evolution. This site is the headquarters of the Malapa Soft Tissue Project.