john hawks weblog

paleoanthropology, genetics and evolution

HapMap

  • Finding more Neandertal genes, chromosome 19 edition

    Thu, 2011-03-31 18:46 -- John Hawks

    When I last wrote about the Neandertal genome, I showed that across the X chromosome, Europe and China have different Neandertal genes. There is overlap between the two, but as a generalization few Neandertal haplotypes that are common in Europe are also common in China, and vice-versa. I described the basic method for finding Neandertal haplotypes in recent people last month ("Neandertal segments of X chromosomes").

    Almost all of the Neandertal haplotypes found in the X chromosomes of recent people are relatively rare, occurring in fewer than 10 percent of individuals. The largest fraction of Neandertal haplotypes occur in only a single person in the HapMap samples.

    But is this a pattern that occurs on the autosomes, or does it reflect X chromosome dynamics in some way?

    That's not a hard question to answer, and I went looking first at chromosome 19. The number of haplotypes is fewer, because chromosome 19 is shorter than the X. The overall pattern is the same. Most Neandertal haplotypes are rare in the HapMap samples, and relatively few are common in both the CEU and CHD samples.

    Neandertal haplotypes on chromosome 19 histogram in CEU and CHD HapMap samples

    I put the origin at the rear; CEU (European ancestry in Utah) number of copies goes toward the left, CHD (Chinese immigrants in Denver) toward the right. You can see that most of the cases are clumped on the extreme edge of both axes. There are not higher counts in CHD; the two axes are at different scales because of one extremely common region in Europeans, as noted below.

    I've received a few comments on the 3-d histograms. I don't like them much, either, and I'm looking for an alternative. This one in particular is miserable; because it's out of scale. I'd like to plot these in 2-d using shading to denote bin counts. Unfortunately I haven't found a quick and dirty program that will do this in 2-d, and I've got too wide a range of bin counts for a bubble plot to do it without a lot of tweaking. So I'm stuck with these for now. I can either write about them and share them or spend my time finding a better graphing solution.

    I've done a few more comparisons. When we look for Neandertal 10-SNP haplotypes in CEU versus TSI (the sample from Tuscany), we find mostly the same haplotypes in both samples. A haplotype in 10 copies in CEU is certain to be in TSI, and vice-versa.

    Neandertal haplotypes on chromosome 19 histogram in CEU and TSI HapMap samples

    Number of copies in CEU goes across the bottom, TSI back into the picture. This is such a striking difference from the CEU-CHD comparison. It's very comforting to me, because this is totally the expected pattern -- CEU and TSI should have the same things, because they share most of their population history! I will mention that for the X chromosome, CHB and JPT have a similar pattern, they mostly share the same stuff. This helps lend some significance on the finding below that GIH is also pretty different from all these other samples.

    You can see that there is one locus where CEU has more than 100 copies (the little cluster there indicates that this haplotype extends over more than 10 SNPs, in fact it's 13 SNPs with possibly 2-3 flanking SNPs forming a decay pattern on either side; the total length is around 150 kb. There are more than 80 copies in Tuscans, and more than 40 in Gujaratis, but only a single copy in the Chinese sample. Three genes lie in this interval but none point to any obvious hypothesis (to me, at least), about why the Neandertal haplotype would be especially common in western Eurasia. I note it because this is the first Neandertal haplotype I've found with a frequency up over 20 percent or so; this one is about 60 percent in CEU and 50 percent in TSI.

    The Gujarati (GIH) sample adds its own distinct twist. There is some overlap between GIH and CEU, and some overlap between GIH and CHD. But by and large the same pattern obtains as between Europe and China: India has its own Neandertal common variants, not widely shared with either CEU or CHD. For example, here's the CHD comparison; CHD going toward left, GIH toward right. The basic pattern is that most cases are clusted on the edge of the graph, few are scattered across most of the area, and there's no consistent pattern among them. Still, the highest-frequency GIH case is the same as the high-frequency haplotype noted in CEU and TSI above.

    Neandertal haplotypes on chromosome 19 histogram in CHD and GIH samples

    These examples should demonstrate pretty clearly that this is not solely an X chromosome phenomenon; basically we're looking at the effects of drift in small ancient populations after they mixed with Neandertals.

    I did have an excellent question today after my talk where I discussed this pattern -- how do we know that this isn't separate mixture events giving rise to different Neandertal-derived variants in different recent humans?

    That's not a trivial question to answer, and I don't think we could easily rule out the hypothesis in the abstract. But the fact that these populations have very similar fractions of Neandertal contribution overall does suggest a single history of mixing. I'll give this some more consideration as I look across the rest of the genome.

  • SNPtastic India

    Wed, 2009-09-23 14:49 -- John Hawks

    The cover story in Nature this week is a paper about the population history of India, from David Reich's lab. It's an important contribution to our knowledge of human genetic variation, and provides a very interesting set of data for further investigation of modern human origins, the dispersal of agriculture into the subcontinent, and the history of more recent Indian populations.

    Here's the abstract:

    India has been underrepresented in genome-wide surveys of human variation. We analyse 25 diverse groups in India to provide strong evidence for two ancient populations, genetically divergent, that are ancestral to most Indians today. One, the 'Ancestral North Indians' (ANI), is genetically close to Middle Easterners, Central Asians, and Europeans, whereas the other, the 'Ancestral South Indians' (ASI), is as distinct from ANI and East Asians as they are from each other. By introducing methods that can estimate ancestry without accurate ancestral populations, we show that ANI ancestry ranges from 39–71% in most Indian groups, and is higher in traditionally upper caste and Indo-European speakers. Groups with only ASI ancestry may no longer exist in mainland India. However, the indigenous Andaman Islanders are unique in being ASI-related groups without ANI ancestry. Allele frequency differences between groups in India are larger than in Europe, reflecting strong founder effects whose signatures have been maintained for thousands of years owing to endogamy. We therefore predict that there will be an excess of recessive diseases in India, which should be possible to screen and map genetically.

    The number of individuals is not huge for the purposes of population genetic analysis -- only 132 people from 25 groups -- but it is very significant in terms of recent samples. By comparison, it is around double the number of effective individuals in any of the HapMap v.1 populations, genotyped at more than 560,000 SNPs.

    The results of the study are basic population genetic issues, including the degree of endogamy, the pattern of regional differentiation, the likelihood of discovering new recessive genetic disorders by additional sampling. Some notes:

    Population mixture. The authors propose that today's groups descend in varying proportions from two ancient (and no longer existing) populations, which they call "ancestral North Indian" and "ancestral South Indian".

    I'm always skeptical of mixture models, especially when the putative source populations no longer exist. There are just too many ways that structured migration or dispersal can lead to the appearance of mixture. People once thought of "Alpines" as a mixture of pure Nordic and Mediterranean elements, after all-- and that was just because their heads were mesocephalic.

    Still, with a half-million SNPs, it's possible to do a better job testing the hypothesis of mixture versus structured migration. The authors in this paper didn't -- they applied a simplified "3 Population Test" that compares the empirical allele frequencies to proportions expected under only two scenarios: simple mixture or complete isolation. It seems to me that the null should be simple isolation by distance, which would give the same result as "mixture" according to their test. If you really want to look for population mixture, you need to involve the dimension of time, for example, by demonstrating the antiquity of haplotypes that have mixed together.

    So I don't accept this ancestral division, certainly not at face value. It does seem plausible that West Asian (and thereby European-related) genes have introgressed into India over time, perhaps in association with the growth of high-density agricultural populations. Maybe some of this gene flow occurred under the influence of positive selection, but processes of elite dominance and differential growth may have been sufficient.

    Regional differences. The results show a greater degree of regional genetic differentiation in India than has been found for continental Europe. Still, with an FST of only 0.01, we're not talking about major population splits here. With that number, the subcontinent is closer to panmixia than one might expect for a region its size. The authors suggest that founder effects explain the regional differentiation:

    We propose that the high FST among Indian groups could be explained if many groups were founded by a few individuals, followed by limited gene flow. This hypothesis predicts that within groups, pairs of individuals will tend to have substantial stretches of the genome in which they share at least one allele at each SNP. We find signals of excess allele sharing in many groups (Supplementary Fig. 2), which as expected tend to occur in the groups that have the highest FST values from all others (P = 0.002 for a correlation). To estimate the age of founder events, we measured the genetic distance scale over which allele-sharing decays, and verified the robustness of our procedure by simulation (Supplementary Fig. 3). Six Indo-European- and Dravidian-speaking groups have evidence of founder events dating to more than 30 generations ago (Supplementary Fig. 2), including the Vysya at more than 100 generations ago (Fig. 2). Strong endogamy must have applied since then (average gene flow less than 1 in 30 per generation) to prevent the genetic signatures of founder events from being erased by gene flow.

    I don't think that explanation works. With those times in generations, we're talking about events within the last 600-2000 years. Since all these calculations are done on the whole dataset assuming complete neutrality, I think we should look more closely at the distribution of LD across loci. It seems likely that some of the high-LD loci that appear to point to founder effects will actually be found to be selected.

    Relationships of Indian to non-Indian populations. One of the real problems of assuming a tree with no migration is that it leads to statements like this:

    [T]he ANI [ancestral North Indian] and CEU [HapMap European sample] form a clade, and further analysis shows that the Adygei, a Caucasian group, are an outgroup (Supplementary Note 4). Many Indian and European groups speak Indo-European languages, whereas the Adygei speak a Northwest Caucasian language. It is tempting to assume that the population ancestral to ANI and CEU spoke 'Proto-Indo-European', which has been reconstructed as ancestral to both Sanskrit and European languages, although we cannot be certain without a date for ANI–ASI mixture.

    Some of the common ancestors of some living Europeans and some Indians were probably speakers of proto-Indo-European speakers. But we can easily refute the hypothesis that all of the common ancestors did so -- some of those common ancestors lived more than 40,000 years ago, as is well-known from the mtDNA chronology. The tree model with complete isolation does not explain the data. So as simple as it is -- and as well-used by Cavalli-Sforza and others -- it would be better to use a more accurate model.

    UPDATE (2009-09-24): Gene Expression has a full review of the paper.

    UPDATE (2009-09-27): Very interesting angle by Suvrat Kher at Reporting on a Revolution:

    The Indian Press has made a hash of the finding....

    But I can't blame the press entirely. The scientists who gave interviews to the press didn't mention this. They wimped out on reporting this potential inflammatory and politically incorrect finding. This is just poor and irresponsible science outreach on part of the scientists. How can you ignore a finding that is staring out at you from the very paper you are talking about? The press may be guilty of not digging in but it was just reporting what the scientists told them.

    References:

    Reich D, Thangaraj K, Patterson N, Price AL, Singh L. 2009. Reconstructing Indian population history. Nature 461:489-494. doi:10.1038/nature08365

  • Spatial variation and near-fixed selected alleles

    Thu, 2009-06-11 14:39 -- John Hawks

    I couple of people have asked me about a new paper in PLoS Genetics by Graham Coop and colleagues, titled, "The role of geography in human adaptation." The paper is open access, and while the details of genetic measures and simulations can be hard to follow, I think it's a great example of the way recent work on selection and human diversity has been structured.

    I'll just expand on a few of the topics in the paper, and discuss how they relate to the previous findings about the number and age of selected variants in human populations.

    Here's the paper's abstract:

    Various observations argue for a role of adaptation in recent human evolution, including results from genome-wide studies and analyses of selection signals at candidate genes. Here, we use genome-wide SNP data from the HapMap and CEPH-Human Genome Diversity Panel samples to study the geographic distributions of putatively selected alleles at a range of geographic scales. We find that the average allele frequency divergence is highly predictive of the most extreme FST values across the whole genome. On a broad scale, the geographic distribution of putatively selected alleles almost invariably conforms to population clusters identified using randomly chosen genetic markers. Given this structure, there are surprisingly few fixed or nearly fixed differences between human populations. Among the nearly fixed differences that do exist, nearly all are due to fixation events that occurred outside of Africa, and most appear in East Asia. These patterns suggest that selection is often weak enough that neutral processes—especially population history, migration, and drift—exert powerful influences over the fate and geographic distribution of selected alleles.

    The paper looks for "nearly fixed" genetic differences between populations, and finds relatively few of them. That's relatively well-known; the FST-based test has been done on fewer populations with similar results (e.g., Williamson et al. 2007; Barreiro et al. 2008). This paper has the HGDP panel, which includes many more populations, and therefore is able to add geographic resolution to these older results. They find that the geographic distribution of near-fixed alleles is clinal; there aren't strong boundaries delimiting the geographic distributions of most apparently selected alleles. That means that the same demographic forces affecting neutral genetic variation have also affected recently selected alleles.

    Is that surprising? As we pointed out in our 2007 paper, the recent demographic history of human populations has included a lot of population growth. This means that the number of adaptive mutations should have increased during the last 10,000--20,000 years. High-FST selected alleles can only reflect selected mutations that are older than this (old enough to reach near fixation in one population), or are extraordinarily strong. A few mutations are exceptionally strong in their selective advantages -- SLC24A5 and lactase persistence seem to be examples. But as long as adaptive mutations are intrinsically rare, very few of them could have occurred in the small populations of 20,000 years ago or earlier, even if many happened in the large populations of the Holocene. So I think the new paper actually reinforces the interpretation of acceleration. The pattern we're seeing today with new mutations just can't be a feature of human evolution before around 20,000 years ago.

    If selection is affected by demographic processes, does that mean that it is "weak"? Clearly, "weak" is a matter of scale. Adaptive genes disperse through a spatially structured population very slowly, even if they confer very large fitness advantages. That means that their dispersal is highly dependent upon demographic conditions, such as the disproportionate growth of some populations or occasional long-distance gene flow. Locally, an allele may rapidly increase under selection, but that effect may have little influence on the evolution of distant populations.

    We see that pattern with genes known to be under strong selection in humans, like the ones that help some people resist malaria. Sickle cell, hemoglobin C and E, alpha- and beta-thalassemia, ovalocytosis, G6PD deficiency all have restricted geographic ranges that parallel the clinal pattern of neutral genes. There is an important difference: the patterns of these genes diverge in areas where malaria risk changes rapidly with geography (like coastal versus inland areas of Mediterranean Europe), and some of them have wide geographic distributions compared to their young haplotype ages (like sickle cell). But even in the latter cases, most are too rare to elevate the FST of surrounding SNP markers. Malaria adaptations are a tremendous example of the way that demographic conditions limit strong selection.

    Africa versus other populations

    Derived alleles are expected to have lower frequencies on average than ancestral alleles. So if a population has a bias toward higher-frequency derived alleles, that may be evidence against neutral evolution. The paper finds that this bias is greater in non-African populations than within Africa:

    The overall genic enrichment is present in all three population comparisons, and each tail seems to be similarly enriched for high- FST genic SNPs. However, the number of derived alleles in each tail does differ substantially and is biased towards derived alleles outside Africa and especially in east Asia. Thus, the statistical evidence for enrichment of events inside Africa is weaker than for the other two populations (we return to this point later).

    In general, populations outside Africa have a genome-wide bias toward higher frequencies of derived alleles. The causes of that bias aren't clear -- ascertainment may account for some of the bias but cannot account for all of it; it's possible that early demographic events may explain some of the bias but the pattern isn't obvious.

    The FST-based tests of neutrality are most powerful when a new allele has swept several rare mutations with it to near-fixation. Rare mutations tend to be derived ones. So the power of the test depends on how many rare mutations there are to start with, and what their frequencies are in other populations that didn't have the same selected allele.

    It's one of many issues that make finding selection in African populations slightly different from elsewhere. I think that Africans have undergone as much, and very possibly more, selection by new adaptive mutations as other populations. But our 2007 work suggested that the modal age of the selection we ascertain in Africa may be older than in other regions. That would be consistent with demographic history, since Late Pleistocene African populations were larger than others. But it's possible that genome-wide features like faster LD decay, higher heterozygosity, and more ancestral versus derived variants may also influence our estimates of the timing and number of selected alleles in Africa.

    Polygenic adaptation

    Toward the end of the paper, the authors discuss the pattern of local adaptation in a more general sense. Why should there be relatively few near-fixed genetic differences between populations, if human ecological changes suggest that local adaptation should have been a powerful force in our recent evolution? One possibility is acceleration -- most of the variants are too recent to have reached near-fixation in any single population.

    But the authors mention another possible influence that we've also been thinking about: epistatic interactions among new variants. For example, lots of skin pigmentation loci are known to have been under recent selection, but only a couple of them have reached near-fixation in any population. The rest are at lower frequencies. Since these alleles all affect the same phenotype, they're subject to diminishing returns. As one lighter-pigment allele becomes common, it reduces the strength of selection on the others. The population doesn't have to fix for any of them; in fact, selection probably cannot drive more than one or two up to fixation since the rest of them compete with each other.

    Over the very long term, this situation would be sorted out. A handful of loci that optimize skin pigmentation might ultimately go to high frequencies or fixation, for some alleles the costs may exceed the benefits and they will disappear. Others, relatively neutral to each other, may fix by drift. But the "very long term" is a span of hundreds of thousands of generations. Here we're talking about a few hundred generations at most. So human populations aren't anywhere near an optimum, they're in a transient where epistatic interactions may be quite important.

    Greg Cochran and I have been discussing this idea for some time. We call it the "Stooge effect". Think of the Three Stooges all trying to run through a door at the same time and getting stuck in the middle. That's what these genes are doing -- all of them are competing to respond to selection, but each is slowed by the presence of the others.

    It's not a new idea -- Frank Livingstone used to talk about this general concept with different malaria adaptations. What's new is the increasing evidence that humans are really in a transient with a lot of genes out of equilibrium. It's very possible that for some phenotypes, standing variation has been an epistatic block on the selection of new mutations. For others, the emergence of some new mutations has limited the trajectory of selection on others.

    Conclusion

    All in all, I think this paper is a nice contribution to our understanding of the pattern and rate of recent positive selection in human populations. Certainly, the HGDP sample will continue to be a very informative addition to our understanding of spatial dynamics in ancient humans. The addition of the new HapMap v.3 samples may be even more important, because these represent further regions with roughly the same discovery power as the initial three HapMap samples. And of course, we have the 1000 Genomes sample coming up, adding significant potential for discovering rarer selected variants.

    References:

    Coop G, Pickrell JK, Novembre J, Kudaravalli S, Li J, et al. 2009. The Role of Geography in Human Adaptation. PLoS Genet 5(6): e1000500. doi:10.1371/journal.pgen.1000500

  • Mutual information between strings of loci

    Sat, 2009-03-28 21:26 -- John Hawks

    Fourth in a series on mutual information and genetic linkage. If you’re happening upon it for the first time, you can find the entire series or the first post, “Information theory: a short introduction”.

    After the last post, you might wonder what the big deal is about these information theoretic measures of linkage. After all, we’ve got lots of other measures of linkage to choose in population genetics, with many years of theory behind them. The basic conclusion about genetic drift was that it adds mutual information to samples over short regions, but that recombination over longer areas washes it out. If the net effect is no linkage, why would we bother to come up with some non-standard linkage measure?

    One answer: If the existing linkage measures were so great for testing neutrality, then we might expect some of the recent genome-wide selection scans to have used them. But they didn’t – instead we have several partially incompatible methods, all of which eschew the usual measures of linkage.

    I’m not going to summarize all the reasons for this, at least not yet. But there is one thing that the current positive selection tests have in common: they all attempt to quantify the decay of linkage across long strings of loci. Each of them—LRH, iHS, LDD—distills into a single number the pattern of reduction of linkage across physical distance.

    After the last post, that may seem odd. If we know the physical distance (or more to the point, the map distance), we can predict the amount of linkage under drift. That really ought to be enough to test neutrality. Yet the existing tests insist on a second dimension: the rate of decline of linkage. Partly, that’s because these tests try to separate positive selection from other things that violate neutrality, like inversions. But it also reflects a real limitation of the methods: they ignore some of the information available in the data, and thereby limit their power to test the hypothesis of genetic drift based on linkage. So they must turn to the rate of linkage decay as a test of drift.

    The trouble with homozygosity

    Massive genotyping projects are not sequencing projects. They provide long lists of genotypes, not sequences. If you’ve got a long list of genotypes, one of the measures of variation immediately available to you is homozygosity. Homozygosity among multiple sites, or joint homozygosity is not mathematically the same as the mutual information between sites, but it also is increased by linkage. To the extent that genetic drift predicts the distribution of long-range linkage, you can test neutrality by looking at the joint homozygosity of two or more sites. For example, the extended haplotype heterozygosity (EHH) is the probability that a pair of gametes that share a single core haplotype will also share identical haplotypes at longer distances. Homozygosity itself is not a test of linkage—for that we have to combine the probabilities of identity of all alleles in some way.

    Homozygosity-based tests of neutrality set up this combination as a ratio: the ratio of homozygosity for one allele or core haplotype versus the homozygosity of other alleles or haplotypes. That procedure throws away a lot of information carried by the fraction of haplotypes that are nonidentical at distance. Since that fraction of non-homozygote mutual information, unlike homozygosity, actually increases with distance, it may in some circumstances provide a more powerful test of neutrality. The use of a ratio of homozygosity also reduces the chance to detect certain classes of non-neutral loci—most obviously, the ones that have two or more selected alleles.

    To put some numbers to these ideas, let’s consider we have two loci, A and B. The alleles of A are a1 and a2, B has b1 and b2. Each of these alleles has an associated frequency, for example p(a1), and the frequency of the haplotype a1b1 will be written p(a1b1). If A and B are independent (unlinked) then the expected value of p(a1b1) would simply be p(a1)p(b1).

    The usual measure of linkage, D, is calculated as

    D = p(a b)p(a b) - p(a b )p(a b )        1 1   2 2     1 2   2 1
    (1)

    It thus involves the frequencies of all four possible haplotypes. At linkage equilibrium, the two terms are expected to be the same, so that D = 0.

    Several other measures of linkage are in common use (and sometimes lead to confusion). There are several reasons why you might want a different measure, and one of the biggest is that D depends on the frequencies of the alleles a1 and b1—low-frequency polymorphisms will have systematically lower linkage values for the same evolutionary scenario. A very good thing about D is that it does not require us to know the frequencies of the individual alleles a1 or b1, only the frequencies of their joint haplotypes. But the expectation that the products in the equation are equal depends on a 2 × 2 contingency table. Three-, four- or more-locus haplotypes demand some other measure.

    Using the same terminology, the homozygosity of the haplotype a1b1 is given as p(a1b1)2. By itself, this tells us nothing at all about linkage. If we combine it with the homozygosity of one or the other allele, (for example with the difference p(a1)2- p(a1b1)2, we will have a measure of the reduction in homozygosity with distance. That reduction in homozygosity still doesn’t tell us about linkage, without considering p(b1). But if we scale the reduction by taking it as a ratio with the reduction in homozygosity of one or more other haplotypes, those other haplotypes give us indirect information about p(b1). So we have an indirect measure of linkage, based on the relative rate of linkage decay.

    Relying on the rate of decay of linkage isn’t such a bad idea if we don’t know the rate of recombination between markers. In that case, we can use the rate of decay of homozygosity of other haplotypes as a scaling factor. But there is a problem lurking: The decay of homozygosity around a marker or core haplotype depends on its frequency. Under drift, higher-frequency haplotypes decay over shorter distances. So our test will be biased in a way that depends on the frequency spectrum of haplotypes.

    It seems much more direct to estimate the mutual information between A and B. We don’t need to use an indirect linkage measure when we have all the frequencies needed to calculate a direct measure. Mutual information is a useful measure because it extends easily to many loci. But it has a disadvantage: it requires us to obtain an independent estimate of the recombination fraction between our markers.

    Maybe that’s not such a problem. High-resolution recombination maps are available for the HapMap markers. If we use those maps, then we can test neutrality with markers at fixed map distances instead of physical distances. That ought to let us easily quantify the genome-wide fraction of mutual information that originated in a given time interval in the past. But we’ll have to remain cautious, since the map distances are worked out under the assumption of neutrality.

    Why multilocus haplotypes are useful

    Why do we care about many loci, when we can test neutrality using only one? Suppose we’re interested in the mutual information across a 500 kb stretch of the genome. We could pick a single marker on each end of the 500 kb, and measure the mutual information between them. That’s basically what I did in the case of drift in the last post, “When genetic drift reduces entropy”.

    That isn’t slicing the sample very finely. Suppose that our two markers both have major alleles at 80 percent. That makes the expected frequency of the most common joint haplotype 64 percent, and the least common 4 percent. In our sample of 120 gametes, we’ll conclude that the two are significantly associated if those values go to around 70 and 10 percent, respectively. That’s like salting the sample with seven or eight copies of the rare haplotype.

    On the other hand, we could salt our sample with more than a dozen copies, evenly split between the two intermediate-frequency haplotypes, and never get a significant result. Even if the expected rare haplotype were completely absent, it wouldn’t be enough to reject the hypothesis of random association, much less the hypothesis of genetic drift. Natural selection can cover its tracks, if two or more linked haplotypes have both increased in frequency.

    Is this likely to be a problem very often? Past some distance, positive selection loses its impact on linked haplotypes, because there’s too much recombination. Near a selected site, linkage is strong enough that there will usually be a single major haplotype hitchhiking upward. In between, there may be minor haplotypes hitchhiking more weakly but we will generally be able to look closer to find a major haplotype. But if there are multiple haplotypes under selection, perhaps because more than one adaptive mutation have occurred, the locus may well look completely neutral.

    An example in human populations may be MC1R. Harding et al. (2000) sequenced the gene in 106 Africans and over 356 Europeans. They found a diverse array of amino acid mutations in Europeans that were absent in their African sample, concluding that purifying selection on skin pigmentation had prevented such mutations from becoming common in Africa. In contrast, purifying selection was relaxed in Europe, allowing several loss-of-function variants to increase in frequency. They found no statistical evidence for positive selection on these loss-of-function variants in Europe. Likewise, no other recent tests for positive selection have shown MC1R to stand out as selected.

    But wait a minute. Where did all those European redheads come from?

    Harding and colleagues found five different coding polymorphisms with frequencies more than 7 percent in Europe, but completely absent in their sample of Africans. That adds up to roughly half the copies of MC1R in Europe, all derived alleles. That kind of emergence of new alleles doesn’t happen by chance, at least not in a population history like Europe’s. But none of the usual tests will reject neutrality when five different alleles have increased under selection. And individually, each of these alleles is too rare to register a rejection of neutrality, at least, if we apply the usual tests.

    What makes this gene so troublesome? There’s no problem estimating the frequency of any single derived allele in Europe, and it’s easy to figure out its extended homozygosity. But if we then compare that to other alleles at the same locus, well, some of them have long haplotypes, too. So none of them have an unusual pattern of LD decay compared to the others.

    What to do? We could try to get a much larger genetic sample, or a large sample of individuals from a part of Europe where one of these alleles is especially common.

    But since I’m lazy, I figure I’ll just consider the mutual information carried by haplotypes of multiple loci. Individually, no single allele may have a frequency high enough to reliably show linkage in a two-locus comparison. Individually, no single allele will reject neutrality by a homozygosity-based test, because the other selected haplotypes decay over very long areas, making the ratio for any single haplotype insignificant. Collectively, the selected haplotypes may carry linkage over long distances. Together they will reject neutrality, even if individually they don’t.

    But first, we need to know the distribution of multilocus mutual information under neutrality.

    Degrees of freedom

    If you can calculate a χ2 and its degrees of freedom, you can skip next four paragraphs.

    In the second post I showed that the mutual information is distributed as a χ2. That’s a statistical distribution with one parameter—the degrees of freedom. The concept of degrees of freedom is one of the trickiest definitions in statistics. Roughly, it means the number of independent factors that contribute to a single observation.

    Let’s take the case of a contingency table, where we have a number of rows, r, and a number of columns, c, where the total number of cells is r ×c. If all the cells were independently distributed, then the number in each cell is expected to be the fraction in its row times the fraction in its column. That’s the expected value. If you take the number you observed in a cell, subtract the number you expected, square that difference, divide it by the expected value, and add up the result for every cell in the table, that’s the χ2 statistic. This statistic is high when the observed numbers are far from the expected ones. It’s low when they are close to expected.

    For a 2 × 2 table, we can ask, how many ways are there for the χ2 statistic to be high? Any one of the cells may have an unusually large number of observations, so we might be tempted to say there are four different ways to get a high value. But it’s obvious that if one value is too high, the value in the same row and the other column must be too low. Likewise, the observed number in the other row and the same column must be too low. And if both those diagonal cells are too low, well then the observed value in the cell diagonal to our high cell must also be high. In other words, all four cells depend on each other. Push one, and you push them all. So there’s really only one axis along which the cells can vary—one degree of freedom.

    With a 3 × 2 table, things are a little different. We can hold one row entirely constant, and just change the proportions in the other two. If one cell has a high observed value, that might be because both of the other rows were shortchanged. Or it might be just a single row that’s low, the other being high as well. That makes two distinct ways that we could get a high χ2 statistic, two degrees of freedom. In general, there are (r - 1)(c - 1) degrees of freedom in a contingency table. It’s always one less than the number of rows or columns because at least one row and column must be shortchanged when a cell is higher than expected.

    Degrees of freedom with comparisons of multiple haplotypes

    From the previous post, you might have anticipated how we could do simulations of drift in samples of multiple markers. For the following, I’m going to examine the mutual information between two sets of three SNP markers. That is, I have a stretch of chromosome roughly 50 kb long from end to end (or more properly, 0.05 cM), with three markers tightly linked at either end. Using the same methods as the previous post, I’m applying the same three-stage population history with genetic drift only to these six markers.

    Here are the results for 10,000 trials:

    PIC

    That doesn’t look very much like the chart that I got in the previous post, using the same population history. Here is that one:

    PIC

    What’s going on? That red line in the second chart is the χ2 distribution with one degree of freedom. The first chart doesn’t match that at all.

    Why not? The second chart shows the mutual information between two markers, with two alleles each. That’s one degree of freedom. The first chart shows the mutual information between two sets of three markers. Three markers give the possibility of eight distinct haplotypes (23). So we have an 8 × 8 contingency table, with 49 degrees of freedom. Here is that distribution, in red, along with the result of our 10,000 simulated samples:

    PIC

    Except, oops, it doesn’t look like 49 degrees of freedom either! What the heck is going on?

    If you’ll remember from the second post in the series, our empirical distribution isn’t going to fit a chi-square unless we have enough observations. Sure, in theory, three markers could give us 8 haplotypes. But in practice, the three markers on each end of our 50 kb region are so tightly linked that we get many fewer haplotypes. Sometimes we get only three or four haplotypes on each end. If each set of three markers were independent of the other, they might fit a chi-square with four degrees of freedom, or nine, or six or twelve. In fact, the mutual information observed in these 10,000 simulations accords with a mixture of distributions. Here are the number of degrees of freedom in the 10,000 simulations, as a histogram:

    PIC

    Most of the trials have six or nine degrees of freedom, with 12 being a substantial proportion as well. No primes, there, except for 2 and 3. Only a tiny fraction of trials have as many as 20 degrees of freedom between the 3-marker sets. None get anywhere close to the theoretical 49, because the three-marker sets are too tightly linked to express all possible haplotypes.

    So if we want a theoretical distribution to match to our simulated one, we are going to need a mixture of different χ2 in proportion to the number represented in the simulated set. Here’s that comparison:

    PIC

    The red line is a mixed distribution, in which different χ2 distributions are blended in the proportions represented by the histogram above. That’s the distribution of mutual information between the two marker sets, conditioned on the haplotypes actually present in each set, under the hypothesis of independence. The simulations lie to the right of this distribution, meaning that small-sample bias and genetic drift have added mutual information to the simulations, compared to the expectation in the absence of linkage.

    The only difference between these simulations and the ones in the previous post is the number of markers. Here, we’re looking at mutual information between two three-marker sets; there, we were looking at mutual information between two one-marker loci. Both examples used the same distance and the same population history. Genetic drift has had a similar effect—in both sets of trials, there is a slight excess of cases with high mutual information. With both, there are around twice as many at the tail as expected if the markers were independent.

    This leads to a couple of conclusions:

    1. Under this population history, genetic drift is not sufficient to cause very large amounts of mutual information to be shared by distant sites.
    2. A slight alteration to the χ2 test—for example, adding some proportion to the critical values—might allow us to test the hypothesis of neutrality for any given case without the necessity of all those simulations.

    If we were to add more and more markers, we would eventually find that the bias in mutual information due to small sample size would get bigger. So there’s some maximum number of markers giving useful information in any given sample. But I just picked 3 markers out of a hat. Adding more markers, up to some point, should increase our ability to see rare haplotypes that might diverge from the expectations of genetic drift.

    Next: Conditional mutual information: finding the haplotypes that explain linkage disequilibrium

    References


       Harding RM, et al. 2000. Evidence for variable selective pressures at MC1R. Am J Hum Genet 66:1351–1361.

  • Ceci n'est pas un pothole

    Thu, 2009-03-26 10:54 -- John Hawks

    In 2005 I wrote this:

    "Unusual compared to the rest of the genome" is a phrase you should expect to hear a lot of in the next few years.

    I was looking back at that old post today, as I'm writing new stuff about bottlenecks. It's about the ability to detect selection using the HapMap data -- written just as I was starting to think about recent selection:

    Suppose we wanted to use a detailed topographic survey of a road to find the potholes. But for everyday roads, there is a problem -- there are lots of bumps and grooves that aren't potholes. And different parts of the road are more or less bumpy. It would help a lot if we could use the empirical distribution of bumps to simulate a section of road -- then we could figure out whether anomalies in the real road were likely to be potholes or not.

    Now suppose that the road isn't just pocked with the occasional pothole -- it has a pothole every three or four feet. Remember why we're using simulations -- not only do we not know where the potholes are, we don't know how common they are. So our simulations based on the pothole-rich road will find that pothole-sized bumps are normal. If pothole-sized bumps are not unusual, then our simulation can have only one result: a pothole is not a pothole.

    So I've been writing about the same problem for over three years -- the problem of ignoring history and archaeology when applying models of population history, and how they skew simulations of genetic drift. Time to do something about it, I guess.

  • When genetic drift reduces entropy

    Sun, 2009-03-15 20:09 -- John Hawks

    This is the third in a series on information theory and tests for recent selection. The first post, “Information theory: a short introduction”, covered some of the basics of entropy. The second post, “Information theory and mutual information between genetic loci”, showed that mutual information between independent sites will be distributed as a χ2.

    We tend to think of genetic drift as a random process. Random processes operating repeatedly over time are called “stochastic,” and changes in gene frequency under genetic drift are certainly that.

    Since entropy is a measure of uncertainty, it might seem natural to think that stochastic changes in gene frequency would increase the entropy in a population. After all, the gene frequency in a population under genetic drift will be more and more uncertain over time. So, considering the frequency of a single allele as the system, genetic drift appears to increase entropy over time.

    But even this simple system isn’t quite so simple as it might appear. Sure if you start out knowing the allele frequency, then genetic drift will increase your uncertainty over time. You will become less and less able to say that it lies in any given interval. But what if you don’t start out knowing? What if all you know is that the locus has been subjected to t generations of genetic drift?

    As t increases, the probability of fixation of the locus also increases. The net effect is to reduce the entropy in the system – going from uncertainty about the allele frequency to more and more certainty that it will be either one or zero. The only thing that will stop this process is some other evolutionary force – mutation, migration from other populations, balancing selection. Each of these will have its own distinctive effects on the entropy of the single-locus system.

    Drift and recombination

    We are going to consider a system of two partially linked loci. That means that the two are near enough to each other on a single chromosome that they are usually inherited together, but that a small fraction of individuals will have recombination between the two loci in each generation.

    If the two loci were unlinked, the evolution of allele frequencies at one would have no effect on the other. The mutual information between the two loci would accord with the description in the last post in the series. As described there, the distribution of mutual information estimates taken from different samples of such populations would approximate a chi-square distribution.

    But our two loci will be linked. Genetic drift tends to affect the allele frequencies at the two loci in the same way. Haplotypes consisting of one allele from each locus will increase or decrease under drift. Over time, haplotype frequencies will tend to diverge, some becoming rare, others common. Because the alleles of the two loci are evolving in tandem, the joint entropy of the two-locus system will be lower than expected from the sum of the individual entropies of the two loci. Mutual information will be greater than expected under the hypothesis of independence.

    How much? That depends on the amount of recombination between the two loci and the power of genetic drift. Recombination introduces random noise, by breaking up pairs of alleles and shuffling them. That tends to increase the joint entropy of the two-allele system. The more recombination, the less we can predict the allelic state of one locus from what we observe at the other.

    Genetic drift is more powerful in small populations; less so in large populations. It can also be enhanced by population structure, or diminished by migration.

    So in principle, the reduction in joint entropy from genetic drift opposes the increase in joint entropy from recombination. As the physical distance between the two loci increases, the power of recombination will increase. As the population size increases, the power of genetic drift will decrease. That means we need to know something about local recombination patterns, and something about ancient human demography to get a good estimate of the effects of these processes on human genes.

    But given some knowledge about these things, we should be able to get a pretty accurate idea of the distribution of mutual information that would result from drift alone in ancient human populations.

    Simulating some samples

    These days, the person who wants to investigate the effects of genetic drift on samples of genes has many options. Several well-documented computer programs allow simulations of genetic drift allowing the user to choose parameters like effective population size, recombination rate, marker type and density. Some of these programs use a forward-time approach – that is, starting with a randomized sample and propagating it forward through time. Others use a coalescent approach, which “starts” with a sample of genes taken from the present and propagates their ancestry backward in time. There’s a good diversity to choose from, unlike the old days (10 years ago) when you more or less had to write your own.

    For this series, I’m going to use a package called CoaSim. The package was written by Thomas Mailund and colleagues, and is available under a free software license, including documentation. This software employs a coalescent approach to generate an “ancestral recombination graph” (ARG) among a sample of genes, under a specified model of demographic history. The ARG is a tree of genealogical relationships among the sampled genes, factoring in a history of recombination between them. From the ARG, the software can add markers (such as SNPs or microsatellites), resulting in simulated datasets of sequences that could have evolved under the specified demographic history.

    Similar functionality is available in some other packages. I’m using CoaSim in this case because it includes Python bindings. I use Python for a number of other analytical methods, so this cuts my development time. That’s a faster path from investigation to results.

    So, let’s consider some samples that might result from a long history of genetic drift in a single population. First, I’ll assume that the effective size of my population is 10,000 diploid individuals. I’m going to simulate 10000 samples from populations matching this description. Each sample consists of two SNP markers drawn from 120 gametes – in other words, two copies from each of 60 individuals, with current sample frequencies random between 0.1 and 0.9. If you’ll recall from the last post in the series, we need a decent number of individuals representing each bin for our mutual information estimate to be reasonable. Excluding rare SNPs helps to make sure we have that.

    The mutual information between these SNPs will depend on the rate of recombination between them. To begin with, I’ll assume that the two markers are 50 kilobases apart, and that the rate of recombination is 1cM/Mb, or around 10-8 per base.

    PIC

    The red line represents a chi-square distribution with 1 degree of freedom. That would be the expectation if these two loci were independent, and the histogram is clearly biased to the right compared to that expectation. That means that the mutual information between two linked loci is higher, on average, than that between two unlinked loci. This is a simulation under drift alone, not an empirical distribution. Drift reduces the joint entropy of two linked loci more than it reduces the individual entropy of each locus.

    How much mutual information are we talking about? From the last post, we know that twice the sample mutual information, measured in nats, is approximated by a chi-square. Here on the high end, we have a value of 50, which corresponds to 25 nats per sample. In our sample of 120 gametes, that is around 0.2 nats per gamete, or around 0.3 bits per gamete. That’s not a remarkable amount – it’s about the entropy of a single biallelic locus with allele frequencies of 0.95 and 0.05.

    Now, what happens if we keep recombination the same, but reduce the strength of genetic drift? The following chart shows what happens if I assume an effective size of 100,000, with the rest of the parameters kept the same – 50 kb, 120 chromosomes, frequencies between 0.1 and 0.9.

    PIC

    Here, the histogram is much closer to the expected chi-square distribution under independence. There’s still a slight bias upward, and out of 10,000 samples, a few that have rather high mutual information values. But the effect of genetic drift in this scenario is very slight compared to the case where N = 10000.

    As it turns out, the effect of a tenfold larger population is precisely the same as if I kept population size the same and moved the loci ten times further apart. The mutual information scales with the product of recombination and effective population size. So two loci 500 kb apart in an effective population of 10,000 would on expectation have the same mutual information as two loci 50 kb apart in an effective population of 100,000. Another way of putting it: By reducing the coalescent probability tenfold, recombination has ten times the opportunity to scramble associations between the two loci.

    When population size expands

    Human populations have grown by more than a thousand-fold during the last 40,000 years. Compared to the population of 40,000 years ago, today’s human population should have markedly less mutual information between linked genetic loci. If we want to find out how much, we’ll have to have a model of the recent human demography.

    Here, I’ve done some simulations with changing population size. First, let’s assume that we have two loci 50 kb apart. I’m going to start with a very simplistic model of human demographic history. This model has three stages, or “epochs:”

    1. For the last 400 generations, the effective size is 1,000,000 individuals.
    2. Before that, up to 800 generations ago, the effective size is 100,000 individuals.
    3. Before 800 generations ago, the effective size is 10,000 individuals.

    This is both simplistic and conservative. It’s basically assigning a size of a million to the post-agriculture population, and 100,000 to the population after the Last Glacial Maximum. We know that those early values are way too small for the African population, which did not markedly contract during the LGM and has had an effective size up over 30,000 for at least the last 100,000 years. And obviously, the effective population for the past 10,000 years has been much higher than a million. The European population may have had more of a bottleneck at the LGM, but today’s Europeans have substantial ancestry in West Asia, so 10,000 is conservative there as well. The point is not that these values are realistic; it is that they are almost certainly too small.

    I’ve picked the simple scenario to make it a little more transparent what is going on with the results. Here they are:

    PIC

    With this population history, two loci 50 kb apart have nearly the same distribution of mutual information as two independent loci. There’s a slight bias toward higher values, but very little. Out at the tail, it looks like there are around twice as many loci as predicted by the χ2 approximation, and this is still a very small number.

    I thought it might be useful to consider what the mutual information would be under the same population history for two loci that are closer together. Here’s the same scenario, except the loci are 5 kb apart:

    PIC

    With the markers ten times closer here, there is only a tenth as much recombination between them. That makes the result a lot more like the first case above, where the markers were 50 kb apart and the effective size was 10,000 back to infinity. Markers 5 kb apart should have a good chance of sharing significant mutual information under drift alone in a human-like population history. However, as in the first case above, the mutual information generated by drift is still fairly slight.

    Growth and recombination

    Let’s step back and consider what’s going on with these examples. Mutual information is determined by the balance between recombination and coalescence. Increase the effective size, and you reduce the probability of coalescence between lineages. That gives recombination more of a chance to scramble the loci, eliminating mutual information between them.

    A human-like population history has a very large effective size within the last several thousand years. The large effective size reduces the probability of coalescence down to almost zero across that time period. So the ancestry of almost every gene copy has more than 400—and often up to 1000—generations all to itself. If recombination is ticking along at 1cM/Mb, then physical distances out to around 50 kb start to saturate for recombinants across the last 20,000 years.

    Consider: two lineages that coalesce 400 generations in the past are separated by 800 generations of time. The two lineages were initially identical. Our two markers are 50 kb apart, given them a per-generation probability of recombination of 0.0005. The probability that neither undergoes recombination during those 400 generations is:

    (1- 0.0005)800 = 0.67

    Two thirds remain the same; one third of pairs will have recombined with other lineages. Half of pairs that coalesce 700 generations ago will have recombined. In contrast, after 700 generations, only seven percent of pairs separated by 5 kb will have recombined. Pairs that remain identical carry mutual information; each pair that recombines loses this mutual information.

    Under drift alone, the mutual information builds slowly. From the first chart we know that the mutual information at this physical distance in a small population will be significant, but not very high. Most of this information is still retained in the samples represented by the last chart, where they are separated by only 5 kb. But over longer distance this mutual information dissipates rapidly in a growing population.

    So if we observe substantial mutual information between loci in today’s populations, we need some explanation other than genetic drift. Selection is one possible explanation—but to consider that more closely, we will want to consider other information theoretic features of gene samples.

    Next: Extending to strings of loci

  • Ascertainment into the future, what worth datasets?

    Wed, 2009-03-11 12:29 -- John Hawks

    A reader helpfully pointed me to a new paper in PNAS that looks at the sampling scheme of the 1000 Genomes Project from the point of view of SNP discovery. They frame the question as "How many variants are yet to be found?"

    Here's part of the abstract:

    Consistent with previous descriptions, our results show that the African population is the most diverse in terms of the number of variants expected to exist, the Asian populations the least diverse, with the European population in-between. In addition, our results show a clear distinction between the Chinese and the Japanese populations, with the Japanese population being the less diverse. To find all common variants (frequency at least 1%) the number of individuals that need to be sequenced is small (∼350) and does not differ much among the different populations; our data show that, subject to sequence accuracy, the 1000 Genomes Project is likely to find most of these common variants and a high proportion of the rarer ones (frequency between 0.1 and 1%). The data reveal a rule of diminishing returns: a small number of individuals (∼150) is sufficient to identify 80% of variants with a frequency of at least 0.1%, while a much larger number (>3,000 individuals) is necessary to find all of those variants.

    Well, if the main goal of the 1000 Genomes project is better chip design for the future, then this question -- what fraction of rare variants will be ascertained in the sample -- is the most pertinent.

    However, I for one am looking forward to the larger sample for a different reason. It should allow us to test for recent selection on lower-frequency variants. It may also help to localize the sites under selection. For that, I'm not all that interested in finding the one-percent alleles, I'm more interested in having a sufficient number of copies of them to test hypotheses about change over time.

    At the end of the abstract, there is this interesting sentence:

    Finally, our results also show a much higher diversity in environmental response genes compared with the average genome, especially in African populations.

    I can't get the PDF today, so I'll be waiting to find out what this actually means. That is, I have no idea what they mean by "environmental response" genes.

    References:

    Ionita-Laza I, Lange C, Laird NM. 2009. Estimating the number of unseen variants in the human genome. Proc Nat Acad Sci (early online) doi:10.1073/pnas.0807815106

  • Information theory and mutual information between genetic loci

    Fri, 2008-10-10 22:52 -- John Hawks

    This is the second in a series on information theory and tests for recent selection. The first entry, "Information theory: a short introduction" reviewed the basic concepts of information measures and their background.

    The International HapMap is a massive project to determine the genotypes for up to 3 million single nucleotide polymorphisms (SNPs) in samples of people from 11 population samples around the world. The current data release (Phase 3) includes genotypes for a subset of over 1.5 million SNPs in 1,115 people. The 11 population samples include people of African ancestry from the US Southwest, Utah residents of Northern and Western European ancestry, Han Chinese from Beijing, people of Chinese ancestry from Denver, people in the Houston Gujarati Indian community, Japanese people from Tokyo, Luhya and Maasai people from Kenya, people of Mexican ancestry from Los Angeles, Italians in Tuscany, and Yoruba from Ibadan, Nigeria.

    As impressive as this effort is, we may wonder why exactly SNP genotyping of so many people is a valuable enterprise in itself. The project’s homepage includes this short statement:

    The goal of the International HapMap Project is to compare the genetic sequences of different individuals to identify chromosomal regions where genetic variants are shared. By making this information freely available, the Project will help biomedical researchers find genes involved in disease and responses to therapeutic drugs.

    There are theoretical and practical objections to this simple explanation (as I discussed here last month). However, what no one involved with the project seems to have expected is the extent to which the data would demonstrate the importance of recent adaptive evolution in human populations.

    Here, I am describing some of the ways that we can test hypotheses about natural selection by using the SNP genotypes from the HapMap. This is a theory-centric description, with some digression into practical aspects of handling the genotype data. First, I consider how we might use information theoretic concepts to test the hypothesis of independence between two genetic loci.


    Entropy and genotypes

    The data from the HapMap consist of an array of biallelic genotypes for each population. The size of this array is different for each population; we may consider it to have m rows, each row corresponding to a single SNP locus, and n columns, each corresponding to a person. The entries are genotypes: AA, AT, CG, and so on. Each SNP is biallelic, so it hardly matters what we call the two alleles—we can arbitrarily label them a and b. Thus, the three possible genotypes may be labeled aa, ab and bb.

    Of course, from a sample of genotypes, we can readily estimate the frequencies of the two alleles in the population. The sample allele frequency ˆp (a) = ˆp (aa) + 12ˆp (ab), where the estimate ˆp (aa) is the number of aa individuals in the sample divided by the sample size n. It is worth expressing some statistical caution about estimates. Although the HapMap SNPs are biased toward common alleles, nevertheless some of them are rare indeed. And as we stretch down the chromosome to consider multilocus haplotypes, many may be vanishingly rare or even singular in the population, even though they happen to occur in the sample.

    Now, how shall we estimate the entropy of a single locus? It seems there are several ways we might look at the question.

    1. Allelic entropy We might use the entire sample of genotypes at the locus to estimate the allele frequencies. In that case, the entropy of the SNP locus would be estimated by
      Hˆ(SNP ) = - [ˆp(a)log ˆp(a)+ ˆp(b)log ˆp(b)]
      (1)

      For a SNP locus with empirical genotype frequencies p(aa) = 0.16, p(ab) = 0.48 and p(bb) = 0.36, this estimate of allelic entropy would be 0.97 bits.

    2. Genotypic entropy Or, we might consider the genotype frequencies as the essential elements of the system. In this case, the entropy would be estimated by
      Hˆ(SNP) = - [ˆp(aa) log ˆp(aa)+ ˆp(ab)logpˆ(ab)+ ˆp(bb)log ˆp(bb)]
      (2)

      For the same genotype frequencies listed above, this genotypic entropy would be estimated as 1.46 bits.

    3. Sample entropy Or, we might consider the sample of genotypes as a series of n repeated trials. That would make the sample entropy in the example 1.46n bits. We might also calculate a sample entropy considering the gametes instead of the genotypes—but this would work out to be the same, as long as we don’t know which gametes came from which parents of heterozygotes.

    To understand this last point, imagine that the sample is a deck of shuffled cards. Each card may be red with probability p(a) or black with probability p(b). If we drew cards one at a time, we could record a sequence (red, red, black,…) of 2n cards. Each card drawn thereby has an exact position in the sequence. If p(a) = 0.4 and p(b) = 0.6 as above, then this entropy will be 1.94n bits. Now, imagine instead that we draw the cards two at a time, and record only these pairs (homozygote red, homozygote black, heterozygote,…). We will have a sequence half as long, and this sequence does not include the exact position of the red and black cards, only their position as part of a pair. The entropy of this sequence of n genotypes is only 1.46n bits. This gives some insight into the nature of the sample entropy as defined above: It is the entropy of a sequence of genotypes.

    By contrast, the allelic entropy is the uncertainty associated with drawing a single copy of the SNP at random from the sample. The genotypic entropy is the uncertainty associated with drawing a genotype (two SNP copies) from the sample. Again, these are empirical estimates derived from the sample; they represent the underlying population just to the extent the sample does.

    Which of these estimates will be useful to us? The sample entropy tells us how many bits of hard drive space we need to store the HapMap data under an efficient coding scheme. That’s interesting, but not necessarily what we have in mind. The other two are sample estimates of an underlying population characteristic—either allele or genotype frequencies. In what follows, we will be interested in quantifying how much our uncertainty about one individual’s SNP genotype is reduced by knowing that same individual’s genotype for some other SNP. It would seem that the appropriate measure for this problem will be the genotypic entropy.

    Mutual information between SNPs

    The joint entropy represented by two SNP loci may be less than the sum of their individual entropies. That is to say, there may be mutual information between the two loci. Mutual information can arise between SNPs for several reasons. Most obviously, if the sample of individuals actually consists of members of two distinct populations with different allele frequencies, then two distinct SNPs may have mutual information because of this hidden population structure. This source of mutual information is exploited by admixture mapping—a process that involves taking people with recent ancestry from two or more populations and finding genomic regions that are linked by virtue of the fact that little recombination has yet had time to reshuffle haplotypes that may be locally common but globally rare. Or two loci may share mutual information because they are physically linked.

    As an example of mutual information estimation, consider two sets of genotypes (the rows of the following table):

    A = 0 1 0 2 0 1 1 0 1 0 1 0 2 1 0 0 1 0 1 0
    B = 2 2 2 0 1 1 2 2 0 2 1 1 0 1 2 1 2 2 1 2

    Each SNP locus has three genotypes; twenty people are represented in the sample, each column represents the genotypes of a single individual. The empirical estimate of genotypic entropy from the first SNP locus (Ĥ(A)), based on 10 zeros, 8 ones and 2 twos, is 1.36 bits per individual. The same estimate for the second SNP locus (Ĥ(B)), based on 3 zeros, 7 ones, and 10 twos, is 1.44 bits per individual. The sum of the individual entropies for the two SNP loci is 2.8 bits. But the joint entropy of the two loci (Ĥ(A,B)), based on three (0,1), seven (0,2), one (1,0), four (1,1), three (1,2), and two (2,0) joint genotypes, is only 2.36 bits. Hence, the two SNP loci share an estimated 0.44 bits of mutual information (e.g., Î(A;B) = 0.44 bits).

    What does this mean? Consider the following contingency table, based on the genotypes above:

    0 1 2
    0 3 7
    1 1 4 3
    2 2

    None of the sampled individuals have the joint genotype (0,0) and none have (2,1) or (2,2). But even more important, a full seven have (0,2) even though only 2.5 would be expected if the two SNPs were independent.

    The contingency table is a clue (for the statistically-minded). The mutual information is telling us something like Pearson’s χ2 test of association. If we know the genotype for SNP A, we can do better than chance predicting the genotype for SNP B.

    Significance testing for mutual information

    The comparison with Pearson’s χ2 test raises another obvious question: How do we test whether a given amount of mutual information is statistically significant? In the example above, we have estimated 0.44 bits of mutual information between SNP A and SNP B. Is this a lot? How much mutual information should we expect between two random samples of data?

    Let’s take an even simpler contingency table — the relation between two coin flips. Each flip may have a 50-50 chance of producing “heads” or “tails”. On average, we expect to see one fourth (H, H), one fourth (T, T), one fourth (H, T) and one fourth (T, H) in our results. But if we perform this experiment (2 coin flips) a finite number of times, we will almost always see some slight divergence from these proportions. Ten sets of coin flips absolutely can’t give us one fourth of each result, because ten doesn’t divide by four evenly. On top of that problem, we might get six or seven (H, H) by chance, even if we only expect 2.5. Any of these cases will give us a positive non-zero estimate of mutual information, even if there is no causal connection between the coin flips.

    Another way of stating this observation is that the estimate of mutual information, Î(A;B) is biased. We should expect the estimate from a small sample to be larger than the true mutual information in the population at large.

    The analogy between mutual information and the χ2 test is more apt than it might appear. In fact, over many trials, the distribution of sample estimates of mutual information will approximate a χ2 distribution, multiplied by a factor 2nlog 2, where n is the number of cases in the sample. Our sample of genotypes of A and B above has 20 individuals and 3 possible genotypes, so 40(log 2)Î(A;B) should be distributed as a χ2 with 4 degrees of freedom.

    Reflecting on the three ways we might estimate the entropy of a locus, above, this is twice the mutual information calculated from the sample entropy, measured in nats instead of bits. Even though we are interested in estimating the population characteristic, the genotypic entropy, the significance of our estimate can only be evaluated by considering the entropy of the sample.

    And indeed, we can perform a randomization of the two sets of genotypes and show the correspondence. Here, I’ve done ten thousand permutations of the genotypes in A with respect to those in B:

    Simulation outcome

    The bars are the histogram representing the 10,000 permuted samples, the curve is the χ2 density function with 4 degrees of freedom. You can see that the permutations show significant clumpiness. With only 20 sampled individuals, some fractional combinations come up quite often while others are impossible. Also, the permutations are more biased than the χ2 would predict—there are too few small values for Î(A;B). This is another small sample effect. Generally, the χ2 approximation fails when there are fewer than 5 observations in a cell, and shouldn’t really be trusted with fewer than 10. Here, we have nine cells and only 20 total observations. But our value of 0.44 bits—equal to a sample mutual information of 12.2 nats on the scale of the figure—is significant according to the (uncorrected) χ2 approximation with p = 0.016, and according to the permutation test with p = 0.014. If we are very concerned about the deviation from the χ2 distribution, we might decide to use Fisher’s Exact Test on the underlying contingency table.

    With more observations, data do tend to converge to a χ2 distribution. For example, here I have run 10,000 permutations of a sample of 10,000 individuals, for two loci, each locus with 10 alleles at equal frequencies. The curve is a χ2 distribution with 81 degrees of freedom:

    Simulation outcome

    Very nice.

    This comparison suggests a couple of things. First, we can always do a permutation test of the hypothesis of independence. Take a sample of paired genotypes, shuffle up one of them, and see if the observed pairs share higher mutual information than a large fraction (say, 95%) of the permuted sets. (Naturally, if at this point you are already thinking of a genome-wide survey, you will need to consider ways to correct for multiple comparisons….)

    Second, we can get approximate results for mutual information using a χ2 test. All we do is multiply the mutual information estimate Î(A;B) by 2nlog 2 and compare it to the appropriate significance level of the χ2 distribution with the appropriate number of degrees of freedom. This approximation will be poor for small samples, including the HapMap samples. Again, if we were testing the hypothesis of independence in those cases, we would likely want to use Fisher’s Exact Test instead.

    But in what follows, we will generally not be testing the hypothesis that two loci are independent; we will be testing the hypothesis that they are linked under neutrality. In that context, these statistical tests of independence will be useful for quickly weeding out genetic regions where linkage is negligible. Then, we can employ different tests for regions where linkage appears to be more substantial, tests that make more effective use of the properties of mutual information.

    Next: Genetic drift reduces mutual information

  • Positive selection and the tip of the iceberg

    Mon, 2008-09-29 10:16 -- John Hawks

    Razib points to a new paper by Johansson and Gyllensten, in which they develop a comparison of FST and haplotype block length as a test of positive selection. The paper is interesting enough (and open access) -- they give a list of a few variants that are likely selected in different populations.

    What I wanted to point to was this figure:

    Selection versus drift, Johansson and Gyllensten 2008

    That pretty much encapsulates the problem of detecting recent positive selection, with current methods. The distribution of selected alleles looks significantly different from the distribution of neutral alleles, but there is a tremendous overlap. Particularly when it comes to choosing an arbitrary cutoff between the two distributions.

    Imagine if you had a sample of men and women, and you chose an arbitrary cutoff of stature to distinguish them. Say, everyone over 5 foot 7 is a man. Well, that will do better than chance, but you've included a lot of women in your sample of men, and vice versa. Now, suppose you thought that men were inherently rare compared to women. Say, 100 women for every man. A cutoff of 5 foot 7 inches is going to include many more false positives (i.e., tall women) than genuine men. So you choose a very conservative cutoff, one that is not likely to include very many women. Maybe 6 foot 5. The people you see who are over 6 foot 5 are extremely likely to be men -- not certainly, you still will catch some very tall women -- but quite likely men. But you've excluded 95 percent of the men to do this.

    That's the situation we are in with respect to detecting selection. There is an enormous set of false negatives -- truly selected alleles that are indistinguishable by means of an arbitrary cutoff from neutral alleles. In the figure above, Johansson and Gyllensten suppose that each ascertained variant (at s=0.01) represents almost 5 in the population. So far, few have made much of the point that a small number of selected alleles under a very stringent cutoff must correspond to a large number that don't make the cutoff. (our 2007 paper being an exception). The issue is not only ascertainment; it is the shape of the non-ascertained distribution.

    Still, one may hope to do better at identifying selected alleles. So far, people have been hacking at these statistical distributions with a hatchet. We need a scalpel.

    References:

    Johansson A, Gyllensten U. 2008. Identification of local selective sweeps in human popluations since the exodus from Africa. Hereditas 145:126-137. doi:10.1111/j.2008.0018-0661.02054.x

  • Information theory: a short introduction

    Fri, 2008-09-26 21:37 -- John Hawks

    I lectured this week in my Biology of Mind course about information theory, and in particular the concept of Shannon entropy. I’ve typed up a few notes for my students, and I’m cross-posting them on my own blog because they are relevant to another topic I’ll be writing about: discovery and testing of natural selection in the human genome. You see, the kind of data that are presently being collected as part of the International HapMap , single nucleotide polymorphisms (SNPs), are naturally treated by information theoretic measures. So first, it may help to define the essential concepts of information theory.

    Many readers will have heard of the concept of entropy in connection to the thermodynamics of physical systems. Indeed, one common statement of the Second Law of Thermodynamics is that a closed system must increase in entropy over time. Entropy is a statistical characteristic of a system, related to the probabilities that the particles of a system will be found in given states at any given time. In thermodynamic terms, a system’s entropy is related to our ability to extract work from the system. The Second Law implies that work cannot endlessly be extracted from a closed system without the addition of energy from outside. By 1927, Leó Szilárd (probably my favorite physicist) had shown that the entropy of a physical system can be naturally defined in terms of information. In other words, one way of looking at entropy is in terms of uncertainty about the state of any given particle in a system, and one might apply energy to a system in order to reduce this uncertainty (for example, by concentrating particles in one part of the system.

    Claude Shannon developed the concept of entropy as applied to communication systems. By doing so, he established the field of information theory. Shannon developed several fundamental theorems, including a derivation of the relation between channel capacity (bandwidth) and noise, studies of optimal encoding strategies, and a means of treating continuous as well as finite communications. His most basic definition is that of information entropy, which has also come to be called the Shannon entropy. This definition places entropy as a measure of our uncertainty about the state of a system—in particular, as applied to information, a system of signs.

    Shannon published “A mathematical theory of communication” in 1947, describing his theoretical work. Along with this article, the (UW-Madison) mathematician Warren Weaver wrote a popular treatment of Shannon’s work, titled “Recent contributions to the mathematical theory of communication.” I mention these articles because it is very hard to improve upon them; they are clear in their exposition and notation. The two can be found together in the 1948 book, The Mathematical Theory of Communication, which has been reprinted several times. I can’t recommend this book highly enough.

    Keeping that in mind, I won’t be reiterating the essentials of information theory here; I just want to give a basic understanding that can be applied to other problems—particularly with respect to SNP datasets.

    Entropy and outcomes

    Suppose we have an electron in a box. Its spin may be “left” or “right”, with equal probability. What is our uncertainty about the electron’s spin? One way of looking at the question: We are just as uncertain about the electron as we would be about the flip of a fair coin. The uncertainty in both cases has the same quantity, even though the systems are in other respects totally different from each other. Hence, it is desirable that our definition of uncertainty not depend on the actual physical characteristics of a system, but only upon the probabilities of signs within the system.

    If we were not uncertain at all, the probability of one outcome would be 1 and the other would be zero. For instance, if we had a two-headed coin, we would be absolutely certain of flipping heads. Naturally, a definition of uncertainty should assign zero to the case in which we already know the outcome. But for a fair coin, we have a probability of 0.5 for one outcome, and a probability of 0.5 for the other. We are uncertain, and our measure of uncertainty should have a positive value in this case, whatever unit we may choose.

    Now suppose we have a nucleotide of DNA. It may be adenine, guanine, cytosine or thymine, each with probability 0.25 (1/4). How many coin flips would give us the same amount of uncertainty? The answer is two: Two flips have four possible outcomes (0,0; 0,1; 1,0; 1,1) with equal probability (0.25) for each. Again we have an equivalence between two systems in the amount of uncertainty about the outcome. However, in this case we can see that it takes two trials of one system to attain the same uncertainty as one trial of the other system. It would seem that we should be twice as uncertain about the nucleotide as we are about a single coin flip. Indeed, three nucleotides of DNA (a codon) will give us 64 possible outcomes—the same as six coin flips.

    The number of possible outcomes of a set of trials grows as the exponent of the number of trials. So for example, 5 coin flips yield 25 = 32 possible outcomes; 10 coin flips yield 210 = 1024 possible outcomes. If the coin is fair, then every outcome is equally probable; meaning that the probability of any sequence of 10 coin results might be observed with probability 1/1024. A consideration of this system over a slightly larger scale will give some idea of the power of encoding. With 10 coin flip results, we might choose a room in a 1024-bed hospital at random. With 100 coin flips, we may describe a system of 1.2 × 1030 elements—enough to randomly choose a point on the Earth’s surface to within a millionth of an inch.

    If we are uncertain where we have hidden our microdot, a hyper-GPS could communicate its location anywhere in the world to us with a string of 100 heads or tails. The information that will remove our uncertainty is related to the logarithm of the number of outcomes. This leads to a mathematical definition of uncertainty in terms of logarithms. In particular, for a system X with possible outcomes x1,x2,,xn, the information entropy (H(X)) is:

              ∑n H (X ) = -   p(xi)logp(xi)           i=1
    (1)

    The logarithm is conventionally taken as a base-2 logarithm, so that the measure of entropy is the binary digit, or bit. A single coin flip has two possible outcomes each with probability 0.5. The equation gives us:

    H (coin flip) = - [0.5 log0.5+ 0.5log 0.5] = 1 bit
    (2)

    This equation allows us also to handle systems in which the probabilities of different outcomes are not all equal. For example, suppose there are two outcomes, with probability 0.9 and 0.1. What is the uncertainty?

    H (X ) = - [0.9 log0.9+ 0.1log 0.1] = 0.47 bits
    (3)

    Here we are less than half as uncertain as in the case of a fair coin—and indeed, that is the point. If we had a coin that consistently gave only 10% tails, we would on average be considerably more certain about the outcome. On average, we can communicate two flips of our unfair coin with only one bit. Exactly how this can be done is by using the right kind of encoding. The point for us is that a system with much less uncertainty than a coin flip can be specified with less information than a coin flip.

    Mutual information

    Now, suppose we have two distinct events. If these events are independent, then the joint entropy represented by both is simply the sum of their individual entropies:

    H (X,Y ) = H (X)+ H (Y)
    (4)

    Why is this? Consider two coin flips. In the combined system including both flips, we have four possible outcomes (0,0; 0,1; 1,0; 1,1). If our two flips are independent, then the probability of each combined outcome is the product of the probabilities of the individual outcomes. That is, p(0,1) = p(0)p(1), and log p(0)p(1) = log p(0) + log p(1). Then, equation 4 can be derived from 1 by algebraic manipulation.

    But, if the two events are not independent—that is, if the outcome of one depends on the outcome of the other, then their joint entropy must be less than the sum of their individual entropies.

    H (X,Y ) ≤ H (X)+ H (Y)
    (5)

    And the difference between the joint entropy and the sum of the individual entropies is a measure of the correlation between the two events. We call this difference the mutual information of the two events, and we define it mathematically:

    I(X;Y ) = H (X)+ H (Y)- H (X,Y )
    (6)

    Consider a game of Blackjack at a casino. Most people play probabilities as if every card were equally likely to be dealt in every hand. Under this scenario, the house has a consistent edge—this is, after all, how casinos make money. But in fact every card is not equally likely to be dealt. In particular, there is a serial correlation among the cards dealt in a Blackjack game. If the King of Hearts is dealt, it is rather less likely to occur again very quickly. In other words, there is mutual information between the dealing of a card and the outcomes of later hands: Once the King of Hearts is observed, its probability of being dealt in later hands declines. A clever player with a good memory may make use of this mutual information to guide his bets: putting down more money when the house is less likely to draw face cards, for instance. Players who can “count cards” in this way would be the doom of the casinos if they allowed it to continue. To prevent it, they reduce the extent of serial correlations by dealing cards from boots of four or more decks, and the casinos eject players suspected of counting.

    Redundancy

    Moreover, a system comprised of two distinct events may include redundancy. We may consider a very prominent system in which two independent events, each with two outcomes, give rise to a combined system with only three, not four outcomes. Consider a Mendelian gene A with two alleles a1 and a2. When two heterozygotes (a1a2) mate, their offspring will have one of three genotypes: a1a1, a1a2, or a2a2. If we want to communicate the genotypes of their children, we will need an average of 1.5 bits for each.

    This seems counter-intuitive, considering that each gamete from the parents is a1 or a2 with equal probability. That is, p(a1) = p(a2) = 0.5. The entropy represented by each gamete is a coin flip’s worth—one bit. It might seem that combining two gametes, each derived independently from a different parent, we should have 2 bits of entropy, not only 1.5.

    Yet it is a simple matter to show that we can transmit information about the children’s genotypes using only 1.5 bits. Let’s encode a heterozygote as a “0”, an a1 homozygote as “10” and an a2 homozygote as “11”. In this encoding, a single bit communicates whether the child is a homozygote or heterozygote, and if homozygote is followed by a second bit communicating which of the two alleles. Now, if our couple of heterozygotes have eight children, four of whom are heterozygotes and two of each homozygote, we can transmit their family’s genotypes as “000010101111”. Eight children. Twelve bits. That’s 1.5 bits per genotype. In some unlikely cases (all homozygotes) we will use more bits, in others (all heterozygotes) we will use less.

    In this case, two different outcomes from the point of view of inheritance are in fact not distinguished in the genotype of each child. “Heterozygotes” include two distinct classes: those who inherit a1 from the mother (and a2 from the father) and those who inherit a2 from the mother (and a1 from the father). The system that gives rise to the genotypes is relevant to the probability that each genotype will occur (as Mendel discovered), but it is not very relevant to the way we describe those genotypes. All we know about heterozygotes is that they have two different alleles. When we want to predict phenotypes, we may not care which allele came from which parent. When we collect genotypes (on an Affy or Illumina chip, for example), we are in a poor position to determine which allele came from which parent. Thus, even though two bits of gametes went into the physical system creating the genotypes, only 1.5 bits is sufficient to communicate them.

    What is the import of this redundancy? Clearly, if we are interested in children’s genotypes, then a system that tracks their parents’ gametic contributions is a poor way of encoding the information. That system includes redundant information, with respect to genotypes. But to put the problem another way, knowing the children’s genotypes leaves us with uncertainty about their parents’ gametic contributions. If we want to devise an accurate paternity test, we will sometimes need to know more than the child’s genotype.

    Next: Information theory and mutual information between genetic loci

Pages

Subscribe to HapMap

Neandertals

For years, I've worked on their bones. Now I'm working on their genes. Read more about the science studying these ancient people.

Denisova

From a finger bone of an ancient human came the record of a completely unexpected population. My lab is working on the science of the Denisova genome.

Acceleration

The advent of agriculture caused natural selection to speed up greatly in humans. We're uncovering some of the ways that populations have rapidly changed during the last 10,000 years.

Malapa

Just outside Johannesburg, the Malapa site is producing some of the most exciting finds in human evolution. This site is the headquarters of the Malapa Soft Tissue Project.