john hawks weblog

paleoanthropology, genetics and evolution

information theory

  • Blueprints and recipes

    Tue, 2011-05-17 08:30 -- John Hawks

    Greg Mayer has a post on preformationism and epigenesis on the Why Evolution Is True blog:"Development is epigenetic".

    He later quotes Richard Dawkins in a similar light, but I'm linking because of Mayer's own useful synopsis of the blueprint analogy versus the recipe analogy for development.

    Preformationism, though wrong, is frequently reinforced by the common (though badly mistaken) practice of referring to DNA or the genome as a “blueprint” for the organism. It is of course no such thing. A blueprint is a two dimensional representation of a three dimensional object. There is, in a blueprint, a scaled representation of all the parts of the object. We can tell, for example, that the window on the second floor is 4 m above and 2m to the left of the door. There is nothing like that in your DNA: there isn’t a gene for your left eye, which is a scaled distance away from the gene for your right eye. Your DNA (and your development) is much more akin to a recipe. In a raisin cake recipe, there isn’t a line in the recipe that says place a raisin 2 cm in from the upper left hand corner (there would be, if we had a blueprint for the cake). Rather, if you combine the right ingredients, in the right sequence, in the right environment, the result is a cake with raisins distributed through it at a certain density.

    In the end both these analogies entail some mechanism. A blueprint needs some past mechanism capable of producing an iconic representation of the final object. A recipe needs some mechanism capable of recording a sequence of steps. Neither of those is impossible to evolve (Mayer briefly mentions the iconic nature of the arrangement of Hox genes), but it's pretty clear that the blueprint analogy does not apply to most developmental processes.

    I was thinking about this issue in light of the nativist and learning theoretic views of language development. In that problem, the question is about the locus of the recipe -- did evolution lay down special instructions for language learning, or does the language environment contain most of the structure necessary for children to learn without special instructions beyond those used for learning many other kinds of behavior? Chomsky argued that language environments cannot in principle supply the necessary structure, so biology must have done so ("Language and spandrels"). But he was essentially preformationist in this position, even to the extent of denying that language could have evolved. He instead preferred to see language as a side-effect of other evolutionary processes, or emerging as a physical principle from humanlike brains.

    Anyway, I'll return to this later, I just wanted to register a note on preformationism and epigenesis in relation to the issue.

  • Gleick interview on information

    Sat, 2011-04-09 23:16 -- John Hawks

    The Guardian has an interview with James Gleick about his new book, The Information: A History, a Theory, a Flood. The book focuses in part on Claude Shannon and his development of information theory, which leads to one of the most interesting passages in the interview:

    But as you note, information is not knowledge. We are more painfully aware of that now than ever. In explaining Shannon's work I kept having to emphasise his point about the irrelevance of meaning; yet we know full well that meaning is what we really care about. This loomed larger and larger. There's a hilarious moment in 1950 in a New York hotel meeting room when Shannon tries to explain "information" to anthropologists and psychologists such as Margaret Mead and Lawrence Frank, and they're a little outraged. Where are the humans in this picture? Where are our brains? If it's just wires and transistors, who cares?

    Oh, if I were a companion of the Doctor, this is the second place I'd like to see.

  • Bitwise consciousness

    Mon, 2010-09-20 19:51 -- John Hawks

    Carl Zimmer writes about theories of consciousness in today's Science NY Times, and describes the work of my Wisconsin colleague, Giulio Tononi.

    But Dr. Tononi’s theory is, potentially, very different. He and his colleagues are translating the poetry of our conscious experiences into the precise language of mathematics. To do so, they are adapting information theory, a branch of science originally applied to computers and telecommunications. If Dr. Tononi is right, he and his colleagues may be able to build a “consciousness meter” that doctors can use to measure consciousness as easily as they measure blood pressure and body temperature. Perhaps then his anesthesiologist will become interested.

    That's fortuitous because I'm lecturing about information theory tomorrow in my "Biology of Mind" course. The article goes on about how to measure consciousness using information theory terms. I'm not sure it's a practical theory of conscious experience, yet, but I think the information theory concepts are fundamentally important to understanding the adaptive evolution of brains on a more basic level.

    I'm always impressed reading back through Darwin, who a hundred years before information theory began to consider what we might describe as transmission properties of animal communication.

    As far as Tononi's ideas -- there is a logic here that is very appealing. Information is about encoding and transmission. Cryptography, for example, requires that we study the transmission properties of a channel to try to understand the encoding. That is, in a sense, what Tononi is proposing. Where most people have considered only the encoding properties, he proposes understanding the transmission properties.

  • Mutual information between strings of loci

    Sat, 2009-03-28 21:26 -- John Hawks

    Fourth in a series on mutual information and genetic linkage. If you’re happening upon it for the first time, you can find the entire series or the first post, “Information theory: a short introduction”.

    After the last post, you might wonder what the big deal is about these information theoretic measures of linkage. After all, we’ve got lots of other measures of linkage to choose in population genetics, with many years of theory behind them. The basic conclusion about genetic drift was that it adds mutual information to samples over short regions, but that recombination over longer areas washes it out. If the net effect is no linkage, why would we bother to come up with some non-standard linkage measure?

    One answer: If the existing linkage measures were so great for testing neutrality, then we might expect some of the recent genome-wide selection scans to have used them. But they didn’t – instead we have several partially incompatible methods, all of which eschew the usual measures of linkage.

    I’m not going to summarize all the reasons for this, at least not yet. But there is one thing that the current positive selection tests have in common: they all attempt to quantify the decay of linkage across long strings of loci. Each of them—LRH, iHS, LDD—distills into a single number the pattern of reduction of linkage across physical distance.

    After the last post, that may seem odd. If we know the physical distance (or more to the point, the map distance), we can predict the amount of linkage under drift. That really ought to be enough to test neutrality. Yet the existing tests insist on a second dimension: the rate of decline of linkage. Partly, that’s because these tests try to separate positive selection from other things that violate neutrality, like inversions. But it also reflects a real limitation of the methods: they ignore some of the information available in the data, and thereby limit their power to test the hypothesis of genetic drift based on linkage. So they must turn to the rate of linkage decay as a test of drift.

    The trouble with homozygosity

    Massive genotyping projects are not sequencing projects. They provide long lists of genotypes, not sequences. If you’ve got a long list of genotypes, one of the measures of variation immediately available to you is homozygosity. Homozygosity among multiple sites, or joint homozygosity is not mathematically the same as the mutual information between sites, but it also is increased by linkage. To the extent that genetic drift predicts the distribution of long-range linkage, you can test neutrality by looking at the joint homozygosity of two or more sites. For example, the extended haplotype heterozygosity (EHH) is the probability that a pair of gametes that share a single core haplotype will also share identical haplotypes at longer distances. Homozygosity itself is not a test of linkage—for that we have to combine the probabilities of identity of all alleles in some way.

    Homozygosity-based tests of neutrality set up this combination as a ratio: the ratio of homozygosity for one allele or core haplotype versus the homozygosity of other alleles or haplotypes. That procedure throws away a lot of information carried by the fraction of haplotypes that are nonidentical at distance. Since that fraction of non-homozygote mutual information, unlike homozygosity, actually increases with distance, it may in some circumstances provide a more powerful test of neutrality. The use of a ratio of homozygosity also reduces the chance to detect certain classes of non-neutral loci—most obviously, the ones that have two or more selected alleles.

    To put some numbers to these ideas, let’s consider we have two loci, A and B. The alleles of A are a1 and a2, B has b1 and b2. Each of these alleles has an associated frequency, for example p(a1), and the frequency of the haplotype a1b1 will be written p(a1b1). If A and B are independent (unlinked) then the expected value of p(a1b1) would simply be p(a1)p(b1).

    The usual measure of linkage, D, is calculated as

    D = p(a b)p(a b) - p(a b )p(a b )        1 1   2 2     1 2   2 1
    (1)

    It thus involves the frequencies of all four possible haplotypes. At linkage equilibrium, the two terms are expected to be the same, so that D = 0.

    Several other measures of linkage are in common use (and sometimes lead to confusion). There are several reasons why you might want a different measure, and one of the biggest is that D depends on the frequencies of the alleles a1 and b1—low-frequency polymorphisms will have systematically lower linkage values for the same evolutionary scenario. A very good thing about D is that it does not require us to know the frequencies of the individual alleles a1 or b1, only the frequencies of their joint haplotypes. But the expectation that the products in the equation are equal depends on a 2 × 2 contingency table. Three-, four- or more-locus haplotypes demand some other measure.

    Using the same terminology, the homozygosity of the haplotype a1b1 is given as p(a1b1)2. By itself, this tells us nothing at all about linkage. If we combine it with the homozygosity of one or the other allele, (for example with the difference p(a1)2- p(a1b1)2, we will have a measure of the reduction in homozygosity with distance. That reduction in homozygosity still doesn’t tell us about linkage, without considering p(b1). But if we scale the reduction by taking it as a ratio with the reduction in homozygosity of one or more other haplotypes, those other haplotypes give us indirect information about p(b1). So we have an indirect measure of linkage, based on the relative rate of linkage decay.

    Relying on the rate of decay of linkage isn’t such a bad idea if we don’t know the rate of recombination between markers. In that case, we can use the rate of decay of homozygosity of other haplotypes as a scaling factor. But there is a problem lurking: The decay of homozygosity around a marker or core haplotype depends on its frequency. Under drift, higher-frequency haplotypes decay over shorter distances. So our test will be biased in a way that depends on the frequency spectrum of haplotypes.

    It seems much more direct to estimate the mutual information between A and B. We don’t need to use an indirect linkage measure when we have all the frequencies needed to calculate a direct measure. Mutual information is a useful measure because it extends easily to many loci. But it has a disadvantage: it requires us to obtain an independent estimate of the recombination fraction between our markers.

    Maybe that’s not such a problem. High-resolution recombination maps are available for the HapMap markers. If we use those maps, then we can test neutrality with markers at fixed map distances instead of physical distances. That ought to let us easily quantify the genome-wide fraction of mutual information that originated in a given time interval in the past. But we’ll have to remain cautious, since the map distances are worked out under the assumption of neutrality.

    Why multilocus haplotypes are useful

    Why do we care about many loci, when we can test neutrality using only one? Suppose we’re interested in the mutual information across a 500 kb stretch of the genome. We could pick a single marker on each end of the 500 kb, and measure the mutual information between them. That’s basically what I did in the case of drift in the last post, “When genetic drift reduces entropy”.

    That isn’t slicing the sample very finely. Suppose that our two markers both have major alleles at 80 percent. That makes the expected frequency of the most common joint haplotype 64 percent, and the least common 4 percent. In our sample of 120 gametes, we’ll conclude that the two are significantly associated if those values go to around 70 and 10 percent, respectively. That’s like salting the sample with seven or eight copies of the rare haplotype.

    On the other hand, we could salt our sample with more than a dozen copies, evenly split between the two intermediate-frequency haplotypes, and never get a significant result. Even if the expected rare haplotype were completely absent, it wouldn’t be enough to reject the hypothesis of random association, much less the hypothesis of genetic drift. Natural selection can cover its tracks, if two or more linked haplotypes have both increased in frequency.

    Is this likely to be a problem very often? Past some distance, positive selection loses its impact on linked haplotypes, because there’s too much recombination. Near a selected site, linkage is strong enough that there will usually be a single major haplotype hitchhiking upward. In between, there may be minor haplotypes hitchhiking more weakly but we will generally be able to look closer to find a major haplotype. But if there are multiple haplotypes under selection, perhaps because more than one adaptive mutation have occurred, the locus may well look completely neutral.

    An example in human populations may be MC1R. Harding et al. (2000) sequenced the gene in 106 Africans and over 356 Europeans. They found a diverse array of amino acid mutations in Europeans that were absent in their African sample, concluding that purifying selection on skin pigmentation had prevented such mutations from becoming common in Africa. In contrast, purifying selection was relaxed in Europe, allowing several loss-of-function variants to increase in frequency. They found no statistical evidence for positive selection on these loss-of-function variants in Europe. Likewise, no other recent tests for positive selection have shown MC1R to stand out as selected.

    But wait a minute. Where did all those European redheads come from?

    Harding and colleagues found five different coding polymorphisms with frequencies more than 7 percent in Europe, but completely absent in their sample of Africans. That adds up to roughly half the copies of MC1R in Europe, all derived alleles. That kind of emergence of new alleles doesn’t happen by chance, at least not in a population history like Europe’s. But none of the usual tests will reject neutrality when five different alleles have increased under selection. And individually, each of these alleles is too rare to register a rejection of neutrality, at least, if we apply the usual tests.

    What makes this gene so troublesome? There’s no problem estimating the frequency of any single derived allele in Europe, and it’s easy to figure out its extended homozygosity. But if we then compare that to other alleles at the same locus, well, some of them have long haplotypes, too. So none of them have an unusual pattern of LD decay compared to the others.

    What to do? We could try to get a much larger genetic sample, or a large sample of individuals from a part of Europe where one of these alleles is especially common.

    But since I’m lazy, I figure I’ll just consider the mutual information carried by haplotypes of multiple loci. Individually, no single allele may have a frequency high enough to reliably show linkage in a two-locus comparison. Individually, no single allele will reject neutrality by a homozygosity-based test, because the other selected haplotypes decay over very long areas, making the ratio for any single haplotype insignificant. Collectively, the selected haplotypes may carry linkage over long distances. Together they will reject neutrality, even if individually they don’t.

    But first, we need to know the distribution of multilocus mutual information under neutrality.

    Degrees of freedom

    If you can calculate a χ2 and its degrees of freedom, you can skip next four paragraphs.

    In the second post I showed that the mutual information is distributed as a χ2. That’s a statistical distribution with one parameter—the degrees of freedom. The concept of degrees of freedom is one of the trickiest definitions in statistics. Roughly, it means the number of independent factors that contribute to a single observation.

    Let’s take the case of a contingency table, where we have a number of rows, r, and a number of columns, c, where the total number of cells is r ×c. If all the cells were independently distributed, then the number in each cell is expected to be the fraction in its row times the fraction in its column. That’s the expected value. If you take the number you observed in a cell, subtract the number you expected, square that difference, divide it by the expected value, and add up the result for every cell in the table, that’s the χ2 statistic. This statistic is high when the observed numbers are far from the expected ones. It’s low when they are close to expected.

    For a 2 × 2 table, we can ask, how many ways are there for the χ2 statistic to be high? Any one of the cells may have an unusually large number of observations, so we might be tempted to say there are four different ways to get a high value. But it’s obvious that if one value is too high, the value in the same row and the other column must be too low. Likewise, the observed number in the other row and the same column must be too low. And if both those diagonal cells are too low, well then the observed value in the cell diagonal to our high cell must also be high. In other words, all four cells depend on each other. Push one, and you push them all. So there’s really only one axis along which the cells can vary—one degree of freedom.

    With a 3 × 2 table, things are a little different. We can hold one row entirely constant, and just change the proportions in the other two. If one cell has a high observed value, that might be because both of the other rows were shortchanged. Or it might be just a single row that’s low, the other being high as well. That makes two distinct ways that we could get a high χ2 statistic, two degrees of freedom. In general, there are (r - 1)(c - 1) degrees of freedom in a contingency table. It’s always one less than the number of rows or columns because at least one row and column must be shortchanged when a cell is higher than expected.

    Degrees of freedom with comparisons of multiple haplotypes

    From the previous post, you might have anticipated how we could do simulations of drift in samples of multiple markers. For the following, I’m going to examine the mutual information between two sets of three SNP markers. That is, I have a stretch of chromosome roughly 50 kb long from end to end (or more properly, 0.05 cM), with three markers tightly linked at either end. Using the same methods as the previous post, I’m applying the same three-stage population history with genetic drift only to these six markers.

    Here are the results for 10,000 trials:

    PIC

    That doesn’t look very much like the chart that I got in the previous post, using the same population history. Here is that one:

    PIC

    What’s going on? That red line in the second chart is the χ2 distribution with one degree of freedom. The first chart doesn’t match that at all.

    Why not? The second chart shows the mutual information between two markers, with two alleles each. That’s one degree of freedom. The first chart shows the mutual information between two sets of three markers. Three markers give the possibility of eight distinct haplotypes (23). So we have an 8 × 8 contingency table, with 49 degrees of freedom. Here is that distribution, in red, along with the result of our 10,000 simulated samples:

    PIC

    Except, oops, it doesn’t look like 49 degrees of freedom either! What the heck is going on?

    If you’ll remember from the second post in the series, our empirical distribution isn’t going to fit a chi-square unless we have enough observations. Sure, in theory, three markers could give us 8 haplotypes. But in practice, the three markers on each end of our 50 kb region are so tightly linked that we get many fewer haplotypes. Sometimes we get only three or four haplotypes on each end. If each set of three markers were independent of the other, they might fit a chi-square with four degrees of freedom, or nine, or six or twelve. In fact, the mutual information observed in these 10,000 simulations accords with a mixture of distributions. Here are the number of degrees of freedom in the 10,000 simulations, as a histogram:

    PIC

    Most of the trials have six or nine degrees of freedom, with 12 being a substantial proportion as well. No primes, there, except for 2 and 3. Only a tiny fraction of trials have as many as 20 degrees of freedom between the 3-marker sets. None get anywhere close to the theoretical 49, because the three-marker sets are too tightly linked to express all possible haplotypes.

    So if we want a theoretical distribution to match to our simulated one, we are going to need a mixture of different χ2 in proportion to the number represented in the simulated set. Here’s that comparison:

    PIC

    The red line is a mixed distribution, in which different χ2 distributions are blended in the proportions represented by the histogram above. That’s the distribution of mutual information between the two marker sets, conditioned on the haplotypes actually present in each set, under the hypothesis of independence. The simulations lie to the right of this distribution, meaning that small-sample bias and genetic drift have added mutual information to the simulations, compared to the expectation in the absence of linkage.

    The only difference between these simulations and the ones in the previous post is the number of markers. Here, we’re looking at mutual information between two three-marker sets; there, we were looking at mutual information between two one-marker loci. Both examples used the same distance and the same population history. Genetic drift has had a similar effect—in both sets of trials, there is a slight excess of cases with high mutual information. With both, there are around twice as many at the tail as expected if the markers were independent.

    This leads to a couple of conclusions:

    1. Under this population history, genetic drift is not sufficient to cause very large amounts of mutual information to be shared by distant sites.
    2. A slight alteration to the χ2 test—for example, adding some proportion to the critical values—might allow us to test the hypothesis of neutrality for any given case without the necessity of all those simulations.

    If we were to add more and more markers, we would eventually find that the bias in mutual information due to small sample size would get bigger. So there’s some maximum number of markers giving useful information in any given sample. But I just picked 3 markers out of a hat. Adding more markers, up to some point, should increase our ability to see rare haplotypes that might diverge from the expectations of genetic drift.

    Next: Conditional mutual information: finding the haplotypes that explain linkage disequilibrium

    References


       Harding RM, et al. 2000. Evidence for variable selective pressures at MC1R. Am J Hum Genet 66:1351–1361.

  • When genetic drift reduces entropy

    Sun, 2009-03-15 20:09 -- John Hawks

    This is the third in a series on information theory and tests for recent selection. The first post, “Information theory: a short introduction”, covered some of the basics of entropy. The second post, “Information theory and mutual information between genetic loci”, showed that mutual information between independent sites will be distributed as a χ2.

    We tend to think of genetic drift as a random process. Random processes operating repeatedly over time are called “stochastic,” and changes in gene frequency under genetic drift are certainly that.

    Since entropy is a measure of uncertainty, it might seem natural to think that stochastic changes in gene frequency would increase the entropy in a population. After all, the gene frequency in a population under genetic drift will be more and more uncertain over time. So, considering the frequency of a single allele as the system, genetic drift appears to increase entropy over time.

    But even this simple system isn’t quite so simple as it might appear. Sure if you start out knowing the allele frequency, then genetic drift will increase your uncertainty over time. You will become less and less able to say that it lies in any given interval. But what if you don’t start out knowing? What if all you know is that the locus has been subjected to t generations of genetic drift?

    As t increases, the probability of fixation of the locus also increases. The net effect is to reduce the entropy in the system – going from uncertainty about the allele frequency to more and more certainty that it will be either one or zero. The only thing that will stop this process is some other evolutionary force – mutation, migration from other populations, balancing selection. Each of these will have its own distinctive effects on the entropy of the single-locus system.

    Drift and recombination

    We are going to consider a system of two partially linked loci. That means that the two are near enough to each other on a single chromosome that they are usually inherited together, but that a small fraction of individuals will have recombination between the two loci in each generation.

    If the two loci were unlinked, the evolution of allele frequencies at one would have no effect on the other. The mutual information between the two loci would accord with the description in the last post in the series. As described there, the distribution of mutual information estimates taken from different samples of such populations would approximate a chi-square distribution.

    But our two loci will be linked. Genetic drift tends to affect the allele frequencies at the two loci in the same way. Haplotypes consisting of one allele from each locus will increase or decrease under drift. Over time, haplotype frequencies will tend to diverge, some becoming rare, others common. Because the alleles of the two loci are evolving in tandem, the joint entropy of the two-locus system will be lower than expected from the sum of the individual entropies of the two loci. Mutual information will be greater than expected under the hypothesis of independence.

    How much? That depends on the amount of recombination between the two loci and the power of genetic drift. Recombination introduces random noise, by breaking up pairs of alleles and shuffling them. That tends to increase the joint entropy of the two-allele system. The more recombination, the less we can predict the allelic state of one locus from what we observe at the other.

    Genetic drift is more powerful in small populations; less so in large populations. It can also be enhanced by population structure, or diminished by migration.

    So in principle, the reduction in joint entropy from genetic drift opposes the increase in joint entropy from recombination. As the physical distance between the two loci increases, the power of recombination will increase. As the population size increases, the power of genetic drift will decrease. That means we need to know something about local recombination patterns, and something about ancient human demography to get a good estimate of the effects of these processes on human genes.

    But given some knowledge about these things, we should be able to get a pretty accurate idea of the distribution of mutual information that would result from drift alone in ancient human populations.

    Simulating some samples

    These days, the person who wants to investigate the effects of genetic drift on samples of genes has many options. Several well-documented computer programs allow simulations of genetic drift allowing the user to choose parameters like effective population size, recombination rate, marker type and density. Some of these programs use a forward-time approach – that is, starting with a randomized sample and propagating it forward through time. Others use a coalescent approach, which “starts” with a sample of genes taken from the present and propagates their ancestry backward in time. There’s a good diversity to choose from, unlike the old days (10 years ago) when you more or less had to write your own.

    For this series, I’m going to use a package called CoaSim. The package was written by Thomas Mailund and colleagues, and is available under a free software license, including documentation. This software employs a coalescent approach to generate an “ancestral recombination graph” (ARG) among a sample of genes, under a specified model of demographic history. The ARG is a tree of genealogical relationships among the sampled genes, factoring in a history of recombination between them. From the ARG, the software can add markers (such as SNPs or microsatellites), resulting in simulated datasets of sequences that could have evolved under the specified demographic history.

    Similar functionality is available in some other packages. I’m using CoaSim in this case because it includes Python bindings. I use Python for a number of other analytical methods, so this cuts my development time. That’s a faster path from investigation to results.

    So, let’s consider some samples that might result from a long history of genetic drift in a single population. First, I’ll assume that the effective size of my population is 10,000 diploid individuals. I’m going to simulate 10000 samples from populations matching this description. Each sample consists of two SNP markers drawn from 120 gametes – in other words, two copies from each of 60 individuals, with current sample frequencies random between 0.1 and 0.9. If you’ll recall from the last post in the series, we need a decent number of individuals representing each bin for our mutual information estimate to be reasonable. Excluding rare SNPs helps to make sure we have that.

    The mutual information between these SNPs will depend on the rate of recombination between them. To begin with, I’ll assume that the two markers are 50 kilobases apart, and that the rate of recombination is 1cM/Mb, or around 10-8 per base.

    PIC

    The red line represents a chi-square distribution with 1 degree of freedom. That would be the expectation if these two loci were independent, and the histogram is clearly biased to the right compared to that expectation. That means that the mutual information between two linked loci is higher, on average, than that between two unlinked loci. This is a simulation under drift alone, not an empirical distribution. Drift reduces the joint entropy of two linked loci more than it reduces the individual entropy of each locus.

    How much mutual information are we talking about? From the last post, we know that twice the sample mutual information, measured in nats, is approximated by a chi-square. Here on the high end, we have a value of 50, which corresponds to 25 nats per sample. In our sample of 120 gametes, that is around 0.2 nats per gamete, or around 0.3 bits per gamete. That’s not a remarkable amount – it’s about the entropy of a single biallelic locus with allele frequencies of 0.95 and 0.05.

    Now, what happens if we keep recombination the same, but reduce the strength of genetic drift? The following chart shows what happens if I assume an effective size of 100,000, with the rest of the parameters kept the same – 50 kb, 120 chromosomes, frequencies between 0.1 and 0.9.

    PIC

    Here, the histogram is much closer to the expected chi-square distribution under independence. There’s still a slight bias upward, and out of 10,000 samples, a few that have rather high mutual information values. But the effect of genetic drift in this scenario is very slight compared to the case where N = 10000.

    As it turns out, the effect of a tenfold larger population is precisely the same as if I kept population size the same and moved the loci ten times further apart. The mutual information scales with the product of recombination and effective population size. So two loci 500 kb apart in an effective population of 10,000 would on expectation have the same mutual information as two loci 50 kb apart in an effective population of 100,000. Another way of putting it: By reducing the coalescent probability tenfold, recombination has ten times the opportunity to scramble associations between the two loci.

    When population size expands

    Human populations have grown by more than a thousand-fold during the last 40,000 years. Compared to the population of 40,000 years ago, today’s human population should have markedly less mutual information between linked genetic loci. If we want to find out how much, we’ll have to have a model of the recent human demography.

    Here, I’ve done some simulations with changing population size. First, let’s assume that we have two loci 50 kb apart. I’m going to start with a very simplistic model of human demographic history. This model has three stages, or “epochs:”

    1. For the last 400 generations, the effective size is 1,000,000 individuals.
    2. Before that, up to 800 generations ago, the effective size is 100,000 individuals.
    3. Before 800 generations ago, the effective size is 10,000 individuals.

    This is both simplistic and conservative. It’s basically assigning a size of a million to the post-agriculture population, and 100,000 to the population after the Last Glacial Maximum. We know that those early values are way too small for the African population, which did not markedly contract during the LGM and has had an effective size up over 30,000 for at least the last 100,000 years. And obviously, the effective population for the past 10,000 years has been much higher than a million. The European population may have had more of a bottleneck at the LGM, but today’s Europeans have substantial ancestry in West Asia, so 10,000 is conservative there as well. The point is not that these values are realistic; it is that they are almost certainly too small.

    I’ve picked the simple scenario to make it a little more transparent what is going on with the results. Here they are:

    PIC

    With this population history, two loci 50 kb apart have nearly the same distribution of mutual information as two independent loci. There’s a slight bias toward higher values, but very little. Out at the tail, it looks like there are around twice as many loci as predicted by the χ2 approximation, and this is still a very small number.

    I thought it might be useful to consider what the mutual information would be under the same population history for two loci that are closer together. Here’s the same scenario, except the loci are 5 kb apart:

    PIC

    With the markers ten times closer here, there is only a tenth as much recombination between them. That makes the result a lot more like the first case above, where the markers were 50 kb apart and the effective size was 10,000 back to infinity. Markers 5 kb apart should have a good chance of sharing significant mutual information under drift alone in a human-like population history. However, as in the first case above, the mutual information generated by drift is still fairly slight.

    Growth and recombination

    Let’s step back and consider what’s going on with these examples. Mutual information is determined by the balance between recombination and coalescence. Increase the effective size, and you reduce the probability of coalescence between lineages. That gives recombination more of a chance to scramble the loci, eliminating mutual information between them.

    A human-like population history has a very large effective size within the last several thousand years. The large effective size reduces the probability of coalescence down to almost zero across that time period. So the ancestry of almost every gene copy has more than 400—and often up to 1000—generations all to itself. If recombination is ticking along at 1cM/Mb, then physical distances out to around 50 kb start to saturate for recombinants across the last 20,000 years.

    Consider: two lineages that coalesce 400 generations in the past are separated by 800 generations of time. The two lineages were initially identical. Our two markers are 50 kb apart, given them a per-generation probability of recombination of 0.0005. The probability that neither undergoes recombination during those 400 generations is:

    (1- 0.0005)800 = 0.67

    Two thirds remain the same; one third of pairs will have recombined with other lineages. Half of pairs that coalesce 700 generations ago will have recombined. In contrast, after 700 generations, only seven percent of pairs separated by 5 kb will have recombined. Pairs that remain identical carry mutual information; each pair that recombines loses this mutual information.

    Under drift alone, the mutual information builds slowly. From the first chart we know that the mutual information at this physical distance in a small population will be significant, but not very high. Most of this information is still retained in the samples represented by the last chart, where they are separated by only 5 kb. But over longer distance this mutual information dissipates rapidly in a growing population.

    So if we observe substantial mutual information between loci in today’s populations, we need some explanation other than genetic drift. Selection is one possible explanation—but to consider that more closely, we will want to consider other information theoretic features of gene samples.

    Next: Extending to strings of loci

  • Kulturkreise of the Paleolithic Era

    Tue, 2008-11-25 16:25 -- John Hawks

    I'm reading through the English translation of The Culture Historical Method of Ethnology by Wilhelm Schmidt -- one of the practitioners of the Vienna School in the early 20th century. This is in pursuit of my project describing early 20th-century diffusionism, the next logical point is the Kulturkreise, or "culture circle" theory.

    While I formulate my thoughts on that topic, I thought it might be worth highlighting this passage, in which Schmidt considers the problem of reconstructing culture relations among prehistoric hunter-gatherers -- the kind with which Paleolithic archaeologists would now be concerned. The passage lies in a broader section about external causes of culture similarities, and the basic point seems to be that the vastness of time allows few distinctive similarities among the very ancient cultures of Paleolithic people:

    On the other hand, the comparison of ethnological with the prehistorical periods, as is carried out by O. Menghin in his Weltgeschichte der Steinzeit, shows how long those periods of the food gatherers must have lasted, which without any noticeable "progress" (in external culture) cover thousands (and even tens of thousands) of years. We pass over in silence the astounding speed of the historical events of modern times, for neither could that profusion of events of "early historical" times have happened in the epochs with which we are here concerned, because there existed only very small human groups who lived thinly distributed over the earth, which was still so sparsely peopled. At any rate, the external course of events could by far not have been rich and developed, and the resonance of the individual events could also not by any means describe such broad circles and, if so, then only in a long course of time.

    Of course, that does not decide the question of te plenitude and richness of psychic events of the peoples of that time, although also the small number of individuals in the single groups and the great distances separating the single groups restricted these to certain limits. The smallness of a group could especially increase the danger that the group received none of the leading individuals, who are of particular importance for the awakening and preservation of intellectual life (Schmidt 1939:251-252).

    I find the last paragraph here to be quite modern in its conception of information transfer and storage within individuals. In the first paragraph, Schmidt concerns himself with the pace of culture change, with culture viewed as a reactive system in the face of external events. The long span of Paleolithic time has few extrinsic changes to which cultures might have obviously adapted, so that contrasts between cultures on an extrinsic basis might be expected to be minor.

  • Julian Steward and the logic of diffusion

    Thu, 2008-10-30 00:17 -- John Hawks

    I've had a tremendous response to the last entry in the diffusion series, which discussed the treatment of cultural diffusion by the Boasian school. I really appreciate the pointers, which have taken me in some interesting directions that I might have missed. Meanwhile, I'm continuing on with my review.

    In the last post, I pointed out that Boas already had described the main elements necessary for a formal description of diffusion. Indeed, diffusion was of special importance to the Boasian view of culture, because it provided a mechanism by which culture history could come to be partially independent of race history. But despite its importance, neither Boas nor his students brought themselves to a reductive theory of diffusion. That is no surprise, in the context of early American anthropology, which was in many respects unwilling to generalize rules from the particular descriptions generated by ethnographers.

    But one of Kroeber and Lowie's students, Julian Steward, did attempt a relatively formalized description of cultural diffusion. In a short 1929 article, "Diffusion and independent invention: a critique of logic," the 27-year-old Steward showed frustration with anthropology's failure to grapple with the diffusion problem in a concrete way.

    There exists a large proportion of anthropological data which admits of no clear-cut methodology but is usually handled according to inference and common sense logic. While this method may be soundly rational, the possibility of an enormous subjective element and fallacious logic is ever present and is demonstrated by the existence of the diffusion controversy. This controversy is made possible not only by the personal bias of the investigator but also by confusion of the principles upon which the solution is based.

    It is not my purpose to present a rule-of-thumb method for the settlement of the diffusion controversy but to inquire into its logical implications and discover whether these are not capable of formulation. While this will but formulate the principles implicit in most work, it will also reveal the possibility of certain confusions and inconsistencies (Steward 1929:491).

    Why was there a "diffusion controversy"? Steward describes the controversy as emerging from two extreme viewpoints about culture history -- "extreme diffusionists" and "evolutionists." In Steward's description, this conflict comes down to a logical error of comparison.

    He describes this error with an example: "inverted speech," which is "a custom of clowns and others of saying the reverse of what is meant."

    This may be illustrated by inverted speech which occurs in North America in the Plains area, California, and the Southwest, and also occurs in Australia. Shall we account for these four occurrences by diffusion or independent invention. The solution depends upon inference from the assembled facts, but what is the logic of our reasoning? We ask: How probable is communication between these areas? How difficult an achievement is inverted speech? It is tempting immediately to postulate diffusion between the North American occurrences but independent invention for Australia. This would be solely on a basis of distribution and by this we should be prone to judge the uniqueness of the element (Steward 1929:491-492).

    Steward uses "uniqueness" as a synonym for improbability: an improbable feature shared by two societies is likely to result from a unique occurrence that had diffused to both. Continuing with inverted speech, Steward notes that the assumption of diffusion among the North American instances leads naturally to the conclusion that inverted speech is improbable -- after all, it only occurred once in the entire continent. But the assumption of independent invention of the trait in Australia

    ...lead us to regard inverted speech as not so difficult an invention after all, for it clearly has been invented a second time. What logical justification would there be for the assumption that independent invention is inherently less possible for the Plains, California, and the Southwest than for Australia because the first three happen to be geographically more accessible (492)?

    Steward claimed that this comparison amounts to question-begging. Sure, if we assume that communication between two societies is likely, we will be predisposed to interpret diffusion. But this does not give us any real information as to whether independent invention happened.

    To address this problem, Steward suggests three "principles" to guide the interpretation of shared culture elements. I'm going to list these, which Steward presents as statements about probabilities, and recast them in terms of information theory.

    (1) The probability of independent invention is directly proportionate to the difficulty of communication between the localities

    In other words, if messages may pass easily between two locations, then we predict that the information comprising culture elements may pass easily also. Steward further describes two signs that may indicate the difficulty of communication. If the two localities share many culture elements, then we can infer that a regular communication was probably present. And if the localities communicated over a very long time, they may share many more things than if they have communicated only over a short time.

    (2) The probability of independent invention is directly proportionate to the uniqueness of the element.

    Steward adds:

    The uniqueness of a culture element -- that is, the probability of its being invented -- is the most difficult problem to determine. This will be decided by the investigator upon his experience and knowledge of the cultural setting and circumstances under which it may have been invented. But his decision must not depend upon either of the other two principles stated here. To the probability of an element of culture arising in a particular culture, the existence of this element in other localities and the difficulty of communication between the localities are totally irrelevant.

    This is probably the most important point in the essay: It would be desirable to maintain a separate test of the diffusion hypothesis with reference to the "uniqueness" of the element in question, so that this test could reinforce the test based on communication. As it was, each of these tests appeared to obviate the other, resulting in a circle. But Steward's description of this point is totally unclear, and benefits from an information theoretic analysis.

    If we recast "uniqueness" as "reduction in Shannon entropy" (e.g., "information" in the technical sense), and understand that the entropy is a function of probabilities of the components of a culture element, then we can reformulate this principle: Independent invention is less probable when the shared element has a high information content. Still, this cannot be absolute: high information content is difficult to communicate. Communicating a long message across a noisy channel (like a culture contact) requires some work to maintain the fidelity of the information -- and that work requires an incentive.

    In this sense, the "uniqueness" criterion must really break down into at least three separate properties: (1) the information content of the trait, (2) its value, and (3) synergy with the existing culture system. Value may refer to function, such as status marking or foraging utility. A culture element that conflicts with existing knowledge is less likely to be transmitted accurately; whereas one that is synergistic with existing knowledge might be picked up easily from a distant culture. In effect, "synergy" refers to the extent that the information content of a culture element is already familiar within the culture background.

    (3) The probability of independent invention is inversely proportionate to the probability of derivation from a common ancestral culture.

    This is a phyletic view of shared culture elements, that they come from original populations that split into daughter cultures. Again, when many elements are shared, this increases the probability of sharing for any single element.

    Yet after this discussion of principles, Steward comes to an uncomfortable conclusion. We are still left utterly unable to assess whether inverted speech was invented independently in different North American regions, or whether it diffused from one origin. Certainly it helps to know that we don't know that, but that doesn't give us a method, just a critique.

    What the problem really requires is some kind of measurement of the probabilities of invention and transmission. In the case of genetic transmission, we have well-defined probabilities, because actually there is very little information entropy in the system of a single gene with a finite (and small) number of alleles.

    But for culture elements, the information content may be much greater. With more information, we can imagine more ways for the information to be apportioned, as well as more different ways that transmission of the information might be disrupted.

    As to the "diffusion controversy," that will be discussed further in the next post, where I cover Leslie White's record on diffusion.

    References:

    Steward JH. 1929. Diffusion and independent invention: a critique of logic. Am Anthropol 31:491-495.

  • Information theory and mutual information between genetic loci

    Fri, 2008-10-10 22:52 -- John Hawks

    This is the second in a series on information theory and tests for recent selection. The first entry, "Information theory: a short introduction" reviewed the basic concepts of information measures and their background.

    The International HapMap is a massive project to determine the genotypes for up to 3 million single nucleotide polymorphisms (SNPs) in samples of people from 11 population samples around the world. The current data release (Phase 3) includes genotypes for a subset of over 1.5 million SNPs in 1,115 people. The 11 population samples include people of African ancestry from the US Southwest, Utah residents of Northern and Western European ancestry, Han Chinese from Beijing, people of Chinese ancestry from Denver, people in the Houston Gujarati Indian community, Japanese people from Tokyo, Luhya and Maasai people from Kenya, people of Mexican ancestry from Los Angeles, Italians in Tuscany, and Yoruba from Ibadan, Nigeria.

    As impressive as this effort is, we may wonder why exactly SNP genotyping of so many people is a valuable enterprise in itself. The project’s homepage includes this short statement:

    The goal of the International HapMap Project is to compare the genetic sequences of different individuals to identify chromosomal regions where genetic variants are shared. By making this information freely available, the Project will help biomedical researchers find genes involved in disease and responses to therapeutic drugs.

    There are theoretical and practical objections to this simple explanation (as I discussed here last month). However, what no one involved with the project seems to have expected is the extent to which the data would demonstrate the importance of recent adaptive evolution in human populations.

    Here, I am describing some of the ways that we can test hypotheses about natural selection by using the SNP genotypes from the HapMap. This is a theory-centric description, with some digression into practical aspects of handling the genotype data. First, I consider how we might use information theoretic concepts to test the hypothesis of independence between two genetic loci.


    Entropy and genotypes

    The data from the HapMap consist of an array of biallelic genotypes for each population. The size of this array is different for each population; we may consider it to have m rows, each row corresponding to a single SNP locus, and n columns, each corresponding to a person. The entries are genotypes: AA, AT, CG, and so on. Each SNP is biallelic, so it hardly matters what we call the two alleles—we can arbitrarily label them a and b. Thus, the three possible genotypes may be labeled aa, ab and bb.

    Of course, from a sample of genotypes, we can readily estimate the frequencies of the two alleles in the population. The sample allele frequency ˆp (a) = ˆp (aa) + 12ˆp (ab), where the estimate ˆp (aa) is the number of aa individuals in the sample divided by the sample size n. It is worth expressing some statistical caution about estimates. Although the HapMap SNPs are biased toward common alleles, nevertheless some of them are rare indeed. And as we stretch down the chromosome to consider multilocus haplotypes, many may be vanishingly rare or even singular in the population, even though they happen to occur in the sample.

    Now, how shall we estimate the entropy of a single locus? It seems there are several ways we might look at the question.

    1. Allelic entropy We might use the entire sample of genotypes at the locus to estimate the allele frequencies. In that case, the entropy of the SNP locus would be estimated by
      Hˆ(SNP ) = - [ˆp(a)log ˆp(a)+ ˆp(b)log ˆp(b)]
      (1)

      For a SNP locus with empirical genotype frequencies p(aa) = 0.16, p(ab) = 0.48 and p(bb) = 0.36, this estimate of allelic entropy would be 0.97 bits.

    2. Genotypic entropy Or, we might consider the genotype frequencies as the essential elements of the system. In this case, the entropy would be estimated by
      Hˆ(SNP) = - [ˆp(aa) log ˆp(aa)+ ˆp(ab)logpˆ(ab)+ ˆp(bb)log ˆp(bb)]
      (2)

      For the same genotype frequencies listed above, this genotypic entropy would be estimated as 1.46 bits.

    3. Sample entropy Or, we might consider the sample of genotypes as a series of n repeated trials. That would make the sample entropy in the example 1.46n bits. We might also calculate a sample entropy considering the gametes instead of the genotypes—but this would work out to be the same, as long as we don’t know which gametes came from which parents of heterozygotes.

    To understand this last point, imagine that the sample is a deck of shuffled cards. Each card may be red with probability p(a) or black with probability p(b). If we drew cards one at a time, we could record a sequence (red, red, black,…) of 2n cards. Each card drawn thereby has an exact position in the sequence. If p(a) = 0.4 and p(b) = 0.6 as above, then this entropy will be 1.94n bits. Now, imagine instead that we draw the cards two at a time, and record only these pairs (homozygote red, homozygote black, heterozygote,…). We will have a sequence half as long, and this sequence does not include the exact position of the red and black cards, only their position as part of a pair. The entropy of this sequence of n genotypes is only 1.46n bits. This gives some insight into the nature of the sample entropy as defined above: It is the entropy of a sequence of genotypes.

    By contrast, the allelic entropy is the uncertainty associated with drawing a single copy of the SNP at random from the sample. The genotypic entropy is the uncertainty associated with drawing a genotype (two SNP copies) from the sample. Again, these are empirical estimates derived from the sample; they represent the underlying population just to the extent the sample does.

    Which of these estimates will be useful to us? The sample entropy tells us how many bits of hard drive space we need to store the HapMap data under an efficient coding scheme. That’s interesting, but not necessarily what we have in mind. The other two are sample estimates of an underlying population characteristic—either allele or genotype frequencies. In what follows, we will be interested in quantifying how much our uncertainty about one individual’s SNP genotype is reduced by knowing that same individual’s genotype for some other SNP. It would seem that the appropriate measure for this problem will be the genotypic entropy.

    Mutual information between SNPs

    The joint entropy represented by two SNP loci may be less than the sum of their individual entropies. That is to say, there may be mutual information between the two loci. Mutual information can arise between SNPs for several reasons. Most obviously, if the sample of individuals actually consists of members of two distinct populations with different allele frequencies, then two distinct SNPs may have mutual information because of this hidden population structure. This source of mutual information is exploited by admixture mapping—a process that involves taking people with recent ancestry from two or more populations and finding genomic regions that are linked by virtue of the fact that little recombination has yet had time to reshuffle haplotypes that may be locally common but globally rare. Or two loci may share mutual information because they are physically linked.

    As an example of mutual information estimation, consider two sets of genotypes (the rows of the following table):

    A = 0 1 0 2 0 1 1 0 1 0 1 0 2 1 0 0 1 0 1 0
    B = 2 2 2 0 1 1 2 2 0 2 1 1 0 1 2 1 2 2 1 2

    Each SNP locus has three genotypes; twenty people are represented in the sample, each column represents the genotypes of a single individual. The empirical estimate of genotypic entropy from the first SNP locus (Ĥ(A)), based on 10 zeros, 8 ones and 2 twos, is 1.36 bits per individual. The same estimate for the second SNP locus (Ĥ(B)), based on 3 zeros, 7 ones, and 10 twos, is 1.44 bits per individual. The sum of the individual entropies for the two SNP loci is 2.8 bits. But the joint entropy of the two loci (Ĥ(A,B)), based on three (0,1), seven (0,2), one (1,0), four (1,1), three (1,2), and two (2,0) joint genotypes, is only 2.36 bits. Hence, the two SNP loci share an estimated 0.44 bits of mutual information (e.g., Î(A;B) = 0.44 bits).

    What does this mean? Consider the following contingency table, based on the genotypes above:

    0 1 2
    0 3 7
    1 1 4 3
    2 2

    None of the sampled individuals have the joint genotype (0,0) and none have (2,1) or (2,2). But even more important, a full seven have (0,2) even though only 2.5 would be expected if the two SNPs were independent.

    The contingency table is a clue (for the statistically-minded). The mutual information is telling us something like Pearson’s χ2 test of association. If we know the genotype for SNP A, we can do better than chance predicting the genotype for SNP B.

    Significance testing for mutual information

    The comparison with Pearson’s χ2 test raises another obvious question: How do we test whether a given amount of mutual information is statistically significant? In the example above, we have estimated 0.44 bits of mutual information between SNP A and SNP B. Is this a lot? How much mutual information should we expect between two random samples of data?

    Let’s take an even simpler contingency table — the relation between two coin flips. Each flip may have a 50-50 chance of producing “heads” or “tails”. On average, we expect to see one fourth (H, H), one fourth (T, T), one fourth (H, T) and one fourth (T, H) in our results. But if we perform this experiment (2 coin flips) a finite number of times, we will almost always see some slight divergence from these proportions. Ten sets of coin flips absolutely can’t give us one fourth of each result, because ten doesn’t divide by four evenly. On top of that problem, we might get six or seven (H, H) by chance, even if we only expect 2.5. Any of these cases will give us a positive non-zero estimate of mutual information, even if there is no causal connection between the coin flips.

    Another way of stating this observation is that the estimate of mutual information, Î(A;B) is biased. We should expect the estimate from a small sample to be larger than the true mutual information in the population at large.

    The analogy between mutual information and the χ2 test is more apt than it might appear. In fact, over many trials, the distribution of sample estimates of mutual information will approximate a χ2 distribution, multiplied by a factor 2nlog 2, where n is the number of cases in the sample. Our sample of genotypes of A and B above has 20 individuals and 3 possible genotypes, so 40(log 2)Î(A;B) should be distributed as a χ2 with 4 degrees of freedom.

    Reflecting on the three ways we might estimate the entropy of a locus, above, this is twice the mutual information calculated from the sample entropy, measured in nats instead of bits. Even though we are interested in estimating the population characteristic, the genotypic entropy, the significance of our estimate can only be evaluated by considering the entropy of the sample.

    And indeed, we can perform a randomization of the two sets of genotypes and show the correspondence. Here, I’ve done ten thousand permutations of the genotypes in A with respect to those in B:

    Simulation outcome

    The bars are the histogram representing the 10,000 permuted samples, the curve is the χ2 density function with 4 degrees of freedom. You can see that the permutations show significant clumpiness. With only 20 sampled individuals, some fractional combinations come up quite often while others are impossible. Also, the permutations are more biased than the χ2 would predict—there are too few small values for Î(A;B). This is another small sample effect. Generally, the χ2 approximation fails when there are fewer than 5 observations in a cell, and shouldn’t really be trusted with fewer than 10. Here, we have nine cells and only 20 total observations. But our value of 0.44 bits—equal to a sample mutual information of 12.2 nats on the scale of the figure—is significant according to the (uncorrected) χ2 approximation with p = 0.016, and according to the permutation test with p = 0.014. If we are very concerned about the deviation from the χ2 distribution, we might decide to use Fisher’s Exact Test on the underlying contingency table.

    With more observations, data do tend to converge to a χ2 distribution. For example, here I have run 10,000 permutations of a sample of 10,000 individuals, for two loci, each locus with 10 alleles at equal frequencies. The curve is a χ2 distribution with 81 degrees of freedom:

    Simulation outcome

    Very nice.

    This comparison suggests a couple of things. First, we can always do a permutation test of the hypothesis of independence. Take a sample of paired genotypes, shuffle up one of them, and see if the observed pairs share higher mutual information than a large fraction (say, 95%) of the permuted sets. (Naturally, if at this point you are already thinking of a genome-wide survey, you will need to consider ways to correct for multiple comparisons….)

    Second, we can get approximate results for mutual information using a χ2 test. All we do is multiply the mutual information estimate Î(A;B) by 2nlog 2 and compare it to the appropriate significance level of the χ2 distribution with the appropriate number of degrees of freedom. This approximation will be poor for small samples, including the HapMap samples. Again, if we were testing the hypothesis of independence in those cases, we would likely want to use Fisher’s Exact Test instead.

    But in what follows, we will generally not be testing the hypothesis that two loci are independent; we will be testing the hypothesis that they are linked under neutrality. In that context, these statistical tests of independence will be useful for quickly weeding out genetic regions where linkage is negligible. Then, we can employ different tests for regions where linkage appears to be more substantial, tests that make more effective use of the properties of mutual information.

    Next: Genetic drift reduces mutual information

  • Information theory: a short introduction

    Fri, 2008-09-26 21:37 -- John Hawks

    I lectured this week in my Biology of Mind course about information theory, and in particular the concept of Shannon entropy. I’ve typed up a few notes for my students, and I’m cross-posting them on my own blog because they are relevant to another topic I’ll be writing about: discovery and testing of natural selection in the human genome. You see, the kind of data that are presently being collected as part of the International HapMap , single nucleotide polymorphisms (SNPs), are naturally treated by information theoretic measures. So first, it may help to define the essential concepts of information theory.

    Many readers will have heard of the concept of entropy in connection to the thermodynamics of physical systems. Indeed, one common statement of the Second Law of Thermodynamics is that a closed system must increase in entropy over time. Entropy is a statistical characteristic of a system, related to the probabilities that the particles of a system will be found in given states at any given time. In thermodynamic terms, a system’s entropy is related to our ability to extract work from the system. The Second Law implies that work cannot endlessly be extracted from a closed system without the addition of energy from outside. By 1927, Leó Szilárd (probably my favorite physicist) had shown that the entropy of a physical system can be naturally defined in terms of information. In other words, one way of looking at entropy is in terms of uncertainty about the state of any given particle in a system, and one might apply energy to a system in order to reduce this uncertainty (for example, by concentrating particles in one part of the system.

    Claude Shannon developed the concept of entropy as applied to communication systems. By doing so, he established the field of information theory. Shannon developed several fundamental theorems, including a derivation of the relation between channel capacity (bandwidth) and noise, studies of optimal encoding strategies, and a means of treating continuous as well as finite communications. His most basic definition is that of information entropy, which has also come to be called the Shannon entropy. This definition places entropy as a measure of our uncertainty about the state of a system—in particular, as applied to information, a system of signs.

    Shannon published “A mathematical theory of communication” in 1947, describing his theoretical work. Along with this article, the (UW-Madison) mathematician Warren Weaver wrote a popular treatment of Shannon’s work, titled “Recent contributions to the mathematical theory of communication.” I mention these articles because it is very hard to improve upon them; they are clear in their exposition and notation. The two can be found together in the 1948 book, The Mathematical Theory of Communication, which has been reprinted several times. I can’t recommend this book highly enough.

    Keeping that in mind, I won’t be reiterating the essentials of information theory here; I just want to give a basic understanding that can be applied to other problems—particularly with respect to SNP datasets.

    Entropy and outcomes

    Suppose we have an electron in a box. Its spin may be “left” or “right”, with equal probability. What is our uncertainty about the electron’s spin? One way of looking at the question: We are just as uncertain about the electron as we would be about the flip of a fair coin. The uncertainty in both cases has the same quantity, even though the systems are in other respects totally different from each other. Hence, it is desirable that our definition of uncertainty not depend on the actual physical characteristics of a system, but only upon the probabilities of signs within the system.

    If we were not uncertain at all, the probability of one outcome would be 1 and the other would be zero. For instance, if we had a two-headed coin, we would be absolutely certain of flipping heads. Naturally, a definition of uncertainty should assign zero to the case in which we already know the outcome. But for a fair coin, we have a probability of 0.5 for one outcome, and a probability of 0.5 for the other. We are uncertain, and our measure of uncertainty should have a positive value in this case, whatever unit we may choose.

    Now suppose we have a nucleotide of DNA. It may be adenine, guanine, cytosine or thymine, each with probability 0.25 (1/4). How many coin flips would give us the same amount of uncertainty? The answer is two: Two flips have four possible outcomes (0,0; 0,1; 1,0; 1,1) with equal probability (0.25) for each. Again we have an equivalence between two systems in the amount of uncertainty about the outcome. However, in this case we can see that it takes two trials of one system to attain the same uncertainty as one trial of the other system. It would seem that we should be twice as uncertain about the nucleotide as we are about a single coin flip. Indeed, three nucleotides of DNA (a codon) will give us 64 possible outcomes—the same as six coin flips.

    The number of possible outcomes of a set of trials grows as the exponent of the number of trials. So for example, 5 coin flips yield 25 = 32 possible outcomes; 10 coin flips yield 210 = 1024 possible outcomes. If the coin is fair, then every outcome is equally probable; meaning that the probability of any sequence of 10 coin results might be observed with probability 1/1024. A consideration of this system over a slightly larger scale will give some idea of the power of encoding. With 10 coin flip results, we might choose a room in a 1024-bed hospital at random. With 100 coin flips, we may describe a system of 1.2 × 1030 elements—enough to randomly choose a point on the Earth’s surface to within a millionth of an inch.

    If we are uncertain where we have hidden our microdot, a hyper-GPS could communicate its location anywhere in the world to us with a string of 100 heads or tails. The information that will remove our uncertainty is related to the logarithm of the number of outcomes. This leads to a mathematical definition of uncertainty in terms of logarithms. In particular, for a system X with possible outcomes x1,x2,,xn, the information entropy (H(X)) is:

              ∑n H (X ) = -   p(xi)logp(xi)           i=1
    (1)

    The logarithm is conventionally taken as a base-2 logarithm, so that the measure of entropy is the binary digit, or bit. A single coin flip has two possible outcomes each with probability 0.5. The equation gives us:

    H (coin flip) = - [0.5 log0.5+ 0.5log 0.5] = 1 bit
    (2)

    This equation allows us also to handle systems in which the probabilities of different outcomes are not all equal. For example, suppose there are two outcomes, with probability 0.9 and 0.1. What is the uncertainty?

    H (X ) = - [0.9 log0.9+ 0.1log 0.1] = 0.47 bits
    (3)

    Here we are less than half as uncertain as in the case of a fair coin—and indeed, that is the point. If we had a coin that consistently gave only 10% tails, we would on average be considerably more certain about the outcome. On average, we can communicate two flips of our unfair coin with only one bit. Exactly how this can be done is by using the right kind of encoding. The point for us is that a system with much less uncertainty than a coin flip can be specified with less information than a coin flip.

    Mutual information

    Now, suppose we have two distinct events. If these events are independent, then the joint entropy represented by both is simply the sum of their individual entropies:

    H (X,Y ) = H (X)+ H (Y)
    (4)

    Why is this? Consider two coin flips. In the combined system including both flips, we have four possible outcomes (0,0; 0,1; 1,0; 1,1). If our two flips are independent, then the probability of each combined outcome is the product of the probabilities of the individual outcomes. That is, p(0,1) = p(0)p(1), and log p(0)p(1) = log p(0) + log p(1). Then, equation 4 can be derived from 1 by algebraic manipulation.

    But, if the two events are not independent—that is, if the outcome of one depends on the outcome of the other, then their joint entropy must be less than the sum of their individual entropies.

    H (X,Y ) ≤ H (X)+ H (Y)
    (5)

    And the difference between the joint entropy and the sum of the individual entropies is a measure of the correlation between the two events. We call this difference the mutual information of the two events, and we define it mathematically:

    I(X;Y ) = H (X)+ H (Y)- H (X,Y )
    (6)

    Consider a game of Blackjack at a casino. Most people play probabilities as if every card were equally likely to be dealt in every hand. Under this scenario, the house has a consistent edge—this is, after all, how casinos make money. But in fact every card is not equally likely to be dealt. In particular, there is a serial correlation among the cards dealt in a Blackjack game. If the King of Hearts is dealt, it is rather less likely to occur again very quickly. In other words, there is mutual information between the dealing of a card and the outcomes of later hands: Once the King of Hearts is observed, its probability of being dealt in later hands declines. A clever player with a good memory may make use of this mutual information to guide his bets: putting down more money when the house is less likely to draw face cards, for instance. Players who can “count cards” in this way would be the doom of the casinos if they allowed it to continue. To prevent it, they reduce the extent of serial correlations by dealing cards from boots of four or more decks, and the casinos eject players suspected of counting.

    Redundancy

    Moreover, a system comprised of two distinct events may include redundancy. We may consider a very prominent system in which two independent events, each with two outcomes, give rise to a combined system with only three, not four outcomes. Consider a Mendelian gene A with two alleles a1 and a2. When two heterozygotes (a1a2) mate, their offspring will have one of three genotypes: a1a1, a1a2, or a2a2. If we want to communicate the genotypes of their children, we will need an average of 1.5 bits for each.

    This seems counter-intuitive, considering that each gamete from the parents is a1 or a2 with equal probability. That is, p(a1) = p(a2) = 0.5. The entropy represented by each gamete is a coin flip’s worth—one bit. It might seem that combining two gametes, each derived independently from a different parent, we should have 2 bits of entropy, not only 1.5.

    Yet it is a simple matter to show that we can transmit information about the children’s genotypes using only 1.5 bits. Let’s encode a heterozygote as a “0”, an a1 homozygote as “10” and an a2 homozygote as “11”. In this encoding, a single bit communicates whether the child is a homozygote or heterozygote, and if homozygote is followed by a second bit communicating which of the two alleles. Now, if our couple of heterozygotes have eight children, four of whom are heterozygotes and two of each homozygote, we can transmit their family’s genotypes as “000010101111”. Eight children. Twelve bits. That’s 1.5 bits per genotype. In some unlikely cases (all homozygotes) we will use more bits, in others (all heterozygotes) we will use less.

    In this case, two different outcomes from the point of view of inheritance are in fact not distinguished in the genotype of each child. “Heterozygotes” include two distinct classes: those who inherit a1 from the mother (and a2 from the father) and those who inherit a2 from the mother (and a1 from the father). The system that gives rise to the genotypes is relevant to the probability that each genotype will occur (as Mendel discovered), but it is not very relevant to the way we describe those genotypes. All we know about heterozygotes is that they have two different alleles. When we want to predict phenotypes, we may not care which allele came from which parent. When we collect genotypes (on an Affy or Illumina chip, for example), we are in a poor position to determine which allele came from which parent. Thus, even though two bits of gametes went into the physical system creating the genotypes, only 1.5 bits is sufficient to communicate them.

    What is the import of this redundancy? Clearly, if we are interested in children’s genotypes, then a system that tracks their parents’ gametic contributions is a poor way of encoding the information. That system includes redundant information, with respect to genotypes. But to put the problem another way, knowing the children’s genotypes leaves us with uncertainty about their parents’ gametic contributions. If we want to devise an accurate paternity test, we will sometimes need to know more than the child’s genotype.

    Next: Information theory and mutual information between genetic loci

  • Superstition explained? Signs point to no...

    Sat, 2008-09-13 10:47 -- John Hawks

    I happened to be lecturing about cause-effect relationships in my Biology of Mind course today, and will continue the subject next time. So I was interested when I saw this story from New Scientist writer Evan Callaway:

    The tendency to falsely link cause to effect – a superstition – is occasionally beneficial, says Kevin Foster, an evolutionary biologist at Harvard University.

    For instance, a prehistoric human might associate rustling grass with the approach of a predator and hide. Most of the time, the wind will have caused the sound, but "if a group of lions is coming there’s a huge benefit to not being around," Foster says.

    OK, so this seems fairly obvious on the surface. Sure, if you could avoid the lions by jumping at every sound in the grass, then this kind of "superstition" might pay off, and would be unlikely to hurt. But then, being afraid of wind in the grass is really not a superstition. There is reliable mutual information connecting grass sounds and lions. Jumping at every grass movement might lead to lots of false alarms, but will really help avoid the lions, too.

    That's a different kind of scenario than what we usually mean by "superstition". Like never washing your game socks so that your team won't lose on Saturday. Or paying the witch doctor so that the demons will spare your little girl.

    We can still say that these false beliefs have little fitness cost. After all, the difference between a witch doctor and a real doctor may be important in today's cosmopolitan society, but throughout most of human existence, faith healers, witch doctors, shamans and thaumaturges really didn't face any competition.

    But still, the simple idea of false association doesn't seem to cover these kinds of superstitions. I mean, many of these beliefs are so complicated, with involved series of rules and taboos that must be observed. Human superstitions may involve false hypotheses of causation, but they are not only that. So it seems like there must be some active principle at work, prompting people to learn arbitrary sequences of behaviors on the basis of fictions.

    Moreover, it is not easy to list cases where animals make false inferences of cause and effect in this way. To be sure, we can manipulate animals experimentally, getting them to learn arbitrary correlations (like Pavlov's dogs) and then switch them so that they're false. But that's not the same thing at all! That's like the case where other teams all collude with each other to make sure that your team always wins except when you wash your socks.

    The research study is available in Proceedings of the Royal Society B, by Kevin Foster and Hanna Kokko. The paper defines "superstitious behavior" as behaviors that result from "the incorrect establishment of cause and effect". That definition doesn't include most of what people mean when they refer to "superstitions" -- it's really quite strictly limited to formally correct inferences based on spurious associations. The authors then derive conditions under which such errors may actually increase fitness.

    Foster and Kokko assume a model in which the fitness of an individual is determined by some events in a series that includes many other events that do not affect fitness. If some of those benign events are correlated with fitness-determining events (that is, if there is mutual information linking them), then an individual might increase its fitness by exploiting this mutual information.

    Of course, there is a problem of detecting the correlations. Suppose we observe for three consecutive months that it rains after the full moon. We might be tempted to think there is some causal connection here---that the full moon causes rain in some way. Maybe the moon goddess has an affinity for water. In any event, our observation may induce us to believe that other full moons will also be followed by rain. That's inductive reasoning.

    To that end, Foster and Kokko's model might appear to provide an evolutionary account of the problem of induction. Their model entails a trade-off: A greater degree of precision in making correct inferences will have benefits, but the necessary effort will have costs. "Superstition" reigns below a certain threshold -- the point at which the costs of more accurate inferences are balanced by the benefits of rapid inference-making.

    That's a simple model, and the trade-off is a mathematical necessity, given the assumptions.

    Is that all "superstition" is?

    It seems to me that "false inference" is not a sufficient definition for "superstition." For one thing, superstitions persist even in the face of considerable evidence against them. The player who loses a game may still keep wearing those dirty socks. The family whose first child dies may still employ the witch doctor when the second child falls ill.

    Another thing is that superstitions do not really consist of false inferences. They consist of highly distinctive signs. Those dirty socks, or a rabbit's foot, or oracle bones. Walking under a ladder, broken mirrors, which direction the horseshoe is hanging. "Gesundheit" after a sneeze. You don't learn about them by inference from natural observations; you learn about them from other people who explain the "boundaries" of the signs -- what counts and doesn't count as part of the belief.

    We might compose some sort of analogous argument to that presented by Forster and Kokko. Maybe learning about superstitious beliefs from other people is adaptive because we also learn about valid, true inferences from them, and it's too costly for us to try to tell the difference. There may be some truth to that idea. We accept superstitious beliefs because everyone around us accepts them, and how could so many people be wrong?

    But there are other aspects we should consider. We can't freely take or leave the beliefs that are common in the society around us. Some of them are enforced at the point of a gun. Others are instilled by ritual repetition from early childhood onward. We can call it "social pressure" or "tradition," or simply "culture," but whatever we call it, we have to recognize that we are compelled to accept some beliefs based on what other people around us believe.

    If we call a belief a "superstition," chances are it's already out of style. Belief in ghosts may have been the norm a hundred years ago; today it's a smaller niche. Or a "superstition" may be simply idiosyncratic in some way -- one person's eccentric belief. There are many, many equally silly beliefs that we call much more respectful things, like "laws" or "traditions" --- the difference being that large communities of people share those rituals.

    Any account of superstition has to take into consideration those elements of human learning and sociality. It is far from obvious that any non-human animals have "superstition," even in the sense of false inferences. If we look at a truly human use of the term -- arbitrary sign sequences believed to alter natural phenomena and maintained by social learning -- then it is doubtful that any non-humans have such beliefs. Still, phenomena like the chimpanzee "rain dance" might fit the bill, if we could suitably define the concept of "belief" in a non-linguistic context.

    At any rate, the kind of logic about costs and benefits of inference-making don't seem to describe the kind of phenomenon we really mean when we talk about "superstition." And that's disappointing. If we take the course of defining the term in an overly simplistic way, ignoring its evident social and semiotic components, then we rob ourselves of the depth of human cognitive creativity.

    References:

    Foster KR, Kokko H. 2008. The evolution of superstitious and superstition-like behaviour. Proc Roy Soc Lond B (online) doi:10.1098/rspb.2008.0981

Pages

Subscribe to information theory

Neandertals

For years, I've worked on their bones. Now I'm working on their genes. Read more about the science studying these ancient people.

Denisova

From a finger bone of an ancient human came the record of a completely unexpected population. My lab is working on the science of the Denisova genome.

Acceleration

The advent of agriculture caused natural selection to speed up greatly in humans. We're uncovering some of the ways that populations have rapidly changed during the last 10,000 years.

Malapa

Just outside Johannesburg, the Malapa site is producing some of the most exciting finds in human evolution. This site is the headquarters of the Malapa Soft Tissue Project.