john hawks weblog

paleoanthropology, genetics and evolution

recombination

  • Mutual information between strings of loci

    Sat, 2009-03-28 21:26 -- John Hawks

    Fourth in a series on mutual information and genetic linkage. If you’re happening upon it for the first time, you can find the entire series or the first post, “Information theory: a short introduction”.

    After the last post, you might wonder what the big deal is about these information theoretic measures of linkage. After all, we’ve got lots of other measures of linkage to choose in population genetics, with many years of theory behind them. The basic conclusion about genetic drift was that it adds mutual information to samples over short regions, but that recombination over longer areas washes it out. If the net effect is no linkage, why would we bother to come up with some non-standard linkage measure?

    One answer: If the existing linkage measures were so great for testing neutrality, then we might expect some of the recent genome-wide selection scans to have used them. But they didn’t – instead we have several partially incompatible methods, all of which eschew the usual measures of linkage.

    I’m not going to summarize all the reasons for this, at least not yet. But there is one thing that the current positive selection tests have in common: they all attempt to quantify the decay of linkage across long strings of loci. Each of them—LRH, iHS, LDD—distills into a single number the pattern of reduction of linkage across physical distance.

    After the last post, that may seem odd. If we know the physical distance (or more to the point, the map distance), we can predict the amount of linkage under drift. That really ought to be enough to test neutrality. Yet the existing tests insist on a second dimension: the rate of decline of linkage. Partly, that’s because these tests try to separate positive selection from other things that violate neutrality, like inversions. But it also reflects a real limitation of the methods: they ignore some of the information available in the data, and thereby limit their power to test the hypothesis of genetic drift based on linkage. So they must turn to the rate of linkage decay as a test of drift.

    The trouble with homozygosity

    Massive genotyping projects are not sequencing projects. They provide long lists of genotypes, not sequences. If you’ve got a long list of genotypes, one of the measures of variation immediately available to you is homozygosity. Homozygosity among multiple sites, or joint homozygosity is not mathematically the same as the mutual information between sites, but it also is increased by linkage. To the extent that genetic drift predicts the distribution of long-range linkage, you can test neutrality by looking at the joint homozygosity of two or more sites. For example, the extended haplotype heterozygosity (EHH) is the probability that a pair of gametes that share a single core haplotype will also share identical haplotypes at longer distances. Homozygosity itself is not a test of linkage—for that we have to combine the probabilities of identity of all alleles in some way.

    Homozygosity-based tests of neutrality set up this combination as a ratio: the ratio of homozygosity for one allele or core haplotype versus the homozygosity of other alleles or haplotypes. That procedure throws away a lot of information carried by the fraction of haplotypes that are nonidentical at distance. Since that fraction of non-homozygote mutual information, unlike homozygosity, actually increases with distance, it may in some circumstances provide a more powerful test of neutrality. The use of a ratio of homozygosity also reduces the chance to detect certain classes of non-neutral loci—most obviously, the ones that have two or more selected alleles.

    To put some numbers to these ideas, let’s consider we have two loci, A and B. The alleles of A are a1 and a2, B has b1 and b2. Each of these alleles has an associated frequency, for example p(a1), and the frequency of the haplotype a1b1 will be written p(a1b1). If A and B are independent (unlinked) then the expected value of p(a1b1) would simply be p(a1)p(b1).

    The usual measure of linkage, D, is calculated as

    D = p(a b)p(a b) - p(a b )p(a b )        1 1   2 2     1 2   2 1
    (1)

    It thus involves the frequencies of all four possible haplotypes. At linkage equilibrium, the two terms are expected to be the same, so that D = 0.

    Several other measures of linkage are in common use (and sometimes lead to confusion). There are several reasons why you might want a different measure, and one of the biggest is that D depends on the frequencies of the alleles a1 and b1—low-frequency polymorphisms will have systematically lower linkage values for the same evolutionary scenario. A very good thing about D is that it does not require us to know the frequencies of the individual alleles a1 or b1, only the frequencies of their joint haplotypes. But the expectation that the products in the equation are equal depends on a 2 × 2 contingency table. Three-, four- or more-locus haplotypes demand some other measure.

    Using the same terminology, the homozygosity of the haplotype a1b1 is given as p(a1b1)2. By itself, this tells us nothing at all about linkage. If we combine it with the homozygosity of one or the other allele, (for example with the difference p(a1)2- p(a1b1)2, we will have a measure of the reduction in homozygosity with distance. That reduction in homozygosity still doesn’t tell us about linkage, without considering p(b1). But if we scale the reduction by taking it as a ratio with the reduction in homozygosity of one or more other haplotypes, those other haplotypes give us indirect information about p(b1). So we have an indirect measure of linkage, based on the relative rate of linkage decay.

    Relying on the rate of decay of linkage isn’t such a bad idea if we don’t know the rate of recombination between markers. In that case, we can use the rate of decay of homozygosity of other haplotypes as a scaling factor. But there is a problem lurking: The decay of homozygosity around a marker or core haplotype depends on its frequency. Under drift, higher-frequency haplotypes decay over shorter distances. So our test will be biased in a way that depends on the frequency spectrum of haplotypes.

    It seems much more direct to estimate the mutual information between A and B. We don’t need to use an indirect linkage measure when we have all the frequencies needed to calculate a direct measure. Mutual information is a useful measure because it extends easily to many loci. But it has a disadvantage: it requires us to obtain an independent estimate of the recombination fraction between our markers.

    Maybe that’s not such a problem. High-resolution recombination maps are available for the HapMap markers. If we use those maps, then we can test neutrality with markers at fixed map distances instead of physical distances. That ought to let us easily quantify the genome-wide fraction of mutual information that originated in a given time interval in the past. But we’ll have to remain cautious, since the map distances are worked out under the assumption of neutrality.

    Why multilocus haplotypes are useful

    Why do we care about many loci, when we can test neutrality using only one? Suppose we’re interested in the mutual information across a 500 kb stretch of the genome. We could pick a single marker on each end of the 500 kb, and measure the mutual information between them. That’s basically what I did in the case of drift in the last post, “When genetic drift reduces entropy”.

    That isn’t slicing the sample very finely. Suppose that our two markers both have major alleles at 80 percent. That makes the expected frequency of the most common joint haplotype 64 percent, and the least common 4 percent. In our sample of 120 gametes, we’ll conclude that the two are significantly associated if those values go to around 70 and 10 percent, respectively. That’s like salting the sample with seven or eight copies of the rare haplotype.

    On the other hand, we could salt our sample with more than a dozen copies, evenly split between the two intermediate-frequency haplotypes, and never get a significant result. Even if the expected rare haplotype were completely absent, it wouldn’t be enough to reject the hypothesis of random association, much less the hypothesis of genetic drift. Natural selection can cover its tracks, if two or more linked haplotypes have both increased in frequency.

    Is this likely to be a problem very often? Past some distance, positive selection loses its impact on linked haplotypes, because there’s too much recombination. Near a selected site, linkage is strong enough that there will usually be a single major haplotype hitchhiking upward. In between, there may be minor haplotypes hitchhiking more weakly but we will generally be able to look closer to find a major haplotype. But if there are multiple haplotypes under selection, perhaps because more than one adaptive mutation have occurred, the locus may well look completely neutral.

    An example in human populations may be MC1R. Harding et al. (2000) sequenced the gene in 106 Africans and over 356 Europeans. They found a diverse array of amino acid mutations in Europeans that were absent in their African sample, concluding that purifying selection on skin pigmentation had prevented such mutations from becoming common in Africa. In contrast, purifying selection was relaxed in Europe, allowing several loss-of-function variants to increase in frequency. They found no statistical evidence for positive selection on these loss-of-function variants in Europe. Likewise, no other recent tests for positive selection have shown MC1R to stand out as selected.

    But wait a minute. Where did all those European redheads come from?

    Harding and colleagues found five different coding polymorphisms with frequencies more than 7 percent in Europe, but completely absent in their sample of Africans. That adds up to roughly half the copies of MC1R in Europe, all derived alleles. That kind of emergence of new alleles doesn’t happen by chance, at least not in a population history like Europe’s. But none of the usual tests will reject neutrality when five different alleles have increased under selection. And individually, each of these alleles is too rare to register a rejection of neutrality, at least, if we apply the usual tests.

    What makes this gene so troublesome? There’s no problem estimating the frequency of any single derived allele in Europe, and it’s easy to figure out its extended homozygosity. But if we then compare that to other alleles at the same locus, well, some of them have long haplotypes, too. So none of them have an unusual pattern of LD decay compared to the others.

    What to do? We could try to get a much larger genetic sample, or a large sample of individuals from a part of Europe where one of these alleles is especially common.

    But since I’m lazy, I figure I’ll just consider the mutual information carried by haplotypes of multiple loci. Individually, no single allele may have a frequency high enough to reliably show linkage in a two-locus comparison. Individually, no single allele will reject neutrality by a homozygosity-based test, because the other selected haplotypes decay over very long areas, making the ratio for any single haplotype insignificant. Collectively, the selected haplotypes may carry linkage over long distances. Together they will reject neutrality, even if individually they don’t.

    But first, we need to know the distribution of multilocus mutual information under neutrality.

    Degrees of freedom

    If you can calculate a χ2 and its degrees of freedom, you can skip next four paragraphs.

    In the second post I showed that the mutual information is distributed as a χ2. That’s a statistical distribution with one parameter—the degrees of freedom. The concept of degrees of freedom is one of the trickiest definitions in statistics. Roughly, it means the number of independent factors that contribute to a single observation.

    Let’s take the case of a contingency table, where we have a number of rows, r, and a number of columns, c, where the total number of cells is r ×c. If all the cells were independently distributed, then the number in each cell is expected to be the fraction in its row times the fraction in its column. That’s the expected value. If you take the number you observed in a cell, subtract the number you expected, square that difference, divide it by the expected value, and add up the result for every cell in the table, that’s the χ2 statistic. This statistic is high when the observed numbers are far from the expected ones. It’s low when they are close to expected.

    For a 2 × 2 table, we can ask, how many ways are there for the χ2 statistic to be high? Any one of the cells may have an unusually large number of observations, so we might be tempted to say there are four different ways to get a high value. But it’s obvious that if one value is too high, the value in the same row and the other column must be too low. Likewise, the observed number in the other row and the same column must be too low. And if both those diagonal cells are too low, well then the observed value in the cell diagonal to our high cell must also be high. In other words, all four cells depend on each other. Push one, and you push them all. So there’s really only one axis along which the cells can vary—one degree of freedom.

    With a 3 × 2 table, things are a little different. We can hold one row entirely constant, and just change the proportions in the other two. If one cell has a high observed value, that might be because both of the other rows were shortchanged. Or it might be just a single row that’s low, the other being high as well. That makes two distinct ways that we could get a high χ2 statistic, two degrees of freedom. In general, there are (r - 1)(c - 1) degrees of freedom in a contingency table. It’s always one less than the number of rows or columns because at least one row and column must be shortchanged when a cell is higher than expected.

    Degrees of freedom with comparisons of multiple haplotypes

    From the previous post, you might have anticipated how we could do simulations of drift in samples of multiple markers. For the following, I’m going to examine the mutual information between two sets of three SNP markers. That is, I have a stretch of chromosome roughly 50 kb long from end to end (or more properly, 0.05 cM), with three markers tightly linked at either end. Using the same methods as the previous post, I’m applying the same three-stage population history with genetic drift only to these six markers.

    Here are the results for 10,000 trials:

    PIC

    That doesn’t look very much like the chart that I got in the previous post, using the same population history. Here is that one:

    PIC

    What’s going on? That red line in the second chart is the χ2 distribution with one degree of freedom. The first chart doesn’t match that at all.

    Why not? The second chart shows the mutual information between two markers, with two alleles each. That’s one degree of freedom. The first chart shows the mutual information between two sets of three markers. Three markers give the possibility of eight distinct haplotypes (23). So we have an 8 × 8 contingency table, with 49 degrees of freedom. Here is that distribution, in red, along with the result of our 10,000 simulated samples:

    PIC

    Except, oops, it doesn’t look like 49 degrees of freedom either! What the heck is going on?

    If you’ll remember from the second post in the series, our empirical distribution isn’t going to fit a chi-square unless we have enough observations. Sure, in theory, three markers could give us 8 haplotypes. But in practice, the three markers on each end of our 50 kb region are so tightly linked that we get many fewer haplotypes. Sometimes we get only three or four haplotypes on each end. If each set of three markers were independent of the other, they might fit a chi-square with four degrees of freedom, or nine, or six or twelve. In fact, the mutual information observed in these 10,000 simulations accords with a mixture of distributions. Here are the number of degrees of freedom in the 10,000 simulations, as a histogram:

    PIC

    Most of the trials have six or nine degrees of freedom, with 12 being a substantial proportion as well. No primes, there, except for 2 and 3. Only a tiny fraction of trials have as many as 20 degrees of freedom between the 3-marker sets. None get anywhere close to the theoretical 49, because the three-marker sets are too tightly linked to express all possible haplotypes.

    So if we want a theoretical distribution to match to our simulated one, we are going to need a mixture of different χ2 in proportion to the number represented in the simulated set. Here’s that comparison:

    PIC

    The red line is a mixed distribution, in which different χ2 distributions are blended in the proportions represented by the histogram above. That’s the distribution of mutual information between the two marker sets, conditioned on the haplotypes actually present in each set, under the hypothesis of independence. The simulations lie to the right of this distribution, meaning that small-sample bias and genetic drift have added mutual information to the simulations, compared to the expectation in the absence of linkage.

    The only difference between these simulations and the ones in the previous post is the number of markers. Here, we’re looking at mutual information between two three-marker sets; there, we were looking at mutual information between two one-marker loci. Both examples used the same distance and the same population history. Genetic drift has had a similar effect—in both sets of trials, there is a slight excess of cases with high mutual information. With both, there are around twice as many at the tail as expected if the markers were independent.

    This leads to a couple of conclusions:

    1. Under this population history, genetic drift is not sufficient to cause very large amounts of mutual information to be shared by distant sites.
    2. A slight alteration to the χ2 test—for example, adding some proportion to the critical values—might allow us to test the hypothesis of neutrality for any given case without the necessity of all those simulations.

    If we were to add more and more markers, we would eventually find that the bias in mutual information due to small sample size would get bigger. So there’s some maximum number of markers giving useful information in any given sample. But I just picked 3 markers out of a hat. Adding more markers, up to some point, should increase our ability to see rare haplotypes that might diverge from the expectations of genetic drift.

    Next: Conditional mutual information: finding the haplotypes that explain linkage disequilibrium

    References


       Harding RM, et al. 2000. Evidence for variable selective pressures at MC1R. Am J Hum Genet 66:1351–1361.

  • When genetic drift reduces entropy

    Sun, 2009-03-15 20:09 -- John Hawks

    This is the third in a series on information theory and tests for recent selection. The first post, “Information theory: a short introduction”, covered some of the basics of entropy. The second post, “Information theory and mutual information between genetic loci”, showed that mutual information between independent sites will be distributed as a χ2.

    We tend to think of genetic drift as a random process. Random processes operating repeatedly over time are called “stochastic,” and changes in gene frequency under genetic drift are certainly that.

    Since entropy is a measure of uncertainty, it might seem natural to think that stochastic changes in gene frequency would increase the entropy in a population. After all, the gene frequency in a population under genetic drift will be more and more uncertain over time. So, considering the frequency of a single allele as the system, genetic drift appears to increase entropy over time.

    But even this simple system isn’t quite so simple as it might appear. Sure if you start out knowing the allele frequency, then genetic drift will increase your uncertainty over time. You will become less and less able to say that it lies in any given interval. But what if you don’t start out knowing? What if all you know is that the locus has been subjected to t generations of genetic drift?

    As t increases, the probability of fixation of the locus also increases. The net effect is to reduce the entropy in the system – going from uncertainty about the allele frequency to more and more certainty that it will be either one or zero. The only thing that will stop this process is some other evolutionary force – mutation, migration from other populations, balancing selection. Each of these will have its own distinctive effects on the entropy of the single-locus system.

    Drift and recombination

    We are going to consider a system of two partially linked loci. That means that the two are near enough to each other on a single chromosome that they are usually inherited together, but that a small fraction of individuals will have recombination between the two loci in each generation.

    If the two loci were unlinked, the evolution of allele frequencies at one would have no effect on the other. The mutual information between the two loci would accord with the description in the last post in the series. As described there, the distribution of mutual information estimates taken from different samples of such populations would approximate a chi-square distribution.

    But our two loci will be linked. Genetic drift tends to affect the allele frequencies at the two loci in the same way. Haplotypes consisting of one allele from each locus will increase or decrease under drift. Over time, haplotype frequencies will tend to diverge, some becoming rare, others common. Because the alleles of the two loci are evolving in tandem, the joint entropy of the two-locus system will be lower than expected from the sum of the individual entropies of the two loci. Mutual information will be greater than expected under the hypothesis of independence.

    How much? That depends on the amount of recombination between the two loci and the power of genetic drift. Recombination introduces random noise, by breaking up pairs of alleles and shuffling them. That tends to increase the joint entropy of the two-allele system. The more recombination, the less we can predict the allelic state of one locus from what we observe at the other.

    Genetic drift is more powerful in small populations; less so in large populations. It can also be enhanced by population structure, or diminished by migration.

    So in principle, the reduction in joint entropy from genetic drift opposes the increase in joint entropy from recombination. As the physical distance between the two loci increases, the power of recombination will increase. As the population size increases, the power of genetic drift will decrease. That means we need to know something about local recombination patterns, and something about ancient human demography to get a good estimate of the effects of these processes on human genes.

    But given some knowledge about these things, we should be able to get a pretty accurate idea of the distribution of mutual information that would result from drift alone in ancient human populations.

    Simulating some samples

    These days, the person who wants to investigate the effects of genetic drift on samples of genes has many options. Several well-documented computer programs allow simulations of genetic drift allowing the user to choose parameters like effective population size, recombination rate, marker type and density. Some of these programs use a forward-time approach – that is, starting with a randomized sample and propagating it forward through time. Others use a coalescent approach, which “starts” with a sample of genes taken from the present and propagates their ancestry backward in time. There’s a good diversity to choose from, unlike the old days (10 years ago) when you more or less had to write your own.

    For this series, I’m going to use a package called CoaSim. The package was written by Thomas Mailund and colleagues, and is available under a free software license, including documentation. This software employs a coalescent approach to generate an “ancestral recombination graph” (ARG) among a sample of genes, under a specified model of demographic history. The ARG is a tree of genealogical relationships among the sampled genes, factoring in a history of recombination between them. From the ARG, the software can add markers (such as SNPs or microsatellites), resulting in simulated datasets of sequences that could have evolved under the specified demographic history.

    Similar functionality is available in some other packages. I’m using CoaSim in this case because it includes Python bindings. I use Python for a number of other analytical methods, so this cuts my development time. That’s a faster path from investigation to results.

    So, let’s consider some samples that might result from a long history of genetic drift in a single population. First, I’ll assume that the effective size of my population is 10,000 diploid individuals. I’m going to simulate 10000 samples from populations matching this description. Each sample consists of two SNP markers drawn from 120 gametes – in other words, two copies from each of 60 individuals, with current sample frequencies random between 0.1 and 0.9. If you’ll recall from the last post in the series, we need a decent number of individuals representing each bin for our mutual information estimate to be reasonable. Excluding rare SNPs helps to make sure we have that.

    The mutual information between these SNPs will depend on the rate of recombination between them. To begin with, I’ll assume that the two markers are 50 kilobases apart, and that the rate of recombination is 1cM/Mb, or around 10-8 per base.

    PIC

    The red line represents a chi-square distribution with 1 degree of freedom. That would be the expectation if these two loci were independent, and the histogram is clearly biased to the right compared to that expectation. That means that the mutual information between two linked loci is higher, on average, than that between two unlinked loci. This is a simulation under drift alone, not an empirical distribution. Drift reduces the joint entropy of two linked loci more than it reduces the individual entropy of each locus.

    How much mutual information are we talking about? From the last post, we know that twice the sample mutual information, measured in nats, is approximated by a chi-square. Here on the high end, we have a value of 50, which corresponds to 25 nats per sample. In our sample of 120 gametes, that is around 0.2 nats per gamete, or around 0.3 bits per gamete. That’s not a remarkable amount – it’s about the entropy of a single biallelic locus with allele frequencies of 0.95 and 0.05.

    Now, what happens if we keep recombination the same, but reduce the strength of genetic drift? The following chart shows what happens if I assume an effective size of 100,000, with the rest of the parameters kept the same – 50 kb, 120 chromosomes, frequencies between 0.1 and 0.9.

    PIC

    Here, the histogram is much closer to the expected chi-square distribution under independence. There’s still a slight bias upward, and out of 10,000 samples, a few that have rather high mutual information values. But the effect of genetic drift in this scenario is very slight compared to the case where N = 10000.

    As it turns out, the effect of a tenfold larger population is precisely the same as if I kept population size the same and moved the loci ten times further apart. The mutual information scales with the product of recombination and effective population size. So two loci 500 kb apart in an effective population of 10,000 would on expectation have the same mutual information as two loci 50 kb apart in an effective population of 100,000. Another way of putting it: By reducing the coalescent probability tenfold, recombination has ten times the opportunity to scramble associations between the two loci.

    When population size expands

    Human populations have grown by more than a thousand-fold during the last 40,000 years. Compared to the population of 40,000 years ago, today’s human population should have markedly less mutual information between linked genetic loci. If we want to find out how much, we’ll have to have a model of the recent human demography.

    Here, I’ve done some simulations with changing population size. First, let’s assume that we have two loci 50 kb apart. I’m going to start with a very simplistic model of human demographic history. This model has three stages, or “epochs:”

    1. For the last 400 generations, the effective size is 1,000,000 individuals.
    2. Before that, up to 800 generations ago, the effective size is 100,000 individuals.
    3. Before 800 generations ago, the effective size is 10,000 individuals.

    This is both simplistic and conservative. It’s basically assigning a size of a million to the post-agriculture population, and 100,000 to the population after the Last Glacial Maximum. We know that those early values are way too small for the African population, which did not markedly contract during the LGM and has had an effective size up over 30,000 for at least the last 100,000 years. And obviously, the effective population for the past 10,000 years has been much higher than a million. The European population may have had more of a bottleneck at the LGM, but today’s Europeans have substantial ancestry in West Asia, so 10,000 is conservative there as well. The point is not that these values are realistic; it is that they are almost certainly too small.

    I’ve picked the simple scenario to make it a little more transparent what is going on with the results. Here they are:

    PIC

    With this population history, two loci 50 kb apart have nearly the same distribution of mutual information as two independent loci. There’s a slight bias toward higher values, but very little. Out at the tail, it looks like there are around twice as many loci as predicted by the χ2 approximation, and this is still a very small number.

    I thought it might be useful to consider what the mutual information would be under the same population history for two loci that are closer together. Here’s the same scenario, except the loci are 5 kb apart:

    PIC

    With the markers ten times closer here, there is only a tenth as much recombination between them. That makes the result a lot more like the first case above, where the markers were 50 kb apart and the effective size was 10,000 back to infinity. Markers 5 kb apart should have a good chance of sharing significant mutual information under drift alone in a human-like population history. However, as in the first case above, the mutual information generated by drift is still fairly slight.

    Growth and recombination

    Let’s step back and consider what’s going on with these examples. Mutual information is determined by the balance between recombination and coalescence. Increase the effective size, and you reduce the probability of coalescence between lineages. That gives recombination more of a chance to scramble the loci, eliminating mutual information between them.

    A human-like population history has a very large effective size within the last several thousand years. The large effective size reduces the probability of coalescence down to almost zero across that time period. So the ancestry of almost every gene copy has more than 400—and often up to 1000—generations all to itself. If recombination is ticking along at 1cM/Mb, then physical distances out to around 50 kb start to saturate for recombinants across the last 20,000 years.

    Consider: two lineages that coalesce 400 generations in the past are separated by 800 generations of time. The two lineages were initially identical. Our two markers are 50 kb apart, given them a per-generation probability of recombination of 0.0005. The probability that neither undergoes recombination during those 400 generations is:

    (1- 0.0005)800 = 0.67

    Two thirds remain the same; one third of pairs will have recombined with other lineages. Half of pairs that coalesce 700 generations ago will have recombined. In contrast, after 700 generations, only seven percent of pairs separated by 5 kb will have recombined. Pairs that remain identical carry mutual information; each pair that recombines loses this mutual information.

    Under drift alone, the mutual information builds slowly. From the first chart we know that the mutual information at this physical distance in a small population will be significant, but not very high. Most of this information is still retained in the samples represented by the last chart, where they are separated by only 5 kb. But over longer distance this mutual information dissipates rapidly in a growing population.

    So if we observe substantial mutual information between loci in today’s populations, we need some explanation other than genetic drift. Selection is one possible explanation—but to consider that more closely, we will want to consider other information theoretic features of gene samples.

    Next: Extending to strings of loci

Subscribe to recombination

Neandertals

For years, I've worked on their bones. Now I'm working on their genes. Read more about the science studying these ancient people.

Denisova

From a finger bone of an ancient human came the record of a completely unexpected population. My lab is working on the science of the Denisova genome.

Acceleration

The advent of agriculture caused natural selection to speed up greatly in humans. We're uncovering some of the ways that populations have rapidly changed during the last 10,000 years.

Malapa

Just outside Johannesburg, the Malapa site is producing some of the most exciting finds in human evolution. This site is the headquarters of the Malapa Soft Tissue Project.