john hawks weblog

paleoanthropology, genetics and evolution

simulations

  • Quote: Peter Turchin on the "bugbear" of randomness

    Sun, 2009-08-30 16:18 -- John Hawks

    I'll probably have some more material on quantitative analysis of dispersal in the few days. Here's a quote from Peter Turchin (1998:17-18):

    Of course, we do not know that animals truly move at random, like flipping coins to decide whether to turn right or left. Each individual could be a perfect automaton, rigidly reacting to environmental cues and its internatl states in accordance with some set of behavioral rules. However, even if this were true, we might still choose to model behavior of such animals stochastically, because we would not have the perfect knowledge of all the deterministic rules driving these animals. Even if we did, we might not want to include them all in our dispersal model, since such a model would have an enormous number of parameters and would require a very accurate representation of all environmental "micro-cues." The point is that randomness is a modeling convention. Because it is impractical, and not even helpful, to attempt to model individual movement deterministically, we use a more parsimonious probabilistic model.

    I'm pausing the quote to point out my boldface. It has become computationally feasible in the last few years to model enormously complicated scenarios with individuals acting pseudo-deterministically. The most popular use of such modeling is to try to constrain dispersal models by some geographic conditions, such as local habitat richness, rainfall, or altitude (see also, "One model, hold the extra parameters"). Of course, animals really do disperse in ways that depend on such geographic parameters. The question is whether any datasets are sufficient to test models involving so many parameters.

    This approach is aptly termed behavioral minimalism (Lima and Zollner 1996). In essence, we adopt a thermodynamic approach: the behavior of individuals is erratic, or irregular, but the redistibution process at the population level has many regular features. There is a direct analogy with with thermodynamic theory. The motion of each gas molecule is chaotic and essentially unpredictable, and can only be described probabilistically. When dealing with large numbers of molecules, however, the laws at the aggregate level are for all intents and purposes deterministic. Similarly, the problem of biological dispersal can be treated by starting with a probabilistic description of individual movements (in other words, formulating the problem as a random walk), and then approximating the redistribution process of the ensemble of individuals with a deterministic equation, diffusion.

    The effective scale of stochastic versus deterministic processes is important. I'm chiefly interested in the dispersal of adaptive genes in human populations, for which the deterministic approximation may be considered to have become more and more relevant over time, as the population sizes of regional populations grew. Still, the present pattern in many cases may reflect the stochasticity of populations from earlier time periods, when they were smaller. And formerly important deterministic processes, such as the adoption of agriculture, may no longer be directly observable. So how do we model variance?

    The thermodynamic approach to dispersal does not have to assume that the movement of each "particle" is completely random. The important feature of this approach is that we can control the degree of realism in the model. Environmental factors that have strong effects on movement can be included explicitly in the model, while other factors that have weak effects (or about which we have no information) are included in the stochastic component.

    This would incorporate the geographic modeling approaches mentioned above -- deterministic processes related to spatial variance of habitat or dispersal potential. But then the important step must be to find a minimal deterministic model to account for the data, and then test it with other observations -- such as more extensive genetic sampling, archaeological information, or historical documentation.

    References:

    Turchin P. 1998. Quantitative Analysis of Movement. Sinauer, Sunderland MA.

  • Simulations bubbling like a stew

    Thu, 2009-04-16 11:40 -- John Hawks

    Peter Turchin writes very effectively about quantitative modeling and analytical methods in biology. So every so often I like to post an illuminative quote. Here's his description of maximum likelihood estimation, from Quantitative Analysis of Movement:

    Simpler, more direct analyses may make unwarranted assumptions, but they are better at revealing important patterns in the data, and their results can suggest what variables and functional forms to use in the modeling of data. Eventually, however, direct methods of analysis get beyond the bounds of their competence. The general approach discussed in this section can in principle estimate parameters of any model, given infinite amounts of informative data and infinite computer power.

    The basic approach is to construct a detailed simulation model (better even, a series of models) and fit it to the data using nonlinear estimation techniques. Jon Schnute colorfully describes a detailed simulation as a "stew" of calculations from which observable quantities (to be compared with the actual data) bubble up to the surface (quoted from Hilborn and Mangel 1997). Nonlinear estimation is the process of adjusting the parameters of the stew (adding more or less salt, increasing or decreasing temperature, etc.) until the stuff that bubbles up resembles the actual data the best. The crudest approach is to change parameters in the simulation by the method of trail [sic] and error and to compare the simulation results to data by eye. A more refined approach is to use some quantitative measure of goodness of fit and a nonlinear minimization routine to search for the best fit automatically (Turchin 1998:295).

    The quote has some relevance to yesterday's discussion of the Neandertal population structure paper. I'm philosophically reluctant to turn to simulations until I exhaust my analytical options. This is a matter of trusting myself -- if I really had a lot of confidence in my ability to choose the right assumptions to underlie my simulations, I might turn to them first. But assumptions are tricky. Analytical models have their own assumptions, but those have the advantage of transparency -- I didn't pick them, they are fundamental to the models.

    Still, in some cases it doesn't take long to exhaust the analytical options. So we let the observable quantities "bubble to the surface" of simulations.

  • The utility of theoretical models

    Tue, 2008-10-21 16:33 -- John Hawks

    I'm reading through Peter Turchin's 1998 book, Quantitative Analysis of Movement, for a project I'm working on. I found that his second chapter gives a very nice introduction to the reasons why biology depends on formal mathematical models. This is a topic I often review in my courses, so I'll quote some of his discussion.

    He lists six objectives for model-building on pp. 33-35, each with some explanatory text. This amounts to a paragraph or so for each reason; I'm only giving one or two sentences of each, with much omitted.

    Formal statement of the problem ...The necessity of stating the assumptions of the model is another benefit. A mathematical description of a problem forces one to be very clear about what the different variables and parameters in the model are, and how they are interrelated.

    Identifying knowledge gaps ...It may turn out that good quantitative data are available to estimate some functions and parameters but not others, immediately suggesting a focus for the empirical program. When there are many gaps, one has to decide which parameters need to be estimated precisely, and for which parameters ``guesstimates'' will do....

    Gaining theoretical insights There is a large class of models that are never intended to be directly confronted with data.... The purpose of such models is to gain insights into possible causal interconnections between various factors and, in general, extend our intuition...

    Quantitative tests of theory ...A qualitative prediction allows one to test the theory that generated it, but it does not provide a very strong test. Because there are only a few possible outcomes in a qualitative situation (e.g., factor X will either increase, stay the same, or decrease), the probability that the ``correct'' outcome will happen by chance is correspondingly high. A quantitative prediction, on the other hand, can be a much stronger test of the theory, because it will not only say that X will increase, but how much...

    Interpreting the data Sometimes an investigator is motivated not by a desire to test general theory, but by the necessity of measuring some specific quantity [that would be impossible to measure directly]...

    Forecasting and prediction ...Forecasting is weaker than prediction, and uses the knowledge of the past behavior of the system to forecast its future state. Forecasting does not necessarily require an in-depth understanding of the system's dynamics, and can be done at the phenomenological level. However, forecasting will most likely fail if the system's dynamics change. I use prediction in its strongest sense: that is, to predict a situation that was not encountered in the past. For example, it may be necessary to generate predictions about how a system's behavior will change as a result of a certain human intervention. Prediction, in general, requires a mechanistic understanding of the system....

    I especially appreciate the point about quantitative tests --- one that has eluded many paleontologists who are content with categorical statements that are essentially untestable, because they only assert that something should happen ``regularly'' or ``more often'' than something else.

    Also, the final point, about forecasting and prediction, is valuable -- although perhaps idiosyncratic, as I have not seen that distinction made elsewhere. Still, it applies far beyond theoretical biology and into historical science generally. If we consider our state of knowledge about climate change in response to human activity, clearly this is an example where the distinction between forecasting and prediction is relevant. We can have confidence in a prediction only if it entails a suitable understanding of the mechanisms of change in the system, whereas forecasting is accurate only to the extent that we can depend on a uniformitarian assumption -- that the conditions observed in the past followed the same mechanistic relations that will be relevant to the future.

    I tend to lecture about genetic models, for which there is a great value in simplicity (point 3), but which may require quite complicated extensions to handle reasonable biological populations (point 2). In that connection, some reasonable people go to extremes of interpretation -- sometimes claiming that the data necessitate some assumption on the basis of a very simplified model, and in other cases claiming that no model can apply to the complex history of the population. It is our task (my task) to determine which factors are important and conceivably affect results, and which will always be too weak to influence the interpretation of the data (point 1). And the end will often be to discover evidence for values in past human populations for which we have no direct means of estimating aside from genetic variation (point 5).

    References:

    Turchin P. 1998. Quantitative analysis of movement. Sinauer Associates, Sunderland MA.

  • Some genetic drift graphs with Mathematica

    Mon, 2008-09-15 00:16 -- John Hawks

    The first thing to come up in my lectures is genetic drift. Pretty much everyone who lectures about drift needs a figure showing the results of simple Monte Carlo simulations of sampling drift in a finite population. You start a population with two alleles, sample it randomly in each generation until one or the other alleles disappears. I tend to start with a "population size" of 1000 gene copies, and an initial allele frequency of 50%.

    We can do this kind of simulation in Mathematica pretty easily. We'll work with three variables:

    • popSize, which we'll set equal to 1000;
    • a, the number of gene copies with one of the two alleles, initially equal to 500;
    • frequencyList, which will hold the value of a for each generation. Initially, frequencyList holds the first value of a, 500.

    Now, we're ready to code the simulation. We set the three variables, and then set up a While[] loop that samples a random binomial deviate in each generation, based on the previous generation's allele frequency (a/popSize):

    a = N[500];
    popSize = 1000;
    frequencyList = {a};
    While[a 0,
    a = N[RandomInteger[BinomialDistribution[popSize, a/popSize]]];
    AppendTo[frequencyList, a]]

    The last line appends the value a to the list. Nesting BinomialDistribution[] inside RandomInteger[] gives us a binomial deviate, based on popSize trials and frequency a/popSize. The loop executes until a reaches either zero or 1000, at which point the simulation stops.

    OK, that gives us a list, frequencyList, which contains the number of gene copies with one of the two alleles over time. Now, we want to plot that list. We can try:

    ListPlot[
    frequencyList,
    PlotJoined -> True]

    ...which gives us:

    Monte Carlo simulation of genetic drift

    That's servicable, but a bit simple. And it's confusing where the axis cuts across. It would be better to have the plot bounded at 0 and 1000, the min and max possible. Also, we need a better font than Times.

    Let's try:

    ListPlot[
    {frequencyList},
    PlotJoined -> True,
    Frame -> True,
    AxesOrigin -> {0, 0},
    PlotStyle -> Thick,
    FrameLabel -> {"Time (generations)", Frequency},
    TextStyle -> {FontFamily -> "Myriad Pro", FontSize -> 12},
    Filling -> 500]
    Monte Carlo simulation of genetic drift

    That's a bit more like it. The Filling -> 500 gives us shading everywhere between our frequency line and the 500 mark, a nice cue reminding us where the simulation started.

    Now, we can add a second run in the same plot:

    Monte Carlo simulation of genetic drift, two populations

    Oh, well that shifted the x-axis. No harm, though. And we continue...

    Monte Carlo simulation of genetic drift, two populations

    Not a bad representation of the process, with fixation times ranging from very short to quite long, and one going to fixation at zero. The different color shading tends to confuse the picture a bit, and doing it in a larger size for the projector will take some tweaks, but so far it's much more attractive than the Excel version.

    We could run a few more simulations to substitute, just in case one made the overall picture more clear (for example, by moving the lines apart in the early phase). But here we have the benefit of honesty --- these are the first four sims I ran.

    UPDATE (2009-09-13): I'm revisiting this post as I work on a Mathematica Demonstration of genetic drift. It's much simpler to implement the central drift algorithm with a NestList or a NestWhileList. For example:

    NestList[RandomInteger[BinomialDistribution[1000, #/1000]] &, 500, 4000]

    gives you the trajectory over 4000 generations. This:

    NestWhileList[
    RandomInteger[
    BinomialDistribution[1000, #/1000]] &, 500, # 0 &]

    replicates the output above.

Subscribe to simulations

Neandertals

For years, I've worked on their bones. Now I'm working on their genes. Read more about the science studying these ancient people.

Denisova

From a finger bone of an ancient human came the record of a completely unexpected population. My lab is working on the science of the Denisova genome.

Acceleration

The advent of agriculture caused natural selection to speed up greatly in humans. We're uncovering some of the ways that populations have rapidly changed during the last 10,000 years.

Malapa

Just outside Johannesburg, the Malapa site is producing some of the most exciting finds in human evolution. This site is the headquarters of the Malapa Soft Tissue Project.