No brain expansion in Australopithecus boisei

The endocranial volumes estimated for late Australopithecus boisei specimens (e.g., after 1.8 Ma) are larger than those of earlier specimens. Elton et al (2001) Elton:2001 found that this trend is statistically significant, arguing for the evolution of larger brains over time. Such a trend bears on the ecology and social behavior of A. boisei, and lends some doubt to the idea that brain size evolution in early Homo was exceptional Elton:2001.

But the A. boisei sample has some unusual aspects that may complicate the test of a trend. One question is whether the early KNM-WT 17000 specimen represents A. boisei or another species (possibly, Australopithecus aethiopicus). Another question arises from the very small variation of estimated endocranial volumes in the A. boisei sample. Even including the small KNM-WT 17000 volume estimate, the coefficient of variation in the sample examined by Elton et al (2001) Elton:2001 is only 8.2 percent. Excluding KNM-WT 17000, the within-sample CV is 6.8 percent. By comparison, Tobias (1971) Tobias:brain:1971 reported data on endocranial volumes of hominoids. Great ape values include chimpanzees with 9.7 percent, orangutans at 10.9 percent, and gorillas with a CV of 13.1 percent. According to these estimates A. boisei had less variation than any living hominoids, even though its craniodental variation was as great as gorillas or orangutans Silverman:2001.

There are several possible interpretations for the low variation of the A. boisei sample: (1) A. boisei actually had very low size dimorphism; (2) its endocranial variation has been greatly undersampled, or (3) the sample has been biased by estimation error. Other characters of the A. boisei sample show extensive variability compared to extant hominoids Silverman:2001, so that monomorphism for this species seems unlikely. Low sample variance is a special concern because estimation error might lead to false positive results in a test of trend.

Here, I conduct three new tests of the null hypothesis of stasis of endocranial volume in A. boisei. These tests explore the effect of estimation error on the appearance of a trend in the sample, as well as the effect of low sample variation and small sample size. None of these tests find a statistically significant trend in the sample.

Materials and methods

Fossil specimens

Estimating endocranial volume can be challenging even for relatively complete specimens, considering the subtle distortion exhibited by many fossils. For more fragmentary cranial remains, the estimation of endocranial volume requires not only the correction of distortions but also the reconstruction of missing portions.

A. boisei endocranial volume estimates plotted against time

Endocranial volume estimates for specimens of A. boisei against time. The sample is that used in this study, excluding Omo 323.

The eleven cranial specimens of Australopithecus boisei listed below vary in their completeness and preservation of relevant anatomy. There is no explicit way of statistically controlling for error in the estimation of endocranial volume, considering the diversity of methods of reconstruction. In several cases, different workers have provided competing estimates. For less complete specimens, choosing one estimate above another must involve a close critique of anatomical details. The following list reviews the anatomical condition of each of these specimens. It is not an exhaustive list of volume estimates, but focuses on the range between credible extremes for the more disputed specimens. This gives an impression of the boundary conditions for measurement accuracy for each specimen.

  1. KNM-WT 17000 is a well preserved skull with relatively small vault fragments missing. Walker et al. (1986) wt17000 estimated the volume as 410 ml.
  2. Omo L338-y6 is a juvenile cranium of uncertain age. Holloway (1981) Holloway:Omo:1981 estimated its volume at 427 ml. Elton et al (2001) Elton:2001 estimated an adult volume 4% higher, or 444 ml.
  3. The Omo 323-1976-896 cranial remains are exceedingly fragmentary. One side of the posterior cranial base is preserved, allowing a relatively good estimate of the posterior endocast breadth. The preserved frontal and parietal elements do not join with each other or the temporal; their small size and unknown positions do not allow an accurate estimate of endocast volume. Brown:1993 reported an estimate of ``about 490'' based on similarity with the 491 ml KNM-ER 23000. Falk et al (2000) Falk:2000 considered it too fragmentary for an accurate estimate. I concur; the available estimate cannot be considered independent of other endocasts on which it may have been based.
  4. KNM-WT 17400 preserves only the anterior third of the endocast, consisting mainly of the frontal lobes. Brown et al. (1993) Brown:1993 gave an estimate of 500 ml by modeling missing portions after the more complete KNM-ER 23000, but Holloway (1988) Holloway:robust:1988 put the volume between 390 and 400 ml, and Falk et al. (2000) Falk:2000 adopted an estimate of 390 ml.
  5. OH 5 has good preservation of the endocast, but an uncertain join between the anterior and posterior portions of the vault. This discontinuity has caused a disparity in estimates of its volume, including a low 500 ml estimate by Falk et al (2000) Falk:2000 and a high 530 ml estimate by Tobias (1963) Tobias:1963. The range of estimates on this well-preserved specimen covers nearly a quarter of the range of variation cited for A. boisei as a whole.
  6. KNM-ER 13750 preserves only the superior vault, accounting for under half of the total endocranial contour. The range of estimates provided by Falk et al (2000) Falk:2000, from 450 to 480 ml, again covers roughly a quarter of the range attributable to the species. Brown (1993) Brown:1993 reported a higher estimate of 500 ml.
  7. KNM-ER 23000 is a nearly complete vault missing the midline cranial base. Its endocranial volume of 491 ml Brown:1993 may be the most accurate assigned to A. boisei.
  8. KNM-ER 406 is also well-preserved Wood:monograph:1991. Its volume estimate of 525 ml is uncontroversial Holloway:1988.
  9. KNM-ER 407 is missing several vault sections including those enclosing the frontal lobe. Holloway (1988) Holloway:1988 estimated the volume at 510 ml; Falk et al (2000) Falk:2000 prepared a new reconstruction with a volume estimate of 438 ml. The difference between these two estimates covers nearly 50 percent of the total range of the sample.
  10. KNM-ER 732 has good preservation of the left side of the vault, but is not complete across the rear of the cranium or basicranium, making a mirror reconstruction problematic. Holloway (1988) Holloway:robust:1988 estimated the endocast volume at 500 ml; Falk et al (2000) Falk:2000 at 466 ml.
  11. KGA 10-525 lacks most of the frontal and anterior cranial base. Suwa et al. (1997) Suwa:1997 estimated its volume at 545 ml.

The damaged or missing frontals of many specimens have added to ambiguity about their reconstructed volume. Robust endocasts that preserve this region, such as KNM-WT 17400, differ in their anatomy from other taxa, especially early Homo. Falk et al (2000) Falk:2000 reconstructed specimens with missing or incomplete frontal endocasts using more complete robust australopithecine endocasts as models; this resulted in substantially smaller endocranial estimates for OH 5, KNM-ER 732 and KNM-ER 407.

Tests of temporal trends

Most A. boisei specimens with EV estimates date to the approximate center of the species' temporal span. The reason for the appearance of a trend is quite clear: there is little variation in the center of the species' temporal range; the latest two specimens are also the two largest; the earliest two specimens include two of the three smallest (Figure 1).

A test of a temporal trend might be conducted in several ways. A simple linear regression of endocranial volumes against time will test for a trend, but may be confounded by small numbers of specimens at early and late temporal extremes. Testing for a difference in means among temporal subsamples may address this problem. Comparing each specimen as a temporal subsample results in Spearman's rank-order correlation (ρ), which Elton:2001 reported as significant for their sample of A. boisei EV estimates.

Also, following Leigh (1992) Leigh:1992 and Konigsberg (1990) Konigsberg:1990, Elton et al (2001) Elton:2001 applied the "Hubert test" Hubert:1985, sometimes simply called the "Gamma" (Γ) test Lockwood:2000 Wood:stasis:1994. This test is a randomization test of association of one continuous and one ranked variable, involving four steps:

  1. The age of each specimen is converted to a rank within the sample. For a two-tailed significance test, ranks are standardized with a mean of zero.
  2. The endocranial volume of each specimen is multiplied by its temporal rank, and all the values thus obtained are summed. This is equivalent to calculating the dot product of a vector of endocranial volumes with a vector of ranks.
  3. The sample is reordered at random an arbitrarily large number of times, each time obtaining the dot product of endocranial volume and rank vectors.
  4. The statistic Γ is estimated to be (M+1)/(N+1), where M is the number of permutations with dot products greater than or equal to that of the observed sample, and N is the number of permutations examined. A Γ ≤ 0.05 is taken as a significant rejection of the null hypothesis of no trend.

It is perhaps of interest that although the Hubert test uses the dot product of the two vectors, the use of the product-moment correlation yields precisely the same Γ (shown in Appendix). Samples for which the dot product shows a significant trend are samples that have significant correlations between EV and temporal ranks. This suggests a weakness of the test, since a correlation is a measure not of change over time, but of fit to a linear model. A sample may have a significant correlation with very little change, if its variance is also very low. Hence, the interpretation of the test depends on whether the variance is biologically realistic. Since A. boisei appears to be relatively invariant in endocranial volume compared to sexually dimorphic hominoids, the test might be confounded by error in the sample of EV estimates.

The Hubert test has been applied in the anthropological literature in two partially incompatible ways. As applied by Konigsberg (1990) Konigsberg:1990, following Hubert (1985) Hubert:1985, the vector of temporal ranks is centered on zero (i.e., the values are ... -2, -1, 0, 1, 2 ...). But as applied by Leigh (1992) Leigh:1992 and Elton et al (2001) Elton:2001, the temporal ranks are simple ordinal ranks (i.e., 1, 2, 3, ...). These two alternatives are mathematically equivalent for performing a one-tailed test. But while the first alternative (zero-centered ranks) readily admits a two-tailed test, the second alternative requires a bit more algorithmic complexity for a two-tailed test. Elton et al (2001) Elton:2001 and Leigh (1992) Leigh:1992 did not report whether their tests are one- or two-tailed; following the procedures they described will result in a one-tailed test. Wood et al. (1994) Wood:stasis:1994 also applied the Hubert test to test for trends in dental characters of A. boisei, citing Leigh (1992) Leigh:1992; these authors also did not specify whether they performed one-tailed or two-tailed tests. Lockwood (2000) Lockwood:2000 employed the Hubert test (there called the Γ statistic), and explicitly described a two-tailed approach. One-tailed tests ignore the strength of any negative associations in the permuted samples, and therefore lead to incorrect assessments of statistical significance. The current study applies only two-tailed tests of the null hypothesis of no trend.

Test 1: Lower estimate for KNM-WT 17400

Falk et al (2000) Falk:2000 argued that smaller estimates are more accurate for several robust australopithecine specimens, and the smaller estimates were generally used by Elton et al (2001) Elton:2001. One exception is KNM-WT 17400, for which Elton et al (2001) Elton:2001 used the highest estimate of 500 ml Brown:1993, even though both Holloway (1988) Holloway:robust:1988 and Falk et al (2000) Falk:2000 adopted much lower estimates, between 390 and 400 ml. This smaller estimate would make KNM-WT 17400 the smallest member of the sample. A small size for this specimen at the center of the species' time range increases overall sample variability and decreases the relative contribution of early specimens to that variability. This makes KNM-WT 17400 very important to any test of a trend.

As a preliminary step, I recalculated Spearman's ρ and the Hubert test statistic Γ for the sample of Elton et al (2001) Elton:2001, using the smaller 390 ml estimate for KNM-WT 17400. This replicates the methods of that study, except for the change in size of the single KNM-WT 17400 specimen.

Test 2: Model-based simulation values

A difficulty of the A. boisei sample is the non-independence of estimates. Less complete specimens have been reconstructed using explicit information from more complete endocasts, chiefly Sts 5 and OH 5. The sample should therefore have reduced variation compared to a sample of intact crania. A reduced variance may increase the chance that a null hypothesis of stasis will be falsely rejected. This is a context in which randomization tests are potentially invalid: they do not assume a statistical distribution, but they do assume independence.

An additional aspect of the problem is that the state of preservation of fossils may be autocorrelated with time. In the present sample, the early and late specimens are relatively complete, while the middle of the time range is dominated by incomplete specimens. This situation arises frequently in paleontology, because species abundance is often highest at the center of a species' temporal range. Early and late specimens will be more likely attributed to a species if their anatomy is unambiguous --- which is more likely if they are more complete. Early or late specimens may be represented at different fossil localities than the majority of specimens, again requiring more complete specimens for confident assignment. In a Holocene context, specimens are likely to be more fragmentary and rarer earlier in time. These situations present the possibility of finding spurious trends due to differential preservation.

To attempt to correct for these issues, it is necessary to employ tests that rely on an explicit model of sample variability, instead of randomization of the sample values themselves. A simple model-based test replaces the sample EV estimates with new random deviates from a normal distribution. A normal distribution takes two parameters: the population mean and standard deviation. Deviates drawn from this distribution are independent; an arbitrary number of simulated samples may be obtained by repeatedly drawing new values to replace the sample values.

Here, the model-based sampling technique was used to generate samples with the same temporal ranks as the observed data, but with new EV values. In cases where the observed sample has two specimens of the same date, two specimens in all simulated samples were assigned the same temporal rank. The observed A. boisei sample has two such pairs of specimens. As in the Hubert test, the computer generated an arbitrarily large number of simulated samples (in this study, 100,000). The dot product of EV and temporal rank vectors in each simulated sample is compared to the dot product of the observed sample. The significance measure is taken as (M+1)/(N+1), where N is the number of simulated samples, and M is the number of those samples in which the absolute value of the dot product is more extreme than the observed value. This is a two-tailed test of the null hypothesis of no trend. I refer to the test below as the ``model-based Hubert test.''

This test was applied to the A. boisei sample described above, including KNM-WT 17000, excluding the extremely fragmentary Omo 323-1976-896, and employing an estimate of 390 ml for KNM-WT 17400. Simulated samples were generated using the observed sample mean (468 ml) and standard deviation (49.1).

Test 3: Arbitrary variation

The model-based Hubert test described above is not limited to the observed sample variation. It can also be applied using a different value for the population standard deviation.

This option is relevant to the A. boisei endocranial volume sample, because the sample of estimates may have lower variation than the population from which the specimens were drawn. Even with the lower estimate of 390 ml for KNM-WT 17400, the CV of the observed A. boisei sample is still only 10.3 percent --- between chimpanzees (9.7) and orangutans (10.9). This value might be uncharacteristic of the A. boisei population, if its sexual dimorphism or temporal variability are undersampled by available EV estimates. Since the test described here derives its simulated EV estimates from a model distribution, it is easy to apply a more variable model --- for example, matching the CV of gorillas at 13.1 percent Tobias:brain:1971. As a further example, I varied the population CV parameter of the model-based test, covering the entire range between 4 percent to 15 percent This range encompasses the CVs of all extant hominoids. In all cases I assumed a mean equal to the A. boisei sample mean (468 ml). Using this procedure, it is possible to evaluate whether possible underestimation of variability in the observed sample may affect the significance of the test of no trend.


Test 1: Lower estimate for KNM-WT 17400

The first tests performed were on the A. boisei sensu lato sample of Elton et al (2001) Elton:2001, with the exception of a lower estimate of 390 ml for KNM-WT 17400. With this estimate, the nonparametric Spearman's correlation ρ = 0.52, which is nonsignificant (p>0.10, two-tailed). For the two-tailed Hubert test on the sample, p=0.10. For both tests, the lower estimate for KNM-WT 17400 causes the significance of a temporal trend in A. boisei to completely disappear. This low estimate currently appears to be a consensus for the specimen, although it must be treated cautiously since the endocast is less than 50 percent complete. This single specimen illustrates well the importance of accurate estimates.

Sample Test p-value
Including Omo 323 Spearman's ρ p>0.10 (ns)
Hubert test p=0.10 (ns)
This study (no Omo 323) Spearman's ρ p>0.05 (ns)
Hubert test p=0.07 (ns)
model-based test p=0.07$ (ns)

Results of Tests 1 and 2.

Test 2: Model-based simulated values

The removal from the sample of the 490 ml estimate for Omo 323-1976-896 actually enhances the appearance of a trend. This is reflected by the Hubert test result, with p=0.07 (compared to p=0.10 when Omo 323 is included). Spearman's nonparametric correlation for the sample was 0.58, again nonsignificant (p>0.05, two-tailed). The model-based test described in this paper came to a very similar result on this sample, with p=0.07. Both these tests failed to reject the null hypothesis of no trend for the A. boisei sample.

Further examination of the simulated samples gave some indication of the relationship between sample variability and the appearance of a trend. One hypothesis might be that the sizes of early KNM-WT 17000 specimen is actually relatively extremely small, and the late KGA 10-525 specimen is actually relatively extremely big, resulting in the apperance of a steady expansion from smallest to biggest through the sample. The simulated samples, in which specimens are drawn from a population with equal standard deviation (49.1) to the A. boisei sample, rejected this hypothesis. Forty-four percent of the simulated samples had at least one specimen smaller than 390 ml, the smallest in the observed sample. Forty-six percent had at least one specimen larger than 545 ml, and 19 percent of simulated samples had specimens more extreme than both the largest and smallest of the observed sample.

Test 3: Arbitrary variation

Result of test 3

Result of Test 3, testing the significance of a trend in A. boisei with a range of models for population CV. Each point represents 100,000 simulated samples of equal mean to the A. boisei sample and CV given as on the x-axis. The greater the assumed variation in the underlying population, the greater the chance that an increase over time equal or greater than that in the A. boisei sample will be observed. There is no significant trend for any model of variation within the range of living great apes and humans.

An alternative hypothesis is that the appearance of a trend is due to low sample variability, increasing the correlation of EV and temporal rank. The result of the model-based test applied to a range of model CV between 4% and 15% shows the close relationship of significance of the A. boisei trend and population variation. Briefly, the greater the variation in the population, the more likely each simulated sample will present a trend at least as great as that in the observed sample. If the A. boisei sample was drawn from a population with greater EV variability, then the level of correlation of EV with time is less surprising. If the A. boisei population was as variable in endocranial volume as extant gorillas, then 15.1% of randomly drawn samples would exhibit an apparent trend as strong or stronger than the observed sample. With the extant sample, it is not possible to confirm this hypothesis of underrepresentation --- in particular, body size dimorphism does not necessarily follow from variability in cranial and masticatory variability.


The problem with testing a trend in any early hominid species is similar in form to the problems discussed by Holloway (1970) Holloway:Nature:1970. All reconstructions are based on relevant knowledge of the anatomy of other specimens. Whether reconstructions are done on crania, endocasts, or CT data, they all rely on knowledge of more complete specimens — for A. boisei endocasts, these models include OH 5 and KNM-ER 23000, and the well-known endocast Sts 5. When we test hypotheses using samples of reconstructions, we are to some extent including multiple instances of these well-known specimens, spread through many semi-independent reconstructions. There is no ready statistical model to incorporate the effects of estimation error from fragmentary specimens. These estimates are likely to be biased by the use of more complete specimens as models, the more frequent preservation of some parts of the cranial surface as opposed to others, or unrecognized sex differences in fossil individuals. In other words, one effect of estimation error is to reduce the variation within the fossil sample.

Estimation error may also tend to elevate the between-species differences among early hominins. Presently, samples assigned to different early hominid species exhibit some anatomical differences. For example, These differences may result from differing neuroanatomical adaptations in these different species. If so, then it would be anatomically misleading to use a specimen of A. africanus like Sts 5 as a model for the reconstruction of an incomplete A. boisei specimen. On the other hand, differences are observed between very small samples, and may be idiosyncratic rather than systematic. Instead of distinctive adaptations, they may represent only chance differences between small samples. In this case, the use of only other A. boisei specimens as models for incomplete A. boisei reconstructions would tend to artificially inflate the differences between A.boisei and A. africanus, as well as artificially reducing variation within A. boisei. The smaller the sample, the more likely that between-species differences will be inflated by reconstruction and within-species differences minimized.

Even with a CV of 10.3%, the variation in A. boisei is likely undersampled. The extant sample is apparently male-biased, with only 3 presumed females (KNM-ER 732, KNM-WT 17400, and KNM-ER 407). Incomplete specimens have been reconstructed by modeling after more complete crania, reducing variation from anatomical differences. Beyond this, temporal fluctuations should tend to inflate variability with or without a directional trend.

All of these factors also must affect the samples currently assigned to Homo habilis (including KNM-ER 1470), which taken together have an endocranial volume CV of 12.6%. Endocranial volume has a disproportionately important role in differentiating between smaller and larger Plio-Pleistocene Homo morphs, and this may bias the consideration of evolutionary trends in early Homo.

The only solution for these problems is the discovery of more specimens. But in the meantime, it would be appropriate to exercise caution in the interpretation of variability within and among species. Significant differences among species are tested with reference to within-species variation. For estimated characters like endocast volume, within-species variation is potentially biased by estimation error. This bias may often tend to inflate between-species differences and reduce within-species variation attributed to fossil samples.


The dot product is commonly used in vector transformations, but interpreting it in the context of a temporal trend may not be intuitive. The dot product of two vectors is the sum of the products of their respective elements:


This product is a measure of the projection of one vector onto the other; it increases as the angle between the vectors (taken from the origin) decreases. The dot product of two perpendicular vectors is zero.

The product-moment correlation between two vectors is:


where zxi and zyi are standardized values of xi and yi, respectively. Thus, the product-moment correlation is the dot product of two standardized vectors divided by their rank ( - 1).

In a randomization test, the different values of x and y are scrambled with respect to each other. However, the sample means x and y and the sample standard deviations sx and sy are constant in all of these randomized samples, because each includes exactly the same specimens. Thus, within any random set of permutations of a sample, the product-moment correlation can be obtained by a simple linear transformation from the dot product: