The endocranial volumes estimated for late *Australopithecus boisei* specimens (e.g., after 1.8 Ma) are larger than those of earlier specimens. Elton et al (2001) *A. boisei*, and lends some doubt to the idea that brain size evolution in early *Homo* was exceptional

But the *A. boisei* sample has some unusual aspects that may complicate the test of a trend. One question is whether the early KNM-WT 17000 specimen represents *A. boisei* or another species (possibly, *Australopithecus aethiopicus*). Another question arises from the very small variation of estimated endocranial volumes in the *A. boisei* sample. Even including the small KNM-WT 17000 volume estimate, the coefficient of variation in the sample examined by Elton et al (2001) *A. boisei* had less variation than any living hominoids, even though its craniodental variation was as great as gorillas or orangutans

There are several possible interpretations for the low variation of the *A. boisei* sample: (1) *A. boisei* actually had very low size dimorphism; (2) its endocranial variation has been greatly undersampled, or (3) the sample has been biased by estimation error. Other characters of the *A. boisei* sample show extensive variability compared to extant hominoids

Here, I conduct three new tests of the null hypothesis of stasis of endocranial volume in *A. boisei*. These tests explore the effect of estimation error on the appearance of a trend in the sample, as well as the effect of low sample variation and small sample size. None of these tests find a statistically significant trend in the sample.

#### Materials and methods

### Fossil specimens

Estimating endocranial volume can be challenging even for relatively complete specimens, considering the subtle distortion exhibited by many fossils. For more fragmentary cranial remains, the estimation of endocranial volume requires not only the correction of distortions but also the reconstruction of missing portions.

The eleven cranial specimens of *Australopithecus boisei* listed below vary in their completeness and preservation of relevant anatomy. There is no explicit way of statistically controlling for error in the estimation of endocranial volume, considering the diversity of methods of reconstruction. In several cases, different workers have provided competing estimates. For less complete specimens, choosing one estimate above another must involve a close critique of anatomical details. The following list reviews the anatomical condition of each of these specimens. It is not an exhaustive list of volume estimates, but focuses on the range between credible extremes for the more disputed specimens. This gives an impression of the boundary conditions for measurement accuracy for each specimen.

- KNM-WT 17000 is a well preserved skull with relatively small vault fragments missing. Walker et al. (1986)
wt17000 estimated the volume as 410 ml. - Omo L338-y6 is a juvenile cranium of uncertain age. Holloway (1981)
Holloway:Omo:1981 estimated its volume at 427 ml. Elton et al (2001)Elton:2001 estimated an adult volume 4% higher, or 444 ml. - The Omo 323-1976-896 cranial remains are exceedingly fragmentary. One side of the posterior cranial base is preserved, allowing a relatively good estimate of the posterior endocast breadth. The preserved frontal and parietal elements do not join with each other or the temporal; their small size and unknown positions do not allow an accurate estimate of endocast volume.
Brown:1993 reported an estimate of ``about 490'' based on similarity with the 491 ml KNM-ER 23000. Falk et al (2000)Falk:2000 considered it too fragmentary for an accurate estimate. I concur; the available estimate cannot be considered independent of other endocasts on which it may have been based. - KNM-WT 17400 preserves only the anterior third of the endocast, consisting mainly of the frontal lobes. Brown et al. (1993)
Brown:1993 gave an estimate of 500 ml by modeling missing portions after the more complete KNM-ER 23000, but Holloway (1988)Holloway:robust:1988 put the volume between 390 and 400 ml, and Falk et al. (2000)Falk:2000 adopted an estimate of 390 ml. - OH 5 has good preservation of the endocast, but an uncertain join between the anterior and posterior portions of the vault. This discontinuity has caused a disparity in estimates of its volume, including a low 500 ml estimate by Falk et al (2000)
Falk:2000 and a high 530 ml estimate by Tobias (1963)Tobias:1963 . The range of estimates on this well-preserved specimen covers nearly a quarter of the range of variation cited for*A. boisei*as a whole. - KNM-ER 13750 preserves only the superior vault, accounting for under half of the total endocranial contour. The range of estimates provided by Falk et al (2000)
Falk:2000 , from 450 to 480 ml, again covers roughly a quarter of the range attributable to the species. Brown (1993)Brown:1993 reported a higher estimate of 500 ml. - KNM-ER 23000 is a nearly complete vault missing the midline cranial base. Its endocranial volume of 491 ml
Brown:1993 may be the most accurate assigned to*A. boisei*. - KNM-ER 406 is also well-preserved
Wood:monograph:1991 . Its volume estimate of 525 ml is uncontroversialHolloway:1988 . - KNM-ER 407 is missing several vault sections including those enclosing the frontal lobe. Holloway (1988)
Holloway:1988 estimated the volume at 510 ml; Falk et al (2000)Falk:2000 prepared a new reconstruction with a volume estimate of 438 ml. The difference between these two estimates covers nearly 50 percent of the total range of the sample. - KNM-ER 732 has good preservation of the left side of the vault, but is not complete across the rear of the cranium or basicranium, making a mirror reconstruction problematic. Holloway (1988)
Holloway:robust:1988 estimated the endocast volume at 500 ml; Falk et al (2000)Falk:2000 at 466 ml. - KGA 10-525 lacks most of the frontal and anterior cranial base. Suwa et al. (1997)
Suwa:1997 estimated its volume at 545 ml.

The damaged or missing frontals of many specimens have added to ambiguity about their reconstructed volume. Robust endocasts that preserve this region, such as KNM-WT 17400, differ in their anatomy from other taxa, especially early *Homo*. Falk et al (2000)

### Tests of temporal trends

Most *A. boisei* specimens with EV estimates date to the approximate center of the species’ temporal span. The reason for the appearance of a trend is quite clear: there is little variation in the center of the species’ temporal range; the latest two specimens are also the two largest; the earliest two specimens include two of the three smallest (Figure 1).

A test of a temporal trend might be conducted in several ways. A simple linear regression of endocranial volumes against time will test for a trend, but may be confounded by small numbers of specimens at early and late temporal extremes. Testing for a difference in means among temporal subsamples may address this problem. Comparing each specimen as a temporal subsample results in Spearman’s rank-order correlation (ρ), which *A. boisei* EV estimates.

Also, following Leigh (1992) **Γ**) test

- The age of each specimen is converted to a rank within the sample. For a two-tailed significance test, ranks are standardized with a mean of zero.
- The endocranial volume of each specimen is multiplied by its temporal rank, and all the values thus obtained are summed. This is equivalent to calculating the dot product of a vector of endocranial volumes with a vector of ranks.
- The sample is reordered at random an arbitrarily large number of times, each time obtaining the dot product of endocranial volume and rank vectors.
- The statistic
**Γ**is estimated to be (*M*+1)/(*N*+1), where*M*is the number of permutations with dot products greater than or equal to that of the observed sample, and*N*is the number of permutations examined. A**Γ**≤ 0.05 is taken as a significant rejection of the null hypothesis of no trend.

It is perhaps of interest that although the Hubert test uses the dot product of the two vectors, the use of the product-moment correlation yields precisely the same **Γ** (shown in Appendix). Samples for which the dot product shows a significant trend are samples that have significant correlations between EV and temporal ranks. This suggests a weakness of the test, since a correlation is a measure not of change over time, but of fit to a linear model. A sample may have a significant correlation with very little change, if its variance is also very low. Hence, the interpretation of the test depends on whether the variance is biologically realistic. Since *A. boisei* appears to be relatively invariant in endocranial volume compared to sexually dimorphic hominoids, the test might be confounded by error in the sample of EV estimates.

The Hubert test has been applied in the anthropological literature in two partially incompatible ways. As applied by Konigsberg (1990) *A. boisei*, citing Leigh (1992) **Γ** statistic), and explicitly described a two-tailed approach. One-tailed tests ignore the strength of any negative associations in the permuted samples, and therefore lead to incorrect assessments of statistical significance. The current study applies only two-tailed tests of the null hypothesis of no trend.

### Test 1: Lower estimate for KNM-WT 17400

Falk et al (2000)

As a preliminary step, I recalculated Spearman’s ρ and the Hubert test statistic **Γ** for the sample of Elton et al (2001)

### Test 2: Model-based simulation values

A difficulty of the *A. boisei* sample is the non-independence of estimates. Less complete specimens have been reconstructed using explicit information from more complete endocasts, chiefly Sts 5 and OH 5. The sample should therefore have reduced variation compared to a sample of intact crania. A reduced variance may increase the chance that a null hypothesis of stasis will be falsely rejected. This is a context in which randomization tests are potentially invalid: they do not assume a statistical distribution, but they do assume independence.

An additional aspect of the problem is that the state of preservation of fossils may be autocorrelated with time. In the present sample, the early and late specimens are relatively complete, while the middle of the time range is dominated by incomplete specimens. This situation arises frequently in paleontology, because species abundance is often highest at the center of a species’ temporal range. Early and late specimens will be more likely attributed to a species if their anatomy is unambiguous — which is more likely if they are more complete. Early or late specimens may be represented at different fossil localities than the majority of specimens, again requiring more complete specimens for confident assignment. In a Holocene context, specimens are likely to be more fragmentary and rarer earlier in time. These situations present the possibility of finding spurious trends due to differential preservation.

To attempt to correct for these issues, it is necessary to employ tests that rely on an explicit model of sample variability, instead of randomization of the sample values themselves. A simple model-based test replaces the sample EV estimates with new random deviates from a normal distribution. A normal distribution takes two parameters: the population mean and standard deviation. Deviates drawn from this distribution are independent; an arbitrary number of simulated samples may be obtained by repeatedly drawing new values to replace the sample values.

Here, the model-based sampling technique was used to generate samples with the same temporal ranks as the observed data, but with new EV values. In cases where the observed sample has two specimens of the same date, two specimens in all simulated samples were assigned the same temporal rank. The observed *A. boisei* sample has two such pairs of specimens. As in the Hubert test, the computer generated an arbitrarily large number of simulated samples (in this study, 100,000). The dot product of EV and temporal rank vectors in each simulated sample is compared to the dot product of the observed sample. The significance measure is taken as (*M*+1)/(*N*+1), where *N* is the number of simulated samples, and *M* is the number of those samples in which the absolute value of the dot product is more extreme than the observed value. This is a two-tailed test of the null hypothesis of no trend. I refer to the test below as the ``model-based Hubert test.’’

This test was applied to the *A. boisei* sample described above, including KNM-WT 17000, excluding the extremely fragmentary Omo 323-1976-896, and employing an estimate of 390 ml for KNM-WT 17400. Simulated samples were generated using the observed sample mean (468 ml) and standard deviation (49.1).

### Test 3: Arbitrary variation

The model-based Hubert test described above is not limited to the observed sample variation. It can also be applied using a different value for the population standard deviation.

This option is relevant to the *A. boisei* endocranial volume sample, because the sample of estimates may have lower variation than the population from which the specimens were drawn. Even with the lower estimate of 390 ml for KNM-WT 17400, the CV of the observed *A. boisei* sample is still only 10.3 percent — between chimpanzees (9.7) and orangutans (10.9). This value might be uncharacteristic of the *A. boisei* population, if its sexual dimorphism or temporal variability are undersampled by available EV estimates. Since the test described here derives its simulated EV estimates from a model distribution, it is easy to apply a more variable model — for example, matching the CV of gorillas at 13.1 percent *A. boisei* sample mean (468 ml). Using this procedure, it is possible to evaluate whether possible underestimation of variability in the observed sample may affect the significance of the test of no trend.

#### Results

### Test 1: Lower estimate for KNM-WT 17400

The first tests performed were on the *A. boisei sensu lato* sample of Elton et al (2001) *p*>0.10, two-tailed). For the two-tailed Hubert test on the sample, *p*=0.10. For both tests, the lower estimate for KNM-WT 17400 causes the significance of a temporal trend in *A. boisei* to completely disappear. This low estimate currently appears to be a consensus for the specimen, although it must be treated cautiously since the endocast is less than 50 percent complete. This single specimen illustrates well the importance of accurate estimates.

Sample | Test | p-value |

Including Omo 323 | Spearman's ρ | p>0.10 (ns) |

Hubert test | p=0.10 (ns) | |

This study (no Omo 323) | Spearman's ρ | p>0.05 (ns) |

Hubert test | p=0.07 (ns) | |

model-based test | p=0.07$ (ns) |

### Test 2: Model-based simulated values

The removal from the sample of the 490 ml estimate for Omo 323-1976-896 actually enhances the appearance of a trend. This is reflected by the Hubert test result, with *p*=0.07 (compared to *p*=0.10 when Omo 323 is included). Spearman’s nonparametric correlation for the sample was 0.58, again nonsignificant (*p*>0.05, two-tailed). The model-based test described in this paper came to a very similar result on this sample, with *p*=0.07. Both these tests failed to reject the null hypothesis of no trend for the *A. boisei* sample.

Further examination of the simulated samples gave some indication of the relationship between sample variability and the appearance of a trend. One hypothesis might be that the sizes of early KNM-WT 17000 specimen is actually relatively extremely small, and the late KGA 10-525 specimen is actually relatively extremely big, resulting in the apperance of a steady expansion from smallest to biggest through the sample. The simulated samples, in which specimens are drawn from a population with equal standard deviation (49.1) to the *A. boisei* sample, rejected this hypothesis. Forty-four percent of the simulated samples had at least one specimen smaller than 390 ml, the smallest in the observed sample. Forty-six percent had at least one specimen larger than 545 ml, and 19 percent of simulated samples had specimens more extreme than both the largest and smallest of the observed sample.

### Test 3: Arbitrary variation

An alternative hypothesis is that the appearance of a trend is due to low sample variability, increasing the correlation of EV and temporal rank. The result of the model-based test applied to a range of model CV between 4% and 15% shows the close relationship of significance of the *A. boisei* trend and population variation. Briefly, the greater the variation in the population, the more likely each simulated sample will present a trend at least as great as that in the observed sample. If the *A. boisei* sample was drawn from a population with greater EV variability, then the level of correlation of EV with time is less surprising. If the *A. boisei* population was as variable in endocranial volume as extant gorillas, then 15.1% of randomly drawn samples would exhibit an apparent trend as strong or stronger than the observed sample. With the extant sample, it is not possible to confirm this hypothesis of underrepresentation — in particular, body size dimorphism does not necessarily follow from variability in cranial and masticatory variability.

#### Discussion

The problem with testing a trend in any early hominid species is similar in form to the problems discussed by Holloway (1970) *A. boisei* endocasts, these models include OH 5 and KNM-ER 23000, and the well-known \emph{A. africanus} endocast Sts 5. When we test hypotheses using samples of reconstructions, we are to some extent including multiple instances of these well-known specimens, spread through many semi-independent reconstructions. There is no ready statistical model to incorporate the effects of estimation error from fragmentary specimens. These estimates are likely to be biased by the use of more complete specimens as models, the more frequent preservation of some parts of the cranial surface as opposed to others, or unrecognized sex differences in fossil individuals. In other words, one effect of estimation error is to reduce the variation within the fossil sample.

Estimation error may also tend to elevate the between-species differences among early hominins. Presently, samples assigned to different early hominid species exhibit some anatomical differences. For example, These differences may result from differing neuroanatomical adaptations in these different species. If so, then it would be anatomically misleading to use a specimen of *A. africanus* like Sts 5 as a model for the reconstruction of an incomplete *A. boisei* specimen. On the other hand, differences are observed between very small samples, and may be idiosyncratic rather than systematic. Instead of distinctive adaptations, they may represent only chance differences between small samples. In this case, the use of *only* other *A. boisei* specimens as models for incomplete *A. boisei* reconstructions would tend to artificially inflate the differences between *A.boisei* and *A. africanus*, as well as artificially reducing variation within *A. boisei*. The smaller the sample, the more likely that between-species differences will be inflated by reconstruction and within-species differences minimized.

Even with a CV of 10.3%, the variation in *A. boisei* is likely undersampled. The extant sample is apparently male-biased, with only 3 presumed females (KNM-ER 732, KNM-WT 17400, and KNM-ER 407). Incomplete specimens have been reconstructed by modeling after more complete crania, reducing variation from anatomical differences. Beyond this, temporal fluctuations should tend to inflate variability with or without a directional trend.

All of these factors also must affect the samples currently assigned to *Homo habilis* (including KNM-ER 1470), which taken together have an endocranial volume CV of 12.6%. Endocranial volume has a disproportionately important role in differentiating between smaller and larger Plio-Pleistocene *Homo* morphs, and this may bias the consideration of evolutionary trends in early *Homo*.

The only solution for these problems is the discovery of more specimens. But in the meantime, it would be appropriate to exercise caution in the interpretation of variability within and among species. Significant differences among species are tested with reference to within-species variation. For estimated characters like endocast volume, within-species variation is potentially biased by estimation error. This bias may often tend to inflate between-species differences and reduce within-species variation attributed to fossil samples.

#### Appendix

The dot product is commonly used in vector transformations, but interpreting it in the context of a temporal trend may not be intuitive. The dot product of two vectors is the sum of the products of their respective elements:

This product is a measure of the projection of one vector onto the other; it increases as the angle between the vectors (taken from the origin) decreases. The dot product of two perpendicular vectors is zero.

The product-moment correlation between two vectors is:

where *z _{xi}* and

*z*are standardized values of

_{yi}*x*and

_{i}*y*, respectively. Thus, the product-moment correlation is the dot product of two standardized vectors divided by their rank ( - 1).

_{i}In a randomization test, the different values of *x* and *y* are scrambled with respect to each other. However, the sample means and and the sample standard deviations *s _{x}* and

*s*are constant in all of these randomized samples, because each includes exactly the same specimens. Thus, within any random set of permutations of a sample, the product-moment correlation can be obtained by a simple linear transformation from the dot product:

_{y}