The endocranial volumes estimated for late Australopithecus boisei specimens (e.g., after 1.8 Ma) are larger than those of earlier specimens. Elton et al (2001)
But the A. boisei sample has some unusual aspects that may complicate the test of a trend. One question is whether the early KNMWT 17000 specimen represents A. boisei or another species (possibly, Australopithecus aethiopicus). Another question arises from the very small variation of estimated endocranial volumes in the A. boisei sample. Even including the small KNMWT 17000 volume estimate, the coefficient of variation in the sample examined by Elton et al (2001)
There are several possible interpretations for the low variation of the A. boisei sample: (1) A. boisei actually had very low size dimorphism; (2) its endocranial variation has been greatly undersampled, or (3) the sample has been biased by estimation error. Other characters of the A. boisei sample show extensive variability compared to extant hominoids
Here, I conduct three new tests of the null hypothesis of stasis of endocranial volume in A. boisei. These tests explore the effect of estimation error on the appearance of a trend in the sample, as well as the effect of low sample variation and small sample size. None of these tests find a statistically significant trend in the sample.
Materials and methods
Fossil specimens
Estimating endocranial volume can be challenging even for relatively complete specimens, considering the subtle distortion exhibited by many fossils. For more fragmentary cranial remains, the estimation of endocranial volume requires not only the correction of distortions but also the reconstruction of missing portions.
The eleven cranial specimens of Australopithecus boisei listed below vary in their completeness and preservation of relevant anatomy. There is no explicit way of statistically controlling for error in the estimation of endocranial volume, considering the diversity of methods of reconstruction. In several cases, different workers have provided competing estimates. For less complete specimens, choosing one estimate above another must involve a close critique of anatomical details. The following list reviews the anatomical condition of each of these specimens. It is not an exhaustive list of volume estimates, but focuses on the range between credible extremes for the more disputed specimens. This gives an impression of the boundary conditions for measurement accuracy for each specimen.

KNMWT 17000 is a well preserved skull with relatively small vault fragments missing. Walker et al. (1986)
wt17000 estimated the volume as 410 ml. 
Omo L338y6 is a juvenile cranium of uncertain age. Holloway (1981)
Holloway:Omo:1981 estimated its volume at 427 ml. Elton et al (2001)Elton:2001 estimated an adult volume 4% higher, or 444 ml. 
The Omo 3231976896 cranial remains are exceedingly fragmentary. One side of the posterior cranial base is preserved, allowing a relatively good estimate of the posterior endocast breadth. The preserved frontal and parietal elements do not join with each other or the temporal; their small size and unknown positions do not allow an accurate estimate of endocast volume.
Brown:1993 reported an estimate of ``about 490'' based on similarity with the 491 ml KNMER 23000. Falk et al (2000)Falk:2000 considered it too fragmentary for an accurate estimate. I concur; the available estimate cannot be considered independent of other endocasts on which it may have been based. 
KNMWT 17400 preserves only the anterior third of the endocast, consisting mainly of the frontal lobes. Brown et al. (1993)
Brown:1993 gave an estimate of 500 ml by modeling missing portions after the more complete KNMER 23000, but Holloway (1988)Holloway:robust:1988 put the volume between 390 and 400 ml, and Falk et al. (2000)Falk:2000 adopted an estimate of 390 ml. 
OH 5 has good preservation of the endocast, but an uncertain join between the anterior and posterior portions of the vault. This discontinuity has caused a disparity in estimates of its volume, including a low 500 ml estimate by Falk et al (2000)
Falk:2000 and a high 530 ml estimate by Tobias (1963)Tobias:1963 . The range of estimates on this wellpreserved specimen covers nearly a quarter of the range of variation cited for A. boisei as a whole. 
KNMER 13750 preserves only the superior vault, accounting for under half of the total endocranial contour. The range of estimates provided by Falk et al (2000)
Falk:2000 , from 450 to 480 ml, again covers roughly a quarter of the range attributable to the species. Brown (1993)Brown:1993 reported a higher estimate of 500 ml. 
KNMER 23000 is a nearly complete vault missing the midline cranial base. Its endocranial volume of 491 ml
Brown:1993 may be the most accurate assigned to A. boisei. 
KNMER 406 is also wellpreserved
Wood:monograph:1991 . Its volume estimate of 525 ml is uncontroversialHolloway:1988 . 
KNMER 407 is missing several vault sections including those enclosing the frontal lobe. Holloway (1988)
Holloway:1988 estimated the volume at 510 ml; Falk et al (2000)Falk:2000 prepared a new reconstruction with a volume estimate of 438 ml. The difference between these two estimates covers nearly 50 percent of the total range of the sample. 
KNMER 732 has good preservation of the left side of the vault, but is not complete across the rear of the cranium or basicranium, making a mirror reconstruction problematic. Holloway (1988)
Holloway:robust:1988 estimated the endocast volume at 500 ml; Falk et al (2000)Falk:2000 at 466 ml. 
KGA 10525 lacks most of the frontal and anterior cranial base. Suwa et al. (1997)
Suwa:1997 estimated its volume at 545 ml.
The damaged or missing frontals of many specimens have added to ambiguity about their reconstructed volume. Robust endocasts that preserve this region, such as KNMWT 17400, differ in their anatomy from other taxa, especially early Homo. Falk et al (2000)
Tests of temporal trends
Most A. boisei specimens with EV estimates date to the approximate center of the species' temporal span. The reason for the appearance of a trend is quite clear: there is little variation in the center of the species' temporal range; the latest two specimens are also the two largest; the earliest two specimens include two of the three smallest (Figure 1).
A test of a temporal trend might be conducted in several ways. A simple linear regression of endocranial volumes against time will test for a trend, but may be confounded by small numbers of specimens at early and late temporal extremes. Testing for a difference in means among temporal subsamples may address this problem. Comparing each specimen as a temporal subsample results in Spearman's rankorder correlation (ρ), which
Also, following Leigh (1992)
 The age of each specimen is converted to a rank within the sample. For a twotailed significance test, ranks are standardized with a mean of zero.
 The endocranial volume of each specimen is multiplied by its temporal rank, and all the values thus obtained are summed. This is equivalent to calculating the dot product of a vector of endocranial volumes with a vector of ranks.
 The sample is reordered at random an arbitrarily large number of times, each time obtaining the dot product of endocranial volume and rank vectors.
 The statistic Γ is estimated to be (M+1)/(N+1), where M is the number of permutations with dot products greater than or equal to that of the observed sample, and N is the number of permutations examined. A Γ ≤ 0.05 is taken as a significant rejection of the null hypothesis of no trend.
It is perhaps of interest that although the Hubert test uses the dot product of the two vectors, the use of the productmoment correlation yields precisely the same Γ (shown in Appendix). Samples for which the dot product shows a significant trend are samples that have significant correlations between EV and temporal ranks. This suggests a weakness of the test, since a correlation is a measure not of change over time, but of fit to a linear model. A sample may have a significant correlation with very little change, if its variance is also very low. Hence, the interpretation of the test depends on whether the variance is biologically realistic. Since A. boisei appears to be relatively invariant in endocranial volume compared to sexually dimorphic hominoids, the test might be confounded by error in the sample of EV estimates.
The Hubert test has been applied in the anthropological literature in two partially incompatible ways. As applied by Konigsberg (1990)
Test 1: Lower estimate for KNMWT 17400
Falk et al (2000)
As a preliminary step, I recalculated Spearman's ρ and the Hubert test statistic Γ for the sample of Elton et al (2001)
Test 2: Modelbased simulation values
A difficulty of the A. boisei sample is the nonindependence of estimates. Less complete specimens have been reconstructed using explicit information from more complete endocasts, chiefly Sts 5 and OH 5. The sample should therefore have reduced variation compared to a sample of intact crania. A reduced variance may increase the chance that a null hypothesis of stasis will be falsely rejected. This is a context in which randomization tests are potentially invalid: they do not assume a statistical distribution, but they do assume independence.
An additional aspect of the problem is that the state of preservation of fossils may be autocorrelated with time. In the present sample, the early and late specimens are relatively complete, while the middle of the time range is dominated by incomplete specimens. This situation arises frequently in paleontology, because species abundance is often highest at the center of a species' temporal range. Early and late specimens will be more likely attributed to a species if their anatomy is unambiguous  which is more likely if they are more complete. Early or late specimens may be represented at different fossil localities than the majority of specimens, again requiring more complete specimens for confident assignment. In a Holocene context, specimens are likely to be more fragmentary and rarer earlier in time. These situations present the possibility of finding spurious trends due to differential preservation.
To attempt to correct for these issues, it is necessary to employ tests that rely on an explicit model of sample variability, instead of randomization of the sample values themselves. A simple modelbased test replaces the sample EV estimates with new random deviates from a normal distribution. A normal distribution takes two parameters: the population mean and standard deviation. Deviates drawn from this distribution are independent; an arbitrary number of simulated samples may be obtained by repeatedly drawing new values to replace the sample values.
Here, the modelbased sampling technique was used to generate samples with the same temporal ranks as the observed data, but with new EV values. In cases where the observed sample has two specimens of the same date, two specimens in all simulated samples were assigned the same temporal rank. The observed A. boisei sample has two such pairs of specimens. As in the Hubert test, the computer generated an arbitrarily large number of simulated samples (in this study, 100,000). The dot product of EV and temporal rank vectors in each simulated sample is compared to the dot product of the observed sample. The significance measure is taken as (M+1)/(N+1), where N is the number of simulated samples, and M is the number of those samples in which the absolute value of the dot product is more extreme than the observed value. This is a twotailed test of the null hypothesis of no trend. I refer to the test below as the ``modelbased Hubert test.''
This test was applied to the A. boisei sample described above, including KNMWT 17000, excluding the extremely fragmentary Omo 3231976896, and employing an estimate of 390 ml for KNMWT 17400. Simulated samples were generated using the observed sample mean (468 ml) and standard deviation (49.1).
Test 3: Arbitrary variation
The modelbased Hubert test described above is not limited to the observed sample variation. It can also be applied using a different value for the population standard deviation.
This option is relevant to the A. boisei endocranial volume sample, because the sample of estimates may have lower variation than the population from which the specimens were drawn. Even with the lower estimate of 390 ml for KNMWT 17400, the CV of the observed A. boisei sample is still only 10.3 percent  between chimpanzees (9.7) and orangutans (10.9). This value might be uncharacteristic of the A. boisei population, if its sexual dimorphism or temporal variability are undersampled by available EV estimates. Since the test described here derives its simulated EV estimates from a model distribution, it is easy to apply a more variable model  for example, matching the CV of gorillas at 13.1 percent
Results
Test 1: Lower estimate for KNMWT 17400
The first tests performed were on the A. boisei sensu lato sample of Elton et al (2001)
Sample  Test  pvalue 
Including Omo 323  Spearman's ρ  p>0.10 (ns) 
Hubert test  p=0.10 (ns)  
This study (no Omo 323)  Spearman's ρ  p>0.05 (ns) 
Hubert test  p=0.07 (ns)  
modelbased test  p=0.07$ (ns) 
Test 2: Modelbased simulated values
The removal from the sample of the 490 ml estimate for Omo 3231976896 actually enhances the appearance of a trend. This is reflected by the Hubert test result, with p=0.07 (compared to p=0.10 when Omo 323 is included). Spearman's nonparametric correlation for the sample was 0.58, again nonsignificant (p>0.05, twotailed). The modelbased test described in this paper came to a very similar result on this sample, with p=0.07. Both these tests failed to reject the null hypothesis of no trend for the A. boisei sample.
Further examination of the simulated samples gave some indication of the relationship between sample variability and the appearance of a trend. One hypothesis might be that the sizes of early KNMWT 17000 specimen is actually relatively extremely small, and the late KGA 10525 specimen is actually relatively extremely big, resulting in the apperance of a steady expansion from smallest to biggest through the sample. The simulated samples, in which specimens are drawn from a population with equal standard deviation (49.1) to the A. boisei sample, rejected this hypothesis. Fortyfour percent of the simulated samples had at least one specimen smaller than 390 ml, the smallest in the observed sample. Fortysix percent had at least one specimen larger than 545 ml, and 19 percent of simulated samples had specimens more extreme than both the largest and smallest of the observed sample.
Test 3: Arbitrary variation
An alternative hypothesis is that the appearance of a trend is due to low sample variability, increasing the correlation of EV and temporal rank. The result of the modelbased test applied to a range of model CV between 4% and 15% shows the close relationship of significance of the A. boisei trend and population variation. Briefly, the greater the variation in the population, the more likely each simulated sample will present a trend at least as great as that in the observed sample. If the A. boisei sample was drawn from a population with greater EV variability, then the level of correlation of EV with time is less surprising. If the A. boisei population was as variable in endocranial volume as extant gorillas, then 15.1% of randomly drawn samples would exhibit an apparent trend as strong or stronger than the observed sample. With the extant sample, it is not possible to confirm this hypothesis of underrepresentation  in particular, body size dimorphism does not necessarily follow from variability in cranial and masticatory variability.
Discussion
The problem with testing a trend in any early hominid species is similar in form to the problems discussed by Holloway (1970)
Estimation error may also tend to elevate the betweenspecies differences among early hominins. Presently, samples assigned to different early hominid species exhibit some anatomical differences. For example, These differences may result from differing neuroanatomical adaptations in these different species. If so, then it would be anatomically misleading to use a specimen of A. africanus like Sts 5 as a model for the reconstruction of an incomplete A. boisei specimen. On the other hand, differences are observed between very small samples, and may be idiosyncratic rather than systematic. Instead of distinctive adaptations, they may represent only chance differences between small samples. In this case, the use of only other A. boisei specimens as models for incomplete A. boisei reconstructions would tend to artificially inflate the differences between A.boisei and A. africanus, as well as artificially reducing variation within A. boisei. The smaller the sample, the more likely that betweenspecies differences will be inflated by reconstruction and withinspecies differences minimized.
Even with a CV of 10.3%, the variation in A. boisei is likely undersampled. The extant sample is apparently malebiased, with only 3 presumed females (KNMER 732, KNMWT 17400, and KNMER 407). Incomplete specimens have been reconstructed by modeling after more complete crania, reducing variation from anatomical differences. Beyond this, temporal fluctuations should tend to inflate variability with or without a directional trend.
All of these factors also must affect the samples currently assigned to Homo habilis (including KNMER 1470), which taken together have an endocranial volume CV of 12.6%. Endocranial volume has a disproportionately important role in differentiating between smaller and larger PlioPleistocene Homo morphs, and this may bias the consideration of evolutionary trends in early Homo.
The only solution for these problems is the discovery of more specimens. But in the meantime, it would be appropriate to exercise caution in the interpretation of variability within and among species. Significant differences among species are tested with reference to withinspecies variation. For estimated characters like endocast volume, withinspecies variation is potentially biased by estimation error. This bias may often tend to inflate betweenspecies differences and reduce withinspecies variation attributed to fossil samples.
Appendix
The dot product is commonly used in vector transformations, but interpreting it in the context of a temporal trend may not be intuitive. The dot product of two vectors is the sum of the products of their respective elements:
This product is a measure of the projection of one vector onto the other; it increases as the angle between the vectors (taken from the origin) decreases. The dot product of two perpendicular vectors is zero.
The productmoment correlation between two vectors is:
where z_{xi} and z_{yi} are standardized values of x_{i} and y_{i}, respectively. Thus, the productmoment correlation is the dot product of two standardized vectors divided by their rank (  1).
In a randomization test, the different values of x and y are scrambled with respect to each other. However, the sample means and and the sample standard deviations s_{x} and s_{y} are constant in all of these randomized samples, because each includes exactly the same specimens. Thus, within any random set of permutations of a sample, the productmoment correlation can be obtained by a simple linear transformation from the dot product: