Razib Khan conveys a list of suggestions from a recent paper by Joseph Simmons and colleagues

The central point is that every research paper is a product of a course of inquiry that may come to include many kinds of questions, most of which are unanswered or answered negatively by results and data. When scientists report results, they focus on those that meet some statistical threshold. The threshold ostensibly makes results “significant” but the actual probability of seeing such a result depends on how many things the scientists looked at, not only on those they choose to report:

In this article, we show that despite the nominal endorsement of a maximum false-positive rate of 5% (i.e., p ? .05), current standards for disclosing details of data collection and analyses make false positives vastly more likely. In fact, it is unacceptably easy to publish statistically significant evidence consistent withanyhypothesis. The culprit is a construct we refer to asresearcher degrees of freedom. In the course of collecting and analyzing data, researchers have many decisions to make: Should more data be collected? Should some observations be excluded? Which conditions should be combined and which ones compared? Which control variables should be considered? Should specific measures be combined or transformed or both? It is rare, and sometimes impractical, for researchers to make all these decisions beforehand. Rather, it is common (and accepted practice) for researchers to explore various analytic alternatives, to search for a combination that yields statistical significance, and to then report only what worked. The problem, of course, is that the likelihood of at least one (of many) analyses producing a falsely positive finding at the 5% level is necessarily greater than 5%.

“Researcher degrees of freedom” sounds erudite, but all they’re really describing is tinkering. When the data lead you to a result, they do so by leading you along a drunkard’s path of new analytical biases.

Razib has presented the authors’ suggestions for researchers and reviewers, to try to reduce the tinkering bias. I think if we followed those suggestions in paleoanthropology, our discipline would be stronger in some ways, weaker in others. For example, the authors suggest rejecting any paper with fewer than 20 observations in a cell of a test of association. Clearly, if we rigidly enforced such a rule, we’d have a lot more work done on comparative collections, and that would be a good thing. On the other hand, we’d have a lot more papers like the ones I like to write, about how the data are insufficient to test a hypothesis. Our science would shift even further toward description, which would benefit some kinds of research and punish others.

One may object that there are many cases in paleoanthropology where a single observation is fundamentally important. I would just point out that such cases are most evident where the single observation is a many-sigma outlier to some pre-existing hypothesis. If we have a new radiocarbon date that’s a three-sigma outlier above previous dates, it will either cause us to change our hypothesis or challenge the date’s accuracy. There are biases even so – for example, when we find outlier radiocarbon dates on otherwise-uncontroversial things, we tend to just ignore the outliers.

What I most liked about this paper was that the authors anticipated various objections. For example, many researchers would claim that a Bayesian statistical approach would eliminate or reduce the bias from “researcher degrees of freedom”. Here’s the authors’ response:

Although the Bayesian approach has many virtues, it actually increases researcher degrees of freedom. First, it offers a new set of analyses (in addition to all frequentist ones) that authors could flexibly try out on their data. Second, Bayesian statistics require making additional judgments (e.g., the prior distribution) on a case-by-case basis, providing yet more researcher degrees of freedom.

That’s my observation as well. Researchers adopt Bayesian methods for more ways to tinker. I also appreciate this comment:

We are strongly supportive of all journals requiring authors to make their original materials and data publicly available. However, this is not likely to address the problem of interest, as this policy would impose too high a cost on readers and reviewers to examine, in real time, the credibility of a particular claim. Readers should not need to download data, load it into their statistical packages, and start running analyses to learn the importance of controlling for fathers age; nor should they need to read pages of additional materials to learn that the researchers simply dropped the Hot Potato condition. Furthermore, if a journal allows the redaction of a condition from the report, for example, it would presumably also allow its redaction from the raw data and original materials, making the entire transparency effort futile.

All in all, the article is a good reminder of Feynman’s first principle, “You must not fool yourself, and you are the easiest person to fool.”