My God, it’s full of stars!

It must have been a reviewer error

There’s a new paper out¬†which exams reported levels of statistical significance in (mainly) empirical economics papers from the top three journals:

Journals favor rejections of the null hypothesis. This selection upon results may distort the behavior of researchers. Using 50,000 tests published between 2005 and 2011 in the AER, JPE and QJE, we identify a residual in the distribution of tests that cannot be explained by selection. The distribution of p-values exhibits a camel shape with abundant p-values above .25, a valley between .25 and .10 and a bump slightly under .05. Missing tests are those which would have been accepted but close to being rejected (p-values between .25 and .10). We show that this pattern corresponds to a shift in the distribution of p-values: between 10% and 20% of marginally rejected tests are misallocated. Our interpretation is that researchers might be tempted to inflate the value of their tests by choosing the specification that provides the highest statistics. Note that Inflation is larger in articles where stars are used in order to highlight statistical significance and lower in articles with theoretical models.

Basically, if results were “unbiased”, a graph of the distribution of the observed results (or in this case, observed p-values of significance) should be relatively smooth and monotonic. Here’s what the distribution looks like (taken from the paper):

Do you see that second little hump? That’s just below the p = 0.05 threshold, the magic and totally-arbitrary¬†rule of thumb for whether a statistical result is worthwhile or not (although in my experience p = 0.10 is becoming the new norm). This suggests an abnormal grouping just below the threshold. Now if this was only a result of systematic selection bias, with academic journals only accepting results which were significant above this threshold, we’d expect to see abnormal grouping to the right of the threshold. However, this doesn’t explain why the distribution is bimodal: results which¬†are¬†nearly¬†significant are less frequent than those that are much further way. This suggests something more nefarious than publisher bias: that researchers with results that are nearly significant are doing things to nudge their results into the just-significant category.

I think someone is assuming we should be scared and outraged by all this – but I don’t think we should. Here’s why:

These results suggest that researchers care very deeply about getting under that p = 0.05 threshold. They do this because we seem to attach some value to the presence of “stars” (typical way of highlighting significant results in econ papers). But our weighting of results shouldn’t be binary – it should be continuous. We should give results which are just¬†barely¬†insignificant about the same weight as those which are just barely significant.

So even if the rest of the academic establishment has decided to be irrational, resulting in a shift in the average result from a p = 0.052 to a p = 0.048, we shouldn’t be bothered by these small shifts, because the change in our¬†interpretation¬†of those results should be very minor.

Hat tip to Marginal Revolution for the paper link.

One thought on “My God, it’s full of stars!

Comments are closed.