Machine learning and the replication struggle

“Of course you can have my replication code. Good luck with that.”

I was in a seminar a few months ago where someone presented results from a panel study that had a significant attrition problem. This can be a serious issue in empirical work, as the surviving sample could be different in unobservable ways that might also be correlated with intervention the authors are studying. While there are already a large number of ways that one can tackle the attrition problem, the authors of this study had done something novel: they had handed that surviving sample over to a machine learning algorithm which used [insert black box here] to re-weighted the sample so it matched the original study population as closely as possible.

The author presenting the results explained that they had made this choice because the machine learning algorithm was arguably neutral.  Normally, the researchers might have consciously or unconsciously chosen weights that gave them the result that they wanted. But by handing things over to to an algorithm that used its own sufficiently-inscrutable method of determining the appropriate sample weights, the authors could not be accused that their results were the product of researcher degrees of freedom.

Machine learning is rapidly becoming a tool that empirical economists are relying on. It seems to be most useful for measurement and generating new data, such as creating new spatial measures of poverty to measuring conflict, but it is also more frequently being used for estimation issues like the ones the authors above were facing. The rise of the machines is undoubtedly going to be incredibly useful for our work, drastically widening our ability to tackle difficult, computationally-intense problems. But I wonder if it also has implications for the way we check and challenge each others work.

So far economics has been spared the full brunt of the empirical replication crisis, but there are also occasionally warning signs¬†that our own reckoning might not be far off. To ensure that research in empirical econ remains credible, there needs to be a general assurance that published results can be replicated. The term “replicate” means different things to different people, but I find Michael Clemens’s¬†proposed definitions to be helpful. There are two types of replication that I think are likely to be affected by machine learning. The first is what Clemens refers to as `verification,’ the ability of a third party to take the same data as the original researchers, run the same code and generate the same results (and check that there are no errors in any of these processes). The second is a `reanalysis’, where a third party takes the same data as the original researchers, but investigates whether the results hold up to different ways of analyzing the data (such as different estimation methods, assumptions, perturbations of the data, etc).

Wading through someone else’s Stata or R code in an effort to verify another researcher’s findings is not much fun, but it is manageable. PhD students and younger researchers may not have the same resources as their senior colleagues, but they usually have the time and diligence to go through other’s work and figure out whether the results really are there. But imagine a future where you first need secure sizable amount of server time and computing power before you can even think about a basic replication. In the above paper, the presenter noted that it took several days for them to run their re-weighting algorithm. Things seem even more daunting when it comes time for reanalysis, in this case changing the basic structure of the original algorithm (or re-training it on different set of data). As methods grow more complex, results may be harder for a replicating party to parse. It would seem easy enough for the authors to hand over their algorithms and for third parties to run them, but significantly more difficult for the replicators to understand what the precise limitations of a particular approach might be.

Despite these concerns, I suspect these problems will be transitory. Empirical economics is continuously going through ways of methodological innovation. Each wave begins with a few pioneers using a new tool or a new set of data, but the in-depth expertise to really critique those new methods always lags behind by a few years. More and more econ students are learning to code along the way. New norms around replicating results (such as posting data and code in an online repository like Github) are coalescing.

And if the abilities of the replicators can’t keep up with the growing sophistication of algorithms, then maybe the same technology can be used to make the replicator’s life easy. Researchers in psychology have already used code to check large numbers of published papers for basic mathematical errors. Maybe in ten years, a normal part of the peer review process will entail turning over your results and code to the machines, so they can check it for errors and run an automated re-analysis.

As if dealing with peer reviewers wasn’t harrowing enough. Imagine referee reports that go something like this: