Machine learning and the replication struggle

“Of course you can have my replication code. Good luck with that.”

I was in a seminar a few months ago where someone presented results from a panel study that had a significant attrition problem. This can be a serious issue in empirical work, as the surviving sample could be different in unobservable ways that might also be correlated with intervention the authors are studying. While there are already a large number of ways that one can tackle the attrition problem, the authors of this study had done something novel: they had handed that surviving sample over to a machine learning algorithm which used [insert black box here] to re-weighted the sample so it matched the original study population as closely as possible.

The author presenting the results explained that they had made this choice because the machine learning algorithm was arguably neutral.  Normally, the researchers might have consciously or unconsciously chosen weights that gave them the result that they wanted. But by handing things over to to an algorithm that used its own sufficiently-inscrutable method of determining the appropriate sample weights, the authors could not be accused that their results were the product of researcher degrees of freedom.

Machine learning is rapidly becoming a tool that empirical economists are relying on. It seems to be most useful for measurement and generating new data, such as creating new spatial measures of poverty to measuring conflict, but it is also more frequently being used for estimation issues like the ones the authors above were facing. The rise of the machines is undoubtedly going to be incredibly useful for our work, drastically widening our ability to tackle difficult, computationally-intense problems. But I wonder if it also has implications for the way we check and challenge each others work.

So far economics has been spared the full brunt of the empirical replication crisis, but there are also occasionally warning signs¬†that our own reckoning might not be far off. To ensure that research in empirical econ remains credible, there needs to be a general assurance that published results can be replicated. The term “replicate” means different things to different people, but I find Michael Clemens’s¬†proposed definitions to be helpful. There are two types of replication that I think are likely to be affected by machine learning. The first is what Clemens refers to as `verification,’ the ability of a third party to take the same data as the original researchers, run the same code and generate the same results (and check that there are no errors in any of these processes). The second is a `reanalysis’, where a third party takes the same data as the original researchers, but investigates whether the results hold up to different ways of analyzing the data (such as different estimation methods, assumptions, perturbations of the data, etc).

Wading through someone else’s Stata or R code in an effort to verify another researcher’s findings is not much fun, but it is manageable. PhD students and younger researchers may not have the same resources as their senior colleagues, but they usually have the time and diligence to go through other’s work and figure out whether the results really are there. But imagine a future where you first need secure sizable amount of server time and computing power before you can even think about a basic replication. In the above paper, the presenter noted that it took several days for them to run their re-weighting algorithm. Things seem even more daunting when it comes time for reanalysis, in this case changing the basic structure of the original algorithm (or re-training it on different set of data). As methods grow more complex, results may be harder for a replicating party to parse. It would seem easy enough for the authors to hand over their algorithms and for third parties to run them, but significantly more difficult for the replicators to understand what the precise limitations of a particular approach might be.

Despite these concerns, I suspect these problems will be transitory. Empirical economics is continuously going through ways of methodological innovation. Each wave begins with a few pioneers using a new tool or a new set of data, but the in-depth expertise to really critique those new methods always lags behind by a few years. More and more econ students are learning to code along the way. New norms around replicating results (such as posting data and code in an online repository like Github) are coalescing.

And if the abilities of the replicators can’t keep up with the growing sophistication of algorithms, then maybe the same technology can be used to make the replicator’s life easy. Researchers in psychology have already used code to check large numbers of published papers for basic mathematical errors. Maybe in ten years, a normal part of the peer review process will entail turning over your results and code to the machines, so they can check it for errors and run an automated re-analysis.

As if dealing with peer reviewers wasn’t harrowing enough. Imagine referee reports that go something like this:





A Checklist for the Modern Development Bureaucrat

If you are ¬†a fan of podcasts then you really should be listening to NPR’s Hidden Brain. This week’s episode is a fascinating look at the impact of “checklists” or “to-do” lists used across a number of different professions to offset human error. It recounts how,¬†during the development of the¬†Boeing B-17 “Flying Fortress” bomber in the late 1930s, a fatal cash of a prototype plane led to the US Air Force to mandate the practice of running through check lists prior to future flights. While experts in high-skill professions like piloting or surgery typically feel confident in their abilities to get the job done, the addition of routine (and mundane) checks forces them to guard against unlikely but high-cost events.

In another recent podcast the journalist Sarah Kliff frames this as treating mistakes as plane crashes rather than car crashes, the former requiring a full rethink of how procedures are performed. She recounts how hospitals in the US began using checklists to reduce the incidence of central line infections. Even though medical staff are highly-trained professionals who should know the correct procedures to reduce the chance of contamination, infection rates plummeted once they were forced to use checklists prior to a procedure (there were of course other complementary changes in policy).

This led me to wonder whether or not `checklist culture’ is reflected in modern development policy.¬†One might argue that in some ways policy has become too checklist oriented. Many reform agendas – ranging from anti-money laundering standards to private sector reforms –¬† rely on a simple list of indicators or policy changes that a country needs to check off in order to be compliant.¬† While many of these agendas are centered around outcomes or processes worth achieving anyway, many are shallow in nature. This results in brittle institutions that look good on paper, but are incapable of doing much beyond that, a point laid out a long time ago by Lant Pritchett when he first spoke of isomorphic mimicry.

But there may be some ways in which the checklist mentality might be useful for decision makers in the development space. We know from psychology and behavioral economics that people often exhibit cognitive biases in their decision-making. This leads them to make decisions that are bad for them in the long term, but while the ramifications can be substantial, they are largely confirmed to the individual level.

The stakes are potentially a lot higher when those cognitive biases and errors are being made by those that make decisions that effect hundreds, thousands or millions of other people. It would then seem important that development professionals be able to act as impartial, rational decision makers, alas there is evidence that we’re just as flawed as the rest of humankind. A recent working paper by researchers from the World Bank and the Universities of East Anglia and Oxford brought development professionals from DFID and the WB to investigate, aaaaaaand the results ain’t too pretty.¬†From the paper’s abstract:

“Experiments conducted on a novel subject pool of development policy professionals (public servants of the World Bank and the Department for International Development in the United Kingdom) show that policy professionals are indeed subject to¬†decision making traps, including sunk cost bias, the framing of losses and gains, frame-dependent risk-aversion, and, most strikingly, confirmation bias correlated with ideological priors, despite having an explicit mission to promote evidence-informed and impartial decision making. These findings should worry policy professionals and their principals in governments and large organizations, as well as citizens themselves.”

Thankfully, development professionals are not unchecked autocrats, out decisions are confined by the structures of the institutions we work for. But what we don’t know is whether those institutions mitigate or amplify our biases or priors – I think cases can be made in either direction. Development¬†bureaucrats certainly have to jump through a lot of hurdles to get their projects off the ground – but those `checklists’ are largely around mitigating risk, ensuring a proposal has been properly vetted and that it is likely development-friendly. There is some evidence from the above paper that deliberation is effective in reducing these biases, but one wonders whether the type of deliberation that the subjects (in this case DFID economists) participated in mirrors at all the kind of peer-review or administrative checks that the average bureaucrats at DFID or the World Bank goes through.

So maybe we need checklists specifically to offset our biases. I’ll start with a few ideas for both bureaucrats and researchy-economists working on a proposal or note or paper, but would be interested to hear what yours would be.

  1. Is there any rigorous evidence supporting the argument I am making?
  2. Have I sat down and examined whether my beliefs are based on emotion or reasoning?
  3. Would an ordinary person who doesn’t study development or economics understand what I am saying?
  4. Have I made the case that my proposal addresses a development/poverty question, rather than justifying its existence through internal or external politics or momentum around some issue?
  5. Have I listed, at least in my own head, the reasons why I might be wrong about this?
  6. Would someone in another team/department/institution make better use of these resources that I control?
  7. Have I written down a contingency plan for when things go wrong?
  8. Have I thought about how I will know if something has gone wrong?

By the way, you can find the podcasts I mentioned here: