Machine learning and the replication struggle

“Of course you can have my replication code. Good luck with that.”

I was in a seminar a few months ago where someone presented results from a panel study that had a significant attrition problem. This can be a serious issue in empirical work, as the surviving sample could be different in unobservable ways that might also be correlated with intervention the authors are studying. While there are already a large number of ways that one can tackle the attrition problem, the authors of this study had done something novel: they had handed that surviving sample over to a machine learning algorithm which used [insert black box here] to re-weighted the sample so it matched the original study population as closely as possible.

The author presenting the results explained that they had made this choice because the machine learning algorithm was arguably neutral.  Normally, the researchers might have consciously or unconsciously chosen weights that gave them the result that they wanted. But by handing things over to to an algorithm that used its own sufficiently-inscrutable method of determining the appropriate sample weights, the authors could not be accused that their results were the product of researcher degrees of freedom.

Machine learning is rapidly becoming a tool that empirical economists are relying on. It seems to be most useful for measurement and generating new data, such as creating new spatial measures of poverty to measuring conflict, but it is also more frequently being used for estimation issues like the ones the authors above were facing. The rise of the machines is undoubtedly going to be incredibly useful for our work, drastically widening our ability to tackle difficult, computationally-intense problems. But I wonder if it also has implications for the way we check and challenge each others work.

So far economics has been spared the full brunt of the empirical replication crisis, but there are also occasionally warning signs that our own reckoning might not be far off. To ensure that research in empirical econ remains credible, there needs to be a general assurance that published results can be replicated. The term “replicate” means different things to different people, but I find Michael Clemens’s proposed definitions to be helpful. There are two types of replication that I think are likely to be affected by machine learning. The first is what Clemens refers to as `verification,’ the ability of a third party to take the same data as the original researchers, run the same code and generate the same results (and check that there are no errors in any of these processes). The second is a `reanalysis’, where a third party takes the same data as the original researchers, but investigates whether the results hold up to different ways of analyzing the data (such as different estimation methods, assumptions, perturbations of the data, etc).

Wading through someone else’s Stata or R code in an effort to verify another researcher’s findings is not much fun, but it is manageable. PhD students and younger researchers may not have the same resources as their senior colleagues, but they usually have the time and diligence to go through other’s work and figure out whether the results really are there. But imagine a future where you first need secure sizable amount of server time and computing power before you can even think about a basic replication. In the above paper, the presenter noted that it took several days for them to run their re-weighting algorithm. Things seem even more daunting when it comes time for reanalysis, in this case changing the basic structure of the original algorithm (or re-training it on different set of data). As methods grow more complex, results may be harder for a replicating party to parse. It would seem easy enough for the authors to hand over their algorithms and for third parties to run them, but significantly more difficult for the replicators to understand what the precise limitations of a particular approach might be.

Despite these concerns, I suspect these problems will be transitory. Empirical economics is continuously going through ways of methodological innovation. Each wave begins with a few pioneers using a new tool or a new set of data, but the in-depth expertise to really critique those new methods always lags behind by a few years. More and more econ students are learning to code along the way. New norms around replicating results (such as posting data and code in an online repository like Github) are coalescing.

And if the abilities of the replicators can’t keep up with the growing sophistication of algorithms, then maybe the same technology can be used to make the replicator’s life easy. Researchers in psychology have already used code to check large numbers of published papers for basic mathematical errors. Maybe in ten years, a normal part of the peer review process will entail turning over your results and code to the machines, so they can check it for errors and run an automated re-analysis.

As if dealing with peer reviewers wasn’t harrowing enough. Imagine referee reports that go something like this:





A Checklist for the Modern Development Bureaucrat

If you are  a fan of podcasts then you really should be listening to NPR’s Hidden Brain. This week’s episode is a fascinating look at the impact of “checklists” or “to-do” lists used across a number of different professions to offset human error. It recounts how, during the development of the Boeing B-17 “Flying Fortress” bomber in the late 1930s, a fatal cash of a prototype plane led to the US Air Force to mandate the practice of running through check lists prior to future flights. While experts in high-skill professions like piloting or surgery typically feel confident in their abilities to get the job done, the addition of routine (and mundane) checks forces them to guard against unlikely but high-cost events.

In another recent podcast the journalist Sarah Kliff frames this as treating mistakes as plane crashes rather than car crashes, the former requiring a full rethink of how procedures are performed. She recounts how hospitals in the US began using checklists to reduce the incidence of central line infections. Even though medical staff are highly-trained professionals who should know the correct procedures to reduce the chance of contamination, infection rates plummeted once they were forced to use checklists prior to a procedure (there were of course other complementary changes in policy).

This led me to wonder whether or not `checklist culture’ is reflected in modern development policy. One might argue that in some ways policy has become too checklist oriented. Many reform agendas – ranging from anti-money laundering standards to private sector reforms –  rely on a simple list of indicators or policy changes that a country needs to check off in order to be compliant.  While many of these agendas are centered around outcomes or processes worth achieving anyway, many are shallow in nature. This results in brittle institutions that look good on paper, but are incapable of doing much beyond that, a point laid out a long time ago by Lant Pritchett when he first spoke of isomorphic mimicry.

But there may be some ways in which the checklist mentality might be useful for decision makers in the development space. We know from psychology and behavioral economics that people often exhibit cognitive biases in their decision-making. This leads them to make decisions that are bad for them in the long term, but while the ramifications can be substantial, they are largely confirmed to the individual level.

The stakes are potentially a lot higher when those cognitive biases and errors are being made by those that make decisions that effect hundreds, thousands or millions of other people. It would then seem important that development professionals be able to act as impartial, rational decision makers, alas there is evidence that we’re just as flawed as the rest of humankind. A recent working paper by researchers from the World Bank and the Universities of East Anglia and Oxford brought development professionals from DFID and the WB to investigate, aaaaaaand the results ain’t too pretty. From the paper’s abstract:

“Experiments conducted on a novel subject pool of development policy professionals (public servants of the World Bank and the Department for International Development in the United Kingdom) show that policy professionals are indeed subject to decision making traps, including sunk cost bias, the framing of losses and gains, frame-dependent risk-aversion, and, most strikingly, confirmation bias correlated with ideological priors, despite having an explicit mission to promote evidence-informed and impartial decision making. These findings should worry policy professionals and their principals in governments and large organizations, as well as citizens themselves.”

Thankfully, development professionals are not unchecked autocrats, out decisions are confined by the structures of the institutions we work for. But what we don’t know is whether those institutions mitigate or amplify our biases or priors – I think cases can be made in either direction. Development bureaucrats certainly have to jump through a lot of hurdles to get their projects off the ground – but those `checklists’ are largely around mitigating risk, ensuring a proposal has been properly vetted and that it is likely development-friendly. There is some evidence from the above paper that deliberation is effective in reducing these biases, but one wonders whether the type of deliberation that the subjects (in this case DFID economists) participated in mirrors at all the kind of peer-review or administrative checks that the average bureaucrats at DFID or the World Bank goes through.

So maybe we need checklists specifically to offset our biases. I’ll start with a few ideas for both bureaucrats and researchy-economists working on a proposal or note or paper, but would be interested to hear what yours would be.

  1. Is there any rigorous evidence supporting the argument I am making?
  2. Have I sat down and examined whether my beliefs are based on emotion or reasoning?
  3. Would an ordinary person who doesn’t study development or economics understand what I am saying?
  4. Have I made the case that my proposal addresses a development/poverty question, rather than justifying its existence through internal or external politics or momentum around some issue?
  5. Have I listed, at least in my own head, the reasons why I might be wrong about this?
  6. Would someone in another team/department/institution make better use of these resources that I control?
  7. Have I written down a contingency plan for when things go wrong?
  8. Have I thought about how I will know if something has gone wrong?

By the way, you can find the podcasts I mentioned here:

Next week I’ll be unemployed (for a week)

Sooner or later, everybody goes through McCann-Erickson

I’m changing jobs! Next week I’ll cease to work at CGD and will move to Washington D.C. to join the World Bank. I am joining the Bank’s new Global Tax Team to help with working ranging from international tax issues (profit shifting, tax havens) to domestic resource issues (tax compliance/morale).

I’ve been at CGD for nearly three years. It has been an amazing place to work and a difficult place to leave.

More difficult will be the reverse culture shock of returning to the US after more than a decade away. Then again, D.C. is the one and only place that I’ve ever been recognized by a complete stranger who reads the blog (which was amazing and something I cling to in my blogging obsolescence). Given the current situation in the UK, this JPEG accurately captures how I feel right now (although I could be eating those words next week):


Speaking of the blog – the plan is to keep things going, even if the rate has been substantially lower than in our years of glory.

The rise of empirical econ, in one chart

Goddammit I need more memory

I apologize for the click-baiting title, but this is pretty cool. John C. McCallum has assembled a (rough) estimate of the price of computer memory (mainly RAM) over time. I’ve adjusted the prices for inflation and graphed it over time. The results are pretty amazing (keep in mind the y-axis is log-scale).


Without cheap memory, you can say goodbye to big data sets and complex calculations which really enabled empirical econ to take off. Sure, CPU speeds matters as well and the RCT folks were always a little less reliant on large data sets, but can you imagine having to bootstrap those standard errors with 2mb of RAM? File this under “things are getting better.” Hat tip to Data is Plural, a newsletter you really should subscribe to if you like random data sets.

I’m upset that my football team lost, so I’m going to have to ask you to leave the country

I AM THE LAW! Except when the Broncos lose. Then I just turn to jelly.

In the NYT, immigration judges contemplate how biases might creep into the decisions they make:

In all, 336 people from 13 countries and even more ethnic backgrounds appeared in San Francisco’s immigration court recently over three days. All of them were facing possible deportation, because they either were in the United States illegally or had committed crimes serious enough to jeopardize their legal presence as noncitizens. One challenge facing Judge Marks was deciding whether to deport some of them immediately after they had testified. Another challenge was her own biases.

“You have to go through some hypotheticals in your brain,” said Judge Marks, wrestling with the weighty decisions she must make, the little time she has to make them and all the impressions she and her judicial colleagues form from the bench about the immigrants before them.

“Would I treat a young person the same way I’m treating this old person?” she said. “Would I treat a black person the same way I’m treating this white person? This situation of rush, rush, rush as fast as we can go, it’s not conducive to doing that.”

The solution? Anti-bias training:

Now, as the country struggles with how these instinctive judgments shape our lives, the Justice Department is trying to minimize the role of bias in law enforcement and the courts. More than 250 federal immigration judges attended a mandatory anti-bias training session in August, and this summer the Justice Department announced that 28,000 more employees would go through a similar exercise.

This seems reasonable, but what about factors that influence decisions that go beyond the characteristics of the immigrant? Enter a recent (unpublished) paper by Daniel Chen:

I detect intra-judge variation in judicial decisions driven by factors completely unrelated to the merits of the case, or to any case characteristic for that matter. Concretely, I show that asylum grant rates in U.S. immigration courts differ by the success of the court city’s NFL team on the night before, and by the city’s weather on the day of, the decision. My data including half a million decisions spanning two decades allows me to exclude confounding factors, such as scheduling and seasonal effects. Most importantly, my design holds the identity of the judge constant. On average, U.S. immigration judges grant an additional 1.5% of asylum petitions on the day after their city’s NFL team won, relative to days after the team lost. Bad weather on the day of the decision has approximately the opposite effect. By way of comparison, the average grant rate is 39%. In contrast, I do not find comparable effects in sentencing decisions of U.S. District Courts, and speculate that this may be due to higher quality of the federal judges, more time for deliberation, or the constraining effect of the federal sentencing guidelines.

Yikes. If it’s true, then there are all sorts of external factors which affect the fates of thousands of asylum seekers, some of whom are turned away because the judge is just having a bad day. This wouldn’t be the first paper to find that irrelevant, external factors influence judicial decisions. A recent paper by Ozkan Eren and Naci Mocan find similar effects (this time via college football – go figure) on decisions in juvenile courts. Others have found that judges are less likely to rule in the defendant’s favour when they are hangry.

Maybe judges should take some sort of mood test before they are allowed to review cases. Or maybe, despite what the folks at ProPublica think, it’s time to let the machines do the work for us.

Hat tip to Charles Kenny for the Chen paper.

The European Union wants to use aid to kill people. Seriously.


Here we go. Deep breaths. From the Guardian:

When international donors and the Afghan government convene in Brussels next week, the EU secretly plans to threaten Afghanistan with a reduction in aid if the war-torn country does not accept at least 80,000 deported asylum seekers.

According to a leaked restricted memo (pdf), the EU will make some of its aid “migration sensitive”, even while acknowledging that security in Afghanistan is worsening.

Let’s be absolutely clear here: deporting people from Europe to Afghanistan harms them. The evidence of the enormous individual benefits of migration is – at this point – pretty irrefutable. At the very least, sending people to a poor conflict ridden country condemns them to a lower lifetime income, fewer opportunities, worse health outcomes and shorter lives (the average life expectancy – unweighted – in Europe is 78, in Afghanistan it is 60). Even worse, deported people face persecution and violent deaths. If we send 80,000 people back to Afghanistan, some of them will die unnecessarily and many more will  suffer.

I understand that countries have to deport people sometimes. But when these nasty, cynical policies are tied to development aid, those aid policies are forever tainted. It is another worrying sign that EU aid is being used to harm and endanger the very people it should be helping.

When I saw Alfonso Cuarón’s adaptation of PD James’s “Children of Men” six years ago, I worried it might some day come true. I worry more now. Shame on these people. Shame on all of them.

The sum of parts

You’re doing it wrong

The IMF has a new paper out on gender budgeting efforts in sub-Saharan African countries:

Gender budgeting is an initiative to use fiscal policy and administration to address gender inequality and women’s advancement. A large number of sub-Saharan African countries have adopted gender budgeting. Two countries that have achieved notable success in their efforts are Uganda and Rwanda, both of which have integrated gender-oriented goals into budget policies, programs, and processes in fundamental ways. Other countries have made more limited progress in introducing gender budgeting into their budget-making. Leadership by the ministry of finance is critical for enduring effects, although nongovernmental organizations and parliamentary bodies in sub-Saharan Africa play an essential role in advocating for gender budgeting.

These sorts of efforts have certainly improved in both scope and sophistication. Back when I worked in the budget division of the Malawian Ministry of Finance, I was only asked once to perform any sort analysis of the gender focus of the budget. The request that landed on my desk had come from the Commonwealth, who wanted to know how many times the word “gender” had been used in any of the previous presentations of the national budget to parliament. After fishing out the transcripts from the Ministry’s library, I eventually discovered the answer was “zero.”

Crowdfunding lives

Crowds sometimes choose inefficient ways to save lives.
Crowds sometimes choose inefficient ways to save lives.

A week ago, two climbers from Utah disappeared in a Pakistan mountain range. The two had already attempted to climb that particular peak last year, but a nearly-fatal accident prevented them from reaching the summit.

Since their disappearance, the internet has successfully crowdsourced over $100,000 to mount a rescue mission.

According to the impact calculator at The Life You Can Save, 100 grand could at the very least save dozens of lives if donated to the right charity. But we haven’t seemed to figure out how to make these causes feel quite as urgent as two blokes stop on top of a mountain.

Economists aren’t really supposed to judge people’s preferences, and it’s unlikely that the plight of the mountaineers is displacing money that otherwise could have been used to de-worm people or give them cash transfers. In fact, there is some (hopefully soon to be released) evidence that even big charity appeals lead to a net increase in people’s propensity to give. But it would be nice if that urgent, empathetic urge to give could be activated in the deeply impersonal world of effective altruism.

Help me test a (very silly) hypothesis by answering a few questions


I’ve held a silly hypothesis in my head ever since I was a grad student, but never had the time/resources to test it. I just recently came across a publication which drastically reduced the costs to testing the idea. It will almost certainly result in a “jokey” paper, but a fun one nonetheless.

But I could use your help. I have constructed an online survey displaying photos of people, and I need respondents to tell me whether these people are smiling, frowning or have neutral expressions. There are over 170 questions, but they are randomized, so even if you only manage to answer a few (and then close the window), it still would help a lot!

I can’t tell you what the idea is just yet, because it might spoil how you answer the questions. More information to follow, once enough people have answered the survey.

Click through here to enter the survey.

On Hirschman, measurement and RCTs

Every intervention is unique, perhaps unintentionally so.

Recently, Dan Honig of Johns Hopkins forwarded Ranil and me some thoughts he had in reaction to an Albert Hirschman on development projects that he felt was pertinent to the discussion on the pros and cons of RCTs. What followed is a discussion (rant) between Dan, Ranil and me. I’ve edited out the e-maily bits for clarity:


Matt, just after I hit send on this I realized I should have included you on this – I generally think you’re right on RCTs and the stale-ness of the conversation (and discussed this with Ranil a few months back some 2 days after you had dinner with him, hence the cc to Ranil) but feel like I’ve never seen this Hirschmann frame and wondering if it struck you as interesting. And yes, basically I’m trying to catalyze you writing something cool on this so I can quote/reference it down the road

Reading Hirschmann’s Development Projects Observed for the first time, and as I read it he’s with [Lant Pritchett and Michael Woolcock] on RCTs and causal density in international development projects. The quote below is from page 186 of the 1967 edition; italics are his, brackets mine; just before this he suggests we may not be able to identify good indicators of effects ex-ante and thus presumably couldn’t be pre-specified in a trial, meaning presumably we would be ill served by an RCT on a particular intervention even if we ignored external validity concerns.

“The indirect effects [of development projects] are so varied as to escape detection by one or even several criteria uniformly applied to all projects. Upon inspection, each project turns out to represent a unique constellation of experiences and consequences, of direct and indirect effects.”



Hey Dan, that’s a really interesting quote by Hirschman. If my interpretation is correct, it seems to be more damning for empirical evaluation in general than for RCTs in particular.

I’m not sure how I feel about this. Even if you move away from a simple, reduced form causal framework, Hirschman’s critique seems like it would apply. Even if development is a messy, complex thing that can’t really be boiled down in an impact evaluation framework, we still rely on measurement when we talk about development, and any given set of measurements is going to leave out things which might matter which are unmeasured. We can point at improving test scores but leave out student stress, etc, and the set of things that we leave out that might be important will change depending on the context. I guess I see this as a problem of measurement rather than as a problem for RCTs.

I also wonder what this means for how an empirical researcher operates. Over the last few years, I have become incredibly suspicious of surprising, counter-intuitive results, where a researcher measures something outside of the standard set of outcomes and finds a result. In a world of multiple hypothesis tests, expanding the set of outcomes to include as much of Hirschman’s unique constellation as possible will open up the door to a lot of false positives which will end up getting written up and published.

So that was a rant. Um, what do you think Ranil?

Read More