In 2016, a team of 18 economists led by Colin Camerer re‑analysed 18 experimental studies published in the American Economic Review and the Quarterly Journal of Economics between 2011 and 2014. Only 11 of the 18 studies (61%) produced a statistically significant effect in the same direction as the original. The replicated effects were, on average, just 66% the size of those originally reported. This is the empirical face of what is now called replication crisis economics.
A 2018 replication of social science experiments in Nature and Science found a similar 62% replication rate, with effect sizes about half the original. These numbers are best read as a sign of the discipline’s growing scientific maturity. Economics is testing its own claims, opening its data, and rebuilding the credibility of empirical work from the ground up.
Replication in Economics
Replication in economics has three flavours, and confusing them generates most of the heat in this debate.
Computational reproducibility asks the simplest question: if a researcher runs the original code on the original data, do they get the numbers in the published table? Direct replication goes further: it collects new data using the original study’s procedures and checks whether the same result appears. Conceptual replication tests the same underlying hypothesis with different methods, samples, or settings. A finding that survives all three is robust. One that survives only the first is merely accurate, not necessarily true.
The crisis label began in psychology. In 2015, the Open Science Collaboration published a landmark study replicating 100 experiments from three top psychology journals. Although 97% of the original studies had reported statistically significant results, only 36% of the replications did, and replication effect sizes averaged half the original magnitude. The shock waves reached economics quickly. Within a year, Camerer’s team had published its experimental economics replication. By 2017, Andrew Chang and Phillip Li at the Federal Reserve Board had reported that they could reproduce only about half of 60 macroeconomics papers from 13 journals when working with the original data and code.
The work has only intensified. The Institute for Replication, founded by Abel Brodeur, now coordinates teams of replicators who systematically reanalyse papers from leading economics and political science journals. A 2024 IZA discussion paper by Brodeur and over 350 coauthors examined 110 articles from leading journals and found that more than 85% were fully computationally reproducible, but coding errors appeared in roughly a quarter of them, and about 70% of independent robustness tests remained statistically significant, with t and z values about 20% lower on average than in the original papers.
Read together, these projects sketch a consistent picture. Economics does well on the narrowest definition of reproducibility, less well on direct replication, and shows meaningful effect-size shrinkage almost everywhere. The discipline is more transparent than it was a decade ago. It is also nowhere near as airtight as economists once liked to imagine.
Why Findings Fail to Replicate
If most published economists are competent and honest, why do so many findings shrink or vanish on retest? The answer lies in the incentive structure of academic publishing and a handful of statistical traps that even careful researchers can fall into.
The first culprit is publication bias. Top journals overwhelmingly publish papers that report a statistically significant result, especially one with a surprising sign or magnitude. Studies that find no effect are far harder to publish, even when they ask important questions and are well executed. The result is a literature systematically tilted toward positive findings. If only the most striking estimates from a research programme make it into print, the published average will overstate the true effect. Brodeur, Cook, and Neisser document this pattern in The Economic Journal, showing that the distribution of t-statistics in published economics papers has a suspicious bunch just past the conventional significance threshold.
Closely related is p-hacking: the conscious or unconscious search through specifications, control variables, sample restrictions, and outcome definitions until one combination yields p < 0.05. With modern data and software, a researcher faces hundreds of plausible analytical paths. If twenty of them are run and the one that produces stars is reported, the resulting p-value no longer means what it claims to mean. The probability of a false positive under the null is no longer 5% but much higher.
A cousin of p-hacking is HARKing, or hypothesising after the results are known. A researcher runs an exploratory analysis, finds an unexpected pattern, and writes the paper as though that pattern were the predicted hypothesis from the start. The narrative becomes airtight. The statistical inference is now invalid, because exploratory findings have not been tested against new data.
Sample sizes and statistical power matter too. Many influential studies in economics, particularly older ones in development, behavioural economics, and labour, were run on samples too small to detect realistic effect sizes reliably. A finding that appears with p = 0.04 in a small underpowered study is likely to be inflated, because only the noisiest, largest estimates clear the significance bar. When the same effect is tested in a larger sample, the estimate shrinks toward the truth, and the original looks like a false alarm.
Finally, the file drawer problem haunts every meta-analysis. Null results often never see the light of day. They sit on hard drives, in unfinished drafts, in failed dissertations. The published literature thus represents not the universe of attempts to study a question but a curated subset that survived editorial selection. A meta-analysis of published studies will inherit this bias unless it explicitly corrects for it.
The graphic intuition is simple. Picture a true effect that is small and positive. Random sampling produces estimates scattered around it. Add a publication filter that only lets through estimates with p < 0.05, and the average of what gets published will be larger than the truth. Add researcher flexibility, and even null effects can produce a steady stream of false positives. Replication is the corrective: when a new team draws a fresh sample and tests the same hypothesis, the truth tends to reveal itself, and the original estimate looks inflated or wrong.

Notable Replication Failures
Three high-profile cases illustrate how non-replications reshape policy debates.
The most famous is Carmen Reinhart and Kenneth Rogoff’s 2010 paper “Growth in a Time of Debt,” which reported that countries with public debt above 90% of GDP suffered an average real GDP growth rate of -0.1%, compared with 3-4% for less indebted countries. The 90% threshold became a rallying point for austerity advocates in the United States and Europe. Paul Ryan cited it in his “Path to Prosperity” budget. UK Chancellor George Osborne and European Commission officials referred to it repeatedly. In April 2013, doctoral student Thomas Herndon and his advisors, Michael Ash and Robert Pollin, obtained the original spreadsheet and found three problems: an Excel formula that failed to include Australia, Austria, Belgium, Canada, and Denmark in a key average; selective exclusion of certain country-year observations; and an unconventional weighting scheme. After correction, average growth at debt-to-GDP ratios above 90% rose from -0.1% to 2.2%. The dramatic growth cliff disappeared. A relationship between high debt and slower growth remained, but it was a far less dramatic story than the one that had been driving fiscal policy.
The second case concerns the employment effects of minimum wage increases. David Card and Alan Krueger’s 1994 study of fast-food restaurants in New Jersey and Pennsylvania, conducted via telephone survey, found no evidence that a higher minimum wage reduced employment, contradicting basic supply-and-demand intuition and reshaping labour economics. David Neumark and William Wascher (2000) re-examined the same natural experiment using payroll records from 230 restaurants and reached the opposite conclusion: employment in New Jersey fell relative to Pennsylvania. Card and Krueger replied with a defence based on Bureau of Labor Statistics data. Decades of follow-up work have produced a more textured picture, in which moderate minimum wage increases have small effects on employment that vary by region, industry, and the size of the increase. The episode is less a clean failure than a demonstration that data quality and measurement matter as much as identification.
A third example is microfinance. In the 2000s, a wave of enthusiasm presented small-loan programmes as a powerful tool for lifting families out of poverty. Subsequent randomised evaluations across six countries, summarised by Esther Duflo, Abhijit Banerjee, and coauthors, found modest and varied effects on business activity but no transformative impact on household income, consumption, or poverty in most settings. The original case for microcredit relied heavily on early non-experimental studies whose causal claims did not survive rigorous replication. The field corrected itself, but only after billions in donor funding had flowed on the strength of the original narrative.
What these episodes have in common is not academic embarrassment but a lag. Each finding shaped policy for years before the replication arrived. Austerity programmes in Europe, minimum wage debates in the United States, and microfinance expansion in South Asia all moved on the basis of empirical claims that later proved fragile. The lesson is not that economics cannot guide policy. The lesson is that policy should not move on a single paper, however prestigious the journal.

Reforms in Economics
Economics has responded to the crisis with the most aggressive reform agenda in social science. The changes are practical and fast.
Pre-registration requires researchers to record their hypotheses, data sources, sample sizes, and analytical specifications before collecting or analysing data. The American Economic Association maintains a Randomized Controlled Trial Registry, and AEA journals require all field experiments to be registered prior to submission. The trend is toward pre-registration of observational studies as well, especially in development economics.
Pre-analysis plans go beyond registration. They specify exactly which regressions will be run, which controls included, which subgroups examined, and how missing data will be handled. The plan is filed before the data are touched. Deviations are allowed but must be flagged. The point is to remove the researcher’s degrees of freedom before they can be exploited.
Registered reports, pioneered by the Journal of Development Economics, take pre-registration to its logical conclusion. Journals review the research question and methods first, before any results are known, and commit to publishing the paper based on the quality of the design rather than the statistical significance of the findings. A registered report cannot be killed because it produced a null result.
Data and code sharing is now mandatory at most leading economics journals. The American Economic Association’s Data and Code Availability Policy requires authors of accepted papers to deposit their full replication package, including raw data, cleaning code, and analysis code, in the AEA Data and Code Repository before publication. A dedicated AEA Data Editor verifies that the code actually reproduces the published numbers. The policy has dramatically increased the share of empirical economics papers that can be replicated computationally, even if direct replication remains difficult when datasets are proprietary.
Meta-analysis has matured into a standard tool for synthesising evidence across studies. By combining estimates from many papers, often after correcting for publication bias using techniques such as the funnel plot, trim-and-fill, or selection models, meta-analysts can recover effect sizes closer to the truth than any single study would suggest. The minimum wage literature, the elasticity of taxable income, and the returns to schooling have all been substantially recalibrated by careful meta-analytic work in the past decade.
Behind these tools is a broader culture shift. Replication, once a thankless and unpublishable activity, now has dedicated outlets, including the I4R Discussion Paper Series and the new generation of journals that specialise in replications. Top departments are hiring economists who do meta-research. The career incentives are not yet aligned, but they are bending in the right direction.
Replication Across Fields
The chart below compares replication success rates across major fields based on the most-cited large-scale projects. Economics looks comparatively healthy on this measure, though the comparison is rough because each project used different criteria, journals, and time windows.
Sources: Camerer et al. (2016), Camerer et al. (2018), Open Science Collaboration (2015), Brodeur et al. (2024). Comparisons are approximate due to differing replication criteria across projects.
The 85% figure for the Brodeur et al. project measures computational reproducibility, the easiest bar to clear, while the 36% for psychology counts the share of replications that reached statistical significance, the hardest. A fair reading is that economics performs at roughly the level of experimental psychology when the same standard is applied, but better than most fields in making code and data available so that any check is even possible.
The table below summarises the major replication initiatives that have shaped the modern economics conversation.
| Project | Year | Studies Examined | Headline Result | Key Finding |
|---|---|---|---|---|
| Camerer et al. (Experimental Economics Replication Project) | 2016 | 18 lab experiments from AER and QJE (2011-2014) | 61% replicated significantly | Replicated effect sizes averaged 66% of the original |
| Camerer et al. (Social Science Replication Project) | 2018 | 21 social science experiments in Nature and Science (2010-2015) | 62% replicated significantly | Replication effect sizes averaged about 50% of the original |
| Chang & Li (Federal Reserve Board) | 2017 | 60 macroeconomics papers from 13 journals | ~50% computationally reproducible | Many failures stemmed from missing or incomplete code, not contested results |
| Institute for Replication (Brodeur et al.) | 2024 | 110 articles from leading economics and political science journals | >85% computationally reproducible; ~70% of robustness tests significant | Coding errors in 25% of studies; t/z values 20% lower on robustness checks |
| AEA Data and Code Availability Policy | 2019 (revised 2024) | All accepted AEA journal papers | Mandatory pre-publication verification | Sharp rise in computational reproducibility for AEA-published work |
|
||||
Critiques of the Replication Crisis
Not every economist accepts the crisis framing, and the counter-arguments deserve a hearing.
First, replication failure is not the same as fraud or even error. A failed replication can mean the original was wrong, the replication was wrong, the effect is real but smaller than reported, or the effect is real but context-dependent. Many findings in economics depend on specific institutional, cultural, or temporal settings. A minimum wage effect that holds in 1990s New Jersey may not hold in 2020s California, not because either study is flawed, but because the underlying economy is different. Conceptual replication failure can be evidence of generalisation limits, which is useful science.
Second, the headline numbers can be cherry-picked. Camerer’s 2016 project examined only 18 experiments from two top journals over a four-year window. Brodeur’s 2024 project found a much higher computational reproducibility rate when journals enforce data and code policies. The Institute for Replication itself has shown that about 70% of independent robustness tests of published papers remain statistically significant, with t and z values only modestly attenuated. That is not a disaster. That is what a healthy empirical science should look like when its claims are stress-tested.
Third, the publication bias correction may itself be overcorrected. If everyone now expects published findings to be inflated, replications may receive disproportionate attention when they fail and little when they succeed. Successful replications rarely make headlines. Failed ones become viral threads. The selection bias that distorts the original literature can also distort the replication literature.
Fourth, economics has been more transparent than many of its critics admit. The discipline adopted formal data and code policies before psychology did. The AEA Data Editor’s office routinely catches errors before publication that would never have been found in earlier decades. Restricted-access data, which is increasingly common at top journals, makes external replication harder, but the trade-off is that the data are richer and more policy-relevant.
The mature view is that economics, like any empirical science, accumulates evidence rather than producing definitive single-paper truths. A finding that appears with the wrong sign in one replication and the right sign in another is not a failure of science. It is a signal that the effect is small, conditional, or context-dependent, all of which are worth knowing. The replication movement is not destroying economics. It is making economics behave more like the science it has long claimed to be.
MASEconomics Explains
Four economic concepts behind the replication crisis
Conclusion
The replication crisis in economics is real, but it is not a verdict against the discipline. The Camerer projects, the Open Science Collaboration, and the Institute for Replication have all shown that a meaningful share of published findings shrink, weaken, or disappear when retested with more rigorous methods. The Reinhart-Rogoff spreadsheet error, the contested minimum wage results, and the recalibration of microfinance expectations are reminders that single papers should not drive policy. At the same time, economics has responded with the most ambitious transparency reforms of any social science: mandatory data and code sharing, pre-registration, registered reports, dedicated data editors, and a growing literature on meta-analysis and robustness reproduction. Roughly 85% of recent papers from leading journals are now computationally reproducible, and around 70% of independent robustness tests survive scrutiny. The discipline is treating its own findings the way it has long treated other people’s claims: with scepticism, replication, and a willingness to update.
Did you find this article helpful? Share it with someone who loves economics. And remember, at MASEconomics, we make complex ideas simple.