New Research

Scientists Replicated 100 Psychology Studies, and Fewer Than Half Got the Same Results

The massive project shows that reproducibility problems plague even top scientific journals

Brian Handwerk

Science Correspondent

42-52701089.jpg

Academic journals and the press regularly serve up fresh helpings of fascinating psychological research findings. But how many of those experiments would produce the same results a second time around?

According to work presented today in Science , fewer than half of 100 studies published in 2008 in three top psychology journals could be replicated successfully. The international effort included 270 scientists who re-ran other people's studies as part of The Reproducibility Project: Psychology , led by Brian Nosek of the University of Virginia .

The eye-opening results don't necessarily mean that those original findings were incorrect or that the scientific process is flawed. When one study finds an effect that a second study can't replicate, there are several possible reasons, says co-author Cody Christopherson of Southern Oregon University. Study A's result may be false, or Study B's results may be false—or there may be some subtle differences in the way the two studies were conducted that impacted the results.

“This project is not evidence that anything is broken. Rather, it's an example of science doing what science does,” says Christopherson. “It's impossible to be wrong in a final sense in science. You have to be temporarily wrong, perhaps many times, before you are ever right.”

Across the sciences, research is considered reproducible when an independent team can conduct a published experiment, following the original methods as closely as possible, and get the same results. It's one key part of the process for building evidence to support theories. Even today, 100 years after Albert Einstein presented his general theory of relativity, scientists regularly repeat tests of its predictions and look for cases where his famous description of gravity does not apply.

"Scientific evidence does not rely on trusting the authority of the person who made the discovery," team member Angela Attwood , a psychology professor at the University of Bristol, said in a statement "Rather, credibility accumulates through independent replication and elaboration of the ideas and evidence."

The Reproducibility Project, a community-based crowdsourcing effort, kicked off in 2011 to test how well this measure of credibility applies to recent research in psychology. Scientists, some recruited and some volunteers, reviewed a pool of studies and selected one for replication that matched their own interest and expertise. Their data and results were shared online and reviewed and analyzed by other participating scientists for inclusion in the large Science study.

To help improve future research, the project analysis attempted to determine which kinds of studies fared the best, and why. They found that surprising results were the hardest to reproduce, and that the experience or expertise of the scientists who conducted the original experiments had little to do with successful replication.

The findings also offered some support for the oft-criticized statistical tool known as the P value , which measures whether a result is significant or due to chance. A higher value means a result is most likely a fluke, while a lower value means the result is statistically significant.

The project analysis showed that a low P value was fairly predictive of which psychology studies could be replicated. Twenty of the 32 original studies with a P value of less than 0.001 could be replicated, for example, while just 2 of the 11 papers with a value greater than 0.04 were successfully replicated.

But Christopherson suspects that most of his co-authors would not want the study to be taken as a ringing endorsement of P values, because they recognize the tool's limitations. And at least one P value problem was highlighted in the research: The original studies had relatively little variability in P value, because most journals have established a cutoff of 0.05 for publication. The trouble is that value can be reached by being selective about data sets , which means scientists looking to replicate a result should also carefully consider the methods and the data used in the original study.

It's also not yet clear whether psychology might be a particularly difficult field for reproducibility—a similar study is currently underway on cancer biology research. In the meantime, Christopherson hopes that the massive effort will spur more such double-checks and revisitations of past research to aid the scientific process.

“Getting it right means regularly revisiting past assumptions and past results and finding new ways to test them. The only way science is successful and credible is if it is self-critical,” he notes. 

Unfortunately there are disincentives to pursuing this kind of research, he says: “To get hired and promoted in academia, you must publish original research, so direct replications are rarer. I hope going forward that the universities and funding agencies responsible for incentivizing this research—and the media outlets covering them—will realize that they've been part of the problem, and that devaluing replication in this way has created a less stable literature than we'd like.”

Get the latest Science stories in your inbox.

Brian Handwerk | READ MORE

Brian Handwerk is a science correspondent based in Amherst, New Hampshire.

  • Bipolar Disorder
  • Therapy Center
  • When To See a Therapist
  • Types of Therapy
  • Best Online Therapy
  • Best Couples Therapy
  • Managing Stress
  • Sleep and Dreaming
  • Understanding Emotions
  • Self-Improvement
  • Healthy Relationships
  • Student Resources
  • Personality Types
  • Sweepstakes
  • Guided Meditations
  • Verywell Mind Insights
  • 2024 Verywell Mind 25
  • Mental Health in the Classroom
  • Editorial Process
  • Meet Our Review Board
  • Crisis Support

What Is Replication in Psychology Research?

Examples of replication in psychology.

  • Why Replication Matters
  • How It Works

What If Replication Fails?

  • The Replication Crisis

How Replication Can Be Strengthened

Replication refers to the repetition of a research study, generally with different situations and subjects, to determine if the basic findings of the original study can be applied to other participants and circumstances.

In other words, when researchers replicate a study, it means they reproduce the experiment to see if they can obtain the same outcomes.

Once a study has been conducted, researchers might be interested in determining if the results hold true in other settings or for other populations. In other cases, scientists may want to replicate the experiment to further demonstrate the results.

At a Glance

In psychology, replication is defined as reproducing a study to see if you get the same results. It's an important part of the research process that strengthens our understanding of human behavior. It's not always a perfect process, however, and extraneous variables and other factors can interfere with results.

For example, imagine that health psychologists perform an experiment showing that hypnosis can be effective in helping middle-aged smokers kick their nicotine habit. Other researchers might want to replicate the same study with younger smokers to see if they reach the same result.

Exact replication is not always possible. Ethical standards may prevent modern researchers from replicating studies that were conducted in the past, such as Stanley Milgram's infamous obedience experiments .

That doesn't mean that researchers don't perform replications; it just means they have to adapt their methods and procedures. For example, researchers have replicated Milgram's study using lower shock thresholds and improved informed consent and debriefing procedures.

Why Replication Is Important in Psychology

When studies are replicated and achieve the same or similar results as the original study, it gives greater validity to the findings. If a researcher can replicate a study’s results, it is more likely that those results can be generalized to the larger population.

Human behavior can be inconsistent and difficult to study. Even when researchers are cautious about their methods, extraneous variables can still create bias and affect results. 

That's why replication is so essential in psychology. It strengthens findings, helps detect potential problems, and improves our understanding of human behavior.

How Do Scientists Replicate an Experiment?

When conducting a study or experiment , it is essential to have clearly defined operational definitions. In other words, what is the study attempting to measure?

When replicating earlier researchers, experimenters will follow the same procedures but with a different group of participants. If the researcher obtains the same or similar results in follow-up experiments, it means that the original results are less likely to be a fluke.

The steps involved in replicating a psychology experiment often include the following:

  • Review the original experiment : The goal of replication is to use the exact methods and procedures the researchers used in the original experiment. Reviewing the original study to learn more about the hypothesis, participants, techniques, and methodology is important.
  • Conduct a literature review : Review the existing literature on the subject, including any other replications or previous research. Considering these findings can provide insights into your own research.
  • Perform the experiment : The next step is to conduct the experiment. During this step, keeping your conditions as close as possible to the original experiment is essential. This includes how you select participants, the equipment you use, and the procedures you follow as you collect your data.
  • Analyze the data : As you analyze the data from your experiment, you can better understand how your results compare to the original results.
  • Communicate the results : Finally, you will document your processes and communicate your findings. This is typically done by writing a paper for publication in a professional psychology journal. Be sure to carefully describe your procedures and methods, describe your findings, and discuss how your results compare to the original research.

So what happens if the original results cannot be reproduced? Does that mean that the experimenters conducted bad research or that, even worse, they lied or fabricated their data?

In many cases, non-replicated research is caused by differences in the participants or in other extraneous variables that might influence the results of an experiment. Sometimes the differences might not be immediately clear, but other researchers might be able to discern which variables could have impacted the results.

For example, minor differences in things like the way questions are presented, the weather, or even the time of day the study is conducted might have an unexpected impact on the results of an experiment. Researchers might strive to perfectly reproduce the original study, but variations are expected and often impossible to avoid.

Are the Results of Psychology Experiments Hard to Replicate?

In 2015, a group of 271 researchers published the results of their five-year effort to replicate 100 different experimental studies previously published in three top psychology journals. The replicators worked closely with the original researchers of each study in order to replicate the experiments as closely as possible.

The results were less than stellar. Of the 100 experiments in question, 61% could not be replicated with the original results. Of the original studies, 97% of the findings were deemed statistically significant. Only 36% of the replicated studies were able to obtain statistically significant results.

As one might expect, these dismal findings caused quite a stir. You may have heard this referred to as the "'replication crisis' in psychology.

Similar replication attempts have produced similar results. Another study published in 2018 replicated 21 social and behavioral science studies. In these studies, the researchers were only able to successfully reproduce the original results about 62% of the time.

So why are psychology results so difficult to replicate? Writing for The Guardian , John Ioannidis suggested that there are a number of reasons why this might happen, including competition for research funds and the powerful pressure to obtain significant results. There is little incentive to retest, so many results obtained purely by chance are simply accepted without further research or scrutiny.

The American Psychological Association suggests that the problem stems partly from the research culture. Academic journals are more likely to publish novel, innovative studies rather than replication research, creating less of an incentive to conduct that type of research.

Reasons Why Research Cannot Be Replicated

The project authors suggest that there are three potential reasons why the original findings could not be replicated.  

  • The original results were a false positive.
  • The replicated results were a false negative.
  • Both studies were correct but differed due to unknown differences in experimental conditions or methodologies.

The Nobel Prize-winning psychologist Daniel Kahneman has suggested that because published studies are often too vague in describing methods used, replications should involve the authors of the original studies to more carefully mirror the methods and procedures used in the original research.

In fact, one investigation found that replication rates are much higher when original researchers are involved.

While some might be tempted to look at the results of such replication projects and assume that psychology is more art than science, many suggest that such findings actually help make psychology a stronger science. Human thought and behavior is a remarkably subtle and ever-changing subject to study.

In other words, it's normal and expected for variations to exist when observing diverse populations and participants.

Some research findings might be wrong, but digging deeper, pointing out the flaws, and designing better experiments helps strengthen the field. The APA notes that replication research represents a great opportunity for students. it can help strengthen research skills and contribute to science in a meaningful way.

Nosek BA, Errington TM. What is replication ?  PLoS Biol . 2020;18(3):e3000691. doi:10.1371/journal.pbio.3000691

Burger JM. Replicating Milgram: Would people still obey today ?  Am Psychol . 2009;64(1):1-11. doi:10.1037/a0010932

Makel MC, Plucker JA, Hegarty B. Replications in psychology research: How often do they really occur? Perspectives on Psychological Science . 2012;7(6):537-542. doi:10.1177/1745691612460688

Aarts AA, Anderson JE, Anderson CJ, et al. Estimating the reproducibility of psychological science . Science. 2015;349(6251). doi:10.1126/science.aac4716

Camerer CF, Dreber A, Holzmeister F, et al. Evaluating the replicability of social science experiments in Nature and Science between 2010 and 2015 . Nat Hum Behav . 2018;2(9):637-644. doi:10.1038/s41562-018-0399-z

American Psychological Association. Learning into the replication crisis: Why you should consider conducting replication research .

Kahneman D. A new etiquette for replication . Social Psychology. 2014;45(4):310-311.

By Kendra Cherry, MSEd Kendra Cherry, MS, is a psychosocial rehabilitation specialist, psychology educator, and author of the "Everything Psychology Book."

  • Mobile Site
  • Staff Directory
  • Advertise with Ars

Filter by topic

  • Biz & IT
  • Gaming & Culture

Front page layout

Science —

100 psychology experiments repeated, less than half successful, large-scale effort to replicate scientific studies produces some mixed results..

Cathleen O'Grady - Aug 28, 2015 1:29 pm UTC

100 psychology experiments repeated, less than half successful

Since November 2011, the Center for Open Science has been involved in an ambitious project: to repeat 100 psychology experiments and see whether the results are the same the second time round. The first wave of results will be released in tomorrow’s edition of Science , reporting that fewer than half of the original experiments were successfully replicated.

The studies in question were from social and cognitive psychology, meaning that they don’t have immediate significance for therapeutic or medical treatments. However, the project and its results have huge implications in general for science, scientists, and the public. The key takeaway is that a single study on its own is never going to be the last word, said study coordinator and psychology professor Brian Nosek.

“The reality of science is we're going to get lots of different competing pieces of information as we study difficult problems,” he said in a public statement. “We're studying them because we don't understand them, and so we need to put in a lot of energy in order to figure out what's going on. It's murky for a long time before answers emerge.”

Tuning up science's engines

A lack of replication is a problem  for many scientific disciplines, from psychology to biomedical science and beyond. This is because a single experiment is a very limited thing, with poor abilities to give definitive answers on its own.

Experiments need to operate under extremely tight constraints to avoid unexpected influences from toying with the results, which means they look at a question through a very narrow window. Meanwhile, experimenters have to make myriad individual decisions from start to finish: how to find the sample to be studied, what to include and exclude, what methods to use, how to analyse the results, how best to explain the results.

This is why it’s essential for a question to be peered at from every possible angle to get a clear understanding of how it really looks in its entirety, and for each experiment to be replicated: repeated again, and again, and again, to ensure that each result wasn’t a fluke, a mistake, a result of biased reporting or specific decisions—or, in worst-case scenarios, fraud .

And yet, the incentives for replications in scientific institutions are weak. “Novel, positive and tidy results are more likely to survive peer review,” said Nosek. Novel studies have a “wow” factor; replications are less exciting, and so they're less likely to get published.

It’s better for researchers’ careers to conduct and publish original research, rather than repeating studies someone else has already done. When grant money is scarce, it’s also difficult to direct it towards replications. With scientific journals more likely to accept novel research than publications, the incentives for researchers to participate in replication efforts diminish.

At the same time, studies that found what they set out to find—called a positive effect—are also more likely to be published, while less exciting results are more likely to languish in a file drawer. Over time, these problems combine to make “the published literature … more beautiful than the reality,” Nosek explained.

The more blemished reality is that it's impossible for all hunches to be correct. Many experiments will likely turn up nothing interesting, or show the opposite effect from what was expected, but these results are important in themselves. It helps researchers to know if someone else has already tried what they’re about to do, and found that it doesn’t work. And of course, if there are five published experiments showing that something works, and eight unpublished experiments showing it doesn’t, the published literature gives a very skewed image overall.

Many researchers are working to combat these problems in different ways , by tackling both the journals and the rewards systems in institutions. Some have called for all PhD candidates to be required to conduct at least one replication in order to graduate, although this could run the risk of making replication boring, low-prestige grunt work and do little to enhance its popularity.

Scratching the surface

In 2011, the Reproducibility Project: Psychology, coordinated by the Center for Open Science, started a massive replication effort: 100 psychology experiments from three important psychology journals, replicated by 270 researchers around the world.

As with all experiments, there were complicated decisions to be made along the way. Which experiments were most important to replicate first? How should they decide what level of expertise was necessary for the researchers doing the replicating? And most importantly, what counts as a successful replication?

The last question wasn’t an easy one to answer, so the researchers came up with a multitude of ways to assess it, and applied all the criteria to each replication.

Of the 100 original studies, 97 had results that were statistically significant; only a third of the replications, however, had statistically significant results. Around half of the replications had effect sizes that were roughly comparable to the original studies. The teams conducting the replications reported whether they considered the effect to be replicated, and only 39 percent of them said it did. These criteria suggest that fewer than half of the originals were successfully replicated.

So what does this mean? It’s easy to over-simplify what a successful or failed replication implies, the authors of the Science paper write. If the replication worked, all that means is that the original experiment produced a reliable, repeatable result. It doesn’t mean that the explanation for the results is necessarily correct.

There are often multiple different explanations for a particular pattern, and one set of authors might prefer one explanation, while others prefer another. Those questions remain unresolved with a simple replication, and they need different experiments to answer those concerns.

A failed replication, meanwhile, doesn’t necessarily mean that the original result was a false positive, although this is definitely possible. For a start, the replication result could have been a false negative. There’s also the possibility that small changes in the methods used for an experiment could change the results in unforeseen ways.

What’s really needed is multiple replications, as well as tweaks to the experiment to figure out when the effect appears, and when it disappears—this can help to figure out exactly what might be going on. If many different replications, trying different things, find that the original effect can’t be repeated, then it means that we can probably think about scrapping that original finding.

replicated experiments in psychology

No clear answers, just hints. Obviously.

Part of what the Center for Open Science hoped to demonstrate with this effort is that, despite the incentives for novel research, it is possible to conduct huge replication efforts. In this project, there were incentives for researchers to invest, even if they weren’t the usual ones. “I felt I was taking part in an important groundbreaking effort and this motivated me to invest heavily in the replication study that I conducted,” said E. J. Masicampo, who led one of the replication teams.

Like all experiments, the meta-analysis of replications wasn’t able to answer every possible question at once. For instance, the project provided a list of potential experiments for volunteers to choose from, and it’s likely that there were biases in which experiments were chosen to be replicated. Because funding was thin on the ground, less resource-intensive experiments were likely to be chosen. It’s possible this affected the results in some way.

Another replication effort for another 100 experiments might turn up different results: the sample of original experiments will be different, the project coordinators might make different choices, and the analyses they choose might also change. That’s the point: uncertainty is “the reality of doing science, even if it is not appreciated in daily practice,” write the authors.

“After this intensive effort to reproduce a sample of published psychological findings, how many of the effects have we established are true?” they continue. Their answer: zero. We also haven’t established that any of the effects are false. The first round of experiments offered the first bit of evidence; the replications added to that, and further replications will be needed to continue to build on that. And slowly, the joined dots begin to form a picture.

Science , 2015. DOI: doi/10.1126/science.aac4716  ( About DOIs ).

reader comments

Channel ars technica.

  • A-Z Publications

Annual Review of Psychology

Volume 73, 2022, review article, replicability, robustness, and reproducibility in psychological science.

  • Brian A. Nosek 1,2 , Tom E. Hardwicke 3 , Hannah Moshontz 4 , Aurélien Allard 5 , Katherine S. Corker 6 , Anna Dreber 7 , Fiona Fidler 8 , Joe Hilgard 9 , Melissa Kline Struhl 2 , Michèle B. Nuijten 10 , Julia M. Rohrer 11 , Felipe Romero 12 , Anne M. Scheel 13 , Laura D. Scherer 14 , Felix D. Schönbrodt 15 , and Simine Vazire 16
  • View Affiliations Hide Affiliations Affiliations: 1 Department of Psychology, University of Virginia, Charlottesville, Virginia 22904, USA; email: [email protected] 2 Center for Open Science, Charlottesville, Virginia 22903, USA 3 Department of Psychology, University of Amsterdam, 1012 ZA Amsterdam, The Netherlands 4 Addiction Research Center, University of Wisconsin–Madison, Madison, Wisconsin 53706, USA 5 Department of Psychology, University of California, Davis, California 95616, USA 6 Psychology Department, Grand Valley State University, Allendale, Michigan 49401, USA 7 Department of Economics, Stockholm School of Economics, 113 83 Stockholm, Sweden 8 School of Biosciences, University of Melbourne, Parkville VIC 3010, Australia 9 Department of Psychology, Illinois State University, Normal, Illinois 61790, USA 10 Meta-Research Center, Tilburg University, 5037 AB Tilburg, The Netherlands 11 Department of Psychology, Leipzig University, 04109 Leipzig, Germany 12 Department of Theoretical Philosophy, University of Groningen, 9712 CP Groningen, The Netherlands 13 Department of Industrial Engineering and Innovation Sciences, Eindhoven University of Technology, 5612 AZ Eindhoven, The Netherlands 14 University of Colorado Anschutz Medical Campus, Aurora, Colorado 80045, USA 15 Department of Psychology, Ludwig Maximilian University of Munich, 80539 Munich, Germany 16 School of Psychological Sciences, University of Melbourne, Parkville VIC 3052, Australia
  • Vol. 73:719-748 (Volume publication date January 2022) https://doi.org/10.1146/annurev-psych-020821-114157
  • First published as a Review in Advance on October 19, 2021
  • Copyright © 2022 by Annual Reviews. All rights reserved

Replication—an important, uncommon, and misunderstood practice—is gaining appreciation in psychology. Achieving replicability is important for making research progress. If findings are not replicable, then prediction and theory development are stifled. If findings are replicable, then interrogation of their meaning and validity can advance knowledge. Assessing replicability can be productive for generating and testing hypotheses by actively confronting current understandings to identify weaknesses and spur innovation. For psychology, the 2010s might be characterized as a decade of active confrontation. Systematic and multi-site replication projects assessed current understandings and observed surprising failures to replicate many published findings. Replication efforts highlighted sociocultural challenges such as disincentives to conduct replications and a tendency to frame replication as a personal attack rather than a healthy scientific practice, and they raised awareness that replication contributes to self-correction. Nevertheless, innovation in doing and understanding replication and its cousins, reproducibility and robustness, has positioned psychology to improve research practices and accelerate progress.

Article metrics loading...

Full text loading...

Literature Cited

  • Alogna VK , Attaya MK , Aucoin P , Bahník Š , Birch S et al. 2014 . Registered Replication Report: Schooler and Engstler-Schooler (1990). Perspect. Psychol. Sci. 9 : 5 556– 78 [Google Scholar]
  • Altmejd A , Dreber A , Forsell E , Huber J , Imai T et al. 2019 . Predicting the replicability of social science lab experiments. PLOS ONE 14 : 12 e0225826 [Google Scholar]
  • Anderson CJ , Bahník Š , Barnett-Cowan M , Bosco FA , Chandler J et al. 2016 . Response to Comment on “Estimating the reproducibility of psychological science. Science 351 : 6277 1037 [Google Scholar]
  • Anderson MS , Martinson BC , De Vries R. 2007 . Normative dissonance in science: results from a national survey of U.S. scientists. J. Empir. Res. Hum. Res. Ethics 2 : 4 3– 14 [Google Scholar]
  • Appelbaum M , Cooper H , Kline RB , Mayo-Wilson E , Nezu AM , Rao SM. 2018 . Journal article reporting standards for quantitative research in psychology: the APA Publications and Communications Board task force report. Am. Psychol. 73 : 1 3 – 25 Corrigendum 2018 . Am. Psychol 73 : 7 947 [Google Scholar]
  • Armeni K , Brinkman L , Carlsson R , Eerland A , Fijten R et al. 2020 . Towards wide-scale adoption of open science practices: the role of open science communities. MetaArXiv, Oct. 6 https://doi.org/10.31222/osf.io/7gct9 [Crossref]
  • Artner R , Verliefde T , Steegen S , Gomes S , Traets F et al. 2020 . The reproducibility of statistical results in psychological research: an investigation using unpublished raw data. Psychol. Methods. In press. https://doi.org/10.1037/met0000365 [Crossref] [Google Scholar]
  • Baker M. 2016 . Dutch agency launches first grants programme dedicated to replication. Nat. News. https://doi.org/10.1038/nature.2016.20287 [Crossref] [Google Scholar]
  • Bakker M , van Dijk A , Wicherts JM. 2012 . The rules of the game called psychological science. Perspect. Psychol. Sci. 7 : 6 543– 54 [Google Scholar]
  • Bakker M , Wicherts JM. 2011 . The (mis)reporting of statistical results in psychology journals. Behav. Res. Methods 43 : 3 666– 78 [Google Scholar]
  • Baribault B , Donkin C , Little DR , Trueblood JS , Oravecz Z et al. 2018 . Metastudies for robust tests of theory. PNAS 115 : 11 2607– 12 [Google Scholar]
  • Baron J , Hershey JC. 1988 . Outcome bias in decision evaluation. J. Pers. Soc. Psychol. 54 : 4 569– 79 [Google Scholar]
  • Baumeister RF. 2016 . Charting the future of social psychology on stormy seas: winners, losers, and recommendations. J. Exp. Soc. Psychol. 66 : 153– 58 [Google Scholar]
  • Baumeister RF , Vohs KD. 2016 . Misguided effort with elusive implications. Perspect. Psychol. Sci. 11 : 4 574– 75 [Google Scholar]
  • Benjamin DJ , Berger JO , Johannesson M , Nosek BA , Wagenmakers E-J et al. 2018 . Redefine statistical significance. Nat. Hum. Behav. 2 : 1 6– 10 [Google Scholar]
  • Botvinik-Nezer R , Holzmeister F , Camerer CF , Dreber A , Huber J et al. 2020 . Variability in the analysis of a single neuroimaging dataset by many teams. Nature 582 : 7810 84– 88 [Google Scholar]
  • Bouwmeester S , Verkoeijen PPJL , Aczel B , Barbosa F , Bègue L et al. 2017 . Registered Replication Report: Rand, Greene, and Nowak (2012). Perspect. Psychol. Sci. 12 : 3 527– 42 [Google Scholar]
  • Brown NJL , Heathers JAJ. 2017 . The GRIM test: A simple technique detects numerous anomalies in the reporting of results in psychology. Soc. Psychol. Pers. Sci. 8 : 4 363– 69 [Google Scholar]
  • Button KS , Ioannidis JPA , Mokrysz C , Nosek BA , Flint J et al. 2013 . Power failure: why small sample size undermines the reliability of neuroscience. Nat. Rev. Neurosci. 14 : 5 365– 76 [Google Scholar]
  • Byers-Heinlein K , Bergmann C , Davies C , Frank M , Hamlin JK et al. 2020 . Building a collaborative psychological science: lessons learned from ManyBabies 1. Can. Psychol. Psychol. Can. 61 : 4 349– 63 [Google Scholar]
  • Camerer CF , Dreber A , Forsell E , Ho T-H , Huber J et al. 2016 . Evaluating replicability of laboratory experiments in economics. Science 351 : 6280 1433– 36 [Google Scholar]
  • Camerer CF , Dreber A , Holzmeister F , Ho T-H , Huber J et al. 2018 . Evaluating the replicability of social science experiments in Nature and Science between 2010 and 2015. Nat. Hum. Behav. 2 : 9 637– 44 [Google Scholar]
  • Carter EC , Schönbrodt FD , Gervais WM , Hilgard J. 2019 . Correcting for bias in psychology: a comparison of meta-analytic methods. Adv. Methods Pract. Psychol. Sci. 2 : 2 115– 44 [Google Scholar]
  • Cent. Open Sci 2020 . APA joins as new signatory to TOP guidelines. Center for Open Science Nov. 10. https://www.cos.io/about/news/apa-joins-as-new-signatory-to-top-guidelines [Google Scholar]
  • Cesario J. 2014 . Priming, replication, and the hardest science. Perspect. Psychol. Sci. 9 : 1 40– 48 [Google Scholar]
  • Chambers C. 2019 . What's next for Registered Reports?. Nature 573 : 7773 187– 89 [Google Scholar]
  • Cheung I , Campbell L , LeBel EP , Ackerman RA , Aykutoğlu B et al. 2016 . Registered Replication Report: Study 1 from Finkel, Rusbult, Kumashiro, & Hannon (2002). Perspect. Psychol. Sci. 11 : 5 750– 64 [Google Scholar]
  • Christensen G , Wang Z , Paluck EL , Swanson N , Birke DJ , Miguel E , Littman R. 2019 . Open science practices are on the rise: the State of Social Science (3S) Survey. MetaArXiv, Oct. 18. https://doi.org/10.31222/osf.io/5rksu [Crossref]
  • Christensen-Szalanski JJ , Willham CF. 1991 . The hindsight bias: a meta-analysis. Organ. Behav. Hum. Decis. Process. 48 : 1 147– 68 [Google Scholar]
  • Cohen J. 1962 . The statistical power of abnormal-social psychological research: a review. J. Abnorm. Soc. Psychol. 65 : 3 145– 53 [Google Scholar]
  • Cohen J. 1973 . Statistical power analysis and research results. Am. Educ. Res. J. 10 : 3 225– 29 [Google Scholar]
  • Cohen J. 1990 . Things I have learned (so far). Am. Psychol. 45 : 1304– 12 [Google Scholar]
  • Cohen J. 1992 . A power primer. Psychol. Bull. 112 : 1 155– 59 [Google Scholar]
  • Cohen J. 1994 . The earth is round (p < .05). Am. Psychol. 49 : 12 997– 1003 [Google Scholar]
  • Colling LJ , Szücs D , De Marco D , Cipora K , Ulrich R et al. 2020 . Registered Replication Report on Fischer, Castel, Dodd, and Pratt (2003). Adv. Methods Pract. Psychol. Sci 3 : 2 143– 62 [Google Scholar]
  • Cook FL. 2016 . Dear Colleague Letter: robust and reliable research in the social, behavioral, and economic sciences. National Science Foundation Sept. 20. https://www.nsf.gov/pubs/2016/nsf16137/nsf16137.jsp [Google Scholar]
  • Crandall CS , Sherman JW. 2016 . On the scientific superiority of conceptual replications for scientific progress. J. Exp. Soc. Psychol. 66 : 93– 99 [Google Scholar]
  • Crisp RJ , Miles E , Husnu S 2014 . Support for the replicability of imagined contact effects. Soc. Psychol. 45 : 4 303– 4 [Google Scholar]
  • Cronbach LJ , Meehl PE. 1955 . Construct validity in psychological tests. Psychol. Bull. 52 : 4 281– 302 [Google Scholar]
  • Dang J , Barker P , Baumert A , Bentvelzen M , Berkman E et al. 2021 . A multilab replication of the ego depletion effect. Soc. Psychol. Pers. Sci. 12 : 1 14– 24 [Google Scholar]
  • Devezer B , Nardin LG , Baumgaertner B , Buzbas EO. 2019 . Scientific discovery in a model-centric framework: reproducibility, innovation, and epistemic diversity. PLOS ONE 14 : 5 e0216125 [Google Scholar]
  • Dijksterhuis A. 2018 . Reflection on the professor-priming replication report. Perspect. Psychol. Sci. 13 : 2 295– 96 [Google Scholar]
  • Dreber A , Pfeiffer T , Almenberg J , Isaksson S , Wilson B et al. 2015 . Using prediction markets to estimate the reproducibility of scientific research. PNAS 112 : 50 15343– 47 [Google Scholar]
  • Duhem PMM. 1954 . The Aim and Structure of Physical Theory Princeton, NJ: Princeton Univ. Press [Google Scholar]
  • Ebersole CR , Alaei R , Atherton OE , Bernstein MJ , Brown M et al. 2017 . Observe, hypothesize, test, repeat: Luttrell, Petty and Xu (2017) demonstrate good science. J. Exp. Soc. Psychol. 69 : 184– 86 [Google Scholar]
  • Ebersole CR , Atherton OE , Belanger AL , Skulborstad HM , Allen JM et al. 2016a . Many Labs 3: evaluating participant pool quality across the academic semester via replication. J. Exp. Soc. Psychol. 67 : 68– 82 [Google Scholar]
  • Ebersole CR , Axt JR , Nosek BA 2016b . Scientists’ reputations are based on getting it right, not being right. PLOS Biol . 14 : 5 e1002460 [Google Scholar]
  • Ebersole CR , Mathur MB , Baranski E , Bart-Plange D-J , Buttrick NR et al. 2020 . Many Labs 5: testing pre-data-collection peer review as an intervention to increase replicability. Adv. Methods Pract. Psychol. Sci. 3 : 3 309– 31 [Google Scholar]
  • Eerland A , Sherrill AM , Magliano JP , Zwaan RA , Arnal JD et al. 2016 . Registered Replication Report: Hart & Albarracín (2011). Perspect. Psychol. Sci. 11 : 1 158– 71 [Google Scholar]
  • Ellemers N , Fiske ST , Abele AE , Koch A , Yzerbyt V. 2020 . Adversarial alignment enables competing models to engage in cooperative theory building toward cumulative science. PNAS 117 : 14 7561– 67 [Google Scholar]
  • Epskamp S , Nuijten MB. 2018 . Statcheck: extract statistics from articles and recompute p values. Statistical Software https://CRAN.R-project.org/package=statcheck [Google Scholar]
  • Errington TM , Denis A , Perfito N , Iorns E , Nosek BA 2021 . Challenges for assessing reproducibility and replicability in preclinical cancer biology. eLife In press [Google Scholar]
  • Etz A , Vandekerckhove J. 2016 . A Bayesian perspective on the reproducibility project: psychology. PLOS ONE 11 : 2 e0149794 [Google Scholar]
  • Fanelli D. 2010 .. “ Positive” results increase down the hierarchy of the sciences. PLOS ONE 5 : 4 e10068 [Google Scholar]
  • Fanelli D. 2012 . Negative results are disappearing from most disciplines and countries. Scientometrics 90 : 3 891– 904 [Google Scholar]
  • Feest U. 2019 . Why replication is overrated. Philos. Sci. 86 : 5 895– 905 [Google Scholar]
  • Ferguson MJ , Carter TJ , Hassin RR. 2014 . Commentary on the attempt to replicate the effect of the American flag on increased Republican attitudes. Soc. Psychol. 45 : 4 301– 2 [Google Scholar]
  • Fetterman AK , Sassenberg K. 2015 . The reputational consequences of failed replications and wrongness admission among scientists. PLOS ONE 10 : 12 e0143723 [Google Scholar]
  • Forsell E , Viganola D , Pfeiffer T , Almenberg J , Wilson B et al. 2019 . Predicting replication outcomes in the Many Labs 2 study. J. Econ. Psychol. 75 : 102117 [Google Scholar]
  • Franco A , Malhotra N , Simonovits G. 2014 . Publication bias in the social sciences: unlocking the file drawer. Science 345 : 6203 1502– 5 [Google Scholar]
  • Franco A , Malhotra N , Simonovits G. 2016 . Underreporting in psychology experiments: evidence from a study registry. Soc. Psychol. Pers. Sci. 7 : 1 8– 12 [Google Scholar]
  • Frank MC , Bergelson E , Bergmann C , Cristia A , Floccia C et al. 2017 . A collaborative approach to infant research: promoting reproducibility, best practices, and theory-building. Infancy 22 : 4 421– 35 [Google Scholar]
  • Funder DC , Ozer DJ. 2019 . Evaluating effect size in psychological research: sense and nonsense. Adv. Methods Pract. Psychol. Sci. 2 : 2 156– 68 [Google Scholar]
  • Gelman A , Carlin J. 2014 . Beyond power calculations: assessing type S (sign) and type M (magnitude) errors. Perspect. Psychol. Sci. 9 : 6 641– 51 [Google Scholar]
  • Gelman A , Loken E. 2013 . The garden of forking paths: why multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research hypothesis was posited ahead of time Work. Pap., Columbia Univ. New York: [Google Scholar]
  • Gergen KJ. 1973 . Social psychology as history. J. Pers. Soc. Psychol. 26 : 2 309– 20 [Google Scholar]
  • Gervais WM , Jewell JA , Najle MB , Ng BKL. 2015 . A powerful nudge? Presenting calculable consequences of underpowered research shifts incentives toward adequately powered designs. Soc. Psychol. Pers. Sci. 6 : 7 847– 54 [Google Scholar]
  • Ghelfi E , Christopherson CD , Urry HL , Lenne RL , Legate N et al. 2020 . Reexamining the effect of gustatory disgust on moral judgment: a multilab direct replication of Eskine, Kacinik, and Prinz (2011). Adv. Methods Pract. Psychol. Sci. 3 : 1 3– 23 [Google Scholar]
  • Gilbert DT , King G , Pettigrew S , Wilson TD 2016 . Comment on “Estimating the reproducibility of psychological science. Science 351 : 6277 1037 [Google Scholar]
  • Giner-Sorolla R. 2012 . Science or art? How aesthetic standards grease the way through the publication bottleneck but undermine science. Perspect. Psychol. Sci. 7 : 6 562– 71 [Google Scholar]
  • Giner-Sorolla R. 2019 . From crisis of evidence to a “crisis” of relevance? Incentive-based answers for social psychology's perennial relevance worries. Eur. Rev. Soc. Psychol. 30 : 1 1– 38 [Google Scholar]
  • Gollwitzer M. 2020 . DFG Priority Program SPP 2317 Proposal: A meta-scientific program to analyze and optimize replicability in the behavioral, social, and cognitive sciences (META-REP). PsychArchives, May 29. http://dx.doi.org/10.23668/psycharchives.3010 [Crossref]
  • Gordon M , Viganola D , Bishop M , Chen Y , Dreber A et al. 2020 . Are replication rates the same across academic fields? Community forecasts from the DARPA SCORE programme. R. Soc. Open Sci. 7 : 7 200566 [Google Scholar]
  • Götz M , O'Boyle EH , Gonzalez-Mulé E , Banks GC , Bollmann SS 2020 . The “Goldilocks Zone”: (Too) many confidence intervals in tests of mediation just exclude zero. Psychol. Bull. 147 : 1 95– 114 [Google Scholar]
  • Greenwald AG. 1975 . Consequences of prejudice against the null hypothesis. Psychol. Bull. 82 : 1 1– 20 [Google Scholar]
  • Hagger MS , Chatzisarantis NLD , Alberts H , Anggono CO , Batailler C et al. 2016 . A multilab preregistered replication of the ego-depletion effect. Perspect. Psychol. Sci. 11 : 4 546– 73 [Google Scholar]
  • Hanea AM , McBride MF , Burgman MA , Wintle BC , Fidler F et al. 2017 . I nvestigate D iscuss E stimate A ggregate for structured expert judgement. Int. J. Forecast. 33 : 1 267– 79 [Google Scholar]
  • Hardwicke TE , Bohn M , MacDonald KE , Hembacher E , Nuijten MB et al. 2021 . Analytic reproducibility in articles receiving open data badges at the journal Psychological Science : an observational study. R. Soc. Open Sci. 8 : 1 201494 [Google Scholar]
  • Hardwicke TE , Mathur MB , MacDonald K , Nilsonne G , Banks GC et al. 2018 . Data availability, reusability, and analytic reproducibility: evaluating the impact of a mandatory open data policy at the journal Cognition . R. Soc. Open Sci. 5 : 8 180448 [Google Scholar]
  • Hardwicke TE , Serghiou S , Janiaud P , Danchev V , Crüwell S et al. 2020a . Calibrating the scientific ecosystem through meta-research. Annu. Rev. Stat. Appl. 7 : 11– 37 [Google Scholar]
  • Hardwicke TE , Thibault RT , Kosie JE , Wallach JD , Kidwell M , Ioannidis J. 2020b . Estimating the prevalence of transparency and reproducibility-related research practices in psychology (2014–2017). MetaArXiv, Jan. 2. https://doi.org/10.31222/osf.io/9sz2y [Crossref]
  • Hedges LV , Schauer JM. 2019 . Statistical analyses for studying replication: meta-analytic perspectives. Psychol. Methods 24 : 5 557– 70 [Google Scholar]
  • Hoogeveen S , Sarafoglou A , Wagenmakers E-J. 2020 . Laypeople can predict which social-science studies will be replicated successfully. Adv. Methods Pract. Psychol. Sci. 3 : 3 267– 85 [Google Scholar]
  • Hughes BM. 2018 . Psychology in Crisis London: Palgrave Macmillan [Google Scholar]
  • Inbar Y. 2016 . Association between contextual dependence and replicability in psychology may be spurious. PNAS 113 : 34 E4933– 34 [Google Scholar]
  • Ioannidis JPA. 2005 . Why most published research findings are false. PLOS Med 2 : 8 e124 [Google Scholar]
  • Ioannidis JPA. 2008 . Why most discovered true associations are inflated. Epidemiology 19 : 5 640– 48 [Google Scholar]
  • Ioannidis JPA. 2014 . How to make more published research true. PLOS Med 11 : 10 e1001747 [Google Scholar]
  • Ioannidis JPA , Trikalinos TA. 2005 . Early extreme contradictory estimates may appear in published research: the Proteus phenomenon in molecular genetics research and randomized trials. J. Clin. Epidemiol. 58 : 6 543– 49 [Google Scholar]
  • Isager PM , van Aert RCM , Bahník Š , Brandt M , DeSoto KA et al. 2020 . Deciding what to replicate: A formal definition of “replication value” and a decision model for replication study selection. MetaArXiv, Sept. 2. https://doi.org/10.31222/osf.io/2gurz [Crossref]
  • John LK , Loewenstein G , Prelec D. 2012 . Measuring the prevalence of questionable research practices with incentives for truth telling. Psychol. Sci. 23 : 5 524– 32 [Google Scholar]
  • Kahneman D. 2003 . Experiences of collaborative research. Am. Psychol. 58 : 9 723– 30 [Google Scholar]
  • Kerr NL. 1998 . HARKing: Hypothesizing after the results are known. Pers. Soc. Psychol. Rev. 2 : 3 196– 217 [Google Scholar]
  • Kidwell MC , Lazarević LB , Baranski E , Hardwicke TE , Piechowski S et al. 2016 . Badges to acknowledge open practices: a simple, low-cost, effective method for increasing transparency. PLOS Biol 14 : 5 e1002456 [Google Scholar]
  • Klein RA , Cook CL , Ebersole CR , Vitiello C , Nosek BA et al. 2019 . Many Labs 4: failure to replicate mortality salience effect with and without original author involvement. PsyArXiv, Dec. 11. https://doi.org/10/ghwq2w [Crossref]
  • Klein RA , Ratliff KA , Vianello M , Adams RB , Bahník Š et al. 2014 . Investigating variation in replicability: a “many labs” replication project. Soc. Psychol. 45 : 3 142– 52 [Google Scholar]
  • Klein RA , Vianello M , Hasselman F , Adams BG , Adams RB et al. 2018 . Many Labs 2: investigating variation in replicability across samples and settings. Adv. Methods Pract. Psychol. Sci. 1 : 4 443– 90 [Google Scholar]
  • Kunda Z. 1990 . The case for motivated reasoning. Psychol. Bull. 108 : 3 480– 98 [Google Scholar]
  • Lakens D. 2019 . The value of preregistration for psychological science: a conceptual analysis. PsyArXiv, Nov. 18. https://doi.org/10.31234/osf.io/jbh4w [Crossref]
  • Lakens D , Adolfi FG , Albers CJ , Anvari F , Apps MA et al. 2018 . Justify your alpha. Nat. Hum. Behav. 2 : 3 168– 71 [Google Scholar]
  • Landy JF , Jia ML , Ding IL , Viganola D , Tierney W et al. 2020 . Crowdsourcing hypothesis tests: making transparent how design choices shape research results. Psychol. Bull. 146 : 5 451– 79 [Google Scholar]
  • Leary MR , Diebels KJ , Davisson EK , Jongman-Sereno KP , Isherwood JC et al. 2017 . Cognitive and interpersonal features of intellectual humility. Pers. Soc. Psychol. Bull. 43 : 6 793– 813 [Google Scholar]
  • LeBel EP , McCarthy RJ , Earp BD , Elson M , Vanpaemel W. 2018 . A unified framework to quantify the credibility of scientific findings. Adv. Methods Pract. Psychol. Sci. 1 : 3 389– 402 [Google Scholar]
  • Leighton DC , Legate N , LePine S , Anderson SF , Grahe J 2018 . Self-esteem, self-disclosure, self-expression, and connection on Facebook: a collaborative replication meta-analysis. Psi Chi J. Psychol. Res. 23 : 2 98– 109 [Google Scholar]
  • Leising D , Thielmann I , Glöckner A , Gärtner A , Schönbrodt F. 2020 . Ten steps toward a better personality science—how quality may be rewarded more in research evaluation. PsyArXiv, May 31. https://doi.org/10.31234/osf.io/6btc3 [Crossref]
  • Leonelli S 2018 . Rethinking reproducibility as a criterion for research quality. Research in the History of Economic Thought and Methodology 36 L Fiorito, S Scheall, CE Suprinyak 129– 46 Bingley, UK: Emerald [Google Scholar]
  • Lewandowsky S , Oberauer K. 2020 . Low replicability can support robust and efficient science. Nat. Commun. 11 : 1 358 [Google Scholar]
  • Maassen E , van Assen MALM , Nuijten MB , Olsson-Collentine A , Wicherts JM. 2020 . Reproducibility of individual effect sizes in meta-analyses in psychology. PLOS ONE 15 : 5 e0233107 [Google Scholar]
  • Machery E. 2020 . What is a replication?. Philos. Sci. 87 : 4 545 – 67 [Google Scholar]
  • ManyBabies Consort 2020 . Quantifying sources of variability in infancy research using the infant-directed-speech preference. Adv. Methods Pract. Psychol. Sci. 3 : 1 24– 52 [Google Scholar]
  • Marcus A , Oransky I 2018 . Meet the “data thugs” out to expose shoddy and questionable research. Science Feb. 18. https://www.sciencemag.org/news/2018/02/meet-data-thugs-out-expose-shoddy-and-questionable-research [Google Scholar]
  • Marcus A , Oransky I. 2020 . Tech firms hire “Red Teams.” Scientists should, too. WIRED July 16. https://www.wired.com/story/tech-firms-hire-red-teams-scientists-should-too/ [Google Scholar]
  • Mathur MB , VanderWeele TJ. 2020 . New statistical metrics for multisite replication projects. J. R. Stat. Soc. A 183 : 3 1145– 66 [Google Scholar]
  • Maxwell SE. 2004 . The persistence of underpowered studies in psychological research: causes, consequences, and remedies. Psychol. Methods 9 : 2 147– 63 [Google Scholar]
  • Maxwell SE , Lau MY , Howard GS. 2015 . Is psychology suffering from a replication crisis? What does “failure to replicate” really mean?. Am. Psychol. 70 : 6 487– 98 [Google Scholar]
  • Mayo DG. 2018 . Statistical Inference as Severe Testing Cambridge, UK: Cambridge Univ. Press [Google Scholar]
  • McCarthy R , Gervais W , Aczel B , Al-Kire R , Baraldo S et al. 2021 . A multi-site collaborative study of the hostile priming effect. Collabra Psychol 7 : 1 18738 [Google Scholar]
  • McCarthy RJ , Hartnett JL , Heider JD , Scherer CR , Wood SE et al. 2018 . An investigation of abstract construal on impression formation: a multi-lab replication of McCarthy and Skowronski (2011). Int. Rev. Soc. Psychol. 31 : 1 15 [Google Scholar]
  • Meehl PE. 1978 . Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald, and the slow progress of soft psychology. J. Consult. Clin. Psychol. 46 : 4 806– 34 [Google Scholar]
  • Meyer MN , Chabris C. 2014 . Why psychologists' food fight matters. Slate Magazine July 31. https://slate.com/technology/2014/07/replication-controversy-in-psychology-bullying-file-drawer-effect-blog-posts-repligate.html [Google Scholar]
  • Mischel W. 2008 . The toothbrush problem. APS Observer Dec. 1. https://www.psychologicalscience.org/observer/the-toothbrush-problem [Google Scholar]
  • Moran T , Hughes S , Hussey I , Vadillo MA , Olson MA et al. 2020 . Incidental attitude formation via the surveillance task: a Registered Replication Report of Olson and Fazio (2001). PsyArXiv, April 17. https://doi.org/10/ghwq2z [Crossref]
  • Moshontz H , Campbell L , Ebersole CR , IJzerman H , Urry HL et al. 2018 . The Psychological Science Accelerator: advancing psychology through a distributed collaborative network. Adv. Methods Pract. Psychol. Sci. 1 : 4 501– 15 [Google Scholar]
  • Munafò MR , Chambers CD , Collins AM , Fortunato L , Macleod MR. 2020 . Research culture and reproducibility. Trends Cogn. Sci. 24 : 2 91– 93 [Google Scholar]
  • Muthukrishna M , Henrich J. 2019 . A problem in theory. Nat. Hum. Behav. 3 : 3 221– 29 [Google Scholar]
  • Natl. Acad. Sci. Eng. Med 2019 . Reproducibility and Replicability in Science Washington, DC: Natl. Acad. Press [Google Scholar]
  • Nelson LD , Simmons J , Simonsohn U. 2018 . Psychology's renaissance. Annu. Rev. Psychol. 69 : 511– 34 [Google Scholar]
  • Nickerson RS. 1998 . Confirmation bias: a ubiquitous phenomenon in many guises. Rev. Gen. Psychol. 2 : 2 175– 220 [Google Scholar]
  • Nosek B. 2019a . Strategy for culture change. Center for Open Science June 11. https://www.cos.io/blog/strategy-for-culture-change [Google Scholar]
  • Nosek B. 2019b . The rise of open science in psychology, a preliminary report. Center for Open Science June 3. https://www.cos.io/blog/rise-open-science-psychology-preliminary-report [Google Scholar]
  • Nosek BA , Alter G , Banks GC , Borsboom D , Bowman SD et al. 2015 . Promoting an open research culture. Science 348 : 6242 1422– 25 [Google Scholar]
  • Nosek BA , Beck ED , Campbell L , Flake JK , Hardwicke TE et al. 2019 . Preregistration is hard, and worthwhile. Trends Cogn. Sci. 23 : 10 815– 18 [Google Scholar]
  • Nosek BA , Ebersole CR , DeHaven AC , Mellor DT. 2018 . The preregistration revolution. PNAS 115 : 11 2600– 6 [Google Scholar]
  • Nosek BA , Errington TM. 2020a . What is replication?. PLOS Biol 18 : 3 e3000691 [Google Scholar]
  • Nosek BA , Errington TM. 2020b . The best time to argue about what a replication means? Before you do it. Nature 583 : 7817 518– 20 [Google Scholar]
  • Nosek BA , Gilbert EA. 2017 . Mischaracterizing replication studies leads to erroneous conclusions. PsyArXiv, April 18. https://doi.org/10.31234/osf.io/nt4d3 [Crossref]
  • Nosek BA , Spies JR , Motyl M. 2012 . Scientific utopia: II. Restructuring incentives and practices to promote truth over publishability. Perspect. On Psychol. Sci. 7 : 6 615– 31 [Google Scholar]
  • Nuijten MB , Bakker M , Maassen E , Wicherts JM. 2018 . Verify original results through reanalysis before replicating. Behav. Brain Sci. 41 : e143 [Google Scholar]
  • Nuijten MB , Hartgerink CHJ , van Assen MALM , Epskamp S , Wicherts JM 2016 . The prevalence of statistical reporting errors in psychology (1985–2013). Behav. Res. Methods 48 : 4 1205– 26 [Google Scholar]
  • Nuijten MB , van Assen MA , Veldkamp CL , Wicherts JM. 2015 . The replication paradox: Combining studies can decrease accuracy of effect size estimates. Rev. Gen. Psychol. 19 : 2 172– 82 [Google Scholar]
  • O'Donnell M , Nelson LD , Ackermann E , Aczel B , Akhtar A et al. 2018 . Registered Replication Report: Dijksterhuis and van Knippenberg (1998). Perspect. Psychol. Sci. 13 : 2 268– 94 [Google Scholar]
  • Olsson-Collentine A , Wicherts JM , van Assen MALM. 2020 . Heterogeneity in direct replications in psychology and its association with effect size. Psychol. Bull. 146 : 10 922– 40 [Google Scholar]
  • Open Sci. Collab 2015 . Estimating the reproducibility of psychological science. Science 349 : 6251 aac4716 [Google Scholar]
  • Patil P , Peng RD , Leek JT. 2016 . What should researchers expect when they replicate studies? A statistical view of replicability in psychological science. Perspect. Psychol. Sci. 11 : 4 539– 44 [Google Scholar]
  • Pawel S , Held L. 2020 . Probabilistic forecasting of replication studies. PLOS ONE 15 : 4 e0231416 [Google Scholar]
  • Perugini M , Gallucci M , Costantini G. 2014 . Safeguard power as a protection against imprecise power estimates. Perspect. Psychol. Sci. 9 : 3 319– 32 [Google Scholar]
  • Protzko J , Krosnick J , Nelson LD , Nosek BA , Axt J et al. 2020 . High replicability of newly-discovered social-behavioral findings is achievable. PsyArXiv, Sept. 10. https://doi.org/10.31234/osf.io/n2a9x [Crossref]
  • Rogers EM. 2003 . Diffusion of Innovations New York: Free Press, 5th ed.. [Google Scholar]
  • Romero F. 2017 . Novelty versus replicability: virtues and vices in the reward system of science. Philos. Sci. 84 : 5 1031– 43 [Google Scholar]
  • Rosenthal R. 1979 . The file drawer problem and tolerance for null results. Psychol. Bull. 86 : 3 638– 41 [Google Scholar]
  • Rothstein HR , Sutton AJ , Borenstein M 2005 . Publication bias in meta-analysis. Publication Bias in Meta-Analysis: Prevention, Assessment and Adjustments HR Rothstein, AJ Sutton, M Borenstein 1– 7 Chichester, UK: Wiley & Sons [Google Scholar]
  • Rouder JN. 2016 . The what, why, and how of born-open data. Behav. Res. Methods 48 : 3 1062– 69 [Google Scholar]
  • Scheel AM , Schijen M , Lakens D. 2020 . An excess of positive results: comparing the standard psychology literature with Registered Reports. PsyArXiv, Febr. 5. https://doi.org/10.31234/osf.io/p6e9c [Crossref]
  • Schimmack U. 2012 . The ironic effect of significant results on the credibility of multiple-study articles. Psychol. Methods 17 : 4 551– 66 [Google Scholar]
  • Schmidt S. 2009 . Shall we really do it again? The powerful concept of replication is neglected in the social sciences. Rev. Gen. Psychol. 13 : 2 90– 100 [Google Scholar]
  • Schnall S 2014 . Commentary and rejoinder on Johnson, Cheung, and Donnellan (2014a). Clean data: Statistical artifacts wash out replication efforts. Soc. Psychol. 45 : 4 315– 17 [Google Scholar]
  • Schwarz N , Strack F. 2014 . Does merely going through the same moves make for a “direct” replication? Concepts, contexts, and operationalizations. Soc. Psychol. 45 : 4 305– 6 [Google Scholar]
  • Schweinsberg M , Madan N , Vianello M , Sommer SA , Jordan J et al. 2016 . The pipeline project: pre-publication independent replications of a single laboratory's research pipeline. J. Exp. Soc. Psychol. 66 : 55– 67 [Google Scholar]
  • Sedlmeier P , Gigerenzer G. 1992 . Do studies of statistical power have an effect on the power of studies?. Psychol. Bull. 105 : 2 309– 16 [Google Scholar]
  • Shadish WR , Cook TD , Campbell DT 2002 . Experimental and Quasi-Experimental Designs for Generalized Causal Inference Boston: Houghton Mifflin [Google Scholar]
  • Shiffrin RM , Börner K , Stigler SM. 2018 . Scientific progress despite irreproducibility: a seeming paradox. PNAS 115 : 11 2632– 39 [Google Scholar]
  • Shih M , Pittinsky TL 2014 . Reflections on positive stereotypes research and on replications. Soc. Psychol. 45 : 4 335– 38 [Google Scholar]
  • Silberzahn R , Uhlmann EL , Martin DP , Anselmi P , Aust F et al. 2018 . Many analysts, one data set: making transparent how variations in analytic choices affect results. Adv. Methods Pract. Psychol. Sci. 1 : 3 337– 56 [Google Scholar]
  • Simmons JP , Nelson LD , Simonsohn U 2011 . False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychol. Sci. 22 : 11 1359– 66 [Google Scholar]
  • Simons DJ. 2014 . The value of direct replication. Perspect. Psychol. Sci. 9 : 1 76– 80 [Google Scholar]
  • Simons DJ , Shoda Y , Lindsay DS. 2017 . Constraints on generality (COG): a proposed addition to all empirical papers. Perspect. Psychol. Sci. 12 : 6 1123– 28 [Google Scholar]
  • Simonsohn U. 2015 . Small telescopes: detectability and the evaluation of replication results. Psychol. Sci. 26 : 5 559– 69 [Google Scholar]
  • Simonsohn U , Simmons JP , Nelson LD. 2020 . Specification curve analysis. Nat. Hum. Behav. 4 : 1208– 14 [Google Scholar]
  • Smaldino PE , McElreath R. 2016 . The natural selection of bad science. R. Soc. Open Sci. 3 : 9 160384 [Google Scholar]
  • Smith PL , Little DR. 2018 . Small is beautiful: in defense of the small-N design. Psychon. Bull. Rev. 25 : 6 2083– 101 [Google Scholar]
  • Soderberg CK. 2018 . Using OSF to share data: a step-by-step guide. Adv. Methods Pract. Psychol. Sci. 1 : 1 115– 20 [Google Scholar]
  • Soderberg CK , Errington T , Schiavone SR , Bottesini JG , Thorn FS et al. 2021 . Initial evidence of research quality of Registered Reports compared with the standard publishing model. Nat. Hum. Behav 5 : 8 990 – 97 [Google Scholar]
  • Soto CJ. 2019 . How replicable are links between personality traits and consequential life outcomes? The life outcomes of personality replication project. Psychol. Sci. 30 : 5 711– 27 [Google Scholar]
  • Spellman BA. 2015 . A short (personal) future history of revolution 2.0. Perspect. Psychol. Sci. 10 : 6 886– 99 [Google Scholar]
  • Steegen S , Tuerlinckx F , Gelman A , Vanpaemel W 2016 . Increasing transparency through a multiverse analysis. Perspect. Psychol. Sci. 11 : 5 702– 12 [Google Scholar]
  • Sterling TD. 1959 . Publication decisions and their possible effects on inferences drawn from tests of significance—or vice versa. J. Am. Stat. Assoc. 54 : 285 30– 34 [Google Scholar]
  • Sterling TD , Rosenbaum WL , Weinkam JJ. 1995 . Publication decisions revisited: the effect of the outcome of statistical tests on the decision to publish and vice versa. Am. Stat. 49 : 108– 12 [Google Scholar]
  • Stroebe W , Strack F. 2014 . The alleged crisis and the illusion of exact replication. Perspect. Psychol. Sci. 9 : 1 59– 71 [Google Scholar]
  • Szucs D , Ioannidis JPA. 2017 . Empirical assessment of published effect sizes and power in the recent cognitive neuroscience and psychology literature. PLOS Biol 15 : 3 e2000797 [Google Scholar]
  • Tiokhin L , Derex M. 2019 . Competition for novelty reduces information sampling in a research game—a registered report. R. Soc. Open Sci. 6 : 5 180934 [Google Scholar]
  • Van Bavel JJ , Mende-Siedlecki P , Brady WJ , Reinero DA 2016 . Contextual sensitivity in scientific reproducibility. PNAS 113 : 23 6454– 59 [Google Scholar]
  • Vazire S. 2018 . Implications of the credibility revolution for productivity, creativity, and progress. Perspect. Psychol. Sci. 13 : 4 411– 17 [Google Scholar]
  • Vazire S , Schiavone SR , Bottesini JG. 2020 . Credibility beyond replicability: improving the four validities in psychological science. PsyArXiv, Oct. 7. https://doi.org/10.31234/osf.io/bu4d3 [Crossref]
  • Verhagen J , Wagenmakers E-J. 2014 . Bayesian tests to quantify the result of a replication attempt. J. Exp. Psychol. Gen. 143 : 4 1457– 75 [Google Scholar]
  • Verschuere B , Meijer EH , Jim A , Hoogesteyn K , Orthey R et al. 2018 . Registered Replication Report on Mazar, Amir, and Ariely (2008). Adv. Methods Pract. Psychol. Sci. 1 : 3 299– 317 [Google Scholar]
  • Vosgerau J , Simonsohn U , Nelson LD , Simmons JP 2019 . 99% impossible: a valid, or falsifiable, internal meta-analysis. J. Exp. Psychol. Gen. 148 : 9 1628– 39 [Google Scholar]
  • Wagenmakers E-J , Beek T , Dijkhoff L , Gronau QF , Acosta A et al. 2016 . Registered Replication Report: Strack, Martin, & Stepper (1988). Perspect. Psychol. Sci. 11 : 6 917– 28 [Google Scholar]
  • Wagenmakers E-J , Wetzels R , Borsboom D , van der Maas HL. 2011 . Why psychologists must change the way they analyze their data. The case of psi: comment on Bem (2011). J. Pers. Soc. Psychol. 100 : 3 426– 32 [Google Scholar]
  • Wagenmakers E-J , Wetzels R , Borsboom D , van der Maas HL , Kievit RA. 2012 . An agenda for purely confirmatory research. Perspect. Psychol. Sci. 7 : 6 632– 38 [Google Scholar]
  • Wagge J , Baciu C , Banas K , Nadler JT , Schwarz S et al. 2018 . A demonstration of the Collaborative Replication and Education Project: replication attempts of the red-romance effect. PsyArXiv, June 22. https://doi.org/10.31234/osf.io/chax8 [Crossref]
  • Whitcomb D , Battaly H , Baehr J , Howard-Snyder D. 2017 . Intellectual humility: owning our limitations. Philos. Phenomenol. Res. 94 : 3 509– 39 [Google Scholar]
  • Wicherts JM , Bakker M , Molenaar D. 2011 . Willingness to share research data is related to the strength of the evidence and the quality of reporting of statistical results. PLOS ONE 6 : 11 e26828 [Google Scholar]
  • Wiktop G. 2020 . Systematizing Confidence in Open Research and Evidence (SCORE). Defense Advanced Research Projects Agency https://www.darpa.mil/program/systematizing-confidence-in-open-research-and-evidence [Google Scholar]
  • Wilkinson MD , Dumontier M , Aalbersberg IJ , Appleton G , Axton M et al. 2016 . The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data 3 : 1 160018 [Google Scholar]
  • Wilson BM , Harris CR , Wixted JT. 2020 . Science is not a signal detection problem. PNAS 117 : 11 5559– 67 [Google Scholar]
  • Wilson BM , Wixted JT. 2018 . The prior odds of testing a true effect in cognitive and social psychology. Adv. Methods Pract. Psychol. Sci. 1 : 2 186– 97 [Google Scholar]
  • Wintle B , Mody F , Smith E , Hanea A , Wilkinson DP et al. 2021 . Predicting and reasoning about replicability using structured groups. MetaArXiv, May 4. https://doi.org/10.31222/osf.io/vtpmb [Crossref]
  • Yang Y , Youyou W , Uzzi B. 2020 . Estimating the deep replicability of scientific findings using human and artificial intelligence. PNAS 117 : 20 10762– 68 [Google Scholar]
  • Yarkoni T. 2019 . The generalizability crisis. PsyArXiv, Nov. 22. https://doi.org/10.31234/osf.io/jqw35 [Crossref]
  • Yong E 2012 . A failed replication draws a scathing personal attack from a psychology professor. National Geo-graphic March 10. https://www.nationalgeographic.com/science/phenomena/2012/03/10/failed-replication-bargh-psychology-study-doyen/ [Google Scholar]

Data & Media loading...

Supplementary Data

Download the Supplemental Material as a single PDF. Includes Supplemental Text, Supplemental Tables 1-12, and Supplemental Figures 1-2.

  • Article Type: Review Article

Most Read This Month

Most cited most cited rss feed, job burnout, executive functions, social cognitive theory: an agentic perspective, on happiness and human potentials: a review of research on hedonic and eudaimonic well-being, sources of method bias in social science research and recommendations on how to control it, mediation analysis, missing data analysis: making it work in the real world, grounded cognition, personality structure: emergence of the five-factor model, motivational beliefs, values, and goals.

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • BMC Psychol

Logo of bmcpsychol

Psychology, replication & beyond

Keith r. laws.

School of Life and Medical Sciences, University of Hertfordshire, Hatfield, UK

Associated Data

Not applicable.

Modern psychology is apparently in crisis and the prevailing view is that this partly reflects an inability to replicate past findings. If a crisis does exists, then it is some kind of ‘chronic’ crisis, as psychologists have been censuring themselves over replicability for decades. While the debate in psychology is not new, the lack of progress across the decades is disappointing. Recently though, we have seen a veritable surfeit of debate alongside multiple orchestrated and well-publicised replication initiatives. The spotlight is being shone on certain areas and although not everyone agrees on how we should interpret the outcomes, the debate is happening and impassioned. The issue of reproducibility occupies a central place in our whig history of psychology.

In the parlance of Karl Popper, the notion of falsification is seductive – some seem to imagine that it identifies an act as opposed to a process . It often carries the misleading implication that hypotheses can be readily discarded in the face of something called a ‘failed’ replication. Popper [ 46 ] was quite transparent when he declared “… a few stray basic statements contradicting a theory will hardly induce us to reject it as falsified. We shall take it as falsified only if we discover a reproducible effect which refutes the theory . In other words, we only accept the falsification if a low level empirical hypothesis which describes such an effect is proposed and corroborated.” (p.203: my italics). Popper’s view might reassure those whose psychological models have recently come under scrutiny through replication initiatives. We cannot, nor should we, close the door on a hypothesis because a study fails to be replicated. The hypothesis is not nullified and ‘nay-saying’ alone is an insufficient response from scientists. Like Popper, we might expect a testable alternative hypothesis that attempts to account for the discrepancy across studies; and one that itself may be subject to testing rather than merely being ad hoc . In other words, a ‘failed’ replication is not, in itself, the answer to a question, but a further question.

Replication, replication, replication

At least two key types of replication exist: direct and conceptual. Conceptual replication generally refers to cases where researchers ‘tweak’ the methods of previous studies [ 43 ] and when successful, may be informative with regard to the boundaries and possible moderators of an effect. When a conceptual replication fails, however, fewer clear implications exist for the original study because of likely differences in procedure or stimuli and so on. For this reason, we have seen an increased weight given to direct replications.

How often do direct and conceptual replications occur in psychology? Screening 100 of the most-cited psychology journals since 1900, Makel, Plucker & Hegarty [ 40 ] found that approximately 1.6 % of all psychology articles used the term replication in the text. A further more detailed analysis of 500 randomly selected articles revealed that only 68 % using the term replication were actual replications. They calculated an overall replication rate of 1.07 % and Makel et al. [ 40 ] found that only 18 % of those were direct rather than conceptual replications.

The lack of replication in psychology is systemic and widespread, and particularly the bias against publishing direct replications. In their survey of social science journal editors , Neuliep & Crandall [ 42 ] found almost three quarters preferred to publish novel findings rather than replications. In a parallel survey of reviewers for social science journals, Neuliep & Crandall [ 43 ] found over half (54 %) stated a preference for new findings over replications. Indeed, reviewers stated that replications were “Not newsworthy” or even a “Waste of space”. By contrast, comments from natural science journal editors present a more varied picture, with comments ranging from “Replication without some novelty is not accepted” to “Replication is rarely an issue for us…since we publish them.” [ 39 ].

Despite an enduring historical abandonment of replication, the tide appears to be turning. Makel et al. [ 40 ] found that the replication rate after the year 2000 was 1.84 times higher than for the period between 1950 and 1999. In a more recent evolution, several large-scale direct replication projects have emerged during the past 2 years including: the Many Labs project [ 33 ]; a set of preregistered replications published in a special issue of Social Psychology (Edited by [ 44 ]); the Reproducibility Project of the Open Science Collaboration [ 45 ]; and the Pipeline Project by Schweinsberg et al. [ 50 ]. In two of these projects (Many Labs by [ 33 ]; Pipeline Project by [ 50 ]), a group of researchers replicated samples of studies, with each group replicating all studies. In the two remaining projects, a number of research groups each replicated one study, selected from a sample of studies (Registered Reports by [ 44 ]; Open Science Collaboration, [ 45 ]). Each project ensured that replications were sufficiently powered (typically in excess of 90 % -thus offering a very good probability of detecting true effects) and where possible, used the original materials and stimuli as provided by the original authors. It is worth considering each in more detail.

Many Labs involved 36 research groups across 12 countries who replicated 13 psychological studies in over 6,000 participants. Studies of classic and newer effects were selected partly because they had simple designs that could be adapted for online administration. Reassuringly perhaps, 10 of the 13 effects replicated consistently across 36 different samples with, of course, some variability in the effect size reported compared to the original studies – some smaller but also some larger. One effect received weak support. Only two studies consistently failed to replicate and both involved what are described as ‘ social priming’ phenomena. One study where ‘accidental’ exposure to a US flag resulted in increased conservatism amongst Americans [ 11 ]. Participants viewed four photos and were asked to just estimate the time-of-day in the photo – the US flag appeared in two photos. Following this, they completed an 8-item questionnaire assessing their views toward various political issues (e.g., abortion, gun control). In the second priming study, exposure to ‘money’ had resulted in endorsement of the current social system [ 12 ]. In this study, participants completed demographic questions against a background that showed a faint picture of US $100 bills or the same background but blurred. Each of these two priming experiments had a single significant p -value (out of 36 replications) and for flag priming, it was in the opposite direction to that expected.

Turning to the special issue of Social Psychology edited by Nosek & Lakens [ 44 ]. This contained a series of articles replicating important results in social psychology. Important was broadly defined as “…often cited, a topic of intense scholarly or public interest, a challenge to established theories), but should also have uncertain truth value (e.g., few confirmations, imprecise estimates of effect sizes).” One might euphemistically describe the studies as curios . The articles were first submitted as Registered Reports and reviewed prior to data collection, with authors being assured their findings would be published regardless of outcome, as long as they adhered to the registered protocol. Attempted replications included the “Romeo and Juliet effect” – does parental interference lead to increases in love and commitment (Original: [ 17 ]; Replication: Sinclair, Hood, & Wright, [ 53 ]), does experiencing physical warmth (warm therapeutic packs) increase judgments of interpersonal warmth (Original: [ 58 ]; Replication: Lynott, Corker, Wortman, Connell, Donnellan, Lucas, & O’Brien, [ 38 ]), does recalling unethical behavior lead participants to see the room as darker (Original: [ 3 ]; Replication: [ 10 ]); does physical cleanliness reduce the severity of moral judgments (original : [ 49 ]: [ 28 ]). In contrast to high replication rate of Many Labs , the Registered Reports replications failed to confirm the results in 10 of 13 studies.

In the largest crowdsourced effort to date, the OSC Reproducibility project involved 270 collaborators attempting to replicate 100 findings from 3 major psychology journals Psychological Science (PSCI), Journal of Personality and Social Psychology (JPSP), and Journal of Experimental Psychology: Learning, Memory, and Cognition (JEP: LMC). While 97 of 100 studies originally reported statistically significant results, only 36 % of the replications did so with a mean effect size of around half of that reported in the original studies.

All of the journals exhibited a large reduction of around 50 % in effect sizes, with replications from JPSP particularly affected - shrinking by 75 % from 0.29 to 0.07. The replicability in one domain of psychology (good or poor) in no way guarantees what will happen in another domain. One thing we know from this project, is that “…reproducibility was stronger in studies and journals representing cognitive psychology than social psychology topics. For example, combining across journals, 14 of 55 (25 %) of social psychology effects replicated by the P  < 0.05 criterion, whereas 21 of 42 (50 %) of cognitive psychology effects did so.” The reasons for such a difference are debatable, but provide no licence to either congratulate cognitive psychologists or berate social psychologists. Indeed, the authors paint a considered and faithful picture of what their findings mean when they conclude “…how many of the effects have we established are true? Zero. And how many of the effects have we established are false? Zero. Is this a limitation of the project design? No. It is the reality of doing science”. (Open Science Collaboration p.4716-7)

The studies that were not selected for replication are informative – they were described as “…deemed infeasible to replicate because of time, resources, instrumentation, dependence on historical events, or hard-to-access samples… [and some] required specialized samples (such as macaques or people with autism), resources (such as eye tracking machines or functional magnetic resonance imaging), or knowledge making them difficult to match with teams”. Thus, the main drivers of replication are often economic in terms of time, money and human investment. High cost studies are likely to remain castles in the air, leaving us with little insight about replicability rates in some areas such as functional imaging (e.g. [ 9 ]), clinical and health psychology (see Coyne, this issue), and neuropsychology.

The ‘ Pipeline project’ by Schweinsberg et al. [ 50 ] intentionally used a non-adversarial approach. They crowdsourced 25 research teams across various countries to replicate a series of 10 unpublished moral-judgment experiments from the lead author’s (Uhlmann) lab i.e., in the pipeline. This speaks directly to Lykken’s [ 37 ] proposal from nearly 50 years ago that “…ideally all experiments would be replicated before publication” although at that time, he deemed it ‘impractical’.

Pipeline replications included: the Bigot–misanthrope effect – whether participants judge a manager who selectively mistreats racial minorities as a more blameworthy person than a manager who mistreats all of his employees; Bad tipper effect - are people who leave a full tip, but entirely in pennies judged more negatively than someone who leaves less money, but in notes; the Burn-in-hell effect – do people perceive corporate executives as more likely to burn in hell than members of social categories defined by antisocial behaviour, such as vandals. Six of ten findings replicated across all of their replication criteria, one further finding replicated but with a significantly smaller effect size than the original, one finding replicated consistently in the original culture but not outside of it ( bad tipper replicated in US and not outside), and two findings effects were unsupported.

The headline replication rates differed considerably across projects – occurring more frequently for Many Labs (77 %) and the Pipeline Project (60 %) than Registered Reports (30 %) and the Open Science Collaboration (36 %). Why are replication rates lower in the latter two projects? Possible explanations include the choice of likely versus unlikely replication candidates. Amongst the Many Labs studies, some had already previously been replicated and were selected knowing this fact. By contrast, the studies in the Pipeline project had not been previously replicated (indeed, not even previously published). Also important from a different perspective is whether each study was replicated only once by one group or multiple times by many groups.

In the Many Labs and Pipeline projects, 36 and 25 separate research groups were replicating each of 13 and 10 studies respectively. Multiple analyses lend themselves to meta-analytic techniques and analysis of the heterogeneity across research groups examining the same effect – the extent to which they accord in their effect sizes or not. The Many Labs project reported I2 values, which estimate the proportion of variation due to heterogeneity rather than chance. In the majority of cases, heterogeneity was small to moderate or even non-existent (e.g. across the 36 replications for both of the social priming studies: flag and money). Indeed, heterogeneity of effect sizes was greater between studies than within studies. When heterogeneity was greater, it was - perhaps surprisingly - where mean effect sizes were largest. Nonetheless, Many Labs reassuringly shows that some effects are highly replicable across research groups, countries, presentational differences (online versus face to face).

Counter-intuitive and even fanciful psychological hypotheses are not necessarily more likely to be false, but believing them to be so may influence researchers– even implicitly – in terms of how replications are conducted. In their extensive literature search, Makel et al. [ 40 ] reported that most direct replications are conducted by authors who proposed the original findings. This raises the thorny question of who should replicate? Almost 50 years ago Bakan [ 2 ] sagely warned that “If an investigator attempts to replicate his own investigation at another time, he will inevitably be under the influence of what he has already done…He should challenge, for example, his personal identification with the results he has already obtained, and prepare himself for finding both novelty and contradiction with respect to his earlier investigation” and that “…If one investigator is interested in replicating the investigation of another investigator, he should carefully take into account the possibility of suggestion, or his willingness to accept the results of the earlier investigator. …He should take careful cognizance of possible motivation for showing the earlier investigator to be in error, etc. [p. 110].” The irony is that as psychologists, we should be acutely aware of such biases - we cannot ignore the psychology of replication in the replication of psychology.

What are we replicating and why?

The cheap and easy.

Few areas of psychology have fallen under the replication lens and where they have, they are psychology’s equivalent to take-away meals – easy to prepare studies (e.g. often using online links to questionnaires). Hence, the focus has tended to be on studies from social and cognitive psychology, and not for example developmental or clinical studies, which are more prohibitive. Other notable examples exist such as cognitive neuropsychology, where the single case study has been predominant for decades – how can anyone recreate the brain injury and subsequent cognitive testing in a second patient?

The contentious

We cannot assert that the totality– or even a representative sample - of psychology has been scrutinised for replication. We can also see why some may feel targeted – replication does not (and probably cannot) occur in a random fashion. The vast majority of psychological studies are overlooked. To date, psychologists have targeted the unexpected, the curious, and newsworthy findings; and largely within a narrow range of areas (cognitive and social primarily). As psychologists, the need to sample more widely ought to go without saying; and one corollary of this, is that it makes no sense to claim that psychology is in crisis.

Too often perhaps, psychologists have been attracted to replicating contentious topics such as social priming, ego-depletion, psychic ability and so on. Some high impact journals have become repositories for the attention-grabbing, strange, unexpected and unbelievable findings. This goes to the systemic heart of the matter. Hartshorne & Schachner [ 27 ] amongst many others have noted “…replicability is not systematically considered in measuring paper, researcher, and journal quality. As a result, the current incentive structure rewards the publication of non-replicable findings …” (p.3 my italics). This is nothing new in science, as the quest for scientific prestige has historically resulted in a conflict between the goals of science and the personal goals of the scientist (see [ 47 ]).

The preposterous

“If there is no ESP, then we want to be able to carry out null experiments and get no effect, otherwise we cannot put much belief in work on small effects in non-ESP situations. If there is ESP, that is exciting. However, thus far it does not look as if it will replace the telephone” (Mosteller [ 41 ], p 396)

From the opposite perspective, Jim Coyne (this issue) maintains that psychology would benefit from some “…provision for screening out candidates for replication for which a consensus could be reached that the research hypotheses were improbable and not warranting the effort and resources required for a replication to establish this.” The frustration of some psychologists is palpable as they peruse apparently improbable hypotheses. Coyne’s concern echoes that of Edwards [ 18 ] who half a century ago similarly remarked, “If a hypothesis is preposterous to start with, no amount of bias against it can be too great. On the other hand, if it is preposterous to start with, why test it ?” Edwards (p 402). How preposterous can we get? According to Simmons et al. [ 51 ], it is “…unacceptably easy to publish “statistically significant” evidence consistent with any hypothesis. (p. 1359). Indeed, they managed to show by manipulating what they describe as researcher degrees of freedom (e.g. ‘data-peeking’, deciding when to stop testing participants, whether to exclude outlying data points) , that people appear to forget their age and claim to be 1.5 years younger after listening to the Beatles song “When I’m 64”.

The fact that seemingly incredible findings can be published raises disquiet about the methods normally employed by psychologists and in some circles, this has inflated to concerns about psychology more generally. Within the methodological and statistical frameworks that psychologists normally operate, we have to face the unpalatable possibility that the wriggle room for researchers is – unacceptably large. Further, it is implicitly reinforced, as Coyne notes, by the actions of some journals as well as media outlets– and until that is adequately addressed, little will change.

The negative

Interestingly, the four replication projects outlined above almost wholly neglected null findings. To date, replication efforts are invariably aimed at positive findings. Should we not also try to replicate null findings? Given the propensity for positive findings to become nulls , what is the likelihood of reverse effects in more adequately powered studies? The emphasis on replicating positive outcomes betrays the wider bias that psychologist have against null findings per se (Laws [ 36 ]). The overwhelming majority of published findings in psychology are positive (93.5 %: [ 54 ]) and the aversion to null findings may well be worse in psychology than other sciences [ 20 ]. Intriguingly, we can see a hint of this issue inthe OSC reproducibility project, which did include 3 %of sampled findings that were null initially - and whiletwo were confirmed as nulls, one did indeed become significant.As psychologists, we might ponder how the biasagainst publishing null findings finds a clear echo in the bias against replicating null findings.

A conflict between belief and evidence

The wriggle room is fertile ground for psychologists to exploit the disjunction between belief and evidence that seems quite pervasive in psychology. As remarked upon by Francis “Contrary to its central role in other sciences,it appears that successful replication is sometimes not related to belief about an effect in experimental psychology. A high rate of successful replication is not sufficient to induce belief in an effect [ 8 ] , nor is a high rate of successful replication necessary for belief [ 22 ].” The Bem [ 8 ] study documented “experimental evidence for anomalous retroactive influences on cognition and affect” or in plain language…precognition. Using multiple tasks, and nine experiments involving over 1,000 participants, Bem had implausibly demonstrated that the performance of participants reflected what happened after they had made their decision. For example, on a memory test, participants were more likely to remember words that they were later asked to practise i.e. memory rehearsal seemingly worked back in time. In another task, participants had to select which of two curtains on a computer screen hid an erotic image, and they did so at a level significantly greater than chance, but not when the hidden images were less titillating. Furthermore, Bem and colleagues [ 7 ] later meta-analysed 90 previous studies to establish a significant effect size of 0.22.

Bem presents nine replications of a phenomenon and a large meta-analysis, yet we do not believe it, while other phenomena do not so readily replicate (e.g. bystander apathy [ 22 ]) but we do believe in them. Francis [ 23 ] bleakly concludes “ The scientific method is supposed to be able to reveal truths about the world, and the reliability of empirical findings is supposed to be the final arbiter of science; but this method does not seem to work in experimental psychology as it is currently practiced .” Whether we believe in Bem’s precognition, social priming, or indeed, any published psychological finding – researchers are operating within the methodological and statistical wriggle room . The task for psychologists is to view these phenomena like any other scientific question i.e. in need of explanation. If they can close-down the wriggle room, then we might expect such curios and anomalies to evaporate in a cloud of nonsignificant results.

While some might view the disjunction between belief and evidence as ‘healthy skepticism’, others might also describe it as resistance to evidence or even anti-science. A pertinent example comes from Lykken [ 37 ] who described a study in which people who see frogs in a Rorschach test – ‘frog responders’ – were more likely to have an eating disorder [ 48 ] – a finding interpreted as evidence of harboring oral impregnation fantasies and an unconscious belief in anal birth. Lykken asked 20 clinician colleagues to estimate the likelihood of this ‘cloacal theory of birth’ before and after seeing Sapolsky’s evidence. Beforehand, they reported a “…median value of 0.01, which can be interpreted to mean, roughly, ‘I don't believe it’ ” and after being shown the confirmatory evidence “… the median unchanged at 0.01. I interpret this consensus to mean, roughly, ‘I still don’t believe it.’” (p. 151–152) . Lykken remarked that normally when a prediction is confirmed by experiment, we might expect “…a nontrivial increment in one’s confidence in that theory should result, especially when one’s prior confidence is low… [but that] this rule is wrong not only in a few exceptional instances but as it is routinely applied to the majority of experimental reports in the psychological literature” p.152 . Often such claims give rise to a version of Feynman’s maxim that “Extraordinary claims require extraordinary evidence”. The remarkableness of a claim, however, is not necessarily relevant to either the type or the scale of evidence required. Instead of setting different criteria for the ordinary and extraordinary, we need to continue to close the wriggle room .

Beliefs and the failure to self-correct

“Scientists should not be in the business of simply ignoring literature that they do not like because it contests their view.” [ 30 ]

Taking this to the opposite extreme, some researchers may choose to ignore the findings of meta-analyses at the expense of selected individual studies that accord more with their view. Giner-Sorolla [ 24 ] maintained that “…meta-analytic validation is not seen as necessary to proclaim an effect reliable. Textbooks, press reports, and narrative reviews often rest conclusions on single influential articles rather than insisting on a replication across independent labs and multiple contexts ” (p 564, my italics).

Stoebe & Strack rightly point-out, “Even multiple failures to replicate an established finding would not result in a rejection of the original hypothesis, if there are also multiple studies that supported that hypothesis.” [and] ‘believers’ “…will keep on believing, pointing at the successful replications and derogating the unsuccessful ones, whereas the nonbelievers will maintain their belief system drawing on the failed replications for support of their rejection of the original hypothesis.” (p.64). Psychology rarely – if ever- proceeds with an unequivocal knock-out blow delivered by a negative finding or even a meta-analysis. Indeed, psychology often has more of the feel of trench warfare, where models and hypotheses are ultimately abandoned largely because researchers lose interest [ 26 ].

Jussim et al. [ 30 ] provide some interesting examples of precisely how social psychology doesn’t seem to correct itself when big findings fail to replicate. If doubts are raised about an original finding then as Jussim et al point out, we might expect citations to reflect this debate, the uncertainly and as such the original and the unsuccessful replications would be expected to be fairly equally cited.

In a classic study, Darley & Gross [ 15 ] found people applied a stereotype about social class when they saw a young girl taking a maths test either after seeing her playing in an affluent or poor background. After obtaining the original materials and following the procedure carefully, Baron et al. [ 6 ] published two failed replications using more than twice as many participants. Not only did they fail to replicate, the evidence was in the opposite direction. Such findings ought to encourage debate with relatively equal attention to the pro and con studies in the literature - alas no. Jussim et al. reported that “…since 1996, the original study has been cited 852 times, while the failed replications have been cited just 38 times (according to Google Scholar searches conducted on 9/11/15).”

This is not an unusual case, as Jussim et al. report several examples of failed replications not being cited, while original studies continue to be readily cited. The infamous and seminal study by Bargh and colleagues [ 5 ] showed that unconsciously priming people with an ‘elderly stereotype’ (unscrambling jumbled sentences that contained words like: old, lonely, bingo, wrinkle ) makes them subsequently walk more slowly. However, Doyen et al. [ 16 ] failed to replicate the finding using more accurate measures of walking speed. Since 2013, Bargh et al. has been cited 900 times and Doyen et al. 192. Or a meta-analysis of 88 studies by Jost et al. [ 29 ] showing that conservativism is a syndrome characterized by rigidity, dogmatism, prejudice, and fear, not replicated by a larger better controlled meta-analysis conducted by Van Hiel and colleagues [ 57 ]. Since 2010, the former has been cited 1030 times while the latter a mere 60 by comparison. Jussim et al. suggest “This pattern of ignoring correctives likely leads social psychology to overstate the extent to which evidence supports the original study’s conclusions…[] it behooves researchers to grapple with the full literature, not just the studies conducive to their preferred arguments”.

Meta-analysis: rescue remedy or statistical alchemy?

Some view meta-analysis as the closest thing we have to a definitive approach for establishing the veracity and reliability of an effect. In the context of discussing social priming experiments, John Bargh [ 4 ] declared that “… In science the way to answer questions about replicability of effects is through statistical techniques such as meta-analysis ”. Others are more skeptical: “Meta-analysis is a reasonable way to search for patterns in previously published research. It has serious limitations, however, as a method for confirming hypotheses and for establishing the replicability of experiments” (p. 486 Hyman, 2010). Meta-analysis is not a magic dust that we can sprinkle over primary literatures to elucidate necessary truths. Likewise totemically accumulating replicated findings, in itself, does not necessarily prove anything (pace Popper). Does it matter if we replicate a finding once, twice, or 20 times, what ratio of positive to negative outcomes do we find acceptable? Answers or rules of thumb do not exist – it often comes down to our beliefs in psychology.

This special issue of BMC Psychology contains 4 articles (Taylor & Munafo, [ 56 ]; Lakens, Hilgaard & Staaks [ 34 ]; Coppens, Verkoeijen, Bouwmeester & Rikers, [ 13 ]; Coyne [ 14 ]) and in each, meta-analysis occupies a pivotal place. As shown by Taylor & Munafo (current issue), meta analyses have proliferated, are highly cited and “…most worryingly, the perceived authority of the conclusions of a meta-analysis means that it has become possible to use a meta-analysis in the hope of having the final word in an academic debate.” As with all methods, meta-analysis has its own limitations and retrospective validation via meta-analysis is not a substitute for prospective replication using adequately powered trials, but they do have substantive role to play in the reproducibility question.

Judging the weight of evidence is never straightforward and whether a finding sustains in psychology often reflect our beliefs almost as much as the evidence. Indeed, meta-analysis rightly or wrongly enables some ideas to persist despite a lack of support at the level of individual study or trial. This has certainly been argued in the use of meta-analyses to establish a case for psychic abilities, where Storm, Tressoldi & Di Risio [ 55 ] identify how “It distorts what scientists mean by confirmatory evidence. It confuses retrospective sanctification with prospective replicability.” (p.489)

This is a kind of free-lunch’ notion of meta-analysis. Feinstein [ 21 ] even stated that “ meta-analysis is analogous to statistical alchemy for the 21st century … the main appeal is that it can convert existing things into something better. “Significance” can be attained statistically when small group sizes are pooled into big ones” (p. 71). Undoubtedly, the conclusions of meta-analyses may prove unreliable where small numbers of nonsignificant trials are pooled to produce significant effects [ 19 ]. Nonetheless, it is also quite feasible for a majority of negative outcomes in a literature and still produce a reliable overall significant effect size (e.g. streptokinase: [ 35 ]).

Two of the papers presented here (Lakens et al. this issue; Taylor & Munafo this issue) offer extremely good suggestions relating to some of these conflicts in meta-analytic findings. Lakens and colleagues offer 6 recommendations, including permitting others to “re-analyze the data to examine how sensitive the results are to subjective choices such as inclusion criteria” and enabling this by providing links to data files that permit such analysis. Currently, we also need to address data sharing in regular papers. Sampling papers published in one year in the top 50 high-impact journals, Alsheikh-Ali et al. [ 1 ] reported that a substantial proportion of papers published in high-impact journals “…are either not subject to any data availability policies, or do not adhere to the data availability instructions in their respective journals”. Such efforts for transparency are extremely welcome and indeed, echo the posting online of our interactive CBT for schizophrenia meta-analysis database ( http://www.cbtinschizophrenia.com/ ), which has been used by others to test new hypotheses (e.g. [ 25 ]).

Taylor & Munafo (this issue) advise greater triangulation of evidence and in this particular instance, supplementing traditional meta-analysis and P-curve analysis [ 52 ]. In passing, Taylor & Munafo also mention “…adversarial collaboration, where primary study authors on both sides of a particular debate contribute to an agreed protocol and work together to interpret the results”. The proposed version of adversarial collaboration proposed by Kahneman [ 31 ] urged scientists to engage in a “good-faith effort to conduct debates by carrying out joint research” (p. 729). More recently, he elaborated on this in the context of the furore over failed replications (Kahneman [ 32 ]). Coyne covers some aspects of this latest paper on replication etiquette and finds some of it wanting. It may however be possible to find some new adversarial middle ground, but it crucially depends upon psychologists being more open. Indeed, some aspects of adversarial collaboration could dovetail with Lakens et als’ proposal regarding hosting relevant data on web platforms. In such a scenario, opposing views could test their hypotheses in a public arena using a shared database.

In the context of adversarial collaboration, some uncertainty and difference of opinion exists about how we might accommodate the views of those being replicated. One possibility again requires openness and that is for those who are replicated to be asked to submit a review; and crucially, the review and replicator’s responses are then published alongside the paper. Indeed, this happened with the paper of Coppens et al. (this issue). They replicated the ‘testing effect’ reported by Carpenter (2009) – that information which has been retrieved from memory is better recalled than that which has simply been studied. Their replications and meta-analysis partially replicate the original findings, and Carpenter was one of the reviewers whose review is available alongside the paper (along with the author responses). Indeed, from its initiation, BMC Psychology has published all reviews and responses to reviewers alongside published papers. This degree of openness is unusual in psychology journals, but does offer readers a glimpse into the process behind a replication (or any paper), allows the person being replicated to contribute and comment on the replication, to reply and be published in the same journal at the same time.

Ultimately, the issues that psychologists face over replication are as much about our beliefs, biases and openness as anything else. We are not dispassionate about the outcomes that we measure. Maybe because the substance of our spotlight is people, cognition and brains, we sometimes care too much about the ‘truths’ we choose to declare. They have implications. Similarly, we should not ignore the incentive structures and conflicts between the personal goals of psychologists and the goals of science. They have implications. Finally, the attitudes of psychologists to the transparency of our science needs to change. They have implications.

Acknowledgements

Availability of data and materials, authors’ contributions, competing interests.

Keith R Laws is a Section Editor for BMC Psychology, who declares no competing interests.

Consent for publication

Ethics approval and consent to participate.

The role of replication in psychological science

  • Paper in Philosophy of Science in Practice
  • Published: 08 January 2021
  • Volume 11 , article number  23 , ( 2021 )

Cite this article

replicated experiments in psychology

  • Samuel C. Fletcher   ORCID: orcid.org/0000-0002-9061-8976 1  

2064 Accesses

11 Citations

15 Altmetric

Explore all metrics

The replication or reproducibility crisis in psychological science has renewed attention to philosophical aspects of its methodology. I provide herein a new, functional account of the role of replication in a scientific discipline: to undercut the underdetermination of scientific hypotheses from data, typically by hypotheses that connect data with phenomena. These include hypotheses that concern sampling error, experimental control, and operationalization. How a scientific hypothesis could be underdetermined in one of these ways depends on a scientific discipline’s epistemic goals, theoretical development, material constraints, institutional context, and their interconnections. I illustrate how these apply to the case of psychological science. I then contrast this “bottom-up” account with “top-down” accounts, which assume that the role of replication in a particular science, such as psychology, must follow from a uniform role that it plays in science generally. Aside from avoiding unaddressed problems with top-down accounts, my bottom-up account also better explains the variability of importance of replication of various types across different scientific disciplines.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save.

  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime

Price includes VAT (Russian Federation)

Instant access to the full article PDF.

Rent this article via DeepDyve

Institutional subscriptions

Similar content being viewed by others

replicated experiments in psychology

History of Replication Failures in Psychology

replicated experiments in psychology

Low replicability can support robust and efficient science

replicated experiments in psychology

Making our “meta-hypotheses” clear: heterogeneity and the role of direct replications in science

These related events include Daryl Bem’s use of techniques standard in psychology to show evidence for extra-sensory perception ( 2011 ), the revelations of high-profile scientific fraud by Diederik Stapel (Callaway 2011 ) and Marc Hauser (Carpenter 2012 ), and related replication failures involving prominent effects such as ego depletion (Hagger et al. 2016 ).

The quotation reads: “the scientifically significant physical effect may be defined as that which can be regularly reproduced by anyone who carries out the appropriate experiment in the way prescribed.” See also Popper ( 1959 , p. 45): “Only when certain events recur in accordance with rules or regularities, as in the case of repeatable experiments, can our observations be tested—in principle—by anyone. … Only by such repetition can we convince ourselves that we are not dealing with a mere isolated ‘coincidence,’ but with events which, on account of their regularity and reproducibility, are in principle inter-subjectively testable.” Zwaan et al. ( 2018 , pp. 1, 2, 4) also quote Dunlap ( 1926 ) (published earlier as Dunlap ( 1925 )) for the same point.

Schmidt ( 2009 , pp. 90–2), citing much the same passages of Popper ( 1959 , p. 45) as the others mentioned, also provides a similar explanation of replication’s importance, appealing to general virtues such as objectivity and reliability. (See the first paragraphs of Schmidt ( 2009 , p. 90; 2017 , p. 236) for especially clear statements, and Machery ( 2020 ) for an account of replication based on its ability to buttress reliability in particular.) But for him, that explanation only motivates why establishing a definition of replication is important in the first place; it plays no role in his definition itself. Thus, by drawing on Schmidt’s account of what replication is, I am not committing to his and others’ stated explanations of why is important.

For example, it is compatible with modifications or clarifications of how interpretation plays an essential role in determining what data models are or what they represent, either for Suppes’ hierarchy (Leonelli 2019 ) or Bogan and Woodward’s (Harris 2003 ). It is also compatible with interactions between the levels of data and phenomena (or experiment) in the course of a scientific investigation (Bailer-Jones 2009 , Ch. 7).

That’s not to say there is no interesting relationship between low-level underdetermination and the question of scientific realism, only that it much more indirect. See Laymon ( 1982 ) for a discussion thereof and Brewer and Chinn ( 1994 ) for historical examples from psychology as they bear on the motivation for theory change.

The first function, concerning mistakes in data analysis, does not appear in Schmidt ( 2009 , 2017 ). That said, neither he nor I claim that our lists are exhaustive, but they do seem to enumerate the most common types of low-level underdetermination that arise in the interpretation of the results of psychological studies. One type that occurs more often in the physical sciences concerns the accuracy, precision, and systematic error of an experiment or measurement technique; I hope in future work to address this other function in more detail. It would also be interesting to compare the present perspective to that of Feest ( 2019 ), who, focusing on the “epistemic uncertainty” regarding the third and sixth functions, arrives at a more pessimistic and limiting conclusion about the role of replication in psychological science.

For examples from economics, see Cartwright ( 1991 , pp. 145–6); for examples from gravitational and particle physics, see Franklin and Howson ( 1984 , pp. 56–8).

This is also analogous to the case of the demarcation problem, on which progress might be possible if one helps oneself to discipline-specific information (Hansson 2013 ).

Of course, there is a variety of quantitative and qualitative methods in psychological research, and qualitative methods are not always a good target for statistical analysis. But the question of whether the data are representative of the population of interest is important regardless of whether that data is quantitative or qualitative.

Meehl ( 1967 ) wanted to distinguish this lack of precise predictions from the situation in physics, but perhaps overstated his case: there are many experimental situations in physics in which theory predicts the existence of an effect determined by an unknown parameter, too. Meehl ( 1967 ) was absolutely right, though, that one cannot rest simply with evidence against a non-zero effect size; doing so abdicates responsibility to find just what the aforementioned patterns of human behavior and mental life are .

Online participant services such as Amazon Turk and other crowdsourced methods offer a potentially more diverse participant pool at a more modest cost (Uhlmann et al. 2019 ), but come with their own challenges.

“Big science” is a historiographical cluster concept referring to science with one or more of the following characteristics: large budgets, large staff sizes, large or particularly expensive equipment, and complex and expansive laboratories (Galison and Hevly 1992 ).

For secondary sources on MSRP, see Musgrave and Pigden ( 2016 , §§2.2, 3.4)

For more on this, see Musgrave and Pigden ( 2016 , §4).

In what follows, I use my own examples rather than Guttinger’s, with the exception of some overlap in discussion of Leonelli ( 2018 ).

Leonelli ( 2018 ) has argued that this possibility is realized in certain sciences that focus on qualitative data collection, but it is yet unclear whether this is really due to pragmatic limitations on the possibility of replications, rather than a lack of underdetermination, low-level or otherwise.

Bailer-Jones, D.M. (2009). Scientific models in philosophy of science . Pittsburgh: University of Pittsburgh Press.

Book   Google Scholar  

Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature , 533 (7604), 452–454.

Article   Google Scholar  

Begley, C.G., & Ellis, L.M. (2012). Raise standards for preclinical cancer research: drug development. Nature , 483 (7391), 531–533.

Bem, D.J. (2011). Feeling the future: experimental evidence for anomalous retroactive influences on cognition and affect. Journal of Personality and Social Psychology , 100 (3), 407.

Benjamin, D.J., Berger, J.O., Johannesson, M., Nosek, B.A., Wagenmakers, E.-J., Berk, R., Bollen, K.A., Brembs, B., Brown, L., Camerer, C., & et al. (2018). Redefine statistical significance. Nature Human Behaviour , 2 (1), 6.

Bird, A. (2018). Understanding the replication crisis as a base rate fallacy. The British Journal for the Philosophy of Science , forthcoming.

Bogen, J., & Woodward, J. (1988). Saving the phenomena. The Philosophical Review , 97 (3), 303–352.

Brewer, W.F., & Chinn, C.A. (1994). Scientists’ responses to anomalous data: Evidence from psychology, history, and philosophy of science. In PSA: Proceedings of the biennial meeting of the philosophy of science association , (Vol. 1 pp. 304–313): Philosophy of Science Association.

Button, K.S., Ioannidis, J.P., Mokrysz, C., Nosek, B.A., Flint, J., Robinson, E.S., & Munafò, M.R. (2013). Power failure: why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience , 14 (5), 365–376.

Callaway, E. (2011). Report finds massive fraud at Dutch universities. Nature , 479 (7371), 15.

Camerer, C.F., Dreber, A., Forsell, E., Ho, T.-H., Huber, J., Johannesson, M., Kirchler, M., Almenberg, J., Altmejd, A., Chan, T., & et al. (2016). Evaluating replicability of laboratory experiments in economics. Science , 351 (6280), 1433–1436.

Carpenter, S. (2012). Government sanctions Harvard psychologist. Science , 337 (6100), 1283–1283.

Cartwright, N. (1991). Replicability, reproducibility, and robustness: comments on Harry Collins. History of Political Economy , 23 (1), 143–155.

Chen, X. (1994). The rule of reproducibility and its applications in experiment appraisal. Synthese , 99 , 87–109.

Dunlap, K. (1925). The experimental methods of psychology. The Pedagogical Seminary and Journal of Genetic Psychology , 32 (3), 502–522.

Dunlap, K. (1926). The experimental methods of psychology. In Murchison, C. (Ed.) Psychologies of 1925: Powell lectures in psychological theory (pp. 331–351). Worcester: Clark University Press.

Feest, U. (2019). Why replication is overrated. Philosophy of Science , 86 (5), 895–905.

Feyerabend, P. (1970). Consolation for the specialist. In Lakatos, I., & Musgrave, A. (Eds.) Criticism and the growth of knowledge (pp. 197–230). Cambridge: Cambridge University Press.

Feyerabend, P. (1975). Against method . London: New Left Books.

Google Scholar  

Fidler, F., & Wilcox, J. (2018). Reproducibility of scientific results. In Zalta, E.N. (Ed.) The Stanford encyclopedia of philosophy. Metaphysics Research Lab, Stanford University, winter 2018 edition .

Franklin, A., & Howson, C. (1984). Why do scientists prefer to vary their experiments? Studies in History and Philosophy of Science Part A , 15 (1), 51–62.

Galison, P., & Hevly, B.W. (Eds.). (1992). Big science: the growth of large-scale research . Stanford: Stanford University Press.

Gelman, A. (2018). Don’t characterize replications as successes or failures. Behavioral and Brain Sciences , 41 , e128.

Gillies, D.A. (1971). A falsifying rule for probability statements. The British Journal for the Philosophy of Science , 22 (3), 231–261.

Gómez, O.S., Juristo, N., & Vegas, S. (2010). Replications types in experimental disciplines. In Proceedings of the 2010 ACM-IEEE international symposium on empirical software engineering and measurement, ESEM ’10 . New York: Association for Computing Machinery.

Greenwald, A.G., Pratkanis, A.R., Leippe, M.R., & Baumgardner, M.H. (1986). Under what conditions does theory obstruct research progress? Psychological Review , 93 (2), 216–229.

Guttinger, S. (2020). The limits of replicability. European Journal for Philosophy of Science , 10 (10), 1–17.

Hagger, M.S., Chatzisarantis, N.L., Alberts, H., Anggono, C.O., Batailler, C., Birt, A.R., Brand, R., Brandt, M.J., Brewer, G., Bruyneel, S., & et al. (2016). A multilab preregistered replication of the ego-depletion effect. Perspectives on Psychological Science , 11 (4), 546–573.

Hansson, S.O. (2013). Defining pseudoscience and science. In Pigliucci, M., & Boudry, M. (Eds.) Philosophy of pseudoscience: reconsidering the demarcation problem (pp. 61–77). Chicago: University of Chicago Press.

Harris, T. (2003). Data models and the acquisition and manipulation of data. Philosophy of Science , 70 (5), 1508–1517.

Lakatos, I. (1970). Falsification and the methodology of scientific research programmes. In Lakatos, I., & Musgrave, A. (Eds.) Criticism and the growth of knowledge (pp. 91–196). Cambridge: Cambridge University Press.

Lakens, D., Adolfi, F.G., Albers, C.J., Anvari, F., Apps, M.A., Argamon, S.E., Baguley, T., Becker, R.B., Benning, S.D., Bradford, D.E., & et al. (2018). Justify your alpha. Nature Human Behaviour , 2 (3), 168.

Laudan, L. (1983). The demise of the demarcation problem. In Cohan, R., & Laudan, L. (Eds.) Physics, philosophy, and psychoanalysis (pp. 111–127). Dordrecht: Reidel.

Lawrence, M.S., Stojanov, P., Polak, P., Kryukov, G.V., Cibulskis, K., Sivachenko, A., Carter, S.L., Stewart, C., Mermel, C.H., Roberts, S.A., & et al. (2013). Mutational heterogeneity in cancer and the search for new cancer-associated genes. Nature , 499 (7457), 214–218.

Laymon, R. (1982). Scientific realism and the hierarchical counterfactual path from data to theory. In PSA: Proceedings of the biennial meeting of the philosophy of science association , (Vol. 1 pp. 107–121): Philosophy of Science Association.

LeBel, E.P., Berger, D., Campbell, L., & Loving, T.J. (2017). Falsifiability is not optional. Journal of Personality and Social Psychology , 113 (2), 254–261.

Leonelli, S. (2018). Rethinking reproducibility as a criterion for research quality. In Boumans, M., & Chao, H.-K. (Eds.) Including a symposium on Mary Morgan: curiosity, imagination, and surprise, volume 36B of Research in the History of Economic Thought and Methodology (pp. 129–146): Emerald Publishing Ltd.

Leonelli, S. (2019). What distinguishes data from models? European Journal for Philosophy of Science , 9 (2), 22.

Machery, E. (2020). What is a replication? Philosophy of Science , forthcoming.

Meehl, P.E. (1967). Theory-testing in psychology and physics: a methodological paradox. Philosophy of Science , 34 (2), 103–115.

Meehl, P.E. (1990). Appraising and amending theories: the strategy of Lakatosian defense and two principles that warrant it. Psychological Inquiry , 1 (2), 108–141.

Musgrave, A., & Pigden, C. (2016). Imre Lakatos. In Zalta, E.N. (Ed.) The Stanford encyclopedia of philosophy. Metaphysics Research Lab, Stanford University, winter 2016 edition .

Muthukrishna, M., & Henrich, J. (2019). A problem in theory. Nature Human Behaviour , 3 (3), 221–229.

Norton, J.D. (2015). Replicability of experiment. THEORIA. Revista de Teoría Historia y Fundamentos de la Ciencia , 30 (2), 229–248.

Nosek, B.A., & Errington, T.M. (2017). Reproducibility in cancer biology: making sense of replications. Elife , 6 , e23383.

Nosek, B.A., & Errington, T.M. (2020). What is replication? PLoS Biology , 18 (3), e3000691.

Nuijten, M.B., Bakker, M., Maassen, E., & Wicherts, J.M. (2018). Verify original results through reanalysis before replicating. Behavioral and Brain Sciences , 41 , e143.

Open Science Collaboration (OSC). (2015). Estimating the reproducibility of psychological science. Science , 349 (6251), aac4716.

Popper, K.R. (1959). The logic of scientific discovery . Oxford: Routledge.

Radder, H. (1992). Experimental reproducibility and the experimenters’ regress. PSA: Proceedings of the biennial meeting of the philosophy of science association (Vol. 1 pp. 63–73). Philosophy of Science Association.

Rosenthal, R. (1990). Replication in behavioral research. In Neuliep, J.W. (Ed.) Handbook of replication research in the behavioral and social sciences, volume 5 of Journal of Social Behavior and Personality (pp. 1–30). Corte Madera: Select Press.

Schmidt, S. (2009). Shall we really do it again? The powerful concept of replication is neglected in the social sciences. Review of General Psychology , 13 (2), 90–100.

Schmidt, S. (2017). Replication. In Makel, M.C., & Plucker, J.A. (Eds.) Toward a more perfect psychology: improving trust, accuracy, and transparency in research (pp. 233–253): American Psychological Association.

Simons, D.J. (2014). The value of direct replication. Perspectives on Psychological Science , 9 (1), 76–80.

Simons, D.J., Shoda, Y., & Lindsay, D.S. (2017). Constraints on generality (COG): a proposed addition to all empirical papers. Perspectives on Psychological Science , 12 (6), 1123–1128.

Stanford, K. (2017). Underdetermination of scientific theory. In Zalta, E.N. (Ed.) The Stanford encyclopedia of philosophy. Metaphysics Research Lab, Stanford University, winter 2017 edition .

Suppes, P. (1962). Models of data. In Nagel, E., Suppes, P., & Tarski, A. (Eds.) Logic, methodology and philosophy of science: proceedings of the 1960 international congress (pp. 252–261). Stanford: Stanford University Press.

Suppes, P. (2007). Statistical concepts in philosophy of science. Synthese , 154 (3), 485–496.

Uhlmann, E.L., Ebersole, C.R., Chartier, C.R., Errington, T.M., Kidwell, M.C., Lai, C.K., McCarthy, R.J., Riegelman, A., Silberzahn, R., & Nosek, B.A. (2019). Scientific Utopia III: crowdsourcing science. Perspectives on Psychological Science , 14 (5), 711–733.

Zwaan, R.A., Etz, A., Lucas, R.E., & Donnellan, M.B. (2018). Making replication mainstream. Behavioral and Brain Sciences , 41 , e120.

Download references

Acknowledgments

Thanks to audiences in London (UK XPhi 2018), Burlington (Social Science Roundtable 2019), and Geneva (EPSA2019) for their comments on an earlier version, and especially to the Pitt Center for Philosophy of Science Reading Group in Spring 2020: Jean Baccelli, Andrew Buskell, Christian Feldbacher-Escamilla, Marie Gueguen, Paola Hernandez-Chavez, Edouard Machery, Adina Roskies, and Sander Verhaegh.

This research was partially supported by a Single Semester Leave from the University of Minnesota, and a Visiting Fellowship at the Center for Philosophy of Science at the University of Pittsburgh.

Author information

Authors and affiliations.

Department of Philosophy, University of Minnesota, Twin Cities, Minneapolis, MN, USA

Samuel C. Fletcher

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Samuel C. Fletcher .

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article belongs to the Topical Collection: EPSA2019: Selected papers from the biennial conference in Geneva

Guest Editors: Anouk Barberousse, Richard Dawid, Marcel Weber

Rights and permissions

Reprints and permissions

About this article

Fletcher, S.C. The role of replication in psychological science. Euro Jnl Phil Sci 11 , 23 (2021). https://doi.org/10.1007/s13194-020-00329-2

Download citation

Received : 16 June 2020

Accepted : 30 October 2020

Published : 08 January 2021

DOI : https://doi.org/10.1007/s13194-020-00329-2

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Replication
  • Underdetermination
  • Confirmation
  • Reproducibility

Advertisement

  • Find a journal
  • Publish with us
  • Track your research

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Published: 27 August 2015

Over half of psychology studies fail reproducibility test

  • Monya Baker  

Nature ( 2015 ) Cite this article

42k Accesses

71 Citations

1436 Altmetric

Metrics details

  • Research management

Largest replication study to date casts doubt on many published positive results.

replicated experiments in psychology

Don’t trust everything you read in the psychology literature. In fact, two thirds of it should probably be distrusted.

In the biggest project of its kind, Brian Nosek, a social psychologist and head of the Center for Open Science in Charlottesville, Virginia, and 269 co-authors repeated work reported in 98 original papers from three psychology journals, to see if they independently came up with the same results. 

The studies they took on ranged from whether expressing insecurities perpetuates them to differences in how children and adults respond to fear stimuli, to effective ways to teach arithmetic.

According to the replicators' qualitative assessments, as previously reported by Nature ,  only 39 of the 100 replication attempts were successful. (There were 100 completed replication attempts on the 98 papers, as in two cases replication efforts were duplicated by separate teams.) But whether a replication attempt is considered successful is not straightforward. Today in Science, the team report the multiple different measures they used to answer this question 1 .

The 39% figure derives from the team's subjective assessments of success or failure (see graphic, 'Reliability test'). Another method assessed whether a statistically significant effect could be found, and produced an even bleaker result. Whereas 97% of the original studies found a significant effect, only 36% of replication studies found significant results. The team also found that the average size of the effects found in the replicated studies was only half that reported in the original studies.

replicated experiments in psychology

There is no way of knowing whether any individual paper is true or false from this work, says Nosek. Either the original or the replication work could be flawed, or crucial differences between the two might be unappreciated. Overall, however, the project points to widespread publication of work that does not stand up to scrutiny.

Although Nosek is quick to say that most resources should be funnelled towards new research, he suggests that a mere 3% of scientific funding devoted to replication could make a big difference. The current amount, he says, is near-zero.

Replication failure

The work is part of the Reproducibility Project, launched in 2011 amid high-profile reports of fraud and faulty statistical analysis that led to an identity crisis in psychology.

John Ioannidis, an epidemiologist at Stanford University in California, says that the true replication-failure rate could exceed 80%, even higher than Nosek's study suggests. This is because the Reproducibility Project targeted work in highly respected journals, the original scientists worked closely with the replicators, and replicating teams generally opted for papers employing relatively easy methods — all things that should have made replication easier.

But, he adds, “We can really use it to improve the situation rather than just lament the situation. The mere fact that that collaboration happened at such a large scale suggests that scientists are willing to move in the direction of improving.”

The work published in Science is different from previous papers on replication because the team actually replicated such a large swathe of experiments, says Andrew Gelman, a statistician at Columbia University in New York. In the past, some researchers dismissed indications of widespread problems because they involved small replication efforts or were based on statistical simulations.

But they will have a harder time shrugging off the latest study, says Gelman. “This is empirical evidence, not a theoretical argument. The value of this project is that hopefully people will be less confident about their claims.”

Publication bias

The point, says Nosek, is not to critique individual papers but to gauge just how much bias drives publication in psychology. For instance, boring but accurate studies may never get published, or researchers may achieve intriguing results less by documenting true effects than by hitting the statistical jackpot ; finding a significant result by sheer luck or trying various analytical methods until something pans out.

Nosek believes that other scientific fields are likely to have much in common with psychology. One analysis found that only 6 of 53 high-profile papers in cancer biology could be reproduced 2 and a related reproducibility project in cancer biology is currently under way. The incentives to find results worthy of high-profile publications are very strong in all fields, and can spur people to lose objectivity. “If this occurs on a broad scale, then the published literature may be more beautiful than reality," says Nosek.

The results published today should spark a broader debate about optimal scientific practice and publishing, says Betsy Levy Paluck, a social psychologist at Princeton University in New Jersey. “It says we don't know the balance between innovation and replication.”

The fact that the study was published in a prestigious journal will encourage further scholarship, she says, and shows that now “replication is being promoted as a responsible and interesting line of enquiry”.

Open Science Collaboration. Science http://dx.doi.org/10.1126/science.aac4716 (2015).

Begley, C. G. & Ellis, L. M. Nature 483 , 531–533 (2012)

Article   ADS   CAS   Google Scholar  

Download references

You can also search for this author in PubMed   Google Scholar

Additional information

Tweet Follow @NatureNews

Related links

Related links in nature research.

First results from psychology’s largest reproducibility test 2015-Apr-30

Replication studies: Bad copy 2012-May-16

Related external links

Reproducibility Project: Psychology

Rights and permissions

Reprints and permissions

About this article

Cite this article.

Baker, M. Over half of psychology studies fail reproducibility test. Nature (2015). https://doi.org/10.1038/nature.2015.18248

Download citation

Published : 27 August 2015

DOI : https://doi.org/10.1038/nature.2015.18248

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

What’s going to happen to me prognosis in the face of uncertainty.

  • Daniele Chiffi
  • Mattia Andreoletti

Topoi (2021)

Differential color tuning of the mesolimbic reward system

  • Eve De Rosa
  • Adam K. Anderson

Scientific Reports (2020)

Yellow is for safety: perceptual and affective perspectives

Psychological Research (2020)

Our path to better science in less time using open data science tools

  • Julia S. Stewart Lowndes
  • Benjamin D. Best
  • Benjamin S. Halpern

Nature Ecology & Evolution (2017)

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

replicated experiments in psychology

  •   X/Twitter   Mastodon   Linkedin   YouTube   Facebook
  •   Shop
  •   Donate

Center for Open Science

  • Compliance and Standards
  • Inclusion, Diversity, Equity, and Access
  • Community Building
  • Policy Reform
  • Preregistration
  • Registered Reports
  • Instagram Data Access Pilot
  • Lifecycle Journal
  • Metascience
  • SMART Prototyping
  • Registered Revisions
  • Randomized Trial
  • RP: Cancer Biology
  • RP: Psychology
  • Metascience 2025
  • Year of Open Science Conference
  • Metascience Symposium

28 classic and contemporary psychology findings replicated in more than 60 laboratories each across three dozen nations and territories

Summary : A team of 186 researchers conducted replications of 28 classic and contemporary findings in psychology.  Overall, 14 of the 28 findings failed to replicate despite the massive sample size with more than 60 laboratories contributing samples from all over the world to test each finding.  The study examined the extent to which variability in replication success can be attributed to the study sample. If a finding replicated, it replicated in most samples with occasional variation in the magnitude of the findings.  If a finding was not replicated, it failed to replicate with little variation across samples and contexts. This evidence is inconsistent with a popular explanation that failures to replicate in psychology are likely due to changes in the sample between the original and replication study.  Ongoing efforts to improve research rigor such as preregistration and transparency standards may be the best opportunities to improve reproducibility.

The results of a massive replication project in psychology are published today in Advances in Methods and Practices in Psychological Science.  The paper “ Many Labs 2: Investigating Variation in Replicability Across Sample and Setting ” represents the efforts of 186 researchers from 36 nations and territories to replicate 28 classic and contemporary psychology findings.  Compared to other large replication efforts, the unique quality of this study was that each of the 28 findings was repeated in more than 60 laboratories all over the world resulting in a median sample size of 7,157. This was more than 64X larger than the median sample size of 112 of the original studies.  This provided two important features for evaluating replicability: (1) extremely sensitive tests for whether the effect could be replicated, and (2) insight into whether the replicability of the findings varied based on the sample.

Overall, 14 of the 28 findings (50%) replicated successfully.  The effect sizes of the replication studies were less than half the size of the original studies on average.  With samples from all six populated continents, the study tested a popular argument that some psychology findings fail to replicate because the original and replication samples were different.  Across these 28 findings, there was not much evidence that replicability was highly dependent on the sample. For studies that failed to replicate, the finding failed to replicate in almost all the samples.  That is, they failed to exceed replication success that would be expected by chance if there was no effect to detect. For studies that replicated successfully, particularly those with large effect sizes, the replications were successful in almost all the samples.  There was some heterogeneity among the larger effects across samples but, for these cases, that variability seemed to indicate that the effect magnitude was larger in some samples than others, and not that the finding was present in some samples and absent in others.  “We were surprised that our diversity in our samples from around the world did not result in substantial diversity in the findings that we observed” said Rick Klein, one of the project leaders and a postdoctoral associate at the University of Grenoble Alpes in France. “Among these studies at least, if the finding was replicable, we observed it in most samples, and the magnitude of the finding only varied somewhat depending on the sample.”

This paper is the latest of six major replication projects in the social and behavioral sciences published since 2014.  These projects are a response to collective concern that the reproducibility of published findings may not be as robust as is assumed, particularly because of publication pressures that may lead to publication bias in which studies and findings with negative results are ignored or unpublished.  Such biases could distort the evidence in the published literature implying that the findings are stronger than the existing evidence suggests and failing to identify boundary conditions on when the findings will be observed. Across the six major replication projects, 90 of 190 findings (47%) have been replicated successfully according to each study’s primary evaluation criterion.  “The cumulative evidence suggests that there is a lot of room to improve the reproducibility of findings in the social and behavioral sciences” said Fred Hasselman, one of the project leaders and an assistant professor at Radboud University in Nijmegen.

The Many Labs 2 project addressed some of the common criticisms that skeptics have offered for why studies may fail to replicate.  First, the studies had massive samples ensuring sufficient power to detect the original findings. Second, the replication team obtained the original materials to ensure faithful replications of the original findings.  Third, all 28 studies underwent formal peer review at the journal prior to conducting the studies, a process called   Registered Reports .  This ensured that original authors and other experts could provide critical feedback on how to improve the study design to ensure that it would be an adequate test of the original finding.  Fourth, all of the studies were preregistered at OSF ( http://osf.io/ ) to ensure strong confirmatory tests of the findings, and all data, materials, and analytic code for the projects are archived and openly available on the OSF for others to review and reproduce the findings ( https://osf.io/8cd4r/ ).  And, fifth, the study directly evaluated whether sample characteristics made a meaningful difference in the likelihood of observing the original finding and, in most cases, it did not.  Michelangelo Vianello, one of the project leads and a Professor at the University of Padua concluded “We pursued the most rigorous tests of the original findings that we could. It was surprising that even with these efforts we were only able to obtain support for the original findings for half of the studies.  These results do not definitively mean that the original findings were wrong, but they do suggest that they are not as robust as might have been assumed. More research is needed to identify whether there are conditions in which the unreplicated findings can be observed. Many Labs 2 suggests that diversity in samples and settings may not be one of them.”

A second paper “ Predicting Replication Outcomes in the Many Labs 2 Study ” is scheduled to appear in the Journal of Economic Psychology and has also been released.  This paper reported evidence that researchers participating in surveys and prediction markets about the Many Labs 2 findings could predict which of the studies were likely to replicate and which were not.  In prediction markets, each share for a finding that successfully replicates is worth $1 and each share for a finding that fails to replicate is worth nothing. Researchers then buy and sell shares in each finding to predict which ones will succeed and fail to replicate.  The final market price is interpretable as the predicted probability that the original finding will replicate. Anna Dreber, senior author of the prediction market paper, and Professor at the Stockholm School of Economics and University of Innsbruck said “We now have four studies successfully demonstrating that researchers can predict whether findings will replicate or not in surveys and prediction markets with pretty good accuracy. This suggests potential reforms in peer review of grants and papers to help identify findings that are exciting but highly uncertain to invest resources to see if they are replicable.”

Failure to replicate is part of ordinary science.  Researchers are investigating the unknown and there will be many false starts in the generation of new knowledge.  Nevertheless, prominent failures to replicate findings in psychology and other scientific disciplines have increased concerns that the published literature is not as reproducible as expected.  The scientific community is in the midst of a reformation that involves self-critical review of the reproducibility of research, evaluation of the cultural incentives that may lead to irreproducibility and inefficiency in discovery, and testing of solutions to improve reproducibility and accelerate science.  The Many Labs 2 project embodies many of the reforms that are spreading across disciplines including   preregistration , use of   Registered Reports   in partnership with a journal, and open sharing of all   data, materials, and code .  The investigation of irreproducibility itself serves as a mechanism for improving reproducibility and the pace of discovering knowledge, solutions, and cures.

Recent News

Center for Open Science

210 Ridge McIntire Road Suite 500 Charlottesville, VA 22903-5083 Email: [email protected]

  •   X/Twitter
  •   Mastodon
  •   Linkedin
  •   YouTube

Join our mailing list

Creative Commons License

Translation Help  |  Terms of Use  |  Privacy Policy

  • Diversity, Equity & Inclusion
  •    Donate
  • Open Science

Responsible stewards of your support

Charity Navigator

COS has earned top recognition from Charity Navigator and Candid (formerly GuideStar) for our financial transparency and accountability to our mission. COS and the OSF were also awarded SOC2 accreditation in 2023 after an independent assessment of our security and procedures by the American Institute of CPAs (AICPA) .

We invite all of our sponsors, partners, and members of the community to learn more about how our organization operates, our impact, our financial performance, and our nonprofit status.

Photo Stream

Psychology’s Replication Crisis Is Running Out of Excuses

Another big project has found that only half of studies can be repeated. And this time, the usual explanations fall flat.

"The Thinker," by Auguste Rodin

Over the past few years, an international team of almost 200 psychologists has been trying to repeat a set of previously published experiments from its field, to see if it can get the same results. Despite its best efforts, the project, called Many Labs 2 , has only succeeded in 14 out of 28 cases. Six years ago, that might have been shocking. Now it comes as expected (if still somewhat disturbing) news.

In recent years, it has become painfully clear that psychology is facing a “ reproducibility crisis ,” in which even famous, long-established phenomena—the stuff of textbooks and TED Talks—might not be real. There’s social priming , where subliminal exposures can influence our behavior. And ego depletion , the idea that we have a limited supply of willpower that can be exhausted. And the facial-feedback hypothesis , which simply says that smiling makes us feel happier.

One by one, researchers have tried to repeat the classic experiments behind these well-known effects—and failed. And whenever psychologists undertake large projects , like Many Labs 2, in which they replicate past experiments en masse , they typically succeed, on average, half of the time.

Read: A worrying trend for psychology’s “simple little tricks”

Ironically enough, it seems that one of the most reliable findings in psychology is that only half of psychological studies can be successfully repeated.

That failure rate is especially galling, says Simine Vazire from the University of California at Davis, because the Many Labs 2 teams tried to replicate studies that had made a big splash and been highly cited. Psychologists “should admit we haven’t been producing results that are as robust as we’d hoped, or as we’d been advertising them to be in the media or to policy makers,” she says. “That might risk undermining our credibility in the short run, but denying this problem in the face of such strong evidence will do more damage in the long run.”

Many psychologists have blamed these replication failures on sloppy practices. Their peers, they say, are too willing to run small and statistically weak studies that throw up misleading fluke results, to futz around with the data until they get something interesting, or to only publish positive results while hiding negative ones in their file drawers.

But skeptics have argued that the misleadingly named “crisis” has more mundane explanations . First, the replication attempts themselves might be too small. Second, the researchers involved might be incompetent, or lack the know-how to properly pull off the original experiments. Third, people vary, and two groups of scientists might end up with very different results if they do the same experiment on two different groups of volunteers.

The Many Labs 2 project was specifically designed to address these criticisms. With 15,305 participants in total, the new experiments had, on average, 60 times as many volunteers as the studies they were attempting to replicate. The researchers involved worked with the scientists behind the original studies to vet and check every detail of the experiments beforehand. And they repeated those experiments many times over, with volunteers from 36 different countries, to see if the studies would replicate in some cultures and contexts but not others. “It’s been the biggest bear of a project,” says Brian Nosek from the Center for Open Science, who helped to coordinate it. “It’s 28 papers’ worth of stuff in one.”

Despite the large sample sizes and the blessings of the original teams, the team failed to replicate half of the studies it focused on. It couldn’t, for example, show that people subconsciously exposed to the concept of heat were more likely to believe in global warming , or that moral transgressions create a need for physical cleanliness in the style of Lady Macbeth , or that people who grow up with more siblings are more altruistic. And as in previous big projects , online bettors were surprisingly good at predicting beforehand which studies would ultimately replicate. Somehow, they could intuit which studies were reliable.

Read: Online bettors can sniff out weak psychology studies.

But other intuitions were less accurate. In 12 cases, the scientists behind the original studies suggested traits that the replicators should account for. They might, for example, only find the same results in women rather than men, or in people with certain personality traits. In almost every case, those suggested traits proved to be irrelevant. The results just weren’t that fickle.

Likewise, Many Labs 2 “was explicitly designed to examine how much effects varied from place to place, from culture to culture,” says Katie Corker from Grand Valley State University, who chairs the Society for the Improvement of Psychological Science. “And here’s the surprising result: The results do not show much variability at all.” If one of the participating teams successfully replicated a study, others did, too. If a study failed to replicate, it tended to fail everywhere.

It’s worth dwelling on this because it’s a serious blow to one of the most frequently cited criticisms of the “reproducibility crisis” rhetoric. Surely, skeptics argue, it’s a fantasy to expect studies to replicate everywhere. “There’s a massive deference to the sample,” Nosek says. “Your replication attempt failed? It must be because you did it in Ohio and I did it in Virginia, and people are different. But these results suggest that we can’t just wave those failures away very easily.”

This doesn’t mean that cultural differences in behavior are irrelevant. As Yuri Miyamoto from the University of Wisconsin at Madison notes in an accompanying commentary, “In the age of globalization, psychology has remained largely European [and] American.” Many researchers have noted that volunteers from Western, educated, industrialized, rich, and democratic countries— WEIRD nations —are an unusual slice of humanity who think differently than those from other parts of the world.

In the majority of the Many Labs 2 experiments, the team found very few differences between WEIRD volunteers and those from other countries. But Miyamoto notes that its analysis was a little crude—in considering “non- WEIRD countries” together, it’s lumping together people from cultures as diverse as Mexico, Japan, and South Africa. “Cross-cultural research,” she writes, “must be informed with thorough analyses of each and all of the cultural contexts involved.”

Read: Psychology’s replication crisis has a silver lining.

Nosek agrees. He’d love to see big replication projects that include more volunteers from non-Western societies, or that try to check phenomena that you’d expect to vary considerably outside the WEIRD bubble. “Do we need to assume that WEirD ness matters as much as we think it does?” he asks. “We don’t have a good evidence base for that.”

Sanjay Srivastava from the University of Oregon says the lack of variation in Many Labs 2 is actually a positive thing. Sure, it suggests that the large number of failed replications really might be due to sloppy science. But it also hints that the fundamental business of psychology—creating careful lab experiments to study the tricky, slippery, complicated world of the human mind—works pretty well. “Outside the lab, real-world phenomena can and probably do vary by context,” he says. “But within our carefully designed studies and experiments, the results are not chaotic or unpredictable. That means we can do valid social-science research.”

The alternative would be much worse. If it turned out that people were so variable that even very close replications threw up entirely different results, “it would mean that we could not interpret our experiments, including the positive results, and could not count on them happening again,” Srivastava says. “That might allow us to dismiss failed replications, but it would require us to dismiss original studies, too. In the long run, Many Labs 2 is a much more hopeful and optimistic result.”

* A mention of the marshmallow test was removed from an early paragraph, since the circumstances there differ from those of other failed replications.

About the Author

replicated experiments in psychology

More Stories

Fatigue Can Shatter a Person

Who’s the Cutest Little Dolphin? Is It You?

Scientists replicated 100 recent psychology experiments. More than half of them failed.

by Julia Belluz

replicated experiments in psychology

Replication is one of the foundational ideas behind science. It's when researchers take older studies and reproduce them to see if the findings hold up. Testing, validating, retesting: It's all part of the slow and grinding process to arrive at some semblance of scientific truth.

Yet it seems that way too often, when we hear about researchers trying to replicate studies, they simply flop or flounder . Some have even called this a “crisis of irreproducibility.“ Consider the newest evidence: a landmark study published today in the journal Science . More than 270 researchers from around the world came together to replicate 100 recent findings from top psychology journals. By one measure, only 36 percent showed results that were consistent with the original findings. In other words, many more than half of the replications failed.

The results of this study may actually be too generous

”The results are more or less consistent with what we’ve seen in other fields ,” said Ivan Oransky, one of the founders of the blog Retraction Watch , which tracks scientific retractions. Still, he applauded the effort: “Because the authors worked with the original researchers and repeated the experiments, the paper is an example of the gold standard of replication.”

In reality, the replication failure rate might be even higher But Stanford's John Ioannidis , who famously penned a paper arguing that most published research findings are wrong, explained that exactly because it's the gold standard, the results might be a little too generous; in reality, the replication failure rate might be even higher.

“I say this because the 100 assessed studies were all published in the best journals, so one would expect the quality of the research and the false rates to be higher if studies from all journals were assessed,” he said.

The 100 studies replicated ended up excluding some 50 others for which the replication was thought to be too difficult. “Among those that did get attempted, difficult, challenging replication was a strong predictor of replication failure, so the failure rates might have been even higher in the 50 or so papers that no one dared to replicate,” Ioannidis said.

Again, the scientists worked closely with the researchers of the original papers, to get their data and talk over the details of their methods. This is why this effort is considered top quality— they tried really hard to understand the original research and duplicate it — but that collaboration may have also biased the results, increasing the chances of a successful replication. “In a few cases [the original authors] affected the choice of which exact experiment among many should be attempted to replicate,” said Ioannidis.

Just listen to how difficult it was to repeat just one experiment

Even with all this buy-in and support, running a replication is an extremely difficult task, explained one of the people on the team, University of Virginia PhD candidate David Reinhard . In fact, after talking to Reinhard, I’ve come to the view the chance of reproducing a study and arriving at the same result — especially in a field like psychology, where local culture and context are so important — as next to nil.

Reinhard had been hearing a lot about the problem of irreproducibility in science recently and wanted to get firsthand experience with replication. He had no idea what he was in for — and his journey tells a lot about how arduous science can be.

To begin with, the original study he wanted to replicate failed during the pretesting stage. That’s the first little-appreciated step of any replication (or study, for that matter) when researchers run preliminary tests to make sure their experiment is viable.

Reinhard spent hours on the phone with the original study authors. He also had to translate some of the data from German.

The study he finally settled on was originally run in Germany. It looked at how “global versus local processing influenced the way participants used priming information in their judgment of others.” In English, that means the researchers were studying how people use concepts they are currently thinking about (in this case, aggression) to make judgments about other people’s ambiguous behavior when they were in one of two mindsets: a big-picture (global) mindset versus a more detail-oriented (local) mindset. The original study had found that they were more suggestible when thinking big.

“Fortunately for me, the authors of the study were helpful in terms of getting the materials and communication,” Reinhard said. He spent hours on the phone with them — talking over the data, getting information about details that were missing or unclear in the methods section of the paper (where researchers spell out how an experiment was conducted). He also had to translate some of the data from German to English, which took more time and resources.

This cooperation was essential, he said, and it’s not necessarily always present. Even still, he added, “There were a lot of difficulties that arose.”

Reinhard had to figure out how to translate the social context, bringing a study that ran in Germany to students at the University of Virginia. For example, the original research used maps from Germany. “We decided to use maps of one of the states in the US, so it would be less weird for people in Virginia,” he said.

After all that, he couldn't reproduce the original findings

Another factor: Americans’ perceptions of aggressive behavior are different from Germans’, and the study hinged on participants scoring their perceptions of aggression. The German researchers who ran the original study based it on some previous research that was done in America, but they changed the ratings scale because the Germans’ threshold for aggressive behavior was much higher. Now Reinhard had to change them back — just one of a number of variables that had to be manipulated.

In the end, he couldn’t reproduce their findings, and he doesn’t know why his experiment failed. “When you change the materials, a lot of things can become part of the equation,” he said. Maybe the cultural context mattered, or using different stimuli (like the new maps) made a difference. Or it could just be that the original finding was wrong.

“I still think replication is an extremely important part of science, and I think that’s one of the really great things about this project,” Reinhard said. But he’s also come to a more nuanced view of replication, that sometimes the replications themselves can be wrong, too, for any number of reasons.

“The replication is just another sort of data point that there is when it comes to the effect but it’s not the definitive answer,” he added. “We need a better understanding of what a replication does and doesn’t say.”

Here’s how to make replication science easier

After reading the study and talking to Reinhard, I had a much better sense of how replication works. But I also felt pretty sorry about the state of replication science.

It seemed a little too random, unsystematic, and patchwork — not at all the panacea many have made it out to be.

I asked Brian Nosek , the University of Virginia psychologist who led the Science effort, what he learned in the process. He came to a conclusion very similar to Reinhard’s:

My main observation here is that reproducibility is hard. That’s for many reasons. Scientists are working on hard problems. They’re investigating things where we don’t know the answer. So the fact that things go wrong in the research process, meaning we don’t get to the right answer right away, is no surprise. That should be expected.

To make it easier, he suggested some fixes. For one thing, he said, scientists need to get better at sharing the details — and all the assumptions they may have made — in the methods sections of their papers. “It would be great to have stronger norms about being more detailed with the methods,” he said. He also suggested added supplements at the end of papers that get into the procedural nitty-gritty, to help anyone wanting to repeat an experiment. “If I can rapidly get up to speed, I have a much better chance of approximating the results,” he said. (Nosek has detailed other potential fixes in these guidelines for publishing scientific studies, which I wrote about here — all part of his work at the Center for Open Science .)

Ioannidis agreed and added that more transparency and better data sharing are also key. “It is better to do this in an organized fashion with buy-in from all leading investigators in a scientific discipline rather than have to try to find the investigator in each case and ask him or her in detective-work fashion about details, data, and methods that are otherwise unavailable,” he said. “Investigators move, quit science, die, lose their data, have their hard drives with all their files destroyed, and so forth.”

What both Ioannidis and Nosek are saying is that we need to have a better infrastructure for replication in place. For now, science is slowly lurching along in this this direction. And that’s good news, because trying to do a replication — even with all the infrastructure of a world-famous experiment behind you, as Reinhard had — is challenging. Trying to do it alone is probably impossible.

Further reading:

  • Science is often flawed. It’s time we embraced that.
  • John Ioannidis has dedicated his life to quantifying how science is broken
  • This is why you shouldn’t believe that exciting new medical study
  • How the biggest fraud in political science nearly got missed
  • Science is broken. These academics think they have the answer.

Most Popular

  • The difference between American and UK Love Is Blind
  • Kamala Harris’s speech triggered a vintage Trump meltdown
  • The massive Social Security number breach is actually a good thing
  • A Trump judge ruled there’s a Second Amendment right to own machine guns
  • This chart of ocean heat is terrifying

Today, Explained

Understand the world with a daily explainer plus the most compelling stories of the day.

 alt=

This is the title for the native ad

 alt=

More in Science

The staggering death toll of scientific lies

Scientific fraud kills people. Should it be illegal?

Antibiotics are failing. The US has a plan to launch a research renaissance.

But there might be global consequences.

Why does it feel like everyone is getting Covid?

Covid’s summer surge, explained

Earthquakes are among our deadliest disasters. Scientists are racing to get ahead of them.

Japan’s early-warning system shows a few extra seconds can save scores of lives.

The only child stigma, debunked

Being an only child doesn’t mess you up for life. We promise.

We have a drug that might delay menopause — and help us live longer

Ovaries age faster than the rest of the body. Figuring out how to slow menopause might help all of us age better.

Two More Classic Psychology Studies Just Failed The Reproducibility Test

Two more classic psychology studies just failed the reproducibility test

For years now, researchers have been warning about a reproducibility crisis in science - the realisation that a lot of seminal papers, particularly in psychology, don't actually hold up when scientists take the time to try to reproduce the results.

Now, two more key papers in the psychology have failed the reproducibility test, serving as an important reminder that many of the scientific 'facts' we've come to believe aren't necessarily as steadfast as we thought.

To be fair, just because findings can't be reproduced, it doesn't automatically mean they're wrong. Replication is an important part of the scientific method that helps us nut out what's really going on - it could be that the new researchers did something differently, or that the trend is more subtle than originally thought.

But the problem is that, for decades now, the importance of replicating results has been largely overlooked, with researchers usually choosing to chase a 'new' discovery rather than fact-checking an old one - thanks to the pressure to publish exciting and novel findings  in order to secure jobs.

As John Oliver said earlier this year : "There's no Nobel prize for fact-checking."

That's brought us to the 'crisis' we're in now, where most papers that are published can't be replicated. Last year, the University of Virginia led a new  Reproducibility Project  that repeated 100 experiments… with only one-third of them successfully being replicated - although this study has since been criticised for havings its own replication errors.

The two latest examples are widely cited papers from 1988 and 1998.

The 1988 study  concluded that our facial expressions can influence our mood - so the more we smile, the happier we'll be, and vice versa.

The 1998 study , led by Roy Baumestier from Case Western University, provided evidence for something called ego depletion, which is the idea that our willpower can be worn down over time.

The latter assumption has been the basis of a huge amount of follow-on psychological studies, but now Martin Hagger from Curtin University in Australia has led researchers from 24 labs in an attempt to recreate the seminal paper, and found no evidence that the effect exists.

His results have been accepted for publication in the journal Perspectives in Psychological Science   in the coming weeks.

The facial expression replication attempt follows much the same trend.

In the original paper , researchers from Germany asked participants to read The Far Side comics by artist Gary Larson, with either a pen held between their teeth (forcing them into a smile) or between their lips (replicating a pout).

The team found that people who smiled found the comics funnier than those who were pouting, leading the researchers to conclude that changing our facial expression can change our moods, something known as the facial feedback hypothesis.

But when a team of researchers at the University of Amsterdam in the Netherlands conducted the same experiment - even using the same '80s comics - they failed to replicate the findings " in a statistically compelling fashion ".

"Overall, the results were inconsistent with the original result," the team conclude in  Perspectives in Psychological Science  - a separate paper to the ego depletion replication, but also due to be published in a few weeks.

Again, that doesn't necessarily mean that the original result wasn't accurate - nine out of the 17 Dutch labs that attempted to recreate the experiment actually reported a similar result to the 1988 study. But the remaining eight labs didn't, and when the results were combined, the effect disappeared.  

"[T]his does not mean the entire facial feedback hypothesis is dead in the water," writes Christian Jarrett for the British Psychological Society's Research Digest .

"Many diverse studies have supported the hypothesis, including research involving participants who have undergone botox treatment, which affects their facial muscles."

The results could be due to a number of other variables - like, maybe people today don't find The Far Side funny anymore. And the Dutch study also used psychology students, many of whom would have been familiar with the 1988 paper, which could have skewed the results .

Only more investigation will help us know for sure.

But in the meantime, all this hype over the reproducibility crisis in the media lately can only be a good thing for the state of science. 

"It shows how much effort and attention has gone towards improving the accuracy of the knowledge produced," John Ioannidis, a Stanford University researcher who led a 2005 reproducibility study,  told Olivia Goldhill at Quartz.

"Psychology is a discipline that has always been very strong methodologically and was at the forefront at describing various biases and better methods. Now they are again taking the lead in improving their replication record."

One positive that's already emerged is a discussion about  pre-registering trials , which would stop researchers tweaking their results after they've been collected to get a more exciting results.

And hopefully, the more people talk and think about replicating results, the better the public will get at thinking critically about the science news they read.

"Science isn't about truth and falsity, it's about reducing uncertainty," Brian Nosek, the researcher behind the Reproducibility Project, told Quartz .

"Really, this whole project is science on science: researchers doing what science is supposed to do, which is be skeptical of our own process, procedure, methods, and look for ways to improve." 

Score Card Research NoScript

Jeremy D. Safran Ph.D.

Replication Crisis

Replication problems in psychology, are replication failures in psychology a crisis or a 'tempest in a teapot'.

Posted November 15, 2015

  • What Is Therapy?
  • Take our Do I Need Therapy?
  • Find a therapist near me

In late August 2015 an article appeared in the New York Times with a loaded headline: Many Psychology Findings Not as Strong as Claimed, Study Says. The article reported on a recent publication in the journal Science, which raised important questions about the extent to which findings in psychological research are replicable . This is an important issue, since replicability is considered a hallmark of the scientific method. If findings cannot be replicated, the trustworthiness of the research becomes suspect. The publication in Science outlined the findings of a major project that had attempted to replicate 100 studies that had been published in major psychology journals in recent years. Led by Brian Nosek of University of Virginia, the research was conducted by a network of research teams (there were 270 contributing authors to the publication) that had coordinated efforts to replicate a total of 100 studies that had been published in three top tier psychology journals:Psychological Science, Journal of Personality and Social Psychology, and Journal of Experimental Psychology: Learning, Memory , and Cognition . They found that just over one third of the studies could be replicated. As one could imagine, this study, designated as The Reproducibility Project, has caused quite a stir in both the professional world and the media.

After its publication, the September issue of American Psychologist, the flagship publication of the American Psychological Association, was devoted to what has been termed the “ replication crisis ” in psychology. Some of the articles in this issue attempted to dismiss or minimize the significance of The Reproducibility Project, while others attempted to identify factors endemic to the field of psychology which may contribute to the problem and suggested steps that can potentially be taken to address it moving forward.

The field of psychology has been aware of this problem for some time. In the mid-1970s Lee J. Cronbach, a prominent methodologist, noted the general tendency for effects in psychology to decay over time. At same time, Michael Mahoney, a prominent cognitive behavioral theorist and researcher, drew attention to this problem in a more emphatic way. As he put it “The average physical scientist would probably shudder a the number of social science ‘facts’ which rest on unreplicated research.” [1] More recently, cognitive psychologist Jonathan Schooler garnered considerable attention with a publication in the high profile journal Nature in which he described the ubiquity of the replication problem in psychology, and outlined some of the factors potentially responsible for it.

It is important to bear in mind that conversations about whether there is indeed a replication crisis in psychology take place within the broader context of an ongoing conversation about the cultural boundaries of science. Efforts to establish the boundaries of demarcation between the “sciences” and nonsciences have been of longstanding interest to philosophers, and increasingly to sociologists, anthropologists, and historians as well.

Within the field of psychology, conversations of this sort can arouse intense controversy and heated discussion. This is not surprising, given the fact that whether or not a given field of inquiry is considered to be a science has significant implications for the designation of epistemic authority in our culture. This in turn has immense implications for social prestige, power, credibility, and the allocation of resources, all of which shape research agendas and ultimately influence shared cultural assumptions.

Whether or not psychology (and other social sciences such as anthropology, sociology and economics) should be viewed as belonging in the same category as the natural sciences such as physics, chemistry and biology, can and has been argued on various grounds including methodology, explanatory power, predictive power, and the ability to generate useful applications. But the bottom line is that these cultural boundaries shift over time, and at times become the focus of considerable controversy.

In the aftermath of World War II, the question of whether the social sciences should be included along with the natural sciences within the nascent National Science Foundation (NSF) was the topic of sustained debate. Proponents of including the social sciences argued that they share a common methodology with the natural sciences and that although they have less predictive power can nevertheless lead to trustworthy developments in knowledge with important applications. Opponents argued that the social sciences have no advantages over common sense, that their inclusion would lead to a tarnishing of the public’s view of the natural sciences, and that unlike the natural sciences, which at the time were portrayed as “objective,” the objectivity of the social sciences is compromised by their inevitable entanglement with human values. Ultimately a decision was made not to completely exclude social sciences from the NSF, but to include them under a miscellaneous “other sciences” category (in contrast to sciences such as physics and chemistry which were designated by name).

Then in the 1960s a number of social science proponents argued vigorously for the establishment of a National Social Science Foundation (NSSF) to house the social sciences. At this time they argued that the social sciences do indeed differ from the natural sciences in important ways, and warrant dedicated funding, so that they don’t have to vie with the natural sciences for resources. An important concern at the time among those arguing for the creation of an NSSF was that the social sciences would always be treated as “second class citizens” within the NSF. In contrast to the argument for inclusion upon the initial establishment of the NSF, defenders of social sciences now argued that the social and natural sciences use different methods and that the social sciences need to be evaluated by different criteria in order to fully develop. Ultimately Congress decided not to establish a National Social Science Foundation, but the social sciences did receive increased attention within the NSF for a period of time.[2]

Psychology has had a particularly strong investment in defining itself as a science that is similar in important respects to the natural sciences. Perhaps this is in part a function of its past successes in attracting government funding on this basis. Because of this, psychology tends to be somewhat vulnerable to what one might call “epistemic insecurity” when articles such as “Many Psychology Findings Not as Strong as Claimed, Study Says,” appear in the New York Times. I suspect that the media will lose interest in this topic pretty quickly, and that the recent publication of a special issue of American Psychologist devoted to the topic will turn out to have been the peak of interest in this topic within the field of psychology. Be that as it may, I will devote some time in this essay discussing factors contributing to the “replication problem” in my own field of research (i.e. psychotherapy research), as well as the way in which this theme is intertwined with the politics of funding within the healthcare system. I will then discuss factors contributing to replication failures in the field of psychology in general, with a particular emphasis on important discrepancies between the way in which research is represented in the published literature, and the realities of research as practiced on the ground. And finally I will argue that rather than simply striving to bring the everyday practice of psychology research closer in line with the idealized portrayal of psychology research that is common, there is potentially much to be learned from studying the way in which impactful research in psychology is actually practiced.

replicated experiments in psychology

Psychotherapy Research

In the psychotherapy research field, the replication problem has been widely discussed for years, with some investigators ignoring it, others dismissing its significance, and others making serious efforts to grapple with its implications. One of the key ways in which the failure to replicate plays out in the psychotherapy research field is in the form of a phenomenon which is commonly referred to as the “therapeutic equivalence paradox” or the “dodo bird verdict,” alluding to an incident from Lewis Carol’s Alice in Wonderland in which the Dodo Bird decrees “Everybody has won and all must have prizes.” In psychotherapy research, the “dodo bird verdict” refers to the finding that despite ongoing claims of proponents of different therapy schools regarding the superiority of their respective approaches, systematic and rigorous syntheses of large numbers of clinical trials conducted by different investigators over time fail to find that any one therapeutic approach is consistently more effective than others.

Those who ignore this phenomenon or dismiss its relevance tend to be proponents of the so-called evidence-based treatment approach. If one has a stake in claiming that a particular form of psychotherapy is more effective than others, the data consistent with the dodo bird verdict become a nuisance to be ignored, or worse, a threat to one’s funding. On the other hand, the dodo bird verdict is taken more seriously by those who 1) have an investment in arguing for the value of a therapeutic approach is that is supposedly not “evidence based,” (e.g., psychodynamic therapies), 2) take the position that all therapies work through common factors, or 3) believe that that the effects of therapy are a function of the interaction between each unique therapist-patient dyad. The situation is complicated further by the fact that many of those who recognize the validity of the dodo bird verdict (e.g., colleagues of mine who are psychodynamic researchers) will conduct clinical trials evaluating the effectiveness of psychodynamic treatment, in an effort to have it recognized as effective by the field as a whole, where “evidence-based” is more or less a synonym for “funded.” To further complicate the picture, a number of prominent psychotherapy researchers in the 1970s, who were funded at the time by the National Institute of Mental Health, recognized the existence of the “therapeutic equivalence paradox,” and believed that clinical trials in psychotherapy research would be of limited value in advancing the field, arguing instead for the value of investigating how change takes place. Nevertheless, with the rise of biological psychiatry in the late 1970s, and the looming threat of the government and heathcare insurers deciding that psychotherapy was of no value, they decided that it was worth putting what little influence they had left into launching the most expensive program of research that had ever taken place, evaluating the relative effectiveness of two forms of short-term psychotherapy versus antidepressant medication for the treatment of depression . Not only did this study ultimately demonstrate that these therapies were as effective as medication, it established the clinical trials method derived from pharmaceutical research as the standard for all psychotherapy research that would be fundable moving forward. Thus the stage was set for mainstream psychotherapy research valuing randomized clinical trials as the “gold standard” in methodology, despite the fact that prominent psychotherapy researchers had been making the case for some time that they were of limited value for purposes of genuinely advancing knowledge in the field.

In the mid 1970s, Lester Luborsky, a prominent psychodynamic researcher at the time, reanalyzed the aggregated results of a number of randomized clinical trials comparing the effectiveness of different forms of psychotherapy. This time, he used a statistical procedure to estimate how much impact the theoretical allegiance of the investigator has on the outcome of the study. He found that the investigator’s theoretical allegiance has a massive impact on treatment outcome – an impact that dwarfs the degree of impact attributable to the brand of therapy. Since that time, Luborsky’s findings have been replicated sufficiently often that they are beyond dispute. What accounts for the researcher allegiance effect? Although potential misrepresentation of findings may take place in some instances, there are a large number of other variables that are likely to be more common.

One factor is that psychotherapy researchers tend to select treatment outcome measures that reflect their understanding of what change should look like, and this understanding is shaped by different worldviews. Another is that most investigators understandably have an investment in demonstrating that their preferred approach (in some cases an approach that they have played a role in developing) is effective. This investment is likely to influence the outcome of the study in a variety of ways. For one thing, there is a phenomenon that can be called the ‘home team advantage.’ If an investigator believes in the value of the approach they are testing, this belief is likely to have an impact on the enthusiasm and confidence of the therapists who are implementing this approach in the study. In many cases the effectiveness of the ‘home team’ treatment is evaluated relative to the effectiveness of a treatment intentionally designed to be less effective. One consequence of the ‘home team advantage’ is that a replication study carried out by a different team with different theoretical commitments is likely to fail to replicate the findings of the first team, and may in fact come up with findings that are completely contradictory.

Psychology Research in General

Moving beyond the specifics of psychotherapy research to the field of psychology in general, there are a number of factors potentially contributing to the replication problem (no doubt some of the factors are relevant to other fields as well, but that is not the focus of this brief essay). The first can be referred to as the “originality bias .” In practice, if not in theory, controlled replications are considered to be one of the lowest priorities in the field. Thus straightforward replications are less likely to be accepted for publication in important journals, where this is a tendency for reviewers dismiss them as “mere replications.” Psychology researchers learn this early on in graduate school and are thus less likely to conduct replication studies. When they do conduct replications they are apt to modify the design, so that the study has the potential of adding a “new wrinkle potential” on the topic. Studies of this type are thus unlikely to be exact replications.

Another factor is that studies that do not yield statistically significant findings are less likely to be accepted for publication by reviewers, and are thus less likely to be submitted for publication. When reviews of the literature are conducted that include unpublished Ph.D. dissertations, conclusions drawn on the basis of the published research alone tend to disappear. Michael Mahoney conducted a provocative study in which he sent out 75 manuscripts to reviewers from well-known psychology journals. All versions of the manuscript were anonymously authored and contained identical introductions, methodology sections and bibliography. The manuscripts were, however, varied in terms of whether the results were significant, mixed or non significant. His findings were striking. Manuscripts with significant results were consistently accepted. Manuscripts with nonsignificant or mixed findings were consistently rejected.

A third factor is that there are important contrasts between codified formal prescriptions regarding the way research should be practiced in psychology and the reality of research as it is conducted in the real world. In graduate school, psychology students are taught that the scientific method consists of spelling out hypotheses and then conducting research to test them. In the real world researchers often modify their hypotheses in a fashion that is informed by the findings that are emerging. One of the articles in the recent issue ofAmerican Psychologist devoted to the “replication crisis” suggested that one way of remedying this problem would be for funding agencies to require that all principle investigators register their hypotheses with the funding agency before they begin collecting data. In a recent conversation on this topic, a colleague of mine who dismisses the significance of the “replication crisis” remarked that the requirement to register hypotheses in advance would be problematic, since “we all know that some of the more creative aspects of research involve modifying and refining hypotheses in light of the data that emerges (or something to that effect).” I am in complete agreement with him. My concern is that this type of post hoc attempt to find meaning in the data when one one’s initial hypotheses are not supported is an implicit aspect of psychology research in the real world, rather than the formal position that is advocated. This is not as scandalous as it might seem. Abduction, the reasoning process through which theories are formulated to fit observed patterns, plays an important role in the natural sciences as well. If the field of psychology were to formally place more emphasis on the value of abduction, requiring investigators to register their hypotheses in advance might be less problematic. Post hoc efforts to make sense of unexpected findings could potentially be considered as legitimate and interesting as findings that support one’s a priori hypotheses. There would thus be less incentive for investigators to report their research in a way that appears to support their initial hypotheses.

Another common practice in psychology research is referred to as “data mining” Data mining involves the process of analyzing the data in every conceivable way, until significant findings emerge. In some cases this can involve examining aspects of the data that one had not initially planned to examine. In other cases it involves exploring the use of a range of different statistical procedures until significant results emerge. Once again, textbooks in psychology research methodology teach that data mining or “going on a fishing expedition” is not de rigueur. There is, however, nothing inherently wrong with data mining. The reality of psychology research as it is practiced on the ground is that it does not take place in the linear fashion in which it is often portrayed in the published literature. Data are collected and analyzed, and researchers try make sense of the data as they analyze it. This process often helps them to refine their thinking about the phenomenon of interest. This practice becomes problematic when researchers selectively report on all that has taken place between the data collection process and the final publication. But of course there are good reasons for selective reporting: clean stories are more compelling and easier to publish.

Another standard practice consists of conducting pilot research. Pilot research is the trial and error process through which the investigator develops important aspects of his or her methodological procedure. One important aspect of this pilot phase entails experimenting with different ways of (what is termed) implementing the experimental manipulation (i.e. the conditions that the subjects are exposed to) until one is able to consistently demonstrate the phenomenon of interest. This type of “stage management ” is part of the art of psychology research. Is such “stage management” inherently problematic? Not necessarily. What is problematic is that publications don’t as a rule describe the pilot work that that led to the development of the effective experimental manipulation.

Vividness and the compelling demonstration of phenonema

An important aspect of the skill of psychological research consists of devising creative procedures for demonstrating phenomena. While psychology research is not performance art, a key element in whether or not a particular piece of research has an impact on the field is the vividness or memorability of the study. Two of the more influential and widely known modern psychology experiments in the history of psychology exemplify this principle: Stanley Milgram’s classic research on obedience to authority, and Harry Harlow’s demonstration of fundamental nature of the need for “contact comfort” in baby rhesus monkeys.

Milgram conducted his research in the context of the Eichmann trials and Hannah Arendt’s publication of Eichmann in Jerusalem: A Report on the Banality of Evil. He set about designing an experiment to demonstrate that given the right context, the average American could be manipulated to act in a cruel and inhumane way out of deference to an authority figure. Milgram worked with his research team to stage an elaborate deception in which subjects were recruited to participate in what they were told was a study on learning.

Upon arriving at the lab, subjects were assigned the role of “teacher” (supposedly on a random basis) and paired with another subject who they were led to believe was randomly assigned to the role of the “student.” In reality, the so-called students were research confederates working with Milgram to stage the deception. The real subjects, who were always assigned to the role of the teacher, were instructed by an “experimenter” to administer electric shocks when the student gave incorrect answers, as a way of investigating whether punishment can facilitate the learning process. The “experimenter” wore a white lab coat, designed give him the air of authority, and stood beside the subjects, instructing them when to administer the shocks and what voltage to use. The set-up was rigged so that students made ongoing mistakes. With each repeated mistake the “experimenter” instructed the “teacher” to increase the intensity of the shock, until they were administering potentially harmful levels of shock that left “students” screaming in pain. The published results of the research reported that over 60% of subjects actually cooperated with the “experimenter” to the point at which they were administering painful and potentially harmful levels of shock.

Archival research reveals that Milgram spent a tremendous amount of time piloting variations in the experimental procedure, until he was able to find one that produced the desired effect. [3] It turns out that Milgram also conducted studies employing a range of variations on the experimental studies, and in some conditions, the proportion of subjects complying with the experimenter was considerably lower than it was in his publications. In interviews, Milgram’s subjects revealed that they construed the experimental situation in a variety of different ways. Some believed that they really had inflicted painful shocks and were genuinely traumatized. Others were skeptical and simply “played along.” None of these details were revealed in Milgrams’s published papers (or the book he finally authored).

Considerable controversy erupted following the publication of Milgram’s dramatic results, with much of it focusing on whether or not it was ethical to deceive subjects in this fashion. In the aftermath of this controversy, new ethics policies would make precise replications of Milgram’s research paradigm impossible in the future. At the time there was also considerable controversy regarding the interpretation, meaning, and generalizability of his results. Despite this controversy, Milgram’s experiment had a lasting effect on public perception and remains one of the most well-known experiments in the history of psychology. Milgram’s notes reveal that he took great care in developing a compelling documentary of his research, carefully choosing footage of subjects and “experimenters” that he believed would be most impactful, and employing directorial and editing strategies to maximize the documentary’s dramatic impact. And to great effect: the image of the Milgram’s “shock apparatus” and the compliant “teacher” inflicting painful levels of shock on a screaming “student” has become a fixed element in popular culture.

The compelling impact of vivid demonstrations can also be seen in Harry Harlow’s influential research of the 1950s investigating the nature of the factors underlying the infant- caregiver bond. At the time the dominant theory in American psychology was the behavioral notion that the baby’s affection for his or her mother is learned because she provides food to satisfy the child’s hunger. Harlow was interested essentially in demonstrating that the origins of love cannot be reduced to learning via association. In order to demonstrate this, he separated baby rhesus monkeys from their mothers and then put them in room with two “surrogate mothers” constructed out of wire and wood. The first wire mother held a bottle of milk that the baby monkey could drink from. The other wire monkey held no milk but was covered with terry cloth. Harlow found that the baby monkeys consistently preferred to spend time with the terry cloth monkey with no milk, only quickly feeding from the bare wire mother before seeking comfort and cuddling for hours with the cloth mother. He argued on this basis that the need for closeness and affection (what he technically referred to as ‘contact comfort’) cannot be explained simply on the basis of learning to associate the caregiver with nourishment. Although Harlow’s research can hardly be considered definitive on logical grounds, it had a massive impact in the field and is still one of the morewell-known studies in psychology. Like Milgram’s shock machine, Harlow’s baby monkey clinging to the terry cloth mother is etched into public consciousness.

Psychology and Science

Prior to the 1960s, attempts to understand how science works were primarily considered to be within the domain of the philosophy of science. Since Thomas Kuhn’s classic publication of The structure of scientific revolutions, it has become increasingly apparent that in the natural sciences there is a substantial discrepancy between any type of purely philosophical reconstruction of how science takes place and the reality of scientific practice in the real world.Thus there is a general recognition that the practice of science can and should be studied in same way that science itself studies other areas.This naturalistic approach to understanding the nature of science has been contributed to in significant ways by historians, anthropologists [4], sociologists, and philosophers [5]. Psychologists have been conspicuously absent in this field of study, with one important exception. Michael Mahoney (whose work I mentioned earlier in this essay) published a book on the topic in 1976 titledScientist as subject: The psychological imperative.

There is an emerging consensus in both contemporary philosophy and in the field of science studies, that the practice of science is best understood as an ongoing conversation between members of a scientific community who attempt to persuade one another of the validity of their positions. Evidence plays an important role in this conversation, but this evidence is always subject to interpretation. The data do not “speak” for themselves, but are viewed from the perspective of a particular lens to begin with and are woven into narratives that stake out positions. [6][7]

Research methodology in mainstream psychology is shaped, broadly speaking, by a combination of neo-positivist and falsificationist philosophies that were developed prior to the naturalistic turn in science studies. [8] [9] In this essay I have been making the case that some of the more important aspects of research activity in psychology take place behind the scenes, and consequently are not part of the published record. Researchers in psychology mine their data to search for interesting patterns, they experiment with trial and error procedures in order to produce compelling demonstrations of phenomena, and they selectively ignore inconvenient findings in order to make their cases. The idea that formulating hypotheses and collecting data take place in a linear and sequential fashion is an idealized portrayal of what happens in psychology research, quite distinct from what happens on the ground. In practice, data analysis and theory formulation are more intertwined in nature.

I hope it is clear that I am not arguing that the recent evidence regarding the “replication problem” in psychology is a serious blow to the field, nor am I arguing against the “scientific method” in psychology. I am arguing, rather, that it is important for psychologists to adapt a reflexive stance on the field that encourages them to study the way in which research really takes place, as well as those factors, oftentimes operating underground, that contribute to important developments in the field. Whether or not psychology is classified as a “science” has more to do with the politics of credibility claims than with the question of what is most likely to advance knowledge in the field. Unfortunately, some prevailing disciplinary standards actually obstruct the momentum of the field: if we produce literature that reinforces an outmoded idealized notion of what science is, we prevent its becoming what it can be. Educating psychology graduate students in post-Kuhnian developments in the philosophy of science as well as contemporary naturalistic science studies (both ethnographic and historical) is every bit as important as teaching them conventional research methodology, and I believe that future psychologists will need backgrounds in both if they are going to have a progressive impact on the development of mainstream psychology.

*This essay was initially published in Public Seminar.

[1] Mahoney, M.J. (1976). Scientist as subject: The psychological imperative. Cambridge, MA: Ballinger.

[2] Much of this discussion about the history of the social sciences in relationship to the National Science Foundation is based on The Cultural Boundaries of Science: Credibility on the Line, 1st Edition by Thomas F. Gieryn 1999 Chicago: University of Chicago Press.

[3] Perry, G. (2012). Behind the Shock Machine: The Untold Story of the Notorious Milgram Psychology Experiments. New York: Scribe Publications.

[4] Latour, B. (1987). Science in action: How to follow scientists and engineers through society. Paris, France: La Découverte.

[5] Bernstein, R.J. (1983). Beyond objectivism and relativism: science, hermeneutics, and praxis. Philadelphia: University of Pennsylvania Press.

[6] Callebaut, W. (1993). Taking the naturalistic turn on how science is done. Chicago: University of Chicago Press.

[7] Shapin, S. (2008). The scientific life: A late moral history of a late modern vocation.Chicago: The University of Chicago Press.

[8] Godfrey-Smith, P. (2003). Theory and reality: An introduction to the philosophy of science. Chicago, IL: University of Chicago Press.

[9] Laudan, L. (1996). Beyond positivism and relativism: Theory, method, and evidence.Boulder, CO: Westview.

Jeremy D. Safran Ph.D.

Jeremy D. Safran, Ph.D. , was a professor of psychology at the New School for Social Research in New York, a clinical psychologist, psychoanalyst, psychotherapy researcher and author.

  • Find a Therapist
  • Find a Treatment Center
  • Find a Psychiatrist
  • Find a Support Group
  • Find Online Therapy
  • United States
  • Brooklyn, NY
  • Chicago, IL
  • Houston, TX
  • Los Angeles, CA
  • New York, NY
  • Portland, OR
  • San Diego, CA
  • San Francisco, CA
  • Seattle, WA
  • Washington, DC
  • Asperger's
  • Bipolar Disorder
  • Chronic Pain
  • Eating Disorders
  • Passive Aggression
  • Personality
  • Goal Setting
  • Positive Psychology
  • Stopping Smoking
  • Low Sexual Desire
  • Relationships
  • Child Development
  • Self Tests NEW
  • Therapy Center
  • Diagnosis Dictionary
  • Types of Therapy

July 2024 magazine cover

Sticking up for yourself is no easy task. But there are concrete skills you can use to hone your assertiveness and advocate for yourself.

  • Emotional Intelligence
  • Gaslighting
  • Affective Forecasting
  • Neuroscience
  • Skip to main content
  • Keyboard shortcuts for audio player

Shots - Health News

  • Your Health
  • Treatments & Tests
  • Health Inc.
  • Public Health

In Psychology And Other Social Sciences, Many Studies Fail The Reproducibility Test

Richard Harris

replicated experiments in psychology

A researcher showed people a picture of The Thinker in an effort to study the link between analytical thinking and religious disbelief. In hindsight, the researcher called his study design "silly". The study could not be reproduced. Peter Barritt/Getty Images hide caption

A researcher showed people a picture of The Thinker in an effort to study the link between analytical thinking and religious disbelief. In hindsight, the researcher called his study design "silly". The study could not be reproduced.

The world of social science got a rude awakening a few years ago, when researchers concluded that many studies in this area appeared to be deeply flawed. Two-thirds could not be replicated in other labs.

Some of those same researchers now report those problems still frequently crop up, even in the most prestigious scientific journals.

But their study, published Monday in Nature Human Behaviour , also finds that social scientists can actually sniff out the dubious results with remarkable skill.

First, the findings. Brian Nosek , a psychology researcher at the University of Virginia and the executive director of the Center for Open Science, decided to focus on social science studies published in the most prominent journals, Science and Nature .

"Some people have hypothesized that, because they're the most prominent outlets they'd have the highest rigor," Nosek says. "Others have hypothesized that the most prestigious outlets are also the ones that are most likely to select for very 'sexy' findings, and so may be actually less reproducible."

To find out, he worked with scientists around the world to see if they could reproduce the results of key experiments from 21 studies in Science and Nature , typically psychology experiments involving students as subjects. The new studies on average recruited five times as many volunteers, in order to come up with results that were less likely due to chance.

Scientists Are Not So Hot At Predicting Which Cancer Studies Will Succeed

Shots - Health News

Scientists are not so hot at predicting which cancer studies will succeed.

The results were better than the average of a previous review of the psychology literature, but still far from perfect. Of the 21 studies, the experimenters were able to reproduce 13. And the effects they saw were on average only about half as strong as had been trumpeted in the original studies.

The remaining eight were not reproduced.

"A substantial portion of the literature is reproducible," Nosek concludes. "We are getting evidence that someone can independently replicate [these findings]. And there is a surprising number [of studies] that fail to replicate."

One of the eight studies that failed this test came from the lab of Will Gervais , when he was getting his PhD at the University of British Columbia. He and a colleague had run a series of experiments to see whether people who are more analytical are less likely to hold religious beliefs. In one test, undergraduates looked at pictures of statues.

"Half of our participants looked at a picture of the sculpture, 'The Thinker,' where here's this guy engaged in deep reflective thought," Gervais says. "And in our control condition, they'd look at the famous stature of a guy throwing a discus."

People who saw The Thinker, a sculpture by August Rodin, expressed more religious disbelief, Gervais reported in Science . And given all the evidence from his lab and others, he says there's still reasonable evidence that underlying conclusion is true. But he recognizes the sculpture experiment was really quite weak.

"Our study, in hindsight, was outright silly," says Gervais, who is now an assistant professor at the University of Kentucky.

A previous study also failed to replicate his experimental findings, so the new analysis is hardly a surprise.

But what interests him the most in the new reproducibility study is that scientists had predicted that his study – along with the seven others that failed to replicate – were unlikely to stand up to the challenge.

As part of the reproducibility study, about 200 social scientists were surveyed and asked to predict which results would stand up to the re-test and which would not. Scientists filled out a survey in which they predicted the winners and losers. They also took part in a "prediction market," where they could buy or sell tokens that represented their views.

"They're taking bets with each other, against us," says Anna Dreber , an economics professor at the Stockholm School of Economics, and coauthor of the new study.

It turns out, "these researchers were very good at predicting which studies would replicate," she says. "I think that's great news for science."

These forecasts could help accelerate the process of science. If you can get panels of experts to weigh in on exciting new results, the field might be able to spend less time chasing errant results known as false positives.

How Flawed Science Is Undermining Good Medicine

How Flawed Science Is Undermining Good Medicine

"A false positive result can make other researchers, and the original researcher, spend lots of time and energy and money on results that turn out not to hold," she says. "And that's kind of wasteful for resources and inefficient, so the sooner we find out that a result doesn't hold, the better."

But if social scientists were really good at identifying flawed studies, why did the editors and peer reviewers at Science and Nature let these eight questionable studies through their review process?

"The likelihood that a finding will replicate or not is one part of what a reviewer would consider," says Nosek. "But other things might influence the decision to publish. It may be that this finding isn't likely to be true, but if it is true, it is super important, so we do want to publish it because we want to get it into the conversation."

Nosek recognizes that, even though the new studies were more rigorous than the ones they attempted to replicate, that doesn't guarantee that the old studies are wrong and the new studies are right. No single scientific study gives a definitive answer.

Forecasting could be a powerful tool in accelerating that quest for the truth.

That may not work, however, in one area where the stakes are very high: medical research, where answers can have life-or-death consequences.

Jonathan Kimmelman at McGill University, who was not involved in the new study, says when he's asked medical researchers to make predictions about studies, the forecasts have generally flopped.

"That's probably not a skill that's widespread in medicine," he says. It's possible that the social scientists selected to make the forecasts in the latest study have deep skills in analyzing data and statistics, and their knowledge of the psychological subject matter is less important.

And forecasting is just one tool that could be used to improve the rigor of social science.

"The social-behavioral sciences are in the midst of a reformation," says Nosek. Scientists are increasingly taking steps to increase transparency, so that potential problems surface quickly. Scientists are increasingly announcing in advance the hypothesis they are testing; they are making their data and computer code available so their peers can evaluate and check their results.

Perhaps most important, some scientists are coming to realize that they are better off doing fewer studies, but with more experimental subjects, to reduce the possibility of a chance finding.

"The way to get ahead and get a job and get tenure is to publish lots and lots of papers," says Gervais. "And it's hard to do that if you are able run fewer studies, but in the end I think that's the way to go — to slow down our science and be more rigorous up front."

Gervais says when he started his first faculty job, at the University of Kentucky, he sat down with his department chair and said he was going to follow this path of publishing fewer, but higher quality studies. He says he got the nod to do that. He sees it as part of a broader cultural change in social science that's aiming to make the field more robust.

You can reach Richard Harris at [email protected] .

COMMENTS

  1. Scientists Replicated 100 Psychology Studies, and Fewer Than Half Got

    According to work presented today in Science, fewer than half of 100 studies published in 2008 in three top psychology journals could be replicated successfully. The international effort included ...

  2. Replication in Psychology: Definition, Steps, and Challenges

    For example, researchers have replicated Milgram's study using lower shock thresholds and improved informed consent and debriefing procedures. Why Replication Is Important in Psychology . ... Are the Results of Psychology Experiments Hard to Replicate? In 2015, a group of 271 researchers published the results of their five-year effort to ...

  3. A discipline-wide investigation of the replicability of Psychology

    To determine whether a study replicated or not, we used a common metric reported in all replication studies—the replication team's summary judgment of whether the study replicated or did not replicate ("yes" or "no"). ... Subfields with larger proportions of non-experiments (Personality Psychology, Organizational Psychology and ...

  4. Replications in Psychology Research: How Often Do They Really Occur

    The first use of the term replication that we found in a psychology journal was Rosenblith's article in the Journal of Abnormal Psychology, titled "A Replication of Some Roots of Prejudice," which successfully replicated the findings of Allport and Kramer (1946) while relying on college students in South Dakota instead of those in Harvard ...

  5. 100 psychology experiments repeated, less than half successful

    In 2011, the Reproducibility Project: Psychology, coordinated by the Center for Open Science, started a massive replication effort: 100 psychology experiments from three important psychology ...

  6. Replications in Psychology Research: How Often Do They Really Occur?

    This investigation revealed that roughly 1.6% of all psychology publications used the term replication in text. A more thorough analysis of 500 randomly selected articles revealed that only 68% of articles using the term replication were actual replications, resulting in an overall replication rate of 1.07%.

  7. Replicability, Robustness, and Reproducibility in Psychological Science

    Vol. 63 (2012), pp. 539-569. More. Replication—an important, uncommon, and misunderstood practice—is gaining appreciation in psychology. Achieving replicability is important for making research progress. If findings are not replicable, then prediction and theory development are stifled. If findings are replicable, then interrogation of ...

  8. Psychology, replication & beyond

    Each of these two priming experiments had a single significant p-value (out of 36 replications) and for flag priming, it was in the opposite direction to that expected. Turning to the special issue of Social Psychology edited by Nosek & Lakens . This contained a series of articles replicating important results in social psychology.

  9. Evaluating the replicability of social science experiments in

    The RPP replicated 100 original studies in psychology and found a significant effect in the same direction as the original studies for 36% of the 97 studies reporting 'positive findings' 12.

  10. The Psychology of Replication and Replication in Psychology

    Abstract. Like other scientists, psychologists believe experimental replication to be the final arbiter for determining the validity of an empirical finding. Reports in psychology journals often attempt to prove the validity of a hypothesis or theory with multiple experiments that replicate a finding. Unfortunately, these efforts are sometimes ...

  11. Replication failures in psychology not due to differences in study

    A previous Many Labs project 4 successfully replicated 10 out of 13 studies, while other projects have found replication rates as low as 36%. Of the 190 studies examined in the 6 large-scale ...

  12. The replication crisis and open science in psychology: Methodological

    This editorial begins by noting that if you were to ask two psychologists, a pessimist and an optimist, as to when psychology entered what we now know as the "replication crisis," the former might state "decades ago" whereas the latter might be inclined to say "2011." Despite this disagreement, both are likely to agree that two key occurrences in 2011 that received considerable ...

  13. The role of replication in psychological science

    The replication or reproducibility crisis in psychological science has renewed attention to philosophical aspects of its methodology. I provide herein a new, functional account of the role of replication in a scientific discipline: to undercut the underdetermination of scientific hypotheses from data, typically by hypotheses that connect data with phenomena. These include hypotheses that ...

  14. When and Why to Replicate: As Easy as 1, 2, 3?

    The crisis of confidence in psychology has prompted vigorous and persistent debate in the scientific community concerning the veracity of the findings of psychological experiments. This discussion has led to changes in psychology's approach to research, and several new initiatives have been developed, many with the aim of improving our findings. One key advancement is the marked increase in ...

  15. Over half of psychology studies fail reproducibility test

    The work published in Science is different from previous papers on replication because the team actually replicated such a large swathe of experiments, says Andrew Gelman, a statistician at ...

  16. 28 classic and contemporary psychology findings replicated in more than

    The results of a massive replication project in psychology are published today in Advances in Methods and Practices in Psychological Science. The paper "Many Labs 2: Investigating Variation in Replicability Across Sample and Setting" represents the efforts of 186 researchers from 36 nations and territories to replicate 28 classic and contemporary psychology findings.

  17. Psychology's Replication Crisis Is Running Out of Excuses

    November 19, 2018. Over the past few years, an international team of almost 200 psychologists has been trying to repeat a set of previously published experiments from its field, to see if it can ...

  18. Scientists replicated 100 recent psychology experiments. More ...

    Consider the newest evidence: a landmark study published today in the journal Science. More than 270 researchers from around the world came together to replicate 100 recent findings from top ...

  19. What's the Best-Replicated Finding in Social Psychology?

    Ego depletion has a strong case to be the best-replicated finding in social psychology. Ego depletion is based on the theory that willpower is limited, so that after people expend effort on self ...

  20. Scientific Findings Often Fail To Be Replicated, Researchers Say

    A massive effort to test the validity of 100 psychology experiments finds that more than 50 percent of the studies fail to replicate. This is based on a new study published in the journal "Science."

  21. Two More Classic Psychology Studies Just Failed The Reproducibility

    Last year, the University of Virginia led a new Reproducibility Project that repeated 100 experiments… with only one-third of them successfully being replicated - although this study has since been criticised for havings its own replication errors. The two latest examples are widely cited papers from 1988 and 1998.

  22. Replication Problems in Psychology

    If findings cannot be replicated, the trustworthiness of the research becomes suspect. ... Two of the more influential and widely known modern psychology experiments in the history of psychology ...

  23. Psychology Studies Often Can't Be Reproduced : Shots

    Two-thirds could not be replicated in other labs. ... typically psychology experiments involving students as subjects. The new studies on average recruited five times as many volunteers, in order ...

  24. Intertemporal empathy decline: Feeling less distress for future others

    The present actions of individuals and society at large can cause outsized consequences on future generations' quality of life. Moral philosophers have explored how people should value the well-being of future generations. Yet, the question of how people actually feel when considering the plight of others in the future compared to the present remains understudied. In four experiments (N ...