A landmark 2015 report that cast doubt on the results of dozens of published psychology studies has exposed deep divisions within the field, serving as a reality check for many working researchers but as an affront to others who continue to insist the original research was sound.
On Thursday, a group of four researchers publicly challenged the report, arguing that it was statistically flawed and, as a result, wrong.
The 2015 report, called the Reproducibility Project, found that less than 40 studies in a sample of 100 psychology papers in leading journals held up when retested by an independent team. The new critique by the four researchers countered that when that team’s statistical methodology was adjusted, the rate was closer to 100 percent.
Neither the original analysis nor the critique found evidence of fraud or manipulation of data.
“That study got so much press, and the wrong conclusions were drawn from it,” said Timothy D. Wilson, a professor of psychology at the University of Virginia and a co-author of the new critique. “It’s a mistake to make generalizations from something that was done poorly, and this we think was done poorly.”
Brian A. Nosek, a colleague of Dr. Wilson’s at Virginia who coordinated the original, yearslong replication project, countered that the critique was highly biased: “They are making assumptions based on selectively interpreting data and ignoring data that’s antagonistic to their point of view.”
The challenge comes as the field of psychology is undergoing a generational change, with young researchers beginning to share their data and study designs before publication, to improve transparency. Still, the new critique is likely to feed an already lively debate about how best to conduct and evaluate so-called replication projects of prior studies. Such projects are underway in several fields, scientists on both sides of the debate said.
These are issues that experts have been debating since well before the original replication study appeared last August. “On some level I suppose it is appealing to think everything is fine and there is no reason to change the status quo,” said Sanjay Srivastava, a psychologist at the University of Oregon, who was not a member of either team. “But we know too much, from many other sources, to put too much credence in an analysis that supports that remarkable conclusion.”
One issue the critique raised was how faithfully the replication team had adhered to the original design of the 100 studies it retested. Small alterations in design can make the difference between whether a study replicates or not, scientists say. To address this, Dr. Nosek and his many collaborators consulted closely with the authors of the studies they were trying to reproduce. Afterward, independent researchers — that is, neither from the original study team nor the replication one — evaluated how closely the study designs matched.
But Dr. Wilson and co-authors of the critique — Daniel T. Gilbert, Gary King, and Stephen Pettigrew, all of Harvard — pointed out that authors of 31 of the original studies had not explicitly endorsed the design of the retest. They noted that, for example, one study on race initially run at Stanford was replicated in Amsterdam, a different cultural context.
The critique found that the explicitly endorsed studies were nearly four times more likely to replicate than the nonendorsed ones.
Dr. Nosek said he planned to rerun the replications of 11 studies whose authors raised concern to try to answer whether design differences accounted for the differing results.
Another issue that the critique raised had to do with statistical methods. When Dr. Nosek began his study, there was no agreed-upon protocol for crunching the numbers. He and his team settled on five measures, including the strength of the effect and the effect of combining both studies, to look at the results together.
The co-authors of the critique argued that it would have been better to focus on one measure: How many of the retests would be expected to fail by chance, given the variations, like design differences, introduced by mounting the retests?
Uri Simonsohn, a researcher at the Wharton School of the University of Pennsylvania, has blogged about these issues, including this dispute. He said that both the original replication paper and the critique use statistical approaches that are “predictably imperfect” for this kind of analysis.
One way to think about the dispute, Dr. Simohnson said, is that the original paper found that the glass was about 40 percent full, and the critique argues that it could be 100 percent full. In fact, he said in an email, “State-of-the-art techniques designed to evaluate replications say it is 40 percent full, 30 percent empty, and the remaining 30 percent could be full or empty, we can’t tell till we get more data.”