The journal Science just published a blockbuster report from the Reproducibility Project led by University of Virginia psychology professor Brian Nosek. The group examined 100 prominent psychology research papers and made an exhaustive effort to independently reproduce their findings. The results were eye-opening, not just for the field of psychology, but also in light of the recent fad of importing psychology and “cognitive science” into politics.
What Nosek’s team found was that almost two-thirds of the results they tested didn’t quite hold up. In a few cases, the reproduced experiments gave an opposite result, showing either no effect or an effect in the other direction from the original study. More commonly, the reproduced results were simply smaller than those claimed in the original study, often so small as to not be statistically significant—that is, not to be clearly distinguishable from random noise in the data.
Here is the kind of study that didn’t replicate:
Among them was one on free will. It found that participants who read a passage arguing that their behavior is predetermined were more likely than those who had not read the passage to cheat on a subsequent test.
Another was on the effect of physical distance on emotional closeness. Volunteers asked to plot two points that were far apart on graph paper later reported weaker emotional attachment to family members, compared with subjects who had graphed points close together.
A third was on mate preference. Attached women were more likely to rate the attractiveness of single men highly when the women were highly fertile, compared with when they were less so. In the reproduced studies, researchers found weaker effects for all three experiments.
Here’s a handy graph of the results.
The slanted line shows where a study would fall if the reproduced version exactly matches the original. As you can see, the bulk of the results fall well below that line, showing effects about half as strong. More important is the difference between the blue dots and the red dots. The red dots are studies where the results ended up being not statistically significant. That doesn’t automatically mean the original result was wrong. It means that we now have two attempts to measure the effect in question, and two different results. But it indicates that more work is needed to sort it all out.
The big story here is that the Reproducibility Project was needed in the first place. As Nosek says at the beginning of his report, “reproducibility is a defining feature of science.” If your results are valid and useful, then I should be able to perform the same experiment and get the same results. This is necessary, not just to catch fraudulent claims, but to filter out subtle biases or errors that might not have been noticed by the original researchers.
Yet the incentives in academia heavily discourage any systematic attempt to reproduce and double-check other people’s findings. When psychologists and social scientists design an experiment, it’s not just the test subjects who are responding to stimuli. The researchers themselves also respond to incentives. They are rats in a maze, chasing after a piece of cheese called “tenure.” The way to advance your career is to publish your own original studies, especially ones with exciting results that produce splashy headlines. But there’s no real incentive to painstakingly reproduce other people’s results to check if they’re accurate. (And often this is very difficult. One of the goals of the Reproducibility Project’s parent organization, the Center for Open Science, is to encourage researchers to provide more information about how they conducted their studies so that it is easier for others to reproduce them.)
So the system encourages a lot of overstated results that never get checked and end up being accepted as established truth. And then some blowhard who tells you that he [expletive deleted] loves science repeats it to you as gospel. You get the idea.
I talked to Professor Nosek at the Center for Open Science offices in Charlottesville, and he emphasized that his goal is not to discount the results of psychological studies. His immediate goal was to discover the size and scope of the reproducibility problem, but the project’s next step is to figure out how to reduce that problem. (The main solution seems to be the need for a larger number of test subjects, which is not going to be popular because it’s more difficult and expensive.) Next, the Reproducibility Project is moving on to a review of biomedical research, where industry labs have been complaining that they can’t reproduce the results of academic research. This means that a lot of time and money is being wasted chasing after research that doesn’t pan out, rather than being used to turn research into therapies. So Nosek’s project could end up redefining and improving the standards for scientific research across the board.
One of the problems with the current system is what happens when overstated results make their way into the media and politics. In these arenas, conclusions are used differently than they are in the sciences. A new idea in the hard sciences will eventually be used to build something or do something. It will either work or it won’t, and the difference matters. A new idea in the humanities or in politics, by contrast, is used to tell people a narrative about how they should feel about themselves, and it doesn’t really matter if it’s true. And that’s what tends to happen to ideas from psychology and cognitive science as presented in the media.
Nosek struck me as an honest, conscientious scientist, and he carefully avoided treading into the arena of politics and public policy. But I can spot one big implications of his study. In the past five to ten years, there has been a “cognitive science” boomlet of academics telling us they have identified the “cognitive biases” that cause people to make bad decisions, by which they usually mean: decisions other than the ones we would like them to make. And they have assured us that regulators can use psychological science to “nudge” people into making the right decisions. The term “nudge” is from a book of the same name by Cass Sunstein, who argued for a kind of benevolent regulatory paternalism run by 27-year old cognitive science graduates.
But this is a “physician, heal thyself” moment for the cognitive scientists. As Nosek concludes, these findings call for a little humility on the part of psychologists, a recognition of how much they still don’t know and can’t yet demonstrate. And I would add that the same applies to politicians and regulators, too.