NHacker Next
login
▲Famous cognitive psychology experiments that failed to replicatebuttondown.com
118 points by PaulHoule 4 hours ago | 73 comments
Loading comments...
jbentley1 2 hours ago [-]
This is a great list for people who want to smugly say "Um, actually" a lot in conversation.

Based on my brief stint doing data work in psychology research, amongst many other problems they are AWFUL at stats. And it isn't a skill issue as much as a cultural one. They teach it wrong and have a "well, everybody else does it" attitude towards p-hacking and other statistical malpractice.

wduquette 2 hours ago [-]
"they are AWFUL at stats."

SF author Michael Flynn was a process control engineer as his day job; he wrote about how designing statistically valid experiments is incredibly difficult, and the potential for fooling yourself is high, even when you really do know what you are doing and you have nearly perfect control over the measurement setup.

And on top of it you're trying to measure the behavior of people not widgets; and people change their behavior based on the context and what they think you're measuring.

There was a lab set up to do "experimental economics" at Caltech back in the late 80's/early 90's. Trouble is, people make different economic decisions when they are working with play money rather than real money.

dgfitz 20 minutes ago [-]
> Trouble is, people make different economic decisions when they are working with play money rather than real money.

Understated even. Ever play poker with just chips and no money behind them? Nobody cares, there is no value to the plastic coins.

sputr 2 hours ago [-]
As someone who's part of a startup (hrpotentials.com) trying to bring truly scientifically valid psychological testing into HR processes .... yeah. We've been at it for almost 7 years, and we're finally at a point where we can say we have something that actually makes scientific sense - and we're not inventing anything new, just commercializing the science! It only took an electrical engineer (not me) with a strong grasp of statistics working for years with a competent professor of psychology to separate the wheat from the chaff. There's some good science there it's just ... not used much.
obviouslynotme 1 hours ago [-]
How are you going to get around Griggs v. Duke Power Co.? AFAIK, personality tests have not (yet) been given the regulatory eye, but testing cognitive ability has.
PaulHoule 2 hours ago [-]
Yeah, this is an era which is notorious for pseudoscience.
odyssey7 2 hours ago [-]
There’s surely irony here
Waterluvian 2 hours ago [-]
Um, actually I’d say it is the responsibility of all scientists, both professional and amateur, to point out falsehoods when they’re uttered, and not an act of smugness.
rolph 2 hours ago [-]
[um], has contexts but is usually a cue, that an unexpected, off the average, something is about to be said.

[actually], is a neutral declaration that some cognitive structure was presented, but is at odds with physically observable fact that will now be laid out to you.

delichon 4 hours ago [-]
Approximate replication rates in psychology:

  social      37%
  cognitive   42%
  personality 55%
  clinical    44%
So a list of famous psychology experiments that do replicate may be shorter.

https://www.nature.com/articles/nature.2015.18248

t_mann 10 minutes ago [-]
Thanks for providing the reference, that's useful context. Those are awful replication rates, worse than a coin flip. Sounds like the OP can add their own introduction to their list. From the introduction:

> Most results in the field do actually replicate and are robust[citation needed]

NewJazz 2 hours ago [-]
I think one would wish the famous ones to be more often replicable.
tomjakubowski 2 hours ago [-]
Nonreplicable publications are cited more than replicable ones (2021)

> We use publicly available data to show that published papers in top psychology, economics, and general interest journals that fail to replicate are cited more than those that replicate. This difference in citation does not change after the publication of the failure to replicate. Only 12% of postreplication citations of nonreplicable findings acknowledge the replication failure.

https://www.science.org/doi/10.1126/sciadv.abd1705

Press release: https://rady.ucsd.edu/why/news/2021/05-21-a-new-replication-...

dlcarrier 24 minutes ago [-]
Isn't the unexpected more famous than the expected?
sunscream89 2 hours ago [-]
There may be minute details like having a confident frame of reference for the confidence tests. Cultures, even psychologies might swing certain ideas and their compulsions.
glial 3 hours ago [-]
The incentive of all psychology researchers is to do new work rather than replications. Because of this, publicly-funded psychology PhDs should be required to perform study replication as part of their training. Protocol + results should be put in a database.
analog31 25 minutes ago [-]
Sure, dump it on the lowest level employee, who has the least training and the most to lose. Punish them for someone else's bad research. Grad school already takes too long, pays too little, and involves too much risk of not finishing. And it doesn't solve the problem of people having to generate copious quantities of research in order to sustain their careers.

Disclosure: Physics PhD.

gwd 2 hours ago [-]
How interesting would it be if every PhD thesis had to have a "replication" section, where they tried to replicate some famous paper's results.
epolanski 2 hours ago [-]
> Claimed result: Women risk being judged by the negative stereotype that women have weaker math ability, and this apprehension disrupts their math performance on difficult tests.

I'll never understand stances trying to hide biological differences between different sexes or ethnic backgrounds.

We know for a fact that sex or ethnicity impacts body yet we seem unable to cope with the idea that there are also differences in how brains (and hormones) work.

Women have, on average, a higher emotional intelligence which is e.g. tied to higher linguistic proficiency. That helps in many different fields and, on average, women tend to learn languages easier than men.

At the same time, on average, they may perform slightly worse than men in highly computational fields (math or chess).

I want to iterate what I'm getting at to before the rest of the post:

Genetics matter when you look at very large samples, but they are irrelevant on smaller (or single) samples.

I feel NBA provides a great example.

On average, african americans are taller than white men and have a higher muscular density.

On large samples, they tend to outperform white men. But as soon as you make the samples smaller, even at elite levels, you find out that Larry Bird (30+ years ago) or Nikola Jokic (today) are the best players in the world.

Same applies to women, just because average samples will explain some statistics, such as on average females performing worse on maths, won't change that women can be the best chess players or cryptographers in the world.

viewtransform 51 minutes ago [-]
<On average, african americans are taller than white men and have a higher muscular density.>

Are you comparing direct descendants of Yoruba versus descendants of Celts in America ? or mixed descendants of Bantu and Cherokee versus mixed descendants of Anglo-Saxons and Slavs ? In your study would Barack Obama be a person of color or a person of pallor ?

Or is this data you have gathered observing people at Costco. Just checking on your scientific methodology.

fny 38 minutes ago [-]
Differences are hidden because (1) differences, even small ones, are used to justify discrimination (2) some feel the need to correct for stereotypes (3) these differences often don't really exist or amount to a small effect size.[0]

In the end, we're talking about distributions of people, and staring at these differences mischaracterizes all but those at the mean.

All that matters is who can pass the test.

[0]: I also encourage you to ask ChatGPT/Grok/Claude "men vs women math performance studies." You'll be shocked to find most studies point to no or small differences.

[1]: Malcom Gladwell wrote a great piece about his experience as a runner that seems appropriate to share https://www.newyorker.com/magazine/1997/05/19/the-sports-tab...

runarberg 8 minutes ago [-]
Quite often those differences exist because of systemic or cultural bias that affects the test design. Tests are often validated based off of other tests that showed a difference, but those tests often had a severe sampling bias that showed a group difference where non-existed. It then became an established theory that if you design a test that measures e.g. “emotional intelligence” (whatever that means) and it didn’t show a group difference, it was invalid and had to be adjusted until it did.
myhf 2 hours ago [-]
Circular reasoning can be used to "prove" anything, so it's not helpful as a basis for policy making.
runarberg 14 minutes ago [-]
> Women have, on average, a higher emotional intelligence which is e.g. tied to higher linguistic proficiency. That helps in many different fields and, on average, women tend to learn languages easier than men.

Has this been experimentally shown to be the case with studies that don‘t fail to replicate?

Between studies that fail to replicate and pure conjecture and pseudo-science I certainly favor the former, at least actual studies that fail to replicate can be disproven, your conjectures are just race/sex science and nothing but pseudo-science. I can either take you at your word, or choose not to believe you. I pick the latter.

us-merul 1 hours ago [-]
> We know for a fact that sex or ethnicity impacts body yet we seem unable to cope with the idea that there are also differences in how brains work.

Here is your error. You’re assuming that a physical difference in morphology is linked to behavioral or neural correlates. That’s not the case, since observed statistical- or group-level differences need not be driven by biology. You’re assuming biological determinism, and the evidence for direct genetic effects on behavior isn’t there.

Aurornis 1 hours ago [-]
> and the evidence for direct genetic effects on behavior isn’t there.

Yes it is. There's an entire field for studying this called Behavioral Genetics.

The easiest evidence comes from comparing monozygotic and dizygotic twins (maternal vs fraternal twins). The variance in behavior is higher among the dizygotic twins who have different genomes.

epolanski 1 hours ago [-]
It's not an error unless you're able to demonstrate the opposite.

I have yet to see studies that demonstrate that different sexes, hormones or even ethnicities do not impact cognitive abilities or higher proficiency in different fields.

Whereas I've seen plenty that show that women, on average, demonstrate higher cognitive abilities linked to verbal proficiency, text comprehension or executive tasks. Women also tend to have better memory than men.

Facts are that there are genetic differences in how our brains work. And let's not ignore the huge importance of hormones, extremely potent regulators of how we function.

To ignore that we have differences that, at large, help explain statistics is asinine.

us-merul 1 hours ago [-]
And how are you able to rule out that societal or environmental effects are the primary driver? How is your argument not circular, that observed differences are therefore the result of biology?
epolanski 1 hours ago [-]
I've never stated that biology is the primary driver.

I merely stated that biology, should not be ignored when judging very large samples.

There are cross sex cognitive tests at which women and men tend to perform differently, such as spatial awareness or speed perception and many others.

What's the environmental or cultural factor behind the fact that a female's brain, on average, is able to judge speed much more correctly than a male?

us-merul 1 hours ago [-]
I see you edited your response after my reply. I’m not denying that you’ve read about those observed differences. I’m trying to say that those differences don’t need to be driven by biology, and evidence suggests otherwise. Behavior can’t be reduced to genetics, and the mechanistic link isn’t there. You are claiming that morphological differences explain the variation. Besides, by your reasoning, you could look at the NBA before Bill Russell and make very different claims.
3cKU 1 hours ago [-]
> And how are you able to rule out

It is not possible to rule out unfalsifiable hypotheses.

1 hours ago [-]
aeve890 3 hours ago [-]
>Source: Stern, Gerlach, & Penke (2020)

Wow, what are the odds?

https://en.wikipedia.org/wiki/Stern%E2%80%93Gerlach_experime...

dlcarrier 19 minutes ago [-]
I thought you were pointing out some bias by comparing the research to previous research from the same authors. It took me far too long to realize that the experiment was from 100 years ago, and you were pointing out that the names were coincidentally the same.
NooneAtAll3 2 hours ago [-]
I'm still amazed that wikipedia doesn't have redirect away from its mobile site
dang 2 hours ago [-]
(It's on my list to rewrite those URLs in HN comments at least)
dlcarrier 25 minutes ago [-]
Disturbing fact: The Stanford prison experiment, run by Philip Zimbardo, wasn't reproducible but that didn't stop Zimbardo from using it to promote his ideologies about the impossibility of rehabilitating criminals, or from becoming the president of the American Psychological Association.

The APA has a really good style guide, but I don't trust them for actual psychology.

Terr_ 3 hours ago [-]
> Source: Hagger et (63!) al. 2016

I can't help chuckling at the idea that over 1.98 * 10^87 people were involved in the paper.

dlcarrier 16 minutes ago [-]
If you were to meet a "normal" person, would you interpret that as meaning "perpendicular" or as meaning "the kind of person that doesn't look at everything like it's a mathematical expression"?
Aurornis 1 hours ago [-]
> Claimed result: Adopting expansive body postures for 2 minutes (like standing with hands on hips or arms raised) increases testosterone, decreases cortisol, and makes people feel more powerful and take more risks.

A heuristic I use that is unreasonably good at identifying grifters and charlatans: Unnecessarily invoking cortisol or other hormones when discussing behavioral topics. Influencers, podcasters, and pseudoscience practitioners love to invoke cortisol, testosterone, inflammation, and other generic concepts to make their ideas sound more scientific. Instead of saying "stress levels" they say "cortisol". They also try to suggest that cortisol is bad and you always want it lower, which isn't true.

Dopamine is another favorite of the grifters. Whenever someone starts talking about raising dopamine or doing something to increase dopamine, they're almost always being misleading or just outright lying. Health and fitness podcasters are the worst at this right now.

quickthrowman 36 minutes ago [-]
> Dopamine is another favorite of the grifters. Whenever someone starts talking about raising dopamine or doing something to increase dopamine, they're almost always being misleading or just outright lying. Health and fitness podcasters are the worst at this right now.

Yeah it’s ridiculous. You know what raises dopamine very effectively? Cocaine, gambling, heroin, meth, etc. If they really believed their own advice they’d all be doing meth or cocaine all day every day. If you look at what happens to regular math users, it doesn’t seem like raising dopamine all the time is a good idea.

fsckboy 3 hours ago [-]
famous cognitive psychology experiments that do replicate: IQ tests

http://www.psychpage.com/learning/library/intell/mainstream....

in fact, the foundational statistical models considered the gold standard for statistics today were developed for this testing.

dlcarrier 59 seconds ago [-]
I took an IQ test as a high school student, and one of the subtests involved placing a stack of shuffled pictures in chronological order. I had one series in the incorrect order, because I had no understanding of the typical behavior of snowfall. The test proctor said almost everyone she tested mixed that one up, because it doesn't snow in the area where I live.

I have no doubt that IQ tests reproducibly measure the test takers ability to pass tests, as well as to perform in a society that the tests are based on.

I think it's disingenuous to attribute IQ to intelligence as a whole though, and it is better understood as an indicator of cultural intelligence.

I would expect that, for cultures who's members score below average on IQ tests from the US, an equivalent IQ test created within that culture would show average members of that culture scoring higher than average members of US culture.

alphazard 2 hours ago [-]
> in fact, the foundational statistical models considered the gold standard for statistics today were developed for this testing.

The normal distribution predates the general factor model of IQ by hundreds of years.[0]

You can try other distributions yourself, it's going to be hard to find one that better fits the existing IQ data than the normal (bell curve) distribution.

[0] https://en.wikipedia.org/wiki/Normal_distribution#History

fsckboy 2 hours ago [-]
Darwin's cousin, Francis Galton, for whom the log-normal distribution is often called the Galton distribution, was among the first to investigate psychometrics.

not realizing he was hundreds of years late to the game, he still went ahead and coined the term "median"

more tidbits here https://en.wikipedia.org/wiki/Francis_Galton#Statistical_inn...

astrange 25 minutes ago [-]
Survivorship bias. You can easily make someone's IQ test not replicate. (Hit them on the head really hard.)
gwd 2 hours ago [-]
> Smile to Feel Better Effect

> Claimed result: Holding a pen in your teeth (forcing a smile-like expression) makes you rate cartoons as funnier compared to holding a pen with your lips (preventing smiling). More broadly, facial expressions can influence emotional experiences: "fake it till you make it."

I read this about a decade ago, and started, when going into a situation where I wanted to have a natural smile, grimacing maniacally like I had a pencil in my teeth. The thing is, it's just so silly, it always makes me laugh at myself, at which point I have a genuine smile. I always doubted whether the claimed connection was real, but it's been a useful tool anyway.

2 hours ago [-]
sunscream89 2 hours ago [-]
Yeah, the marshmallow one taught me to have patience and look for the long returns on investments of personal effort.

I think there may be something to a few of these, and more may need considering regarding how these are conducted.

Let’s leave open our credulities for the inquest of time.

tryauuum 28 minutes ago [-]

    claimed result: Women are more attracted to hot guys during high-fertility days of their cycles

wait why not? I hoped I'm attractive at least some days of the month :(
sunrunner 1 hours ago [-]
No mention of the Stanford Prison Experiment I notice.
systemstops 2 hours ago [-]
Is anyone tracking how much damage to society bad social science has done? I imagine it's quite a bit.
roadside_picnic 43 minutes ago [-]
The most obvious one is the breakdown of trust in scientific research. A frequent discussion I would have with another statistics friend of mine was that that anti-vax crowd really isn't as off base as they are more popularly portrayed and if anything, the "trust the science!" rhetoric is more clearly incorrect.

Science should never be taught as dogmatic, but the reproducibility crisis has ultimately fostered a culture where one should not question "established" results (Kahneman famously proclaimed that one "must" accept the results of the unbelievable priming results in his famous book), especially if that one is interested in a long academic career.

The trouble is that some trust is necessary in communicating scientific observations and hypothesis to the general public. It's easy to blame the failure of the public to unify around Covid as based around cultural divides, but the truth is that skepticism around high stakes, hastily done science is well warranted. The trouble is that even when you can step through the research and see the conclusions are sound, the skepticism remains.

However, as someone that has spent a long career using data to understand the world, I suspect the harm directly caused by the wrong conclusions being reached is more minimal than one would think. This is largely because, despite lip service to "data driven decision making", science and statistics very rarely are the prime driver of any policy decision.

BeetleB 32 minutes ago [-]
I imagine it's comparable to the damage done when policies are set that are not based on studies.

Let's be candid: Most policies have no backing in science whatsoever. The fact that some were backed by poor science is not an indictment of much.

feoren 2 hours ago [-]
We rack up quite a lot of awfulness with eugenics, phrenology, the "science" that influenced Stalin's disastrous agriculture policies in the early USSR, overpopulation scares leading to China's one-child policy, etc. Although one could argue these were back-justifications for the awfulness that people wanted to do anyway.
systemstops 2 hours ago [-]
Those things were not done by awful people though - they all thought they were serving the public good. We only judge it as awful now because of the results. Nearly of these ideas (Lysenkoism I think was always fringe) were embraced by the educated elites of the time.
feoren 1 hours ago [-]
Lysenkoism! That's the one. Thank you for reminding me of the name (and for knowing what I was grasping at).

I think some "bad people" used eugenics and phrenology to justify prior hate, but they were also effective tools at convincing otherwise "good people" to join them.

izabera 2 hours ago [-]
i'm struggling to imagine many negative effects on society caused by the specific papers in this list
systemstops 2 hours ago [-]
Public policies were made (or justified) based on some of this research. People used this "settled science" to make consequential decisions.

Stereotype threat for example was widely used to explain test score gaps as purely environmental, which contributed to the public seeing gaps as a moral emergency that needed to be fixed, leading to affirmative action policies.

bogtog 2 hours ago [-]
Little of this is considered cognitive psychology. The vast majority would be viewed as "social psychology"

Setting that aside, among any scientific field I'm aware of, psychology has taken the replication crisis most seriously. Rigor across all areas of psychology is steadily increasing: https://journals.sagepub.com/doi/full/10.1177/25152459251323...

blindriver 2 hours ago [-]
Papers should not be accepted until an independent lab has replicated the results. It’s pretty simple but people are incentivized to not care if it’s replicable because they need the paper to publish to advance their career
picardo 2 hours ago [-]
Well, at least the growth mindset study is not fully debunked yet. It's basically a modern interpretation of what we've known to be true about self-fulfilling prophecies. If you tell children they are can be smart and competent if they work hard, then they will work hard and become smart and competent. This should be a given.
lutusp 26 minutes ago [-]
A key factor behind psychology's low replication rate is the absence of theories that define the field. In most science fields, an initial finding can be compared to theory before publication, which may weed out unlikely results in advance. But psychology doesn't have this option -- no theories, so no Litmus test.

It's important to say that a psychology study can be scientific in one sense -- say, rigorous and disciplined, but at the same time be unscientific, in the sense that it doesn't test a falsifiable, defining psychological theory -- because there aren't any of those.

Or, to put it more simply, scientific fields require falsifiable theories about some aspect of nature, and the mind is not part of nature.

Future neuroscience might fix this, but don't hold your breath for that outcome. I suspect we'll have AGI in artificial brains before we have testable, falsifiable neuroscience theories about our natural brains.

ausbah 3 hours ago [-]
i wonder the replication rate is for ML papers
PaulHoule 2 hours ago [-]
From working in industry and rubbing shoulders with CS people who prioritize writing papers over writing working software I’m sure that in a high fraction of papers people didn’t implement the algorithm they thought they implemented.
avdelazeri 2 hours ago [-]
Don't get me started, I have seem repos that I'm fairly sure never ran in their presented form. A guy in our lab thinks authors purposefully mess up their code when publishing on GitHub to make it harder to replicate. I'm starting to come around on his theory.
WesolyKubeczek 2 hours ago [-]
> Claimed result: Listening to Mozart temporarily makes you smarter.

This belongs in a dungeon crawl game. You find an artifact that plays music to you. Depending on the music played (depends on the artifact's enchantment and blessed status), it can buff or debuff your intelligence by several points temporarily.

insane_dreamer 1 hours ago [-]
If the "failed replication" was a single study, as in many cases listed here, there is still an open question as to whether the 1) replication study was underpowered (the ones I looked at had pretty small n's), or 2) the re-implementation of the original study was flawed. So I'm not so sure we can quickly label the original studies as "debunked", no more than we can express a high level of confidence in the original studies.

(This isn't a comment on any of the individual studies listed.)

Animats 2 hours ago [-]
> Most results in the field do actually replicate and are robust [citation needed], so it would be a pity to lose confidence in the whole field just because of a few bad apples.

Is there a good list of results that do consistently replicate?

hn_throw_250915 2 hours ago [-]
I thought we knew that these were vehicles by wannabe self-help authors to puff up their status for money. See for example “Grit” and “Deep Work” and other bullshit entries in a breathlessly hyped up genre of pseudoscience.
SpaceManNabs 3 hours ago [-]
One thing that confuses me is that some of these papers were successfully replicated, so juxtaposing them to the ones that have not been replicated at all given the title of the page feels a bit off. Not sure if fair.

The ego depletion effect seems intuitively surprising to me. Science is often unintuitive. I do know that it is easier to make forward-thinking decisions when I am not tired so I dont know.

ceckGrad 2 hours ago [-]
>some of these papers were successfully replicated, so juxtaposing them to the ones that have not been replicated at all given the title of the page feels a bit off. Not sure if fair.

I don't like Giancotti's claims. He wrote: >This post is a compact reference list of the most (in)famous cognitive science results that failed to replicate and should, for the time being, be considered false.

I don't agree with Giancotti's epistemological claims but today I will not bloviate at length about the epistemology of science. I will try to be brief.

If I understand Marco Giancotti correctly, one particular point is that Giancotti seems to be saying that Hagger et al. have impressively debunked Baumeister et al.

The ego depletion "debunking" is not really what I would call a refutation. It says, "Results from the current multilab registered replication of the ego-depletion effect provide evidence that, if there is any effect, it is close to zero. ... Although the current analysis provides robust evidence that questions the strength of the ego-depletion effect and its replicability, it may be premature to reject the ego-depletion effect altogether based on these data alone."

Maybe Baumeister's protocol was fundamentally flawed, but the counter-argument from Hagger et al. does not convince me. I wasn't thrilled with Baumeister's claims when they came out, but now I am somehow even less thrilled with the claims of Hagger et al., and I absolutely don't trust Giancotti's assessment. I could believe that Hagger executed Baumeister's protocol correctly, but I can't believe Giancotti has a grasp of what scientific claims "should" be "believed."

SpaceManNabs 12 minutes ago [-]
You make some good points based on your deeper read. I am a bit saddened that the rest of the comment section (the top 6 comments as of right now) devolved into "look at how silly psychology is with all its p-hacking"

That might be true, but this article's comment section isn't a good place for it because it doesn't seem like the article is entirely fair. I would not call it dishonest, but there is a lack of certainty and finality in being able to conclude that these papers have been successfully proven to not be replicable.

taeric 3 hours ago [-]
The idea isn't that it is easier to do things when not tired. It is that you specifically get tired exercising self control.

I think that can be subtly confused by people thinking you can't get better at self control with practice? That is, I would think a deliberate practice of doing more and more self control every day should build up your ability to do more self control. And it would be easy to think that that means you have a stamina for self control that depletes in the same way that aerobic fitness can work. But, those don't necessarily follow each other.

juujian 2 hours ago [-]
Now I want to know which cognitive psychology experiments were successfully replicated though.