The ego depletion saga demonstrates the importance of effect sizes


There have been a slew of systematic replication efforts and meta-analyses with rather provocative findings of late. The ego depletion saga is one of those stories. It is an important story because it demonstrates the clarity that comes with focusing on effect sizes rather than statistical significance.

I should confess that I’ve always liked the idea of ego depletion and even tried my hand at running a few ego depletion experiments.* And, I study conscientiousness which is pretty much the same thing as self-control—at least as it is assessed using the Tangney et al self-control scale (2004) which was meant, in part, to be an individual difference complement to the ego depletion experimental paradigms.

So, I was more than a disinterested observer as the “effect size drama” surrounding ego depletion played out over the last few years. First, you had the seemingly straightforward meta analysis by Hagger et al (2010), showing that the average effect size of the sequential task paradigm of ego-depletion studies was a d of .62. Impressively large by most metrics that we use to judge effect sizes. That’s the same as a correlation of .3 according to the magical effect size converters. Despite prior mischaracterizations of correlations of that magnitude being small**, that’s nothing to cough at.

Quickly on the heels of that meta-analysis were new meta-analyses and re-analyses of the meta-analytic data (e.g., Carter et al, 2015). These new meta-analyses and re-analyses concluded that there wasn’t any “there” there. Right after the Hagger et al paper was published, the quant jocks came up with a slew of new ways of estimating bias in meta-analyses. What happens when you apply these bias estimators to ego depletion data? There seemed to be a lot of bias in the research synthesized in these meta-analyses. So much so that the bias-corrected estimates included a zero effect size as a possibility (Carter et al., 2015). These re-analyses were then re-analyzed because the field of bias correction was moving faster than basic science and these initial corrections were called into question because apparently bias corrections are, well, biased… (Friese et al., 2018).

Not to be undone by an inability to estimate truth from the prior publication record, another, overlapping group of researchers conducted their own registered replication report—the most defensible and unbiased method of estimating an effect size (Hagger et al., 2016). Much to everyone’s surprise, the effect across 23 labs was something close to zero (d = .04). Once again, this effort was criticized for being a non-optimal test of the ego depletion effect (Friese et al., 2018).

To address the prior limitations of all of these incredibly thorough analyses of ego depletion, yet a third team took it upon themselves to run a pre-registered replication project testing two additional approaches ego-depletion using optimal designs (Vohs, Schmeichel & others, 2018). Like a broken record, the estimate across 40 labs resulted in effect size estimates that ranged from 0 (if you assumed zero was the prior) to about a d of .08 if you assumed otherwise***. If you bothered to compile the data across the labs and run a traditional frequentist analysis, this effect size, despite being minuscule was statistically significant (trumpets sound in the distance).

So, it appears the best estimate of the effect of ego depletion is around a d of .08, if we are being generous.

Eyes wide shut

So, there were a fair number of folks who expressed some curiosity about the meaning of the results. They asked questions on social media, like, “The effect was statistically significant, right? That means there’s evidence for ego depletion.”

Setting aside effect sizes for a moment, there are many reasons to see the data as being consistent with the theory. Many of us were rooting for ego depletion theory. Countless researchers were invested in the idea either directly or indirectly. Many wanted a pillar of their theoretical and empirical foundational knowledge to hold up, even if the aggregate effect was more modest than originally depicted. For those individuals, a statistically significant finding seems like good news, even if it is really cold comfort.

Another reason for the prioritization of significant findings over the magnitude of the effect is, well, ignorance of effect sizes and their meaning. It was not too long ago that we tried in vain to convince colleagues that a Neyman-Pearson system was useful (balance power, alpha, effect size, and N). A number of my esteemed colleagues pushed back on the notion that they should pay heed to effect sizes. They argued that, as experimental theoreticians, their work was, at best, testing directional hypotheses of no practical import. Since effect sizes were for “applied” psychologists (read: lower status), the theoretical experimentalist had no need to sully themselves with the tools of applied researchers. They also argued that their work was “proof of concept” and the designs were not intended to reflect real world settings (see ego depletion) and therefore the effect sizes were uninterpretable. Setting aside the unnerving circularity of this thinking****, what it implies is that many people have not been trained on, or forced to think much about, effect sizes. Yes, they’ve often been forced to report them, but not to really think about them. I’ll go out on a limb and propose that the majority of our peers in the social sciences think about and make inferences based solely on p-values and some implicit attributes of the study design (e.g., experiment vs observational study).

The reality, of course, is that every study of every stripe comes with an effect size, whether or not it is explicitly presented or interpreted. More importantly, a body of research in which the same study or paradigm is systematically investigated, like has been done with ego depletion, provides an excellent estimate of the true effect size for that paradigm. The reality of a true effect size in the range of d = .04 to d = .08 is a harsh reality, but one that brings great clarity.

Eyes wide open

So, let’s make an assumption. The evidence is pretty good that the effect size of sequential ego depletion tasks is, at best, d = .08.

With that assumption, the inevitable conclusion is that the traditional study of ego depletion using experimental approaches is dead in the water.

Why?

First, because studying a phenomenon with a true effect size of d = .08 is beyond the resources of almost all labs in psychology. To have 80% power to detect an effect size of d = .08 you would need to run more than 2500 participants through your lab. If you go with the d = .04 estimate, you’d need more than 9000 participants. More poignantly, none of the original studies used to support the existence of ego depletion were designed to detect the true effect size.

These types of sample size demands violate most of our norms in psychological science. The average sample size in prior experimental ego depletion research appears to be about 50 to 60. With that kind of sample size, you have 6% power to detect the true effect.

What about our new rules of thumb, like do your best to reach an N of 50 per cell, or use 2.5 the N of the original study, or crank the N up above 500 to test an interaction effect? Power is 8%, 11%, and 25% in each of those situations, respectively. If you ran your studies using these rules of thumb, you would be all thumbs.

But, you say, I can get 2500 participants on mTurk. That’s not a bad option. But, you have to ask yourself: To what end? The import of ego depletion research and much experimental work like it, is predicated on the notion that the situation is “powerful,” as in, it has a large effect. How important is ego depletion to our understanding of human nature if the effect is minuscule? Before you embark on the mega study of thousands of mTurkers, it might be prudent to answer this question.

But, you say, some have argued that small effects can cumulate and therefore be meaningful if studied with enough fidelity and across time. Great. Now all you need to do is run a massive longitudinal intervention study where you test how the minuscule effect of the manipulation cumulates over time and place. The power issue doesn’t disappear with this potential insight. You still have to deal with the true effect size of the manipulation being a d of .08. So, one option is to use a massive study. Good luck funding that study. The only way you could get the money necessary to conduct it would be to promise doing an fMRI of every participant. Wait. Oh, never mind.

The other option would be to do something radical like create a continuous intervention that builds on itself over time—something currently not part of ego depletion theory or traditional experimental approaches in psychology.

But, you say, there are hundreds of studies that have been published on ego depletion. Exactly. Hundreds of studies have been published that had average d-value of .62. Hundreds of studies have been published showing effect sizes that cannot, by definition, be true given the true effect size is d = .08. That is the clarity that comes with the use of accurate effect sizes. It is incredibly difficult to get d-values of .62 when the true d is .08. Look at the distribution of d-values around zero with sample sizes of 50. The likelihood of landing a d of .62 or higher is about 3%. This fact invites some uncomfortable questions. How did all of these people find this many large effects? If we assume they found these relatively huge, highly unlikely effects by chance alone, this would mean that there are thousands of studies lying about in file drawers somewhere. Or it means people used other means to dig these effects out of the data….

Setting aside the motivations, strategies, and incentives that would net this many findings that are significantly unlikely to be correct (p < .03), the import of this discrepancy is huge. The fact that hundreds of studies with such unlikely results were published using the standard paradigms should be troubling to the scientific community. It shows that psychologists, as a group using the standard incentive systems and review processes of the day, can produce grossly inflated findings that lend themselves to the appearance of an accumulated body of evidence for an idea when, by definition, it shouldn’t exist. That should be more than troubling. It should be a wakeup call. Our system is more than broken. It is spewing pollution into the scientific environment at an alarming rate.

This is why effect sizes are important. Knowing that the true effect size of sequential ego depletion studies is a d of .08 leads you to conclude that:

1. Most prior research on the sequential task approach to ego depletion is so problematic that it cannot and should not be used to inform future research. Are you interested in those moderators or boundary mechanisms of ego depletion? Great, you are now proposing to see whether your new condition moves a d of .08 to something smaller. Good luck with that.

2. New research on ego depletion is out of reach for most psychological scientists unless they participate in huge multi-lab projects like the Psychological Science Accelerator.

3. Our field is capable of producing huge numbers of published reports in support of an idea that are grossly inaccurate.

4. If someone fails to replicate one of my studies, I can no longer point to dozens, if not hundreds of supporting studies and confidently state that there is a lot of backing for my work.

5. As has been noted by others, meta-analysis is fucked.

And don’t take this situation as anything particular to ego depletion. We now have reams of studies that either through registered replication reports or meta-analyses have shown that the original effect sizes are inflated and that the “truer” effect sizes are much smaller. In numerous cases, ranging from GxE studies to ovulatory cycle effects, the meta-analytic estimates, while statistically significant, are conspicuously smaller than most if not all of the original studies were capable of detecting. These updated effect sizes need to be weighed heavily in research going forward.

In closing, let me point out that I say these things with no prejudice against the idea of ego depletion. I still like the idea and still hold out a sliver of hope that the idea may be viable. It is possible that the idea is sound and the way prior research was executed is the problem.

But, extrapolating from the cumulative meta-analytic work and the registered replication projects, I can’t avoid the conclusion that the effect size for the standard sequential paradigms is small. Really, really small. So small that it would be almost impossible to realistically study the idea in almost any traditional lab.

Maybe the fact that these paradigms no longer work will spur some creative individuals on to come up with newer, more viable, and more reliable ways of testing the idea. Until then, the implication of the effect size is clear: Steer clear of the classic experimental approaches to ego depletion. And, if you nonetheless continue to find value in the basic idea, come up with new ways to study it; the old ways are not robust.

Brent W. Roberts

 

* p < .05: They failed.  At the time, I chalked it up to my lack of expertise.  And that was before it was popular to argue that people who failed to replicate a study lacked expertise.

** p < .01: See “personality coefficient” Mischel, W. (2013). Personality and assessment. Psychology Press.

*** p < .005: that’s a correlation of .04, but who’s comparing effect sizes??

**** p < .001: “I’m special, so I can ignore effect sizes—look, small effect sizes—I can ignore these because I’m a theoretician. I’m still special”

 



from Hacker News https://ift.tt/36io9sl