A blog on statistics, methods, and open science. Understanding 20% of statistics will improve 80% of your inferences.

Thursday, May 11, 2017

How a power analysis implicitly reveals the smallest effect size you care about

When designing a study, you need to justify the sample size you aim to collect. If one of your goals is to observe a p-values lower than the alpha level you decided upon (e.g., 0.05), one justification for the sample size can be a power analysis. A power analysis tells you the probability of observing a statistically significant effect, based on a specific sample size, alpha level, and true effect size. At our department, people who use power as a sample size justification need to aim for 90% power if they want to get money from the department to collect data.

A power analysis is performed based on the effect size you expect to observe. When you expect an effect with a Cohen’s d of 0.5 in an independent two-tailed t-test, and you use an alpha level of 0.05, you will have 90% power with 86 participants in each group. What this means, is that only 10% of the distribution of effects sizes you can expect when d = 0.5 and n = 86 falls below the critical value required to get a p < 0.05 in an independent t-test.

In the figure below, the power analysis is visualized by plotting the distribution of Cohen’s d given 86 participants per group when the true effect size is 0 (or the null-hypothesis is true), and when d = 0.5. The blue area is the Type 2 error rate (the probability of not finding p < α, when there is a true effect).


You’ve probably seen such graphs before (indeed, G*power, widely used power analysis software, provides these graphs as output). The only thing I have done is to transform the t-value distribution that is commonly used in these graphs, and calculated the distribution for Cohen’s d. This is a straightforward transformation, but instead of presenting the critical t-value the figure provides the critical d-value. I think people find it easier to interpret d than t. Only t-tests which yield a t 1.974, or a d 0.30, will be statistically significant. All effects smaller than d = 0.30 will never be statistically significant with 86 participants in each condition.
 
If you design a study where results will be analyzed with an independent two-tailed t-test with α = 0.05, the smallest true effect you can statistically detect is determined exclusively by the sample size. The (unknown) true effect size only determines how far to the right the distribution of d-values lies, and thus, which percentage of effect sizes will be larger than the smallest effect size of interest (and will be statistically significant – or the statistical power).

I think it is reasonable to assume that if you decide to collect data for a study where you plan to perform a null-hypothesis significance test, you are not interested in effect sizes that will never be statistically significant. If you design a study that has 90% power for a medium effect of d = 0.5, the sample size you decide to use means effects smaller than d = 0.3 will never be statistically significant. We can use this fact to infer what your smallest effect size of interest, or SESOI (Lakens, 2014), will be. Unless you state otherwise, we can assume your SESOI is d = 0.3, and any effects smaller than this effect size are considered too small to be interesting. Obviously, you are free to explicitly state any effect smaller than d = 0.5 or d = 0.4 is already too small to matter for theoretical or practical purposes. But without such an explicit statement about what your SESOI is, we can infer it from your power analysis.

This is useful. Researchers who use null-hypothesis significance testing often only specify the effect they expect when the null is true (d = 0), but not the smallest effect size that should still be considered support for their theory when there is a true effect. This leads to a psychological science that is unfalsifiable (Morey & Lakens, under review). Alternative approaches to determining what the smallest effect size of interest is have recently been suggested. For example, Simonsohn (2015) suggested to set the smallest effect size of interest to 33% of the effect size in the original study could detect. For example, if an original study used 20 participants per group, the smallest effect size of interest would be d = 0.49 (which is the effect size they had 33% power to detect with n = 20).

Let’s assume the original study used a sample size of n = 20 per group. The figure below shows that an observed effect size of d = 0.8 would be statistically significant (d = 0.8 lies to the right of the critical d-value), but that the critical d-value is d = 0.64. That means that effects smaller than d = 0.64 would never be statistically significant in a study with 20 participants per group in a between-subjects design. I think it makes more sense to assume the smallest effect size of interest for researchers who design a study with n = 20 is d = 0.64, rather than d = 0.49. 


The figures can be produced by a new Shiny app I created (the Shiny app also plots power curves and the p-value distribution [they are not all visible on Shinyapps.org, but you can try HERE as long as bandwidth lasts, or just grab the code and app from GitHub] – I might discuss these figures in a future blog post). If you have designed your next study, check the critical d-value to make sure that the smallest effect size you care about, isn’t smaller than the critical effect size you can actually detect. If you think smaller effects are interesting, but you don’t have the resources, specify your SESOI explicitly in your article. You can also use this specified smallest effect size of interest in an equivalence test to statistically reject any effect large enough that you deem it worthwhile (Lakens, 2017), which will help interpreting t-tests where p > α. In short, we really need to start specifying the effects we expect under the alternative model, and if you don’t know where to start, your power analysis might have been implicitly telling you what your smallest effect size of interest is.


References
Lakens, D. (2014). Performing high-powered studies efficiently with sequential analyses: Sequential analyses. European Journal of Social Psychology, 44(7), 701–710. https://doi.org/10.1002/ejsp.2023

Lakens, D. (2017). Equivalence tests: A practical primer for t-tests, correlations, and meta-analyses. Social Psychological and Personality Science. https://doi.org/10.1177/1948550617697177

Morey, R. D., & Lakens, D. (under review). Why most of psychology is statistically unfalsifiable.

Simonsohn, U. (2015). Small Telescopes Detectability and the Evaluation of Replication Results. Psychological Science, 26(5), 559–569. https://doi.org/10.1177/0956797614567341

11 comments:

  1. If experimental psychologists are basing their sample size requirements on the effect size they expect to observe, then they are making a mistake, because, for one thing, their experiment will be underpowered to detect a smaller effect size that they would still consider scientifically of interest. Sample size planning should always be based on the smallest effect size of interest.

    ReplyDelete
    Replies
    1. That's what I was thinking - the d = 0.3, i.e. what you call the smallest effect size of interest, is just the sample effect size, whereas the d = 0.5 you based the power calculation on is a postulated population effect size. If you have a smallest effect size of interest, wouldn't you want to treat it as a postulated population effect size and base your power computation on that?

      Incidentally, your departmental policy sounds interesting (swing for 90% power). Do you have any worked out examples, i.e., of your colleagues identifying the effect size that they're investigating etc.?

      Delete
    2. Hi - current practice in power analysis is not very state of the art in psychology, and the problem is, smallest effect sizes of interest do not exist, are almost never used or specified. That's why I wrote this post - to bootstrap a SESOI, building from a practice people use (power analysis).

      We've had this practice of 90% power for about 2 years. I could give examples - very often people in our group specify a SESOI, or they look at a pilot study, and then use a more conservative estimate in a power analysis. If there is large uncertainty, we recommend sequential analyses.

      Delete
    3. Jan, although your last paragraph was presumably aimed at Daniel, rather than me, when applying for funding, I always base my proposed sample size on having 90% power to detect the smallest effect size of interest. This usually winds up giving me 99% power or more to detect the hypothesized true effect size.

      Delete
    4. JT - it's good you don't work at our department! Our ethics department would not be easily convinced by designing studies with 99% power - it's wasteful, and our resources can be spent more efficiently! You should really do sequential analyses (see Lakens, 2014, for an introduction (you are anonymous, so I don't know what you know about stats, but if you've never learned about sequential analyses, you should!).

      Delete
    5. Daniel, I'm a biostatistician, but I occasionally consult for social scientists. I disagree that 99% power is inefficient. We are interested not just in detecting an effect, but in obtaining a reasonably precise estimate of the effect size. The flip side of high power is narrow confidence intervals.

      As to sequential analysis, I agree that in many psych experiments the approach is useful. However, sequential analysis would not be practical in most studies I've been involved with. For example, if we need patients to be under treatment for three months, then, for many logistical reasons as well as financial ones, we really need the study to terminate after three months.

      Delete
    6. Should you ever run out of ideas for blog posts, I think one where you detail how you or your collaborators arrived at an effect size or SESOI would make for interesting reading. My sense is that power analyses are often based on canned effect sizes with little regard to the specifics of the study (theory and design), so it would be useful to see some more sophisticated approaches to specifying ESs.

      Delete
  2. Hi Daniël! Interesting post. Just a detail: I think Simonsohn (2015) did not suggest to set the smallest effect size of interest to 33% of the effect size in the original study, as you write. He suggested to set the smallest effect size of interest so that the original experiment had 33% power to reject the null if this ES was true. This smallest ES of interest thus does not depend on the found effect size of the original study: it only depends on the sample size. For instance, for n=20 per cell in a two cells design, the effect size would be d=0.5, because this gives 33% power. Your approach is that the smallest ES is the effect size that gives 50% power in the original study. It makes a difference, but I think your approach is, in the end, quite close to Simonsohn's approach.

    ReplyDelete
    Replies
    1. Thanks! Changed (and I knew that - last minute addition I didn't think through! Thanks for correcting me!).

      Delete
  3. I find this equivocating between observed ES and population ES. This is very common in psych, and it would really help if you labelled which you have in mind whenever used. Cohen had a subscript s for the observed ES. (I use difference for the observed, and discrepancy for the parametric effect size).
    To take the simple one-sample test of a Normal mean : Ho: mu< 0 vs H1: mu > 0, the cut-off for rejection at the 025 level is a sample mean M of 1.96SE. Are you saying the pop effect size of interest is this cut-off, 1.96 SE? That would be to take, as the pop ES of interest, one against which the test has 50% power. I'm not saying that would be bad, I'm just trying to figure out your equivocal use of effect size.

    ReplyDelete
  4. Since you asked about where it's equivocal, between pop ES and sample ES, here's one: you say "true effect size is 0 (or the null-hypothesis is true), and when d = 0.5."
    Here d = .5 appears to speak of the pop ES. On your graph it's the observed.
    Another: Your first figure shows d = .5 & also that d = .3, the first I take it is a pop, the second a sample.

    A separate issue I have with using these standardized pop d's is that it seems you're allowed to do the analysis without knowing the standard deviation. Is that so?

    ReplyDelete