With all the manuscripts I see, as editor-in-chief of Journal of Experimental Social Psychology, it’s clear that authors are following a wide variety of standards for statistical power analysis. In particular, the standards for power analysis of interaction effects are not clear. Most authors simply open up GPower software and plug in the numerator degrees of freedom of the interaction effect, which gives a very generous estimate.

I often hear that power analysis is impossible to carry out for a novel effect, because you don’t know the effect size ahead of time. But for novel effects that are built upon existing ones, a little reasoning can let you guess the likely size of the new one. That’s the good news. The bad news is: you’re usually going to need a much bigger sample to get decent power than GPower alone will suggest.

**Heather’s Trilemma**

Meet our example social psychologist, Heather. In an Experiment 1, she has participants listen either to a speeded-up or normal-tempo piece of mildly pleasant music. Then they fill out a mood questionnaire. She wants to give her experiment 80% power to detect a medium effect, d = .5. Using GPower software, for a between-subjects t-test, this requires n = 64 in each condition, or N = 128 total.

The result actually shows a slightly larger effect size, d = .63. People are significantly happier after listening to the speeded-up music. Heather now wants to expand into a 2 x 2 between-subjects design that tests the effect’s moderation by the intensity of the music. So, she crosses the tempo manipulation with whether the music is mildly pleasant or intensely pleasant.

But there are three different authorities out there for this power analysis. And each is telling Heather different things.

- Heather assumes that the interaction effect is not known, even if the main effect is, so she goes with a medium effect again, d = .5. Using GPower, she tests the N needed to achieve 80% power of an interaction in this 2 x 2 ANOVA design, with 1 degree of freedom and f = .25 (that is, d = .5). GPower tells her that again, she needs only a total of 128, but now divided among 4 cells, for 32 people per cell.
*Because this power test of the interaction uses the same numerator df and other inputs as for the main effect, it gives the same result for N.* - Heather, however, always heard in graduate school that you should base factorial designs on a certain n per cell. Your overall N should grow, not stay the same, as the design gets more complex. In days of old, the conventional wisdom said n=20 was just fine. Now, the power analysis of the first experiment shows that you need over three times that number. Still, n=64 per cell should mean that you need N=256 people in the 2 x 2 experiment, not 128, right? What’s the real story?
- Then, Heather runs across a Data Colada blog post from a few years ago, “No-Way Interactions” by Uri Simonsohn. It said you actually need to
*quadruple*, not just double, your overall N in order to move from a main effect design to a between-subjects interaction design. So Heather’s Experiment 2 would require 512 participants, not 256!

Most researchers at the time just shrugged at the Colada revelations (#3), said something like “We’re doomed” or “I tried to follow the math but it’s hard” or “These statistical maniacs are making it impossible to do psychology,” and went about their business. I can say with confidence that not a single one out of over 800 manuscripts I have triaged at JESP has cited the “No-Way” reason to use higher cell n when testing an interaction than a main effect!

But guess what? For most analyses in social psychology, answer #3 is the correct one. The “No-Way” argument deserves to be understood better. First, you need to know that the test of the interaction, by itself, is usually not adequate as a full model of the hypotheses you have. Second, you need to know how to estimate the expected effect size of an interaction when all you have is a prior main effect.

**Power of a Factorial Analysis: Not so Simple**

Answer #1, above (from basic power analysis) forgets that we are usually not happy just to see a significant interaction. No, we want to show more, that the interaction corresponds to a pattern of means that supports our conclusion. Here is an example showing that the shape of the interaction does matter, when you add in different configurations of main effects.

Result A shows the interaction you would get if Heather’s new experiment replicates the effect of tempo found in the old experiment when the music is mildly pleasant. For intensely pleasant music, there is no effect of tempo — a boundary condition limiting the original effect! The interaction coexists with a main effect of tempo, where faster music leads to better mood overall. Let’s say this was Heather’s hypothesis all along: the effect will replicate, but is “knocked out” if the music is too intense.

Result B also shows an interaction, with the same difference between slopes, and thus the same significance level and effect size. Intense is “less positive” than mild. But now the interaction is coexisting with two different main effects. Faster music worsens mood, and intense music improves it. Here, the mild version shows no effect of tempo on mood, and the intense version actually reduces mood with the fast vs. normal tempo. This doesn’t seem like such a good fit to the hypothesis!

Because these outcomes of the same interaction effect look so different, you need simple effects tests in order to fully test the stated hypotheses. And yes, most authors know this, and duly present simple tests breaking down their interaction. But this means that answer #2 (n calculated per-cell) is more correct than answer #1 (N calculated by the interaction test). The more cells, the more overall N you need to adequately power the smallest-scale simple test.

**Extending the Effect Size in a Factorial Analysis**

That’s one issue. The other issue is how you can guess at the size of the interaction when you add another condition. **This is not impossible.** It can be estimated from the size of the original effect that you’re building the interaction upon, if you have an idea what the shape of the interaction is going to be.

Let’s start by asking when the interaction’s effect size is likely to be about the same as the original main effect size. Here I’m running a couple of analyses with example data and symmetrical means. If the new condition’s effect (orange dots) is the same size as the existing effect but in the other direction — a “reversal” effect — that’s when the interaction effect size, converted to d, is approximately the same as the original effect size. A power analysis of the interaction alone suggests you need about half as many participants per cell (n = 19!). That’s Heather’s answer #1. But, a power analysis of each of the simple effects — each as big as the original effect — suggests that you need about the same number per cell (n = 38), or twice as many in total. We’ve already established that the simple effects are the power analyses to focus on, and so it looks like Heather’s answer #2 is right in this case.

Unfortunately, very few times in psychology do we add a factor and expect a complete reversal of the effect. For example, maybe you found that competent fellow team members in a competition are seen in a better light than incompetent ones. You add a condition where you evaluate members of the *opposing* team. Then you would expect that the incompetent ones would be more welcome, and the competent ones less so. That is the kind of situation you need in order to see a complete reversal.

However, it’s more usual that we expect the new condition to alter the size, rather than direction, of the existing effect. In the strongest such case, a new condition “knocks out” the old effect. Perhaps you found that men get more angry in a frustrating situation than women. Then you add a baseline condition (orange dots), a calm situation where you’d expect *no* gender differences in anger. The example shown below, to the left, reveals that the interaction effect size here will be about half that of the original, strong gender effect. So, you need more power: roughly four times the number suggested by the mere GPower analysis of the interaction effect. This is Simonsohn’s recommendation too in the “knockout” case. But you can’t derive it using GPower, without realizing that your estimate of the interaction’s size has to be *smaller* than its constituent simple effect.

Even more typically, we might expect the new condition to attenuate but not completely knock out the effect. What if, even in a resting state, men express more anger than women? Let’s say that resting-state gender differences are about half the size of that shown when in a frustrated state. This is the example on the right, above.

It’s an interaction pattern that is not uncommon to see in published research. But it also has a very small interaction effect size, about 1/4 that of the simple effect in the “frustrated” state. As a result, the full design takes over 1,000 participants to achieve adequate power for the interaction.

I often see reports where a predicted simple effects test is significant but the overall interaction is not. The examples show why: in all but reversal effects, the simple effects tests require fewer people to get decent power than the interaction effects tests do. But the interaction is important, too. It is the test of differences among simple effects. If your hypothesis is merely satisfied by one simple effect being significant and another one not, you are committing the error of mistaking a “difference in significance” for a “significant difference.”

To sum up, the dire pronouncements in “No-Way Interactions” are true, but applying them correctly requires understanding the shape of the expected interaction.

- If you expect the new condition to show a reversal, use a cell n equal to your original study, total N = 2x the original.
- If you expect the new condition to knock out the effect, use a cell n twice that of your original study, for a total N = 4x the original.
- If you expect only a 50% attenuation in the new condition, you really ought to use a cell n seven times that of your original study, for a total N = 14x the original! Yes, moderators are harder to show than you might think.

**Take-home message:** It is not impossible to estimate the effect size of a novel effect, if it builds on a known effect. But you may not like what the estimate has to say about the power of your design.

Dear Professor Giner-Sorolla, I’ve read several blogposts about the required sample size for the analysis of interaction effects. However, I am having a hard time finding published scientific articles or books on this topic that I could cite in my paper (as to justify why I have not tested for an interaction effect), do you know of any? Thank you in advance

Hi Amber! I’m not sure of any citations that would let you justify not testing for an interaction effect. Cohen’s book Statistical Power Analysis for the Behavioral Sciences (revised, 1988) doesn’t contradict what I’ve said here, which is about choosing the tests and effect size going into the power analysis itself, and gives methods for the analysis.

I posted a follow-up to this, if you are interested in reading: http://markhw.com/blog/power-twoway. I provide a tool for doing power analyses for those three examples you provide.