Justify Your Alpha … For Its Audience

In this light, it seems pertinent to ask, what is the fundamental purpose of a scientific publication? Although it is often argued that journals are a “repository of the accumulated knowledge of a field” (APA Publication Manual, 2010, p. 9), equally important are their communicative functions. It seems to us that the former purpose emphasizes perfectionism (and especially perfectionistic concerns) whereas the latter purpose prioritizes sharing information so that others may build on a body of work and ideas (flawed as they may eventually turn out to be).

(Reis & Yee, 2016)

There’s been a pushback against the call for more rigorous standards of evidence in psychology research publishing. Preregistration, replication, statistical power requirements, and lower alpha levels for statistical significance, all have been criticized as leading to less interesting and creative research (e.g., Baumeister, 2016; Fiedler, Kutzner, & Krueger, 2012; and many, many personal communications.) After all, present-day standards completely exclude from the scientific record any individual studies that do not meet significance. So, wouldn’t stricter standards expand the dark reach of the file-drawer even further? I want to show a different way out of this dilemma.

First things first. If creativity conflicts with rigor, it’s clear to me which side I chose 30 years ago. By deciding to go into social science rather than humanities, I chose a world where creativity is constrained by evidence. If you want to maximize your creativity, you’re in the wrong business. In a simulation of publishing incentives, Campbell and Gustafson (2018; pdf) show that adopting a lower threshold for significance is most helpful in avoiding false positives when studying novel, unlikely, “interesting” effects. It is precisely because counterintuitive ideas are likely to be untrue that we need more rigor around them.

Maximum creativity

One call for increased rigor has involved lowering significance thresholds, specifically alpha = .005 (old school, for the hipsters: Greenwald et al., 1996 (pdf); new school: Benjamin et al., 2017). The best-known response, “Justify your alpha” (Lakens et al., 2017), calls instead for a flexible and openly justified standard of significance. It is not crystal clear what criteria should govern this flexibility, but we can take the “Justify” authors’ three downsides of adopting an alpha=.005 as potential reasons for using a higher alpha. There are problems with each of them.

Risk of fewer replication studies: “High-powered original research means fewer resources for replication.” Implicitly, this promotes a higher alpha for lower-powered exploratory studies, using the savings to fund high-powered replications that can afford a lower alpha (Sakaluk, 2016; pdf). Problem is, this model has been compared in simulations to just running well-powered research in the first place, and found lacking — ironically, by the JYA lead author. If evidence doesn’t care about the order in which you run studies, it makes more sense to run a medium-sized initial study and a medium-sized replication, than to run a small initial study and a large replication, because large numbers have statistically diminishing returns. The small study may be economical, but it increases the risk of running a large follow-up for nothing.

Risk of reduced generalizability and breadth: “Requiring that published research meet a higher threshold will push the field even more to convenience samples and convenient methods.” Problem is, statistical evidence does not care whether your sample is Mechanical Turk or mechanics in Turkey. We need to find another way to make sure people get credit for doing difficult research. There’s a transactional fallacy here, the same one you may have heard from students: “I pay so much tuition, worked so hard on this, I deserve a top grade!” Like student grades, research is a truth-finding enterprise, not one where you pay your money and get the best results. To be clear (spoiler ahoy!) I would rather accommodate difficult research by questioning what we consider to be a publishable result, rather than changing standards of evidence for a positive proposition.

Risk of exaggerating the focus on single p-values. “The emphasis should instead be on meta-analytic aggregations.” Problem is, this issue is wholly independent from what alpha you choose. It’s possible to take p = .005 as a threshold for meta-analytic aggregation of similar studies in a paper or even a literature (Giner-Sorolla, 2016; pdf), just as p = .05 is now the threshold for such analyses. And even with alpha = .05, we often see the statistical stupidity of requiring all individual results in a multi-study paper to meet the threshold in order to get published.

See, I do think that alpha should be flexible: not according to subjective standards based on traits of your research, but based on traits of your research audience.

Reis and Yee, in their quote at the top, are absolutely right that publishing is as much a communication as a truth-establishing institution. Currently, there is a context-free, one-size-fits-all publishing model where peer-reviewed papers are translated directly to press releases, news stories, and textbook examples. But I believe that the need to establish a hypothesis as “true” is most important when communicating with non-scientific audiences — lay readers, undergraduate students, policymakers reading our reports.  Specifically, I propose that in an ideal world there should be:

  • no alpha threshold for communications read by specialists in the field
  • a .05 threshold for reporting positive results aimed at research psychologists across specialties
  • a .005 threshold (across multiple studies and papers) for positive results that are characterized as “true” to general audiences

(Yes, firmly negative results should have thresholds too — else how would we declare a treatment to be quackery, or conclude that there’s essentially no difference between two groups on an outcome? But on those thresholds, there’s much less consensus than on the well-developed significance testing for positive results. Another time.)

No threshold for specialists. If you’re a researcher, think about a topic that’s right on the money for what you do, a finding that could crucially modify or validate some of your own findings. Probably, if the study was well-conducted, you would want to read it no matter what it found. Even if the result is null, you’d want to know that in order to avoid going down that path in your own research. This realization is one of the motivators towards increasing acceptance of Registered Reports (pdf), where the methods are peer-reviewed, and a good study can be published regardless of what it actually finds.

.05 threshold for non-specialist scientific audiences. Around researchers’ special field, there is an area of findings that could potentially inspire and inform our work. Social psychologists might benefit from knowing how cognitive psychologists approach categorization, for example. There may not be enough interest to follow null findings in these areas, but there might be some interest in running with tentative positive results that can enrich our own efforts. Ideally, researchers would be well-trained enough to know that p = .05 is fairly weak evidence, and treat such findings as speculative even as they plan further tests of them. Having access to this layer of findings would maintain the “creative” feel of the research enterprise as a whole, and also give wider scientific recognition to research that has to be low-powered.

.005 threshold for general audiences. The real need for caution comes when we communicate with lay people who want to know how the mind works–including our own undergraduate students. Quite simply, the lower the p-value in the original finding, the more likely it is to replicate. If a finding is breathlessly reported, and then nullified or reversed by further work, this undermines confidence in the whole of science. Trying to communicate the shakiness of an intriguing finding, frankly, just doesn’t work. As a story is repeated through the media, the layers of nuance fall away and we are left with bold, attention-grabbing pronouncements. Researchers and university press offices are as much to blame for this as reporters are. Ultimately, we also have to account for the fact that most people think about things they’re not directly concerned with in terms of “true”/”not true” rather than shades of probability. (I mean, as a non-paleontologist I have picked up the idea that, “oh, so T-Rex has feathers now” when it’s a more complicated story.)

We’ve seen the bunny, now for the chickie.

I make these recommendations fully aware that our current communication system will have a hard time carrying them out. Let’s take the first distinction, between no threshold and alpha = .05. To be a “specialist” journal, judging from the usual wording of editors’ rejection letters, is to sit in a lower league to where unworthy findings are relegated. However, currently I don’t see specialist journals as any more or less likely to publish null findings than more general-purpose psychology journals. If anything, the difference in standards is more about what is considered novel, interesting, and methodologically sound. Ultimately, the difference between my first two categories may be driven by researchers’ attention more than by journal standards. That is, reports with null findings will just be more interesting to specialists than to people working in a different specialty.

On to the public-facing threshold. Here I see hope that textbook authors and classroom teachers are starting to adopt a stricter evidence standard for what they teach, in the face of concerns about replicability. But it’s more difficult to stop the hype engine that serves the short-term interests of researchers and universities. Even if somehow we put a brake on public reporting of results until they reach a more stringent threshold, nothing will stop reporters reading our journals or attending our conferences. We can only hope, and insist, that journalists who go to those lengths to be informed also know the difference between results that the field is still working on, and results that are considered solid enough to go public. I hope that at least, when we psychologists talk research among ourselves, we can keep this distinction in mind, so that we can have both interesting communication and rigorous evaluation.

The ACE Article and Those Confounded Effect Sizes

There is a kind of psychology research article I have seen many times over the past 20 years. My pet name for it is the “ACE” because it goes like this:

  • Archival
  • Correlational
  • Experimental

These three phases, in that order, each with one or more studies. To invent an example, imagine we’re testing whether swearing makes you more likely to throw things … via a minimal ACE article.

(To my knowledge, there is no published work on this topic, but mea culpa if this idea is your secret baby.)

First, give me an “A” — Archival! We choose to dig up the archives of the US Tennis Association, covering all major matches from 1999 to 2007. Lucky for us, they have kept records of unsporting incidents for each player at each match, including swearing and throwing the racket. Using multilevel statistics, we find that indeed, the more a player curses in a match, the more they throw the racket. The effect size is a healthy r = .50, p  < .001.

Now, give me a “C” — Correlational! We give 300 online workers a questionnaire, where we ask how often they have cursed in the last 7 days, and how often they have thrown something in anger in the same time period. The two measures are correlated r = .17, p = .003.

Finally, give me an “E” — Experimental! We bring 100 undergraduates into the lab. We get half of them to swear profusely for one full minute, the other half to rattle off as many names of fish as they can think of. We then let them throw squash balls against a target electronically emblazoned with the face of their most hated person, a high-tech update of the dart task in Rozin, Millman and Nemeroff (1986). And behold, the balls fly faster and harder in the post-swearing condition than in the post-fish condition, t(98) = 2.20, d=0.44, p = .03.

Aesthetically, this model is pleasing. It embraces a variety of settings and populations, from the most naturalistic (tennis) to the most constrained (the lab). It progresses by eliminating confounds. We start with two settings, the Archival and Correlational, where throwing and cursing stand in an uncertain relation to each other. Their association, after all, could just be a matter of causation going either way, or third variables causing both. The Experiment makes it clearer. The whole package presents a compelling, full-spectrum accumulation of proof for the hypothesis. After all, the results are all significant, right?

But in the new statistical era, we care about more than just the significance of individual studies. We care about effect size across the studies. Effect size helps us integrate the evidence from non-significant studies and significant ones, under standards of full reporting. It goes into meta-analyses, both within the article and on a larger scale. It lets us calculate the power of the studies, informing analyses of robustness such as p-curve and R-index.

Effect sizes, however, are critically determined by the methods used to generate them.  I would bet that many psychologists think about effect size as some kind of Platonic form, a relationship between variables in the abstract state. But effect size cannot be had except through research methods. And methods can:

  • give the appearance of a stronger effect than warranted, through confounding variables;
  • or obscure an effect that actually exists, through imprecision and low power.

So, it’s complicated to extract a summary effect size from the parts of an ACE. The archival part, and to some extent the correlational, will have lots of confounding. More confounds may remain even after controlling for the obvious covariates. Experimental data may suffer from measurement noise, especially if subtle or field methods are used.

And, indeed, all parts might yield an inflated estimate if results and analyses are chosen selectively. The archival part is particularly prone to suspicions of post-hoc peeking. Why those years and not other ones? Why thrown tennis rackets and not thrown hockey sticks? Why Larry the lawyer and Denise the dentist, but not Donna the doctor or Robbie the rabbi (questions asked of research into the name-letter effect; Gallucci, 2003, pdf)? Pre-registration of these decisions, far from being useless for giving confidence in archival studies, almost seems like a requirement, provided it happens before the data are looked at in depth.

The ACE’s appeal requires studies with quite different methods. When trying to subject it to p-curves, mini-meta-analyses (Goh, Hall & Rosenthal, 2016; pdf), or other aggregate “new statistics,” we throw apples and oranges in the blender together. If evidence from the experiment is free from confounds but statistically weak, and the evidence from the archival study is statistically strong and big but full of confounds, does the set of studies really show both strong and unconfounded evidence?



The 18th century showman Wolfgang von Kempelen had a way to convince spectators there was no human agent inside his chess-playing “automaton”. He would open first one cabinet, then the other, while the human player inside moved his torso into the concealed parts. At no time was the automaton (yes, the famous “Mechanical Turk“)  fully open for view. Likewise, the ACE often has no study that combines rigorous method with robust finding. So, it’s hard to know what conclusions we should draw about effect size and significance on an article-wise level.

Leif Nelson at the Data Colada blog has recently taken a pessimistic view on finding the true effect size behind any research question. As he argues, the effect size has to take into account all possible studies that could have been run. So, evidence about it will always be incomplete.

Still, I think a resolution is possible, if we understand the importance of method. Yes, the best effect size takes into account both underlying effect and method used. But not all methods are equal. To have the clearest view of an underlying effect, we should focus on those methods that best meet the two criteria above: the least confounded with other factors, while at the same time being the most precise, sensitive, and free from error.

I said “possible,” not “easy.” For the first criterion, we need some agreement about which factors are confounds, and which are part of the effect.  For example, we show that science knowledge correlates with a “liberal”-“conservative” political self-report item. Then, trying to eliminate confounds, we covary out religious fundamentalism, right-wing authoritarianism, and support for big government. The residual of the lib-con item now represents a strange “liberal” who by statistical decree is just as likely as an equally strange “conservative” to be an anti-government authoritarian fundamentalist. In trying to purify a concept, you can end up washing it clean away.

Experimental methods make it somewhat easier to isolate an effect, but even then controversy might swirl around whether the effect has been whittled down to something too trivial or ecologically invalid to matter. A clear definition of the effect is also necessary in experiments. For example, whether you see implicit manipulations as the ultimate test of an effect depends on how much you see participant awareness and demand as part of the effects, or part of the problem. And are we content to look at self-reports of mental phenomena, or do we demand a demonstration of (messier? noisier?) behavioral effects as well? Finally, the effect size is going to be larger if you are only interested in the strength of an intervention versus no intervention — in which case, bring on the placebo effect, if it helps! It will usually be smaller, though, if you are interested in theoretically nailing down the “active ingredient” in that intervention.

The second criterion of precision is better studied. Psychometricians already know a lot about reducing noise in studies through psychometric validation. Indeed, there is a growing awareness that the low reliability and validity of many measures and manipulations in psychological research is a problem (Flake, Pek & Hehman, 2017, pdf). Even if the process of testing methods for maximum reliability is likely to be tedious, it is, theoretically, within our grasp.

But in a final twist, these two criteria often work against each other in research. Trying to reach good statistical power, it is harder to run large numbers of participants in controlled lab experiments than in questionnaire or archival data collections. Trying to avoid participant awareness confounds, implicit measures often increase measurement “noise” (Gawronski & De Houwer, 2014; Krause et al., 2011). This means it will be hard to get agreement about what method simultaneously maximizes the clarity of the effect  and its construct validity. But the alternative is pessimism about the meaning of the effect size, and a return to direction-only statistics.

I’ll conclude, boldly. It is meaningless to talk about the aggregate effect size in an ACE-model article, or to apply any kind of aggregate test to it that depends on effect sizes, such as p-curve. The results will depend, arbitrarily, on how many studies are included using each kind of method. A litmus test: would these studies all be eligible for inclusion in a single quantative meta-analysis? Best practice in meta-analysis demands that we define carefully the design of studies for inclusion, so that they are comparable with each other. Knowing what we know about methodology and effect size, the article, like the meta-analysis, is only a valid unit of aggregation if its studies’ methods are comparable with each other. The ACE article presents a compelling variety of methods and approaches, but that very quality is its Achilles’ heel when it comes to the “new statistics.”