The ACE Article and Those Confounded Effect Sizes

There is a kind of psychology research article I have seen many times over the past 20 years. My pet name for it is the “ACE” because it goes like this:

  • Archival
  • Correlational
  • Experimental

These three phases, in that order, each with one or more studies. To invent an example, imagine we’re testing whether swearing makes you more likely to throw things … via a minimal ACE article.

(To my knowledge, there is no published work on this topic, but mea culpa if this idea is your secret baby.)

First, give me an “A” — Archival! We choose to dig up the archives of the US Tennis Association, covering all major matches from 1999 to 2007. Lucky for us, they have kept records of unsporting incidents for each player at each match, including swearing and throwing the racket. Using multilevel statistics, we find that indeed, the more a player curses in a match, the more they throw the racket. The effect size is a healthy r = .50, p  < .001.

Now, give me a “C” — Correlational! We give 300 online workers a questionnaire, where we ask how often they have cursed in the last 7 days, and how often they have thrown something in anger in the same time period. The two measures are correlated r = .17, p = .003.

Finally, give me an “E” — Experimental! We bring 100 undergraduates into the lab. We get half of them to swear profusely for one full minute, the other half to rattle off as many names of fish as they can think of. We then let them throw squash balls against a target electronically emblazoned with the face of their most hated person, a high-tech update of the dart task in Rozin, Millman and Nemeroff (1986). And behold, the balls fly faster and harder in the post-swearing condition than in the post-fish condition, t(98) = 2.20, d=0.44, p = .03.

Aesthetically, this model is pleasing. It embraces a variety of settings and populations, from the most naturalistic (tennis) to the most constrained (the lab). It progresses by eliminating confounds. We start with two settings, the Archival and Correlational, where throwing and cursing stand in an uncertain relation to each other. Their association, after all, could just be a matter of causation going either way, or third variables causing both. The Experiment makes it clearer. The whole package presents a compelling, full-spectrum accumulation of proof for the hypothesis. After all, the results are all significant, right?

But in the new statistical era, we care about more than just the significance of individual studies. We care about effect size across the studies. Effect size helps us integrate the evidence from non-significant studies and significant ones, under standards of full reporting. It goes into meta-analyses, both within the article and on a larger scale. It lets us calculate the power of the studies, informing analyses of robustness such as p-curve and R-index.

Effect sizes, however, are critically determined by the methods used to generate them.  I would bet that many psychologists think about effect size as some kind of Platonic form, a relationship between variables in the abstract state. But effect size cannot be had except through research methods. And methods can:

  • give the appearance of a stronger effect than warranted, through confounding variables;
  • or obscure an effect that actually exists, through imprecision and low power.

So, it’s complicated to extract a summary effect size from the parts of an ACE. The archival part, and to some extent the correlational, will have lots of confounding. More confounds may remain even after controlling for the obvious covariates. Experimental data may suffer from measurement noise, especially if subtle or field methods are used.

And, indeed, all parts might yield an inflated estimate if results and analyses are chosen selectively. The archival part is particularly prone to suspicions of post-hoc peeking. Why those years and not other ones? Why thrown tennis rackets and not thrown hockey sticks? Why Larry the lawyer and Denise the dentist, but not Donna the doctor or Robbie the rabbi (questions asked of research into the name-letter effect; Gallucci, 2003, pdf)? Pre-registration of these decisions, far from being useless for giving confidence in archival studies, almost seems like a requirement, provided it happens before the data are looked at in depth.

The ACE’s appeal requires studies with quite different methods. When trying to subject it to p-curves, mini-meta-analyses (Goh, Hall & Rosenthal, 2016; pdf), or other aggregate “new statistics,” we throw apples and oranges in the blender together. If evidence from the experiment is free from confounds but statistically weak, and the evidence from the archival study is statistically strong and big but full of confounds, does the set of studies really show both strong and unconfounded evidence?

turk

 

The 18th century showman Wolfgang von Kempelen had a way to convince spectators there was no human agent inside his chess-playing “automaton”. He would open first one cabinet, then the other, while the human player inside moved his torso into the concealed parts. At no time was the automaton (yes, the famous “Mechanical Turk“)  fully open for view. Likewise, the ACE often has no study that combines rigorous method with robust finding. So, it’s hard to know what conclusions we should draw about effect size and significance on an article-wise level.

Leif Nelson at the Data Colada blog has recently taken a pessimistic view on finding the true effect size behind any research question. As he argues, the effect size has to take into account all possible studies that could have been run. So, evidence about it will always be incomplete.

Still, I think a resolution is possible, if we understand the importance of method. Yes, the best effect size takes into account both underlying effect and method used. But not all methods are equal. To have the clearest view of an underlying effect, we should focus on those methods that best meet the two criteria above: the least confounded with other factors, while at the same time being the most precise, sensitive, and free from error.

I said “possible,” not “easy.” For the first criterion, we need some agreement about which factors are confounds, and which are part of the effect.  For example, we show that science knowledge correlates with a “liberal”-“conservative” political self-report item. Then, trying to eliminate confounds, we covary out religious fundamentalism, right-wing authoritarianism, and support for big government. The residual of the lib-con item now represents a strange “liberal” who by statistical decree is just as likely as an equally strange “conservative” to be an anti-government authoritarian fundamentalist. In trying to purify a concept, you can end up washing it clean away.

Experimental methods make it somewhat easier to isolate an effect, but even then controversy might swirl around whether the effect has been whittled down to something too trivial or ecologically invalid to matter. A clear definition of the effect is also necessary in experiments. For example, whether you see implicit manipulations as the ultimate test of an effect depends on how much you see participant awareness and demand as part of the effects, or part of the problem. And are we content to look at self-reports of mental phenomena, or do we demand a demonstration of (messier? noisier?) behavioral effects as well? Finally, the effect size is going to be larger if you are only interested in the strength of an intervention versus no intervention — in which case, bring on the placebo effect, if it helps! It will usually be smaller, though, if you are interested in theoretically nailing down the “active ingredient” in that intervention.

The second criterion of precision is better studied. Psychometricians already know a lot about reducing noise in studies through psychometric validation. Indeed, there is a growing awareness that the low reliability and validity of many measures and manipulations in psychological research is a problem (Flake, Pek & Hehman, 2017, pdf). Even if the process of testing methods for maximum reliability is likely to be tedious, it is, theoretically, within our grasp.

But in a final twist, these two criteria often work against each other in research. Trying to reach good statistical power, it is harder to run large numbers of participants in controlled lab experiments than in questionnaire or archival data collections. Trying to avoid participant awareness confounds, implicit measures often increase measurement “noise” (Gawronski & De Houwer, 2014; Krause et al., 2011). This means it will be hard to get agreement about what method simultaneously maximizes the clarity of the effect  and its construct validity. But the alternative is pessimism about the meaning of the effect size, and a return to direction-only statistics.

I’ll conclude, boldly. It is meaningless to talk about the aggregate effect size in an ACE-model article, or to apply any kind of aggregate test to it that depends on effect sizes, such as p-curve. The results will depend, arbitrarily, on how many studies are included using each kind of method. A litmus test: would these studies all be eligible for inclusion in a single quantative meta-analysis? Best practice in meta-analysis demands that we define carefully the design of studies for inclusion, so that they are comparable with each other. Knowing what we know about methodology and effect size, the article, like the meta-analysis, is only a valid unit of aggregation if its studies’ methods are comparable with each other. The ACE article presents a compelling variety of methods and approaches, but that very quality is its Achilles’ heel when it comes to the “new statistics.”

Advertisements

The Lab-Wise File Drawer

The file drawer problem refers to a researcher failing to publish studies that do not get significant results in support of the main hypothesis. There is little doubt that file-drawering has been endemic in psychology. A number of tools have been proposed to analyze excessive rates of significant publication and suspicious patterns of significance (summarized and referenced in this online calculator). All the same, people disagree about how seriously the file-drawer distorts conclusions in psychology (e.g. Fabrigar & Wegener in JESP, 2016, with comment by Francis and reply). To my mind the greater threat of file-drawering noncompliant studies is ethical. It simply looks bad, has little justification in an age of supplemental materials that shatter the page count barrier, and violates the spirit of the APA Ethics Code.

But I’ll bet that all of us who have been doing research for any time have a far larger file drawer composed of whole lines of research that simply did not deliver significant or consistent results. The ethics here are less clear. Is it OK to bury our dead? Where would we publish whole failed lines of research?

Of course, some research topics are interesting no matter how they come out. As I’ve argued elsewhere, focusing on questions like these would remove a lot of uncertainty from careers and undercut the careerist motivation for academic fraud. But would even the most extreme advocate of open science reform back a hard requirement for a completely transparent lab?

The lab-wise file drawer rate — if you will, the circular file drawer — could explain part or all of publication bias. Statistical power could lag behind the near-unanimous significance of reported results due to whole lines of research being file-drawered. The surviving lines of research, even if they openly report all studies run, would then look better than chance would have it. I ran some admittedly simplistic simulations (Google Drive spreadsheet) to check out how serious the circular file drawer can get.

circular-file-drawer
File, circular, quantity 1

 

Our simulated lab has eight lines of research going on, testing independent hypotheses with multiple studies. Each study has 50% power to find a significant result given the match between its methods and the actual, population effect size out there. You may think this is low, but keep in mind that even if you build in 80 or 90% power to find, say, a medium sized effect, the actual effect may be small or practically nil, reducing your power post-hoc.

The lab also follows rules for conserving resources while conforming to the standard of evidence that many journals follow. For each idea, they run up to three studies. If two studies fail to get a significant result, they don’t publish the research. They also stop the research short if either they fail to replicate a promising first study, or they try two studies, both of which fail. In this 50% power lab where each study’s success is a coin flip, this means that out of eight lines of research, they will only end up trying to publish three. Sounds familiar?

Remember, all these examples  assume that the lab reports even non-significant studies from lines of research that “succeed” by this standard. There is no topic-wise file drawer — only a lab-wise one.

In our example lab that runs at a consistent 50% power, at the spreadsheet’s top left, the results of this practice look pretty eye-opening. Even though not all reported results are significant, the 77.8% that are significant still exceed the 50% power of the studies. This leads to an R-index of 22, which has been described as a typical result when reporting bias is applied to a nonexistent effect (Schimmack, 2016).

labwise1

Following the spreadsheet down, we see minimal effects of adopting slightly different rules that are more or less conservative in abandoning a research line after failure. They only require one study more or less about every 4 topics, and the R-indices from these analyses are still problematic.

Following the spreadsheet to the right, we see stronger benefits of carrying out more strongly powered research — which includes studying effects that are more strongly represented in the population to begin with. At 80% power, most research lines yield three significant studies, and the R-index becomes a healthy 71.

labwise2

The next block to the right assumes only 5% power – a figure that breaks the assumptions of the R index. This represents a lab that is going after an effect that doesn’t exist, so tests will only be significant at the 5% type I error rate. Each of the research rules is very effective in limiting exposure to completely false conclusions, with only one in hundreds of false hypotheses making it to publication.

Before drawing too many conclusions about the 50% power example, however, it is important to question one of its assumptions. If all studies run in a lab have a uniform 50% power, and use similar methods, then all the hypotheses are true, with the same population effect size. Thus, variability in the significance of studies cannot reflect (nonexistent) variability in the truth value of hypotheses.

To reflect reality more closely, we need a model like the one I present at the very far right. A lab uses similar methods across the board to study a variety of hypotheses: 1/3 hypotheses with strong population support (so their standard method yields 80% power), 1/3 with weaker population support (so, 50% power), and 1/3 hypotheses that are just not true at all (so, 5% power). This gives the lab’s publication choices a chance to represent meaningful differences in reality, not just random variance in sampling.

What happens here?

labwise3

This lab, as expected, flushes out almost all of its tests of nonexistent effects, and finds relatively more success with research lines that have high power based on a strong effect, versus low power based on a weaker effect. As a result, inflation of published findings is still appreciable, but less problematic than if each study is done at a uniform 50% power.

To sum up:

  1. A typical lab might expect to have its published results show inflation above their power levels even if it commits to reporting all studies relevant to each separate topic it publishes on.
  2. This is because of sampling randomness in which topics are “survivors.” The more a lab can reduce this random factor — by running high-powered research, for example — the less lab-wise selection creates inflation in reported results.
  3. A lab’s file drawer creates the most inflation when it tries to create uniform conditions of low power — for example, trying to economize by studying strong effects using weak methods and weak effects using strong methods, so that a uniform post-hoc power of 50% is reached (as in the first example). It may be better to let weakly supported hypotheses wither away (as in the hybrid lab example).

And three additional observations:

  1. These problems vanish in a world where all results are publishable because research is done and evaluated in a way that reflects confidence in even null results (for example, the world of reviewed pre-registration). Psychology is a long way from that world, though.
  2. The labwise file drawer adds another degree of uncertainty when trying to use post-hoc credibility analyses to assess the existence and extent of publication bias. Some of that publication bias may come from the published topic being simply more lucky than other topics in the lab.
  3. If people are going to disagree on the implications, a lot of it will hinge on whether it is ethical to not report failed lines of research. Those who think it’s OK will see a further reason not to condemn existing research, because part of the inflation used as evidence for publication bias could be due to this OK practice. Those who think it’s not OK will press for even more complete reporting requirements, bringing lab psychology more in line with practices in other lab sciences (see illustration).

    lab_notebook_example2
    Being trusted is a privilege.

 

 

My JESP Inaugural Editorial

A short break from the editors’ forum questions: I’m not sure I’ve given proper publicity to my editorial, which according to the cellulose-based publication schedules is “coming out” in the “July issue” but is now accessible – openly, as far as I can tell. So here: publicity.

If you want the take-home summary about “the crisis” here it is, in bold, with a little commentary:

1. I would rather not set out hard standards for the number of participants in a study or condition.  I don’t think our field has done the theory and method development to support those kind of standards.

I would rather indirectly reward the choice to study both higher N and higher effect sizes, by putting more weight on strong rather than just-significant p-values, while recognizing that sometimes a true, and even strong, effect is represented in a multi-study sequence by a non-significant p-value. So…

2. I do like smarter use of p-values. I want people to look at research with a full awareness of just how much p-values vary even when testing a true effect with good power.

This means not playing the game of getting each study individually significant, and being able to present a set of studies that represent overall good evidence for an effect regardless of what their individual p-values are.

If I could be sure that a multi-study paper contained all studies run on that topic, including real evidence on why some of them were excluded because they were not good tests of the hypothesis, then I could express a standard of evidence in terms of some combined p-value using meta-analytic methods. I suggest a p < .01 but my original impulse (and an opinion I received from someone I respect a lot soon after the editorial came out) was for p < .005. I suggest this very lightly because I don’t want it to crystallize into the new p < .05, N =200, p-rep > .87 or whatever. It’s just based on the approximate joint probability of two results p < .10 in the same direction.

If, if, if. There’s also the question of how many nonsignificant tests go into the file drawer behind any multi-study paper. Given our power versus our positive reporting rate and other indicators there’s no doubt that many do. The reactions that I have seen range from “hey, business as usual” to “not ideal but probably doesn’t hurt our conclusions much” to “not cool, let’s do better” to “selective reporting is the equivalent of fraud.” If I could snap my fingers and get the whole field together on one issue it would be this, naturally, ending up at my personally preferred position of “not cool.”

But until we have a much better worked out comparison of all the different methods out there, it’s hard to know how effectively gatekeepers can deduce when a research report has less evidence than it claims, because of that file drawer. I’m not the only one who would rather have people feel free to be more complete a priori research reporters, than have to write back with “Your article with four studies at p = .035, p = .04, p = .045 and p = .0501 (significant (sic)) looks, um, you know …”  Erika Salomon has taken the first step but, clearly, the last word is far from written.

Anyway, one of the goals of this blog is to open up a debate and lay out both the ethical and the pragmatic issues about this most divisive of litmus tests in the field today – should we worry about the file drawer? So stay tuned.