The Lab-Wise File Drawer

The file drawer problem refers to a researcher failing to publish studies that do not get significant results in support of the main hypothesis. There is little doubt that file-drawering has been endemic in psychology. A number of tools have been proposed to analyze excessive rates of significant publication and suspicious patterns of significance (summarized and referenced in this online calculator). All the same, people disagree about how seriously the file-drawer distorts conclusions in psychology (e.g. Fabrigar & Wegener in JESP, 2016, with comment by Francis and reply). To my mind the greater threat of file-drawering noncompliant studies is ethical. It simply looks bad, has little justification in an age of supplemental materials that shatter the page count barrier, and violates the spirit of the APA Ethics Code.

But I’ll bet that all of us who have been doing research for any time have a far larger file drawer composed of whole lines of research that simply did not deliver significant or consistent results. The ethics here are less clear. Is it OK to bury our dead? Where would we publish whole failed lines of research?

Of course, some research topics are interesting no matter how they come out. As I’ve argued elsewhere, focusing on questions like these would remove a lot of uncertainty from careers and undercut the careerist motivation for academic fraud. But would even the most extreme advocate of open science reform back a hard requirement for a completely transparent lab?

The lab-wise file drawer rate — if you will, the circular file drawer — could explain part or all of publication bias. Statistical power could lag behind the near-unanimous significance of reported results due to whole lines of research being file-drawered. The surviving lines of research, even if they openly report all studies run, would then look better than chance would have it. I ran some admittedly simplistic simulations (Google Drive spreadsheet) to check out how serious the circular file drawer can get.

circular-file-drawer
File, circular, quantity 1

 

Our simulated lab has eight lines of research going on, testing independent hypotheses with multiple studies. Each study has 50% power to find a significant result given the match between its methods and the actual, population effect size out there. You may think this is low, but keep in mind that even if you build in 80 or 90% power to find, say, a medium sized effect, the actual effect may be small or practically nil, reducing your power post-hoc.

The lab also follows rules for conserving resources while conforming to the standard of evidence that many journals follow. For each idea, they run up to three studies. If two studies fail to get a significant result, they don’t publish the research. They also stop the research short if either they fail to replicate a promising first study, or they try two studies, both of which fail. In this 50% power lab where each study’s success is a coin flip, this means that out of eight lines of research, they will only end up trying to publish three. Sounds familiar?

Remember, all these examples  assume that the lab reports even non-significant studies from lines of research that “succeed” by this standard. There is no topic-wise file drawer — only a lab-wise one.

In our example lab that runs at a consistent 50% power, at the spreadsheet’s top left, the results of this practice look pretty eye-opening. Even though not all reported results are significant, the 77.8% that are significant still exceed the 50% power of the studies. This leads to an R-index of 22, which has been described as a typical result when reporting bias is applied to a nonexistent effect (Schimmack, 2016).

labwise1

Following the spreadsheet down, we see minimal effects of adopting slightly different rules that are more or less conservative in abandoning a research line after failure. They only require one study more or less about every 4 topics, and the R-indices from these analyses are still problematic.

Following the spreadsheet to the right, we see stronger benefits of carrying out more strongly powered research — which includes studying effects that are more strongly represented in the population to begin with. At 80% power, most research lines yield three significant studies, and the R-index becomes a healthy 71.

labwise2

The next block to the right assumes only 5% power – a figure that breaks the assumptions of the R index. This represents a lab that is going after an effect that doesn’t exist, so tests will only be significant at the 5% type I error rate. Each of the research rules is very effective in limiting exposure to completely false conclusions, with only one in hundreds of false hypotheses making it to publication.

Before drawing too many conclusions about the 50% power example, however, it is important to question one of its assumptions. If all studies run in a lab have a uniform 50% power, and use similar methods, then all the hypotheses are true, with the same population effect size. Thus, variability in the significance of studies cannot reflect (nonexistent) variability in the truth value of hypotheses.

To reflect reality more closely, we need a model like the one I present at the very far right. A lab uses similar methods across the board to study a variety of hypotheses: 1/3 hypotheses with strong population support (so their standard method yields 80% power), 1/3 with weaker population support (so, 50% power), and 1/3 hypotheses that are just not true at all (so, 5% power). This gives the lab’s publication choices a chance to represent meaningful differences in reality, not just random variance in sampling.

What happens here?

labwise3

This lab, as expected, flushes out almost all of its tests of nonexistent effects, and finds relatively more success with research lines that have high power based on a strong effect, versus low power based on a weaker effect. As a result, inflation of published findings is still appreciable, but less problematic than if each study is done at a uniform 50% power.

To sum up:

  1. A typical lab might expect to have its published results show inflation above their power levels even if it commits to reporting all studies relevant to each separate topic it publishes on.
  2. This is because of sampling randomness in which topics are “survivors.” The more a lab can reduce this random factor — by running high-powered research, for example — the less lab-wise selection creates inflation in reported results.
  3. A lab’s file drawer creates the most inflation when it tries to create uniform conditions of low power — for example, trying to economize by studying strong effects using weak methods and weak effects using strong methods, so that a uniform post-hoc power of 50% is reached (as in the first example). It may be better to let weakly supported hypotheses wither away (as in the hybrid lab example).

And three additional observations:

  1. These problems vanish in a world where all results are publishable because research is done and evaluated in a way that reflects confidence in even null results (for example, the world of reviewed pre-registration). Psychology is a long way from that world, though.
  2. The labwise file drawer adds another degree of uncertainty when trying to use post-hoc credibility analyses to assess the existence and extent of publication bias. Some of that publication bias may come from the published topic being simply more lucky than other topics in the lab.
  3. If people are going to disagree on the implications, a lot of it will hinge on whether it is ethical to not report failed lines of research. Those who think it’s OK will see a further reason not to condemn existing research, because part of the inflation used as evidence for publication bias could be due to this OK practice. Those who think it’s not OK will press for even more complete reporting requirements, bringing lab psychology more in line with practices in other lab sciences (see illustration).

    lab_notebook_example2
    Being trusted is a privilege.

 

 

Advertisements

APA Ethics vs. the File Drawer

These days, authors and editors often complain about a lack of clear, top-down guidance on the ethics of the file-drawer. For many years in psychology, it was considered OK to refrain from reporting studies in a line of research with nonsignificant key results. This may sound bad to your third-grader, to Aunt Millie, or to Representative Sanchez. But almost everyone did it.

The rationales have looked a lot like Bandura’s inventory of moral disengagement strategies (pdf): “this was just a pilot study” (euphemistic labeling), “there must be something wrong with the methods, unlike these studies that worked” (distortion of consequences — unless you can point to evidence the methods failed, independently of the results),  “at least it’s not fabrication” (advantageous comparison), and of course, “we are doing everyone a favor, nobody wants to read boring nonsignificant results” (moral justification).

Bandura would probably classify “journals won’t accept anything with a nonsignificant result” as displacement of responsibility, too. But I see journals as just the right and responsible place to set standards for authors to follow. So, as Editor-in-Chief, I’ve let it be known that JESP is open to nonsignificant study results, either as part of sampling variation in a larger pattern of evidence, or telling a convincing null story thanks to solid methods.

That’s the positive task, but the negative task is harder. How do we judge how much a body of past research, or a manuscript submitted today, suffers from publication bias? What is the effect of publication bias on conclusions to be drawn from the literature? These are pragmatic questions. There’s also ethics: whether, going forward, we should treat selective publication based only on results as wrong.

Uli Schimmack likens selective publication and analysis to doping. But if so, we’re in the 50-year period in the middle of the 20th century when, slowly and piecemeal, various athletic authorities were taking first steps to regulate performance-enhancing drugs. A British soccer player buzzed on benzedrine in 1962 was not acting unethically by the regulations of his professional body. Imagine referees being left to decide at each match whether a player’s performance is “too good to be true” without clear regulations from the professional body. This is the position of the journal editor today.

Or is it? I haven’t seen much awareness that the American Psychological Association’s publication manuals, 5th (2003) and 6th (2010) edition, quietly put forward an ethical standard relevant to selective publication. Here’s the 6th edition, p. 12. The 5th edition’s language is very similar.

apa4

Note that this is an interpretation of a section in the Ethics Code that does not directly mention omission of results. You could search the Ethics Code without finding any mention of selective publication, which is probably why this section is little known. Here’s 5.01a below.

apa6

Also getting in the way of a clear message is the Publication Manual’s terse language. “Observations” could, I suppose, be narrowly interpreted to mean dropping participants ad hoc from a single study just to improve the outcome. If you interpret “observations” more broadly (and reasonably) to mean “studies,” there is still the question of what studies a given report should contain, in a lab where multiple lines of research are going on in parallel. There is room to hide failed studies, perhaps, in the gap between lines.

But I don’t think we should be trying to reverse-engineer a policy out of such a short a description. See it for what it is: a statement of the spirit of the law, rather than the letter. Even if you don’t think you’re being “deceptive or fraudulent,” just trying to clarify the message our of kindness to your reader, the Publication Manual warns against the impulse “to present a more convincing story.” There can be good reasons for modifying and omitting evidence in order to present the truth faithfully. But these need to be considered independent of the study’s failure or success in supporting the hypothesis.

One last numbered paragraph. This is the relevant section of the Ethical Principles (not the Publication Manual) that authors have to sign off on when they submit a manuscript to an APA journal.

apa7

What would be the implications if the APA’s submission form instead used the exact language and interpretation of 5.01a from its own most recent Publication Manual? Explosive, I think. Using the APA’s own official language, it would lay down an ethical standard for research reporting far beyond any of the within-study reporting measures I know about in any journal of psychology. It would go beyond p-curving, R-indexing and “robustness” talk after the fact, and say out loud that file-drawering studies only because they’ve failed to support a hypothesis is unethical. Working out reasonable ways to define that standard would then be an urgent next step for the APA and all journals who subscribe to its publication ethics.

My JESP Inaugural Editorial

A short break from the editors’ forum questions: I’m not sure I’ve given proper publicity to my editorial, which according to the cellulose-based publication schedules is “coming out” in the “July issue” but is now accessible – openly, as far as I can tell. So here: publicity.

If you want the take-home summary about “the crisis” here it is, in bold, with a little commentary:

1. I would rather not set out hard standards for the number of participants in a study or condition.  I don’t think our field has done the theory and method development to support those kind of standards.

I would rather indirectly reward the choice to study both higher N and higher effect sizes, by putting more weight on strong rather than just-significant p-values, while recognizing that sometimes a true, and even strong, effect is represented in a multi-study sequence by a non-significant p-value. So…

2. I do like smarter use of p-values. I want people to look at research with a full awareness of just how much p-values vary even when testing a true effect with good power.

This means not playing the game of getting each study individually significant, and being able to present a set of studies that represent overall good evidence for an effect regardless of what their individual p-values are.

If I could be sure that a multi-study paper contained all studies run on that topic, including real evidence on why some of them were excluded because they were not good tests of the hypothesis, then I could express a standard of evidence in terms of some combined p-value using meta-analytic methods. I suggest a p < .01 but my original impulse (and an opinion I received from someone I respect a lot soon after the editorial came out) was for p < .005. I suggest this very lightly because I don’t want it to crystallize into the new p < .05, N =200, p-rep > .87 or whatever. It’s just based on the approximate joint probability of two results p < .10 in the same direction.

If, if, if. There’s also the question of how many nonsignificant tests go into the file drawer behind any multi-study paper. Given our power versus our positive reporting rate and other indicators there’s no doubt that many do. The reactions that I have seen range from “hey, business as usual” to “not ideal but probably doesn’t hurt our conclusions much” to “not cool, let’s do better” to “selective reporting is the equivalent of fraud.” If I could snap my fingers and get the whole field together on one issue it would be this, naturally, ending up at my personally preferred position of “not cool.”

But until we have a much better worked out comparison of all the different methods out there, it’s hard to know how effectively gatekeepers can deduce when a research report has less evidence than it claims, because of that file drawer. I’m not the only one who would rather have people feel free to be more complete a priori research reporters, than have to write back with “Your article with four studies at p = .035, p = .04, p = .045 and p = .0501 (significant (sic)) looks, um, you know …”  Erika Salomon has taken the first step but, clearly, the last word is far from written.

Anyway, one of the goals of this blog is to open up a debate and lay out both the ethical and the pragmatic issues about this most divisive of litmus tests in the field today – should we worry about the file drawer? So stay tuned.