A short break from the editors’ forum questions: I’m not sure I’ve given proper publicity to my editorial, which according to the cellulose-based publication schedules is “coming out” in the “July issue” but is now accessible – openly, as far as I can tell. So here: publicity.
If you want the take-home summary about “the crisis” here it is, in bold, with a little commentary:
1. I would rather not set out hard standards for the number of participants in a study or condition. I don’t think our field has done the theory and method development to support those kind of standards.
I would rather indirectly reward the choice to study both higher N and higher effect sizes, by putting more weight on strong rather than just-significant p-values, while recognizing that sometimes a true, and even strong, effect is represented in a multi-study sequence by a non-significant p-value. So…
2. I do like smarter use of p-values. I want people to look at research with a full awareness of just how much p-values vary even when testing a true effect with good power.
This means not playing the game of getting each study individually significant, and being able to present a set of studies that represent overall good evidence for an effect regardless of what their individual p-values are.
If I could be sure that a multi-study paper contained all studies run on that topic, including real evidence on why some of them were excluded because they were not good tests of the hypothesis, then I could express a standard of evidence in terms of some combined p-value using meta-analytic methods. I suggest a p < .01 but my original impulse (and an opinion I received from someone I respect a lot soon after the editorial came out) was for p < .005. I suggest this very lightly because I don’t want it to crystallize into the new p < .05, N =200, p-rep > .87 or whatever. It’s just based on the approximate joint probability of two results p < .10 in the same direction.
If, if, if. There’s also the question of how many nonsignificant tests go into the file drawer behind any multi-study paper. Given our power versus our positive reporting rate and other indicators there’s no doubt that many do. The reactions that I have seen range from “hey, business as usual” to “not ideal but probably doesn’t hurt our conclusions much” to “not cool, let’s do better” to “selective reporting is the equivalent of fraud.” If I could snap my fingers and get the whole field together on one issue it would be this, naturally, ending up at my personally preferred position of “not cool.”
But until we have a much better worked out comparison of all the different methods out there, it’s hard to know how effectively gatekeepers can deduce when a research report has less evidence than it claims, because of that file drawer. I’m not the only one who would rather have people feel free to be more complete a priori research reporters, than have to write back with “Your article with four studies at p = .035, p = .04, p = .045 and p = .0501 (significant (sic)) looks, um, you know …” Erika Salomon has taken the first step but, clearly, the last word is far from written.
Anyway, one of the goals of this blog is to open up a debate and lay out both the ethical and the pragmatic issues about this most divisive of litmus tests in the field today – should we worry about the file drawer? So stay tuned.