New Policies at JESP for 2018: The Why and How

As we found out starting in 2016, improving the scientific quality of articles in JESP, by requesting things such as disclosure statements or effect size reporting, means more work. More work for triaging editors, to check that the requried elements are there; and also for authors, if they get something wrong and have to revise their submission. To keep this work manageable, changes in publication practices at JESP proceed in increments. And with a new year, the next increment is here (announcement).

Briefly, from January 1 2018 we will:

  • No longer accept new FlashReports as an article format.
  • Begin accepting Registered Reports, or pre-registrations of Introduction sections and research plans prior to data collection which are sent to peer review, and are accepted, revised or rejected prior to knowing the study’s results. These may be replication or original studies. This format also includes articles where one or more studies are reported normally, and only the final study is in Registered Report format.
  • In all articles, require any hypothesis-critical inferential analyses to be accompanied by a sensitivity power analysis at 80%, that is, an estimate of the minimum effect size that the method can detect with 80% power. Sensitivity analysis is one of the options in the freely available software GPower (manual, pdf). It should be reported together with any assumptions necessary to derive power; for example, the assumed correlation between repeated measures.
  • Require that any mediation analyses either: explain the logic of the causal model that mediation assumes (for example, why the mediator is assumed to be causally prior to the outcome), or present the mediation cautiously, as a test of only one possible causal model.
  • Begin crediting the handling editor in published articles.

Ring out the Flash, ring in the Registered

It should come as no surprise that the journal is adopting Registered Reports, given my previous support of pre-registration (editorial, article). It may require more explanation why FlashReports are being discontinued.

As of January 2016 we emphasized that these short, 2500 word reports of research should not be seen as an easy way to get tentative evidence published, but rather an appropriate outlet for briefly reported papers that meet the same standards for evidence and theory development as the rest of the journal. Our reviewing and handling of FlashReports in the past two years has followed this model, and we also managed to handle FlashReports more speedily (by 3-4 weeks on average) than other article types.

With these changes, though, came doubts about the purpose of having a super-short format at all. It’s hard to escape the suspicion that short reports became popular in the 2000s out of embarrassment at the lag in publishing psychology articles on timely events. For example, after 9/11, some of the immediate follow-up research was still coming out in 2006 and later. Indeed, previous guidelines for FlashReports did refer to research on significant events.

But social psychology, unlike other disciplines (such as political science, which faces its own debates about timely versus accurate publishing), does not study historical events per se. Instead, they are examples that can illustrate deeper truths about psychological processes. Indeed, many FlashReports received in the past year dealt with the 2016 election or the Trump phenomenon, but didn’t go deep enough into those timely topics to engage with psychological theory in a generalizable way.

Few problems can be solved by a social psychologist racing to the scene. Our discipline has its impact on a longer, larger scale. We inform expert testimony, public understanding, and intervention in future cases. With this view, it becomes less important to get topical research out quickly, and more important to make sure the conclusions from it are correct, fully reported, replicable, and generalizable. While we still strive for timely handling of all articles, it makes less sense to promote a special “express lane” for short articles, and more sense to encourage all authors to report their reasoning, method, and analysis fully.

Sense and sensitivity

Over the past year, the Editors and I have noticed more articles that include a justification of sample size, even though it is not formally requested by our guidelines. This prompted us to reconsider our own requirements.

Many of us saw these sample size justifications as unsatisfactory on a number of counts. Sometimes, they were simply based on “traditional” numbers of participants (or worse yet, on the optimistic n=20 in Simmons et al., 2011, which the authors have decisively recanted). When based on a priori power analyses, the target effect size was often arrived at ad hoc, or through review of literatures known to be affected by publication bias. In research that tests a novel hypothesis, the effect size is simply not known. Attempts to benchmark using field-wide effect size estimates are unsatisfactory because methodology and strength of effects make effect sizes vary greatly from one research question to another (e.g. in Richard et al. 2003, the effect sizes of meta-analyses, themselves aggregating dozens if not hundreds of studies and methods, vary from near-zero to r = .5 and up). It is also sometimes seen that authors base their sample estimates on power analysis but fall short of those numbers in data collection, showing the futility of reporting good intentions rather than good results. In 2016, we saw these weaknesses as reasons not to make a requirement out of power analyses.

And yet it helps to have some idea of the adequacy of statistical power. Power affects the likelihood that any given p-value represents a true vs. false positive (as demonstrated interactively here.) More generally, caring about adequate power is part of the greater emphasis on method, rather than results, that I promoted in the 2016 editorial. A post-hoc power analysis, however, gives no new information above and beyond the exact p-value, which we now require; that is, a result with p = .05 always had about 50% power to detect the effect size actually found, 80% power always corresponds to p =.005 post-hoc, and so on.

As suggested by incoming Associate Editor Dan Molden, and further developed in conversations with current editors Ursula Hess and Nick Rule, the most informative kind of power analysis in a report is the sensitivity analysis, in which you report the minimum effect size your experiment had 80% power to detect. Bluntly put, if you are an author, we don’t want to make decisions based on your good intentions  (though you’re still welcome to report them), but rather, on the sensitivity of your actual experiment.

As before, there are no hard-and-fast guidelines on how powerful an experiment must be. The sensitivity of an experiment to reasonable effect sizes for social psychology will be taken into account, together with other indicators of methodological quality, and with some consideration of the difficuty of data collection. Our hope is that other journals will see the merit of shifting from reporting based on explaining intentions, to reporting based on the statistical facts of the experiment.

Mediation requires a causal model

The new policy on mediation is just a determination to start enforcing the warnings  already leveled at the use of this statistical practice in my 2016 editorial. To quote:

As before, we see little value in mediation models in which the mediator is conceptually very similar to either the predictor or outcome. Additionally, good mediation models should have a methodological basis for the causal assumptions in each step; for example, when the predictor is a manipulation, the mediator a self-reported mental state, and the outcome a subsequent decision or observed behavior. Designs that do not meet these assumptions can still give valuable information about potential processes through correlation, partial correlation, and regression, but should not use causal language and should interpret indirect paths with caution. We reiterate that mediation is not the only way to address issues of process in an experimental design.

In line with this, mediation analyses for JESP now have to either justify the causal direction of each step in the indirect path(s) explicitly using theory and method arguments, or include a disclaimer that only one of several possible models is tested. This standard also follows the recommendations made in a forthcoming JESP methods paper by Fiedler, Harris & Schott, which shows that mediation analyses published in our field almost never mention the causal assumptions that, statistically, these methods require. Again, it is our hope that this policy and its explanation will inspire other editors to take a second look at the use of mediation, especially in studies where the sophistication it lends is more apparent than real.


Scientific Criticism: Personal by Nature, Civil by Choice

Scientific criticism of psychology research is good. Bullying is bad. Between these two truisms lie a world of cases and a sea of disagreement. Look at the different perspectives described and shown in Susan Dominus’ article on Amy Cuddy, and Daniel Engber’s take, and Simine Vazire’s, and Andrew Gelman’s.

Talking about a “tone problem,” as in previous years, hasn’t worked. Worrying vaguely about overall tone seems petty. Everyone justifies their own approach. Few are repenting. The question needs to be sharpened: why do critics say things that seem reasonable to them, and unacceptably personal to some audiences?

In principle, it’s easy to insist on avoiding the ad hominem: “tackle the ball, not the player.” In practice, scientific criticism and testing have many reasons to veer into gray areas that leave the referees confused about where the ball ends and the foot begins. Here are some of them.

1. Who to critique? I see triage, you see resentment.

Asking for data, replicating studies–these are time-consuming activities. Checking everyone’s research is impossible. There are far more people producing than critiquing research. Since the start of the movement there has been a quest, backstage and quiet, to come up with rules governing whom to train the telescope on. Until this quest returns an answer, choices are going to be subjective, ill-defined, seemingly unfair.

Criticism is going to focus on results that are highly publicized. One, because they come to mind easily. Two, because the critic sees a special duty to make sure high-profile results are justified. But to the criticized, this focus can seem unfair, maybe even motivated by resentment of their success (or Nietzsche’s ressentiment). Naturally, it’s unlikely that a critic who investigates only high-profile results will have the same level of recognition as the results’ authors.

I endorse this authentic Mustache Fred quote.

Criticism is also going to focus on results that are surprising and implausible. Indeed, 5-15 years ago, there was considerable overlap between being high-profile and being surprising. “Big behavioral results from small input tweaks” was a formula for success in journals that prized novelty. Whether or not you think it’s fair to go after those examples might relate to what you think of the prior probability that a hypothesis is true. That probability, by definition, is lower the more implausible the effect is.

  • People who think prior probabilities are necessary in order to interpret positive evidence will agree that extraordinary claims require extraordinary evidence. This includes Bayesians, but also frequentists who understand that a significant p-value by itself is not the chance that you have made a type I error; you need priors to interpret the p-value as a “truth statistic” (see Ioannidis, 2005 and this widget).
  • If you don’t get this point of statistics, you will probably lean toward seeing an equal-tolerance policy as more fair. Surely, all findings should be treated alike, no matter how supported they are by precedent and theory. By this standard, focusing on surprising findings looks like at least a scientific-theoretical bias, if not a personal one. You may wonder why the boring, obvious results don’t get the same level of scrutiny.

2. Are your results your career? Maybe they shouldn’t be, but they are.

In addressing bad incentives in science, I sometimes hear that the success of a scientist’s career should not depend on whether their research gives hypothesis-confirming results. But this has never been the case in science and is unlikely ever to be. If a scientist has a wrong idea that is later shot down by more evidence, their stature is diminished. After all, if being right is not important to the personal reputation of a scientist, what incentive is there to be right?

Let’s not be naive. Science careers have been increasingly precarious and competitive, making it hard for a scientist to “merely live” with null or nullified results. Tenure, grant and hiring standards at the highest levels make it necessary for a scientist to be personally identified with a substantial, positive, and influential line of research. Criticism that leads to doubt about published work is a career threat, like it or not. Under these conditions, the scientific is inherently personal.

3. What to do in the face of impunity?

When impersonal, procedural justice fails to deliver results, there is a strong pull to deliver personal, frontier justice. Now when it comes to scientific error in psychology, or worse yet misconduct, who exactly is the law? Universities are often vested in maintaining the reputations of their good grant-getters (but, as with free speech issues, happy to throw temporary faculty under the bus.) Not all researchers belong to professional organizations that enforce an ethics code. Funders take some interest in research misconduct, but not all research is funded. Journal editors have leverage with retractions, corrections, and expressions of concern, but this is itself a form of sheriff law, with very different results depending on the editor’s opinion, and very few investigative powers on its own.

Having to rely on sheriff law, mob law, or vigilante law, frankly sucks. There’s very little stopping this kind of justice from being capricious, unaccountable, unjust. Batman, Dirty Harry and the Bad Lieutenant are all in the same game. Historically, sociologically, what makes this kind of justice go away is the establishment of reliable, accountable institutions that police integrity. Institutions that are empowered not just to find smoking guns pointing to misconduct, but to close a raftload of loopholes that allow evidence to be concealed or destroyed with impunity.

Until that happens, justice is stuck in the Middle Ages, with judicial duels involving vested persons rather than impartial institutions. With nobody responsible for bringing clear charges, accusations confuse sloppiness, overstatement, and outright fraud, just as Batman punishes serial killers and truant kids alike. Even posse members get the blues about this, sometimes.

Image result for batman stay in school
Snyder & Capullo, Batman 51

Criticize, with civility

If criticism in today’s science, for all these reasons, is going to have personal resonance, the rule “avoid anything that seems like a personal attack” is not going to cut it.

(I know some of you think it is wrong to put any community standards on what can be said. This discussion is not for you, and we are all now free to call you someone who really gets off on being an asshole. Shh! Don’t complain.)

For the rest of us, I propose that civil discourse means we treat other scientists as people of good character who are gravely mistaken. Even when we suspect that they are not people of good character. Especially when we suspect this; otherwise we are merely being nice to people we like.

Assuming good character means, concretely, this:

  • Confine your public observations to judging their words and behaviors.
  • Don’t try to get inside their head; avoid claims like “He must really want to tear down science” or “She just loves to pull off these tricks.”
  • Don’t try to paint their character with a label. People are uncannily good at making inferences from actions. “Con artist,” “shady,” “zealot,” all go over the top. They inflame, when your evidence should be enough.
  • Do all this even when you are itching to land them a zinger, even when you want everyone to rise up in arms behind your call. You have great freedom in highlighting and arranging the facts; that freedom can be challenged, and answered rationally. Let the case speak for itself.

The Lab-Wise File Drawer

The file drawer problem refers to a researcher failing to publish studies that do not get significant results in support of the main hypothesis. There is little doubt that file-drawering has been endemic in psychology. A number of tools have been proposed to analyze excessive rates of significant publication and suspicious patterns of significance (summarized and referenced in this online calculator). All the same, people disagree about how seriously the file-drawer distorts conclusions in psychology (e.g. Fabrigar & Wegener in JESP, 2016, with comment by Francis and reply). To my mind the greater threat of file-drawering noncompliant studies is ethical. It simply looks bad, has little justification in an age of supplemental materials that shatter the page count barrier, and violates the spirit of the APA Ethics Code.

But I’ll bet that all of us who have been doing research for any time have a far larger file drawer composed of whole lines of research that simply did not deliver significant or consistent results. The ethics here are less clear. Is it OK to bury our dead? Where would we publish whole failed lines of research?

Of course, some research topics are interesting no matter how they come out. As I’ve argued elsewhere, focusing on questions like these would remove a lot of uncertainty from careers and undercut the careerist motivation for academic fraud. But would even the most extreme advocate of open science reform back a hard requirement for a completely transparent lab?

The lab-wise file drawer rate — if you will, the circular file drawer — could explain part or all of publication bias. Statistical power could lag behind the near-unanimous significance of reported results due to whole lines of research being file-drawered. The surviving lines of research, even if they openly report all studies run, would then look better than chance would have it. I ran some admittedly simplistic simulations (Google Drive spreadsheet) to check out how serious the circular file drawer can get.

File, circular, quantity 1


Our simulated lab has eight lines of research going on, testing independent hypotheses with multiple studies. Each study has 50% power to find a significant result given the match between its methods and the actual, population effect size out there. You may think this is low, but keep in mind that even if you build in 80 or 90% power to find, say, a medium sized effect, the actual effect may be small or practically nil, reducing your power post-hoc.

The lab also follows rules for conserving resources while conforming to the standard of evidence that many journals follow. For each idea, they run up to three studies. If two studies fail to get a significant result, they don’t publish the research. They also stop the research short if either they fail to replicate a promising first study, or they try two studies, both of which fail. In this 50% power lab where each study’s success is a coin flip, this means that out of eight lines of research, they will only end up trying to publish three. Sounds familiar?

Remember, all these examples  assume that the lab reports even non-significant studies from lines of research that “succeed” by this standard. There is no topic-wise file drawer — only a lab-wise one.

In our example lab that runs at a consistent 50% power, at the spreadsheet’s top left, the results of this practice look pretty eye-opening. Even though not all reported results are significant, the 77.8% that are significant still exceed the 50% power of the studies. This leads to an R-index of 22, which has been described as a typical result when reporting bias is applied to a nonexistent effect (Schimmack, 2016).


Following the spreadsheet down, we see minimal effects of adopting slightly different rules that are more or less conservative in abandoning a research line after failure. They only require one study more or less about every 4 topics, and the R-indices from these analyses are still problematic.

Following the spreadsheet to the right, we see stronger benefits of carrying out more strongly powered research — which includes studying effects that are more strongly represented in the population to begin with. At 80% power, most research lines yield three significant studies, and the R-index becomes a healthy 71.


The next block to the right assumes only 5% power – a figure that breaks the assumptions of the R index. This represents a lab that is going after an effect that doesn’t exist, so tests will only be significant at the 5% type I error rate. Each of the research rules is very effective in limiting exposure to completely false conclusions, with only one in hundreds of false hypotheses making it to publication.

Before drawing too many conclusions about the 50% power example, however, it is important to question one of its assumptions. If all studies run in a lab have a uniform 50% power, and use similar methods, then all the hypotheses are true, with the same population effect size. Thus, variability in the significance of studies cannot reflect (nonexistent) variability in the truth value of hypotheses.

To reflect reality more closely, we need a model like the one I present at the very far right. A lab uses similar methods across the board to study a variety of hypotheses: 1/3 hypotheses with strong population support (so their standard method yields 80% power), 1/3 with weaker population support (so, 50% power), and 1/3 hypotheses that are just not true at all (so, 5% power). This gives the lab’s publication choices a chance to represent meaningful differences in reality, not just random variance in sampling.

What happens here?


This lab, as expected, flushes out almost all of its tests of nonexistent effects, and finds relatively more success with research lines that have high power based on a strong effect, versus low power based on a weaker effect. As a result, inflation of published findings is still appreciable, but less problematic than if each study is done at a uniform 50% power.

To sum up:

  1. A typical lab might expect to have its published results show inflation above their power levels even if it commits to reporting all studies relevant to each separate topic it publishes on.
  2. This is because of sampling randomness in which topics are “survivors.” The more a lab can reduce this random factor — by running high-powered research, for example — the less lab-wise selection creates inflation in reported results.
  3. A lab’s file drawer creates the most inflation when it tries to create uniform conditions of low power — for example, trying to economize by studying strong effects using weak methods and weak effects using strong methods, so that a uniform post-hoc power of 50% is reached (as in the first example). It may be better to let weakly supported hypotheses wither away (as in the hybrid lab example).

And three additional observations:

  1. These problems vanish in a world where all results are publishable because research is done and evaluated in a way that reflects confidence in even null results (for example, the world of reviewed pre-registration). Psychology is a long way from that world, though.
  2. The labwise file drawer adds another degree of uncertainty when trying to use post-hoc credibility analyses to assess the existence and extent of publication bias. Some of that publication bias may come from the published topic being simply more lucky than other topics in the lab.
  3. If people are going to disagree on the implications, a lot of it will hinge on whether it is ethical to not report failed lines of research. Those who think it’s OK will see a further reason not to condemn existing research, because part of the inflation used as evidence for publication bias could be due to this OK practice. Those who think it’s not OK will press for even more complete reporting requirements, bringing lab psychology more in line with practices in other lab sciences (see illustration).

    Being trusted is a privilege.



APA Ethics vs. the File Drawer

These days, authors and editors often complain about a lack of clear, top-down guidance on the ethics of the file-drawer. For many years in psychology, it was considered OK to refrain from reporting studies in a line of research with nonsignificant key results. This may sound bad to your third-grader, to Aunt Millie, or to Representative Sanchez. But almost everyone did it.

The rationales have looked a lot like Bandura’s inventory of moral disengagement strategies (pdf): “this was just a pilot study” (euphemistic labeling), “there must be something wrong with the methods, unlike these studies that worked” (distortion of consequences — unless you can point to evidence the methods failed, independently of the results),  “at least it’s not fabrication” (advantageous comparison), and of course, “we are doing everyone a favor, nobody wants to read boring nonsignificant results” (moral justification).

Bandura would probably classify “journals won’t accept anything with a nonsignificant result” as displacement of responsibility, too. But I see journals as just the right and responsible place to set standards for authors to follow. So, as Editor-in-Chief, I’ve let it be known that JESP is open to nonsignificant study results, either as part of sampling variation in a larger pattern of evidence, or telling a convincing null story thanks to solid methods.

That’s the positive task, but the negative task is harder. How do we judge how much a body of past research, or a manuscript submitted today, suffers from publication bias? What is the effect of publication bias on conclusions to be drawn from the literature? These are pragmatic questions. There’s also ethics: whether, going forward, we should treat selective publication based only on results as wrong.

Uli Schimmack likens selective publication and analysis to doping. But if so, we’re in the 50-year period in the middle of the 20th century when, slowly and piecemeal, various athletic authorities were taking first steps to regulate performance-enhancing drugs. A British soccer player buzzed on benzedrine in 1962 was not acting unethically by the regulations of his professional body. Imagine referees being left to decide at each match whether a player’s performance is “too good to be true” without clear regulations from the professional body. This is the position of the journal editor today.

Or is it? I haven’t seen much awareness that the American Psychological Association’s publication manuals, 5th (2003) and 6th (2010) edition, quietly put forward an ethical standard relevant to selective publication. Here’s the 6th edition, p. 12. The 5th edition’s language is very similar.


Note that this is an interpretation of a section in the Ethics Code that does not directly mention omission of results. You could search the Ethics Code without finding any mention of selective publication, which is probably why this section is little known. Here’s 5.01a below.


Also getting in the way of a clear message is the Publication Manual’s terse language. “Observations” could, I suppose, be narrowly interpreted to mean dropping participants ad hoc from a single study just to improve the outcome. If you interpret “observations” more broadly (and reasonably) to mean “studies,” there is still the question of what studies a given report should contain, in a lab where multiple lines of research are going on in parallel. There is room to hide failed studies, perhaps, in the gap between lines.

But I don’t think we should be trying to reverse-engineer a policy out of such a short a description. See it for what it is: a statement of the spirit of the law, rather than the letter. Even if you don’t think you’re being “deceptive or fraudulent,” just trying to clarify the message our of kindness to your reader, the Publication Manual warns against the impulse “to present a more convincing story.” There can be good reasons for modifying and omitting evidence in order to present the truth faithfully. But these need to be considered independent of the study’s failure or success in supporting the hypothesis.

One last numbered paragraph. This is the relevant section of the Ethical Principles (not the Publication Manual) that authors have to sign off on when they submit a manuscript to an APA journal.


What would be the implications if the APA’s submission form instead used the exact language and interpretation of 5.01a from its own most recent Publication Manual? Explosive, I think. Using the APA’s own official language, it would lay down an ethical standard for research reporting far beyond any of the within-study reporting measures I know about in any journal of psychology. It would go beyond p-curving, R-indexing and “robustness” talk after the fact, and say out loud that file-drawering studies only because they’ve failed to support a hypothesis is unethical. Working out reasonable ways to define that standard would then be an urgent next step for the APA and all journals who subscribe to its publication ethics.

Quadratic Confounds, Linear Controls: A Doubt About That “Air Rage” Study

This past week it was hard to miss reading about the “air rage” study by Katherine DeCelles and Michael Norton, published in PNAS. They argued that economy-class passengers exposed to status inequalities when the flight had a first-class section were more likely to become belligerent or drunk — especially when the airplane boarded from the front and they had to walk right by those first-class 1%-ers in their sprawling leather seats. (article; sample news report).

Sounds compelling. But Andrew Gelman had doubts, based on some anomalies in the numbers and the general idea that it is easy to pick and choose analyses in an archival study. Indeed, for much research, one can level the critique that researchers are taking advantage of the “garden of forking paths”, but it’s harder to assess how much the analytic choices made could actually have influenced the conclusions.

This is where I found something interesting. In research that depends on multiple regression analysis, it’s important to compare zero-order and regression results. First, to see whether the basic effect is strong enough, so that the choices made for which variables to control for wouldn’t impact the results too much. Second, to see whether the conclusions being drawn from the regression line up with the zero-order interpretation of the effect that someone not familiar with regression would likely draw.

It’s already been remarked that the zero-order stats for the walk-by effect actually show a negative correlation with air rage; the positive relationship comes about only when you control for a host of factors. But the zero-order stats for the basic first-class air rage effect on economy are harder to get clear in the report. In the text, they are reported in raw, absolute numbers: over ten times as many air rage incidents occur on a yes/no basis in economy class when first class is present compared to when it is not. However, these raw numbers are naturally confounded by two factors that are highly correlated with the presence of a first class (see table S1 in the supplementary materials): length of flight and number of seats in economy. So you have to control for that in some way.

There was no data on actual number of passengers, but number of seats was used as a proxy for number of passengers, given that over 90% of flights are typically fully sold on that airline. In the article, number of seats and flight time are controlled for together, entered in the paper’s regression analysis on raw incident numbers per-flight. Not surprisingly, seats and time are correlated highly, nearly .80 (bigger planes fly longer routes), so including one after the other will not improve prediction much.

But most importantly, this approach doesn’t reflect that the effect of time and passenger numbers on behavioral base rates is multiplicative. That is, the raw amount of any kind of incident on a plane is a function of the number of people times the length of the flight, before any other factors like first class presence come into play. So what you need to do to model the rate of occurrence is  introduce an interaction term multiplying those two numbers – not just entering the two as predictors.

Or to put it another way – the analysis they reported might give a much lower effect of first class presence if they had taken as their outcome variable the proportion of air rage incidents per seat, per flight hour, because the outcome is still confounded with the size/length of the flight if you just control for the two of them in parallel.

Still confused? OK, one more time, with bunnies and square fields.

Rabbit fields

Although the data are proprietary and confidential, I would really appreciate seeing an analysis either controlling for the interaction of time and seats, or dividing the incidents by the product of time and number of seats before they go in as DVs, to arrive at the reasonable outcome of incidents per person per hour.

One other thing. Beyond their mere effects on base rates, long hours and many passengers may each affect raw numbers of air rage in a more exponential way – each with a quadratic function – by also leading to higher rates of incidents, even per person per hour, through higher numbers of potential interactions, and flight fatigue. So ideally, if you’re keeping raw numbers as the DV, the quadratic as well as linear functions of passenger numbers and flight time need to be modeled.

So the bottom line here for me is not that the garden of forking paths is in play, but that the wrong path appears to have been taken. I look forward to any feedback on this – especially from the authors involved.

My JESP Inaugural Editorial

A short break from the editors’ forum questions: I’m not sure I’ve given proper publicity to my editorial, which according to the cellulose-based publication schedules is “coming out” in the “July issue” but is now accessible – openly, as far as I can tell. So here: publicity.

If you want the take-home summary about “the crisis” here it is, in bold, with a little commentary:

1. I would rather not set out hard standards for the number of participants in a study or condition.  I don’t think our field has done the theory and method development to support those kind of standards.

I would rather indirectly reward the choice to study both higher N and higher effect sizes, by putting more weight on strong rather than just-significant p-values, while recognizing that sometimes a true, and even strong, effect is represented in a multi-study sequence by a non-significant p-value. So…

2. I do like smarter use of p-values. I want people to look at research with a full awareness of just how much p-values vary even when testing a true effect with good power.

This means not playing the game of getting each study individually significant, and being able to present a set of studies that represent overall good evidence for an effect regardless of what their individual p-values are.

If I could be sure that a multi-study paper contained all studies run on that topic, including real evidence on why some of them were excluded because they were not good tests of the hypothesis, then I could express a standard of evidence in terms of some combined p-value using meta-analytic methods. I suggest a p < .01 but my original impulse (and an opinion I received from someone I respect a lot soon after the editorial came out) was for p < .005. I suggest this very lightly because I don’t want it to crystallize into the new p < .05, N =200, p-rep > .87 or whatever. It’s just based on the approximate joint probability of two results p < .10 in the same direction.

If, if, if. There’s also the question of how many nonsignificant tests go into the file drawer behind any multi-study paper. Given our power versus our positive reporting rate and other indicators there’s no doubt that many do. The reactions that I have seen range from “hey, business as usual” to “not ideal but probably doesn’t hurt our conclusions much” to “not cool, let’s do better” to “selective reporting is the equivalent of fraud.” If I could snap my fingers and get the whole field together on one issue it would be this, naturally, ending up at my personally preferred position of “not cool.”

But until we have a much better worked out comparison of all the different methods out there, it’s hard to know how effectively gatekeepers can deduce when a research report has less evidence than it claims, because of that file drawer. I’m not the only one who would rather have people feel free to be more complete a priori research reporters, than have to write back with “Your article with four studies at p = .035, p = .04, p = .045 and p = .0501 (significant (sic)) looks, um, you know …”  Erika Salomon has taken the first step but, clearly, the last word is far from written.

Anyway, one of the goals of this blog is to open up a debate and lay out both the ethical and the pragmatic issues about this most divisive of litmus tests in the field today – should we worry about the file drawer? So stay tuned.

Editors’ Forum Supplemental: Improving Journal Performance

Here’s my individual answer to another question from the SPSP forum we never got around to answering.


This is referring to Tetlock’s work on identifying “superpredictors”and more generally improving performance  within geopolitical prediction markets. In those studies, the target outcome is clear and binary: Will the Republicans or Democrats control the Senate after the next election? Will there be at least 10 deaths in warfare in the South China Sea in the next year? Here, Brent suggests that editorial decisions can be treated like predictions of a paper’s future citation count, which in turn feeds into most metrics that look at a journal’s impact or importance.

Indeed, prediction markets have been used as academic quality judgments in a number of areas: for example, the ponderous research quality exercises that we in the UK are subject to, or the Reproducibility Project: Psychology (I was one of the predictors in that one, though apparently not a superpredictor, because I only won 60 of the 100 available bucks). But the more relevant aspect of Tetlock’s research is the identification of what makes a superpredictor super. In a 2015 Perspectives article, the group lists a number of  factors identified through research. Some of them are obvious at least in hindsight, like high cognitive ability and motivation. Others seem quite specific to the task of predicting geopolitical events, like unbiased counterfactual thinking.

There’s a reason, though, to be skeptical of maximizing the citation count of articles in a journal. [Edit: example not valid any more, thanks Mickey Inzlicht for pointing this out on Facebook!] If I had to guess, subjective journal prestige would probably be predicted best by a function that positively weights citation count and negatively weights topic generality. That is, more general outlets like Psych Science have more potential people who would cite them, independently of prestige within a field.

More fundamentally, trying to game citation metrics directly might be bad overall for scientific reporting. Admittedly, there is very little systematic research into what makes an article highly cited, especially within the kind of articles that any one journal might publish (for example, I’d expect theory/review papers to have a higher count than original research papers). But in trying to second-guess what kind of papers might drive up impact ratings, there is the danger of:

  • Overrating papers that strive for novelty in defining a paradigm, as opposed to doing important work to validate or extend a theory, including replication.
  • Overrating bold statements that are likely to be cited negatively (“What? They say that social cognition doesn’t exist? Ridiculous!”)
  • Even more cynically, trying to get authors to cite internally within a journal or institution to drive up metrics.  From what I have seen in a few different contexts, moves like this tend to be made with embarrassment and met with resistance.
  • Ignoring other measures of relevance beyond academic citations, like media coverage (and how to tell quality from quantity here? That’s a whole other post I’ve got in me.)

So really, any attempt to systematically improve the editorial process would really  have to grapple with a very complicated success metric whose full outcome may not be clear for years or decades. Given this, I’d rather focus on standards, and trust that they will be rewarded in the metrics over the long term.

But one last thing: It’s hard to ignore that methods papers, if directly relevant to research, seem to have a distinct advantage in citations. For example, in the top 10 JESP citations, three have to do with methods, a rate far higher than the overall percentage of methods papers in the journal. In Nature‘s top 100 cited papers across all sciences , the six psychology/psychiatry articles that make the cut all have to do with methods – either statistics, or measurement development for commonly understood constructs such as handedness or depression. (Eagle eyes should notice that a lot of the rest are methods development in biology.) So, although I had other reasons for calling for more researcher-ready methods papers in my JESP editorial, I have to say that such useful content in a journal isn’t so bad for the citation count, either.