Quadratic Confounds, Linear Controls: A Doubt About That “Air Rage” Study

This past week it was hard to miss reading about the “air rage” study by Katherine DeCelles and Michael Norton, published in PNAS. They argued that economy-class passengers exposed to status inequalities when the flight had a first-class section were more likely to become belligerent or drunk — especially when the airplane boarded from the front and they had to walk right by those first-class 1%-ers in their sprawling leather seats. (article; sample news report).

Sounds compelling. But Andrew Gelman had doubts, based on some anomalies in the numbers and the general idea that it is easy to pick and choose analyses in an archival study. Indeed, for much research, one can level the critique that researchers are taking advantage of the “garden of forking paths”, but it’s harder to assess how much the analytic choices made could actually have influenced the conclusions.

This is where I found something interesting. In research that depends on multiple regression analysis, it’s important to compare zero-order and regression results. First, to see whether the basic effect is strong enough, so that the choices made for which variables to control for wouldn’t impact the results too much. Second, to see whether the conclusions being drawn from the regression line up with the zero-order interpretation of the effect that someone not familiar with regression would likely draw.

It’s already been remarked that the zero-order stats for the walk-by effect actually show a negative correlation with air rage; the positive relationship comes about only when you control for a host of factors. But the zero-order stats for the basic first-class air rage effect on economy are harder to get clear in the report. In the text, they are reported in raw, absolute numbers: over ten times as many air rage incidents occur on a yes/no basis in economy class when first class is present compared to when it is not. However, these raw numbers are naturally confounded by two factors that are highly correlated with the presence of a first class (see table S1 in the supplementary materials): length of flight and number of seats in economy. So you have to control for that in some way.

There was no data on actual number of passengers, but number of seats was used as a proxy for number of passengers, given that over 90% of flights are typically fully sold on that airline. In the article, number of seats and flight time are controlled for together, entered in the paper’s regression analysis on raw incident numbers per-flight. Not surprisingly, seats and time are correlated highly, nearly .80 (bigger planes fly longer routes), so including one after the other will not improve prediction much.

But most importantly, this approach doesn’t reflect that the effect of time and passenger numbers on behavioral base rates is multiplicative. That is, the raw amount of any kind of incident on a plane is a function of the number of people times the length of the flight, before any other factors like first class presence come into play. So what you need to do to model the rate of occurrence is  introduce an interaction term multiplying those two numbers – not just entering the two as predictors.

Or to put it another way – the analysis they reported might give a much lower effect of first class presence if they had taken as their outcome variable the proportion of air rage incidents per seat, per flight hour, because the outcome is still confounded with the size/length of the flight if you just control for the two of them in parallel.

Still confused? OK, one more time, with bunnies and square fields.

Rabbit fields

Although the data are proprietary and confidential, I would really appreciate seeing an analysis either controlling for the interaction of time and seats, or dividing the incidents by the product of time and number of seats before they go in as DVs, to arrive at the reasonable outcome of incidents per person per hour.

One other thing. Beyond their mere effects on base rates, long hours and many passengers may each affect raw numbers of air rage in a more exponential way – each with a quadratic function – by also leading to higher rates of incidents, even per person per hour, through higher numbers of potential interactions, and flight fatigue. So ideally, if you’re keeping raw numbers as the DV, the quadratic as well as linear functions of passenger numbers and flight time need to be modeled.

So the bottom line here for me is not that the garden of forking paths is in play, but that the wrong path appears to have been taken. I look forward to any feedback on this – especially from the authors involved.

Advertisements

My JESP Inaugural Editorial

A short break from the editors’ forum questions: I’m not sure I’ve given proper publicity to my editorial, which according to the cellulose-based publication schedules is “coming out” in the “July issue” but is now accessible – openly, as far as I can tell. So here: publicity.

If you want the take-home summary about “the crisis” here it is, in bold, with a little commentary:

1. I would rather not set out hard standards for the number of participants in a study or condition.  I don’t think our field has done the theory and method development to support those kind of standards.

I would rather indirectly reward the choice to study both higher N and higher effect sizes, by putting more weight on strong rather than just-significant p-values, while recognizing that sometimes a true, and even strong, effect is represented in a multi-study sequence by a non-significant p-value. So…

2. I do like smarter use of p-values. I want people to look at research with a full awareness of just how much p-values vary even when testing a true effect with good power.

This means not playing the game of getting each study individually significant, and being able to present a set of studies that represent overall good evidence for an effect regardless of what their individual p-values are.

If I could be sure that a multi-study paper contained all studies run on that topic, including real evidence on why some of them were excluded because they were not good tests of the hypothesis, then I could express a standard of evidence in terms of some combined p-value using meta-analytic methods. I suggest a p < .01 but my original impulse (and an opinion I received from someone I respect a lot soon after the editorial came out) was for p < .005. I suggest this very lightly because I don’t want it to crystallize into the new p < .05, N =200, p-rep > .87 or whatever. It’s just based on the approximate joint probability of two results p < .10 in the same direction.

If, if, if. There’s also the question of how many nonsignificant tests go into the file drawer behind any multi-study paper. Given our power versus our positive reporting rate and other indicators there’s no doubt that many do. The reactions that I have seen range from “hey, business as usual” to “not ideal but probably doesn’t hurt our conclusions much” to “not cool, let’s do better” to “selective reporting is the equivalent of fraud.” If I could snap my fingers and get the whole field together on one issue it would be this, naturally, ending up at my personally preferred position of “not cool.”

But until we have a much better worked out comparison of all the different methods out there, it’s hard to know how effectively gatekeepers can deduce when a research report has less evidence than it claims, because of that file drawer. I’m not the only one who would rather have people feel free to be more complete a priori research reporters, than have to write back with “Your article with four studies at p = .035, p = .04, p = .045 and p = .0501 (significant (sic)) looks, um, you know …”  Erika Salomon has taken the first step but, clearly, the last word is far from written.

Anyway, one of the goals of this blog is to open up a debate and lay out both the ethical and the pragmatic issues about this most divisive of litmus tests in the field today – should we worry about the file drawer? So stay tuned.

Editors’ Forum Supplemental: Improving Journal Performance

Here’s my individual answer to another question from the SPSP forum we never got around to answering.

q11

This is referring to Tetlock’s work on identifying “superpredictors”and more generally improving performance  within geopolitical prediction markets. In those studies, the target outcome is clear and binary: Will the Republicans or Democrats control the Senate after the next election? Will there be at least 10 deaths in warfare in the South China Sea in the next year? Here, Brent suggests that editorial decisions can be treated like predictions of a paper’s future citation count, which in turn feeds into most metrics that look at a journal’s impact or importance.

Indeed, prediction markets have been used as academic quality judgments in a number of areas: for example, the ponderous research quality exercises that we in the UK are subject to, or the Reproducibility Project: Psychology (I was one of the predictors in that one, though apparently not a superpredictor, because I only won 60 of the 100 available bucks). But the more relevant aspect of Tetlock’s research is the identification of what makes a superpredictor super. In a 2015 Perspectives article, the group lists a number of  factors identified through research. Some of them are obvious at least in hindsight, like high cognitive ability and motivation. Others seem quite specific to the task of predicting geopolitical events, like unbiased counterfactual thinking.

There’s a reason, though, to be skeptical of maximizing the citation count of articles in a journal. [Edit: example not valid any more, thanks Mickey Inzlicht for pointing this out on Facebook!] If I had to guess, subjective journal prestige would probably be predicted best by a function that positively weights citation count and negatively weights topic generality. That is, more general outlets like Psych Science have more potential people who would cite them, independently of prestige within a field.

More fundamentally, trying to game citation metrics directly might be bad overall for scientific reporting. Admittedly, there is very little systematic research into what makes an article highly cited, especially within the kind of articles that any one journal might publish (for example, I’d expect theory/review papers to have a higher count than original research papers). But in trying to second-guess what kind of papers might drive up impact ratings, there is the danger of:

  • Overrating papers that strive for novelty in defining a paradigm, as opposed to doing important work to validate or extend a theory, including replication.
  • Overrating bold statements that are likely to be cited negatively (“What? They say that social cognition doesn’t exist? Ridiculous!”)
  • Even more cynically, trying to get authors to cite internally within a journal or institution to drive up metrics.  From what I have seen in a few different contexts, moves like this tend to be made with embarrassment and met with resistance.
  • Ignoring other measures of relevance beyond academic citations, like media coverage (and how to tell quality from quantity here? That’s a whole other post I’ve got in me.)

So really, any attempt to systematically improve the editorial process would really  have to grapple with a very complicated success metric whose full outcome may not be clear for years or decades. Given this, I’d rather focus on standards, and trust that they will be rewarded in the metrics over the long term.

But one last thing: It’s hard to ignore that methods papers, if directly relevant to research, seem to have a distinct advantage in citations. For example, in the top 10 JESP citations, three have to do with methods, a rate far higher than the overall percentage of methods papers in the journal. In Nature‘s top 100 cited papers across all sciences , the six psychology/psychiatry articles that make the cut all have to do with methods – either statistics, or measurement development for commonly understood constructs such as handedness or depression. (Eagle eyes should notice that a lot of the rest are methods development in biology.) So, although I had other reasons for calling for more researcher-ready methods papers in my JESP editorial, I have to say that such useful content in a journal isn’t so bad for the citation count, either.

Editors’ Forum Supplemental: Manuscript Hand-offs

In the next few of these posts, I aim to go solo and reflect on some of the questions posed on Facebook by the various communities that we didn’t get around to answering in the journal editors’ forum.

q10 snip.PNG

Thanks, Brian!

Suggestion 1 is one I have seen made before. It may even be policy at some journals, though I can’t recall which ones. Taking another journal’s reviews as-is, however, requires some coordination. The accepting journal doesn’t want papers with fatal flaws. It doesn’t want papers that had to be transformed in their theory and analysis so much that they’re different papers, because now the old reviews may not apply.

What it wants is papers where the initial reviews found the method and results to be basically sound or at least fixable; but rejection came mainly because they fell short of the originating journal’s standards for innovation, theoretical development, depth of evidence, or amount of evidence. And there really has to be a step difference in those standards from one journal to another. Probably this is best handled by the lower-impact journal editor-in-chief approaching the higher-impact one and setting up the hand-off agreement. Yes, there is a little awkward social hierarchy reinforcement here, but it’s all for the best. (Hey, in the grand scheme of things, all E-i-Cs are cool kids – or at least, alpha nerds.)

There would also be some finesse in the hand-off. Reviews are the intellectual property of their authors, so consent to pass them on would have to be given, most feasibly at the point when the  review is solicited. The authors would have to agree, too, but I don’t see anyone not jumping at the chance to shave six or more months off the publication process. Editors at the originating journal would have to apply some discretion to make sure that only papers with a  reasonable chance are handed off. None of this, though, is unworkable.

(Brian’s second suggestion, “flipped” process, is creative. But imagine (for example) JESP, JRP, PSPB, and SPPS all getting in the pool together. I would have to not only look at the 600+ manuscripts we receive yearly, but the X number of relevant experimental social manuscripts that PSPB and SPPS receive, potentially doubling or tripling load. A lot less fun than an actual pool party with those guys.)

At JESP, the most obvious journal to solicit hand-offs from would be Journal of Personality and Social Psychology – at least the first two sections, which are most likely to have experimental social content. I think we can all agree that for a long time now, JPSP has expected more in terms of number of studies, theoretical innovation, and empirical development than any other social psychology journal. One more thing on my to-do list, then, is to approach Eliot Smith and Kerry Kawakami, the relevant section editors, with that proposal.

I’m also interested in hearing from other journals who think they could accept hand-offs from JESP. In case it wasn’t clear, I do think that every journal in the hierarchy plays its part. Work that makes smaller contributions deserves to be seen, not least for showing converging evidence on an established effect. Since 2011 that has only gotten more important. And the quantity of research, the pressure in the pipeline, means that even lower-impact journals can pick and choose and enforce standards effectively.

Finally, many people don’t see the obvious drawback of the hand-off innovation as being a drawback at all. Under hand-offs, on average there will be less reviewing, fewer different opinions bearing on any given paper (the same goes for my editorial decision to avoid reviewing revisions, which has also been pretty popular as far as I can tell). If more reviewing and editing is not seen as a good thing, this tells me that:the long and onerous process still doesn’t seem to be delivering consistent quality control. Adding more voices to the review process is seen to add noise rather than precision.

I think I know what the reason is, too. A review consists of a number of facts embedded in a set of opinions. Or at least, the more facts a review contains, the better it is. The same facts can elicit many different opinions. For some, a given limitation is but a foible, while for others, that same limitation is a fatal flaw.

The more the editorial process consists of just adding up opinions – or worse yet, a disjunctive, “Anna Karenina” rule in which one negative opinion is enough to sink the ship – the more it is seen as fundamentally capricious and chaotic, so that extending it only increases the pain of this arbitrary gantlet.

But if the review process consists of adding up facts into a big picture and evaluating those facts, regardless of opinion, then more facts are better, and the only question is when the saturation point of fact-finding is reached. I surely have more to say on this matter, but I hope it’s obvious that in my own editorial work and in guiding my associate editors, I encourage the higher-effort, fact-based approach wherever possible.

SPSP Editors’ Forum: Summary

This past week I have been putting together notes taken by a couple of very helpful audience members (Michael Wohl and Jessie Sun, thank you!) from our Editors’ Panel on stats and reporting controversies last weekend at the SPSP conference. The panelists also helped to reconstruct the dialogue. It’s not a 100% accurate transcript but it gets the point across.

Replies follow the original questions, taken from their social media context (SPSP forum and Facebook). Post-hoc comments from myself and the other editors are in italics.

Roger Giner-Sorolla (RGS), Richard E. Lucas (REL), Simine Vazire (SV), Duane T. Wegener (DTW) (respectively Editors-in-Chief of JESP, JRP, SPPS, PSPB)

q1

SV – it is the way people think about their results that matters. The way bootstrapping for mediation is often currently used in S/P psychology shows that people do use dichotomous thinking. We need to think more flexibly. [I think I mentioned this in response to another question, but CIs also help a lot when interpreting null results – are they really conclusive that there is a close-to-zero effect, or does the CI also include practically meaningful effects? -SV]

DTW – Using p values appropriately in a continuous and more nuanced manner is helpful.

REL: One goal of encouraging CIs is changing attitudes and how we think about and write up our results. Even if CIs and p values are related, CIs encourage an appreciation of uncertainty.

RGS – Agree with the above, and CIs can be hard to interpret in basic research. CIs become more meaningful when the number has more meaning (for example, effects on GPA, earnings in money terms, weight loss, etc.).

q2b

REL – We need to keep the spirit rather than the letter of these changes in mind and build in flexibility.

SV – The  N issue needs to be noted and discussion of the results more tempered.  Doesn’t know an editor who has imposed new standards and not provided exceptions. For example, considering how difficult it would be to collect a large sample for a specific population, we have to be flexible. But if people doing hard-to-collect populations can still make the effort to collect adequate numbers, then we should expect more from those with easier to-collect data.

RGS – N is not the only way to increase power; increasing reliability or precision of methods are other ways. About the second part of the question, too, there is a common misperceptions that pre-registration ties your hands. It just allows the researcher to draw a distinction between confirmatory and exploratory analyses. Exploratory analyses (those not in the pre-registration) can be reported, and those evaluating the manuscript should not just dismiss them, but look at the strength of evidence in the context of how many tests could have been tried.

DTW – There will always be nuanced boundaries in what is exploratory vs. confirmatory. Often, exploratory ideas can become confirmatory tests in subsequent studies.

q3

REL – Typically those studies are very well documented. You can point people to a website in the manuscript.

SV – The authors should report the other variables that they did consider. Disclose in a footnote or short description in the main text. Authors could also show how the results look according to a set of specifications you did consider as well as other sets you could have considered. Show representative results, not the most beautiful results you could have gotten from the data set. A good practice might be to present your most damning result, as well as your most beautiful result. [For more info on specification curves, see this paper by Simonsohn et al., under review: http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2694998 -SV]

RGS – There is a difference between the letter and the spirit of the rule. In this case you can make your own call about what variables are relevant to the hypotheses to disclose in the article, but as mentioned, being able to verify these with a posting of the larger data specification might be important for people evaluating this.

q4

SV – We will at the end of 2016 ask people after publishing if they will post their data and give them a badge if they do. We don’t know much about the effects going forward (e.g., submission rates), but I don’t think it will negatively impact submission. We will just give a badge to those who do share. Whether you’re willing to share your data won’t affect decisions at SPPS.

DTW  – We do not have badges, but we ask that people abide by guidelines. Also, how people deal with the data that’s been shared with them can have a chilling effect or an engaging effect.

REL – The more people do this the more evidence will accumulate. It is good to share, but we understand there are times this is not possible. We should allow people an opportunity to have people explain why they are not sharing. One of the nice things about trying a lot of different things is that it’ll start to show us what works.

RGS – We don’t have badges, but it is on the to-do list. We have checkboxes that people will keep raw data for 5 years and that APA guidelines have and will be followed. Also, as a field, if we want open sharing of data, we need to communicate this to IRBs and make sure the consent forms are compliant with open data. Otherwise there’s no way you can legitimately put up data.

SV – We need to have open discourse about change. If there’s a clear sense of what members would like that’s a big part of decision-making when policies are set. Express your preferences, vote with your submissions.

q5q6

Context note: The bias analysis and rankings refer to the R-index, which is a comparison of studies’ post-hoc power with the proportion of significant results; low rankings mean there is more reporting of significant vs. nonsignificant results than chance would indicate.

RGS – I would like to see more testing and generation of methods (like Erika Salomon’s poster at SPSP) to determine whether bias analyses are accurate at small numbers of studies and which ones work best. The file drawer issue is an important one. Currently the above usage of the R-index is based on all statistical tests and not just focal ones, so one caveat is that it may be picking up reporting styles not related to selective reporting. Journals with more personality research also seem to be doing better in the above ranking subset. That said, and not to be too defensive, my new guidelines at JESP are aiming at reducing selective reporting by encouraging a “big picture” approach to p values.

REL – We likely would not do formal analysis for this. I would tell the authors that you have poorly powered studies [RGS: this is based on the assumption that most bias tests, directly or indirectly, reveal an unusual preponderance of significant p-values that are high and close to .05 REL: Yes]. We should not have to rely on assumptions about these things. We can deal with this without casting aspersions. [REL: I do wish I had said that much of the debate about these methods seems to be just how certain we can be when the techniques are used. However, I think there is less debate that much of the logic behind them is sound, and that in the situation described in the question, this is evidence that we should be cautious in our interpretation. And I think an editor can use that to ask for more evidence, even if in post-publication peer-review, we might be cautious about the accusations we make about the author.]

DTW – We have to also look at the claims being made in the paper. Are you arguing in a way that you have a large effect? If the effect size estimate is the main claim, you want to make sure it’s as accurate as possible. Are you making a directional hypothesis? We should expect and value variability across studies.  

SV – if I see a series of underpowered studies, I will ask for a high power study or one with pre-registration. The vast majority of the desk rejections at SPPS are due to underpowered studies.

RGS – It looks bad on our field if we are not reporting studies that don’t work. It’s an ethical imperative that has been supported by the APA [RGS: expect a blog post on this little-known fact but I also reference it in my editorial]. I would say that it’s not a robust finding when all ps are just below .05, and we might call for a new pre-registered study. Why does marginally significant only go one way (e.g., when p = .06, but not when p = .04)?

DTW – We expect variability across studies. We need to be comfortable with that and interpret that variability. We’ve been trained to analyse single studies, but not a set of studies.

q7

DTW – It depends on what you attribute the cause of reproducibility (or lack there of).

SV – I think this should be an important goal for our journals – most of our studies should replicate.  The goal should not be 100%, but if we say we don’t care about our findings being replicable, we’re in trouble.  Of course we need to be sensitive to whether the replication differed in important ways from the original, but we should try to specify the necessary conditions a priori.  We allow authors to provide supplementary material. We should be able to publish things in such a way that the study could be replicated. Authors should be noting if their manipulation doesn’t work in a given context and why, so that if there is a methodological feature that is key for replication, it should be stated a priori.

RGS – Reporting more aspects of the study is of import. Doing so should result in improved reproducibility. Pre-testing and calibration is important in original research as well as in replication research.

REL – What’s important is getting more data on systematic attempts.[REL: I also think that things may change slowly, and we shouldn’t be discouraged if a 2015 report was not much different than a 2013 report.]

q8

REL – Things have been messy for a while at JRP [due to the nature of the research we deal with]. I haven’t noticed any changes in reaction, but I think we are perhaps more open to messy data and will continue to do so.

DTW – Perhaps a small increase.  We need to be comfortable with messiness. Although each study may not be clear individually, alongside other studies the results may be clearer.

SV – We need to appreciate variability and messy results more. But sometimes there’s messiness from when studies are not well-done (e.g., low power). In that case,  I wouldn’t be convinced by a meta-analysis of the studies.  Sometimes it’s easy to collect more data to get a conclusive result in either direction. This is where thinking about CIs helps a lot. The question is: Do we know more now than we did before we read the paper? If the answer is no, and it would be relatively easy to collect more data to get more certainty, then it’s still reasonable for editors to want a clear, overall conclusion.

RGS – You can do an internal meta-analysis or a Fisher’s test to aggregate ps when summarizing “messy” data. So far, I haven’t seen the amount of “mess”  I’m looking for in the 45 or so manuscripts this year, but, with time, hopefully. [RGS: I should also add that among the articles I have seen so far, people seem to be addressing the “mess” issue the other way, with big samples, and yes, most of them are crowdsourced. See Patty Linville’s question below.]

Q and A from the audience at SPSP

Alex Rothman: Let’s specify some hypotheses right now. What do you think the outcomes on scientific behaviour will be in five years’ time? We can treat the situation like a natural experiment (journals are doing different things). It would be great to have some evidence for what happens.

SV: The goal for the next reproducibility project would not be 100% reproducibility, but to set a standard for what we would like it to be. In the end, something like that has to be one of the main goals of science—to produce a body of knowledge that has some amount of certainty. It’s great to think of it as an empirical question to try to make predictions. Hopefully the reproducibility will increase over time with what we are doing.

DTW: It depends on what was causing problems in the first place. Right now we still don’t know for sure what that was.

SV: It’s true that these new policies are changing the rewards. But, the acceptance rates are staying the same. There are always some people getting rewarded for some practices.

Chris Crandall had a comment that we found difficult to remember but he graciously provided a version of it for us here.

Many people in the “reform” movement, are not using the phrases “context of confirmation” and “context of discovery,” or their statistical counterparts “exploratory” or “confirmatory” data analyses in a correct fashion.

There is no hard distinction between the two in any philosophy of science that I know (that philosophers still endorse). What people mean when they say these words is this:

“The truth value of the data is dependent upon the state of mind of the researcher prior to conducting data analysis.”

That is, what you thought you were looking for, before you saw the data, determines what you can conclude from the data (which, of course, exists independently before you analyze anything).

I suggested that what people really care about is “how long were you digging into the data before you found the results you’re reporting?” This is really something quite different, and I proposed that we need a metric, and a way to report, how deep into the analysis we got before we developed a particular statistical model. This will not be easy, because a well-fitting statistical model is like a ring of keys–we always find it at the end of the hunt, because we stop the search at success.

Patricia Linville: Do you feel that we will be a field that only does research on MTurk to get up our N to meet the new standards? I would have more faith if we see studies in a paper that gets their sample from a few different places.

REL – JRP is not finding an increase in large student [or MTurk? RGS] samples. Those that just report correlations among [self-report measures] get desk-rejected. [We do have more studies that use internet samples, but this is inclusive; it is not just MTurk, but also studies where people use a variety of methods to recruit participants and have them participate online (including on-line panels).] It’s also important to track these side-effects over time. At least at JRP, we haven’t had these negative side-effects but we should pay attention to & track it. We are tracking sample size over the years as well as the use of method over time.

DTW – I share the concern. There will be intended and unintended consequences. On MTurk we now have “professional” participants and there could be unintended consequences there.

SV – This is where flexible standards come into play—I think we should have affirmative action for multi-method, intensive methods.

In future blog posts I will try to address some of the questions from Facebook that didn’t get answered in the panel. The serious ones, anyway.

q0

 

 

First Post

Now that the circle is complete and I’m Editor-in-Chief of JESP, the first psychology journal I published in (Giner-Sorolla & Chaiken, 1994), I thought I would start a blog to share all the thoughts on statistics, methodology, and more broadly evidence in publishing that don’t make it into my peer-reviewed output.

I held a contest on Facebook last year to name this hypothetical blog. My friends came up with a great selections of groaners, witticisms, and deeply appropriate titles. In the end I chose to grow my own.

I like the tentativeness about Simine Vazire’s blog title. (Of late it’s much more tentative than the blog posts themselves, which is a good mix!)

I think statistical significance is still the most important aspect of analysis in spite of all the criticisms of null hypothesis testing. I think most of its critics can be answered in the spirit of Ronald Fisher (the luxuriantly-bearded hipster on my title banner – yes, briar pipes are going to replace vaping in about 5 years, but you saw it here first).

Sir Ronald in his later years emphasized reasonable interpretation of the p-value continuum in context rather than the wholesale acceptance or rejection of hypotheses, and I advocate this too, given the limitations of theory and method in social psychology in particular. More on that later.

And, the title ironically refers to one of the euphemisms that has sprung up in the world of significance testing to legitimize results just a wee bit over p = .05. Hundreds more can be found here.

Anyway, some of the most interesting commentary and methods development in today’s world of research can be found on blogs. Distribution is lightning-quick. At the same time, accurate critical responses (or any responses at all) don’t come guaranteed. Often, discussion dead-ends. People seeking change risk hypocrisy, if we criticize researchers for leaping to publicize their exciting ideas based on less-than-robust results, but then turn around and do the same with our own, critical methods. So, I think a lot of this self-published material needs to make it to peer review, as slow as that process can be.

I am encouraging submissions of methods articles to JESP for peer review. I want to see pieces that reach across and communicate with researchers about better ways to create and analyze data. We are not talking Psych Methods here. Explanations up, equations down. Web-based or downloadable resources are particularly to be encouraged, as is code usable with a range of statistical platforms.

Some of the posts to follow on Approaching Significance will talk about techniques I find to be missing and underdeveloped in the field. You can take that as a cue for the kind of topics we might be interested in at JESP. Or, you might surprise me. I’m always available to discuss your chances via email (rsg {snail} kent [snail eye] ac [other snail eye] uk) before you submit.

In the meantime, bookmark this if you like, add it to your feed and stay tuned.

 

References

Giner-Sorolla, R., & Chaiken, S. (1994). The causes of hostile media judgments. Journal of Experimental Social Psychology, 30(2), 165–180.