I made a PDF of one of my presentations; it is freely available online at: https://psychsci.blogspot.com/2016/11/a-practical-guide-to-psych-stats.html It’s meant for psychology undergraduates as a Psych Stats course wrap-up, i.e. to answer the question “I’ve learned the statistical tests, now what do I do?”

The general thrust of my reporting guidelines on slides 12-15 is that most psychologists generally don’t want to move away from NHST procedures. So the best advice I can offer is to report your test statistic & p-value, PLUS the appropriate effect size measure, PLUS confidence interval, PLUS means & SDs so the reader can see for him/herself how much variability is in the data and the magnitude of the difference in means. Such an approach helps to present a more complete picture of what’s going on in the data, rather than just “p < .05" Please feel free to check it out and correct me on any misapprehensions that may exist!

]]>Agree but that can be informed by Bayesian computation as in the link below.

]]>Maybe require p less than 0.00000001 just to be safe. But in the meantime we still have to make decisions! I think the problem is in the desire for certainty where none exists. Rather than setting thresholds, I prefer to summarize what knowledge we have and accept our uncertainty. That is, we should be Nate Silvers, and not let our desire for certainty make us into Sam Wangs.

]]>The only Bayesian way to incorporate information about effect sizes and measurements is through priors but that’s of little help when a non-expert reader tries to evaluate a research paper. What is required here is critical thinking about the given information on measurement and effect sizes, not Bayesian computation. Critical thinking is by no means an exclusively Bayesian domain. ]]>

(I guess I won’t use greater than and less than symbols anymore)

]]>How the data was collected and how much effort was put into developing the *research hypothesis* is much more important.

If you

1) collect crap data (unblinded, questionable proxy for what you care about, etc)

or

2) fail to get your model to make a prediction more precise than “A should be positively correlated with B”

Then the paper will not be useful.

Now, it would be fine if people skipped number 2 and just collected good data and published descriptions of it. Then the models/theories can be developed by people with those skills (this is much closer to how eg particle physics functions).

However, that is not really allowed under the current culture. So people attempt testing these vague speculations (by instead testing a precise null hypothesis known to be false…) and come to all sorts of incorrect conclusions. Why? Because such vague hypotheses are impossible to distinguish from a whole slew of alternative trivial explanations for the results. For biomed, the reality is that 99.99% of claims being published are not worth paying attention to, rather you are better off having never heard about it.

]]>In classical statistics, prior information on effect sizes and measurement is used in designing the data collection and in choosing what class of model to use and what variables to include in the model, but that’s it. Bayesian statistics encourages the use of prior information in the model fitting stage as well.

And, for sure, “Bayesian statistics” is not just what’s in Bayes’s paper. He used a uniform prior, at least that’s what I remember. But Bayesian statistics as we define it today does not require uniform priors.

]]>That’s smart. but how’s that Bayesian? Every frequentist, EDA fan or whatever can do that, too.

Certainly Bayes himself didn’t mention a thing about these things in his famous paper.

In fact this is not entirely tongue in cheek: I’ve used the same procedure as a lead on software development teams, evaluating the marketing claims of software products; most of the time you just don’t even bother.

I guess the point is that I think the process of science implicitly has some measure of utility in it, as there are many more questions we choose *not* to study than we chose to study, simply because the answers are largely irrelevant either way and resources are constrained. So perhaps one ought to just do that explicitly even in one’s own evaluation of research.

]]>(The notes were developed for a “continuing education” course I taught for several years. I’ve decided to retire from it, but someone else will be teaching it in May 2017 — see https://stat.utexas.edu/training/ssi for information.)

]]>I bet against studies like that in the various prediction markets and always made good money while having to do little actual reading of bad studies.

But replicabilty is, in a way, boring, many studies are of course invalid, even if they replicate, and detecting that requires real reading and thinking.

The others all mention journal clubs, and that has taught me a lot too (we always have a positive and a negative round, but it often leaves us feeling dejected too).

But you don’t always have access to a journal club of bright minds with the specific expertise who want to discuss the paper. I like the Altmetric bookmarklet https://www.altmetric.com/products/free-tools/bookmarklet/ and the Pubpeer browser extension. Altmetric includes post-publication peer review on Pubpeer, Publons, Twitter and blogs. If there is some criticism there, it can be helpful. This has helped me with stuff outside my immediate discipline. In some disciplines (e.g. genetics) this helps a lot. Unfortunately blogs aren’t well-tagged in Altmetric.

Also, the absence of robustness and sensitivity analyses is a red flag to me. Unfortunately such analyses are infrequent in psychology. I mean they all did these analyses to see whether they could squeeze out a bigger effect, but they don’t report them. Especially in correlational work, I want to see what the model assumptions do, I won’t be convinced that the model you happened to choose was best.

]]>+1

]]>I chose one semester because that was the basic undergrad requirement for my major (neuroscience), I’ve personally done more than that.

]]>1. Small N (too many papers in social psych and behavioral sciences with N=25 per cell, rather than 50 or 100)

2. Too many rejections of the null with p close to 0.05 (p-hacking)

3. Diana’s point 1, very specific manipulations to prove a general claim.

If these things are present, start looking for issues and you’ll find. ]]>

There are relatively clear tools for examining whether a paper says anything worthwhile or not: noise, sign of effect relative to its size, etc. But the clear tools are often really hard to apply, particularly because papers are written to obscure the weak points in their reasoning or to gloss over the weak points in the data or in the the analysis. And the tools are in no way complete enough that you can rely on a testing algorithm that lets you pump in a study for evaluation and get an answer quickly without a ton of fiddling. So you have to rely on fairly simple measures like: did this paper make me think about diabetes in a different way? That way doesn’t have to be correct – the paper might be complete crap – but if it changed your mind in some way then your mind is now capable of figuring out whether that answer is half decent or really good or utter garbage. It is this which I think explains our fascination with shitty work: you have to recognize crap where it is and it’s often not as obvious as a giant horse apple pile in the path and sometimes it’s as slippery as goose or turkey shit whose color nearly matches your deck or the grass or the leaves (because some shit wants to blend in, doesn’t it? Kind of as Wednesday says at the end of Addams Family when asked why she isn’t in costume for Halloween and she says, “I’m going as a serial killer. They look like everyone.” Now that’s shit blending in.

The cool thing about the question is that it’s asked because that means you’re open to challenging your conceptions, which means you are able to change your priors and perhaps to understand how these priors affect your perception of and comprehension of your posteriors, meaing your results. That’s where Bayes connects to life in general: your life is a series of posterior results, achieved in a generalized probability model in which priors both influence – to some degree, depending on the circumstance – not only the posteriors but how the posteriors are noticed, categorized and acted on. That’s a fun thing in modeling game theoretical behaviors and coin flips: that you can quickly and obviously define circumstances in which there are clear negative and positive outcomes for the entire process of priors affecting posteriors. Now if only people could understand that …

]]>1. An obvious discrepancy between the study’s actual findings and its conclusions (in the abstract or press release). For example, if a study purports to show that students learn math better when they gesture, but really only found that students who were told to gesture while solving a math problem did somewhat better on a subsequent math quiz than students told explicitly *not* to gesture, I know something’s up. The issue may be with the reporting, but often it’s in the language of the study itself.

2. Unclear or inconsistent measurement, or measurement of the wrong thing. If a study purports to show that students who read stories of scientists’ *struggles* make more progress in science than those who read of scientists’ *successes,* and the progress is measured in terms of homework grades (across assignments, with no consistent criteria), then the findings are iffy at best.

3. Failure to think a problem through. The other day I read about a study of the emotional “shapes” of stories. From Drake Baer’s article: “With that corpus [of 1,327 stories from Project Gutenberg] in place, the researchers analyzed the happiness levels of the words themselves, ratings that were found via crowdsourcing the hedonic value of individual words (“love,” “laughter,” and “happiness” are at the top).” The researchers do not seem to have considered the many problems with this approach, problems that become clear if you look closely at a work of literature.

Those are just a few examples.

]]>“Our study reveals that there are significant gender differences in how people react to different ways this information is presented to them: Women reach better/more accurate decisions when information is presented multimodally, i.e. using natural language text and graphics, whereas men are happy with just the graphics.”

From the linked paper

“We found that females score significantly higher at the decision task when exposed to either of the NLG output presentations, when compared to the graphics-only presentation (p < 0.05, effect = +53.03).

We found that males obtained similar game scores with all the types of representation. This suggests that the overall improved scores (for All Adults) presented above, are largely due to the beneficial effects of NLG for women."

Separate table given for females but not males…

]]>Second, I look for p values anywhere near 0.05 as evidence, combined with forking paths. We all know where that 0.0492 came from, and it’s never the straight path.

Third, I look for *simple* tests of the result, with increasing skepticism as more advanced (and convoluted) techniques yield very different results. Sure, IV methods are needed when there is endogeneity. but if the IV results and the OLD results don’t point in the same direction, then you’d better convince yourself that the instrument wasn’t selected to get you there. This isn’t an indictment of IV methods… just a suggestion to look at the magnitude of the change from OLS before accepting the result credulously.

Fourth, look for (where applicable) predictions of the implications of the central result in something not directly related and, where possible, actual tests of those predictions. But at least respect the researcher who pins herself down as to how applicable she thinks these results are outside of this particular estimation procedure.

Fifth, look for modesty. I like papers which aren’t all about “look what I found” but contain a healthy dose of “but this is why this might not be right.” (This is partly related to the first criterion.) Most critically, if there’s some possible mechanism that’s not tested for or even mentioned in the paper itself, and if it’s an objection that I came up without really being immersed in the particular area myself, that’s a big red flag.

]]>Seems most groups try to look at studies one by one and give them a fair/full evaluation… which is nice but seems time consuming. I was thinking it might be a lot fairer/faster if a group tries to only estimate the Type S, Type M errors for typical p-value studies, then publishes a big reference… that way, when considering whether to bother understanding a new paper in an unfamiliar discipline, one could just look up the reference and not bother with obvious noise.

]]>I agree Raghu’s suggestion of working in groups is helpful, and I’d add that we try to engage in an exercise where it starts with “this result could be nothing if is true.” And then, we look to the materials we have to see if the author addressed that possibility. We don’t do it exhaustively, but we want to arrive at a bunch of alternative explanations for the observed effect and see if the author thought about them as well. It sort of reminds me of those lame graphs you see on Facebook “this one graph explains why is ” There is no “one graph” that completely explains anything in my mind.

]]>The group I belonged to evaluate randomized trials for meta-analysis (1980,s) randomly assigned one group member to be a paper’s advocate and another a critic. So there was usually a they did this right aspect.

We also naively thought that once the research community learned about the problems they would step up and resolve them http://andrewgelman.com/2012/02/12/meta-analysis-game-theory-and-incentives-to-do-replicable-research/

Actually do think the conditions are better now for stepping up and resolving replication issues.

I still keep in touch with members of that group – repeatedly confronting others (hard to avoid) failures does seem to build character.

]]>What if the above doesn’t tell you to move on? Next you need to figure out what they actually measured, where did these numbers come from? That may take a bit more time. During this process you will likely need to follow some refs, use that as a spot check to make sure the papers say what is claimed. Anyway, think about whether a mouse sniffing a corner of a plus maze really has much to do with memory rather than other factors, or whatever it is they measured. You should *always* be able to come up with a few alternative explanations. Now check the paper, did they ignore alternative explanations, just handwave them, or take that issue seriously?

Using the above, within a day or so someone reasonably familiar with the field should be able to tell if a paper would be a waste of thier time to study further (but not if it is good). That makes me wonder how so much crap keeps getting published though…

]]>Yes, to me as a statistics researcher and textbook writer, the challenge is to formalize ideas of good practice and fold them into statistical theory and education.

There can be a disconnect, where researchers know that ideas such as model checking, exploratory data analysis, careful measurement, etc., are important—but then they don’t know how to integrate these practices into their work. And in that case the logic of the statistical methods can, unfortunately, drive the research into the ground.

Consider topics such as power pose. Of course the Amy Cuddys and Andy Yaps and Susan Fiskes of the world know that good measurement is important and that p-hacking is bad. But with the primitive statistical methods they are using, their research and publication schedule becomes determined by what is or is not statistically significant, which pushes issues of bias and variance of measurement to the side because they do not appear in any of the formulas they’re using.

Much of my research during the past few decades is an attempt to (a) formulate the “good practice” intuitions so they can fit into statistical analyses, and (b) stretch the boundaries of statistical analyses to account for good practice.

Examples include:

– Hierarchical modeling as a way of partially pooling information

– Exploratory data analysis as posterior predictive checking

– Weakly informative priors and regularization

– Strongly informative priors on effect sizes and joint modeling of interactions

– Studying frequency properties of standard practices; type M and type S errors

– Comparison of models as part of model building.

Then there are people who want to know whether a claim made in a paper is likely to be accurate. For example, they have a practical question – how infectious is disease X? Or what drug targets exist for …? A paper makes a relevant claim (observation, deduction) and the reader needs to know how likely it is to be correct. This is the reproducibility crisis and generalizability crisis problem.

For this, my approach does involve priors – how surprising is the finding? Combine with how reliable the field (and specific authors) have been… And then the smell of the paper (like code smells). Were the statistical methods modern and appropriate? What of silly errors and outrageous claims? etc etc.

]]>Your comment about measurements being analyzed being tied to the questions under study would seem to tie-in nicely with your earlier blog post about exploratory studies. Certainly I’ve seen many published studies done on measurements taken incidentally with regard to the primary study, especially when the results for the main outcomes didn’t pan-out.

]]>I suggest being very skeptical, even if “the [estimated] effect size is very large.” Reported estimated effect sizes are overestimates because of the statistical significance filter. If N is small and data are noisy, estimated effect sizes will be huge no matter what.

Also when you talk about controlling for bias: yes, I agree, but I’d point you more strongly toward issues of measurement. I agree with you that statistical issues such as missing data and selection bias are important, and these can be enablers of all sorts of other problems—but more fundamental than all of this, I think, is that the measurements being analyzed are tied closely to the questions under study.

]]>http://www.healthnewsreview.org/toolkit/tips-for-understanding-studies/

and click on any of the items listed.

]]>http://www.healthnewsreview.org/toolkit/tips-for-understanding-studies/

where you will find the following:

Absolute vs. relative risk

Animal & lab studies

Biohype bibliography

Be careful with composite endpoints/outcomes

Phases of drug trials

Medical devices

FDA approval not guaranteed

How much will it cost?

Intention-to-treat analysis

NNT: number needed to treat

Non-inferiority trials

Observational studies: association vs causation

Odds ratios

“Off-label” drug use and marketing

News from scientific meetings

Mixed messages about statistical significance

Surrogate markers may not tell the whole story

In addition, there is always the famous, “follow the money” which supposedly appeared first in the film (but not in the book), “All the President’s Men.” This admonition is especially true for the incoming Trump Administration.

]]>Why only a semester of undergrad stats? Why not a year, or two years? I understand that people don’t have time. But there is no short cut to understanding. If significant amounts of statistics were a requirement to using it, would that be so bad? If you don’t have time to learn enough about statistics to use it, maybe don’t use it?

]]>Be very skeptical of any study with small Ns, unless the effect size is very large. If the effect is clear from a scatterplot, I’m more likely to take it seriously. I’ve become a big fan of overlaying model predictions on scatterplots.

I find skimming the statistical methods section informative. Are they appropriate? Do they seem to have been written by someone who knew what they were doing? Was a professional statistician involved?

Most importantly, did they correctly control for biases? I’ve come to believe that controlling bias is just about the most important thing we statisticians do. And bias comes in many forms, including multiple comparisons, differences between groups in prognostic demographics or other variables, evolving standard of care (beware of studies comparing time intervals before and after some event) — I have seen considerable influence from evolving standard of care in clinical studies spanning more than a few years. Missing data is an often ignored bias — patients may drop-out of a trial for reasons associated with the treatment or the severity of their condition (or lack of it), or age (we may see good follow-up through adolescence, followed by a drop-off as patients enter adulthood and have jobs and families). Then of course there are biases associated with which patients are consented or excluded from a trial.

I take results from a randomized trial more seriously than observational/retrospective studies, assuming the Ns are reasonable.

Over time I’ve come to associate certain authors’ names with junk papers. This is a matter of experience.

A lot of clinical research is done by over-worked medical students, residents, or Fellows, with little over-site of the data collection. Take it with a grain of salt.

]]>