And the problem in research using humans is that in many realistic cases, increasing N often decreases measurement quality. If I want a survey of 5000 people, every survey item really counts with regard to cost. When I have a student sample, I can make them take my 40-question battery. When I first started grad school, I proposed a 30-item scale for a population survey and I thought they were going to laugh me out of the room.

Better measurement helps us get more out of our small samples and might help us devise methods to more economically use larger samples without too great of a loss of measurement quality. But if you think people hate trying to publish shaky statistical results, with some exceptions it’s a huge pain to publish measurement studies.

]]>Take your #1 start: “A p_value is just one view of what an experiment suggests/supports”

– I would keep P-values as a refutational tool only (which is how I think Fisher viewed them) and that means they suggest and support nothing. They only measure something: A P-value takes a “standardized” measure of distance D between the data and a model (e.g., a model which includes “no effect” among its assumptions, along with “no uncontrolled bias” and so on) and map that distance into the unit interval (0,1] using the inverse of the model’s implied sampling distribution for D. This probability transform supposedly makes the observed distance D more intelligible, although experience shows it doesn’t really do so in any practical sense. Hence I’ve been trying to resurrect Good’s 1957 suggestion to take one more step and take surprisal S =-log_2(P) as the bits of information in D (or P) against the model. This measure never supports the model, it just transforms the distance D to an information scale instead of a probability scale.

Then you say “there are many others that may well be better in various situations”: There are always other measures that capture aspects of the data that D and hence P and S don’t. Estimates are the chief example, and are always needed in addition to these model-discrepancy measures if one is seriously trying to extract all the useful, relevant information in the data about a scientific question. The severe model dependence of estimates however leads us back to checking the estimation models and hence to P or S, to make sure we aren’t basing estimates on models that our data scream are wrong. I think this was a core message in Box (1976-1983) and Cox as well. So your #1 should become

1) A p_value is just one of many tools for checking the compatibility between our data and a model or hypothesis of interest. Such checks are important to avoid estimation based on misleading models, but this task should not distract us from estimation as an essential step in answering scientific questions.

The rest I found more congenial – I especially liked 4,6,7 and think the warning “all of them can be gamed for publication and career advantage” needs special emphasis in basic teaching as well as in specialty articles and blogs.

]]>1. A p_value is just one view of what an experiment suggests/supports – there are many others that may well be better in various situations.

2. Consider p_values as continuous assessments and be wary of any thresholds it may or may not be under (or targeted alpha error levels).

3. Keep in mind that p_value assessments are based on the possibly questionable assumption of zero effect and zero systematic error as well as additional ancillary assumptions.

4. Realize that the real or penultimate inference considers the ensemble of studies (completed, ongoing and future), individual studies are just pieces in that, which only jointly allows the assessment of real uncertainty.

5. Be aware that informative prior (beyond the ensemble of studies) information, even if informally brought in as categorical qualifications (e.g. in large well done RCTs with large effects the assumption of zero systematic error is not problematic) maybe unavoidable – learning how to peer review priors so that they are not just seen personal opinion may also be unavoidable.

6. The above considerations must be highly motivated towards discerning what experiments suggest/support as well as quantifying the uncertainties in that, as all of them can be gamed for publication and career advantage.

7. All of this simply cannot be entrusted to single individuals or groups no matter how well meaning they attempt to be – bias and error are unavoidable and random audits may be the only way to overcome these.

8. ???

9. ???

]]>Yes, also there’s the following reasoning which I’ve not seen explicitly stated but is I think how many people think. It goes like this:

– Researcher does a study which he or she thinks is well designed.

– Researcher obtains statistical significance. (Forking paths are involved, but the researcher is not aware of this.)

– Therefore, the researcher thinks that the sample size and measurement quality was sufficient. After all, the purpose of a high sample size and good measurements is to get your standard error down. If you achieved statistical significance, the standard error was by definition low enough. Thus in retrospect the study was just fine.

So part of this is self-interest: It takes less work to do a sloppy study and it can still get published. But part of it is, I think, genuine misunderstanding, an attitude that statistical significance retroactively solves all potential problems of design and data collection.

]]>with GUI software; also helpful are decision trees in some texts. Experimental design,

Deriving model design from theory, deriving measurement from theory are very hard to teach.

To my knowledge (I’m now retired) there is no easy to use software that will do this

In the background. So we do what is easy and the numbers are reflected in journal

reviews and publication providing the incentive to keep up the good work.

We need more and better resources to help us teach (and learn) measurement, design, etc. ]]>

+1

]]>I point out that p-values address one specific type of potential error, the error of falsely extrapolating to a population a result obtained in a sample. By comparing the magnitude of test statistics/parameter values to the amount of variation around them, we get a sense of how plausible this extrapolation is. Even as a measure of this one sort of error, however, p-values are imperfect (for reasons that have been discussed in detail on this site).

But there are other possible sources of error, often much more salient. The data could be mismeasured (or proxy measurements might not be performing their proxy function very well), or the sample could be biased, or the model could be wrong. A meaningful assessment of the plausibility of a given result has to take into account all these types of potential error, guided of course by what has been learned beyond the bounds of this one study.

The message to downgrade the role of p-values is partly an attempt to draw attention to the limitations of this one statistic in sizing up sample-to-population error, but above all the goal is to increase attention to all the other dimensions of assessment.

I realize the error framework has some issues, but it’s familiar to students and it works to get the message across, I think.

]]>“Junk science” should be “Good science”?

]]>