For example this[1] paper. They wanted to know which pneumonia patients would die if **not** admitted to the hospital but used data for pneumonia patients **who were** admitted to the hospital, and then were surprised by counterintuitive results and came up with all sorts of solutions to the wrong problem…

[1] Caruana et al. Intelligible Models for HealthCare: Predicting Pneumonia Risk and Hospital 30-day Readmission. http://dx.doi.org/10.1145/2783258.2788613; http://people.dbmi.columbia.edu/noemie/papers/15kdd.pdf

]]>Another obvious issue is this: why would you declaw a cat (a relatively expensive procedure) if the cat didn’t have furniture-clawing problems? Yes, some owners might do it routinely, but I would think most would have a “wait and see” attitude.

If I were doing this analysis, with about 96000 intact and 4000 declawed cats, I’d take half the data and just assume I’m going to burn through it doing various exploratory analysis and seeing what the heck seems to be going on here. Once I wrote up what I thought I knew, I’d use the other half as validation.

]]>So the suggested approach here is NHST….why no outcry from the usual suspects?

As usual, the problems begin with *defining the question for NHST* rather than scientific purposes:

declawed cats are more likely to have behavioral problems

I wouldn’t bother testing this “hypothesis”, to me it is a total waste of time. NHST doesn’t even do that though, it will test whether the behavioral data on declawed cats is sampled from the same distribution as un-declawed cats. That hypothesis is false, we can reject it without any data.

Why not instead just build a classifier that predicts whether a cat has behavioral problems or not, using declawing as (one of many) features? If successful, then you could get some idea of whether a cat will be a problem or not before adopting or buying. That seems useful to me… what do you think the result of the NHST test gives? I see nothing of value there.

]]>If D1 and D2 are uniform random samples from a single population than the probability that T(D1)-T(D2) would be as far away or farther from 0 as was actually observed is p=0.01

so, logically: Either we had a uniform random sampling algorithm and got really unlucky, or D1 and D2 are not uniform random samples from a single population… But, btw we already knew that they weren’t uniform random samples from a single population, so … we know absolutely nothing more than we started with.

]]>In the early childhood intervention studies, I don’t think an effect of exactly zero is of interest. I think effects are unlikely to be large on average but they can be highly variable, with big effects in some cases and negative effects in others. I do agree with economists that the average effect is of interest, so I’m happy with the use of 95% intervals (and, more generally, hierarchical models) to summarize information from the data, and I’m happy with Bayesian inference to incorporate prior information. I think that various classical tools are helpful but I don’t see the NHST framework as adding anything here; rather, I see it as getting in the way, as discussed in the post linked in my above comment.

]]>Take my rule “Foo” it’s a sampling rule, but it’s not a *random* sampling rule. Random rules are a subset of all rules, and they’re a powerful subset that have very special properties. Random rules very very rarely produce grossly unrepresentative samples (for example, there’s only a 1/320Million chance that a random number generator would choose Bill Gates out of a complete census of all US incomes). Whereas, “Foo” might be *intentionally* producing un-representative samples.

So, what does it mean to be a p value in the context of a sample from a finite population (which of course, the ACS data is) ?

The frequency with which TESTSTAT(SampleData) would be as far or farther than the observed TESTSTAT(SampleData) from some relevant critical value (like 0), if the SampleData were a random sample from a super-population using a specific, known random sampling distribution (usually uniformly at random, but we can handle alternative random distributions, provided we know what the rule is).

So, is it possible to calculate such a p value in the context of rule “Foo”? The thing is, the calculation of a p value relies on the behavior of *random* sequences. Random sequences are very specific subsets of all possible methods of selection, they’re the ones that have very specific mathematical properties. http://models.street-artists.org/2013/03/13/per-martin-lof-and-a-definition-of-random-sequences/

Per Martin-Lof gives a *definition* of a mathematical random sequence there… and how about now I tell you that very definitely rule “Foo” will fail to pass that or even much less stringent tests.

So, logically, it’s *not possible* to calculate a p value. We simply don’t have a random sequence, and even if we did, we don’t know what its probability law is. For example, if I repeat the rule Foo, it might in fact give the SAME EXACT sample every time.

NHST is appropriate as a means of assessing how *randomness* might have caused an observed difference between two samples even if they were actually chosen from the same pool of stuff… But seriously, that’s a HUGE mathematical burden that really needs to be assessed before a p value is appropriate. and, by the way, it’s NEVER assessed in standard stats… so that’s why it seems confusing. why not just go ahead and do what everyone else does, which is assume a random sample, or some selection process + randomness, or whatever… well, that’s just not how cats wind up getting taken to the shelter. there’s absolutely no *random sequence* involved, and yet the math relies heavily on the mathematical properties of such sequences.

]]>I wish that were true but I still think there’s an unstated rule that if the 95% interval doesn’t exclude zero, it’s not a finding. This leads to a push to get statistically significant results, which leads to overestimates and overstatements of certainty, as here.

]]>A p-value is one piece of evidence. There are some problems where the possibility of explanation by the null hypothesis is a live option, in which case the p-value can be relevant, but in most of the problems I’ve seen in economics and elsewhere this is not the case.

]]>Filtering out strays (which we would expect to be non-representative for various outcomes) and looking at the ages (from date of birth to intake), it looks like the majority (82%) of the non-declawed cats are actually kittens (aged 1 or under), which seem to have been surrendered as part of unwanted litters (‘too many pets’). They’re probably not good comparison points for behavioural assessment adult cats.

I figured this out because I was looking at the bodyweight as an outcome, and the non-declawed cats were suspiciously light.

The declawed cats _do_ seem to be on average overweight (11.78lbs), which could be attributed to pain-caused restricted activity, but I’m not sure we can say much from this data, this could just be normally overfed housecats, and the non-declawed cats (which aren’t kittens) are too few to use as comparison.

]]>I take the American Community Survey (ACS) and I write a computer program which sub-samples the survey according to rule “Foo” which is 3000 lines of computer code that I won’t show you, as far as you know, I may be an adversary who is intentionally attempting to make you make a wrong inference. You are allowed to see exactly 1 sample of 1000 families, and are not allowed to look at the full ACS, and must do some inference.

In a second sample, I subsample the ACS according to uniform random RNG sampling using high quality RNG, you’re asked to do the same inference.

Now, what’s the sampling distribution of the difference in means of incomes between black and white families in scenario 1?

How about in scenario 2?

Suppose that third party reads the code and is allowed to give you a 30 word summary of what they think my rule Foo does. A Bayesian model can describe that understanding of what happens in Foo without needing to describe it in terms of correct frequency of occurrence, and still give understandable posterior inference. It’s just that you have to interpret the posterior inference as something other than “the frequency with which we would get result X”

]]>NHST assumes a random sample from a particular known distribution, but all we have is a haphazard sample with unknown distribution. The NHST test quantifies the frequency with which certain measurements would occur if the frequency of outcomes is the same as the assumptions you made about the frequency of outcomes.

Bayesian methods assume a structural description of a process (causal or not) plus a plausible set for the unknown quantities describing the process, plus a plausibility of seeing some measurement in a region “around” the structural prediction given the unknown quantity. There need not be any randomness or a “sampling” process, nor does there need to be a known frequency of occurrence. The posterior distribution quantifies the implications of the assumptions you made about what is and what isn’t plausible based on your knowledge, and is not in general a quantification of the frequency of anything.

]]>This also seems to be cross-sectional in the sense that it simply records the pound’s intake, right? So you don’t know what the outcome of each adoption is, nor do you know how many declawed cats are in the general population these cats are coming from? That makes the data not very informative. I suppose you could look at proportion of reason within the 3 levels of declawing. One possible tact is that I see in ‘Intake.Type’ there is a “Return” level, 10 of the 200 entries; if this means that the cat was adopted and then subsequently returned to the pound, then maybe it is possible to consider this as longitudinal and see whether declawing predicts returns and behavioral problems. 5% over n=62k gives ~3k returns, so that might be enough to be useful.

]]>As it happens, I’m interested in this topic myself, along with my own little research project investigating catnip response (I’m currently running some large-scale international surveys trying to nail down how frequent immunity to catnip is and whether it differs between countries: https://www.gwern.net/Catnip#google-surveys-probability-population-sample ). So I will take a look at the sample of data.

]]>Given the current IRB regime, I doubt the owners were always satisfactorily informed that their data would be used for research, informed of their rights as human subjects, and signed their consent forms.

]]>Are you kidding? Cats are very private people.

]]>