Alex Chernavsky writes:

I discovered your blog through a mutual friend – the late Seth Roberts. I’m not a statistician. I’m a cat-loving IT guy who works for an animal shelter in Upstate New York.

I have a dataset that consists of 17-years’-worth of animal admissions data. When an owner surrenders an animal to us, we capture the main reason for the admission (as told to us by the owner). Some of the reasons are generic: owner moving to no-pets apartment, can’t afford the pet, no time, family members have allergies, etc. But some of the reasons are related to the specific characteristics of the animal: animal is aggressive, pees outside the litter box, hides and acts skittish, etc.

Much of the sheltering community has a long-standing belief that declawed cats are more likely to have behavioral problems, as compared to intact cats. I’d like to test this hypothesis by analyzing the dataset (which also contains the declawed status of all the cats admitted to us). The dataset contains around 100,000 cat admissions, and approximately 4% of those cats were declawed.

I’ve been reading your blog enough to know that you’re not fond of null hypothesis significance testing. What approach would you recommend in this situation? Can you point me in the right direction? I might be collaborating with a veterinary epidemiologist at Cornell Vet School, and possibly also with a data scientist from Booz Allen Hamilton.

My reply: I’d go with the standard approaches for causal inference from observational (that is, non-experimental) data, as discussed for example in chapters 9 and 10 of my book with Jennifer Hill. You’d compare the “treatment group” (declawed cats) to the “control group” (all others), controlling for pre-treatment variables (age, type, size of cat; characteristics of the family the cat is living with; whether the cat lives indoors or outdoors; etc.). There are selection bias concerns if the cats were declawed because they were scratching people too much.

The statement, “Declawing *causes* cats to be more likely to have behavioral problems,” is not the same as the statement, “Declawed cats are more likely to have behavioral problems, as compared to intact cats.” The first of these statements is implicitly a comparison of what would happen for an individual cat if he or she were declawed, while the second statement is a comparison between two different sets of cats, who might differ in all sorts of other ways.

So your analysis might be tentative. But the starting point would to be to see how the comparison looks in your data.

**P.S.** Speedy cat image from Guy Srinivasan.

**P.P.S.** Chernavsky adds this link and says that if any readers out there are interested in collaborating on this project, he has all the data and is looking for someone to help analyze it properly.

How about releasing the dataset? As it is cats, there shouldn’t be any legal impediments or privacy concerns.

Gwern:

Are you kidding? Cats are very private people.

Bah. If they really cared about privacy, they wouldn’t lick their crotches right in front of us constantly!

Clearly you’ve never taken a tour of the NSA.

You can’t generalize about cats — every cat makes their own rules. Each is private about whatever they choose to be private about, and reserves the right to claw (or bite) any human who goes against their rules.

While the research is about cats, the object of the research in the proposed analysis is the cat by surrendering owner, who answered the questions that were recorded to provide the data. There were human subjects that provided the data.

Given the current IRB regime, I doubt the owners were always satisfactorily informed that their data would be used for research, informed of their rights as human subjects, and signed their consent forms.

IRBs have been loosened recently so that anonymized surveys no longer require full IRB approval or consenting, and this is primarily anonymous data about non-humans in the first place: http://www.chronicle.com/article/Long-Sought-Research/239459/ Which is as it should always have been, because how could anyone possibly be meaningfully harmed by this data?

Actually, warning, the government has changed the rules, many local IRBs have not yet changed their processes.

The article is not accessible to non-Chronicle subscribers. Can you (or someone else) give another publicly accessible link to the information? (In particular, what constitutes acceptable “anonymizing”?).

Gwern: I’d rather take a crack at doing the analysis myself (or with collaborators) before making the dataset public. Also, despite the lack of formal legal impediments, I’d have to jump through some hoops on my end before I could get internal approval to release the data.

My suggestion would be to start getting approval now, for two reasons: 1. you might get bored or distracted and never finish it up properly, so the data should at least be released for others; 2. it’ll probably take longer than you expect to get approval because it always does.

As it happens, I’m interested in this topic myself, along with my own little research project investigating catnip response (I’m currently running some large-scale international surveys trying to nail down how frequent immunity to catnip is and whether it differs between countries: https://www.gwern.net/Catnip#google-surveys-probability-population-sample ). So I will take a look at the sample of data.

Looking at the sample, I see some data cleaning issues that would need to be addressed (‘Purebred’ is logical, but does -1 mean True or False? what do “I am a sidekick”/”I’m a Personal Assistant”/… mean inside ‘Distinguishing.Markings’? are “Schedule”/”Scheduled” different things in ‘Intake.Subtype’?).

This also seems to be cross-sectional in the sense that it simply records the pound’s intake, right? So you don’t know what the outcome of each adoption is, nor do you know how many declawed cats are in the general population these cats are coming from? That makes the data not very informative. I suppose you could look at proportion of reason within the 3 levels of declawing. One possible tact is that I see in ‘Intake.Type’ there is a “Return” level, 10 of the 200 entries; if this means that the cat was adopted and then subsequently returned to the pound, then maybe it is possible to consider this as longitudinal and see whether declawing predicts returns and behavioral problems. 5% over n=62k gives ~3k returns, so that might be enough to be useful.

“Sidekick”, “Personal Assistant”, etc. were part of a marketing program that involved categorizing cats based on their alleged personality. The whole thing was silly and was eventually discontinued. Schedule / scheduled mean the same thing — that the person who surrendered the cat made an appointment in advance, rather than just showing up at our door. The outcome (adoption, transfer to another group, or euthanasia) isn’t included with this dataset, but I could add it without too much difficulty. “I suppose you could look at proportion of reason within the 3 levels of declawing.” Yes, that was my thought.

I’d also add that siloing of data very obviously causes harm in human health issues, so release of this data is the best way to get good outcomes for the cats.

To get actual causal inference, and not just correlation, I’d suggest trying to find an instrumental variable. Maybe whether the home has young children or not? Might not be obvious to find one.

Another issue is that shelter cats are a nonrandom sample of cats. One should think carefully about how to extrapolate the treatment effect to cats outside the sample.

Indeed, I would think they are an especially unusual sample of cats, and perhaps in a way that would bias the data against your hypothesis. I’m thinking of a Berkson’s bias effect here. If a cat has behavioral problems but still has its claws, it may exhibit its problems through destructive scratching of people or people’s things, leading the owner to get rid of the cat. By contrast, a cat with a similar behavioral problem that is declawed will do no harm to people or things with its non-existent claws and is less likely to be cast away by its owner. So by selecting specifically a sample of cats that have been cast away, behavioral problem cats with their claws intact are likely to be *over*-represented compared to their actual prevalence in the general domesticated cat population.

It’s curious that the vast majority of sheltered cats are intact. Makes me wonder what is the prevalence of declawing. If behavioral issues is a minor reason for bringing cats to shelter, I’d be worried about drawing conclusions from this dataset.

So the suggested approach here is NHST….why no outcry from the usual suspects?

I was pretty sure that chapters 9 and 10 of Gelman and Hill don’t suggest using NHST, but just to be super extra sure I double-checked right now. Nope, no NHST.

Ok, but what’s the practical difference? There are people in the comments who deride any statistical exercise that reduces to a comparison of means, basically (control/treatment). This is a natural fit for NHST; the null is of control being the same as treatment is a very relevant one.

? The opposition to NHST doesn’t say you can’t compare means or adjusted means (or in this case proportions). And no a non random sample of a cats selected on the basis of a variable potentially related to the dependent variable of interest is not a “natural” for anything based on hypothetical sampling distributions. Even someone committed to a NHST would say that it wouldn’t apply in this case.

Right, the difference between NHST and Bayesian analysis isn’t that in NHST you compare two group means, and Bayesian analysis you don’t… or the like. The difference is in how you model the process.

NHST assumes a random sample from a particular known distribution, but all we have is a haphazard sample with unknown distribution. The NHST test quantifies the frequency with which certain measurements would occur if the frequency of outcomes is the same as the assumptions you made about the frequency of outcomes.

Bayesian methods assume a structural description of a process (causal or not) plus a plausible set for the unknown quantities describing the process, plus a plausibility of seeing some measurement in a region “around” the structural prediction given the unknown quantity. There need not be any randomness or a “sampling” process, nor does there need to be a known frequency of occurrence. The posterior distribution quantifies the implications of the assumptions you made about what is and what isn’t plausible based on your knowledge, and is not in general a quantification of the frequency of anything.

? So if it’s a non-experimental sample of data we can’t apply NHST? Control for confounders, find an instrument, etc…then do NHST. ^Daniel, suppose we adopt the potential outcomes framework, as in Ch. 9 and 10 of Hill and Gelman, I don’t see the practical difference here between a Bayesian approach or frequentist. I understand that fundamentally, or philosophically, they are different, but I don’t think a frequentist and a Bayesian would come to very different conclusions here.

If it’s a haphazard sample you have no idea what the relationship between the sampling distribution of the mean and the population parameter is. I’ll give you an example…

I take the American Community Survey (ACS) and I write a computer program which sub-samples the survey according to rule “Foo” which is 3000 lines of computer code that I won’t show you, as far as you know, I may be an adversary who is intentionally attempting to make you make a wrong inference. You are allowed to see exactly 1 sample of 1000 families, and are not allowed to look at the full ACS, and must do some inference.

In a second sample, I subsample the ACS according to uniform random RNG sampling using high quality RNG, you’re asked to do the same inference.

Now, what’s the sampling distribution of the difference in means of incomes between black and white families in scenario 1?

How about in scenario 2?

Suppose that third party reads the code and is allowed to give you a 30 word summary of what they think my rule Foo does. A Bayesian model can describe that understanding of what happens in Foo without needing to describe it in terms of correct frequency of occurrence, and still give understandable posterior inference. It’s just that you have to interpret the posterior inference as something other than “the frequency with which we would get result X”

Okay, but it’s not like a frequentist framework doesn’t have methods to deal with non-random sampling. To me, you are just describing a scenario where there is some sort of selection into the sample; this can be dealt with in many ways in a frequentist framework. Maybe I’m missing the point. Unfortunately with nearly all of your comments Daniel I have a hard time understanding what exactly you are getting at. They seem nearly cryptic to me. Obviously this is due to my lack of knowledge, but I do have a fair amount of statistical training; I think you should be able to make your comments more comprehensible for a beginner Bayesian.

Matt, I’m sorry if my explanations are cryptic. I honestly think having a bunch of statistical training is a detriment here, because almost all of it is likely to be “standard” and “standard” statistical stuff is honestly pretty confused. But let’s continue.

Take my rule “Foo” it’s a sampling rule, but it’s not a *random* sampling rule. Random rules are a subset of all rules, and they’re a powerful subset that have very special properties. Random rules very very rarely produce grossly unrepresentative samples (for example, there’s only a 1/320Million chance that a random number generator would choose Bill Gates out of a complete census of all US incomes). Whereas, “Foo” might be *intentionally* producing un-representative samples.

So, what does it mean to be a p value in the context of a sample from a finite population (which of course, the ACS data is) ?

The frequency with which TESTSTAT(SampleData) would be as far or farther than the observed TESTSTAT(SampleData) from some relevant critical value (like 0), if the SampleData were a random sample from a super-population using a specific, known random sampling distribution (usually uniformly at random, but we can handle alternative random distributions, provided we know what the rule is).

So, is it possible to calculate such a p value in the context of rule “Foo”? The thing is, the calculation of a p value relies on the behavior of *random* sequences. Random sequences are very specific subsets of all possible methods of selection, they’re the ones that have very specific mathematical properties. http://models.street-artists.org/2013/03/13/per-martin-lof-and-a-definition-of-random-sequences/

Per Martin-Lof gives a *definition* of a mathematical random sequence there… and how about now I tell you that very definitely rule “Foo” will fail to pass that or even much less stringent tests.

So, logically, it’s *not possible* to calculate a p value. We simply don’t have a random sequence, and even if we did, we don’t know what its probability law is. For example, if I repeat the rule Foo, it might in fact give the SAME EXACT sample every time.

NHST is appropriate as a means of assessing how *randomness* might have caused an observed difference between two samples even if they were actually chosen from the same pool of stuff… But seriously, that’s a HUGE mathematical burden that really needs to be assessed before a p value is appropriate. and, by the way, it’s NEVER assessed in standard stats… so that’s why it seems confusing. why not just go ahead and do what everyone else does, which is assume a random sample, or some selection process + randomness, or whatever… well, that’s just not how cats wind up getting taken to the shelter. there’s absolutely no *random sequence* involved, and yet the math relies heavily on the mathematical properties of such sequences.

Thank you. That is pretty interesting, I’ll ponder it.

Well, I should say, it’s logically possible to calculate a p value, but it won’t tell you what you think it will. The p value says this kind of thing:

If D1 and D2 are uniform random samples from a single population than the probability that T(D1)-T(D2) would be as far away or farther from 0 as was actually observed is p=0.01

so, logically: Either we had a uniform random sampling algorithm and got really unlucky, or D1 and D2 are not uniform random samples from a single population… But, btw we already knew that they weren’t uniform random samples from a single population, so … we know absolutely nothing more than we started with.

Where did I say that? I said you have to have a random sample. Most of the time experiments may have random assignments but they don’t have random samples for external purposes.

As I see it, the practical difference is in the avoidance of dichotomization of one’s conclusions.

Thank you. I agree completely. Sometimes I wonder if this blog goes a little too hard on NHST; with the exception of some terrible research in social psychology (why it gets so much attention here, I still don’t quite understand, as it is very low hanging fruit as far as criticism goes) I think p-vaues just get dragged along for the ride with the estimates and standard errors. For example, I think it’s pretty rare in an economics seminar to hear the words “null hypothesis” these days…it’s more about setting up a model and estimating it. However, almost all estimates will have p-values with them, as this is just standard output from the software most applied economists use.

Matt:

A p-value is one piece of evidence. There are some problems where the possibility of explanation by the null hypothesis is a live option, in which case the p-value can be relevant, but in most of the problems I’ve seen in economics and elsewhere this is not the case.

Andrew – Yes, I agree. The point I was trying to make was that often in economics p-values are reported, but I don’t think that the researchers actually view the null as relevant. The emphasis is placed on the estimate and its standard error, as opposed to a dichotomous decision to reject/not reject.

Matt:

I wish that were true but I still think there’s an unstated rule that if the 95% interval doesn’t exclude zero, it’s not a finding. This leads to a push to get statistically significant results, which leads to overestimates and overstatements of certainty, as here.

Fair enough. Certainly it depends on the area – anything involving randomized trials is going to probably be caring about significance. Do you really think that the null is not relevant in most randomized trial settings? The null of control and treatment groups having identical average outcomes seems relevant to me. For context, say as in some of Heckman’s early childhood intervention papers.

Matt:

In the early childhood intervention studies, I don’t think an effect of exactly zero is of interest. I think effects are unlikely to be large on average but they can be highly variable, with big effects in some cases and negative effects in others. I do agree with economists that the average effect is of interest, so I’m happy with the use of 95% intervals (and, more generally, hierarchical models) to summarize information from the data, and I’m happy with Bayesian inference to incorporate prior information. I think that various classical tools are helpful but I don’t see the NHST framework as adding anything here; rather, I see it as getting in the way, as discussed in the post linked in my above comment.

Ya I agree it adds nothing. Just don’t think it is necessarily doing a lot of harm in this case.

As usual, the problems begin with

defining the question for NHSTrather than scientific purposes:I wouldn’t bother testing this “hypothesis”, to me it is a total waste of time. NHST doesn’t even do that though, it will test whether the behavioral data on declawed cats is sampled from the same distribution as un-declawed cats. That hypothesis is false, we can reject it without any data.

Why not instead just build a classifier that predicts whether a cat has behavioral problems or not, using declawing as (one of many) features? If successful, then you could get some idea of whether a cat will be a problem or not before adopting or buying. That seems useful to me… what do you think the result of the NHST test gives? I see nothing of value there.

I had a look at the sample provided, and wanted to flag an issue with trying to draw inference from this data.

Filtering out strays (which we would expect to be non-representative for various outcomes) and looking at the ages (from date of birth to intake), it looks like the majority (82%) of the non-declawed cats are actually kittens (aged 1 or under), which seem to have been surrendered as part of unwanted litters (‘too many pets’). They’re probably not good comparison points for behavioural assessment adult cats.

I figured this out because I was looking at the bodyweight as an outcome, and the non-declawed cats were suspiciously light.

The declawed cats _do_ seem to be on average overweight (11.78lbs), which could be attributed to pain-caused restricted activity, but I’m not sure we can say much from this data, this could just be normally overfed housecats, and the non-declawed cats (which aren’t kittens) are too few to use as comparison.

That’s a good point.

Another obvious issue is this: why would you declaw a cat (a relatively expensive procedure) if the cat didn’t have furniture-clawing problems? Yes, some owners might do it routinely, but I would think most would have a “wait and see” attitude.

If I were doing this analysis, with about 96000 intact and 4000 declawed cats, I’d take half the data and just assume I’m going to burn through it doing various exploratory analysis and seeing what the heck seems to be going on here. Once I wrote up what I thought I knew, I’d use the other half as validation.

I think it goes farther than that. If you want an answer to the question “will this cat have behavioral problems if declawed?”, you need to collect data on the cat before a declawing (or not) occurs. Many times the question of interest really cannot be answered from the available data. Rather than admitting this, people answer some similar sounding question instead, which leads to all sorts of confusion.

For example this[1] paper. They wanted to know which pneumonia patients would die if

notadmitted to the hospital but used data for pneumonia patientswho wereadmitted to the hospital, and then were surprised by counterintuitive results and came up with all sorts of solutions to the wrong problem…[1] Caruana et al. Intelligible Models for HealthCare: Predicting Pneumonia Risk and Hospital 30-day Readmission. http://dx.doi.org/10.1145/2783258.2788613; http://people.dbmi.columbia.edu/noemie/papers/15kdd.pdf

zbicyclist: “why would you declaw a cat… if the cat didn’t have furniture-clawing problems?” Yeah, but no-one declaws a cat for house-soiling. And house-soiling is one of the adverse effects that is suspected to occur as a consequence of declawing (and is also a fairly common reason why cats are admitted to shelters).