## Better to just not see the sausage get made

Mike Carniello writes:

This article in the NYT leads to the full text, in which these statement are buried (no pun intended):

What is the probability that two given texts were written by the same author? This was achieved by posing an alternative null hypothesis H0 (“both texts were written by the same author”) and attempting to reject it by conducting a relevant experiment. If its outcome was unlikely (P ≤ 0.2), we rejected the H0 and concluded that the documents were written by two individuals. Alternatively, if the occurrence of H0 was probable (P > 0.2), we remained agnostic.

See the footnote to this table:

Ahhh, so horrible. The larger research claims might be correct, I have no idea. But I hate to see such crude statistical ideas being used, it’s like using a pickaxe to dig for ancient pottery.

1. Anonymous says:

Besides the various confusions found here, this is a great example of how the threshold changes with the difficulty of collecting data in a given field.If you are gettting a new data point every second, you use 0.0000003 as the significance cutoff. Once every couple thousand years apparently yields a cutoff of 0.2. In other words, alpha is the expected value of p.It is chosen ao that something like 10-50% of the time p will be below the cutoff. Not so rare, but not too common either.

• In some ways, this is really the *correct* use of p values, as filters to distinguish “the usual stuff” from “not so common stuff” so that you can focus your energy on explaining the stuff of interest (whether it’s the “usual” or the “unusual” depends on the purpose).

2. Lauren says:

What is the correct way to do this?

• Andrew says:

Lauren:

I’d fit a multilevel model with interactions and model everything, rather than trying to isolate statistically significant comparisons.

• Bourne says:

What would be the levels? I’m confused what multilevel model exactly you’d fit. I see the issue with many comparisons, but then I’d just do a Bonferroni type correction. Some detail on what exactly you would use in this situation would be helpful.

• Andrew says:

Bourne:

For examples, check out my book with Jennifer or my many applied research articles. Short answer is that all those little unadjusted comparisons are noisy, and multiple comparisons corrections don’t fix that. My goal is not to reject a null hypothesis; it’s to learn about real differences.

3. jrc says:

“What is the probability that two given texts were written by the same author?” – it is 0 or 1.

“This was achieved by posing an alternative null hypothesis H0” – and all this time I thought the “alternative” and “null” hypotheses were different things!

(“both texts were written by the same author”) – I’m fairly sure this is not a statistical null hypothesis.

“and attempting to reject it by conducting a relevant experiment.” – What is being experimentally manipulated here? What change experimentally assigned in the world? Sounds like buying epistemological weight by comparing the author’s statistical idea to another one that people already like.

“If its outcome was unlikely (P ≤ 0.2) we rejected the H0” – I actually don’t hate that they developed their own ad hoc cutoff for their application… .2 or .05, whatever. Points for having the chutzpah.

“and concluded that the documents were written by two individuals.” – So if our fake substantive hypothesis (null hypothesis) about single-authorship, as translated into some unspecified statistical hypothesis about the equality/values of some parameters in a statistical model, is calculated to be unlikely, we conclude that our actual substantive hypothesis (alternative hypothesis) is right…

“Alternatively, if the occurrence of H0 was probable (P > 0.2), we remained agnostic.” – …but it is impossible for us to convince ourselves we are wrong.

*Note: I may be mocking the authors (that is a thing we do here) but I am only mocking the writing. I have no idea what statistical tests they actually conducted. I thus have no idea if they are right about literacy rates at this particular military outpost. Nor do I have any knowledge about whether this is a good representation of other outposts. Nor do I have any idea about the distribution of literacy across military/non-military people at the time. Nor do I have any idea whether broad literacy is required for the Bible to be written. But I am pretty sure this is terrible writing.

• Carlos Ungil says:

> (“both texts were written by the same author”) – I’m fairly sure this is not a statistical null hypothesis

Why not? It seems they are meassuring somehow the similarity for each pair of texts and they have model that provides a distribution under the null hypothesis (i.e. when comparing two text from the same author).

• Carlos Ungil says:

They detail the procedure here http://www.pnas.org/content/suppl/2016/04/05/1522200113.DCSupplemental/pnas.201522200SI.pdf

I looked at it very superficially, but it seems they are doing a randomization test.

• jrc says:

The statistical null hypothesis would be a statement about parameters in their model (a common one being Beta=0). The statement about the world above is them interpreting that null statistical hypothesis in terms of a feature of the world that would make it true (the same person being the author of both texts). But surely other features of the world could make a null statistical hypothesis of this sort true (e.g.. “everyone who wrote at that time wrote similarly” – though I admit it is hard to know if this example works without actually understanding the statistical null hypothesis they are testing).

But setting aside the question of alternative interpretations of what would generate results consistent with the statistical null hypothesis, the null itself has nothing to do with actual authorship of the texts. That’s not how statistical null hypotheses work. They relate only to parameters in the model.

• Carlos Ungil says:

How do the null hypothesis in non-parametric statistical tests relate to the parameters in the model?

• You could consider “author” as a parameter with discrete values.

• jrc says:

Not gonna lie, clever burn. I guess you are right, a statistical null hypothesis could be about features of the empirical distribution function that are not directly parameterized (like a K-S test).

But that said, it still doesn’t in any way save them from my argument that they are conflating a statistical null hypothesis and a substantive scientific hypothesis, right? The null hypothesis “the two authors are the same” cannot be tested in a statistical sense. The null hypothesis “the distribution of differences in character spacing between these texts is X” is testable statistically (I mean, if you believe their counterfactual distribution X) and I think is closer their actual null hypothesis*, which in any case must be a statement about the similarity of two empirical distribution functions.

People in the field can argue whether the evidence from their statistical test should be interpreted as evidence for their theoretical hypothesis. But I still don’t see how they could legitimately claim to be “testing” the “null hypothesis” that “both texts were written by the same author”.

But good point about the limitations of my first formulation and my over-reliance on thinking about testing regression coefficients, when I should’ve thought about statistical testing more broadly.

*In a sense, I feel like they are parameterizing “before” the test, by generating a parameter in an implicit model of people’s writing that is “distance between characters” and making an assumption about it’s stability, yeah? In general, I think this is true, where most “non-parametric” tests rely on an underlying theoretical structure that is in some way at least implicitly “parameterized”, but maybe that is just me confusing “parameterization” and “measurement”.

• If it really is spacing between characters they’re looking at, then they’re probably making some enormously strong assumptions that they aren’t even acknowledging. Like

“Distribution of Spacing between characters is a stable property of a given author”

• e f d p says:

Yeah, what immediately came to mind when reading that phrase was that they’re modeling the distributions of the variables they measure (properties of particular characters?) as conditional on “author,” and they have (nested) models where “author” is free to vary vs. fixed.

I’d have assumed some kind of likelihood-ratio test, but from skimming the methods it looks like they’re not actually modeling the distributions of the variables they’re measuring or calculating likelihoods. Instead they’re kind of… modeling the partitioning of the characters into two “authors”? And then calculating goodness-of-fit by how well a given partitioning matches the original? It seems to depend on the assumption that for characters from a single “author”, all k-means partitions of characters based on the measured feature vectors should be equally likely (this part seems dangerous to me; there could be other sources of structure).

The part where they combine p-values across characters makes me squirm a little, too. They do at least have some kind of simulation; with no domain expertise I can’t really evaluate whether it’s realistic.

But at least from a super cursory look — while I think their approach looks kind of ad hoc and overcomplicated, I don’t think the way they described the null is totally misleading, especially because the purpose of that statement is probably just to give an intuition about how the test works to a general audience.

• ‘“What is the probability that two given texts were written by the same author?” – it is 0 or 1.’

The statement “The two texts were written by the same author” has an objective truth value which is either 0 or 1 but we aren’t privy to it. Since we aren’t privy to it, we can discuss the probability (degree of credence to be given to the statement) which is necessarily a quantity that is related to the knowledge we do have, and so has no one single “correct” value (ie. it doesn’t make sense to say “the true probability of these two texts being written by the same author is 0.22”)

There is no sense in which “The two texts were written by the same author” has any “frequency of occurrence” (at least once you’ve specified exactly which two texts you’re talking about).

So, right from the start, this whole thing sounds confused (I have no access to the full text)