This might be due from most learning statistics with reference to a single study – unlike Fisher who in his early writing often discussed issues in the context of multiple studies (even if just hypothetically).

Also, I was very lucky in that in my first stats course headed by Don Fraser, the sections in his text on combining systems/studies/estimates by multiplying likelihoods versus combining unbiased estimates caught my interest (mostly by confusing me beyond measure at first).

* quote from Peirce might suffice “I [Peirce] do not call the solitary studies of a single man a science. It is only when a group

of men, more or less in intercommunication, are aiding and stimulating one another by their understanding of a particular group of studies as outsiders cannot understand them, that I call their life a science.” Understanding of a particular group of studies is key – http://www.stat.columbia.edu/~gelman/research/unpublished/Amalgamating6.pdf

Now, let us first forget about heuristics for a moment, it doesn’t really matter what sort of strategy the participants are concretely using. Whatever heuristic they’re using leads them to having some degree of certainty about which option is the correct one: if they are using the correct heuristic, they’ll end up having some positive amount of certainty, and they’ll give—on average—more than 50 percent correct answers.

This can be quantified with the following formula:

curve(pnorm(x / sqrt(2)), -1, 1, ylab = “P(correct)”,

xlab = “Certainty that the correct answer is correct”)

Indeed, when the “certainty” is zero, the participant doesn’t really have any idea what to do, they are as certain about both of the options, and will respond randomly. If the certainty is _negative_, then they’ll be more certain about the _incorrect_ option being correct and end up answering correctly less than 50 percent of the time. Conversely for positive values.

This holds when we assume that the participants are unbiased. Let us instead assume that the participants aren’t unbiased. This means that the participants can be biased towards selecting one of the options, even against their “internal certainty”. The next figure will plot the behaviour of biased participants:

curve(pnorm(x / sqrt(2)), -1, 1, ylab = “P(correct)”,

xlab = “Certainty that the correct answer is correct”, ylim = c(0, 1))

curve(pnorm((x – 2) / sqrt(2)), -1, 1, add = T,

col = “red”)

curve(pnorm((x + 2) / sqrt(2)), -1, 1 , add = T,

col = “blue”)

abline(v = 0.7, lty = 2)

In this figure the black line plots the behaviour of an unbiased participant, as before, but the red and blue lines plot the behaviour of biased participants. It is important to note, that the “internal certainty” for the participants represented by the red and blue lines is the same as for the participant represented by the black line: their probabilities for responding correctly are different only due to their decisional bias.

The dashed vertical line plots a certain point on the “certainty” scale: indeed, even if the certainty stays the same, here 0.7, if the participant is biased towards one of the options, their probability of selecting the correct answer may be increased or decreased depending on the sign of the bias.

In this way it is not necessarily the numeracy that is affected–which would be causally linked to the level of internal certainty–but the decisional processes, the bias of the participant.

Now, I’m not suggesting this being the case; this is just something that popped into my mind.

]]>With regard to apparent replication between two studies one can

1: Compare intervals of parameters values that are compatible with the observations in each (Sander Greenland argues such intervals should be called compatibility intervals as the are actually overconfidence intervals).

2: Compare intervals of parameters values that are most supported by the observations in each using a specific data generating model appropriate to each study, that is possibly differing data generating models or likelihoods for each perhaps averaged over the same prior (as I believe for assessing apparent replication the prior should be the same – that is background information should be taken to be common.)

3: Do both 1 and 2 and worry a lot about all the assumptions involved, especially those about what was assumed common versus different between the two studies.

]]>They show a table like (testing out a new formatting strategy here):

Rash got worse Rash got better

Did use cream 223 75

Didn't use cream 107 21

Most participants use a heuristic form of analysis. First, they compare the number of “successes” to the number of “failures” in the treatment group. They then compare the number of successes in the treatment group to the number of successes in the control group. If the number of successes in the treatment group exceeds both the number of failures in the treatment group and the number of successes in the control, people tend to classify the experiment as proof of the efficacy of the treatment. If not, they characterize the evidence as supporting the inference that the treatment was ineffective.

To put this information in a less confusing format, I did (using a computer, not sure if this was available to the participants):

a = 223/(223+75) ~ .75

b = 107/(107+21) ~ .84

Then I compared a > b, which is false. The heuristic is apparently to do 223 > 75 & 223 > 107, which is true. I suppose the former is the correct method and latter the incorrect method.

Either way, I can't conclude whether the cream is helping or not from this info. To start with: Were the researchers blinded, how was rash got better/worse determined, what does it mean to "use" the cream, did the cream make the rash start going away but cause a breakout of zits instead so people stopped using it?

So if they phrased the question like "does this prove the cream makes the rash get worse/better?" I would answer "no" regardless of the numbers. What exactly did they ask the subjects? I'm sure the answers to these questions would lead to more questions...

It could be that critical thinking is triggered more often when the data appears to be "identity threatening". If there is nothing threatening about the conclusion people may fall back on the "numerate heuristic" of saying "a > b, the treatment works, the end". It is interesting that the supposed "correct answer" here seems to amount to statistical significance thinking.

]]>par(mfrow = c(1,2))

curve(dnorm(x), -3, 6, xlab = “Decisional axis”, ylab = “Density”,

main = “High numeracy”)

curve(dnorm(x, qnorm(0.8) * sqrt(2)), -3, 6, add = T)

abline(v = 0, lty = 2)

curve(dnorm(x), -3, 6, xlab = “Decisional axis”, ylab = “Density”,

main = “Low numeracy”)

curve(dnorm(x, qnorm(0.5) * sqrt(2)), -3, 6, add = T)

abline(v = 0, lty = 2)

(The means are based on estimating the average number of correct answers from one of the figures, I forgot which one)

Here the distributions on the left represent incorrect answers and the distributions on the right hand side represent distributions for the correct answers. From a statistical viewpoint, the subject gets a “sample” from both of the distributions, observes the difference and then responds based on an internal criterion–the dashed vertical lines.

The “decisional scale” represents the subject’s, uh, internal feel about the correctness of the answer: higher numeracy skills will result in a larger difference between the modes of the distributions, indicating clearer distinction between correct and incorrect answers.

In the figure presented here, the distributions in the low numeracy group overlap, indicating that they have no feel whatsoever about the correct answer.

Anyway, without going into further details it should be clear that the proportion of correct answers could depend on two things: the “internal feel” (glah, why can’t I come up with a better term, damn flu) about the correct answer, i.e. what would be theoretically meaningful to call “numeracy skill” and the decisional criterion. Based on quick skimming of the paper I don’t see this possibility ruled out.

]]>To publish as a study an online survey of people making ~$40,000-$49,000 that, once analyzed, shows a 20% increase in numeracy but only among those in the top 90% of numerates (numeracy comes with a number from 1-10, who knew?) when the correct answer (~1/5 > ~1/6) correlates with their presumed(but unmeasured)political biases, is to invite a replication attempt. And even if N in the replication attempt was just a fraction of Kahan’s 1,111 I think it’s pretty fair evidence that the “motivated numeracy” effect is at best small-ish and variable – which, in fact, would be entirely consistent with the original findings. Recall that for most people even when their presumably cherished beliefs were at risk their numeracy score was a better predictor of their accuracy than their biases.

]]>https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3026941

Anyway, I think the word “replication” has become a meaningless buzzword in some corners of research so I wouldn’t worry too much about its misuse anymore. Even as part of that psych *replication project* there were some people that changed the methods for whatever reason…

For now, “direct replication” still has meaning. However, someday I expect you will need: “Real actual direct replication wherein we attempted to follow the previous methods as faithfully as possible”.

]]>