Joshua Pritkin writes:

There is a Stan case study by Daniel Furr on a hierarchical two-parameter logistic item response model.

My question is whether to model the covariance between log alpha and

beta parameters. I asked Daniel Furr about this and he said, “The

argument I would make for modelling the covariance is that it results in

a more correct prior, assuming the true covariance is not near zero.”I countered that, “I agree that the true covariance is unlikely to be

zero, but I’m not sure that modeling it results in a more correct prior.

The prior is suppose to represent our prior beliefs and, given that the

parameter is uninterpretable, I don’t think I have a prior belief about

it. The covariance is not important to identify the model, either.”Daniel counter countered with, “I’ll just mention that *hierarchical*

priors are determined mainly from the data, not our beliefs, and if the

data support a correlation between alpha and beta then I think it is

more correct to include the correlation in the hierarchical prior. It

may not be important to include it for most purposes, but still it

matches more closely what the data generating process appears to be,

which is what I mean by ‘more correct.'”What do you think?

My reply: My short answer is that the covariance model is more general and it should be better; if there’s some idea that the correlation is likely to be near zero, or if there’s some desire to regularize because the correlation is difficult to estimate, then one could put a fairly strong zero-centered prior on that correlation. Also, nonzero corrleations can make sense: for example, more difficult items might have higher discrimination, on average. That said, models with covariances run slower and can be more difficult to interpret, so I don’t think it’s so horrible to just fit independent priors, at least as a starting point.

One thing, though: I don’t like the log transform on alpha. This seems wrong to me: in real life, items can have zero or negative discrimination, right? I’d prefer to put the normal modal on alpha, and then if the data isn’t consistent with negative-discrimination items, that should show up in the inferences. I’d rather have the data reveal this than impose it from the outside.

Hi Andrew –

My paper at StanCon in January gets into why discriminations are constrained to be positive in traditional IRT. It depends on the relationship between the data and the unobserved latent variable – if we believe that the relationship is monotonically increasing, then it makes sense to constrain alpha to be positive. For example, on a test if you get a question right it *always* signals higher ability as a student. But in other situations, such as in the ideal point model, that may not be the case – voting “yes” on a bill could signal you are higher or lower on the latent scale (conservative/liberal) because it depends on how the bill loads in the ideal point space. For that reason, in the ideal point model the discrimination parameters should be left unconstrained.

In some situations, it may be possible to fit either model. For example, in the paper I look at an Amazon food ratings database (1 to 5 scale) of coffee products. If I constrained alpha, the model would be interpreted as which coffee product receives the best ratings/has the highest ability. But if I don’t, which I don’t in the paper, then the model becomes about which coffee products tend to be most polarizing between raters.

It took me a long time to figure this out, so I wanted to share. The paper should be up on the Stan website along with the other conference papers soon.

What is the model about iif you don’t constrain alpha and most of the posterior probability mass is on alpha > 0? Is it about both questions?

In my mind, that would provide evidence that the model could be constrained without changing the answer. It is much harder to identify unconstrained discrimination parameters, so if the constrained model produces identical parameters, then it should be preferred as it’s easier to use and manage.

I wanted to make the same point. As I generally work in educational testing models, I’m used to constraining the discriminations to be positive. This avoids an identification issue with the model where “high ability” means that the candidate is unlikely to provide correct responses. (There are still scale and location identification issues in the model, so it doesn’t completely get rid of them).

Also, my experience at ETS lead me to believe that for well-written cognitive items have discriminations between .5 and 2, so I frequently use a N(0,sqrt(2)/2) prior for discrimination. This assumes that the items have been reviewed by competent reviewers to make sure they are in fact discriminating.

This is entirely application dependent. Attitude scales and other psychological constructs can have reverse keyed items which have negative discriminations.

“if we believe that the relationship [between ability and probability of correct answer] is monotonically increasing…”

This believe is empirically unfounded for real life educational tests. Just look at “Figura 3” (never mind the Portuguese text) here https://arxiv.org/pdf/1802.09880.pdf which is the characteristic curve for one question of a national Brazilian higher education entrance exam.

I was going to comment and then saw you made all of my points, Bob. Thank you! I remember meeting you and being impressed by your presentation at StanCon.

I could not get a model with unconstrained discriminations to work back when I started on edstan because the chains would settle in different local maxima. Because I mainly work in an education context, this didn’t seem like the most important problem to solve, and so I let it go. I look forward to seeing how Bob coded up a solution.

Hi Daniel – Thanks I had a great time meeting you too. Essentially the way I deal with the unconstrained parameters is to either allow the user to pre-specify the constraints (generally possible with legislatures) or to first fit an un-identified model, pick a mode in the posterior and then constrain parameters to that mode. Also, I constrain person parameters–constraining the discrimination parameters seems like the logical choice, but it doesn’t work as well for whatever reason.

You can see the coding deets in the package vignettes, which are now on CRAN: https://cran.r-project.org/web/packages/idealstan/index.html.

As an alternative to constraining some parameters, I’ve been able to resolve the IRT identifiability problem in Stan by specifying positive start values for alpha. I tried this with my own hierarchical model, and then after this post retried it with Daniel’s (better) case study parameterization by setting positive start value for mu[,1] and xi[,1]. It works fine for a simple model with alphas that stay above +0.1. It even seems to function okay when I model how alpha changes for different ethnic groups where some groups are estimated as having alphas drifting into the negative territory for certain items. However, I’m getting quite a few divergent transitions when I try to model this variation so I’m probably doing something silly.

I was having lunch with Keith O’Rourke the other day and a similar point came up in conversation. The question was one of constraining an intercept to be zero in a simple linear regression on theoretical grounds. The issue is that if the data are such that the constrained fit is very far from the unconstrained fit then it’s likely that the constraint is false and all of the supposed gain from imposing the constraint (in either improved parameter estimates or predictions) will actually be a distortion of the information in the data. By setting the intercept on theoretical grounds we are robbing ourselves of one way to check our assumptions and potentially degrading the output of the inference to boot.

I think it’s difficult to say based on the data alone which model is preferred. These IRT models produce a ton of parameters; you’ll probably get good fit with either model. In my mind it comes down to what do we want the model to say–positive discrimination parameters is a constrained version, but it also gives us a precise answer to a specific question. If we want to ask a more general question, then perhaps the unconstrained model is better.

Thanks for sharing your thoughts, very interesting points!

The situation with IRT models seems different since it appears (I think? I haven’t gone into the math) that there are issues with parameter identification.

If you don’t model it, then you’re assuming it is zero, which is a much stronger assumption than most conceivable priors one might put on it…

I don’t understand. If you don’t model (the covariance of item difficulty and discrimination) the estimated posteriors can still be correlated. It’s just that one has not put a prior on the correlation. Or am I misunderstanding something?

Sorry you’re right, language was too strong. If you don’t model it, then you’re assuming the two *distributions* are not correlated. Individual parameters sampled from the distributions can still be correlated, but won’t be as correlated as they would, if they were drawn from the correct (or a more correct) distribution.

Maybe it depends one how one understands the “not correlated” part in “assuming the two *distributions* are not correlated”? (English is not my mother tongue)

My understanding is that if one does not put a prior on the co-variation of two parameters, one is making “no assumptions” about their correlation, but one is not making the assumption that “their correlation in zero”.

For example, if one models regression weights without also modeling their covariance, one is not necessarily assuming that they are not correlated.

Guido’s correct in pointing out that just because two parameters are not correlated in the prior does not mean they will not be correlated after conditioning on data. Lack of correlation in the prior just means you don’t know there is correlation. A different prior would concentrate mass on positively correlated posterior variables, whereas another would concentrate mass on negatively correlated posterior values. A Beta distribution scaled to the domain (-1, 1) can even represent the prior information that there’s a high negative or prior correlation, but very unlikely to be no correlation.

I think Charles is imagining a hierarchical model in which there is a group of pairs of parameters drawn from a hierarchical prior. In that case, we can look at the sample correlation among the draws. This isn’t the same as looking at posterior correlation. What Charles says is right in the hierarchical setting in that a hierarchical prior that favors positive correlation will also cause the estimates of the lower-level parameters to be more correlated.

Isn’t the post about exactly that hierarchical situation? confused…

It was more me who was confused, because I did not think carefully enough about the distinction between sample correlation and posterior correlations. I agree that if you do not model the correlation of two hierarchical priors priors, this implies the assumption that the correlation of the sampling distributions is zero.

Suppose you have a slope and an intercept for the effect in each of fifty states of income on the log odds of voting Republican (let’s ignore DC for now).

Now suppose you have an independent prior for all 100 coefficients, let’s say independent normal(0, 2) for each coefficient for concreteness.

Any pair of those coefficients may be correlated in the posterior. The slope for two different states may be correlated, the slope and intercept for a single state may be correlated, etc.

If you take the posterior means for the 50 intercepts and the 50 slopes, it’s possible those two vectors have non-zero sample correlation in the posterior (in fact it’s almost certain given how it’s calculated).

Just to check we’re on the same page… the posterior correlation between samples is an orthogonal issue to that of the original post, right? I could use two subject level regression parameters in place of one, generating large negative correlation between each pair of two parameters (of which there are nsubjects pairs) with respect to the posterior samples, but this is independent of any possible correlation between the parameter means across subjects.

Regarding “all positive alphas”, Pritikin also has a related Stan discussion here:

http://discourse.mc-stan.org/t/latent-factor-loadings/1483

The bottom includes a Stan solution (“test1.stan”) that would allow for both positive and negative discriminations, while also maintaining parameter identification. I personally prefer that solution over because it allows discrimination parameters’ posteriors to overlap with 0, which may be needed during scale development (when some useless items might be considered).