Frank de Libero writes:

I read your Chance article (disproving that no one reads Chance!) re communicating about flawed psychological research. And I know from your other writings of your continuing good fight against misleading quantitative work. I think you and your students might be interested on my recent critique of a 2011 paper published in Psychological Science, “Income Inequality and Happiness” by Shigehiro Oishi, et al. The critique is here.

The blog post demonstrates that treating ordinal numbers with respect along with an eye to robustness leads to contrary conclusions – and a more interesting conjecture. If nothing else for your students, the post is an example of how an applied statistician thinks.

I have emailed the three authors and editor. I don’t expect to see a retraction. But maybe someone will pick up on the recommendations. We’ll see.

De Libero’s critique is worth reading. Lots of interesting points, could be a good example for a statistics class, if the instructor is looking for something that, unlike typical textbook analyses, does not have a simple clean story. Also there seem to be problem with this paper published in Psychological Science, but that’s hardly news. . . .

P.S. In the old days I would’ve crossposted this on the sister blog. But now they don’t like running duplicate material, and so I thought it better to post in this space, since here we get good discussions in the comments.

If I am reading this correctly, respondents were given three options for the happiness item: “Very Happy” (coded 1), “Pretty Happy” (coded 2), and “Not Too Happy” (coded 3); for that particular set of coding, the p-value was < 0.05; the blog post author wrote that "It didn't seem reasonable that the 'interval' between Pretty Happy and Not Too Happy was the same as between the happiest responses"; so the author recoded "Not Too Happy" as 4, and the model returned a p-value of 0.057; then the author recoded "Not Too Happy" as 5, the model returned a p-value was 0.12.

It seems like a lot of assumptions are baked into that recoding. To avoid that, is there a multilevel model estimation technique that could handle dependent variables suspected to be ordinal but not interval? Something equivalent to ordered logit? Maybe meologit in Stata?

We had a similar issue (we thought about estimating a fixed effects model with an ordinal level dependent variable) and I couldn’t find any implementation in Stata yet and it would have been necessary to program it with Mata or something like that (see chapter 9 here: http://people.stern.nyu.edu/wgreene/DiscreteChoice/Readings/OrderedChoiceSurvey.pdf). We didn’t follow up on it, though, hence I did not research it anymore. I’d still like to know if anybody would know more about estimating fixed effects ordered logistic regression.

Ordered logit mixed effect models can be done with meologit in Stata, though, yes:

http://www.stata.com/features/overview/multilevel-ordered-logistic-models/

LJ, Daniel:

You should be able to fit any of these models easily in Stan.

I thought so, but I did not have the time yet to learn Stan. Since I usually can’t choose which program to use in collaborative work, I have to bear with the likes of Stata and even SPSS far too often. Learning another language (besides R), which most of my peers will not use, is not my highest priority at the moment.

Thanks for the information, Daniel and Andrew.

I liked the blog post. This paragraph “The usual happiness question in surveys is similar to the one asked in the General Social Survey (GSS), the one used for IIH: “Taken all together, how would you say things are these days– would you say that you are very happy, pretty happy, or not too happy?” with responses coded as 1, 2, and 3. This question, labeled HAPPY, has been asked in every GSS since 1972, so there’s a history.” And then they show a chart that covers these years. I’m reciting this banal stuff because I thought it’s a nice point: you have to get into the questions rather than assume they actually measure the same thing over time. There is uncertainty.

When this kind of thing comes up, I remember Shakespeare’s “Methinks she doth protest too much”. The meaning of “protest” has changed over time so we hear the line with a different meaning than “She’s being too loud”.

In European surveys and most Quality of Life research, I know, happiness is measured in four levels (very happy, quite happy, not very happy, not at all happy). A similar variable which is often asked is “Taken all together, …, how satisfied are you with your life on a scale from 1 to 10”. There is a somewhat hot debate if these two questions measure the same or not.

The question if a question is understood in the same way over time is highly complicated either. Outside of psychology we’re often not concerned enough about the reliability and validity of the questions and indices we are using.

As an outsider I think the crux of the problem is the surveys asking somewhat stupid questions?

Ask “Taken all together, how would you say things are these days: happy / unhappy”

Or ask: “Rate your current happiness on a 5 point scale”. Or 10 point scale. Or some scale.

But adding vague, subjective categories like “somewhat happy” etc. seems destined for trouble. Is there a good reason to persist asking such questions?

There is a literature in survey methodology (much back in the 1990s, with Norbert Schwarz, Nora Cate Schaeffer, and others … I worked with George Gaskell and Colm O’Muircheartaigh) asking questions like how often you were satisfied or dissatisfied with your life, and also questions with vague phrases for the response alternatives. This was done specifically to show what happens with these vague phrases. While the placement of points along a scale is one concern, another is that the vague phrases can mean different things to different people which makes group comparisons difficult (and mean different things to the researchers). Writing good survey questions can be really difficult.

Dan:

Could you point me towards some kind of seminal papers/books about that? My intuition always was that the more general 1-10 scales would fare better compared to the happiness questions but I only worked on this topic in my undergraduate education and I was quite happy ( ;-) ) to do some moderator and regression analysis on the association of religion, church attendance and satisfaction with life.

Yea, off blog email me at dbrookswr at gmail dot com

By the original paper’s inequality has been consistently trending up over the study period. And happiness trending too, maybe consistently down (looking at “very happy” responses) or maybe trending up until 1990 and thenceforth down (looking at other values).

Isn’t it very likely that we will see a correlation between any two time series with trending data, even if entirely

spurious? Would one expect different results with any else that has been on a trend up since 1972?

Basically, how more valid would this be if there were a perfectly objective metric of happiness (which was not in any way subject to De Libero’s critique) and you got the same p < 0.05. Would the paper be thereby stronger than it is now?

“Isn’t it very likely that we will see a correlation between any two time series with trending data, even if entirely

spurious?”

Yes, comparing any two monotonic trends will be “significant” given enough datapoints. Beyond this, this p-value tells us nothing because zero correlation is a strawman null hypothesis. Disproving that null hypothesis would only be meaningful if we had theoretical reason to think happiness and income inequality were unrelated to each other (which is not true, in fact this would seem to go against common sense). Andrew has said he has a post critiquing the arguments of Meehl 1967 but I do not think he has published it yet:

“Data currently being analyzed by Dr. David Lykken and myself, derived from a

huge sample of over 55,000 Minnesota high school seniors, reveal statistically significant

relationships in 91% of pairwise associations among a congeries of 45 miscellaneous

variables such as sex, birth order, religious preference, number of siblings,

vocational choice, club membership, college choice, mother’s education, dancing,

interest in woodworking, liking for school, and the like. The 9% of non-significant

associations are heavily concentrated among a small minority of variables having

dubious reliability, or involving arbitrary groupings of non-homogeneous or nonmonotonic

sub-categories. The majority of variables exhibited significant relationships

with all but three of the others, often at a very high confidence level

(p < 10-6)."

Meehl, Paul E. (1967). "Theory-Testing in Psychology and Physics: A Methodological Paradox". Philosophy of Science 34 (2): 103–115. doi:10.1086/288135

And then there’s http://www.tylervigen.com/

Martha,

Some of those are very interesting and I’m not sure they should be dismissed just because they are spurious (spurious being that C influences both A and B). The first chart of US spending on science vs Suicides may be explained by increased population, inflation, etc. The two series look too similar to not be closely related somehow, perhaps I’m underestimating the sheer number of comparisons made though.

Also, wikipedia makes a distinction between spurious relationship and spurious correlation:

https://en.wikipedia.org/wiki/Spurious_relationship

question said,

“perhaps I’m underestimating the sheer number of comparisons made though”

Yes, you seem to be underestimating the number of comparisons. The site allows you to choose two variables from a very large list and find their correlation. I got tired counting at several hundred, but would guess that he has listed over 1000 variables. That makes around 1,000,000 pairs of variables. If you did significance tests (at a .05 individual significance rate) for correlation for all those pairs, you would expect about 50,000 to be significant – so it shouldn’t be surprising if many of these 50,000 pairs are indeed highly correlated.

I interpreted bxg’s use of “spurious” in what to me is the usual usage, meaning roughly that there is a correlation without any direct connection between the two variables. I’ve never heard of Pearson’s specialized use before the Wikipedia page you linked. (But note that the Wikipedia page has a note saying it needs improvement)

Well, that’s my concern. I was trying to be conservative in case there was some way they addressed the trend-on-trend idiocy

somehow (and yes, I read the original paper, and did not to my eye see anything such).

But the critique by De Libero, and the vast majority of comments here, are focused on the ordinal measurement of “happiness”. Maybe this is technically right, but if it’s just trend-on-trend it is missing the big picture to a ridiculous extent. I assume that if people on this blog, as well as papers Professor Gelman cites approvingly, are generally not so blind – so I am sincerely curious if I have missed something.

If this is just a trend-on-trend spurious correlation, De Libero’s critique is (IMO) as relatively shallow and off-point as someone criticizing, say, the font used in article. But I’d like to be shown why it’s not that simple.

The ordinal to interval scale transformation they do in order to calculate an average is quite questionable. I agree that spurious correlation is a fundamental problem here, but if the data has been modified in nonsensical ways to begin with that would be even more fundamental a problem.

Even if the ordinal-interval issue were not present, if two time series are “actually” correlated they should share similar structure. These don’t, so while I consider it plausible that the happiness of a population is related to income inequality, it is not apparent from this data.

@bxg

Reading your argument, is it then never appropriate to argue based on the correlation of two time series?

Or is there a stronger way to make a valid argument about the correlation between two time series meaning something that is not subject to the trend-on-trend critique.

@question

When you say “time series [that] are “actually” correlated should share similar structure”, what’s the rigorous way to quantify “structure”?

Rahul,

I am not sure. Perhaps someone else knows the term for this. Peaks/troughs/inflection points should occur at the same time for each series, or with a consistent lag.

Grainger causality?

Rahul,

I’ve never played around with that. It is not clear to me from the wikipedia page what to do with non-stationary data (ie the Gini data used here). I will say that I don’t like the NHST aspect of it since I think that exactly zero relationship between inequality on happiness is implausible even for long lags. It could be useful to calculate the coefficients and examine them without the NHST filtering step but does it really tell us anything that looking at a plot of the differences would not? I think this plot shows that we should look elsewhere for what is influential on responses to the happiness survey. The happiness results from 1984 stand out, the field would probably benefit from speculating on the cause of that:

http://s16.postimg.org/6jkvv3vx1/Happiness005.png

We need some idea of what “effect size” is policy-relevant and some theory (making testable point/interval predictions) to allow us to extrapolate to future/other circumstances.

I’m confused by the figure. http://www.lettingthedataspeak.com/wp-content/uploads/2014/01/Fig2_IIH-300×263.png

On the ordinate, higher numbers mean more unhappiness, right? Since 1=Very Happy and 3=Not Too Happy?

If so, isn’t the sloping downwards line in that figure showing that as Gini Increased over time (i.e. more unequality) people became happier? Not sadder.

What am I getting wrong.

Oishi et. al 2011, p. 1096

I don’t know how the variable is coded in the GSS originally but I’m pretty sure Oishi et al. recoded it such that higher numbers mean more happiness.

It doesn’t make sense to me to treat the three-point “happiness” variable as continuous. I would like to see the results if they treated the “happiness” variable as categorical and used dummy codes. I am not convinced that happiness increases incrementally (or is linear for that matter), as is suggested in this analysis.

The analysis is interesting, and I always take the side of the data detective in these cases because –like many people– have this strong belief that there are too many results that are published but not true. Two points:

1. de Libero finds this inflection point in the data. First the data point comes out of thin air. We are told there is a regime change of some sort in 1990, and all the justification are the straight lines drawn on the data points. Get the data, redo the graph and try to draw your own lines. You’ll find that you can find good looking lines that show an inflection in 2000. So the whole explanation about the changes in the 1990 is dodgy. Furthermore, there is no reason to draw both lines fixing 1972 at 1. These are magnitudes in the unit interval. Plot them in the original scale and try to spot the inflection.

2. I downloaded the data and ran simple regressions in Excel using a logit transformation of each of the Happiness measures (I just deleted incomplete records) Using the logit transformation of, say, “very happy” proportions as dependent variables, and the Gini as an independent variable is a way of measuring the effect of Gini on the binomial choice between answering “very happy” vs. answering something other than “very happy”. The coefficient is negative and significant (t-stat about -3.5). Running the regression using “pretty happy” as dependent variable leads to a positive and significant coefficient (tstat about 2.5). Does inequality make you more likely to be merely “pretty happy” rather than “very happy”. I don’t know.

[note: I have done all this quickly and sleep deprived, so something may be wrong, I encourage fellow stat nerds to download the data and try it out]

You may be wondering where I am going with this. Although I find the paper methodologically poor, the critique is also quite poor. I agree with de Libero that there is more out there that is influencing happiness metrics. But I think a good critique would have looked in more detail how much can inequality explain the results of the GSS, rather than choose arbitrary inflection points and plot in a somewhat deceptive way.

More generally, I would like to know what is the best way to critique a paper like this.

Here is my go at it:

They write:

“In conclusion, Americans are happier when national wealth is distributed more evenly than when it is distributed less evenly. If the ultimate goal of society is to make its citizens happy (Bentham, 1789/2008), then it is desirable to consider policies that produce more income equality, fairness, and general trust.”

Does the data support this policy recommendation?

First, looking at figure 2 from Oishi 2011, we see their “mean happiness” measure has reduced by ~0.04 from 1972 to 2008 while Gini coefficient has increased by ~.1:

http://www.lettingthedataspeak.com/wp-content/uploads/2014/01/Fig2_IIH.png

The increase in income inequality (~30% as measured by Gini Coefficient) is called “One of the most profound social changes in the United States over the last 40 years”, so this change is apparently considered to be substantial by those in the field. Is the decrease in “mean happiness” also considered substantial? It is ~2% of the interval they used to separate “Not too Happy” and “Very Happy”.

The paper does not show us the timeseries they have compared. If we plot these it is clear that any relationship here is weak at best:

http://s15.postimg.org/ru4phk4cb/Happiness001.png

Gini coefficent has been clearly increasing from 1972 to 2012, while not much change has occurred in the happiness responses. We can also see relatively large increases in Gini coefficient occur 1981-1982 and 1992-1993. There are no corresponding spikes in the happiness data, which we would expect if one of the two were strongly influencing the other. It is the Gini data that is primarily responsible for the possible step function noted by de Libero.

If we take 1972 (the first datapoint) as a baseline we can better see the trends for the happiness data:

http://s21.postimg.org/fgwbjflav/Happiness002.png

Overall it looks like the percentage of people responding “Pretty Happy” has increased at the expense of those responding “Very Happy” since 1972. 1984 appears to have been an exceptional year in terms of this. Consistent with de Libero’s exploration, both “Pretty happy” and “Not Too Happy” appear non-monotonic with an inflection point around 1990, while “Very Happy” responses show a more consistent downward trend.

They choose to code the three possible responses as 1-3 and take averages. As mentioned by Frank de Libero, you can’t just treat an ordinal scale as an interval scale without justifying it. It seems nonsensical to me, anyway the paper should include some justification for this choice. Instead it would be better to simply plot each of the responses vs Gini coefficient:

http://s22.postimg.org/oam9arswx/Happiness003.png

From this plot we clearly see the data clustering pre-1982, 1982-1991, and post-1991 along the x axis as we would expect from my first figure above. I don’t see any strong relationship here that can be disentangled from the upwards trend in Gini coefficient.

I don’t think we can learn much from comparing just these two datasets.

1) They have arbitrarily chosen “happiness” as the the independent variable. Perhaps when people are happier they “spread the wealth” and the relationship works the other way.

2) Treating the ordinal happiness data as interval is unjustified.

3) They do not justify their implication that the change in “mean happiness” is large enough to worry about.

4) It is not even clear there is any substantial link between Gini coefficient and happiness from this data. The timeseries do not share any structure (spikes, inflection points, etc), any observed trends could be due to other influences. The other influences need not be common to the two variables, this is the usual correlation is not causation problem.

This seems compelling and impressive, but either you (and I too) are missing something or else Di Libero is rather embarrassing himself with his critique and claim “the post is an example of how an applied statistician thinks”. The paper seems utterly worthless, utterly wrong, beyond any hope of redemption. Did the authors ever plot anything? Perhaps they did so but then think “p < 0.05" is an objective truth that trumps their own eyes. I don't want an "applied statistician" to look at this and come with some very technical criticism of how the paper interprets the ordinal scale (and show rather mild lack of robustness at that.) In context, that's just a silly objection and implicitly gives the paper undue credit. If an expert" is not able or willing to tell us that this is "complete, unmitigated, bullshit" when it's so obviously called for would you hire such a person person to teach? To "apply" his supposed knowledge anywhere? I wouldn't.

There's a serious problem if so called "critics" are unable or unwilling to properly call out even this joke. If this gets such deference, what would not? (Or perhaps: you and I are both rather wrong.)

> claim “the post is an example of how an applied statistician thinks”

Unfortunately that may be true of many or most applied statisticians – they don’t critically evaluate the full process that generated the current claim or should have generated these other claims or apparently could not generate any valid claims at all – but rather instead focus on some favoured technical aspects.

By the way, to assess “apparently could not generate any valid claims at all” how far do we need to investigate? Do we need to try and salvage things by becoming the investigator?

> There’s a serious problem

No, de Libero posted his blogged critique to Andrew who shared it as a possible teaching example and opportunities (for de Libero) to get critical feedback – which he has assuming he is reading the responses.

One of the severe problems of the applied statistics discipline is that many work in settings where no one there feels capable of being critical of the work they do. It is invariably accepted as being right and insightful. To become (and remain!) a good applied statistician one needs to be surrounded by continuous critical scrutiny (with adequate breaks to reflect and revise). On the other hand, it is just human nature to avoid such criticism if one can. When there are groups of statisticians often the manager will monopolise criticism and keep it private one on one being mostly in control.

I also think it is a good teaching example.

Good teaching examples in applied statistics have apparently resolving solutions (ideally even to the instructor) that self-destruct live as the instructor interacts with the students. That is what applied statistics is – continually discovering what’s wrong with your current attempt to represent and fully grasp uncertainties regarding empirical questions. You rest when those _wrongnesses_ are no longer apparent and or considered overly important.

As David Cox once put it “You can’t prevent someone from making a comment that will total change what you think of your analysis of a given study.”

Reading de Libero’s blog post and having a quick glance at the original paper, I think that there is any number of issues with the paper I’d be worried more about than averaging the codings of the ordinal categories, such as question wording and particularly inferring causality from this kind of time series comparison, however appropriately or inappropriately the ordinal categories are coded.

As far as I can see, the original authors only state p<0.5 regarding the originally discussed result, and if de Libero gets this to p=0.057 in his first attempt, that's not necessarily a big change, unless of course one is committed to a culture that interprets p-values in a strictly binary fashion, which is probably worse than averaging ordinals.

Generally averaging of Likert codings of ordinals from my point of view is a legitimate way of aggregation; it is certainly important to discuss the implicit assumption of equal distances between categories and to investigate sensitivity of the conclusions to this as de Libero did, but often this assumption is not too harmful and results are rather robust.

“often this assumption is not too harmful and results are rather robust.”

Do you have an example of this? Something like a regression coefficient was found this way and then was similar with other/new data. I find it somewhat difficult to believe. Is there on research on the way people respond to survey questions leading them to assign roughly equal intervals? Maybe the individual differences average out, but what does the distribution look like for this happiness question? If they have been using it since 1972 it seems like someone should have looked into this.

We do our course teaching quality evaluations on 5-point Likert scales. I tend to play a bit with these data. In the vast majority of cases, when comparing two courses, one distribution is either exactly or almost exactly stochastically larger than another, or neither medians nor averages nor any other aggregation that makes sense to interpret as measuring quality can tell them apart significantly. Many statisticians (including those in my department who officially evaluate these questionnaires) prefer to use medians which are “admissible” for ordinal data, but very many courses have the same median but still many dominate others in terms of their distribution which is picked up by using the averages but not by medians. One may use rank sums instead with proper averaging of bindings, but as far as I remember, this always gave pretty much the same result as averaging for these data.

My claim is not that people assign roughly equal intervals (I do not think that the assumption makes sense that there is a always a “true” interval behind such data), but rather that the average often (!) is a good way of summarising the general tendency of these distributions if you want to make 1-d comparisons (“course A was better/more popular than course B”).

Generally, comparing distributions of ordinal categorical data is a problem unless they are (at least approximately) stochastically ordered. If they’re not, any way to aggregate the values including those supposedly admissible for ordinal data can be legitimately criticised by saying that other legitimate aggregations may lead to different results. So sensitivity analysis and skepticism is fine but it’s hard to find something specific that is better than taking averages of Likert codes in general.

PS: One could say that using averages implies that the *researcher* (not the respondent) aggregates the data implying equal intervals between categories, which the researcher can defend in other ways than saying that they believe that the respondents “really” use equal intervals. Of course, such arguments may be challenged.

Is there a particular advantage to using Likert scales versus just asking students: Score this teacher. Use a number between 0 and 5, fractions allowed.

My point is, aggregating is OK when the respondents gave you numbers. But if they answered “somewhat happy” it isn’t fair to assume that’s midway between happy & sad.

Rahul: I agree that giving students numbers is better in this case (actually that’s what we do, but many social scientists would argue that this is then still ordinal; it certainly is if you subscribe to the representational theory of measurement). I had written before that there are a number of aspects that I find more worthwhile to criticise about the discussed paper, and this actually includes the category wordings.

Regarding the second point, given that the category wordings are as they are, my point was that using averages does not necessarily *assume* that the middle category is “really” exactly midway between the two extremes; it rather implies that the researcher *decides* to treat it in this way (because no other specific scoring may seem more appropriate including those that are “admissible” for ordinal data according to Stevens), which is not the same.

Should have been “the original authors only state p<0.05", not 0.5.