Rick Wash writes:
A colleague as USC (Lian Jian) and I were recently discussing a statistical analysis issue that both of us have run into recently.
We both mostly do research about how people use online interactive websites. One property that most of these systems have is known as the “powerlaw of participation” — the distribution of the number of contributions from each person follows a powerlaw. This mean that a few people contribution a TON and many, many people are in the “long tail” and contribute very rarely. For example, Facebook posts and twitter posts both have this distribution, as do comments on blogs and many other forms of user contribution online.
This distribution has proven to be a problem when we analyze individual behavior. The basic problem is that we’d like to account for the fact that we have repeated data from many users, but a large number of users only have 1 or 2 data points. For example, Lian
recently analyzed data about monetary contributions on the website Spot.Us and in her dataset, over 70% of the contributions were the sole contribution of the contributor. One of my past analyses of tags on delicious.com found similar patterns, with large numbers of tags being used only once or twice.
How would you analyze this, taking advantage of the knowledge that some data points are the same individual? Some thoughts we’ve had:
1. Use a hierarchical model with contributions nested within people. (AKA use a random effect for people) But this has problems when the majority of people only have exactly one data point?
2. Only analyze the first contribution from each person. This is what Lian ended up doing with her Spot.Us data, but this is throwing away lots of valuable and interesting data. Indeed, with a powerlaw, a large percentage of the data points come from the few high
3. Use a hierarchical model with contributions nested within people, but lump all of the “low contributors” into a single large category.
This is what I did for my analysis of delicious tagging data, but it was unsatisfying because the relationship between data points in that category is different than the relationship between data points in the other categories (which each represent one individual).
We find that this problem appears over and over in our research, and were wondering if you have any thoughts about how to address it intelligently.
My reply: Do option 1. It’s fine that the majority of people have exactly one data point. (Hey, sometimes my advice is less nuanced than you might expect!)
Also use some other predictors in your model. Here are some free ones:
- A person-level predictor which is the total number of contributions from that person. (Or maybe the logarithm or reciprocal of this total number or will work better as a predictor in a linear model.)
- If the contributions are time-ordered, the reciprocal of the time ranking of the contribution (so if someone has 3 contributions, this predictor will be 1, 1/2, and 1/3 for his or her contributions). This will catch if there is anything going on when people post a lot of times, if their first few posts are different.
- Any other person-level data you have.
The idea is that the multilevel analysis is not an end in itself; rather, it’s a tool to take away your worries and allow you do to some real modeling.