## Analysis of Power Law of Participation

Rick Wash writes:

A colleague as USC (Lian Jian) and I were recently discussing a statistical analysis issue that both of us have run into recently.

We both mostly do research about how people use online interactive websites. One property that most of these systems have is known as the “powerlaw of participation” — the distribution of the number of contributions from each person follows a powerlaw. This mean that a few people contribution a TON and many, many people are in the “long tail” and contribute very rarely. For example, Facebook posts and twitter posts both have this distribution, as do comments on blogs and many other forms of user contribution online.

This distribution has proven to be a problem when we analyze individual behavior. The basic problem is that we’d like to account for the fact that we have repeated data from many users, but a large number of users only have 1 or 2 data points. For example, Lian
recently analyzed data about monetary contributions on the website Spot.Us and in her dataset, over 70% of the contributions were the sole contribution of the contributor. One of my past analyses of tags on delicious.com found similar patterns, with large numbers of tags being used only once or twice.

How would you analyze this, taking advantage of the knowledge that some data points are the same individual? Some thoughts we’ve had:

1. Use a hierarchical model with contributions nested within people. (AKA use a random effect for people) But this has problems when the majority of people only have exactly one data point?

2. Only analyze the first contribution from each person. This is what Lian ended up doing with her Spot.Us data, but this is throwing away lots of valuable and interesting data. Indeed, with a powerlaw, a large percentage of the data points come from the few high
contributors.

3. Use a hierarchical model with contributions nested within people, but lump all of the “low contributors” into a single large category.

This is what I did for my analysis of delicious tagging data, but it was unsatisfying because the relationship between data points in that category is different than the relationship between data points in the other categories (which each represent one individual).

We find that this problem appears over and over in our research, and were wondering if you have any thoughts about how to address it intelligently.

My reply: Do option 1. It’s fine that the majority of people have exactly one data point. (Hey, sometimes my advice is less nuanced than you might expect!)

Also use some other predictors in your model. Here are some free ones:

- A person-level predictor which is the total number of contributions from that person. (Or maybe the logarithm or reciprocal of this total number or will work better as a predictor in a linear model.)

- If the contributions are time-ordered, the reciprocal of the time ranking of the contribution (so if someone has 3 contributions, this predictor will be 1, 1/2, and 1/3 for his or her contributions). This will catch if there is anything going on when people post a lot of times, if their first few posts are different.

- Any other person-level data you have.

The idea is that the multilevel analysis is not an end in itself; rather, it’s a tool to take away your worries and allow you do to some real modeling.

1. Ted Dunning says:

In practice, there are other very serious issues with data like these.

To wit,

- the prolific “people” are often spammers

- some tags/music/videos are watched by nearly everybody and contribute nearly zero information

- structural issues with the experience may cause some nearly universal behavior

- the total data size is truly massive. Billions of observations are common.

While each of these point can be dealt with at some level using multilevel model, the scale of the data can make a full-on multilevel model infeasible.

In my own experience, here are some practical steps that help with these problems:

- make sure that you count only one interaction per person. It is tempting to count all interactions simply because counting unique users is expensive or because your model says you should, but often gives way too much weight to spammers

- make sure your definition of one person is broad enough to include most spammers trying to evade detection

- don’t let spammers know when you have beaten them. If they don’t know, they won’t devise counter-measures and will be much easier to detect.

Statistically speaking, here are some good ideas:

- consider that the interface probably has multiple avenues for presenting data. As such, popular items likely have their place and personalized recommendations (or something like that) has its place. Don’t try to do all tasks in all places.

- Interactions with items that are ubiquitous can profitably be downsampled if you are analyzing in an item-centric way. If you are thinking user-centrically, then you can down-sample the behavior of prolific users

- consider the mismatch between what you really need and what most statistical techniques are trying to do. For instance, many techniques will output probabilities per item. What you really often want is probability per portfolio of items. This is very different.

- consider the need for exploration. Most website analysis systems are not nice clean scientific experiments. The actions derived from the analysis drive the data that are collected since people don’t interact with things that they never see. This means that it is critical to include an exploration component in any recommendations. I do this with dithering with good effect.

2. Ken says:

I wonder whether a GEE model is the way to go when you have many singletons.

3. Dean Eckles says:

There is also often so much data that #1 is not really an option. If you are going to use all the data, then you need to use a method for getting confidence intervals that account for the dependence among observations. Art Owen and I wrote a paper about bootstrapping this kind of large scale data http://arxiv.org/abs/1106.2125; I use this method with Facebook data.