Loren Maxwell writes:

I am trying to do some studies on the PageRank algorithm with applying a Bayesian technique. If you are not familiar with PageRank, it is the basis for how Google ranks their pages. It basically treats the internet as a large social network with each link conferring some value onto the page it links to. For example, if I had a webpage that had only one link to it, say from my friend’s webpage, then its PageRank would be dependent on my friend’s PageRank, presumably quite low. However, if the one link to my page was off the Google search page, then my PageRank would be quite high since there are undoubtedly millions of pages linking to Google and few pages that Google links to.

The end result of the algorithm, however, is that all the PageRank values of the nodes in the network sum to one and the PageRank of a specific node is the probability that a “random surfer” will end up on that node.

For example, in the attached spreadsheet, Column D shows each node while Column E shows the probability that a random surfer will land on the node. Columns F through H are used to calculate the PageRank, Columns K and L are the links (with Column K linking to Column L), while Column M is also used to calculate the PageRank. There is a macro in the workbook that is activated by pressing “Control-Shift-P” that will copy Column H to Column E 100 times, which is usually more than enough iterations for the PageRank to converge.

You can change the links around or add nodes if you like to play around with the spreadsheet. The only requirements are that each node links to at least one other node (not a true PageRank requirement, but it keeps that math easy) and that the initial values in Column E equal 1 (usually simply set to 1/N).

The dampening factor (Cell B3) represents the chance that a random surfer will jump to another page at random rather than follow a link from the current node. The chance is 1 – the dampening factor.

In this specific example, a random surfer has a 38.2% chance to land on node A at any given time (Node A’s PageRank from Column E) and only a 3.8% change to be on node D.

My question is that if the data represented a certain time, in my specific case the scheduling of sports teams for a specific season, how can I incorporate a Bayesian method into the PageRank algorithm that represents the probability of landing on a certain node for the following season when only a few games have been played. Basically, I envision the old network as a list of nodes with their associated probabilities for last season that can be compared with a list of corresponding nodes with their associated probabilities for the current season, however I am unsure as to how I might reflect this using Bayesian analysis.

It would be helpful to also see what your thoughts would be on dropping and adding nodes in between the periods as well.

My reply:

I’d start by estimating the probabilities using a simple (non-network) model, getting posterior simulations from this Bayesian inference, then running your Page Rank algorithm separately for each simulation. You can then use the output from these simulations to summarize your uncertainty. It’s not perfect—you’re not using the network model to get your uncertainties—but it seems like a reasonable start.

A number of folks have looked at Bayesian algorithms as an alternative or to augment the PageRank algorithms. Probably the most common method is some variation of Latent Dirichlet Allocation (LDA). It actually works fairly well and even in the simplest implementation can handle billions of pages on a desktop computer. The key intuition comes from the consideration of the conditional relationship of mypage with yourpage; the context of mypage relative to yourpage is explicitly considered in LDA-type algorithms. The logic in your [Loren] discussion already captures this conditional dependence and LDA captures the Bayesian structure in a more formal sense. My stuff is unfortunately not accessible and probably already dated relative to what’s out there, but here’s a link to a commercial venture in the area: http://www.scriptol.com/seo/lda.php . FWIW I’m not affiliated with SEO in any capacity, I just grabbed this as a typical example.

That scriptol page was very confused about what Bayesian inference is. As far as I know, no one’s carried out proper Bayesian inference (integrating over the posterior) with LDA because of the problem with modes. Usually they use approximations of a single mode for inference, often without incorporating any estimates of uncertainty.

There are really two issues brought up by Loren Maxwell in the original question. One is how to deal with the time evolution of pages and incorporate information about links last year to predict links this year. Presumably some kind of time-series model would be appropriate here with the assumption that this year looks like last year plus some randomness.

The second is how to make the model Bayesian, which would mean defining a full joint probablity model and then computing the conditional probability of parameters given observed data. PageRank’s usually not presented this way. It’s usually defined as the result of fixing model parameters (a stochastic transition matrix based on link counts and a smoothing fresh start vector), running a simulation, then computing empirical estimates of the percentage of time spent on a single page. If you run long enough, the empirical estimates converge to the fixed point of

,

where is a simplex parameter representing the percentage of time spent on each page in the simulation and is a stochastic transition matrix and is the smoothing (background random jump or “dampening”) model. It would be easy enough to put a prior on here based on last year’s data. I’d also look into using predictors like team winningness, national TV appearances, home market, etc. One could also try to put priors on or the parameter of interest itself — you can think of the as a kind of prior on the transition matrix $\pi \theta + \epsilon$.

http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_introbayes_sect007.htm

I have to say I’m a bit amused by this from the email:

Is a social network even a real thing? PageRank treats the web as a citation graph. Most models of social networks treat social interaction as a citation graph as well. The citation graph is the thing!

“Social network” perhaps allowed the writer to loosely talk about graphs. Most understand how “social networks” are usually formalized.

Given my understanding of the reader’s question, we have two Markov chains here: the one for last year’s data and the one for the early part of this year’s data (a random walk over a graph is just a Markov chain).

So to incorporate a prior, we can simply make a new chain that’s a convex combination of this last year’s chain and this year’s chain. Or equivalently, take a convex combination of the probability transition matrices for the markov chains and run page rank on it.

Does that do the trick?