Loren Maxwell writes:
I am trying to do some studies on the PageRank algorithm with applying a Bayesian technique. If you are not familiar with PageRank, it is the basis for how Google ranks their pages. It basically treats the internet as a large social network with each link conferring some value onto the page it links to. For example, if I had a webpage that had only one link to it, say from my friend’s webpage, then its PageRank would be dependent on my friend’s PageRank, presumably quite low. However, if the one link to my page was off the Google search page, then my PageRank would be quite high since there are undoubtedly millions of pages linking to Google and few pages that Google links to.
The end result of the algorithm, however, is that all the PageRank values of the nodes in the network sum to one and the PageRank of a specific node is the probability that a “random surfer” will end up on that node.
For example, in the attached spreadsheet, Column D shows each node while Column E shows the probability that a random surfer will land on the node. Columns F through H are used to calculate the PageRank, Columns K and L are the links (with Column K linking to Column L), while Column M is also used to calculate the PageRank. There is a macro in the workbook that is activated by pressing “Control-Shift-P” that will copy Column H to Column E 100 times, which is usually more than enough iterations for the PageRank to converge.
You can change the links around or add nodes if you like to play around with the spreadsheet. The only requirements are that each node links to at least one other node (not a true PageRank requirement, but it keeps that math easy) and that the initial values in Column E equal 1 (usually simply set to 1/N).
The dampening factor (Cell B3) represents the chance that a random surfer will jump to another page at random rather than follow a link from the current node. The chance is 1 – the dampening factor.
In this specific example, a random surfer has a 38.2% chance to land on node A at any given time (Node A’s PageRank from Column E) and only a 3.8% change to be on node D.
My question is that if the data represented a certain time, in my specific case the scheduling of sports teams for a specific season, how can I incorporate a Bayesian method into the PageRank algorithm that represents the probability of landing on a certain node for the following season when only a few games have been played. Basically, I envision the old network as a list of nodes with their associated probabilities for last season that can be compared with a list of corresponding nodes with their associated probabilities for the current season, however I am unsure as to how I might reflect this using Bayesian analysis.
It would be helpful to also see what your thoughts would be on dropping and adding nodes in between the periods as well.
I’d start by estimating the probabilities using a simple (non-network) model, getting posterior simulations from this Bayesian inference, then running your Page Rank algorithm separately for each simulation. You can then use the output from these simulations to summarize your uncertainty. It’s not perfect—you’re not using the network model to get your uncertainties—but it seems like a reasonable start.