
Google PageRank algorithm in Python
21st of March 2004
There are many articles on the net about how the PageRank algorithm works that all copy from the original paper written by the very founders of Google Larry Page and Sergey Brin. Google itself also has a very good article that explain it with no formulas or numerical explanations. Basically PageRank is like social networks. If you're mentioned by someone important, your importance increases and the people you mention gets upped as well.
We recently had a coursework in discrete mathematics to calculate PageRank values for all web pages in a web matrix. To be able to do this you have to do many simplifications and you're limited in terms of complexity to keep it possible to do "by hand". I wrote a little program that calculates the PageRank for any web with no simplifications. The outcome is that I can quickly calculate the PageRank values for each page.
Here's how to use it:
web = ((0, 1, 0, 0),
(0, 0, 1, 0),
(0, 0, 0, 1),
(1, 0, 0, 0))
pr = PageRanker(0.85, web)
pr.improve_guess(100)
print pr.getPageRank()
Think of the entries in the matrix as A to D along every row and which page it has a link to along the column. In the above example it means that A has a link to B, B as a link to C, C has a link to D, D has a link to A.

The PageRank values when you run 100 iterations with no random jumps
is:
Pretty obvious isn't it. One complication with the PageRank algorithm is that even if every page has an outgoing link, you don't always cover everything by just following links. That's why to sometimes need to random start over again from a randomly selected webpage. This is we we use 8.5 in the above example. That qualitativly means that there's a 15% chance that you randomly start on a random webpage and iterate from there.
Let's have a "more complex" web model:
(0, 0, 1, 0),
(0, 1, 0, 1),
(1, 1, 0, 0))

Running the algorithm again we find:
Notice how page B has the same PageRank as C and D even though page B has two links coming in to it. This is because it spreads it popularity to other pages. It also matters that the initial guess is that every page is equal initially.
Enough said, download the script yourself and make sure you have Python and the numarray Python module installed.
Comment
24 comments so far Show all 24 commentsCommenting is currently disabled in Mobile version