海归网首页   海归宣言   导航   博客   广告位价格  
海归论坛首页 会员列表 
收 藏 夹 
论坛帮助 
登录 | 登录并检查站内短信 | 个人设置 论坛首页 |  排行榜  |  在线私聊 |  专题 | 版规 | 搜索  | RSS  | 注册 | 活动日历
主题: 来篇纯技术的(ZT)。Google Pagerank Algorithm explained beautifully
回复主题   printer-friendly view    海归论坛首页 -> 海归商务           焦点讨论 | 精华区 | 嘉宾沙龙 | 白领丽人沙龙
  阅读上一个主题 :: 阅读下一个主题
作者 来篇纯技术的(ZT)。Google Pagerank Algorithm explained beautifully   
保尔
[博客]
[个人文集]





游客










文章标题: 来篇纯技术的(ZT)。Google Pagerank Algorithm explained beautifully (1764 reads)      时间: 2005-6-04 周六, 11:01      

作者:游客海归商务 发贴, 来自【海归网】 http://www.haiguinet.com

政治是狗屎,搅得越多越臭。还是来篇旧文,供龟友们复习、欣赏。链接



The Google Pagerank Algorithm
and How It Works




Ian Rogers
IPR Computing Ltd.
[email protected]




Introduction




Page
Rank is a topic much discussed by Search Engine Optimisation (SEO)
experts. At the heart of PageRank is a mathematical formula that seems
scary to look at but is actually fairly simple to understand.




Despite
this many people seem to get it wrong! In particular “Chris Ridings of
www.searchenginesystems.net” has written a paper entitled “PageRank
Explained: Everything you’ve always wanted to know about PageRank”,
pointed to by many people, that contains a fundamental mistake early on in the explanation! Unfortunately this means some of the recommendations in the paper are not quite accurate.




By showing code to correctly calculate real PageRank I hope to achieve several things in this response:





  1. Clearly explain how PageRank is calculated.
  2. Go
    through every example in Chris’ paper, and add some more of my own,
    showing the correct PageRank for each diagram. By showing the code used
    to calculate each diagram I’ve opened myself up to peer review - mostly
    in an effort to make sure the examples are correct, but also because
    the code can help explain the PageRank calculations.
  3. Describe some principles and observations on website design based on these correctly calculated examples.




Any
good web designer should take the time to fully understand how PageRank
really works - if you don’t then your site’s layout could be seriously
hurting your Google listings!




[Note: I have nothing in particular against Chris. If I find any other papers on the subject I’ll try to comment evenly]




How is PageRank Used?




PageRank
is one of the methods Google uses to determine a page’s relevance or
importance. It is only one part of the story when it comes to the
Google listing, but the other aspects are discussed elsewhere (and are
ever changing) and PageRank is interesting enough to deserve a paper of
its own.




PageRank is also displayed on the toolbar of your browser if you’ve installed the Google toolbar (https://toolbar.google.com/). But the Toolbar PageRank only goes from 0 – 10
and seems to be something like a logarithmic scale:




































Toolbar PageRank
(log base 10)



Real PageRank



0



0 - 10



1



100 - 1,000



2



1,000 - 10,000



3



10,000 - 100,000



4



and so on...





We
can’t know the exact details of the scale because, as we’ll see later,
the maximum PR of all pages on the web changes every month when Google
does its re-indexing! If we presume the scale is logarithmic (although
there is only anecdotal evidence for this at the time of writing) then
Google could simply give the highest actual PR page a toolbar PR of 10
and scale the rest appropriately.




Also
the toolbar sometimes guesses! The toolbar often shows me a Toolbar PR
for pages I’ve only just uploaded and cannot possibly be in the index
yet!




What seems to be happening
is that the toolbar looks at the URL of the page the browser is
displaying and strips off everything down the last “/” (i.e. it goes to
the “parent” page in URL terms). If Google has a Toolbar PR for that
parent then it subtracts 1 and shows that as the Toolbar PR for this
page. If there’s no PR for the parent it goes to the parent’s parent’s
page, but subtracting 2, and so on all the way up to the root of your
site.  If it can’t find a Toolbar PR to display in this way, that is if
it doesn’t find a page with a real calculated PR, then the bar is
greyed out.




Note
that if the Toolbar is guessing in this way, the Actual PR of the page
is 0 - though its PR will be calculated shortly after the Google spider
first sees it.




PageRank says
nothing about the content or size of a page, the language it’s written
in, or the text used in the anchor of a link!




Definitions




I’ve
started to use some technical terms and shorthand in this paper. Now’s
as good a time as any to define all the terms I’ll use:
























PR:



Shorthand
for PageRank: the actual, real, page rank for each page as calculated
by Google. As we’ll see later this can range from 0.15 to billions.



Toolbar PR:



The PageRank displayed in the Google toolbar in your browser. This ranges from 0 to 10.



Backlink:



If page A links out to page B, then page B is said to have a “backlink” from page A.





That’s enough of that, let’s get back to the meat…




So what is PageRank?




In
short PageRank is a “vote”, by all the other pages on the Web, about
how important a page is. A link to a page counts as a vote of support.
If there’s no link there’s no support (but it’s an abstention from
voting rather than a vote against the page).




Quoting from the original Google paper, PageRank is defined like this:





    We
    assume page A has pages T1...Tn which point to it (i.e., are
    citations). The parameter d is a damping factor which can be set
    between 0 and 1. We usually set d to 0.85. There are more details about
    d in the next section. Also C(A) is defined as the number of links
    going out of page A. The PageRank of a page A is given as follows:

    PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))

    Note that the PageRanks form a probability distribution over web pages, so the sum of all web pages' PageRanks will be one.

    PageRank
    or PR(A) can be calculated using a simple iterative algorithm, and
    corresponds to the principal eigenvector of the normalized link matrix
    of the web.





but that’s not too helpful so let’s break it down into sections.





  1. PR(Tn)
    - Each page has a notion of its own self-importance. That’s “PR(T1)”
    for the first page in the web all the way up to “PR(Tn)” for the last
    page
  2. C(Tn) - Each page
    spreads its vote out evenly amongst all of it’s outgoing links. The
    count, or number, of outgoing links for page 1 is “C(T1)”, “C(Tn)” for
    page n, and so on for all pages.
  3. PR(Tn)/C(Tn) - so if our page (page A) has a backlink from page “n” the share of the vote page A will get is “PR(Tn)/C(Tn)”
  4. d(...
    - All these fractions of votes are added together but, to stop the
    other pages having too much influence, this total vote is “damped down”
    by multiplying it by 0.85 (the factor “d”)
  5. (1 - d) - The (1 – d) bit at the beginning is a bit of probability math magic so the “sum of all web pages' PageRanks will be one”: it adds in the bit lost by the d(....
    It also means that if a page has no links to it (no backlinks) even
    then it will still get a small PR of 0.15 (i.e. 1 – 0.85). (Aside: the
    Google paper says “the sum of all pages” but they mean the “the
    normalised sum” – otherwise known as “the average” to you and me.




How is PageRank Calculated?




This
is where it gets tricky. The PR of each page depends on the PR of the
pages pointing to it. But we won’t know what PR those pages have until
the pages pointing to them have their PR calculated and so on…
And when you consider that page links can form circles it seems
impossible to do this calculation!




But actually it’s not that bad. Remember this bit of the Google paper:





    PageRank
    or PR(A) can be calculated using a simple iterative algorithm, and
    corresponds to the principal eigenvector of the normalized link matrix
    of the web.





What that means to us is that we can just go ahead and calculate a page’s PR without knowing the final value of the PR of the other pages.
That seems strange but, basically, each time we run the calculation
we’re getting a closer estimate of the final value. So all we need to
do is remember the each value we calculate and repeat the calculations
lots of times until the numbers stop changing much.




Lets take the simplest example network: two pages, each pointing to the other:











Each page has one outgoing link (the outgoing count is 1, i.e. C(A) = 1 and C(B) = 1).




 




Guess 1




We don’t know what their PR should be to begin with, so let’s take a guess at 1.0 and do some calculations:



















d



= 0.85



PR(A)



= (1 – d) + d(PR(B)/1)



PR(B)



= (1 – d) + d(PR(A)/1)




i.e.















PR(A)



= 0.15 + 0.85 * 1
= 1



PR(B)



= 0.15 + 0.85 * 1
= 1




Hmm, the numbers aren’t changing at all! So it looks like we started out with a lucky guess!!!




Guess 2




No,
that’s too easy, maybe I got it wrong (and it wouldn’t be the first
time). Ok, let’s start the guess at 0 instead and re-calculate:

















PR(A)



= 0.15 + 0.85 * 0
= 0.15



 



PR(B)



= 0.15 + 0.85 * 0.15
= 0.2775



NB. we’ve already calculated a “next best guess” at PR(A) so we use it here




And again:















PR(A)



= 0.15 + 0.85 * 0.2775
= 0.385875



PR(B)



= 0.15 + 0.85 * 0.385875
= 0.47799375




And again















PR(A)



= 0.15 + 0.85 * 0.47799375
= 0.5562946875



PR(B)



= 0.15 + 0.85 * 0.5562946875
= 0.622850484375




and
so on. The numbers just keep going up. But will the numbers stop
increasing when they get to 1.0? What if a calculation over-shoots and
goes above 1.0?




 




Guess 3




Well let’s see. Let’s start the guess at 40 each and do a few cycles:





    PR(A) = 40
    PR(B) = 40





First calculation















PR(A)



= 0.15 + 0.85 * 40
= 34.25



PR(B)



= 0.15 + 0.85 * 0.385875
= 29.1775




And again















PR(A)



= 0.15 + 0.85 * 29.1775
= 24.950875



PR(B)



= 0.15 + 0.85 * 24.950875
= 21.35824375




Yup, those numbers are heading down alright! It sure looks the numbers will get to 1.0 and stop




Here’s the code used to calculate this example starting the guess at 0: Show the code | Run the program





  • Principle: it doesn’t matter where you start your guess, once the PageRank calculations have settled down, the “normalized probability distribution” (the average PageRank for all pages) will be 1.0




 




Getting the answer quicker




How
many times do we need to repeat the calculation for big networks?
That’s a difficult question; for a network as large as the World Wide
Web it can be many millions of iterations! The “damping factor” is
quite subtle. If it’s too high then it takes ages for the numbers to
settle, if it’s too low then you get repeated over-shoot, both above
and below the average - the numbers just swing about the average like a
pendulum and never settle down.




Also
choosing the order of calculations can help. The answer will always
come out the same no matter which order you choose, but some orders
will get you there quicker than others.




I’m
sure there’s been several Master’s Thesis on how to make this
calculation as efficient as possible, but, in the examples below, I’ve
used very simple code for clarity and roughly 20 to 40 iterations were
needed!




 




Example 1











I’m
not going to repeat the calculations here, but you can see them by
running the program (yes, if you click the link the program really is
re-run to do the calculations for you)







So the correct PR for the example is:











You can see it took about 20 iterations before the network began to settle on these values!




Look at Page D though - it has a PR of 0.15 even though no-one is voting for it (i.e. it has no incoming links)! Is this right?




The first part, or "term" to be techinal, of the PR equation is doing this:





    PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))





So, for Page D, no backlinks means the equation looks like this:











PR(A)



= (1-d) + d * (0)
= 0.15




no matter what else is going on or how many times you do it.





    Observation: every
    page has at least a PR of 0.15 to share out. But this may only be in
    theory - there are rumours that Google undergoes a post-spidering phase
    whereby any pages that have no incoming links at all are completely
    deleted from the index...

     





Example 2




A simple hierarchy with some outgoing links





















As
you’d expect, the home page has the most PR – after all, it has the
most incoming links! But what’s happened to the average? It’s only
0.378!!! That doesn’t tie up with what I said earlier so something is
wrong somewhere!




Well
no, everything is fine. But take a look at the “external site” pages –
what’s happening to their PageRank? They’re not passing it on, they’re
not voting for anyone, they’re wasting their PR like so much pregnant
chad!!! (NB, a more accurate description of this issue can be found in
this thread)




 




Example 3




Let’s link those external sites back into our home page just so we can see what happens to the average…














That’s
better - it does work after all! And look at the PR of our home page!
All those incoming links sure make a difference – we’ll talk more about
that later.




 




Example 4




What happens to PR if we follow a suggestion about writing page reviews?





















 




 




Example 5




A simple hierarchy














Our home page has 2 and a half times as much PR as the child pages! Excellent!





  • Observation: a hierarchy concentrates votes and PR into one page




 




Example 6




Looping














This
is what we’d expect. All the pages have the same number of incoming
links, all pages are of equal importance to each other, all pages get
the same PR of 1.0 (i.e. the “average” probability).




 




Example 7




Extensive Interlinking – or Fully Meshed














Yes, the results are the same as the Looping example above and for the same reasons.




 




Example 8




Hierarchical – but with a link in and one out.




We’ll
assume there’s an external site that has lots of pages and links with
the result that one of the pages has the average PR of 1.0. We’ll also
assume the webmaster really likes us – there’s just one link from that
page and it’s pointing at our home page.














In
example 5 the home page only had a PR of 1.92 but now it is 3.31!
Excellent! Not only has site A contributed 0.85 PR to us, but the
raised PR in the “About”, “Product” and “More” pages has had a lovely
“feedback” effect, pushing up the home page’s PR even further!





  • Priciple: a well structured site will amplify the effect of any contributed PR




 




Example 9




Looping – but with a link in and a link out














Well, the PR of our home page has gone up a little, but what’s happened to the “More” page?




The
vote of the “Product” page has been split evenly between it and the
external site. We now value the external Site B equally with our “More”
page. The “More” page is getting only half the vote it had before –
this is good for Site B but very bad for us!




 




Example 10




Fully meshed – but with one vote in and one vote out














That’s
much better. The “More” page is still getting less share of the vote
than in example 7 of course, but now the “Product” page has kept three
quarters of its vote within our site - unlike example 10 where it was
giving away fully half of it’s vote to the external site!




Keeping
just this small extra fraction of the vote within our site has had a
very nice effect on the Home Page too – PR of 2.28 compared with just
1.66 in example 10.





  • Observation:
    increasing the internal links in your site can minimise the damage to
    your PR when you give away votes by linking to external sites.
  • Principle:
    • If a particular page is highly important – use a hierarchical structure with the important page at the “top”.
    • Where a group of pages may contain outward links – increase the number of internal links to retain as much PR as possible.
    • Where a group of pages do not contain outward links – the number of internal links in the site has no effect on the site’s average PR. You might as well use a link
      structure that gives the user the best navigational experience.




 




Site Maps




Site maps are useful in at least two ways:





  • If
    a user types in a bad URL most websites return a really unhelpful “404
    – page not found” error page. This can be discouraging. Why not
    configure your server to return a page that shows an error has been
    made, but also gives the site map? This can help the user enormously
  • Linking
    to a site map on each page increases the number of internal links in
    the site, spreading the PR out and protecting you against your vote
    “donations”




 




Example 11




Lets try to fix our site to artificially concentrate the PR into the home page.











That looks good, most of the links seem to be pointing up to page A so we should get a nice PR.




 




Try to guess what the PR of A will be before you scroll down or run the code.







 




 











Oh
dear, that didn’t work at all well – it’s much worse than just an
ordinary hierarchy! What’s going on is that pages C and D have such
weak incoming links that they’re no help to page A at all!





  • Principle: trying to abuse the PR calculation is harder than you think.




 




Example 12




A
common web layout for long documentation is to split the document into
many pages with a “Previous” and “Next” link on each plus a link back
to the home page. The home page then only needs to point to the first
page of the document.














In
this simple example, where there’s only one document, the first page of
the document has a higher PR than the Home Page! This is because page B
is getting all the vote from page A, but page A is only getting
fractions of pages B, C and D.





  • Principle:
    in order to give users of your site a good experience, you may have to
    take a hit against your PR. There’s nothing you can do about this - and
    neither should you try to or worry about it! If your site is a pleasure
    to use lots of other webmasters will link to it and you’ll get back
    much more PR than you lost.




Can
you also see the trend between this and the previous example? As you
add more internal links to a site it gets closer to the Fully Meshed
example where every page gets the average PR for the mesh.





  • Observation: as you add more internal links in your site, the PR will be spread out more evenly between the pages.




 




Example 13




Getting high PR the wrong way and the right way.




Just as an experiment, let’s see if we can get 1,000 pages pointing to our home page, but only have one link leaving it…














Yup, those spam pages are pretty worthless but they sure add up!





  • Observation:
    it doesn’t matter how many pages you have in your site, your average PR
    will always be 1.0 at best. But a hierarchical layout can strongly
    concentrate votes, and therefore the PR, into the home page!




This
is a technique used by some disreputable sites (mostly adult content
sites). But I can’t advise this - if Google’s robots decide you’re
doing this there’s a good chance you’ll be banned from Google! Disaster!




On the other hand there are at least two right ways to do this:




1. Be a Mega-site




Mega-sites, like https://news.bbc.co.uk
have tens or hundreds of editors writing new content – i.e. new pages -
all day long! Each one of those pages has rich, worthwile content of
its own and a link back to its parent or the home page! That’s why the
Home page Toolbar PR of these sites is 9/10 and the rest of us just get
pushed lower and lower by comparison…





  • Principle: Content Is King! There really is no substitute for lots of good content…




 




2. Give away something useful




www.phpbb.com has a Toolbar PR of 8/10 (at the time of writing) and it has no big money or marketing behind it! How can this be?




What
the group has done is write a very useful bulletin board system that is
becoming very popular on many websites. And at the bottom of every
page, in every installation, is this HTML code:




Powered by <a href="https://www.phpbb.com/" target="_blank">phpBB</a>



The administrator of each installation can remove that link, but most don’t because they want to return the favour…




Can you imagine all those millions of pages giving a fraction of a vote to www.phpbb.com? Wow!





  • Principle:
    Make it worth other people’s while to use your content or tools. If
    your give-away is good enough other site admins will gladly give you a
    link back.
  • Principle:
    it’s probably better to get lots (perhaps thousands) of links from
    sites with small PR than to spend any time or money desperately trying
    to get just the one link from a high PR page.




 




A Discussion on Averages




From the Brin and Page paper, the average Actual PR of all pages in the index is 1.0!




So
if you add pages to a site you’re building the total PR will go up by
1.0 for each page (but only if you link the pages together so the
equation can work), but the average will remain the same.




If
you want to concentrate the PR into one, or a few, pages then
hierarchical linking will do that. If you want to average out the PR
amongst the pages then "fully meshing" the site (lots of evenly
distributed links) will do that - examples 5, 6, and 7 in my above.
(NB. this is where Ridings’ goes wrong, in his MiniRank model feedback
loops will increase PR - indefinitely!)




Getting
inbound links to your site is the only way to increase your site's
average PR. How that PR is distributed amongst the pages on your site
depends on the details of your internal linking and which of your pages
are linked to.




If
you give outbound links to other sites then your site's average PR will
decrease (you're not keeping your vote "in house" as it were). Again
the details of the decrease will depend on the details of the linking.




Given
that the average of every page is 1.0 we can see that for every site
that has an actual ranking in the millions (and there are some!) there
must be lots and lots of sites who's Actual PR is below 1.0
(particularly because the absolute lowest Actual PR available is (1 -
d)).




It
may be that the Toolbar PR 1,2 correspond to Actual PR's lower than
1.0! E.g. the logbase for the Toolbar may be 10 but the Actual PR
sequence could start quite low: 0.01, 0.1, 1, 10, 100, 1,000 etc...




Finally




PageRank
is, in fact, very simple (apart from one scary looking formula). But
when a simple calculation is applied hundreds (or billions) of times
over the results can seem complicated.




PageRank
is also only part of the story about what results get displayed high up
in a Google listing. For example there’s some evidence to suggest that
Google is paying a lot of attention these days to the text in a link’s
anchor when deciding the relevance of a target page – perhaps more so
than the page’s PR…




PageRank is still part of the listings story though, so it’s worth your while as a good designer to make sure you understand it correctly.




 




Links







About the Author




Ian Rogers
first used the Internet in 1986 sending email on a University VAX
machine! He first installed a webserver in 1990, taught himself HTML
and perl CGI scripting. Since then he has been a Senior Research Fellow
in User Interface Design and a consultant in Network Security and
Database Backed Websites. He has had an informal interest in topology
and the mathematics and behaviour of networks for years and has also
been known to do a little Jive dancing.







作者:游客海归商务 发贴, 来自【海归网】 http://www.haiguinet.com









相关主题
纯技术贴:企业业务与财务对比分析中,如何选择benchmark?--来自于... 海归商务 2012-3-20 周二, 02:11
[不懂就问]请教有关玻璃的若干问题,纯学术探讨。 海归酒吧 2011-3-14 周一, 16:15
国内把马云,阿里巴巴捧得太高了,俺一直觉得不对劲,阿里巴巴没啥技术含量。马... 海归商务 2011-2-22 周二, 11:13
[讨论]已有美国绿卡,申请澳大利亚PR的必要和重要性--纯技术讨论(附艳照) 海归主坛 2010-1-27 周三, 07:13
纯技术讨论 海归酒吧 2009-12-21 周一, 07:30
再转贴一篇宋技术员有关文章,咱这回可真成为世界超级大国了 海归主坛 2009-8-23 周日, 11:45
[波儿的波]看小草女友征婚,关于女孩子避免受骗的一点技术提示。。。 海归酒吧 2009-3-12 周四, 11:34
[原创]纯技术建议----海归网的页面耗费资源过多 海归主坛 2008-7-05 周六, 10:23

返回顶端
  • 来篇纯技术的(ZT)。Google Pagerank Algorithm explained beautifully -- 保尔 - (64224 Byte) 2005-6-04 周六, 11:01 (1764 reads)
显示文章:     
回复主题   printer-friendly view    海归论坛首页 -> 海归商务           焦点讨论 | 精华区 | 嘉宾沙龙 | 白领丽人沙龙 所有的时间均为 北京时间


 
论坛转跳:   
不能在本论坛发表新主题, 不能回复主题, 不能编辑自己的文章, 不能删除自己的文章, 不能发表投票, 您 不可以 发表活动帖子在本论坛, 不能添加附件不能下载文件, 
   热门标签 更多...
   论坛精华荟萃 更多...
   博客热门文章 更多...


海归网二次开发,based on phpbb
Copyright © 2005-2026 Haiguinet.com. All rights reserved.