Many web sites allow users to casts vote on items. These visitors’ votes are then often used to detect the items’ “popularity” and hence rank the rated items accordingly.
And when “rank” comes into play things gets tricky:
- The system can have inherent deficiencies in ranking items.
- That is mostly because developers tend to “re-invent the wheel” and throw in their own algorithms instead of basing their calculations on well-established statistical formulae (I’ll come to that in a moment, just bear with me ).
- There will be people (i.e. spammers) trying to fool the system and try to take their submissions to top.
- There will be system inefficiencies due to computational complexity.
In this article,
- I’ll try to give examples on how to approach the problem;
- describe the weakness of each particular approach;
- and explain how some well-known social community sites implement their ranking algorithms.
- Finally I’ll discuss the current ranking algorithm we use in linkibol.
So let us begin with the most straigh-forward (and thus the most error-prone) approach:
Votes Determine Popularity
We have a database record, whether it be a news story, picture, video, podcast, whatever.
We could order the “Popular” item by number of votes and be done with it.
Popularity = Votes
This algorithm is far from being perfect.
Let me try to explain why:
Say last month was a very active month and popular records in that month received around 100 votes.
This month is not very active in terms of traffic (assume it’s a holiday season and your user base is rushing to beaches instead of hysterically seeking for items and casting votes) and as a result you’ve received an avarage of 30 votes for each item.
Therefore, popular items from last month will probably never see the front page.
Let’s introduce the age of the record as another variable. If the record is newer it should have a higher prominence on the front page, yes?
Popularity = Votes / Record Age
The older the record, the more votes it requires to achieve popularity. But that’s not fair:
Again assume someone woke up at 3am and submitted an excellent item to the site, where the rest of community are sleeping instead of being on-line an frantically casting votes.
This excellent item, which has caused our beloved user to wake up in the middle of the night, will not receive the credit it deserves as it fades away due to the dampening effect of time.
So let’s look at the age of each record:
Popularity = (V1/A1) + (V2/A2) + … + (Vn/An)
- Vn is a vote,
- An is the age of that vote (for example, in minutes).
All the values of all the votes are added to achieve a popularity score.
This seems to solve the previous problem, but introduces a new one:
If a single person votes on a record a year old, his vote will be worh more than 200 votes casted a year ago when the record was created. Old material comes back to haunt the front page.
Let’s take a look again at the age of the record…
Dampening The Weighted Votes By Record Age
So let us introduce the age of the record into the equation:
Popularity = [ (V1/A1) + (V2/A2) + … + (Vn/An) ] / Record Age
This formula dampens votes we’ve calculated above basd on the items’s age, which prevents old stories from leaping back to the front page.
It seems to solve our problem but it’s computationally inefficient:
It is difficult to implement in a practical system.
Moreover it’s overly-dependent on time, which also adds to the algorithm’s inefficiency:
Every second, both the age of votes and the record age change. Thus, the popularity needs to be calculated for all the records in the database for every few seconds (or minutes depending on how active the site is). That’s intolerable in a high-traffic system where resource utilization is of premium importance.
Since this algorithm is inapplicable let’s look at another basic example:
Using simple + and – Votes
If you like an item, rate it “plus”. If you don’t like it, give it a “minus”.
The rating of an item would then be: number of positive votes divided by number of total votes.
Now if you want to rank the items based on this simple equation, the following happens:
Assume you have on item with a rating of 0.93, based on hundreds of votes.
Now another new item is rated by a total of 2 visitors (or even just one), and they rate it +. The vote goes straight to the first position in the ranking, as its rating is 100%!
What we want is this:
- If there is only few votes, then these votes should count less,
- when there are many votes and we can trust that this is how the public feels about it.
That is “certainty” or “believability” of the vote should depend on the number of votes that record has received.
Thus, we want to calculate a corrected rating that somehow takes the weight of votes into account:
The more votes an item has, the closer the corrected rating value would be to the uncorrected rating value.
That’s not a new thing though. The math for this has been worked out hundreds of years ago. It is called “bayesian average”.
The bayesian rating can be given as:
br = ( (C* avg_rating) + (this_num_votes * this_rating) ) / (C + this_num_votes)
- C: is an ad-hoc constant. If it’s high. It will require more votes for the adjusted (dampened) rating of an item to approach its original unadjusted value.
- avg_rating: The average rating of each item (again, of those that have num_votes>0)
- this_num_votes: number of votes for this item
- this_rating: the rating of this item
There is no point requiring 1000 votes for the item to rank 60% when each item only gets a handful of votes in average. So if the system receives less votes in general, C should be smaller.
To make the system adaptive we can assign C a self-adjusting value, such as “average number of votes”. Which makes our formula.
br = ( (avg_num_votes * avg_rating) + (this_num_votes * this_rating) ) / (avg_num_votes + this_num_votes)
This formula works statistically better than the former one (i.e. dampenint the votes by record age).
That damping by record age algorithm was very dependent on time. On the contrary, this rating formula is totally independent of time.
It can be fine-tuned by adding
- Dampening effect of time,
- Trustablility (karma) of voters
- Page views and site activity at the time of vote
- If we’re voting a link, metadata for that link such as google pagerank, alexa rank etc.
Let’s move to another approach:
Popularity = Lower bound of Wilson score confidence interval for a Bernoulli parameter
Lower bound of what?!
We need to balance the proportion of positive ratings with the uncertainty of a small number of observations. Fortunately, the math for this was worked out in 1927 by Edwin B. Wilson
- p: is the average rating of the item normalized to [0, 1] interval.
- z(alpha/2): is the 1-(alpha/2) th quantile of the standard normal distribution.
It can be calculated easily using the following ruby method:
- n: is the number of votes.
The algorithm is sound, and computationally feasible. You can adjust the dampening effect of the normal distribution by changing alpha.
How Do We Rank Things in linkibol?
Our preference in linkibol is to use an adjusted Bayesian Rating algorithm.
I’ll be explaining linkibol’s popularity algorithm in a moment, but before that let’s how a look at how other giants calculate popularity.
Y Combinator’s Hacker News:
Popularity = (p – 1) / (t + 2)^1.5
Votes divided by age factor.
- p : votes (points) from users.
t : time since submission in hours.
p is subtracted by 1 to negate submitter’s vote.
Age factor is (time since submission in hours plus two) to the power of 1.5.
First of all, the time 7:46:43 am on December 8th 2005 is a constant used to determine the relative age of a submission.
The time the story was submitted minus the constant date is ts. ts works as the force that pulls the stories down the frontpage.
y represents the relationship of up votes to down votes.
45000 is the amount of seconds in 12.5 hours. This constant is used in combination with yts to “water down” votes as they are made farther and farther from the time the article was submitted.
The issue is discussed thoroughly in Y Combinator.
Popularity = (Initial stumbler audience / # domain) + ((% stumbler audience / # domain) + organic bonus – nonfriend) – (% stumbler audience + organic bonus) + N
The initial stumbler “power” (Audience of the initial stumbler divided by the amount of times that stumbler has stumbled the given domain) is added to the sum of all the subsequent stumblers’ powers.
Subsequent stumbler power is ((Percentage of audience stumbler makes up divided by the number of times given stumbler has stumbled domain) + a predetermined power boost for using the toolbar – a predetermined power drain if stumblers are connected) + (% of the stumbler audience + a predetermined boost for using the toolbar)
N is a “safety variable” so that the assumed algorithm is flexible. It represents a random number.
Del.icio.us has the simplest formula of all.
Popularity = (Amount of times story has been bookmarked in the last 3600 seconds)
The algorithm is self explanatory. The more people bookmark an item in a given time window, the more popular it is.
How Do We Implement Popularity in linkibol?
To answer this question we need to give some details on how voting works in linkibol:
- Votes are not absolute in linkibol. Each linkibol user’s vote weight depend on their karma.
- Users who have more karma (i.e. popular and active users) can cast more points for votes.
- The sum of all vote points given to a link is that link’s “linkipoint”.
- Users can cast negative votes, as well as positive votes.
Here’s linkibol‘s popularity calculation formula:
(click on the image for an enlarged version)
A bit mouthful I know . It took us years to simplify it to this form.
But it’s a robust algorithm. And it’s computatioanly efficient once you cache certain non-dynamic parts of it.
Let me explain the terms in this equation:
- nvotes : total number of votes so far
- nlinks : total number of links
- nvotes(r) : number of votes cast to rth link.
- deltarank(k, m) : rank increment caused by mth vote that is casted to kth link.
- nsaves(i) : number of users that save ith link to their linkibol.
- a : save exponent (an ad-hoc value close to 1)
- age(i) : the difference (in days) between date link added and current date.
- b : decay exponent (an ad-hoc value close to 0)
When you examine the formula closely you’ll see that it’ nothing more than an adaptive bayesian rating formula multiplied by two coefficients.
- the number of users who save the link. The more an item is shared, the more popular it is.
- and the age of the link: the older an item the less popular it becomes.
Ideally the adjusted ratings should be calculated regularly for “all” the links in linkibol. However, since this will bring a huge computational burden, we eliminate the problem by discarding unaltered links.
That is: if a link’s linkipoint does not change at all, we don’t calculate its popularity rank. In other words, if a link does not receive any votes, we trust its most recently calculated rank.
For this we have a “dirty” flag on the links table. When someone (even someone anonymous) casts a vote to a link, it becomes dirty.
Then a batch thread scans all dirty links and recalculates and updates their ranks periodically.
Calculating popularity is not as easy as it sems .