How to Build a Popularity Algorithm You can be Proud of
Many web sites allow users to casts vote on items. These visitors’ votes are then often used to detect the items’ “popularity” and hence rank the rated items accordingly.
And when “rank” comes into play things gets tricky:
- The system can have inherent deficiencies in ranking items.
- That is mostly because developers tend to “re-invent the wheel” and throw in their own algorithms instead of basing their calculations on well-established statistical formulae (I’ll come to that in a moment, just bear with me
). - There will be people (i.e. spammers) trying to fool the system and try to take their submissions to top.
- There will be system inefficiencies due to computational complexity.
In this article,
- I’ll try to give examples on how to approach the problem;
- describe the weakness of each particular approach;
- and explain how some well-known social community sites implement their ranking algorithms.
- Finally I’ll discuss the current ranking algorithm we use in linkibol.
So let us begin with the most straigh-forward (and thus the most error-prone) approach:
Votes Determine Popularity
We have a database record, whether it be a news story, picture, video, podcast, whatever.
We could order the “Popular” item by number of votes and be done with it.
Popularity = Votes
This algorithm is far from being perfect.
Let me try to explain why:
Say last month was a very active month and popular records in that month received around 100 votes.
This month is not very active in terms of traffic (assume it’s a holiday season and your user base is rushing to beaches instead of hysterically seeking for items and casting votes) and as a result you’ve received an avarage of 30 votes for each item.
Therefore, popular items from last month will probably never see the front page.
Aging Records
Let’s introduce the age of the record as another variable. If the record is newer it should have a higher prominence on the front page, yes?
Popularity = Votes / Record Age
The older the record, the more votes it requires to achieve popularity. But that’s not fair:
Again assume someone woke up at 3am and submitted an excellent item to the site, where the rest of community are sleeping instead of being on-line an frantically casting votes.
This excellent item, which has caused our beloved user to wake up in the middle of the night, will not receive the credit it deserves as it fades away due to the dampening effect of time.
So let’s look at the age of each record:
Popularity = (V1/A1) + (V2/A2) + … + (Vn/An)
Where
- Vn is a vote,
- An is the age of that vote (for example, in minutes).
All the values of all the votes are added to achieve a popularity score.
This seems to solve the previous problem, but introduces a new one:
If a single person votes on a record a year old, his vote will be worh more than 200 votes casted a year ago when the record was created. Old material comes back to haunt the front page.
Let’s take a look again at the age of the record…
Dampening The Weighted Votes By Record Age
So let us introduce the age of the record into the equation:
Popularity = [ (V1/A1) + (V2/A2) + … + (Vn/An) ] / Record Age
This formula dampens votes we’ve calculated above basd on the items’s age, which prevents old stories from leaping back to the front page.
It seems to solve our problem but it’s computationally inefficient:
It is difficult to implement in a practical system.
Moreover it’s overly-dependent on time, which also adds to the algorithm’s inefficiency:
Every second, both the age of votes and the record age change. Thus, the popularity needs to be calculated for all the records in the database for every few seconds (or minutes depending on how active the site is). That’s intolerable in a high-traffic system where resource utilization is of premium importance.
Since this algorithm is inapplicable let’s look at another basic example:
Using simple + and – Votes
If you like an item, rate it “plus”. If you don’t like it, give it a “minus”.
The rating of an item would then be: number of positive votes divided by number of total votes.
Now if you want to rank the items based on this simple equation, the following happens:
Assume you have on item with a rating of 0.93, based on hundreds of votes.
Now another new item is rated by a total of 2 visitors (or even just one), and they rate it +. The vote goes straight to the first position in the ranking, as its rating is 100%!
What we want is this:
- If there is only few votes, then these votes should count less,
- when there are many votes and we can trust that this is how the public feels about it.
That is “certainty” or “believability” of the vote should depend on the number of votes that record has received.
Thus, we want to calculate a corrected rating that somehow takes the weight of votes into account:
The more votes an item has, the closer the corrected rating value would be to the uncorrected rating value.
That’s not a new thing though. The math for this has been worked out hundreds of years ago. It is called “bayesian average”.
The bayesian rating can be given as:
br = ( (C* avg_rating) + (this_num_votes * this_rating) ) / (C + this_num_votes)
Where
- C: is an ad-hoc constant. If it’s high. It will require more votes for the adjusted (dampened) rating of an item to approach its original unadjusted value.
- avg_rating: The average rating of each item (again, of those that have num_votes>0)
- this_num_votes: number of votes for this item
- this_rating: the rating of this item
There is no point requiring 1000 votes for the item to rank 60% when each item only gets a handful of votes in average. So if the system receives less votes in general, C should be smaller.
To make the system adaptive we can assign C a self-adjusting value, such as “average number of votes”. Which makes our formula.
br = ( (avg_num_votes * avg_rating) + (this_num_votes * this_rating) ) / (avg_num_votes + this_num_votes)
This formula works statistically better than the former one (i.e. dampenint the votes by record age).
That damping by record age algorithm was very dependent on time. On the contrary, this rating formula is totally independent of time.
It can be fine-tuned by adding
- Dampening effect of time,
- Trustablility (karma) of voters
- Page views and site activity at the time of vote
- If we’re voting a link, metadata for that link such as google pagerank, alexa rank etc.
Let’s move to another approach:
Popularity = Lower bound of Wilson score confidence interval for a Bernoulli parameter
Lower bound of what?!
We need to balance the proportion of positive ratings with the uncertainty of a small number of observations. Fortunately, the math for this was worked out in 1927 by Edwin B. Wilson
:
Popularity is:
Where:
- p: is the average rating of the item normalized to [0, 1] interval.
- z(alpha/2): is the 1-(alpha/2) th quantile of the standard normal distribution.
It can be calculated easily using the following ruby method:
Statistics2.pnormaldist(1-alpha/2)
- n: is the number of votes.
The algorithm is sound, and computationally feasible. You can adjust the dampening effect of the normal distribution by changing alpha.
How Do We Rank Things in linkibol?
Our preference in linkibol is to use an adjusted Bayesian Rating algorithm.
I’ll be explaining linkibol’s popularity algorithm in a moment, but before that let’s how a look at how other giants calculate popularity.
Y Combinator’s Hacker News:
Popularity = (p – 1) / (t + 2)^1.5
Votes divided by age factor.
Where
- p : votes (points) from users.
t : time since submission in hours.
p is subtracted by 1 to negate submitter’s vote.
Age factor is (time since submission in hours plus two) to the power of 1.5.
Reddit:
Description:
First of all, the time 7:46:43 am on December 8th 2005 is a constant used to determine the relative age of a submission.
The time the story was submitted minus the constant date is ts. ts works as the force that pulls the stories down the frontpage.
y represents the relationship of up votes to down votes.
45000 is the amount of seconds in 12.5 hours. This constant is used in combination with yts to “water down” votes as they are made farther and farther from the time the article was submitted.
The issue is discussed thoroughly in Y Combinator.
StumbleUpon
Popularity = (Initial stumbler audience / # domain) + ((% stumbler audience / # domain) + organic bonus – nonfriend) – (% stumbler audience + organic bonus) + N
Description:
The initial stumbler “power” (Audience of the initial stumbler divided by the amount of times that stumbler has stumbled the given domain) is added to the sum of all the subsequent stumblers’ powers.
Subsequent stumbler power is ((Percentage of audience stumbler makes up divided by the number of times given stumbler has stumbled domain) + a predetermined power boost for using the toolbar – a predetermined power drain if stumblers are connected) + (% of the stumbler audience + a predetermined boost for using the toolbar)
N is a “safety variable” so that the assumed algorithm is flexible. It represents a random number.
Del.icio.us:
Del.icio.us has the simplest formula of all.
Popularity = (Amount of times story has been bookmarked in the last 3600 seconds)
The algorithm is self explanatory. The more people bookmark an item in a given time window, the more popular it is.
How Do We Implement Popularity in linkibol?
To answer this question we need to give some details on how voting works in linkibol:
- Votes are not absolute in linkibol. Each linkibol user’s vote weight depend on their karma.
- Users who have more karma (i.e. popular and active users) can cast more points for votes.
- The sum of all vote points given to a link is that link’s “linkipoint”.
- Users can cast negative votes, as well as positive votes.
Here’s linkibol‘s popularity calculation formula:
(click on the image for an enlarged version)
A bit mouthful I know
. It took us years to simplify it to this form.
But it’s a robust algorithm. And it’s computatioanly efficient once you cache certain non-dynamic parts of it.
Let me explain the terms in this equation:
- nvotes : total number of votes so far
- nlinks : total number of links
- nvotes(r) : number of votes cast to rth link.
- deltarank(k, m) : rank increment caused by mth vote that is casted to kth link.
- nsaves(i) : number of users that save ith link to their linkibol.
- a : save exponent (an ad-hoc value close to 1)
- age(i) : the difference (in days) between date link added and current date.
- b : decay exponent (an ad-hoc value close to 0)
When you examine the formula closely you’ll see that it’ nothing more than an adaptive bayesian rating formula multiplied by two coefficients.
- the number of users who save the link. The more an item is shared, the more popular it is.
- and the age of the link: the older an item the less popular it becomes.
Ideally the adjusted ratings should be calculated regularly for “all” the links in linkibol. However, since this will bring a huge computational burden, we eliminate the problem by discarding unaltered links.
That is: if a link’s linkipoint does not change at all, we don’t calculate its popularity rank. In other words, if a link does not receive any votes, we trust its most recently calculated rank.
For this we have a “dirty” flag on the links table. When someone (even someone anonymous) casts a vote to a link, it becomes dirty.
Then a batch thread scans all dirty links and recalculates and updates their ranks periodically.
Bottom Line:
Calculating popularity is not as easy as it sems
.
May 10th, 2010 at 1:01 pm
Wonderful write-up! It’s unquestionably the most all-embracing site within the topic. Thank you your time and effort, I will revisit read the revisions
May 11th, 2010 at 1:38 am
Nice post…Thank you for sharing some good things!!
May 11th, 2010 at 5:43 am
Brilliantly written. Many thanks.
May 11th, 2010 at 5:50 am
[...] How to Build a Popularity Algorithm You can be Proud of [...]
May 12th, 2010 at 12:27 am
[...] How to Build a Popularity Algorithm You can be Proud of [...]
May 13th, 2010 at 7:22 am
Fantastic article !
May 28th, 2010 at 4:28 am
Is this what Google uses?
June 3rd, 2010 at 11:48 am
An enlightening article indeed.
Somehow I am getting a hunch though that a better “damping” process can be achieved through using ideas similar to ones used in “ant colony optimisztion” algorithms.
November 4th, 2010 at 9:44 pm
This reminds the “exact” same issues with portfolio management performance calculation, nicely done.
December 23rd, 2010 at 10:22 am
Really cool explanation! I had already read How Not To Sort By Average Rating, which talks about the Wilson formula. What are its downsides that made you pick the Bayes formula over the Wilson formula? (the latter, to me, seems a littlebit more statistically sound).
Of course you tweaked the Bayes formula to account for age and karma, but that’s possible for the Wilson formula as well. Plus, if you do it right, the tweaks would be statistically meaningful as well (which is good for robustness, among other things).
January 13th, 2011 at 8:55 pm
Awesome article.
Any ideas what approach to take for a series of debates, where you want to know which debate is most actively discussed?
Imagine there can be 100′s of debates a day. And each can have 0+ messages posted against it.
Basically, we want to tell users what is being actively discussed – what’s popular right now.
We also want to do ageing, because if a debate falls quiet then it’s obviously not currently popular.
May 26th, 2011 at 8:19 am
Very nice and useful post! I was wondering exactly of this stuff works and you explained me in such clear fashion. Thank you!
May 28th, 2011 at 2:57 pm
This is a very detailed and helpful article. Not many will justify things in as much detail as you have, thank you
July 30th, 2011 at 6:38 am
fantastic post, very informative. I wonder why the other experts of this sector don’t notice this. You should continue your writing. I am sure, you have a great readers’ base already!
August 21st, 2011 at 3:18 pm
Hi guys,
I have been truly busy and was not looking at this blog… ahem… “for a while”.
Thank you for your wonderful comments.
@tripzilch:
as far as my observations are concerned, wilson’s formula is a little computationally expensive than baye’s. But it all depends on the dataset and the change in that data set (amount of new votes per day, amount of new links added… etc).
@Annabel
I plan to write as soon as I have time.
Cheers.
December 7th, 2011 at 4:07 am
How was the StackOverflow reputation system designed?…
I can’t speak to their algorithm specifically, but I’ve been doing some work creating a similar algorithm for a web-based project. Basically, you want to count the number of “votes” that a post gets, then divide it by the amount of time that the po…
February 25th, 2012 at 1:22 pm
[...] * How to Build a Popularity Algorithm You can be Proud of [...]
February 26th, 2012 at 2:16 am
[...] * How to Build a Popularity Algorithm You can be Proud of [...]
May 7th, 2012 at 1:08 pm
Great breakdown of some of the different voting systems. Any idea how ranker.com works with the ranking of the ultimate lists, taking into account the number of times items from those lists are mentioned in user submitted lists + public votes?
May 10th, 2012 at 2:13 pm
Hi B,
Thank you!
And I don’t have a specific suggestion for you, as your formula will totally depend on how your audience behaves.
You may want to start by keeping thing simple and implement a simple bayesian rating scheme, and ajust it iteratively by using at the usage data.
p.s.:
If the service is popular, there will always be people to hack the algorithm to gain some of that popularity juice.
And that’s a good thing in deed. Those “hacker” guys will enable you strengthen your algorithm if you are able to successfully analyze their behavior.
Hope that helps,
Volkan.
May 13th, 2012 at 8:24 pm
Hey I was wondering if this algorithm would work/makes sense
({(average number of votes x avg rating) + (number of votes for this post x the rating of this post)}/(avg number of votes + this number of votes)/(number of submitted links since this post)) x 1000
- where the average number of votes is the avg number of votes a link recieves on the site
- the avg rating is the avg (up votes – down votes) a link recieves on the site
- and then i multiplied it by 1000 just to make the numbers look bigger
May 14th, 2012 at 6:52 am
@duncan
What you propose seems like an “adaptive” bayesian rating calculation.
It most probably should work if the “average number of votes” follows a predictable patterns.
If the “average number of votes” changes abruptly, then you may need to modify your algorithm further.
May 22nd, 2012 at 3:18 am
I genuinely like your article. It’s evident that you have a whole lot understanding on this topic. Your details are well produced and relate-able. Thanks for producing participating and exciting material.
May 22nd, 2012 at 5:57 am
Thanks Zaida. Glad that it was useful.
May 30th, 2012 at 9:35 am
[...] * How to Build a Popularity Algorithm You can be Proud of [...]
June 7th, 2012 at 5:56 am
Nicely written article! Helped me a lot when I was thinking how to design it today!
June 7th, 2012 at 3:59 pm
Hi Jingjle, I hope the article helped you create a better popularity algorithm for your app.
Cheers.
June 17th, 2012 at 10:26 pm
Great article and thanks for explaining things in a way engineers can understand. I’m just wondering if you could give some thoughts on how to meld other metrics into the popularity equation. For example, I have a dataset where impressions over time can also be considered, as well as “liking” and “tagging” of individual articles on the site. I suppose it’s possible to analyse the individual metric, but I’m just wondering if there is a sane way to bring more than one metric together. Thanks in advance.
June 19th, 2012 at 10:38 am
Nice article .. U r a good Mathematician too…
June 20th, 2012 at 5:10 pm
@rohit I have an Engineering major, and by that token I am adequately good in Mathematics.
But calling myself a good “Mathematician” will be an overstatement
June 20th, 2012 at 5:15 pm
Hi Andrew,
When more than a handful of metrics are considered things get more and more messier.
An approach that first comes to my mind would be to generated a weighted average of each dimension.
Where some Matrix Theory/Linear algebra/orthogonality may also pop up in your thought process.
As a rule of thumb though, the simpler your approach is, the easier to maintain and alter it.
Try to show love to Ocham’s Razor as much as you can
HTH.
June 27th, 2012 at 6:57 am
Great post! I learned a lot!
June 27th, 2012 at 7:02 am
[...] * How to Build a Popularity Algorithm You can be Proud of [...]
June 27th, 2012 at 7:10 am
[...] * How to Build a Popularity Algorithm You can be Proud of [...]
July 4th, 2012 at 7:12 am
I’m wondering what would you say about my average rating algorithm: Calculating Average Rating using PHP and MySQL
I think it’s worth checking it out and discussing it…
July 4th, 2012 at 1:01 pm
Hi Martin.
Thanks for sharing your article. Will definitely have a look.
Cheers.
July 16th, 2012 at 12:28 pm
[...] [...]
August 10th, 2012 at 1:05 pm
[...] [...]
August 10th, 2012 at 1:13 pm
[...] [...]
September 15th, 2012 at 8:08 pm
[...] How to Build a Popularity Algorithm You can be Proud ofMay 10, 2010 … Finally I’ll discuss the current ranking algorithm we use in linkibol. So let us begin with the most straigh-forward (and thus the most error-prone) … [...]
November 30th, 2012 at 10:12 am
Y Combinator’s Hacker News:
Popularity = (p – 1) / (t + 2)^1.5
Votes divided by age factor.
How do I implement this in actual DB?
Do I calculate the score in regular interval and index the score field?
Or, do I supply that function to ORDER BY clause for each request?
December 11th, 2012 at 4:08 am
[...] How to Build a Popularity Algorithm You can be Proud of [...]
December 12th, 2012 at 6:09 am
Eugene, that all depends on your scalability requirements.
I’d go by implementing the thing and worry about caching, scalability, replication, sharding, etc afterwards.
If you are not going to calculate the ratings in real time, and some background process is regularly updating the ratings table, you won’t have any significant performance impact.
Experience shows me that 90%+ of the traffic is generally “read” operations, and only <10% are actual content creations and only 10% of that %10 dare to vote the items. So if you are not reddit, chances are that you won’t feel any impact.
My advice would be not to try to solve a problem that you don’t have; and just go out an build the darn thing
January 10th, 2013 at 10:20 pm
Firstly, Thanks for the wonderful article.
Limited knowledge I have, i feel Wilson Score is something that matches my requirement. I wanted to ask you if want to add age factor to Wilson Score formula http://www.evanmiller.org/how-not-to-sort-by-average-rating.html
Do you think dividing the whole formula with age ^ b is that how would you have done? And what is the b value you are using and do you change it from time to time?
January 14th, 2013 at 2:26 pm
Great!!!!
January 20th, 2013 at 6:46 pm
Hi Mayank, I feel like dividing by age ^ b, or maybe dividing by log ( somefunctionof ( age ) ) might work.
As a matter of fact each community has its own dynamics, and those dynamics change over time.
I tend to adjust the formula and numbers from time to time.
Especially if the community is active, looking at the logs help you a lot.
Because there will be always people to trick the system — if there’s a rule, it’s meant to be bent.
And by analyzing how they trick the system, you can optimize your algorithm further to take those factors into account.
It’s mostly a cat and mouse race which might result in a continuous improvement of the system if the community is active, sane, reasonable, and intelligent enough to take care of it.
Though not all communities are so
February 12th, 2013 at 4:50 pm
[...] How to Build a Popularity Algorithm You can be Proud of [...]
March 15th, 2013 at 7:13 pm
This is some good info! I’ll need to read over the whole thing again later. I’m trying to have a “Hot” (currently popular) and “Top” (popular over time) listing, based either on favorites or rank (1-5 stars) so not sure where this would fall. Guess it’s more “hot”.
March 17th, 2013 at 8:51 pm
I’m glad you find it useful Thomas.
Popularity is a hard concept to model and implement properly.