linkibol Version 2.0 is Out

August 21st, 2011 No Comments »

Welcome to linkibol version 2.0

linkibol v.2.0

I know, I know… it has been a long time since my last post. I wish days were 72 hours so that I’d have more time to add value to the community.

For more than a year, I’ve been working on the new version of linkibol. It took so long, because it is really hard to find adequate time while being busy shaping the mobile future of a startup, and trying to create a JavaScript framework, while sharing my experience along.

It’s like juggling with three big glass orbs: You do not want to drop any of them, but you have limited time to concentrate on each one.

And add all the other smaller but yet time-consuming tasks into this hassle, and oh damn… it’s hard to preserve a “mind like water”.

… Did I mention that I was working in a startup :) ?

All of these led me master a vi-based geeky GTD system of my own — there’s nothing better than command line and plain text files when it comes to productivity — which deserves a separate blog post on its own :) .

Anyways, let’s go on with what’s new in linkibol:

Missing Features to be Implemented

Having seen the development time required to release the new version of linkibol is going to infinity, while the amount of time I can allocate is getting lesser and lesser, I radically decided to release linkibol, whatever it takes, with a minimum viable set of features.

Therefore some of the features that were available in the former version are nonexistent/disabled in this version:

  • You cannot order links by popularity,
  • You cannot cast votes on links,
  • You cannot view or edit the links that you’ve formerly added,
  • The only way to add a link is via using the linkibol favelet,
  • Even some of the links on the footer (help, terms of use, privacy policy etc.) are missing…

Yeah, I know, it’s a total mess, and I will be adding these features one by one, in the upcoming weeks.

What’s new in linkibol Version 2.0

There are plenty of new things in this version:

  • linkibol is now bilingual: It supports both English, and Turkish.
    There are minor localization issues, which I’m working on. I also plan to create a “translation API” to allow localization of linkibol in your favorite language of choice.
  • The design of linkibol has changed: It is more organized, aligned, with less clutter. I’d be glad to hear your ideas on making the new design more usable.
  • linkibol is now damn fast: Its backend code and database queries are thoroughly optimized. Its front-end code (i.e. JavaScript, CSS) is optimized as well.
    Besides, it utilizes CSS and JavaScript minification, caching, gzip compression, CSS sprites, and much more cool stuff to make it load faster. Yet I’m still working to make linkibol even faster :) .
  • Most of the bugs are fixed (and some new bugs are introduced :) as I’m writing this post exception messages are filling the log files ;) — which means there will be a really exciting bug pursuit in the upcoming days :D
  • This new version will set the groundwork for further improvement. It was almost a total rewrite, but it was really worth it :) . It may take “some” time to add missing features, and for things to settle down.

    In the meantime, if you find a bug. Or have a great idea to share, feel free to write a comment. I’d be glad to hear your invaluable inputs, comments, ideas, suggestions, and overall support :)

    Cheers,
    Volkan Özçelik,
    Founder and Sole Developer, linkibol.com

Did You Know that jQuery Leaks Memory like a Fountain? — and there’s a solution for it.

May 7th, 2010 8 Comments »

jquery is a very mature JavaScript library that is fast and concise. So you’d assume that it should have put a nail onto the JavaScript memory leakage coffin. And you will be wrong in your assumption ;)

Actually the problem of JavaScript memory leakage is not new, it has been around for more than 5 years.

There are several reasons of memory leak:

  • 99% of the time it’s closures between DOM world, and JS world.

  • and the rest of the time its XMLHttpRequest object’s memory leakage due to a similar (but nastier) circular dependency in its onreadystatechange event handler.

The latter is nastier, because setting onreadystatechange closure of the ajax object to null does not always release the circular reference.

And jquery 1.4.2 leaks like a fountain in Internet Explorer 7 and above.

Don’t you believe me? Then give this tiny little code snippet a try:

<html>
<head>
<script src="jquery.js"></script>
<script>
	var counter = 0;
	var pipe = null;
	function nill(){}

	function openPipe(){
		if(counter>10000){ return; }
		document.getElementById('TestDiv').innerHTML = counter++;

		/* try to remove memory leak by relasing the circular reference. */
		if (pipe !== null) {
			pipe.onreadystatechange = nill;
			pipe.abort();
			pipe = null;
		}

		createPipe();
	}

	function createPipe(){
		pipe = $.ajax({ url: '/index.php',
				cache: false,
				type: 'POST',
				success: onSuccess,
				error: onError
			});
	}

	function onSuccess(data) {
		/* try to remove memory leak by relasing the circular reference. */
		if (pipe!== null) {
			pipe.onreadystatechange = nill;
			pipe.abort();
			pipe = null;
		}
		setTimeout(openPipe,1);
	}	

	function onError(xhr, textStatus, errorThrown) {}

	window.onload = openPipe;
</script>
</head>
<body>
	<div id="TestDiv"></div>
</body>
</html>

When executing the above snippet, the memory utilization in IE7(Windows) will be something like this:

Each horizontal line denotes 100MB of memory. That is, jquery has leaked around 200MB of memory in 10,000 ajax calls. The steep decline at the end of the graph is due to closing the browser window and hence freeing up the leaked memory.

If you are building a single-page ajax application, that will heavily use ajax calls, and will stay open for several hours that’s a lot of memory leak, which will make Internet Explorer throw an out of memory error and crash eventually.

So how can you be sure that the leak is due to jQuery’s ajax implementation?

By creating a similar test snippet by using native ajax calls.

<html>
<head>
<script>
var xhr = new XMLHttpRequest();
var url = 'index.php';
var counter = 0;

function openXHR(){
	xhr.open('POST',url, false);
	xhr.setRequestHeader('X-Requested-With','XMLHttpRequest');
	xhr.setRequestHeader('Accept', 'text/javascript, text/html, application/xml, text/xml, */*');
	xhr.onreadystatechange = readyStateChanged;
	xhr.setRequestHeader('Content-Type', 'application/x-www-form-urlencoded');
	xhr.send('');
}

function nill(){}
function readyStateChanged(){
	if(xhr.readyState === 4){
			if(counter>10000){ return; }
			var theDiv = document.getElementById('TestDiv');
			if(theDiv){ theDiv.innerHTML = (counter++);	}
			xhr.onreadystatechange = nill;
			xhr.abort();
			setTimeout(openXHR,1);
	}
}

window.onload = openXHR;
</script>
</head>
<body>
	<div id="TestDiv"></div>
</body>
</html>

And here’s the corresponding memory utilization graph for 10,000 consecutive runs:

Smooth as butter on bread ;)

Calling native XmlHttpRequests instead of jQuery’s $.ajax got rid of the leak for good.

But wait a second, doesn’t jQuery use browser’s native XmlHttpRequest object when available as well? Actuall it does:

Here’s the related code part from jquery1.4.2 (between lines 4948 and 4960)

// Create the request object; Microsoft failed to properly
// implement the XMLHttpRequest in IE7 (can't request local files),
// so we use the ActiveXObject when it is available
// This function can be overriden by calling jQuery.ajaxSetup
xhr: window.XMLHttpRequest && (window.location.protocol !== "file:" || !window.ActiveXObject) ?
function () {
	return new window.XMLHttpRequest();
} :
function () {
	try {
		return new window.ActiveXObject("Microsoft.XMLHTTP");
	} catch (e) { }
},

As the above code implies, even so jQuery does inherently use native XmlHttpRequest, it fails to prevent the leak. That may (most probably) be due to a circulary depencency or closure hidden inside jQuery core.

This leaking situation demonstrates itself only in Microsoft Internet Explorer (version 7 and above) as far as my tests are concerned. And, oddly enough, using the ActiveX counterpart, instead of the brand new native XmlHttpRequest object prevents the leak.

So if we modify the above lines as follows, jQuery will not leak at all:

// Create the request object; Microsoft failed to properly
// implement the XMLHttpRequest in IE7 (can't request local files),
// so we use the ActiveXObject when it is available
// This function can be overriden by calling jQuery.ajaxSetup
xhr: window.XMLHttpRequest && (window.location.protocol !== "file:" || !window.ActiveXObject) ?
function () {
	/* v.o.: begin memory leak fix. */
	if (window.ActiveXObject) {
		try {
			return new window.ActiveXObject("Microsoft.XMLHTTP");
		} catch (e) { }
	}
	/* v.o.: end memory leak fix. */
	return new window.XMLHttpRequest();
} :

The above hack simply tries to use an ActiveX object for the AJAX request if there’s one available, and fall back to XmlHttpRequest object if we fail to initialize the ActiveX object.

You can download a (90KB zipped) sample test bundle of the above codes, open task manager, run the codes, test and experience the leak for yourself ;)

It’s surprising and intriguing that the world’s most widely adopted JavaScript library’s implementation has overlooked this memory leakage issue.

If you’ve been experiencing a similar problem, I hope that this article has helped you save some development time that you can spare on more useful things than debugging the jQuery core…

…like writing some actual jQuery code ;)

How to Build a Popularity Algorithm You can be Proud of

May 7th, 2010 16 Comments »

Many web sites allow users to casts vote on items. These visitors’ votes are then often used to detect the items’ “popularity” and hence rank the rated items accordingly.

And when “rank” comes into play things gets tricky:

  • The system can have inherent deficiencies in ranking items.
  • That is mostly because developers tend to “re-invent the wheel” and throw in their own algorithms instead of basing their calculations on well-established statistical formulae (I’ll come to that in a moment, just bear with me ;) ).
  • There will be people (i.e. spammers) trying to fool the system and try to take their submissions to top.
  • There will be system inefficiencies due to computational complexity.

In this article,

  • I’ll try to give examples on how to approach the problem;
  • describe the weakness of each particular approach;
  • and explain how some well-known social community sites implement their ranking algorithms.
  • Finally I’ll discuss the current ranking algorithm we use in linkibol.

So let us begin with the most straigh-forward (and thus the most error-prone) approach:

Votes Determine Popularity

We have a database record, whether it be a news story, picture, video, podcast, whatever.

We could order the “Popular” item by number of votes and be done with it.

Popularity = Votes

This algorithm is far from being perfect.

Let me try to explain why:

Say last month was a very active month and popular records in that month received around 100 votes.

This month is not very active in terms of traffic (assume it’s a holiday season and your user base is rushing to beaches instead of hysterically seeking for items and casting votes) and as a result you’ve received an avarage of 30 votes for each item.

Therefore, popular items from last month will probably never see the front page.

Aging Records

Let’s introduce the age of the record as another variable. If the record is newer it should have a higher prominence on the front page, yes?

Popularity = Votes / Record Age

The older the record, the more votes it requires to achieve popularity. But that’s not fair:

Again assume someone woke up at 3am and submitted an excellent item to the site, where the rest of community are sleeping instead of being on-line an frantically casting votes.

This excellent item, which has caused our beloved user to wake up in the middle of the night, will not receive the credit it deserves as it fades away due to the dampening effect of time.

So let’s look at the age of each record:

Popularity = (V1/A1) + (V2/A2) + … + (Vn/An)

Where

  • Vn is a vote,
  • An is the age of that vote (for example, in minutes).

All the values of all the votes are added to achieve a popularity score.

This seems to solve the previous problem, but introduces a new one:

If a single person votes on a record a year old, his vote will be worh more than 200 votes casted a year ago when the record was created. Old material comes back to haunt the front page.

Let’s take a look again at the age of the record…

Dampening The Weighted Votes By Record Age

So let us introduce the age of the record into the equation:

Popularity = [ (V1/A1) + (V2/A2) + … + (Vn/An) ] / Record Age

This formula dampens votes we’ve calculated above basd on the items’s age, which prevents old stories from leaping back to the front page.

It seems to solve our problem but it’s computationally inefficient:

It is difficult to implement in a practical system.

Moreover it’s overly-dependent on time, which also adds to the algorithm’s inefficiency:

Every second, both the age of votes and the record age change. Thus, the popularity needs to be calculated for all the records in the database for every few seconds (or minutes depending on how active the site is). That’s intolerable in a high-traffic system where resource utilization is of premium importance.

Since this algorithm is inapplicable let’s look at another basic example:

Using simple + and – Votes

If you like an item, rate it “plus”. If you don’t like it, give it a “minus”.

The rating of an item would then be: number of positive votes divided by number of total votes.

Now if you want to rank the items based on this simple equation, the following happens:

Assume you have on item with a rating of 0.93, based on hundreds of votes.

Now another new item is rated by a total of 2 visitors (or even just one), and they rate it +. The vote goes straight to the first position in the ranking, as its rating is 100%!

What we want is this:

  • If there is only few votes, then these votes should count less,
  • when there are many votes and we can trust that this is how the public feels about it.

That is “certainty” or “believability” of the vote should depend on the number of votes that record has received.

Thus, we want to calculate a corrected rating that somehow takes the weight of votes into account:

The more votes an item has, the closer the corrected rating value would be to the uncorrected rating value.

That’s not a new thing though. The math for this has been worked out hundreds of years ago. It is called “bayesian average”.

The bayesian rating can be given as:

br = ( (C* avg_rating) + (this_num_votes * this_rating) ) / (C + this_num_votes)

Where

  • C: is an ad-hoc constant. If it’s high. It will require more votes for the adjusted (dampened) rating of an item to approach its original unadjusted value.
  • avg_rating: The average rating of each item (again, of those that have num_votes>0)
  • this_num_votes: number of votes for this item
  • this_rating: the rating of this item

There is no point requiring 1000 votes for the item to rank 60% when each item only gets a handful of votes in average. So if the system receives less votes in general, C should be smaller.

To make the system adaptive we can assign C a self-adjusting value, such as “average number of votes”. Which makes our formula.

br = ( (avg_num_votes * avg_rating) + (this_num_votes * this_rating) ) / (avg_num_votes + this_num_votes)

This formula works statistically better than the former one (i.e. dampenint the votes by record age).

That damping by record age algorithm was very dependent on time. On the contrary, this rating formula is totally independent of time.

It can be fine-tuned by adding

  • Dampening effect of time,
  • Trustablility (karma) of voters
  • Page views and site activity at the time of vote
  • If we’re voting a link, metadata for that link such as google pagerank, alexa rank etc.

Let’s move to another approach:

Popularity = Lower bound of Wilson score confidence interval for a Bernoulli parameter

Lower bound of what?!

We need to balance the proportion of positive ratings with the uncertainty of a small number of observations. Fortunately, the math for this was worked out in 1927 by Edwin B. Wilson

:

Popularity is:

Where:

  • p: is the average rating of the item normalized to [0, 1] interval.
  • z(alpha/2): is the 1-(alpha/2) th quantile of the standard normal distribution.

    It can be calculated easily using the following ruby method:

    Statistics2.pnormaldist(1-alpha/2)

  • n: is the number of votes.

The algorithm is sound, and computationally feasible. You can adjust the dampening effect of the normal distribution by changing alpha.

How Do We Rank Things in linkibol?

Our preference in linkibol is to use an adjusted Bayesian Rating algorithm.

I’ll be explaining linkibol’s popularity algorithm in a moment, but before that let’s how a look at how other giants calculate popularity.

Y Combinator’s Hacker News:

Popularity = (p – 1) / (t + 2)^1.5

Votes divided by age factor.

Where

  • p : votes (points) from users.
  • t : time since submission in hours.

p is subtracted by 1 to negate submitter’s vote.

Age factor is (time since submission in hours plus two) to the power of 1.5.

Reddit:

Description:

First of all, the time 7:46:43 am on December 8th 2005 is a constant used to determine the relative age of a submission.

The time the story was submitted minus the constant date is ts. ts works as the force that pulls the stories down the frontpage.

y represents the relationship of up votes to down votes.

45000 is the amount of seconds in 12.5 hours. This constant is used in combination with yts to “water down” votes as they are made farther and farther from the time the article was submitted.

The issue is discussed thoroughly in Y Combinator.

StumbleUpon

Popularity = (Initial stumbler audience / # domain) + ((% stumbler audience / # domain) + organic bonus – nonfriend) – (% stumbler audience + organic bonus) + N

Description:

The initial stumbler “power” (Audience of the initial stumbler divided by the amount of times that stumbler has stumbled the given domain) is added to the sum of all the subsequent stumblers’ powers.

Subsequent stumbler power is ((Percentage of audience stumbler makes up divided by the number of times given stumbler has stumbled domain) + a predetermined power boost for using the toolbar – a predetermined power drain if stumblers are connected) + (% of the stumbler audience + a predetermined boost for using the toolbar)

N is a “safety variable” so that the assumed algorithm is flexible. It represents a random number.

Del.icio.us:

Del.icio.us has the simplest formula of all.

Popularity = (Amount of times story has been bookmarked in the last 3600 seconds)

The algorithm is self explanatory. The more people bookmark an item in a given time window, the more popular it is.

How Do We Implement Popularity in linkibol?

To answer this question we need to give some details on how voting works in linkibol:

  • Votes are not absolute in linkibol. Each linkibol user’s vote weight depend on their karma.
  • Users who have more karma (i.e. popular and active users) can cast more points for votes.
  • The sum of all vote points given to a link is that link’s “linkipoint”.
  • Users can cast negative votes, as well as positive votes.

Here’s linkibol‘s popularity calculation formula:

(click on the image for an enlarged version)

A bit mouthful I know :) . It took us years to simplify it to this form.

But it’s a robust algorithm. And it’s computatioanly efficient once you cache certain non-dynamic parts of it.

Let me explain the terms in this equation:

  • nvotes : total number of votes so far
  • nlinks : total number of links
  • nvotes(r) : number of votes cast to rth link.
  • deltarank(k, m) : rank increment caused by kth vote that is casted to mth link.
  • nsaves(i) : number of users that save ith link to their linkibol.
  • a : save exponent (an ad-hoc value close to 1)
  • age(i) : the difference (in days) between date link added and current date.
  • b : decay exponent (an ad-hoc value close to 0)

When you examine the formula closely you’ll see that it’ nothing more than an adaptive bayesian rating formula multiplied by two coefficients.

  • the number of users who save the link. The more an item is shared, the more popular it is.
  • and the age of the link: the older an item the less popular it becomes.

Ideally the adjusted ratings should be calculated regularly for “all” the links in linkibol. However, since this will bring a huge computational burden, we eliminate the problem by discarding unaltered links.

That is: if a link’s linkipoint does not change at all, we don’t calculate its popularity rank. In other words, if a link does not receive any votes, we trust its most recently calculated rank.

For this we have a “dirty” flag on the links table. When someone (even someone anonymous) casts a vote to a link, it becomes dirty.

Then a batch thread scans all dirty links and recalculates and updates their ranks periodically.

Bottom Line:

Calculating popularity is not as easy as it sems ;) .

linkiblog has a new home :)

May 7th, 2010 No Comments »

Due to the excessive traffic we receive (thank you!), we’ve migrated linkibol to a faster web server (by the way, I’d be glad to hear your comments on the performance of the new server: Is it better? Are the any unexpected hiccups you experience while browsing linkibol?)

Anyway, it was almost a flawless migration, except for this blog :( :

We’ve realized that the former blogging framework we’d been using did not quite suit our needs, so we’ve migrated linkiblog to the de-facto blogging software “wordpress“.

Although none of the blog posts were lost during this transition period, I will need to reload them incrementally from backups.

I just wanted to inform that full recovery of posts may take some time; and the permalinks to the posts may change.

Kindest regards,
Volkan Özçelik,
Founder, linkibol.com