Building a URL Shortener

With all the talk of URL shortening services, I decided to add a quick service into Snook.ca, which is run on CakePHP, to redirect a short URL to a post. Because my static content already has short URLs and all I have are posts, creating a short URL handler for it was very easy.

To give you some context, I route my posts through a specific structure:

/archives/:category
/archives/:category/:articlename

In this case, I have a couple routes that route everything to my Posts controller and the bycat or view actions. These action take the named parameters and pulls out the appropriate content. Easy peasy.

The key thing here is that my articles have two identifiers: one is the slug, the other is the post ID. The process for the short service just takes a post ID and redirects it to its fully expanded URL.

Here's the Shortening Route:

Router::connect('/s/:id', 
   array('controller'=>'posts', 'action'=>'shorturl'), 
   array('id'=>'\\d+'));

As you can see, it looks for a specific pattern (it must be all numbers) and then passes that into my posts controller to my shorturl action.

function shorturl () {
   $id = $this->params['id'];
   $this->Post->unbindModel(array('hasMany'=>array('Comment')));
   $post = $this->Post->findById($id);
   $url = '/archives/' 
          . $post['Tag'][0]['safetag'] . '/' 
          . $post['Post']['slug'];
   $this->redirect($url);
}

I grab the named parameter. I unbind my Comment model to prevent the findById call in the next line from returning too much. Then I find my post which will return the associated tags (which are my "categories"). I build the URL and then redirect the user onwards.

I haven't exposed the short URL in any way, yet. For now, it's more to allow myself quick posts to Twitter without having to use another service and to see if people are retweeting the link.

And now it's really easy to find the first blog post (based on ID): https://snook.ca/s/1

Building your own

With a single model and a single action taking a single parameter, wiring up a URL shortener was very simple. How could you do it with a more complicated system?

Multiple Routes

Another easy way to extend this concept is to simply map each prefix to each model's view action that needs to be shortened. You could have a Posts model on /p/ and a Comments model on /c/. the :id for each of them would simply point to the view page for each one. That offers up a little more flexibilty but not much.

Automatically Creating and Caching Short Links

In thinking this through, especially for an established site, you could have it automatically create a short link for any URL on your site once it has been visited once. First, create a new model called Short (or whatever you feel like it should be called).The short model will consist of two fields: the primary key (id) and character field to store the URL.

CREATE TABLE shorts (
  id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
  url varchar(100) NOT NULL UNIQUE
)

Within your AppController, grab the current URL (available via $this->url). With the URL being a unique key as well as the ID, you'll only have a single ID for each URL.

If you want to find the short URL for an existing URL, just look it up in the database.

$this->Short->findByUrl($this->url);

If a URL is not found, you'll need create a new record for it. It'd be advantageous for you to create a method on your model that'll do the find/not found/create process.

You can use your ID as your short form (as I did) which, given most sites, will be quite small. If you have 4000 unique URLs, you're using 4 characters. What if you wanted to optimize that even further?

You could convert that integer into a hexidecimal value. Anything under 4096 items will only take 3 characters. That's not bad. Anything over 4096 and you're back to 4 characters.

Creating a super-compressed URL

But what if you wanted to optimize that even further? The trick is to create your own base system with a custom set of characters. This next bit of code isn't CakePHP-specific.

$codeset = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";
$base = strlen($codeset);
$n = 300;
$converted = "";

while ($n > 0) {
  $converted = substr($codeset, ($n % $base), 1) . $converted;
  $n = floor($n/$base);
}

echo $converted; // 4Q

It loops over the original number, converting it into the base that you want. In my particular example, it converts the 300 into 4Q. But you'll get up to 3844 before you need more than 2 characters. And up to over 238,000 before you get past 3 characters. Precious bytes.

If you were setting up a route for this, you can use the following regex pattern:

Router::connect('/s/:id', 
   array('controller'=>'shorts', 'action'=>'retrieve'), 
   array('id'=>'[0-9a-zA-Z]+'));

Of course, feel free to customize the acceptable character list — some people drop 0, O, 1 and l to avoid confusion.

Converting the Compressed Version back to Decimal

Going back is straightforward. In the retrieve method of the shorts controller that we set the route up for, we need to take our compressed ID and uncompress it into an integer we can search the database for.

$converted = $this->params['id'];
$c = 0;
for ($i = strlen($converted); $i; $i--) {
$c += strpos($codeset, substr($converted, (-1 * ( $i - strlen($converted) )),1)) * pow($base,$i-1); }
$this->Short->id = $c; // 300 from our earlier example
$this->redirect( $this->Short->field('url') );

Each point in the string is multiplied by the base to the power of that position. Then it grabs the URL field for that item. Finally, it redirects them off to their final destination.

Wrapping it up

This blog post twisted and turned but ended up in a great place. The principles of the shortening system could be applied to any system whether it's CakePHP or not. If you're a CakePHP fan, feel free to take this example and build it into a component or plugin.

Published April 27, 2009

Conversation

34 Comments · RSS feed
Marc Grabanski said on April 27, 2009

Hey! nice job Snook. I think I'll be adding this to my blog "short"ly. And since it's a CakePHP, it'll be "cake" to implement. Man, puns at 1:30AM are easy to come by.

Ulf said on April 27, 2009

Hey Snook! Thanks for that article, I loved the part with the explanation about the precious bytes - nice detail!
Did you know about the lilURL-Project? It's kind of the same approach.

Ulf

Ivan said on April 27, 2009

OK, this is sooo sweet :-) I used to simply md5 the ID and substr the first 4 characters, guess this is better ;-)

Maciej Grajcarek said on April 27, 2009

That is really nice! I was thinking how to shorten links in a smart way. Your way is really good :). Thanks!

Jonathan Snook said on April 27, 2009

@Ulf: I hadn't seen the lilURL project. Thanks for pointing it out. Admittedly, I prefer my approach as it leaves the database to what it does best: incrementing integers for primary keys. Converting from one base to another is fairly easy (although it did take me a bit to track it down and put together a JavaScript-based proof).

The other limiting factor with lilURL is that it only seems to take advantage of lower-case letters (from what I can tell). However, maybe I'll look to contribute my ideas back into the project.

On a separate note, I took a quick look at what other characters you could use to expand the character set and possibly compress your URL further, here are the other unreserved characters: -_.!~*'(). I think that'd make the URL less readable but just wanted to share.

Jonathan Snook said on April 27, 2009

Just doing a little research after finding the lilURL library, I found the tighturl project, which is also open source. It does spam filtering, as well.

Finally, and this is something I wasn't aware of, PHP has a base_convert function that'll convert anything up to a base 36 [0-9a-z]. Being able to do upper and lower case, as I've done it, certainly gives you greater compression. Of course, use what works best for you.

Zachary Johnson said on April 27, 2009

Having just made my own URL shortener for personal use a couple weeks ago (in action here: http://a.stronaut.com/z1), I thought it was important that I share a couple of things I learned:

1. Check out the PHP built-in function:


base_convert()

2. The default HTTP header that PHP throws when you do a Location redirect is 302, which is bad because you will end up with your pages possibly twice in Google search results, one of those listings being under the shortened URL. You want to throw a 301 code before the location code:


header('HTTP/1.1 301 Moved Permanently');
header('Location: ' . $url);

where $url is the place that the short url is redirecting to.

Chris Wallace said on April 27, 2009

Did you just recently add comment numbers? They are so smooth I thought they were being replaced with Cufon or sIFR, but alas, they are not.

In an attempt to stay on topic, enjoy this FAQ from u.nu, a URL shortening service. It's funny.

Jonathan Snook said on April 27, 2009

@Chris Wallace: I just recently styled the comment numbers. :) They used to be a plainly styled element right next to the commenter's name. My comments are also now highlighted within the list.

Chris Shiflett said on April 27, 2009

Have you considered adding rev="canonical" support? I wrote about it here:

http://shiflett.org/blog/2009/apr/save-the-internet-with-rev-canonical

Here's an API call that returns the advertised short URL for your post, if one exists:

http://revcanonical.appspot.com/api?url=http://snook.ca/archives/php/url-shortener

Simon Willison has a nifty bookmarklet for it:

http://simonwillison.net/2009/Apr/11/revcanonical/

Shimon said on April 27, 2009

I've wrote a short post about using base_convert not so long time ago, check it out - http://baaltshuvaslowly.blogspot.com/2009/02/tinyurl-with-one-line-of-php-easy.html

Andrew Woods said on April 27, 2009

Nicely Done! I've also thought about building a url shortener, and would've done it much the same way. Although, it'd be great to know: how often do urls get shortened?, and how often a particular shortened url gets used?

JR said on April 27, 2009

I took the liberty of converting your encode and decode functions into Python:

import math

codeset = 'Ts87HNB2US1dxhgMWCpAKmRXO0rnG4lDZkcFLqutzEYbfv6JQo3Pea5iw9VyjI'
base = len(codeset)
n = 300

def encode(id):

    encoded = ''

    while id > 0:
        position = int(id % base)
        encoded = ''.join([ codeset[position:position+1], encoded ])
        id = math.floor(id/base)

    return encoded

def decode(encoded):

    id = 0

    for index, char in enumerate(encoded[::-1]):
        id += codeset.find(char) * math.pow(base,index)

    return int(id)

I was recently working on something very similar, but my method wasn't nearly as clean.

I also randomized the codeset string to prevent sequential shortened URLs.

Jakob Heuser said on April 27, 2009

@Zach: To prevent duplicate links, you can also use the canonical "link" tag that google endorses:
http://googlewebmastercentral.blogspot.com/2009/02/specify-your-canonical.html As per google, it's a hint that they "honor strongly". It's now also supported by Yahoo, Ask.com, and MS Live Search.

I agree we should be mindful of sending 301 v 302, but I haven't seen too many browsers that change behavior based on the redirect code.

Jamie Rumbelow said on April 27, 2009

Hey Jonathan,

Have you considered using rev="canonical"? Drew McLellan pointed me to it when I was recording my podcast - it's a semantic way of representing a shorter link. You simply embed a link tag into each page, with the target pointing to the shorter link.

PHP.net makes good use of it - if you view source on every single function page you can see the shorter URL. And if you visit http://revcanonical.appspot.com/, you can see it working in action.

I guess it's the next step for the semantic web - but what you've already accomplished is sweet, and I reckon rev="canonical" will make it even sweeter.

Anyway, if you've got any questions about rev="canonical", semantics or Microformats, give me an email - I'll be happy to help!

Thanks,

Jamie Rumbelow

Kevin Thompson said on April 27, 2009

Great article. Really serves to show how simple it can be to setup your own short URL service. However, what really stood out to me is what Chris Wallace noticed...

Your comments are slick!

As simple as they are, the variation in color that highlights your comments, and the number placement greatly compliment the current design. Bravo.

Kevin Thompson said on April 27, 2009

Blindsided by the newly styled comments, I failed to include my original question...

Now that you have a method for easily identifying your shortened URLs in social media, are you considering including some form of "tweetbacks" in your blog ?

Matt said on April 27, 2009

Nice.
Thanks!

SharonHill said on April 27, 2009

Nice way of shortening URL. I generally use http://aafter.us/ that shrinks large URL. It’s an open and free source.

Regards,
SharonHill

Maxime Perron Caissy said on April 29, 2009

Nice article snook, the super compressed url part is especially interesting!

ps. the comment numbering is brilliant, I first thought you had used images but that would have been a pain in the ass. Great use of fonts!

Michael Kozakewich said on May 01, 2009

I made a similar post a day or two ago (with far less graceful code), but the whole setup is kind of useless on my site (the domain alone has seventeen characters). I may be able to fit up to 40 links in my bar, using the trick where you don't need a question mark. I spent hours looking at different ways of doing it, though.

The best thing of all: I don't even use Twitter.

Matt said on May 01, 2009

Anyone interested in hosting their own URL Shortener should check out the project I manage, urlShort at http://urlshort.sourceforge.net. We’re aiming to have it include all the good features of URL shorteners, but be free, open-source, and easy to use/setup.

Tim Johannessen said on May 03, 2009

For the compression of the ID to work properly, the value should be at least two digits.

Compressing a value of 1 will remain 1 and so forth, until at least two digits are used.

I was looking for a way to "scramble" a string, and yet make it URI friendly - only problem is, that your input will grow to at least twice the size of the original input.

The scrambled string, has a self contained cipher, used when "unsrambling" the string.

http://sandbox.datapimp.dk/scrambler/

Zeb said on May 03, 2009

great job.. there is a similar URL shortener plugin over here: http://www.colly.com/comments/ee_shortener_plugin_your_own_short_urls_using_revcanonical_and_permanent_re/

Cloud Freak said on May 06, 2009

This is a very nice tutorial.
@Ivan: you will have collision soon if you do MD5

Abhisek said on May 11, 2009

WOW!!! Cool... Months back I made a tutorial on this too: http://ad1987.blogspot.com/2008/12/create-your-own-tinyurl-with-php-and.html

Travell Perkins said on May 26, 2009

Hello Jonathan,

Your writeup was great. The main gem for me was the base conversion code. I wrapped it in a static class and started experimenting. Your solution, base 58/60, is a much better solution than base 36 because codes. With 4 character code you can represent 0 to approximately 11.3M vs only 1.7M in base 36. Things get crazy at 6 codes, 58^6 vs 36^6.

I started experimenting with really large numbers to encode and your code fell over. It lost precision and was not able to accurately encode large integers though the decoding process worked fine. Basically after MAX_SIGNED_INT 32 bit (approximately 2.7T) it would fall over. Encoding and then decoding back would result in rounding errors.

Working in finance, I knew that I would need to convert your algorithm to use a big math library. PHP's floor blows up with big numbers and of course modulo and division. I had to implement floor since bcmath doesn't come with that baked in. The class works with insanely huge numbers now. Like all of the atoms in the universe size numbers. You can use it as a basis for an arbitrary base encoding by just augmenting the character set. Base 60 is great if you really aren't worried about people having to ever read the codes over the phone. Base 61 basically adds a hyphen to the mix, though leading,trailing, and multiple hyphens would look weird to most people so it's probably a bad idea. "---" would be a valid number.

I'm using the class in a project to include canonical short urls in content pages on my site. The added benefit is that I can just use the existing integer table ids. I tack on a prefix letter to identify the type of content (table type) and then I can convert the trailing code, get the id, and then lookup the SEO friendly long URL. I return that back to the browser with a 301 permanent redirect. Hope this helps peeps.

Here is the new code:


class BaseIntEncoder {

        //const $codeset = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";
        //readable character set excluded (0,O,1,l)
        const codeset = "23456789abcdefghijkmnopqrstuvwxyzABCDEFGHIJKLMNPQRSTUVWXYZ";

        static function encode($n){
            $base = strlen(self::codeset);
            $converted = '';

            while ($n > 0) {
                $converted = substr(self::codeset, bcmod($n,$base), 1) . $converted;
                $n = self::bcFloor(bcdiv($n, $base));
            }

            return $converted ;
        }

        static function decode($code){
            $base = strlen(self::codeset);
            $c = '0';
            for ($i = strlen($code); $i; $i--) {
                $c = bcadd($c,bcmul(strpos(self::codeset, substr($code, (-1 * ( $i - strlen($code) )),1))
                        ,bcpow($base,$i-1)));
            }

            return bcmul($c, 1, 0);
        }

        static private function bcFloor($x)
        {
            return bcmul($x, '1', 0);
        }

        static private function bcCeil($x)
        {
            $floor = bcFloor($x);
            return bcadd($floor, ceil(bcsub($x, $floor)));
        }

        static private function bcRound($x)
        {
            $floor = bcFloor($x);
            return bcadd($floor, round(bcsub($x, $floor)));
        }
    }

Travell Perkins

marvell said on May 28, 2009

GKAA8m hi! mi site is http://norffg.com
see you!

Dave said on June 05, 2009

Is there irony in the fact that Zeb points out Colly's article on short urls and posts the long url, so long that it runs into the next column?

FReNeTiC said on January 06, 2010

Yesterday i was thinking about "how does a url shortner works?".
I got the same conclusion as you got when you wrote the piece of code from the post :

$codeset = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";
$base = strlen($codeset);
$n = 300;
$converted = "";

while ($n > 0) {
$converted = substr($codeset, ($n % $base), 1) . $converted;
$n = floor($n/$base);
}
echo $converted; // 4Q

When I started trying the algorithm I saw that really large numbers where not working =/
Reading PHP manuals I got the same conclusion as |Travell Perkins.
But, I also saw that numbers higher than PHP integer limit are automatic converted to float, that has an even higher limit.
The problem of this conversion is that float numbers cant perform the % algorithm.
So, I came with a solution, that I THOUGHT (only my opinion =P) is a simple way of solving this integer-float problem, and working with a high numbers.

The code is this:


<?php

/*    recieve a number to convert to the new base    */
if ( !isset($_GET["n"]) || !is_numeric($_GET["n"]) || $_GET['n'] < 1 )
 exit;
   $number = $_GET["n"];
$stack = "";

$base = array("0", "1", "2", "3", "4", "5", "6" , "7", "8", "9", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n",
               "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L",
               "M", "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y", "Z");

/*    while the number is higher than the new base    */
while ($number >= 62) {
	/*    divide the number and split the float into 2 parts    */      $divided = explode(".", $number / 62);

	/*    the first part is an integer value, that is the new number for the loop    */
    $number = $divided[0];

/*    the other half will be used to calculate the rest of the division    */
    if ( isset($divided[1]) ) {
       $rest = "0." . $divided[1];
       $rest = round( $rest * 62);
    }
    else
        $rest = 0;
                                 	/*    whe take the rest of the division and find a base62 number for it    */
    $stack = $base[$rest] . $stack;
}

if ($number != 0)
 $stack = $base[(int)$number] . $stack;
   var_dump($stack);

?>

this algorithm have the same idea of yours, but it has a larger range.
Testing it i could see that PHP FLOAT LIMIT is a number with 14 digits.
Its like 11,111,111,111,111 \o/
And the algorithm converts this 14 digits number to a 8 characters key.

Hello! eagccke interesting eagccke site! said on February 10, 2011

Hello! eagccke interesting eagccke site!

Very nice site! said on February 10, 2011

Very nice site!

Hello! kddcfdg interesting kddcfdg site! said on February 10, 2011

Hello! kddcfdg interesting kddcfdg site!

Very nice site! said on February 10, 2011

Very nice site!

Sorry, comments are closed for this post. If you have any further questions or comments, feel free to send them to me directly.