URL design - hackable URLs and search engine optimisation

As part of this website’s long overdue redesign, where nothing is being left unscrutinised, I have been thinking about my URL structure. I want a solution that treats the URL as a command line (so it’s hackable), but I also want it to be well optimised for search engines. Here’s my thinking on the design of the URL structure I’m likely to use.

What is a hackable URL?

A hackable URL is one that makes sense to a human reader, and where the human reader can guess what to change to get to another page. A hackable URL has no magic voodoo in the address bar e.g., http://some-site.ext/index.php?query=3&foo=bar. My site hasn’t done this for years, thanks to adopting a ‘pretty url’ philosophy with the last re-design. But a hackable URL is not just a pretty URL, for example http://some-site.ext/blog/2008/04/20/my_blog_title/ is pretty and understandable - but you can’t hack it. You’ve no way of knowing on what date the last blog post was. Or what it was called. So you can’t just change the URL to see the last blog post. A hackable URL might be something like this: http://some-site.ext/blog/100. You can guess from this that you are reading the one hundredth blog post from some-site.ext - and you can take a tiny imaginative leap and type ‘99′ at the end with a large degree of confidence that you’ll get the previous blog post. You hacked the URL. A hackable URL lets you apply logic and common sense to find data you want, without needing to know the exact URL before-hand.

What is a search engine optimised URL?

Search engines are complicated things, and they do an awful lot of work analysing your webpage to decide how important it is, in relation to what words, and where it should rank your webpage in it’s result set when someone searches for content on the page. One part of the calculation involves looking at the URL of the page being indexed. As a rough rule of thumb, if the URL contains some of the keywords from the web-page itself, and the depth of the page in the URL structure is shallow, then you score more points. For example: http://some-site.ext/blog/2008/04/20/ isn’t likely to rank well for the term “find me”, because it’s not in the URL. http://some-site.ext/blog/2008/04/20/find-me would be better, but it looks like it’s buried in a sub directory which isn’t as optimised as it might be. There’s not a great deal of consensus on how URL structure and keywords within the URL effect rankings, but if you want to read more SEOmoz have a long opinion piece on the subject. None of it is gospel, search engines keep their ranking criteria secret precisely so spammers can’t abuse them. One of the only concrete things to be aware of is that your page should have only one URL that points to it. You should not have ‘duplicate content’ on your site (the same page but with more than one URL leading to it).

Hitching the two together

First things first you can no longer use www. in my domain. It’ll forward to the same URL with the www. removed. That’s a way of ensuring there’s only one URL to get to the page you’re looking at, and as it was really simple to implement I’ve already done this. All the other URL tweaks are going to have to wait for the launch of the re-design (which hasn’t even been started in code yet).

With the redesign the URL structures are going to change: for permalinks to blog posts my URL structure is going to be like this:
http://mattwilcox.net/all/990/ or perhaps
http://mattwilcox.net/all/990:blog-title/
Hopefully the second URL still looks hackable (the colon made it feel like it was defining the id, and in the implementation the URL parser would completely ignore the colon and anything after it, so you could still hack it). I may forego the ‘keywords in the URL’ element because I’m not convinced it matters much, and because it can quickly look spammy. I’ve not made up my mind yet.

Here’s how I think the URLs are going to work for the rest of the site:

With the exception of reserved words (about, contact, preferences, tag, and media) the first part of the URL will be a category name:

http://mattwilcox.net/all/
will spit out an archive page listing all posts.

http://mattwilcox.net/all/y:2008/
will spit out an archive page containing all posts from 2008

http://mattwilcox.net/all/y:2008/m:12/
will spit out an archive page for all posts in December 2008

http://mattwilcox.net/all/1/
will spit out the very first blog post

http://mattwilcox.net/web-development/1/
will spit out the first blog post filed under Web Development.

http://mattwilcox.net/tag/tag-name/
will spit out an archive of posts assigned the tag ‘tag-name’

http://mattwilcox.net/tag/tag-name/y:2008/
will spit out an archive of posts assigned the tag ‘tag-name’ from 2008

http://mattwilcox.net/tag/tag-name/12/
will spit out the twelfth post assigned the tag ‘tag-name’

I hope it’s logical enough to follow exactly how you might find, for example, the first post assigned a tag of ‘design’ in February 2007 (that’d be http://mattwilcox.net/tag/design/y:2007/m:1/1/ by the way)

However, hackable URLs like this provide many ways of accessing the same page:

http://mattwilcox.net/tag/design/y:2007/14/
http://mattwilcox.net/tag/eric-meyer/13/
http://mattwilcox.net/all/671/

and a few dozen other URLs could all point to the same page - and search engines are very likely to see this as spamming, and penalise accordingly. To get around this problem I’m going to use a little PHP logic so that all blog post url’s will get a ‘no-index’ meta-tag to stop search engines from indexing their content, except when accessed through it’s permalink URL ( http://mattwilcox.net/all/990:url-design/ ). To further help search engine rankings, I’m going to re-write URLs from an external referrer back to the permalink URL. So if Joe Bloggs has a link on his blog pointing to http://mattwilcox.net/design/y:2008/14/ (because that’s how he found the page while looking through my website) my server will transparently re-direct to the permalink URL instead (Using a 301 redirect so search engines still follow the page and index it).

Entry Information

Posted:

Sun, 20th Apr 2008 at 19:04 UTC

Filed under:

Web Development

Tags:

Comments

skip to comment form

Andrew Ingram posted 8 days, 14hrs, 0mins after the entry and said:
Having your parser ignore the text after the id in http://mattwilcox.net/all/990:url-design/ is bad because you've done the very thing you earlier set out to avoid, having multiple paths to the same resource. You can change "url-design" to say anything you want and still get the same thing so you've actually allowed effectively unlimited ways of getting to article 990.
Matt Wilcox posted 8 days, 21hrs, 8mins after the entry and said:
Good point Andrew! I had missed that completely.

That being the case I'll not bother with the colon. Thanks for pointing that out
Dean Landolt posted 23 days, 18hrs, 51mins after the entry and said:
Sure, but it shouldn't matter. Just as you're doing with referrer urls, you can just rewrite the url for the permalinks if they get the slug wrong.

Archived entry | Matt Wilcox .net