Update: Rewrite Rules on Apache 1.3 are Greedy By Default

Well, that was annoying! Some of you may have noticed intermittent outages on my blog while I was trying to fix the URL’s. Scary stuff. Anyway. This post will be gibberish if you don’t understand regular expressions. If you’re one of these people, I suggest you turn back now. 🙂 This is, after all, a technical blog too. 😛

What I discovered was that I was incorrect in my post last night about the catch-all .htaccess entry that would redirect all traffic from michikono.com/blog. Here’s the wrong rewrite rule:

RewriteRule ^/?(.*) http://www.michiknows.com/$1/ [R=301,L]

The goal was to take whatever text came after “michikono.com/blog” (such as the post name), and stick it on the end of “michiknows.com” so that all the old articles translate over without outages. Unfortunately, I noticed a few problems.

The correct code is as follows:

# match just blog
RewriteCond %{REQUEST_URI} blog/?$
RewriteRule . http://www.michiknows.com/ [R,L] 

# match blog posts
RewriteRule (.*) http://www.michiknows.com/$1 [R,L]

Why two regular expressions? Well, I couldn’t use the reluctant modifier (“?”) to make it catch a case when there was no trailing slash. Thus, no matter what I did, it would act as if there no no trailing slash. This broke stuff such as the RSS feed!!

The problem was that Apache 1.3 uses a greedy catch all by default that can not be disabled. In other words, the “*” can’t be set to be non-greedy by adding a “?” behind it. This is possible in virtually all other implementations of regex. The warning flag is that when you put a question mark behind a “+” or “*”, it will give you an error!

So my new solution breaks the problem into two steps.

  1. First, I check specifically for a hard link to the blog home page, which may or may not contain a trailing slash. If so, it will just forward it to this site with a trailing slash.
  2. Then I setup a second catch all rule that just does a straight search and replace. It doesn’t bother with the trailing slash stuff at all since it just snips the entire URL and tags it on.

Why does it seem like the second rule could replace the first? Because the greedy operator acts weird and doesn’t behave as I want. I tried for an hour straight. Believe me. No matter what hack or work around I used, there was always a case that no longer worked (usually the home page bug). And to be honest, regular expressions with Apache are just plain horrible to work with. This solution finally worked, and is what I will settle with (even as I write this post, I tried three other solutions that should work in any other regular expression environment — but failed to generate positive results).

So if you ever decide to move your blog, try the above solutions before giving up.

One thought on “Update: Rewrite Rules on Apache 1.3 are Greedy By Default”

Comments are closed.