Base 64 URL encoding/counting
Several years ago I had the need to automate rate management for a system running the Harris Protocall calling card platform. Never heard of it? You’re probably a better person for it.
Previously the system was maintained by hand which obviously was extremely error prone. Just to give you a quick idea how easily it was to screw up, consider this… the system was completely powered by FoxPro “databases”. Not even Visual FoxPro - we’re talking old school, early 90s FoxPro 2.6 for DOS. The rate updates required manually updating things across 4 or 5 “tables” via an ASCII interface prettied up with some ANSI coloring.
That mostly just sounds antiquated, not difficult. Now consider that we’re talking about at least 20 or 30 Rate Plans in the system and to squeeze every penny out of international telephone rates, you’re talking about upto 3500 rates per plan, split up by country and city codes. Updated by hand. From a spreadsheet. Possibly several spreadsheets. Ugh.
Anywho, back to the point - this old ass system was NOT built to support that many rates. To fit that many rates into the system, I needed to be able to convert some 4 and 5 digit numbers into a 3 character field. Without some massaging, that obviously does not work. My initial thought was to try converting decimal to hex since that’s something pretty universally supported across languages. No dice, that simply didn’t compress the numbers enough. That was definitely the correct approach, though, so I put in a little work and started doing Base32 “encoding” or “counting”. It worked perfectly.
Skip ahead to late 2008/early 2009 and I found myself needing to create shortened URLs for a couple sites thanks to Twitter. I could have simply used the Base32 encoding I’d previously created, but I really wanted the most bang for my buck, so I decided to start playing with Base64 “encoding”/”counting”. The biggest issue here is that you have to be very careful about the character set you use since they need to be characters that do not require being encoded in URLs, so things like “=”, “/”, “?” and so on are not viable options. You may immediately thing “oh, sure, I have a function for Base64 encoding” and yes, you do, but it includes unsafe characters and is not actually for compressing things - in fact, it will make a string/number longer. As you’ll see below, I quickly filled up 62 characters by using “a-z”, “A-Z”, and “0-9″. I then picked “-” and “_” since they should be safe and appear in tons of URLs.
Ok, so here is the original PHP code I used:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 | /** * Decimal to pseudo-Base64 encoding/counting */ public static function dec2bsf($int){ $chars = array('a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z', 'A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z', '0','1','2','3','4','5','6','7','8','9','-','_'); $str=""; for($i=0; $int >= 0; $i=$i){ $str = $chars[$int%64].$str; $int=floor(($int/64)); if ($int == 0){ break; } } return $str; } /** * pseudo-Base64 encoding/counting to Decimal */ public static function bsf2dec($str){ $chars = array('a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z', 'A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z', '0','1','2','3','4','5','6','7','8','9','-','_'); $chars = array_flip($chars); $int = 0; for($i=0;$i<strlen($str);$i++){ $ch = $str[$i]; if ($i==strlen($str)-1){ $int += $chars[$ch]; } else { $pow = pow(64,strlen($str)-($i+1)); $mult = $chars[$ch]; $int += $pow * $mult; } } return $int; } |
I also had the opportunity/need to use this in a Rails project, so I ported that same code over to Ruby:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 | # Decimal to pseudo-Base64 encoding/counting def self.dec2bsf int chars = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z', 'A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z', '0','1','2','3','4','5','6','7','8','9','-','_'] str="" while int > 0 str = chars[int%64]+str; int=(int/64).floor end str end # pseudo-Base64 encoding/counting to Decimal def self.bsf2dec str return '' unless !str.nil? chars = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z', 'A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z', '0','1','2','3','4','5','6','7','8','9','-','_'] int = 0; i = 0 str.each_char do |ch| if i==str.length-1 int += chars.index(ch) else pow = 64**(str.length-(i+1)) mult = chars.index(ch) int += pow * mult; end i = i + 1 end int end |
Alrighty then. To finish this up, there are a couple of gotchas that I’ve run into, though they aren’t too major:
- Because “a” stands for “0″, when you are decoding these slugs, “y”, “ay”, “aay”, and so on are all the same decimal value : 24. That’s because that’s like decoding into “24″, “024″, and “0024″, which, when dealt with as ints, are all the same number. In reality, this is not a big deal b/c the encoding will never create anything starting with an “a”.
- Early on in using this I saw some services decide that the trailing “-” or “_” was not actually part of the URL and took it upon themselves to remove them. I haven’t seen that happen in quite a while, but consider yourself warned. If you are really worried about that, try swapping out those two characters with something you feel better with. If you do, near the end of Section 2.2 of RFC1739 the additional “safe” characters are listed.
Finally, as a disclaimer, I’ll simply say that this works. I guarantee you these routines can be optimized a good bit through some more sophisticated math.
