How the jsquash compressor works
The algorithm I used is fairly simple – you find groups of characters that are the same, and replace them with “code” characters which stand for those characters.
For example, say you want to compress the following line (only using lowercase letters and a few punctuation marks, for simplicity):
alert(‘peter piper picked a peck of pickled pepper.\na peck of pickled pepper peter piper picked.\nif peter piper picked a peck of pickled pepper,\nwhere’s the peck of pickled pepper peter piper picked?’);
Using the application, we come up with this:
var a='/-+*&%$'.split('') var b='$ p$eter piper picked$eck of pickled pepper$\\\\$\\'$'.split(a[a.length-1]),c='alert(%p+ a-*.&na-*-+.&nif-+ a-*,&nwhere%s the-*-+?%);' for(e in a)c=c.split(a[e]).join(b[e]) eval(c);
The first thing the application does is to get rid of all the unnecessary spaces, comments and semi-colons in the original. I won’t bother explaining that, as it’s simple to understand from the source, and can’t really be improved much (well, the implementation probably can, but I think the results are pretty much ideal). The goal of this article is to explain the compression algorithm, so that readers can potentially help improve the compression.
The first thing the algorithm did was to find all characters from ASCII character 32 (space) upwards to 255, that did not appear in the original. In this case, this included ‘/-+*&%$’ and a whole lot more.
From those characters, a delimiter is chosen, a replacement for backslashes (if they’re used), and a replacement for the apostrophe, which will be used to contain strings in the encoder ($, & and %). Then the compression starts.
This method follows the algorithm used by Chris’s application here. His method basically starts finding strings of length
x (user-defined – 15 is recommended), and figuring out how much space would be saved if the copies found were replaced by single characters. Then, the algorithm starts again with
x-1, and so on.
There is a problem with that approach, which becomes clear when you try it from the other way around (start with 2 and move up from there).
So, we start at 2. First, we check if the string composed of the first and second characters of the original (‘al’, in our example) appear multiple times. If not, we move on and test the 2nd and 3rd letters (‘le’).
‘le’ does appear a few times, so we can run a quick calculation to see how many characters would be saved if we replaced ‘el’ with ‘*’ (the next unused character in our list). This is 2 (the length of ‘le’) multiplied by 5 (the amount of occurances of the string) minus 5 (the number of replacement characters) minus 2 (two delimiters) – 3 (dunno how to describe this one – the replacement character plus the replaced string, in the end-results encoded strings).
The same test is then run with the 3rd and 4th characters, etc. Whenever a string replacement is found that comes up with a higher score than the current highest, it becomes new highest, and is recorded as the “best” replacement, for when we do the actual replacement.
Note that there is already some inefficiency here – there is no point doing the calculation test when you are testing the third occurance of ‘li’, for example – it’s already been calculated. So, we only do the calculation test if the occurance you’re testing is the second one. This ensures that A) the occurance is not unique (which would be a waste of time calculating), and B) that you only ever test a particular string once. Another way to do this would be to keep a hash array of already-tested strings, but it’s debateable whether that would be quicker (and I can’t be bothered testing it…).
Another point of inefficiency, is that if the original is 20 characters long, and your current highest is a 2 character string that occurs 3 times, then there is no point testing strings made from the 15th and 16th characters and beyond, as there’s no way this could improve the score – so, we record a “stop searching at” variable, which tells where to stop looking from. This point is the length of the string minus the length of the best match multipled by the amount of occurances of the best match.
After reaching the end of the test for strings of length 2, we start again with 3.
Another efficiency note, which explains why it’s better to start from 2 and work upwards, than to start from 15 and work down – since we already know that the string ‘al’ only occurs once in the original, there is no point even looking at the string ‘ale’. To work with this, we don’t just start with the first character, test that, move to the second, test that, etc – instead, we start off at 2 characters with an array of starting points, which coincidentally, happens to be every single character in the original. A second array is created for the next “level” (when we start testing length 3), and each time a multiple-occuring string is found, its position is added to the second array. At the end of the length 2 round, we have an array of starting points for length 3.
This dramatically speeds up the search, but also tells when the algorithm can’t proceed any further (when the new array is empty, there is no point continuing).
When we come across the point where the new array is empty, we then do the actual replacement, with the current “best” match. In our case, this is “eck of pickled pepper”. I’m not sure why it’s not “peck of pickled pepper”, but that’s probably why I need help with the algorithm (I’ve developed it beyond my own ability to understand 😉 ).
After that, we simply repeat, until the “best” match results in a negative score, at which point any further replacements will make the result grow instead of shrink.
If there are any questions or suggestions, please comment.
Thanks, that makes it a lot clearer.
I think the algorithm can be improved by recalculating the array of possible replacement-Characters again in every round. Every round may completely cut away the occurrence of some characters. That way you usually won’t run out of replacement charactes below 127 and don’t necessarily need the problematic non ASCII characters. We pay with some aditional calculations, but only characters in the last replaced string need to be checked and a single run through the string for every compression round would be sufficient. I think that would be acceptable.
Good idea. I think that idea was almost in my head at some point, but it vanished when I got distracted by something (as happens a lot with me!).
I will have a subversion repository up and running tomorrow. I’m just waiting for some hardware to arrive so I can upgrade my home computer (it keeps freezing), then the repository will be up and running.
I’ll put your idea into the algorithm later today.
I have some workling code. It produces JS-code that is as short as the one produced by your JS-Programm and still works:
You should only need the C++ std-lib to compile it.
Actually I couldn’t simply transfer your code to C++. That’s why I took your description here and did my own implementation. I also added my Idear of recalculating the array of free characters after each round but I didn’t do it very efficient.
I’m planing to improve the code with better algorithms and using boost Libraries. When that is done I’ll improve the API which is very basic at the moment.
The License ist GPL. Use it as you like.
Well, only very simpe JS-code did work with my compressor. Last night I have fixed a bug and now complex code should also work. Just use the same link as above.
sorry I haven’t had time to work on this recently – we’re porting a huge number of websites to our latest CMS in work, so I’m fairly busy. I will hopefully be able to get back to this soon.
I’ll be simply advancing untill you can come back.
I’ve registered a sourceforge-Project now and added you as a member:
When I find the time in the next days I’ll try to polish my current code a bit before I check it in.