The algorithm I used is fairly simple – you find groups of characters that are the same, and replace them with “code” characters which stand for those characters.
For example, say you want to compress the following line (only using lowercase letters and a few punctuation marks, for simplicity):
alert(‘peter piper picked a peck of pickled pepper.\na peck of pickled pepper peter piper picked.\nif peter piper picked a peck of pickled pepper,\nwhere’s the peck of pickled pepper peter piper picked?’);
Using the application, we come up with this:
var a='/-+*&%$'.split('') var b='$ p$eter piper picked$eck of pickled pepper$\\\\$\\'$'.split(a[a.length-1]),c='alert(%p+ a-*.&na-*-+.&nif-+ a-*,&nwhere%s the-*-+?%);' for(e in a)c=c.split(a[e]).join(b[e]) eval(c);
The first thing the application does is to get rid of all the unnecessary spaces, comments and semi-colons in the original. I won’t bother explaining that, as it’s simple to understand from the source, and can’t really be improved much (well, the implementation probably can, but I think the results are pretty much ideal). The goal of this article is to explain the compression algorithm, so that readers can potentially help improve the compression.
The first thing the algorithm did was to find all characters from ASCII character 32 (space) upwards to 255, that did not appear in the original. In this case, this included ‘/-+*&%$’ and a whole lot more.
From those characters, a delimiter is chosen, a replacement for backslashes (if they’re used), and a replacement for the apostrophe, which will be used to contain strings in the encoder ($, & and %). Then the compression starts.
This method follows the algorithm used by Chris’s application here. His method basically starts finding strings of length
x (user-defined – 15 is recommended), and figuring out how much space would be saved if the copies found were replaced by single characters. Then, the algorithm starts again with
x-1, and so on.
There is a problem with that approach, which becomes clear when you try it from the other way around (start with 2 and move up from there).
So, we start at 2. First, we check if the string composed of the first and second characters of the original (‘al’, in our example) appear multiple times. If not, we move on and test the 2nd and 3rd letters (‘le’).
‘le’ does appear a few times, so we can run a quick calculation to see how many characters would be saved if we replaced ‘el’ with ‘*’ (the next unused character in our list). This is 2 (the length of ‘le’) multiplied by 5 (the amount of occurances of the string) minus 5 (the number of replacement characters) minus 2 (two delimiters) – 3 (dunno how to describe this one – the replacement character plus the replaced string, in the end-results encoded strings).
The same test is then run with the 3rd and 4th characters, etc. Whenever a string replacement is found that comes up with a higher score than the current highest, it becomes new highest, and is recorded as the “best” replacement, for when we do the actual replacement.
Note that there is already some inefficiency here – there is no point doing the calculation test when you are testing the third occurance of ‘li’, for example – it’s already been calculated. So, we only do the calculation test if the occurance you’re testing is the second one. This ensures that A) the occurance is not unique (which would be a waste of time calculating), and B) that you only ever test a particular string once. Another way to do this would be to keep a hash array of already-tested strings, but it’s debateable whether that would be quicker (and I can’t be bothered testing it…).
Another point of inefficiency, is that if the original is 20 characters long, and your current highest is a 2 character string that occurs 3 times, then there is no point testing strings made from the 15th and 16th characters and beyond, as there’s no way this could improve the score – so, we record a “stop searching at” variable, which tells where to stop looking from. This point is the length of the string minus the length of the best match multipled by the amount of occurances of the best match.
After reaching the end of the test for strings of length 2, we start again with 3.
Another efficiency note, which explains why it’s better to start from 2 and work upwards, than to start from 15 and work down – since we already know that the string ‘al’ only occurs once in the original, there is no point even looking at the string ‘ale’. To work with this, we don’t just start with the first character, test that, move to the second, test that, etc – instead, we start off at 2 characters with an array of starting points, which coincidentally, happens to be every single character in the original. A second array is created for the next “level” (when we start testing length 3), and each time a multiple-occuring string is found, its position is added to the second array. At the end of the length 2 round, we have an array of starting points for length 3.
This dramatically speeds up the search, but also tells when the algorithm can’t proceed any further (when the new array is empty, there is no point continuing).
When we come across the point where the new array is empty, we then do the actual replacement, with the current “best” match. In our case, this is “eck of pickled pepper”. I’m not sure why it’s not “peck of pickled pepper”, but that’s probably why I need help with the algorithm (I’ve developed it beyond my own ability to understand 😉 ).
After that, we simply repeat, until the “best” match results in a negative score, at which point any further replacements will make the result grow instead of shrink.
If there are any questions or suggestions, please comment.