blz78suf/README.md

3.0 KiB

blz78suf

blz78suf - suffix trie-based optimal LZ78 parsing for Brainfuck code generation. Makes no assumptions on cell size, cell wrapping behaviours or tape wrapping behaviours of the underlying interpreter. Asymptotically optimal for highly redundant inputs.Released to the public domain by Kamila Szewczyk - see COPYING.

Project homepage: https://github.com/kspalaiologos/blz78suf

Building

# If using a git clone (not needed for source packages), first...
$ ./bootstrap

# All...
$ ./configure
$ make
$ sudo make install

Algorithm description

Standard brainfuck text generation algorithms are usually stateless (i.e. load a byte value, output, clear, repeat) or use otherwise low order finite state. Using optimal constants for such a generator renders it optimal on IID inputs. The generator is still however unable to efficiently exploit the redundancy in such an input stemming from the underlying probability distribution of the source.

Low-order finite state generators usually encode transitions between cell states that are desired to be output. As an example, redundancy on the 2-wide sliding window level can be exploited by computing a 256x256 table of optimal transition phrases between cell values.

This however does not approximate real, variable order redundancy in the source. Techniques that implement data compressors in brainfuck and load the compressed data to memory, decompressing it at the runtime, generally exhibit poor performance characteristics due to the high overhead of random memory access in Brainfuck (fastest algorithms are quintic-time).

blz78suf operates by finding long phrases in the input that can be encoded using procedural logic. For example:

void the() { printf("the"); }
int main() {
  the(); printf(" quick brown fox jumps over ");
  the(); printf(" lazy dog");
}

Could be shorter than:

int main() {
  printf("the quick brown fox jumps over the lazy dog");
}

Had an approperiate, sufficiently repetitive phrase been chosen. The benefit of this approach is that we can deduplicate repeating phrases and delegate the more granular, lower order redundancy to a stateless generator.

blz78suf builds on a stateless text generator nicked from the CodeGolf StackExchange website for the Brainfuck Golf challenge. The same generator is used by copy.sh. Phrases are found by constructing the suffix trie and ranking potential replacements by their frequency and length. Then, the individual messages are encoded and the output is generated.

Future improvements

blz78suf is a prototype and as such, it is not optimized for speed. The algorithm could be sped up by using an efficient exclusion algorithm for suffix tries and by using Ukkonen's algorithm for linear-time compressed structures. The procedural structure of the output could be optimized by allowing phrasal chaining. Further, a more efficient low order generator could be used. For example, such a desirable tool would detect patterns via delta encoding and run length encoding.