Proof of concept entropy encoder with output size below the Shannon limit.
Table of Contents
Introduction
This repository contains a basic proof of concept implementation of what I'd like to call Valli encoding, an entropy encoder whose output is below the Shannon limit (lossless compression per the source coding theorem). It fits the definition of an entropy encoder as it can losslessly encode all permutations of a given set of symbol frequencies in a fixed amount of space. The symbol locations are treated as i.i.d, there is no information about the structure/order of the data (as opposed to Lempel-Ziv style compressors) though it does leverage the symbol frequencies in a way that typical entropy encoders do not, requiring them to be exact counts.
I'd be happy to be politely corrected if this already exists, as far as I can tell this is a novel approach, but I am just a problem solver for fun. On the other hand, if you're thinking output this small is impossible, so did I at the start.
For a mathematical explanation: Compare the total size dictated by the Shannon limit to the size it would take to store the number of permutations (multinomial) of the same symbol frequencies, since by the pigeonhole principal each unique integer value can store each unique permutation. The Shannon limit uses Stirling's approximation which is an asymptotic approximation for factorials and as such the approximation is too large for small values and becomes more accurate as the factorial size (i.e. message length and symbol count) approaches infinity.
As an example, given the symbol frequencies of 1, 2, and 3 for a total of 6 symbols.
Shannon limit
[-(1/6)*log2(1/6)-(2/6)*log2(2/6)-(3/6)*log2(3/6)]*6 = 8.75 => 9 bits
Non-Asymptotic/Multinomial limit
Log2(6!/(1!*2!*3!))) = Log2(60) = 5.91 => 6 bits
For a more concrete example of how this is possible: Given a message with two symbols in a 1:1 ratio, typical entropy encoders (Huffman, ANS, etc) would require 1 bit per symbol, with 0 representing one symbol and 1 representing the other. As an example, if the total message length is 8 symbols with a 4:4 ratio the Shannon source coding theorem would require an output size of 8 bits total with 4 zeros and 4 ones. Eight bits are capable of storing $2^8=256$ different values, however the total number of permutations of our input/output is only 8 choose 4 or $8!/(4!*4!) = 70$ . Seventy unique values only requires 7 bits to be stored ( $2^7=128$ ). Stated another way, typical entropy encoders require our example to have exactly 4 bits set in the output, if instead output values with set bit counts from 0-3 and 5-7 are utilized then those extra values can be used to store more information in a smaller space.
The Non-Asymptotic/Multinomial Lower Bound
Let A,B,C,... = symbol counts
T = total count of symbols = A + B + C + ...
Bit size = log2( T! / (A! * B! * C! * ...) )
If you have more questions, check the FAQ.
The Algorithm
Follow this link to see how the code works in a step by step example walkthrough.
Running the Code
Required Dependencies:
GMP library - Used for large integer math. Unfortunately there isn't one simple command for this, use your favorite search engine or LLM with your OS version specified.
Recommended (optional):
Clang - GCC should work too just haven't tested.
C++17 - known to be working, other C++ standards should should work but are not tested. Some shortcuts like "auto" are used which require at least C++11 but could be rewritten.
Linux Commands
git clone git@github.com:Peter-Ebert/Valli-Encoding.git clang++ -std=c++17 -O2 poc-compress.cpp -lgmp -o poc-compress clang++ -std=c++17 -O2 poc-decompress.cpp -lgmp -o poc-decompress
To compress the test data:
./poc-compress testfiles/input1
The output to console is intentionally verbose to help with understanding the calculations performed for each symbol, it slows the output some and can be commented out if desired.
Two files are created in the same directory as the compressed file:
input1.vli - the compressed output
input1.freq - contains frequency counts for each symbol/character, it is sorted ascending and uncompressed. The first 7 bytes are the count and the next 1 byte indicates the symbol, repeating. File size is the same for any input (64bits * 256=2048 bytes). The data could be compressed easily (even zero counts are included), but this simple format shows that nothing is being hidden inside.
To decompress:
./poc-decompress testfiles/input1.vli
This will output "[filename].decom", so that the input and output can be compared. The decompressor will assume the associated frequency file is in the same folder with the same name but replaces ".vli" with ".freq" (created previously by the compressor). As with the compressor, the console output will show much of the math involved to decode the compressed file.
To verify the input matches the output:
diff -s testfiles/input1 testfiles/input1.decom
Expected output:
Files testfiles/input1 and testfiles/input1.decom are identical
Support
Ideally knowledge would be free, but it takes time and effort to create and communicate, all the while one cannot live on ideas and dissertations alone. With the emergence of sites that enable direct support from viewers like Patreon and Twitch/Youtube, I hope exchanging this content for financial support isn't asking too much. If you enjoyed this as much as a drink, a movie, or more and have the ability to help out I'd greatly appreciate it.
Support this project
The receipt will have an email address if you want to send questions/requests, if you've donated I'll do my best to answer them if not in the README/FAQ.
I have more ideas I'd like to explore, but I quit my job to work on this and have spent the funds I put aside. I'm grateful and lucky to have had the chance to set aside time to work on this and never intended to make money off it, so any donations would encourage further research or improvements.
Final Thoughts
This approach for encoding is admittedly is not very practical, given it's polynomial / factorial nature and the fact encoding requires changing every bit along the entire length of the compressed value (bad for cpu caches), but it is fun to think about how this the absolute smallest we can possibly get for random (i.i.d.) symbol locations. The dataset I originally set out to compress was bit sets from hash values (e.g. probablistic counts, bloom filters), so it may have some applications there since those datasets don't normally compress very well. Memory and storage are ample these days so the few bytes it saves may not seem significant, but it's worth noting that even a 1 bit reduction means we've reduced the number of possible values in half.
Another noteworthy feature is this algorithm can be parallelized per unique symbol, as far as I know no other compression algorithm is parallelizable without sacrificing the compression ratio (not counting large lookup tables which can compress multiple symbols at once, which are still sequential in nature). We could also consider not combining individual symbol binomial sums together, and storing each unique symbol separately, as combining only saves <1 bit per unique symbol and combining uses expensive large number multiplication.
I have more ideas, but those ideas also makes the code harder to read, so I've held off for now. If there's more interest or support I might get to them. For now I hope you enjoyed this as much as I did making it.