├── .gitignore
├── FAQ.md
├── README.md
├── frequency-table.hpp
├── poc-compress.cpp
├── poc-decompress.cpp
├── testfiles
    ├── panagram
    ├── sparse
    ├── tongue_twister
    ├── walkthrough
    └── wizard_of_oz
├── the-algorithm.md
└── utility-functions.hpp


/.gitignore:
--------------------------------------------------------------------------------
 1 | # Prerequisites
 2 | *.d
 3 | 
 4 | # Compiled Object files
 5 | *.slo
 6 | *.lo
 7 | *.o
 8 | *.obj
 9 | 
10 | # Precompiled Headers
11 | *.gch
12 | *.pch
13 | 
14 | # Compiled Dynamic libraries
15 | *.so
16 | *.dylib
17 | *.dll
18 | 
19 | # Fortran module files
20 | *.mod
21 | *.smod
22 | 
23 | # Compiled Static libraries
24 | *.lai
25 | *.la
26 | *.a
27 | *.lib
28 | 
29 | # Executables
30 | *.exe
31 | *.out
32 | *.app
33 | 


--------------------------------------------------------------------------------
/FAQ.md:
--------------------------------------------------------------------------------
 1 | # FAQ
 2 | 
 3 | ## Why isn't the output smaller than gzip/zstd/etc...?
 4 | Most commonly used compression tools have 2 types of compression (there are more of course).  This compressor proof of concept is most like an [entropy encoder](https://en.wikipedia.org/wiki/Entropy_coding) which focuses on the frequency of characters, so its performace will be close to those.  The other common type is [dictionary/substitution coders](https://en.wikipedia.org/wiki/Dictionary_coder) which focuses on the structure/relationship between symbols, like [LZ77/78](https://en.wikipedia.org/wiki/LZ77_and_LZ78).  If you used the output of an LZ style compressor as an input to this encoder, then you would see similar sizes to other common tools.  The goal of this code is to provide a proof of concept illustration, not a replacement for other common compression tools, as the code is not highly optimized.
 5 | 
 6 | ## Is there any input this implementation cannot compress?
 7 | The compression algorithm can work on any dataset, however there are 2 datasets this specific implementation will abort on: (1) If there is only 1 or 0 symbols in the input, e.g. 'aaaaa' or '' there is nothing to compress, the frequency table is essentially RLE which is not new, so this is not implemented. (2) The input cannot contain all 256 byte values since this implementation picks one byte value to act as a 'null' value.  The code could be altered to handle both cases.
 8 | 
 9 | ## Why does the program exit on file sizes larger than 64KB?
10 | It's just a precaution to prevent long runtimes.  First of all, this is just a proof of concept that isn't highly optimized and secondly the encoding math requires changing bits along the entire length of an arbitrarily large integer, once this size exceeds the CPU cache it can become quite slow.  If you want to experiment with bigger inputs simply comment out the line that does this check.
11 | 
12 | 
13 | 
14 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Valli-Encoding
  2 | A compression algorithm that uses combinatorics (binomials/multinomials).
  3 | 
  4 | **Table of Contents**  
  5 | * [Introduction](#introduction)  
  6 | * [Comparison to Others](#comparison-to-other-entropy-encoder-implementations)
  7 | * [The Algorithm](#the-algorithm)  
  8 | * [Running the Code](#running-the-code)  
  9 | * [Support](#support)  
 10 | * [Final Thoughts](#final-thoughts)  
 11 | 
 12 | ## Introduction
 13 | This repository contains a basic *proof of concept* implementation of what I'd like to call Valli encoding, which leverages the exact count of the symbol frequencies to compress the input with combinatorics.  The output will be some number between 0 (inclusive) and the number of permutations of symbols in the frequency table (exclusive).  The input is processed one unique symbol at a time using the number of combinations (binomials) that could occur before placing each symbol, the sum of which generates a single symbol's encoding.  These symbol encodings can then be combined together based on the number of permutations of each preceeding symbol.  For a more detailed walkthrough see [how the code works](the-algorithm.md).
 14 | 
 15 | I'd be happy to be politely corrected if this already exists, as far as I can tell this is a novel approach, but I am just a problem solver for fun.  Feel free to raise an issue for that or other feedback.
 16 | 
 17 | The size of the output will always be the size of the total number of permutations of the symbols given the symbol frequencies table, also known as [multinomials](https://en.wikipedia.org/wiki/Multinomial_theorem#Number_of_unique_permutations_of_words):  
 18 | Let A,B,C,... = symbol counts  
 19 | T = total count of symbols = A + B + C + ...  
 20 | Bit size = log2( T! / (A! * B! * C! * ...) )  
 21 | 
 22 | ## Comparison to Other Entropy Encoder Implementations
 23 | In several cases this algorithm's output is smaller than other entropy encoder implementations I can find online.  **Please do send links to better entropy encoder (arithmetic/ANS/etc.) implementations that produce smaller output**.  Please do not link LZ/LZW, PPM, Neural Nets, etc. that leverage information about the relationship between symbols, those are expected to do better as this implementation does not include those techniques.  You can share links through the "[issues](https://github.com/Peter-Ebert/Valli-Encoding/issues)" option at the top of the page.  
 24 | 
 25 | #### Data: wizard_of_oz (chapter 5) - 9905 bytes
 26 | A moderately sized input that uses many different letters and symbols.
 27 | 
 28 | | Algorithm                                                                                                               | Freq Table | Encoding | Total Size (bytes) |
 29 | | ----------------------------------------------------------------------------------------------------------------------- | ---------- | -------- | ------------------ |
 30 | | [Static AC](https://github.com/nayuki/Reference-arithmetic-coding/blob/master/python/arithmetic-compress.py)            | 101**      | 5423     | 5524               |
 31 | | [Adaptive AC](https://github.com/nayuki/Reference-arithmetic-coding/blob/master/python/adaptive-arithmetic-compress.py) | N/A        | 5611     | 5611               |
 32 | | [rANS](https://github.com/rygorous/ryg_rans)                                                                            | 101**      | 5428     | 5529               |
 33 | | Valli                                                                                                                   | 101        | 5395     | **5496**           |
 34 | 
 35 | 
 36 | \*\*Their code does not compress the frequency table, so I've used my implementation's smaller bit packed table size instead.
 37 | 
 38 | #### Data: pangram - 43 bytes
 39 | Worst case scenario for frequency table size vs message size.
 40 | 
 41 | | Algorithm                                                                                                               | Freq Table | Encoding | Total Size (bytes) |
 42 | | ----------------------------------------------------------------------------------------------------------------------- | ---------- | -------- | ------------------ |
 43 | | [Static AC](https://github.com/nayuki/Reference-arithmetic-coding/blob/master/python/arithmetic-compress.py)            | 34**       | 25       | 59                 |
 44 | | [Adaptive AC](https://github.com/nayuki/Reference-arithmetic-coding/blob/master/python/adaptive-arithmetic-compress.py) | N/A        | 42       | **42**             |
 45 | | Valli                                                                                                                   | 34         | **19**   | 53                 |
 46 | 
 47 | #### Data: tongue_twister - 35 bytes
 48 | Many repeated syllables leads to a smaller frequency table.
 49 | 
 50 | | Algorithm                                                                                                               | Freq Table | Encoding | Total Size (bytes) |
 51 | | ----------------------------------------------------------------------------------------------------------------------- | ---------- | -------- | ------------------ |
 52 | | [Static AC](https://github.com/nayuki/Reference-arithmetic-coding/blob/master/python/arithmetic-compress.py)            | **16       | 15       | 48                 |
 53 | | [Adaptive AC](https://github.com/nayuki/Reference-arithmetic-coding/blob/master/python/adaptive-arithmetic-compress.py) | N/A        | 32       | 32                 |
 54 | | Valli                                                                                                                   | 16         | 12       | **28**             |
 55 | 
 56 | #### Data: sparse - 110 bytes
 57 | Data is mostly a single symbol with a few others.
 58 | 
 59 | | Algorithm                                                                                                               | Freq Table | Encoding | Total Size (bytes) |
 60 | | ----------------------------------------------------------------------------------------------------------------------- | ---------- | -------- | ------------------ |
 61 | | [Static AC](https://github.com/nayuki/Reference-arithmetic-coding/blob/master/python/arithmetic-compress.py)            | 6**        | 4        | 10                 |
 62 | | [Adaptive AC](https://github.com/nayuki/Reference-arithmetic-coding/blob/master/python/adaptive-arithmetic-compress.py) | N/A        | 44       | 44                 |
 63 | | Valli                                                                                                                   | 6          | 3        | **9**              |
 64 | 
 65 | ## The Algorithm
 66 | Follow this link to see [how the code works](the-algorithm.md) in a step by step example walkthrough.
 67 | 
 68 | If you have more questions, check the [FAQ](FAQ.md).
 69 | 
 70 | ## The Frequency Table
 71 | I've combined the frequency table and encoding together into a single file, you will see the size of each at the end of the compressor's console output.  The frequency table uses only very basic bit packing for the counts, the characters are stored as raw bytes.  Some additional compression could make it even smaller, but seems excessive.
 72 | 
 73 | ## Running the Code
 74 | #### Required Dependencies:
 75 | * [GMP library](https://gmplib.org/) - Used for large integer math. Unfortunately there isn't one simple command for this, use your favorite search engine or LLM with your OS version specified.
 76 | * A 64 bit CPU that supports LZCNT, most modern Intel/AMD 64 bit CPUs will, except it seems Apple's M series.  This is for fast bit packing of the frequency table.  If anyone wants to contribute a LZCNT equivlaent for ARM please do reach out.
 77 | #### Recommended (optional):
 78 | * Clang - GCC should work too just haven't tested.
 79 | * C++17 - known to be working, other C++ standards should should work but are not tested.  Some shortcuts like "auto" are used which require at least C++11 but could be rewritten.
 80 | 
 81 | #### Linux Commands
 82 | ```
 83 | git clone git@github.com:Peter-Ebert/Valli-Encoding.git
 84 | clang++ -std=c++17 -O2 poc-compress.cpp -lgmp -o poc-compress
 85 | clang++ -std=c++17 -O2 poc-decompress.cpp -lgmp -o poc-decompress
 86 | ```
 87 | 
 88 | To compress the test data:
 89 | ```
 90 | ./poc-compress testfiles/input1
 91 | ```
 92 | 
 93 | The output to console is intentionally verbose to help with understanding the calculations performed for each symbol, it slows the output some and can be commented out if desired.
 94 | 
 95 | Two files are created in the same directory as the compressed file:
 96 | * input1.vli - the compressed output
 97 | * input1.freq - contains frequency counts for each symbol/character, it is sorted ascending and uncompressed.  The first 7 bytes are the count and the next 1 byte indicates the symbol, repeating.  File size is the same for any input (64bits * 256=2048 bytes).  The data could be compressed easily (even zero counts are included), but this simple format shows that nothing is being hidden inside.
 98 | 
 99 | To decompress:
100 | ```
101 | ./poc-decompress testfiles/input1.vli
102 | ```
103 | 
104 | This will output "\[filename\].decom", so that the input and output can be compared. The decompressor will assume the associated frequency file is in the same folder with the same name but replaces ".vli" with ".freq" (created previously by the compressor).  As with the compressor, the console output will show much of the math involved to decode the compressed file.
105 | 
106 | To verify the input matches the output:
107 | ```
108 | diff -s testfiles/input1 testfiles/input1.decom
109 | ```
110 | Expected output:
111 | ```
112 | Files testfiles/input1 and testfiles/input1.decom are identical
113 | ```
114 | 
115 | ## Support
116 | Ideally knowledge would be free, but it takes time and effort to create and communicate, all the while one cannot live on ideas and dissertations alone.  With the emergence of sites that enable direct support from viewers like Patreon and Twitch/Youtube, I hope the idea of exchanging content for financial support isn't asking too much.  If you enjoyed this as much as a drink, a movie, or more and have the ability to help out I'd greatly appreciate it and it would help encourage future posts.  
117 | 
118 | [Support this project](https://www.paypal.com/donate/?business=S7Q76A99VU44W&no_recurring=0&currency_code=USD)  
119 | The receipt will have an email address if you want to send questions/requests, if you've donated I'll do my best to answer them if not in the README/[FAQ](FAQ.md).  
120 | 
121 | I have more ideas I'd like to explore, but I quit my job to work on this and have spent the funds I put aside.  I'm grateful and lucky to have had the chance to set aside time to work on this and never intended to make money off it, so any donations would encourage further research or improvements.
122 | 
123 | ## Some Parting Thoughts
124 | This approach for encoding is admittedly is not very practical given it's factorial / polynomial nature and the fact encoding requires changing every bit along the entire length of the compressed value (bad once outside of cpu caches).  Memory and storage are ample these days so the few bytes it saves may not be worth it.
125 | 
126 | However, I would still assert that mathematically it is interesting.  The encoding is optimal in that the encoded data will not exceed the size of the number of permutations.  I am not an expert but afaik this puts it in the same category as only two other encoders: Arithmetic coding and Asymmetric numeral systems (ANS).  Though we only save a few bytes/bits compared to a fast encoder with a static frequency table, it's worth noting mathematically that a one bit reduction means we've reduced the number of encodable values in half.  The dataset that originally interested me in compression was bit sets generated from hash values (e.g. probabilistic counts, bloom filters), so it may have some applications there since those datasets don't normally compress very well.
127 | 
128 | #### Addition is All you Need
129 | Though you could boil down many different operations like multiplication into addition with loops, the binomials that make up the encoding are deeply related to repeated addition as they make up pascal's triangle.  Likewise, those binomials are then added together to produce an encoding for a single symbol.  This seems somewhat unique, as in ANS's math the state itself is divided and multiplied to create the next state.  Therefore, it is possible to implement this entire encoding using addition almost exclusively.
130 | 
131 | #### Parallelizable
132 | Another noteworthy feature is this algorithm can be parallelized per unique symbol, as far as I know no other compression algorithm is parallelizable without sacrificing the compression ratio (not counting large lookup tables which can compress multiple symbols at once, which are still sequential in nature).  We could also consider not combining individual symbol encodings (binomial sums) together, instead storing each unique symbol separately, as combining only saves <1 bit per unique symbol and combining uses expensive large number multiplication.
133 | 
134 | #### Next
135 | I have more ideas, especially around optimization, but have run out of time and resources and figured it was best to share the idea first to see if it was new.  If there's more interest or support I might get to those ideas.  For now I hope you enjoyed this as much as I did making it.
136 | 
137 | 


--------------------------------------------------------------------------------
/frequency-table.hpp:
--------------------------------------------------------------------------------
  1 | // Frequency Table Implementation w/ basic compression:
  2 | //     diff encoding, variable length integers, bit packing
  3 | 
  4 | // Stores the symbols and frequencies used for compression/decompression
  5 | // Serialization performs some basic bit packing, leveraging the sorted counts
  6 | // followed by corresponding byte symbols, no compression
  7 | 
  8 | 
  9 | #include <fstream>      // ifstream,ofstream
 10 | #include <immintrin.h>  // lzcnt
 11 | 
 12 | // Simple structure to contain the dictionary information.
 13 | // Array of 64 bit numbers,
 14 | // The first 7 most significant bytes store the count
 15 | // and the last byte stores the character being counted
 16 | // This allows for sorting directly on the full 64 bits
 17 | // Size is always 256*8 = 2048 bytes 
 18 | struct FreqChar {
 19 |     uint64_t data[256] = {0};
 20 |     void setChar(unsigned char i, unsigned char symbol) {
 21 |         data[i] = (data[i] & 0xFFFFFFFFFFFFFF00) | symbol;
 22 |     }
 23 |     unsigned char getChar(unsigned char i) {
 24 |         return data[i];
 25 |     }
 26 |     void setCount(unsigned char i, size_t count) {
 27 |         data[i] += count << 8;
 28 |     }
 29 |     uint64_t getCount(unsigned char i) {
 30 |         return data[i] >> 8;
 31 |     }
 32 |     void incrCount(unsigned char i) {
 33 |         data[i] += 1ull << 8;
 34 |         //warning: no overflow detection
 35 |     }
 36 |     //sort the dictionary by frequency, ascending
 37 |     void sortData() {
 38 |         std::sort(data, data+(sizeof(data)/sizeof(data[0])));
 39 |     }
 40 | 
 41 |     // A very basic freq table serialization
 42 |     // bit packing of sorted counts, when count==0, we also know the number of byte symobls to read
 43 |     // returns count of bytes written to file
 44 |     uint64_t serialize(std::ofstream& out_file) {
 45 |         uint64_t output_byte_count = 0;        
 46 |         // write the bit length of the largest count, max 6 bits
 47 |         uint64_t count = (data[255] >> 8);
 48 |         int8_t bit_length = (64 - _lzcnt_u64(count));
 49 |         // !!! use last bit length, not current
 50 |         int8_t last_bit_length = (64 - _lzcnt_u64(count));
 51 |         uint8_t byte_buffer = last_bit_length;
 52 |         uint8_t bit_offset = 6;
 53 |         int non_zero = 0;
 54 |         //bit pack the counts
 55 |         for(int i=255; i>=0; i--) {
 56 |             count = this->getCount(i);
 57 |             bit_length = (64 - _lzcnt_u64(count));
 58 |             int8_t bits_output = 0;
 59 |             while(last_bit_length>bits_output) {
 60 |                 // if byte would fill, write and go next byte
 61 |                 if((last_bit_length-bits_output) >= (8 - bit_offset)) {
 62 |                     byte_buffer |= count << bit_offset;
 63 |                     count = count >> (8 - bit_offset);
 64 |                     out_file << byte_buffer;
 65 |                     output_byte_count++;
 66 |                     bits_output += (8 - bit_offset);
 67 |                     byte_buffer = 0;
 68 |                     bit_offset = 0;
 69 |                 } else {
 70 |                     // no overflow, insert and move offset
 71 |                     byte_buffer |= count << bit_offset;
 72 |                     bit_offset += last_bit_length-bits_output; //can't be >= 8 based on if
 73 |                     bits_output += last_bit_length-bits_output; //can't be >= 8 based on if
 74 |                 }
 75 |             }
 76 |             if(this->getCount(i)==0) {
 77 |                 //exit loop
 78 |                 //todo: test with freqs that end here
 79 |                 break;
 80 |             }
 81 |             non_zero++;
 82 |             last_bit_length = bit_length;
 83 |             // use the lenght of the current count to set the next one
 84 |         }
 85 |         // if unwritten bits in buffer, flush
 86 |         // this will waste at most 7 bits, could be used by encoding, skipping optimization for now
 87 |         if(bit_offset != 0) {
 88 |             out_file << byte_buffer;
 89 |             output_byte_count++;
 90 |         }
 91 | 
 92 |         //output non zero count symbols (bytes) in the same order as the sort (desc)
 93 |         //todo: assert non_zero >= 1
 94 |         for(int i=255; i>=(255-non_zero); i--) {
 95 |             if(this->getCount(i)) {
 96 |                 out_file << this->getChar(i);
 97 |                 output_byte_count++;
 98 |             } else {
 99 |                 break;
100 |             }
101 |         }
102 |         return output_byte_count;
103 |     }
104 | 
105 |     // basic deserializer
106 |     // returns bytes read
107 |     uint64_t deserialize(std::ifstream& input_file) { 
108 |         //for legacy reasons, keep all zero counts in place
109 |         //todo: resize array for non-zero counts
110 |         //todo: assert file len > 1
111 |         uint64_t read_byte_count = 1;
112 |         // read first 6 bits for count bit length
113 |         uint8_t byte_buffer;
114 |         input_file.read((char*)&byte_buffer, 1);
115 |         uint8_t bit_length = byte_buffer & 0b00111111;
116 | 
117 |         uint8_t bit_offset = 6;
118 |         int symbol_count = 0;
119 |         // count loop
120 |         do {
121 |             int64_t count = 0;
122 |             int8_t bits_read = 0;
123 |             // read bytes loop
124 |             while(bit_length>bits_read) {
125 |                 if((bit_length-bits_read) >= (8 - bit_offset)) {
126 |                     // overflow, load then read next byte
127 |                     //load and read next byte, loop until done
128 |                     count |= ((uint64_t)(byte_buffer >> bit_offset)) << bits_read;
129 |                     bits_read += (8 - bit_offset);
130 |                     input_file.read((char*)&byte_buffer, 1);
131 |                     read_byte_count++;
132 |                     bit_offset = 0;
133 |                     //update bits read
134 |                 } else {
135 |                     // fits in current byte
136 |                     // load count and update offset
137 |                     count |= ((uint64_t)(byte_buffer >> bit_offset)) << bits_read;
138 |                     //zero out bits beyond the length
139 |                     count &= ((1ull << (bit_length))-1);
140 |                     bit_offset += bit_length-bits_read;
141 |                     bits_read = bit_length;
142 |                 }
143 |             }
144 |             //set the count, ascending order
145 |             this->setCount(255-symbol_count, count);
146 |             // exit loop if the count is 0 (last in sequence)
147 |             if(count==0) {
148 |                 break;
149 |             }
150 |             symbol_count++;
151 |             // update bit_length with current length
152 |             bit_length = (64 - _lzcnt_u64(count));
153 | 
154 |         } while(symbol_count < 255); //for safety, can be removed, unreachable
155 | 
156 |         // if bit_offset == 0, it contains a symbol
157 |         // else unset bits, read next byte
158 |         if(bit_offset != 0) {
159 |             input_file.read((char*)&byte_buffer, 1);
160 |             read_byte_count++;
161 |         }
162 |         // assert: symbol_count > 1
163 |         this->setChar(255, byte_buffer);
164 |         std::vector<uint8_t> found_symbols(symbol_count);
165 |         found_symbols.at(0) = byte_buffer;
166 |         // set characters
167 |         for(int i=254; i>(255-symbol_count); i--) {
168 |             input_file.read((char*)&byte_buffer, 1);
169 |             read_byte_count++;
170 |             this->setChar(i, byte_buffer);
171 |             found_symbols.at(255-i) = byte_buffer;
172 |         }
173 | 
174 |         std::sort(found_symbols.begin(), found_symbols.end());
175 |         // print vector
176 |         int found_idx = 0;
177 |         for(int i=0; i<(256); i++) {
178 |             if(found_idx < symbol_count && i==found_symbols.at(found_idx)) {
179 |                 //skip
180 |                 found_idx++;
181 |             } else {
182 |                 // update char
183 |                 this->setChar(i-found_idx, i);
184 |             }
185 | 
186 |         }
187 |         // legacy code: back fill other symbols
188 |         // vector symbol_count
189 | 
190 | 
191 |         return read_byte_count;
192 |         //unique_symbols += 1; // 0 is never used
193 |     }    
194 | 
195 | };
196 | 
197 | 
198 | 


--------------------------------------------------------------------------------
/poc-compress.cpp:
--------------------------------------------------------------------------------
  1 | // Valli Entropy Encoder
  2 | // Quick proof of concept
  3 | // To build:
  4 | // -requires: GMP lib https://gmplib.org/
  5 | // clang++ -std=c++17 -O2 poc-compress.cpp -lgmp -o poc-compress
  6 | 
  7 | #include <iostream>  // cout
  8 | #include <fstream>   // ifstream,ofstream
  9 | #include <bitset> // bitset
 10 | #include <immintrin.h> // intrinsics
 11 | #include <chrono> // timer
 12 | #include <math.h>       /* log2 */
 13 | #include <gmp.h> // bigint mpz_t
 14 | #include <algorithm> // sort
 15 | 
 16 | #include "utility-functions.hpp"
 17 | 
 18 | 
 19 | using namespace std;
 20 | 
 21 | int main(int argc, char* argv[]) {
 22 | 
 23 |     // parse file name
 24 |     if (argc != 2) {
 25 |         cout << "Specify a single file, example: " << argv[0] << " <file>" << std::endl;
 26 |         return 1;
 27 |     }
 28 | 
 29 |     // variable to set file output
 30 |     bool write_file = true;
 31 | 
 32 |     string source_path_file = argv[1];
 33 |     string filename = source_path_file.substr(source_path_file.find_last_of("/\\") + 1);
 34 |     string filename_entropy = source_path_file + ".vli";
 35 |     string filename_freq_table = source_path_file + ".freq";
 36 | 
 37 |     // read file into memory
 38 |     std::vector<char> buffer;
 39 |     if(!FileToCharVector(source_path_file, buffer)) {
 40 |         cout << "File read error, most likely it does not exist." << endl;
 41 |         return 1;
 42 |     }
 43 |     cout << "File size: " << buffer.size() << " bytes" << endl;
 44 |     // Warn & exit in case someone accidentally submits a large file
 45 |     // File sizes much larger than your CPU cache can be quite slow
 46 |     // This implementation is designed as a POC and not highly optimized
 47 |     if(buffer.size() > 64000) {
 48 |         cout << "The filesize is greater than 64kb, this compressor implementation is not highly optimized, so compression times may be long once outside of L1 cache, you can comment out this if statement if you wish to proceed." << endl;
 49 |         return 0;
 50 |     }
 51 | 
 52 |     // count the frequencies of each symbol (char/byte)
 53 |     FreqChar freqs;
 54 |     CalcFrequencyPairs(buffer, freqs);
 55 | 
 56 |     //sort the frequency table, ascending by count then symbol
 57 |     sort(freqs.data, freqs.data+sizeof(freqs.data)/sizeof(freqs.data[0]));
 58 | 
 59 |     printf("Sorted Frequencies:\n");
 60 |     cout << "idx : chr : int : count" << endl;
 61 |     // print non-zero frequencies and count unique symbols
 62 |     uint64_t unique_symbols = 0;
 63 |     for (int i = 0; i < 256; i++) {
 64 |         // cout << freqs.getCount(i) << endl;
 65 |         if(freqs.getCount(i)) {
 66 |             unique_symbols++;
 67 |             cout << i << " : '" <<  freqs.getChar(i) << "' : " << (uint)freqs.getChar(i) << " : " << freqs.getCount(i) << endl;
 68 |         }
 69 |     }
 70 |     uint64_t total_symbols = buffer.size();
 71 | 
 72 |     cout << "==============================" << endl;
 73 |     cout << "Total symbols: " << total_symbols << endl;
 74 |     cout << "Unique symbols: " << unique_symbols << endl;
 75 |     cout << "------------------------------" << endl;
 76 | 
 77 |     if(unique_symbols < 2) {
 78 |         // todo: handle this case
 79 |         cout << "Less than 2 unique symbols, nothing to encode, aborting." << endl;
 80 |         return 1;
 81 |     }
 82 | 
 83 |     uint64_t remaining_loc = total_symbols;
 84 |     // select the least common character
 85 |     char null_symbol = (char)freqs.getChar(0);
 86 |     // This simple demonstration implementation select one character as a 'null' symbol to take the place of
 87 |     // symbols which have already been encoded.
 88 |     // As a result, all 256 byte values cannot be used in the input, 
 89 |     // since this is only likely with random or already compressed data, shouldn't be an issue for a POC
 90 |     if(freqs.getCount(0) != 0) {
 91 |         cout << "Unhandled: This implementation requires at least one symbol in the input to be unused ('null' symbol).  This input has all byte values used (0-255)." << endl;
 92 |         return -1;
 93 |     }
 94 |     // todo: to process inputs with all 256 characters are used:
 95 |     // if unique_symbols==256: pick lowest frequency and encode solo, then set that as the null symbol
 96 |     // alternatively (advanced), can rotate the 256 bytes w/ addition so that the max value occurs at the max byte value,
 97 |     // then use less than to test for placement or not, can be done in parallel threads this way as the array can be static
 98 | 
 99 |     // Making an assmumption about the gmp library (not verified)
100 |     // Heavy reuse of mpz_t variables that are similar in size, 
101 |     // with the assumption that there will be fewer allocatitons needed (and less inits)
102 |     mpz_t num_product_seq, denom_fact, combo_result, symbol_accumulator;
103 |     mpz_inits(num_product_seq, denom_fact, combo_result, symbol_accumulator, NULL);
104 | 
105 |     uint64_t symbol_count;
106 |     uint64_t symbol_idx;
107 |     
108 |     mpz_t multiply_combiner, data_accumulator;
109 |     mpz_inits(multiply_combiner, data_accumulator, NULL);
110 |     mpz_set_ui(multiply_combiner, 1);
111 |     mpz_set_ui(data_accumulator, 0);
112 |     
113 |     // Loop through each possible symbol
114 |     // 256-1 because the last symbol (asc sort) does not need to be encoded/decoded
115 |     for (int i = 0; i < 256-1; i++) { 
116 |         // if character exists in message
117 |         if(freqs.getCount(i)) {
118 |             // reset for new symbol
119 |             mpz_set_ui(symbol_accumulator, 0);
120 |             mpz_set_ui(denom_fact, 1);
121 |             // reset symbol count
122 |             symbol_count = 1;
123 | 
124 |             //calculate location for first item
125 |             cout << "--- " << freqs.getChar(i) << ":" << freqs.getCount(i) << " (" << (uint)freqs.getChar(i) << ")" << " ---" << endl;
126 |             
127 |             //find symbol location
128 |             size_t removed_loc = 0;
129 | 
130 |             // encode current symbol by looping through buffer to find each location
131 |             // can exit loop when last instance is found k = symbol_count
132 |             for(size_t byte_loc = 0; byte_loc < buffer.size(); byte_loc++) {
133 |                 if(buffer[byte_loc]==(char)freqs.getChar(i)) { 
134 |                     //found instance of symbol
135 |                     // verbose: combination calculation for location choose symbol_count
136 |                     cout << " + " << byte_loc-removed_loc << " choose " << symbol_count << endl;
137 |                     
138 |                     encode_symbol_location_reuse(byte_loc-removed_loc, symbol_count, symbol_accumulator, num_product_seq, denom_fact, combo_result);
139 |                     buffer[byte_loc] = null_symbol;
140 |                     if(symbol_count==freqs.getCount(i)) {
141 |                         // all symbols have been found
142 |                         // exit loop
143 |                         break;
144 |                     }
145 |                     // increment k and multiply into denom_fact for next loop
146 |                     symbol_count += 1;
147 |                     mpz_mul_ui(denom_fact, denom_fact, symbol_count);
148 |                     
149 |                 } else if(buffer[byte_loc]==null_symbol) {
150 |                     // count the number of symbols that have been removed between the last byte location and the next one
151 |                     removed_loc++; 
152 |                 }
153 |             }
154 | 
155 |             // verbose output: sum of symbols and combiner multiple
156 |             gmp_printf("Sum of Binomials: %Zd \n", symbol_accumulator);
157 |             gmp_printf("Multiply combiner: %Zd \n", multiply_combiner);
158 |             mpz_mul(combo_result, multiply_combiner, symbol_accumulator);
159 |             mpz_add(data_accumulator, data_accumulator, combo_result);
160 | 
161 |             // calculation not needed for last symbol since it's not encoded
162 |             // however it is needed for the 'max bit length' calculation at the end
163 |             // otherwise can wrap with if(i != 254) {}
164 |             next_multiply_combiner(multiply_combiner, remaining_loc, symbol_count, combo_result, num_product_seq, denom_fact);            
165 | 
166 |             //track how many possible locations remain without the current symbol
167 |             remaining_loc -= freqs.getCount(i);
168 |         }
169 |     }
170 | 
171 |     // Verbose: output the final integer and statistics around the output
172 |     cout << "----------Final Data----------" << endl;
173 |     gmp_printf("%Zd \n", data_accumulator);
174 |     cout << "------------------------------" << endl;
175 | 
176 |     size_t bit_length = mpz_sizeinbase(data_accumulator, 2);
177 |     cout << "Current byte length: " << ceil(bit_length/8.0) << endl;
178 |     cout << "Current bit length: " << bit_length << endl;
179 |     // Use combiner to calc max bit len (total # of permutations of symbol frequencies)
180 |     size_t max_bit_length = mpz_sizeinbase(multiply_combiner, 2);
181 |     cout << "Max bit length: " << max_bit_length << endl;
182 |     
183 |     // Calculate the Shannon minimum bit length, static frequency table
184 |     // = shannon entropy per symbol * message length
185 |     double shannon_entropy = 0.0;
186 |     for (int i = 0; i < 256; i++) {
187 |         // cout << freqs.getCount(i) << endl;
188 |         if(freqs.getCount(i)) {
189 |             double probability = static_cast<double>(freqs.getCount(i)) / total_symbols;
190 |             shannon_entropy -= probability * log2(probability);
191 |         }
192 |     }
193 |     shannon_entropy = shannon_entropy * total_symbols;
194 | 
195 |     cout << "Static frequency Shannon limit: " << ceil(shannon_entropy) << endl;
196 |     cout << "Bits saved: " << ceil(shannon_entropy)-max_bit_length << endl;
197 |     cout << "Relative Size: " << 100 * (max_bit_length / ceil(shannon_entropy)) << "%" << endl;
198 | 
199 |     // Write compressed data and frequencies table
200 |     if(write_file) {
201 |         cout << "Writing compressed data to: " << filename_entropy << endl;
202 |         ofstream out_file(filename_entropy);
203 |         //write frequency table
204 |         cout << "Frequency table size (bytes): " << freqs.serialize(out_file) << endl;
205 |         // write encoded data
206 |         size_t out_size = mpz_sizeinbase(data_accumulator, 256);
207 |         cout << "Encoded data (bytes): " << out_size << endl;
208 |         unsigned char *output_array = new unsigned char[out_size];
209 |         //         output_array, word_count, order, size, endian, nails, data
210 |         mpz_export(output_array, NULL,     1,    1,     -1,     0, data_accumulator);
211 |         if(mpz_cmp_ui(data_accumulator, 0) == 0) {
212 |             // if output == 0, gmp will not write to the array
213 |             output_array[0] = 0;
214 |         }
215 |         out_file.write((char *)output_array, out_size);
216 |         delete[] output_array;
217 |         out_file.close();
218 |     } else {
219 |         cout << "Skipping data write." << endl;
220 |     }
221 | 
222 |     return 0;
223 | }


--------------------------------------------------------------------------------
/poc-decompress.cpp:
--------------------------------------------------------------------------------
  1 | // Valli Decompression - Proof of concept
  2 | // clang++ -std=c++17 -O2 poc-decompress.cpp -lgmp -o poc-decompress
  3 | 
  4 | // This implementation is more complicated than the naive approach in the documentation.
  5 | // It uses an an approximate calculation to estimate the binomial, then adjusts it from there.
  6 | // Also some shortcuts for trivial values and other minor optimizations.
  7 | #include <iostream>     // cout
  8 | #include <fstream>      // ifstream,ofstream
  9 | #include <bitset>       // bitset
 10 | #include <immintrin.h>  // intrinsics
 11 | #include <chrono>       // timer
 12 | #include <math.h>       // log2
 13 | #include <gmp.h>        // bigint mpz_t
 14 | #include <algorithm>    // sort
 15 | 
 16 | #include "utility-functions.hpp"
 17 | 
 18 | 
 19 | using namespace std;
 20 | 
 21 | int main(int argc, char* argv[]) {
 22 | 
 23 |     // verify args
 24 |     if (argc != 2) {
 25 |         cout << "Specify a compressed file ending in .vli, example: " << argv[0] << " <file>" << endl;
 26 |         return 1;
 27 |     }
 28 | 
 29 |     // variable to set file output
 30 |     bool write_file = true;
 31 | 
 32 |     // validate and set filenames
 33 |     string compressed_path_file = argv[1];
 34 |     string file_ending = ".vli";
 35 |     // Ensure file ending is .vli, no other validation performed.
 36 |     // Will error out (vector out of bounds) if the encoded data value is too large for the frequency counts.
 37 |     // Since this *should* only be caused by user error/manipulation, leaving unhandled for now.
 38 |     // *could also be cause by bugs, please report any issues
 39 |     // Else, every value smaller than the max permutations will decompress to some permutation of symbols.
 40 |     if (!(compressed_path_file.length() >= file_ending.length() && compressed_path_file.compare(compressed_path_file.length() - file_ending.length(), file_ending.length(), file_ending) == 0)) {
 41 |         cout << "Invalid filename, must end with '.vli'" << endl;
 42 |         return 1;
 43 |     } 
 44 | 
 45 |     // assume frequencies file name & location based on source file name
 46 |     string basename = compressed_path_file.substr(0, compressed_path_file.length() - file_ending.length());
 47 |     string filename_freq_table = basename + ".freq";
 48 |     // create a new output file so that they can be compared to the source
 49 |     string filename_out = basename + ".decom";
 50 | 
 51 |     cout << "Compressed file: " << compressed_path_file << endl;
 52 |     cout << "Frequencies file: " << filename_freq_table << endl;
 53 |     cout << "Output file: " << filename_out << endl;
 54 | 
 55 |     // read file
 56 |     std::ifstream input_file(compressed_path_file, std::ios::binary);
 57 |     if (!input_file) {
 58 |         // Handle file open error
 59 |         std::cerr << "Error opening file, most likely file does not exist." << std::endl;
 60 |         return 1;
 61 |     }
 62 |     // deserialize frequency table at start of file
 63 |     FreqChar freqs;
 64 |     size_t freq_byte_count = freqs.deserialize(input_file);
 65 |     if (!input_file) {
 66 |         std::cerr << "Error reading frequency table." << std::endl;
 67 |         return 1;
 68 |     }
 69 |     cout << "Frequency table size (bytes):" << freq_byte_count << endl;
 70 |     
 71 |     cout << "Frequencies:" << endl;
 72 |     cout << "int : char : count" << endl;
 73 |     uint64_t total_symbols = 0;
 74 |     uint64_t unique_symbols = 0;
 75 |     for (int i = 0; i < 256; i++) {
 76 |         if(freqs.getCount(i)) {
 77 |             cout << (uint)freqs.getChar(i) << " : " << freqs.getChar(i) << " : " << freqs.getCount(i) << endl;
 78 |             total_symbols += freqs.getCount(i);
 79 |             unique_symbols++;
 80 |         }
 81 |     }
 82 |     
 83 |     // load encoding into input_buffer
 84 |     std::vector<char> input_buffer((std::istreambuf_iterator<char>(input_file)), std::istreambuf_iterator<char>());
 85 |     // Check successful file read
 86 |     if (!input_file && !input_file.eof()) {
 87 |         std::cerr << "Error reading encoding." << std::endl;
 88 |         return 1;
 89 |     }
 90 |     input_file.close();
 91 |     cout << "Encoding size (bytes): " << input_buffer.size() << endl;
 92 | 
 93 |     mpz_t     compressed_data, symbol_combo, extracted_combo, root_result, binomial, numerator, denominator, factorial, uncombiner, est_binomial;
 94 |     mpz_inits(compressed_data, symbol_combo, extracted_combo, root_result, binomial, numerator, denominator, factorial, uncombiner, est_binomial, NULL);
 95 | 
 96 |     // export and import must match
 97 |     // mpz_export(output_array, NULL,                1, 1, -1, 0, data_accumulator);
 98 |     mpz_import(compressed_data, input_buffer.size(), 1, 1, -1, 0, input_buffer.data());
 99 | 
100 |     // verbose info
101 |     gmp_printf("Imported Integer: %Zd\n", compressed_data);
102 |     if(unique_symbols < 1) {
103 |         // todo: handle this case, not implemented
104 |         cout << "Not enough unique symbols, 2 required in current implementation, aborting." << endl;
105 |         return 1;
106 |     }
107 |     
108 |     // This implementation fills the output message buffer with the
109 |     // last symbol and uses it as an 'empty' location indicator.
110 |     // After all other symbols are placed correctly the
111 |     // last symbol is already in the correct locations.
112 |     char last_symbol = (char)freqs.getChar(255);
113 |     // allocate buffer for decoded output,  fill with most common character
114 |     std::vector<char> output_buffer(total_symbols, last_symbol);
115 |     uint64_t remaining_locations = total_symbols;
116 | 
117 |     mpz_set_ui(uncombiner, 1);
118 |     size_t symbol_start=256; //starting out of bounds
119 |     // start at 254 becuase last symbol isnt encoded
120 | 
121 |     // symbol index
122 |     size_t symbol_idx = 0;
123 |     // advance index to start of populated symbols
124 |     while(freqs.getCount(symbol_idx) == 0) {
125 |         symbol_idx++;
126 |     } 
127 | 
128 |     // Loop through each symbol, except the last 
129 |     while(symbol_idx < 255) {
130 |         // verbose output
131 |         cout << "------------------------------------" << endl;
132 |         char current_symbol = (char)freqs.getChar(symbol_idx);
133 |         cout << "Current symbol: " << current_symbol << " (" << (uint)current_symbol << ")" << endl;
134 |         cout << "Locations remaining: " << remaining_locations << endl;
135 |         // To extract symbol_combo from compressed_data
136 |         // we mod the compressed_data by the number of combinations for that symbol
137 |         // then subtract out that remainder and repeat for next symbol
138 | 
139 |         // 2nd to last value, calculation & extraction not needed
140 |         if(symbol_idx != 254) {
141 |             // calculate permutations of symbol="uncombiner" to extract symbol combination
142 |             choose_reuse(remaining_locations, freqs.getCount(symbol_idx), uncombiner, numerator, denominator);
143 |             // compressed data = quotient
144 |             // extracted combo = remainder
145 |             mpz_tdiv_qr(compressed_data, extracted_combo, compressed_data, uncombiner);
146 |         } else {
147 |             // Last encoded value
148 |             // no more extraction needed, what remains is the 254th symbol's sum of binomials
149 |             // because uncombiner == 1 so extracted_combo = compressed_data
150 |             mpz_set(extracted_combo, compressed_data);
151 |         }
152 |         
153 |         // verbose output
154 |         gmp_printf("Uncombiner: %Zd \n", uncombiner);    
155 |         gmp_printf("Extracted Binomial Sum: %Zd \n", extracted_combo);
156 | 
157 |         // setup loop to deconstruct the extracted binomial sum
158 |         // largest index location extracted first
159 |         size_t symbol_count = freqs.getCount(symbol_idx);
160 |         size_t insert_offset = total_symbols - remaining_locations;
161 |         // zero based index for locations
162 |         size_t loc_idx;
163 |         size_t last_loc_idx = total_symbols - 1; 
164 |         // size_t removed_symbols = total_symbols - remaining_locations;
165 |         // calculate factorial 
166 |         mpz_fac_ui(factorial, symbol_count);
167 | 
168 |         // Symbol extraction innner loop
169 |         // continue while symbol_count > extracted_combo
170 |         // this avoids estimates that are below zero
171 |         while(mpz_cmp_ui(extracted_combo, symbol_count) > 0) {
172 |             // use sum of binomials property, combined with Newton's method
173 |             // combo ~= (n-k//2)^k / k!
174 |             // (combo*k!)^(1/k) + k//2 ~= n
175 |             
176 |             // multiply in k!
177 |             mpz_mul(symbol_combo, extracted_combo, factorial);
178 |             
179 |             // find the root
180 |             mpz_root(root_result, symbol_combo, symbol_count);
181 |             // add symbol_count//2 to go from near the middle to near the top=n
182 |             size_t loc_idx = mpz_get_ui(root_result) + symbol_count/2;
183 |             
184 |             // calculate estimated binomial value
185 |             choose_reuse(loc_idx, symbol_count, est_binomial, numerator, denominator);
186 |             // verbose output
187 |             cout << "Estimated: " << loc_idx << " choose " << symbol_count << endl;
188 | 
189 |             // Validate estimation (1 & 2) //
190 |             // (1) ensure estimate is less than target value, some even values of k can over estimate
191 |             if(mpz_cmp(est_binomial, extracted_combo) > 0) {
192 |                 // if overestimated, shift N down by 1
193 |                 // N-1 choose K: multiply by N-K; divide by N
194 |                 mpz_mul_ui(est_binomial, est_binomial, loc_idx-symbol_count);
195 |                 mpz_divexact_ui(est_binomial, est_binomial, loc_idx);
196 |                 loc_idx -= 1;
197 |                 cout << "Adjusted down" << endl;
198 |             }
199 |             // (2) ensure estimate is not too low by looking at the delta
200 |             // todo: opt: can skip if corrected down
201 |             // subtract estimate from binomial
202 |             mpz_sub(extracted_combo, extracted_combo, est_binomial);
203 |             // calculate diff between N choose K and N+1 choose K => N choose K-1 => est_binomial *K then /(N-K+1)
204 |             mpz_mul_ui(est_binomial, est_binomial, symbol_count);
205 |             // todo: should be removable, checked for safety
206 |             // avoid div by zero
207 |             if(loc_idx-symbol_count+1 != 0) {
208 |                 // est_binomial / (N-K+1)
209 |                 mpz_divexact_ui(est_binomial, est_binomial, loc_idx-symbol_count+1); 
210 |             }
211 |             
212 |             // tracking for information purposes only, not needed for calculation
213 |             size_t adjust_up_count = 0;
214 |             cout << "---Adjustment loop---" << endl;
215 |             
216 |             // Estimate adjustment loop
217 |             // while(delta to next binomial < remaining sum of binomials)
218 |             while(mpz_cmp(est_binomial, extracted_combo) <= 0 && mpz_cmp_ui(extracted_combo, symbol_count) > 0) {
219 |                 // estimate too small, adjust estimated N up by 1
220 |                 // subtract diff (N Choose K-1)
221 |                 mpz_sub(extracted_combo, extracted_combo, est_binomial);
222 |                 loc_idx += 1;
223 |                 // calculate next diff and confirm
224 |                 // to continue calculation up, estimate N+1 choose K = current estimate *N+1; /N-K to current estimate
225 |                 // if estimate is zero, cannot shift up
226 |                 if(loc_idx<=symbol_count) {
227 |                     //then previous estimate is 0, set est_binomial = 1
228 |                     mpz_set_ui(est_binomial, 1);
229 |                     loc_idx = symbol_count;
230 |                 } else {
231 |                     mpz_mul_ui(est_binomial, est_binomial, loc_idx);
232 |                 }
233 |                 //avoid division by zero
234 |                 if(loc_idx-symbol_count != 0) { 
235 |                     mpz_divexact_ui(est_binomial, est_binomial, loc_idx-(symbol_count-1));
236 |                     // mpz_divexact_ui(est_binomial, est_binomial, loc_idx-symbol_count+1)); // cleaner? VERIFY
237 |                 }
238 |                 adjust_up_count += 1;             
239 |             }
240 |             // verbose output
241 |             cout << "Adjusted up: " << adjust_up_count << "x" << endl;
242 | 
243 |             // calculate offset based on previously placed symbols
244 |             // optimization: start with the total count of placed symbols, subtract already placed symbols as we move backwards
245 |             // for each non-last symbol found, reduce the offset
246 |             // cout << "loop: " << last_loc_idx << " to " << loc_idx+insert_offset << endl;
247 |             for(size_t i=last_loc_idx; i>=loc_idx+insert_offset && insert_offset!=0; i--) {
248 |                 if(output_buffer[i] != last_symbol) {
249 |                     insert_offset--;
250 |                 }
251 |             }
252 |             // verbose output
253 |             cout << "Location offset: " << insert_offset << endl;
254 |             last_loc_idx = loc_idx+insert_offset-1; // -1 because current placement location already checked
255 |             //update character in output buffer
256 |             cout << "Symbol placed at: " << loc_idx+insert_offset << endl; 
257 |             output_buffer.at(loc_idx+insert_offset) = current_symbol;
258 | 
259 |             // Setup next loop
260 |             // calculate next lower factorial=(N-1)!=(N!)/N
261 |             mpz_divexact_ui(factorial, factorial, symbol_count);
262 |             symbol_count -= 1;            
263 |         }
264 | 
265 |         // if symbol_count <= extracted combo, decoding is trivial
266 |         if(symbol_count != 0 && mpz_cmp_ui(extracted_combo, symbol_count) <= 0) {
267 |             cout << "symbol_count <= remaining sum of binomials" << symbol_count << endl;
268 |             // gmp_printf("extracted_combo: %Zd\n", extracted_combo);
269 |             // cout << "symbol_count: " << symbol_count << endl;
270 |             size_t null_idx = 0;
271 |             do {
272 |                 // seek to next non-null index
273 |                 while(output_buffer[null_idx] != last_symbol) {
274 |                         null_idx += 1;
275 |                 }
276 |                 if(mpz_cmp_ui(extracted_combo, symbol_count) < 0) {
277 |                     // if symbol count gt extracted combo place all remaining symbols in in the first location they appear
278 |                     output_buffer.at(null_idx) = current_symbol;
279 |                     cout << "Symbol placed at: " << null_idx << endl; 
280 |                     symbol_count -= 1;
281 |                 } else if(mpz_cmp_ui(extracted_combo, symbol_count) == 0) {
282 |                     // if equal skip one location
283 |                     null_idx += 1;
284 |                     while(output_buffer[null_idx] != last_symbol) {
285 |                         null_idx += 1;
286 |                     }
287 |                     output_buffer.at(null_idx) = current_symbol;
288 |                     cout << "Symbol placed at: " << null_idx << endl;
289 |                     symbol_count -= 1;
290 |                 } else {
291 |                     // if lt, continue placing
292 |                     output_buffer.at(null_idx) = current_symbol;
293 |                     cout << "Symbol placed at: " << null_idx << endl;
294 |                     symbol_count -= 1;
295 |                 }
296 |             } while(symbol_count > 0);
297 |         }
298 |         //used for next loop and location placement later
299 |         remaining_locations -= freqs.getCount(symbol_idx);
300 |         symbol_idx++;
301 |     }
302 | 
303 |     // decompression done, final output:
304 |     cout << "----------------------------------" << endl;
305 |     // print decompressed data to console
306 |     for (const auto& value : output_buffer) {
307 |         cout << value;
308 |     }
309 |     cout << endl;
310 | 
311 |     cout << "Writing decompressed data to: " << filename_out << endl;
312 |     // check setting for file output
313 |     if (write_file) {
314 |         //output to file
315 |         ofstream output_file(filename_out);
316 |         if (!output_file.is_open()) {
317 |             std::cerr << "Failed creating output file." << std::endl;
318 |             return 1;
319 |         }
320 |         for (const auto& value : output_buffer) {
321 |             output_file << value;
322 |         }
323 |         output_file.close();
324 |     }
325 | 
326 |     return 0;
327 | }
328 | 


--------------------------------------------------------------------------------
/testfiles/panagram:
--------------------------------------------------------------------------------
1 | The quick brown fox jumps over the lazy dog


--------------------------------------------------------------------------------
/testfiles/sparse:
--------------------------------------------------------------------------------
1 | aaaaaaaaaaaaaaaaaaaaaaabaaaaaaaaaaaaaaaaaaaaaaaaaaacaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaabaaaaaaaaaaaaaaaaaaaaaa


--------------------------------------------------------------------------------
/testfiles/tongue_twister:
--------------------------------------------------------------------------------
1 | She sells seashells by the seashore


--------------------------------------------------------------------------------
/testfiles/walkthrough:
--------------------------------------------------------------------------------
1 | hidehohedehe


--------------------------------------------------------------------------------
/testfiles/wizard_of_oz:
--------------------------------------------------------------------------------
  1 | Title: The Wonderful Wizard of Oz
  2 | Author: L. Frank Baum
  3 | 
  4 | Chapter V
  5 | The Rescue of the Tin Woodman
  6 | When Dorothy awoke the sun was shining through the trees and Toto had long been out chasing birds around him and squirrels. She sat up and looked around her. There was the Scarecrow, still standing patiently in his corner, waiting for her.
  7 | 
  8 | "We must go and search for water," she said to him.
  9 | 
 10 | "Why do you want water?" he asked.
 11 | 
 12 | "To wash my face clean after the dust of the road, and to drink, so the dry bread will not stick in my throat."
 13 | 
 14 | "It must be inconvenient to be made of flesh," said the Scarecrow thoughtfully, "for you must sleep, and eat and drink. However, you have brains, and it is worth a lot of bother to be able to think properly."
 15 | 
 16 | They left the cottage and walked through the trees until they found a little spring of clear water, where Dorothy drank and bathed and ate her breakfast. She saw there was not much bread left in the basket, and the girl was thankful the Scarecrow did not have to eat anything, for there was scarcely enough for herself and Toto for the day.
 17 | 
 18 | When she had finished her meal, and was about to go back to the road of yellow brick, she was startled to hear a deep groan near by.
 19 | 
 20 | "What was that?" she asked timidly.
 21 | 
 22 | "I cannot imagine," replied the Scarecrow; "but we can go and see."
 23 | 
 24 | Just then another groan reached their ears, and the sound seemed to come from behind them. They turned and walked through the forest a few steps, when Dorothy discovered something shining in a ray of sunshine that fell between the trees. She ran to the place and then stopped short, with a little cry of surprise.
 25 | 
 26 | One of the big trees had been partly chopped through, and standing beside it, with an uplifted axe in his hands, was a man made entirely of tin. His head and arms and legs were jointed upon his body, but he stood perfectly motionless, as if he could not stir at all.
 27 | 
 28 | Dorothy looked at him in amazement, and so did the Scarecrow, while Toto barked sharply and made a snap at the tin legs, which hurt his teeth.
 29 | 
 30 | "Did you groan?" asked Dorothy.
 31 | 
 32 | "Yes," answered the tin man, "I did. I've been groaning for more than a year, and no one has ever heard me before or come to help me."
 33 | 
 34 | "What can I do for you?" she inquired softly, for she was moved by the sad voice in which the man spoke.
 35 | 
 36 | "Get an oil-can and oil my joints," he answered. "They are rusted so badly that I cannot move them at all; if I am well oiled I shall soon be all right again. You will find an oil-can on a shelf in my cottage."
 37 | 
 38 | Dorothy at once ran back to the cottage and found the oil-can, and then she returned and asked anxiously, "Where are your joints?"
 39 | 
 40 | "Oil my neck, first," replied the Tin Woodman. So she oiled it, and as it was quite badly rusted the Scarecrow took hold of the tin head and moved it gently from side to side until it worked freely, and then the man could turn it himself.
 41 | 
 42 | "Now oil the joints in my arms," he said. And Dorothy oiled them and the Scarecrow bent them carefully until they were quite free from rust and as good as new.
 43 | 
 44 | The Tin Woodman gave a sigh of satisfaction and lowered his axe, which he leaned against the tree.
 45 | 
 46 | "This is a great comfort," he said. "I have been holding that axe in the air ever since I rusted, and I'm glad to be able to put it down at last. Now, if you will oil the joints of my legs, I shall be all right once more."
 47 | 
 48 | So they oiled his legs until he could move them freely; and he thanked them again and again for his release, for he seemed a very polite creature, and very grateful.
 49 | 
 50 | "I might have stood there always if you had not come along," he said; "so you have certainly saved my life. How did you happen to be here?"
 51 | 
 52 | "We are on our way to the Emerald City to see the Great Oz," she answered, "and we stopped at your cottage to pass the night."
 53 | 
 54 | "Why do you wish to see Oz?" he asked.
 55 | 
 56 | "I want him to send me back to Kansas, and the Scarecrow wants him to put a few brains into his head," she replied.
 57 | 
 58 | The Tin Woodman appeared to think deeply for a moment. Then he said:
 59 | 
 60 | "Do you suppose Oz could give me a heart?"
 61 | 
 62 | "Why, I guess so," Dorothy answered. "It would be as easy as to give the Scarecrow brains."
 63 | 
 64 | "True," the Tin Woodman returned. "So, if you will allow me to join your party, I will also go to the Emerald City and ask Oz to help me."
 65 | 
 66 | "Come along," said the Scarecrow heartily, and Dorothy added that she would be pleased to have his company. So the Tin Woodman shouldered his axe and they all passed through the forest until they came to the road that was paved with yellow brick.
 67 | 
 68 | The Tin Woodman had asked Dorothy to put the oil-can in her basket. "For," he said, "if I should get caught in the rain, and rust again, I would need the oil-can badly."
 69 | 
 70 | It was a bit of good luck to have their new comrade join the party, for soon after they had begun their journey again they came to a place where the trees and branches grew so thick over the road that the travelers could not pass. But the Tin Woodman set to work with his axe and chopped so well that soon he cleared a passage for the entire party.
 71 | 
 72 | Dorothy was thinking so earnestly as they walked along that she did not notice when the Scarecrow stumbled into a hole and rolled over to the side of the road. Indeed he was obliged to call to her to help him up again.
 73 | 
 74 | "Why didn't you walk around the hole?" asked the Tin Woodman.
 75 | 
 76 | "I don't know enough," replied the Scarecrow cheerfully. "My head is stuffed with straw, you know, and that is why I am going to Oz to ask him for some brains."
 77 | 
 78 | "Oh, I see," said the Tin Woodman. "But, after all, brains are not the best things in the world."
 79 | 
 80 | "Have you any?" inquired the Scarecrow.
 81 | 
 82 | "No, my head is quite empty," answered the Woodman. "But once I had brains, and a heart also; so, having tried them both, I should much rather have a heart."
 83 | 
 84 | "And why is that?" asked the Scarecrow.
 85 | 
 86 | "I will tell you my story, and then you will know."
 87 | 
 88 | So, while they were walking through the forest, the Tin Woodman told the following story:
 89 | 
 90 | "I was born the son of a woodman who chopped down trees in the forest and sold the wood for a living. When I grew up, I too became a woodchopper, and after my father died I took care of my old mother as long as she lived. Then I made up my mind that instead of living alone I would marry, so that I might not become lonely.
 91 | 
 92 | "There was one of the Munchkin girls who was so beautiful that I soon grew to love her with all my heart. She, on her part, promised to marry me as soon as I could earn enough money to build a better house for her; so I set to work harder than ever. But the girl lived with an old woman who did not want her to marry anyone, for she was so lazy she wished the girl to remain with her and do the cooking and the housework. So the old woman went to the Wicked Witch of the East, and promised her two sheep and a cow if she would prevent the marriage. Thereupon the Wicked Witch enchanted my axe, and when I was chopping away at my best one day, for I was anxious to get the new house and my wife as soon as possible, the axe slipped all at once and cut off my left leg.
 93 | 
 94 | "This at first seemed a great misfortune, for I knew a one-legged man could not do very well as a wood-chopper. So I went to a tinsmith and had him make me a new leg out of tin. The leg worked very well, once I was used to it. But my action angered the Wicked Witch of the East, for she had promised the old woman I should not marry the pretty Munchkin girl. When I began chopping again, my axe slipped and cut off my right leg. Again I went to the tinsmith, and again he made me a leg out of tin. After this the enchanted axe cut off my arms, one after the other; but, nothing daunted, I had them replaced with tin ones. The Wicked Witch then made the axe slip and cut off my head, and at first I thought that was the end of me. But the tinsmith happened to come along, and he made me a new head out of tin.
 95 | 
 96 | "I thought I had beaten the Wicked Witch then, and I worked harder than ever; but I little knew how cruel my enemy could be. She thought of a new way to kill my love for the beautiful Munchkin maiden, and made my axe slip again, so that it cut right through my body, splitting me into two halves. Once more the tinsmith came to my help and made me a body of tin, fastening my tin arms and legs and head to it, by means of joints, so that I could move around as well as ever. But, alas! I had now no heart, so that I lost all my love for the Munchkin girl, and did not care whether I married her or not. I suppose she is still living with the old woman, waiting for me to come after her.
 97 | 
 98 | "My body shone so brightly in the sun that I felt very proud of it and it did not matter now if my axe slipped, for it could not cut me. There was only one danger--that my joints would rust; but I kept an oil-can in my cottage and took care to oil myself whenever I needed it. However, there came a day when I forgot to do this, and, being caught in a rainstorm, before I thought of the danger my joints had rusted, and I was left to stand in the woods until you came to help me. It was a terrible thing to undergo, but during the year I stood there I had time to think that the greatest loss I had known was the loss of my heart. While I was in love I was the happiest man on earth; but no one can love who has not a heart, and so I am resolved to ask Oz to give me one. If he does, I will go back to the Munchkin maiden and marry her."
 99 | 
100 | Both Dorothy and the Scarecrow had been greatly interested in the story of the Tin Woodman, and now they knew why he was so anxious to get a new heart.
101 | 
102 | "All the same," said the Scarecrow, "I shall ask for brains instead of a heart; for a fool would not know what to do with a heart if he had one."
103 | 
104 | "I shall take the heart," returned the Tin Woodman; "for brains do not make one happy, and happiness is the best thing in the world."


--------------------------------------------------------------------------------
/the-algorithm.md:
--------------------------------------------------------------------------------
  1 | # The Algorithm
  2 | Note:  
  3 |     // indicates integer division (truncation)  
  4 |     N *choose* K is the [binomial coefficient](https://en.wikipedia.org/wiki/Binomial_coefficient) (N,K) 
  5 | ## Encoding
  6 | 
  7 | #### Input
  8 | 1. A message composed of some number of symbols (characters/bytes in the input file).
  9 | #### Steps
 10 | 1. Create a frequency table by counting occurrences of each symbol in the message.
 11 | 2. Set **encoding_accumulator** = 0; this value accumulates the encoding data for each symbol and contains the final encoding value.
 12 | 3. Iterate through each symbol in the frequency table except for the last one[^1].
 13 |     1. Set **symbol_binomial_sum** = 0; holds the binomial sums for the current symbol being encoded
 14 |     2. Set **combiner** = 1; the number of permutations of all prior symbols before to the current symbol (see examples for more info)
 15 |     3. Iterate through the message until the current symbol is encountered.
 16 |         1. Calculate the binomial N *choose* K where:
 17 |             N= the zero based index of the symbol location and
 18 |             K = the current symbol count (1,2,3,...)
 19 |         1. Continue iterating through the message, summing all binomials calculated in the previous step together into **symbol_binomial_sum**.
 20 |     4. When the end of the message is reached for a symbol:
 21 |         1. Set **encoding_accumulator = encoding_accumulator + symbol_binomial_sum * combiner**
 22 |         2. Set **combiner = combiner * (message_size *choose* symbol_count)**; which will be used in the next symbol iteration.
 23 |         3. Remove the encoded symbols from the message.  (This is optimized away in the code as array resizing is expensive, instead a 'null' value is used to indicate the location should no longer be counted.)
 24 |     5. Continue symbol until you reach the last symbol in the set.  The last value is not encoded since it can be inferred by the location of all the other symbols.
 25 | 3. The encoding_accumulator is now the final encoded data.
 26 | 
 27 | [^1]: The order of iteration through the frequency table must be consistent and repeatable for decoding.  The provided code iterates in order of ascending symbol count followed by symbol itself to resolve duplicate counts.  This sort is used since more symbol occurrences means more operations required to encode.
 28 | ##  Encoding Example 
 29 | 
 30 | | Message | h   | i   | d   | e   | h   | o   | h   | e   | d   | e   | h   | e   |
 31 | | ------- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
 32 | | Index   | 0   | 1   | 2   | 3   | 4   | 5   | 6   | 7   | 8   | 9   | 10  | 11  |
 33 | 
 34 | Organize the data into the following frequency table, sorted by count and symbol:
 35 | 
 36 | | Count | Symbol |
 37 | | ----- | ------ |
 38 | | 1     | i      |
 39 | | 1     | o      |
 40 | | 2     | d      |
 41 | | 4     | e      |
 42 | | 4     | h      |
 43 | 
 44 | **Initial Values**
 45 | ```
 46 | message_length = 12
 47 | combiner = 1
 48 | encoding_accumulator = 0
 49 | ```
 50 | **Symbol 'i'**
 51 | ```
 52 | symbol_binomial_sum = 0
 53 | # First 'i' in message at index 1
 54 | symbol_binomial_sum = symbol_binomial_sum + (location choose count) = 0 + (1 choose 1) = 1
 55 | no more 'i' locations
 56 | encoding_accumulator = encoding_accumulator + combiner * symbol_binomial_sum
 57 | encoding_accumulator = 0 + 1 * 1 = 1
 58 | combiner = combiner * (message_length choose symbol_count)
 59 | combiner = 1 * (12 choose 1) = 12
 60 | message_length = message_length - symbol count 
 61 | message_length = 12 - 1 = 11
 62 | # Remove 'i' from message
 63 | ```
 64 | 
 65 | | Message | h   | d   | e   | h   | o   | h   | e   | d   | e   | h   | e   |
 66 | | ------- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
 67 | | Index   | 0   | 1   | 2   | 3   | 4   | 5   | 6   | 7   | 8   | 9   | 10  |
 68 | 
 69 | **Symbol 'o'** 
 70 | ```
 71 | symbol_binomial_sum = 0
 72 | # first 'o' in message at index 4
 73 | symbol_binomial_sum = 0 + 4 choose 1 = 4
 74 | # no more 'o' locations
 75 | encoding_accumulator = 1 + 12 * 4 = 49
 76 | combiner = 12 * (11 choose 1) = 132
 77 | message_length = 11 - 1 = 10
 78 | # remove 'o' from message
 79 | ```
 80 | 
 81 | | Message | h   | d   | e   | h   | h   | e   | d   | e   | h   | e   |
 82 | | ------- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
 83 | | Index   | 0   | 1   | 2   | 3   | 4   | 5   | 6   | 7   | 8   | 9   |
 84 | 
 85 | **Symbol 'd'**
 86 | ```
 87 | # First 'd' at index 1
 88 | symbol_binomial_sum = 0 + 1 choose 1 = 1
 89 | # Second 'd' at index 6
 90 | symbol_binomial_sum = 1 + 6 choose 2 = 16
 91 | # no more 'd' locations
 92 | encoding_accumulator = 49 + 132 * 16 = 2161
 93 | combiner = 132 * (10 choose 2) = 5940
 94 | message_length = 10 - 2 = 8
 95 | # Remove 'd' from message
 96 | ```
 97 | 
 98 | | Message | h   | e   | h   | h   | e   | e   | h   | e   |
 99 | | ------- | --- | --- | --- | --- | --- | --- | --- | --- |
100 | | Index   | 0   | 1   | 2   | 3   | 4   | 5   | 6   | 7   |
101 | 
102 | **Symbol 'e'**
103 | ```
104 | First 'e' at index 1
105 | symbol_binomial_sum = 0 + 1 choose 1 = 1
106 | Second 'e' at index 4
107 | symbol_binomial_sum = 1 + 4 choose 2 = 7
108 | Third  'e' at index 5
109 | symbol_binomial_sum = 7 + 5 choose 3 = 17
110 | Fourth 'e' at index 7
111 | symbol_binomial_sum = 17 + 7 choose 4 = 52
112 | no more 'e' locations
113 | encoding_accumulator = 2161 + 5940 * 52 = 311041
114 | ```
115 | 
116 | | Message | h   | h   | h   | h   |
117 | | ------- | --- | --- | --- | --- |
118 | | Index   | 0   | 1   | 2   | 3   |
119 | 
120 | The remaining locations are all 'h' which does not need to be encoded since there is only 1 permutation possible.  This would also be the case if the message was all a single character, in that case the algorithm is essentially run length encoding using the frequency table.
121 | 
122 | Thus encoding is complete, the final value to store = 311041.  This requires 3 bytes / 19 bits to store, while the original message was 12 bytes / 96 bits.
123 | ## Decoding
124 | Strike that, reverse it.
125 | #### Inputs 
126 | 1. Encoded output.
127 | 2. Frequency table containing exact counts of each symbol/character.
128 | 
129 | #### Steps
130 | 1. Load the encoded data into an integer and the frequency table.
131 | 2. Create an array/buffer to hold the message that is the sum of the frequency counts.
132 |     1. Fill this array with the last symbol to be decoded.  In the provided code this is the most common symbol, which means fewer locations to place in total.
133 | 3. Iterate through the symbols in the same order as the encoding, not including the last symbol as those locations are inferred.
134 |     1. If the current symbol is the second to last symbol ('e' in the example message)
135 |         1. The symbol_binomial_sum = encoded data
136 |     2. Else
137 |         1. Calculate the uncombiner = (total positions - decoded positions) *choose* current symbol count.
138 |         2. Divide with remainder the encoded data by the uncombiner.
139 |             1. The remainder is the sum of binomials for the current symbol.
140 |             2. The quotient is the encoded data without the current symbol included, which will be used in the next iteration.
141 |     3. For each symbol count, starting with K=total_symbol_count and working backwards to 1
142 |         1. Find the binomial N *choose* K where K= the current symbol count that satisfies
143 |             1. (N *choose* K) < symbol_binomial_sums and 
144 |             2. symbol_binomial_sums < ((N+1) *choose* K)
145 |         2. Count the offset caused by previously placed symbols. This is similar to encoding removing symbols from the message after encoding them, we must calculate the final location of the symbol as though those symbols were not there. A naïve[^2] but straightforward approach would be:
146 |             1. `offset = 0
147 |             2. `for i = 0 to N+offset: if message[i]!='h': offset++
148 |         3. Place the K'th symbol in the message array in the N+offset zero based index location
149 |     4. Continue placing symbols until symbol count == 1, then no estimate is needed and N = symbol_binomial_sum
150 | 4. When the symbol iteration completes, the message is decoded in the array contents.
151 | [^2]: In the provided code this is optimized, the offset is set to the total number of symbols placed and working backwards from the end of the message we subtract every symbol placed from that offset, to arrive at the same result as the naive solution
152 | ### Decoding Example
153 | 
154 | **Initial values**
155 | Frequency table
156 | 
157 | | Count | Symbol |
158 | | ----- | ------ |
159 | | 1     | i      |
160 | | 1     | o      |
161 | | 2     | d      |
162 | | 4     | e      |
163 | | 4     | h      |
164 | 
165 | ```
166 | encoded_data = 311041
167 | remaining_positions = message_length = 12
168 | ```
169 | Message Buffer/Array, filled with last symbol
170 | 
171 | | Message | h   | h   | h   | h   | h   | h   | h   | h   | h   | h   | h   | h   |
172 | | ------- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
173 | | Index   | 0   | 1   | 2   | 3   | 4   | 5   | 6   | 7   | 8   | 9   | 10  | 11  |
174 | 
175 | **Symbol 'i'**
176 | ```
177 | uncombiner = remaining_positions choose count = 12 choose 1 = 12
178 | symbol_binomial_sum = combined_data % uncombiner
179 | symbol_binomial_sum = 311041 % 12 = 1 
180 | encoded_data = encoded_data // uncombiner 
181 | encoded_data = 311041 // 12 = 25920
182 | decode first 'i'
183 | # location estimate & verification not needed for count==1
184 | # for an example go to symbol 'd'
185 | location_estimate = symbol_binomial_sum = 1
186 | # calculate offset
187 | offset = 0
188 | for i = 0 to location_estimate+offset: if message[i]!='h': offset++
189 | # no non-'h' symbols found, offset = 0
190 | message[location_estimate + offset => 1 + 0 => 1] = 'i'
191 | no more 'i' locations
192 | remaining_positions = remaining_positions - count = 12 - 1 = 11
193 | ```
194 | 
195 | | Message | h   | **i**   | h   | h   | h   | h   | h   | h   | h   | h   | h   | h   |
196 | | ------- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
197 | | Index   | 0   | **1**   | 2   | 3   | 4   | 5   | 6   | 7   | 8   | 9   | 10  | 11  |
198 | 
199 | **Symbol 'o'**
200 | ```
201 | uncombiner = 11 choose 1 = 11
202 | symbol_binomial_sum = 25920 % 11 = 4 
203 | encoded_data = 25920 // 11 = 2356
204 | # decode first 'o'
205 | # location estimate & verification not needed for count==1
206 | # for an example go to symbol 'd'
207 | location_estimate = symbol_binomial_sum = 4
208 | # calculate offset
209 | offset = 0
210 | for i = 0 to location_estimate+offset: if message[i]!='h': offset++
211 | # found i; offset = 1
212 | message[4+1=>5] = 'o'
213 | # no more 'o' locations
214 | remaining_positions = 11 - 1 = 10
215 | ```
216 | 
217 | | Message | h   | i   | h   | h   | h   | **o** | h   | h   | h   | h   | h   | h   |
218 | | ------- | --- | --- | --- | --- | --- | ----- | --- | --- | --- | --- | --- | --- |
219 | | Index   | 0   | 1   | 2   | 3   | 4   | **5** | 6   | 7   | 8   | 9   | 10  | 11  |
220 | 
221 | **Symbol 'd'**
222 | ```
223 | uncombiner = 10 choose 2 = 45
224 | symbol_binomial_sum = 2356 % 45 = 16 
225 | encoded_data = 2356 // 45 = 52
226 | # decode first 'd'
227 | # symbol count > 1, estimate location
228 | location_estimate = (symbol_binomial_sum * count!)^(1/count) + count//2
229 | location_estimate = (16*2!)^(1/2)+2//2 = 6
230 | # verify estimate
231 | # check overestimate, can only overestimate by 1 with this approach
232 | current_binomial = location_estimate choose count = 6 choose 2 = 15
233 | if current_binomial > symbol_binomial_sum:
234 |     # 15 > 16 => false, no underestimation
235 |     location_estimate -= 1
236 | else:
237 |     # if the next location's (N+1) binomial is larger than symbol_binomial_sum
238 |     # then the estimate is correct
239 |     # else, loop until larger is found
240 |     next_binomial = location_estimate+1 choose count = 7 choose 2 = 21
241 |     while(next_binomial < symbol_binomial_sum):
242 |         # 21 < 16 => false, no correction needed, no underestimation
243 |         location_estimate += 1
244 |         next_binomial = location_estimate choose count
245 | # location_estimate = 7
246 | # calculate offset
247 | offset = 0
248 | for i = 0 to location_estimate+offset: if message[i]!='h': offset++
249 | # found i,d; offset = 2
250 | message[location_estimate+offset=6+2=>8] = 'd'
251 | symbol_binomial_sum -= location_estimate choose count = 15
252 | symbol_binomial_sum = 1
253 | # decode second 'd'
254 | # location estimate & verification not needed for count==1
255 | location_estimate = symbol_binomial_sum = 1
256 | # calculate offset
257 | offset = 0
258 | for i = 0 to location_estimate+offset: if message[i]!='h': offset++
259 | # found i; offset = 1
260 | message[1+1=>2] = 'd'
261 | # no more 'd' locations
262 | remaining_positions = 10 - 2 = 8
263 | ```
264 | 
265 | | Message | h   | i   | **d** | h   | h   | o   | h   | h   | **d** | h   | h   | h   |
266 | | ------- | --- | --- | ----- | --- | --- | --- | --- | --- | ----- | --- | --- | --- |
267 | | Index   | 0   | 1   | **2** | 3   | 4   | 5   | 6   | 7   | **8** | 9   | 10  | 11  |
268 | 
269 | **Symbol 'e'**
270 | ```
271 | # for the second to last symbol (last to be decoded)
272 | # the uncombiner is not needed and symbol_binomial_sum = encoded_data
273 | symbol_binomial_sum = encoded_data = 52
274 | 
275 | # decode first 'e'
276 | location_estimate = (52*4!)^(1/4)+4//2 = 7
277 | # check overestimate
278 | current_binomial = 7 choose 4 = 35
279 | # 35 > 52 => false, no overestimation
280 | if current_binomial > symbol_binomial_sum:
281 |     location_estimate -= 1
282 | else:
283 |     # check underestimate
284 |     next_binomial = location_estimate+1 choose count = 8 choose 4 = 70
285 |     # 70 < 52 => false, no correction needed, no underestimation
286 |     while(next_binomial < symbol_binomial_sum):
287 |         location_estimate += 1
288 |         next_binomial = location_estimate choose count
289 | # location_estimate = 7
290 | # calculate offset
291 | offset = 0
292 | for i = 0 to location_estimate+offset: if message[i]!='h': offset++
293 | # found i,d,h,o; offset = 4     
294 | message[7+4=>11] = 'e'
295 | symbol_binomial_sum -= location_estimate choose count => 7 choose 4 => 35
296 | symbol_binomial_sum = 17
297 | 
298 | # decode second 'e'
299 | location_estimate = (17*3!)^(1/3)+3//2 = 5
300 | # check overestimate
301 | current_binomial = 5 choose 3 = 10
302 | #   10 > 17 => false, no overestimation
303 | # check underestimate
304 | next_binomial = 6 choose 3 = 20
305 | #   20 < 17 => false, no underestimation
306 | # calculate offset
307 | offset = 0
308 | for i = 0 to location_estimate+offset: if message[i]!='h': offset++
309 | # found i,d,h,o; offset = 4     
310 | message[5+4=>9] = 'e'
311 | symbol_binomial_sum -= 5 choose 3 => 10
312 | symbol_binomial_sum = 7
313 | 
314 | # decode third 'e'
315 | location_estimate = (7*2!)^(1/2)+2//2 = 4
316 | # check overestimate
317 |     current_binomial = 4 choose 2 = 6
318 | #   6 > 7 => false, no overestimation
319 | # check underestimate
320 |     next_binomial = 5 choose 2 = 10
321 | #   10 < 7 => false, no underestimation
322 | # calculate offset
323 | offset = 0
324 | for i = 0 to location_estimate+offset: if message[i]!='h': offset++
325 | # found i,d,o; offset = 3
326 | message[4+3=>7] = 'e'
327 | symbol_binomial_sum -= 4 choose 2 => 6
328 | symbol_binomial_sum = 1
329 | 
330 | # decode fouth 'e'
331 | # location estimate & verification not needed for count==1
332 | location_estimate = symbol_binomial_sum = 1
333 | # calculate offset
334 | offset = 0
335 | for i = 0 to location_estimate+offset: if message[i]!='h': offset++
336 | # found i,d; offset = 2
337 | message[1+2=>3] = 'e'
338 | 
339 | # no more 'e' locations
340 | # remaining_positions no longer needed
341 | # decoding complete
342 | ```
343 | 
344 | | Message | h   | i   | d   | e   | h   | o   | h   | e   | d   | e   | h   | e   |
345 | | ------- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
346 | | Index   | 0   | 1   | 2   | 3   | 4   | 5   | 6   | 7   | 8   | 9   | 10  | 11  |
347 | 
348 | There and back again, decoding is complete.
349 | 
350 | 
351 | If you enjoyed that and would like to help me make more you could consider [Supporting](https://github.com/Peter-Ebert/Valli-Encoding#support).  
352 | More info in the [README](README.md).
353 | 


--------------------------------------------------------------------------------
/utility-functions.hpp:
--------------------------------------------------------------------------------
 1 | // useful functions for binomials and other calculations
 2 | 
 3 | #include <fstream>
 4 | #include <vector>
 5 | #include <gmp.h>  //mpz_t
 6 | 
 7 | #include "frequency-table.hpp"
 8 | 
 9 | 
10 | // returns if read is valid: true = valid
11 | bool FileToCharVector(const std::string& filename, std::vector<char>& buffer) {
12 |     std::ifstream file(filename, std::ios::binary | std::ios::ate);
13 |     if (!file) {
14 |         // Handle file open error
15 |         return false;
16 |     }
17 | 
18 |     std::streamsize fileSize = file.tellg();
19 |     file.seekg(0, std::ios::beg);
20 | 
21 |     buffer.resize(fileSize);
22 |     if (file.read(buffer.data(), fileSize)) {
23 |         return true;
24 |     } else {
25 |         return false;
26 |     }
27 | 
28 | }
29 | 
30 | void CalcFrequencyPairs(std::vector<char> &buffer, FreqChar &freqs) {
31 |     // initialize values
32 |     for(int i=0; i<256; i++) {
33 |         freqs.data[i] = i;
34 |     }
35 |     // loop through buffer
36 |     for (size_t i=0; i < buffer.size(); i++) {
37 |         freqs.incrCount(buffer[i]);
38 |         if(buffer[i] < 0) {
39 |             std::cout << "index: " << i << " value:" << buffer[i] << std::endl;
40 |         }
41 |     }
42 | }
43 | 
44 | void choose_reuse(uint64_t n, uint64_t k, mpz_t &c, mpz_t &c1, mpz_t c2) {
45 |     // mpz_t's must already be initialized
46 |     // mpz_t c1,c2;
47 |     // mpz_inits(c1,c2, NULL); //opt: move inline to avoid cost of initialization every time
48 |     //assert k > 0
49 |     //if n<k return 0
50 |     if(n<k) {
51 |         mpz_set_ui(c, 0);
52 |         return;
53 |     }
54 |     mpz_set_ui(c1, n);
55 |     for(uint64_t i=n-1;i>n-k;--i) {
56 |         mpz_mul_ui(c1, c1, i);
57 |     }
58 |     //opt: if decrementing k by 1 every time, can reuse and divide, as in encode_symbol_location_reuse
59 |     mpz_fac_ui(c2, k);
60 |     mpz_cdiv_q(c, c1, c2);
61 | }
62 | 
63 | //clean-ver
64 | void encode_symbol_location_reuse(uint64_t n, uint64_t k, mpz_t symbol_accumulator, mpz_t &numerator, mpz_t &denom_fact, mpz_t &combo_result) {
65 |     //moved
66 |     // // increment k and multiply into denom_fact
67 |     // k += 1;
68 |     // mpz_mul_ui(denom_fact, denom_fact, k);
69 |     
70 |     //if n<k return 0
71 |     if(n<k) {
72 |         mpz_set_ui(combo_result, 0);
73 |         return;
74 |     }
75 | 
76 |     mpz_set_ui(numerator, n);
77 |     for(uint64_t i=n-1;i>n-k;--i) {
78 |         mpz_mul_ui(numerator, numerator, i);
79 |     }
80 |     mpz_divexact(combo_result, numerator, denom_fact);
81 |     mpz_add(symbol_accumulator, symbol_accumulator, combo_result);
82 | }
83 | 
84 | //warning: overwrites combo_result
85 | void next_multiply_combiner(mpz_t &multiply_combiner, uint64_t n, uint64_t k, mpz_t &combo_result, mpz_t &numerator, mpz_t &denom_fact) {
86 |     //remaining_locations choose symbol count
87 |     mpz_set_ui(numerator, n);
88 |     for(uint64_t i=n-1;i>n-k;--i) {
89 |         mpz_mul_ui(numerator, numerator, i);
90 |     }
91 |     mpz_divexact(combo_result, numerator, denom_fact);
92 |     mpz_mul(multiply_combiner, multiply_combiner, combo_result);
93 | }           
94 | 


--------------------------------------------------------------------------------