├── .gitignore ├── FAQ.md ├── README.md ├── frequency-table.hpp ├── poc-compress.cpp ├── poc-decompress.cpp ├── testfiles ├── panagram ├── sparse ├── tongue_twister ├── walkthrough └── wizard_of_oz ├── the-algorithm.md └── utility-functions.hpp /.gitignore: -------------------------------------------------------------------------------- 1 | # Prerequisites 2 | *.d 3 | 4 | # Compiled Object files 5 | *.slo 6 | *.lo 7 | *.o 8 | *.obj 9 | 10 | # Precompiled Headers 11 | *.gch 12 | *.pch 13 | 14 | # Compiled Dynamic libraries 15 | *.so 16 | *.dylib 17 | *.dll 18 | 19 | # Fortran module files 20 | *.mod 21 | *.smod 22 | 23 | # Compiled Static libraries 24 | *.lai 25 | *.la 26 | *.a 27 | *.lib 28 | 29 | # Executables 30 | *.exe 31 | *.out 32 | *.app 33 | -------------------------------------------------------------------------------- /FAQ.md: -------------------------------------------------------------------------------- 1 | # FAQ 2 | 3 | ## Why isn't the output smaller than gzip/zstd/etc...? 4 | Most commonly used compression tools have 2 types of compression (there are more of course). This compressor proof of concept is most like an [entropy encoder](https://en.wikipedia.org/wiki/Entropy_coding) which focuses on the frequency of characters, so its performace will be close to those. The other common type is [dictionary/substitution coders](https://en.wikipedia.org/wiki/Dictionary_coder) which focuses on the structure/relationship between symbols, like [LZ77/78](https://en.wikipedia.org/wiki/LZ77_and_LZ78). If you used the output of an LZ style compressor as an input to this encoder, then you would see similar sizes to other common tools. The goal of this code is to provide a proof of concept illustration, not a replacement for other common compression tools, as the code is not highly optimized. 5 | 6 | ## Is there any input this implementation cannot compress? 7 | The compression algorithm can work on any dataset, however there are 2 datasets this specific implementation will abort on: (1) If there is only 1 or 0 symbols in the input, e.g. 'aaaaa' or '' there is nothing to compress, the frequency table is essentially RLE which is not new, so this is not implemented. (2) The input cannot contain all 256 byte values since this implementation picks one byte value to act as a 'null' value. The code could be altered to handle both cases. 8 | 9 | ## Why does the program exit on file sizes larger than 64KB? 10 | It's just a precaution to prevent long runtimes. First of all, this is just a proof of concept that isn't highly optimized and secondly the encoding math requires changing bits along the entire length of an arbitrarily large integer, once this size exceeds the CPU cache it can become quite slow. If you want to experiment with bigger inputs simply comment out the line that does this check. 11 | 12 | 13 | 14 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Valli-Encoding 2 | A compression algorithm that uses combinatorics (binomials/multinomials). 3 | 4 | **Table of Contents** 5 | * [Introduction](#introduction) 6 | * [Comparison to Others](#comparison-to-other-entropy-encoder-implementations) 7 | * [The Algorithm](#the-algorithm) 8 | * [Running the Code](#running-the-code) 9 | * [Support](#support) 10 | * [Final Thoughts](#final-thoughts) 11 | 12 | ## Introduction 13 | This repository contains a basic *proof of concept* implementation of what I'd like to call Valli encoding, which leverages the exact count of the symbol frequencies to compress the input with combinatorics. The output will be some number between 0 (inclusive) and the number of permutations of symbols in the frequency table (exclusive). The input is processed one unique symbol at a time using the number of combinations (binomials) that could occur before placing each symbol, the sum of which generates a single symbol's encoding. These symbol encodings can then be combined together based on the number of permutations of each preceeding symbol. For a more detailed walkthrough see [how the code works](the-algorithm.md). 14 | 15 | I'd be happy to be politely corrected if this already exists, as far as I can tell this is a novel approach, but I am just a problem solver for fun. Feel free to raise an issue for that or other feedback. 16 | 17 | The size of the output will always be the size of the total number of permutations of the symbols given the symbol frequencies table, also known as [multinomials](https://en.wikipedia.org/wiki/Multinomial_theorem#Number_of_unique_permutations_of_words): 18 | Let A,B,C,... = symbol counts 19 | T = total count of symbols = A + B + C + ... 20 | Bit size = log2( T! / (A! * B! * C! * ...) ) 21 | 22 | ## Comparison to Other Entropy Encoder Implementations 23 | In several cases this algorithm's output is smaller than other entropy encoder implementations I can find online. **Please do send links to better entropy encoder (arithmetic/ANS/etc.) implementations that produce smaller output**. Please do not link LZ/LZW, PPM, Neural Nets, etc. that leverage information about the relationship between symbols, those are expected to do better as this implementation does not include those techniques. You can share links through the "[issues](https://github.com/Peter-Ebert/Valli-Encoding/issues)" option at the top of the page. 24 | 25 | #### Data: wizard_of_oz (chapter 5) - 9905 bytes 26 | A moderately sized input that uses many different letters and symbols. 27 | 28 | | Algorithm | Freq Table | Encoding | Total Size (bytes) | 29 | | ----------------------------------------------------------------------------------------------------------------------- | ---------- | -------- | ------------------ | 30 | | [Static AC](https://github.com/nayuki/Reference-arithmetic-coding/blob/master/python/arithmetic-compress.py) | 101** | 5423 | 5524 | 31 | | [Adaptive AC](https://github.com/nayuki/Reference-arithmetic-coding/blob/master/python/adaptive-arithmetic-compress.py) | N/A | 5611 | 5611 | 32 | | [rANS](https://github.com/rygorous/ryg_rans) | 101** | 5428 | 5529 | 33 | | Valli | 101 | 5395 | **5496** | 34 | 35 | 36 | \*\*Their code does not compress the frequency table, so I've used my implementation's smaller bit packed table size instead. 37 | 38 | #### Data: pangram - 43 bytes 39 | Worst case scenario for frequency table size vs message size. 40 | 41 | | Algorithm | Freq Table | Encoding | Total Size (bytes) | 42 | | ----------------------------------------------------------------------------------------------------------------------- | ---------- | -------- | ------------------ | 43 | | [Static AC](https://github.com/nayuki/Reference-arithmetic-coding/blob/master/python/arithmetic-compress.py) | 34** | 25 | 59 | 44 | | [Adaptive AC](https://github.com/nayuki/Reference-arithmetic-coding/blob/master/python/adaptive-arithmetic-compress.py) | N/A | 42 | **42** | 45 | | Valli | 34 | **19** | 53 | 46 | 47 | #### Data: tongue_twister - 35 bytes 48 | Many repeated syllables leads to a smaller frequency table. 49 | 50 | | Algorithm | Freq Table | Encoding | Total Size (bytes) | 51 | | ----------------------------------------------------------------------------------------------------------------------- | ---------- | -------- | ------------------ | 52 | | [Static AC](https://github.com/nayuki/Reference-arithmetic-coding/blob/master/python/arithmetic-compress.py) | **16 | 15 | 48 | 53 | | [Adaptive AC](https://github.com/nayuki/Reference-arithmetic-coding/blob/master/python/adaptive-arithmetic-compress.py) | N/A | 32 | 32 | 54 | | Valli | 16 | 12 | **28** | 55 | 56 | #### Data: sparse - 110 bytes 57 | Data is mostly a single symbol with a few others. 58 | 59 | | Algorithm | Freq Table | Encoding | Total Size (bytes) | 60 | | ----------------------------------------------------------------------------------------------------------------------- | ---------- | -------- | ------------------ | 61 | | [Static AC](https://github.com/nayuki/Reference-arithmetic-coding/blob/master/python/arithmetic-compress.py) | 6** | 4 | 10 | 62 | | [Adaptive AC](https://github.com/nayuki/Reference-arithmetic-coding/blob/master/python/adaptive-arithmetic-compress.py) | N/A | 44 | 44 | 63 | | Valli | 6 | 3 | **9** | 64 | 65 | ## The Algorithm 66 | Follow this link to see [how the code works](the-algorithm.md) in a step by step example walkthrough. 67 | 68 | If you have more questions, check the [FAQ](FAQ.md). 69 | 70 | ## The Frequency Table 71 | I've combined the frequency table and encoding together into a single file, you will see the size of each at the end of the compressor's console output. The frequency table uses only very basic bit packing for the counts, the characters are stored as raw bytes. Some additional compression could make it even smaller, but seems excessive. 72 | 73 | ## Running the Code 74 | #### Required Dependencies: 75 | * [GMP library](https://gmplib.org/) - Used for large integer math. Unfortunately there isn't one simple command for this, use your favorite search engine or LLM with your OS version specified. 76 | * A 64 bit CPU that supports LZCNT, most modern Intel/AMD 64 bit CPUs will, except it seems Apple's M series. This is for fast bit packing of the frequency table. If anyone wants to contribute a LZCNT equivlaent for ARM please do reach out. 77 | #### Recommended (optional): 78 | * Clang - GCC should work too just haven't tested. 79 | * C++17 - known to be working, other C++ standards should should work but are not tested. Some shortcuts like "auto" are used which require at least C++11 but could be rewritten. 80 | 81 | #### Linux Commands 82 | ``` 83 | git clone git@github.com:Peter-Ebert/Valli-Encoding.git 84 | clang++ -std=c++17 -O2 poc-compress.cpp -lgmp -o poc-compress 85 | clang++ -std=c++17 -O2 poc-decompress.cpp -lgmp -o poc-decompress 86 | ``` 87 | 88 | To compress the test data: 89 | ``` 90 | ./poc-compress testfiles/input1 91 | ``` 92 | 93 | The output to console is intentionally verbose to help with understanding the calculations performed for each symbol, it slows the output some and can be commented out if desired. 94 | 95 | Two files are created in the same directory as the compressed file: 96 | * input1.vli - the compressed output 97 | * input1.freq - contains frequency counts for each symbol/character, it is sorted ascending and uncompressed. The first 7 bytes are the count and the next 1 byte indicates the symbol, repeating. File size is the same for any input (64bits * 256=2048 bytes). The data could be compressed easily (even zero counts are included), but this simple format shows that nothing is being hidden inside. 98 | 99 | To decompress: 100 | ``` 101 | ./poc-decompress testfiles/input1.vli 102 | ``` 103 | 104 | This will output "\[filename\].decom", so that the input and output can be compared. The decompressor will assume the associated frequency file is in the same folder with the same name but replaces ".vli" with ".freq" (created previously by the compressor). As with the compressor, the console output will show much of the math involved to decode the compressed file. 105 | 106 | To verify the input matches the output: 107 | ``` 108 | diff -s testfiles/input1 testfiles/input1.decom 109 | ``` 110 | Expected output: 111 | ``` 112 | Files testfiles/input1 and testfiles/input1.decom are identical 113 | ``` 114 | 115 | ## Support 116 | Ideally knowledge would be free, but it takes time and effort to create and communicate, all the while one cannot live on ideas and dissertations alone. With the emergence of sites that enable direct support from viewers like Patreon and Twitch/Youtube, I hope the idea of exchanging content for financial support isn't asking too much. If you enjoyed this as much as a drink, a movie, or more and have the ability to help out I'd greatly appreciate it and it would help encourage future posts. 117 | 118 | [Support this project](https://www.paypal.com/donate/?business=S7Q76A99VU44W&no_recurring=0¤cy_code=USD) 119 | The receipt will have an email address if you want to send questions/requests, if you've donated I'll do my best to answer them if not in the README/[FAQ](FAQ.md). 120 | 121 | I have more ideas I'd like to explore, but I quit my job to work on this and have spent the funds I put aside. I'm grateful and lucky to have had the chance to set aside time to work on this and never intended to make money off it, so any donations would encourage further research or improvements. 122 | 123 | ## Some Parting Thoughts 124 | This approach for encoding is admittedly is not very practical given it's factorial / polynomial nature and the fact encoding requires changing every bit along the entire length of the compressed value (bad once outside of cpu caches). Memory and storage are ample these days so the few bytes it saves may not be worth it. 125 | 126 | However, I would still assert that mathematically it is interesting. The encoding is optimal in that the encoded data will not exceed the size of the number of permutations. I am not an expert but afaik this puts it in the same category as only two other encoders: Arithmetic coding and Asymmetric numeral systems (ANS). Though we only save a few bytes/bits compared to a fast encoder with a static frequency table, it's worth noting mathematically that a one bit reduction means we've reduced the number of encodable values in half. The dataset that originally interested me in compression was bit sets generated from hash values (e.g. probabilistic counts, bloom filters), so it may have some applications there since those datasets don't normally compress very well. 127 | 128 | #### Addition is All you Need 129 | Though you could boil down many different operations like multiplication into addition with loops, the binomials that make up the encoding are deeply related to repeated addition as they make up pascal's triangle. Likewise, those binomials are then added together to produce an encoding for a single symbol. This seems somewhat unique, as in ANS's math the state itself is divided and multiplied to create the next state. Therefore, it is possible to implement this entire encoding using addition almost exclusively. 130 | 131 | #### Parallelizable 132 | Another noteworthy feature is this algorithm can be parallelized per unique symbol, as far as I know no other compression algorithm is parallelizable without sacrificing the compression ratio (not counting large lookup tables which can compress multiple symbols at once, which are still sequential in nature). We could also consider not combining individual symbol encodings (binomial sums) together, instead storing each unique symbol separately, as combining only saves <1 bit per unique symbol and combining uses expensive large number multiplication. 133 | 134 | #### Next 135 | I have more ideas, especially around optimization, but have run out of time and resources and figured it was best to share the idea first to see if it was new. If there's more interest or support I might get to those ideas. For now I hope you enjoyed this as much as I did making it. 136 | 137 | -------------------------------------------------------------------------------- /frequency-table.hpp: -------------------------------------------------------------------------------- 1 | // Frequency Table Implementation w/ basic compression: 2 | // diff encoding, variable length integers, bit packing 3 | 4 | // Stores the symbols and frequencies used for compression/decompression 5 | // Serialization performs some basic bit packing, leveraging the sorted counts 6 | // followed by corresponding byte symbols, no compression 7 | 8 | 9 | #include // ifstream,ofstream 10 | #include // lzcnt 11 | 12 | // Simple structure to contain the dictionary information. 13 | // Array of 64 bit numbers, 14 | // The first 7 most significant bytes store the count 15 | // and the last byte stores the character being counted 16 | // This allows for sorting directly on the full 64 bits 17 | // Size is always 256*8 = 2048 bytes 18 | struct FreqChar { 19 | uint64_t data[256] = {0}; 20 | void setChar(unsigned char i, unsigned char symbol) { 21 | data[i] = (data[i] & 0xFFFFFFFFFFFFFF00) | symbol; 22 | } 23 | unsigned char getChar(unsigned char i) { 24 | return data[i]; 25 | } 26 | void setCount(unsigned char i, size_t count) { 27 | data[i] += count << 8; 28 | } 29 | uint64_t getCount(unsigned char i) { 30 | return data[i] >> 8; 31 | } 32 | void incrCount(unsigned char i) { 33 | data[i] += 1ull << 8; 34 | //warning: no overflow detection 35 | } 36 | //sort the dictionary by frequency, ascending 37 | void sortData() { 38 | std::sort(data, data+(sizeof(data)/sizeof(data[0]))); 39 | } 40 | 41 | // A very basic freq table serialization 42 | // bit packing of sorted counts, when count==0, we also know the number of byte symobls to read 43 | // returns count of bytes written to file 44 | uint64_t serialize(std::ofstream& out_file) { 45 | uint64_t output_byte_count = 0; 46 | // write the bit length of the largest count, max 6 bits 47 | uint64_t count = (data[255] >> 8); 48 | int8_t bit_length = (64 - _lzcnt_u64(count)); 49 | // !!! use last bit length, not current 50 | int8_t last_bit_length = (64 - _lzcnt_u64(count)); 51 | uint8_t byte_buffer = last_bit_length; 52 | uint8_t bit_offset = 6; 53 | int non_zero = 0; 54 | //bit pack the counts 55 | for(int i=255; i>=0; i--) { 56 | count = this->getCount(i); 57 | bit_length = (64 - _lzcnt_u64(count)); 58 | int8_t bits_output = 0; 59 | while(last_bit_length>bits_output) { 60 | // if byte would fill, write and go next byte 61 | if((last_bit_length-bits_output) >= (8 - bit_offset)) { 62 | byte_buffer |= count << bit_offset; 63 | count = count >> (8 - bit_offset); 64 | out_file << byte_buffer; 65 | output_byte_count++; 66 | bits_output += (8 - bit_offset); 67 | byte_buffer = 0; 68 | bit_offset = 0; 69 | } else { 70 | // no overflow, insert and move offset 71 | byte_buffer |= count << bit_offset; 72 | bit_offset += last_bit_length-bits_output; //can't be >= 8 based on if 73 | bits_output += last_bit_length-bits_output; //can't be >= 8 based on if 74 | } 75 | } 76 | if(this->getCount(i)==0) { 77 | //exit loop 78 | //todo: test with freqs that end here 79 | break; 80 | } 81 | non_zero++; 82 | last_bit_length = bit_length; 83 | // use the lenght of the current count to set the next one 84 | } 85 | // if unwritten bits in buffer, flush 86 | // this will waste at most 7 bits, could be used by encoding, skipping optimization for now 87 | if(bit_offset != 0) { 88 | out_file << byte_buffer; 89 | output_byte_count++; 90 | } 91 | 92 | //output non zero count symbols (bytes) in the same order as the sort (desc) 93 | //todo: assert non_zero >= 1 94 | for(int i=255; i>=(255-non_zero); i--) { 95 | if(this->getCount(i)) { 96 | out_file << this->getChar(i); 97 | output_byte_count++; 98 | } else { 99 | break; 100 | } 101 | } 102 | return output_byte_count; 103 | } 104 | 105 | // basic deserializer 106 | // returns bytes read 107 | uint64_t deserialize(std::ifstream& input_file) { 108 | //for legacy reasons, keep all zero counts in place 109 | //todo: resize array for non-zero counts 110 | //todo: assert file len > 1 111 | uint64_t read_byte_count = 1; 112 | // read first 6 bits for count bit length 113 | uint8_t byte_buffer; 114 | input_file.read((char*)&byte_buffer, 1); 115 | uint8_t bit_length = byte_buffer & 0b00111111; 116 | 117 | uint8_t bit_offset = 6; 118 | int symbol_count = 0; 119 | // count loop 120 | do { 121 | int64_t count = 0; 122 | int8_t bits_read = 0; 123 | // read bytes loop 124 | while(bit_length>bits_read) { 125 | if((bit_length-bits_read) >= (8 - bit_offset)) { 126 | // overflow, load then read next byte 127 | //load and read next byte, loop until done 128 | count |= ((uint64_t)(byte_buffer >> bit_offset)) << bits_read; 129 | bits_read += (8 - bit_offset); 130 | input_file.read((char*)&byte_buffer, 1); 131 | read_byte_count++; 132 | bit_offset = 0; 133 | //update bits read 134 | } else { 135 | // fits in current byte 136 | // load count and update offset 137 | count |= ((uint64_t)(byte_buffer >> bit_offset)) << bits_read; 138 | //zero out bits beyond the length 139 | count &= ((1ull << (bit_length))-1); 140 | bit_offset += bit_length-bits_read; 141 | bits_read = bit_length; 142 | } 143 | } 144 | //set the count, ascending order 145 | this->setCount(255-symbol_count, count); 146 | // exit loop if the count is 0 (last in sequence) 147 | if(count==0) { 148 | break; 149 | } 150 | symbol_count++; 151 | // update bit_length with current length 152 | bit_length = (64 - _lzcnt_u64(count)); 153 | 154 | } while(symbol_count < 255); //for safety, can be removed, unreachable 155 | 156 | // if bit_offset == 0, it contains a symbol 157 | // else unset bits, read next byte 158 | if(bit_offset != 0) { 159 | input_file.read((char*)&byte_buffer, 1); 160 | read_byte_count++; 161 | } 162 | // assert: symbol_count > 1 163 | this->setChar(255, byte_buffer); 164 | std::vector found_symbols(symbol_count); 165 | found_symbols.at(0) = byte_buffer; 166 | // set characters 167 | for(int i=254; i>(255-symbol_count); i--) { 168 | input_file.read((char*)&byte_buffer, 1); 169 | read_byte_count++; 170 | this->setChar(i, byte_buffer); 171 | found_symbols.at(255-i) = byte_buffer; 172 | } 173 | 174 | std::sort(found_symbols.begin(), found_symbols.end()); 175 | // print vector 176 | int found_idx = 0; 177 | for(int i=0; i<(256); i++) { 178 | if(found_idx < symbol_count && i==found_symbols.at(found_idx)) { 179 | //skip 180 | found_idx++; 181 | } else { 182 | // update char 183 | this->setChar(i-found_idx, i); 184 | } 185 | 186 | } 187 | // legacy code: back fill other symbols 188 | // vector symbol_count 189 | 190 | 191 | return read_byte_count; 192 | //unique_symbols += 1; // 0 is never used 193 | } 194 | 195 | }; 196 | 197 | 198 | -------------------------------------------------------------------------------- /poc-compress.cpp: -------------------------------------------------------------------------------- 1 | // Valli Entropy Encoder 2 | // Quick proof of concept 3 | // To build: 4 | // -requires: GMP lib https://gmplib.org/ 5 | // clang++ -std=c++17 -O2 poc-compress.cpp -lgmp -o poc-compress 6 | 7 | #include // cout 8 | #include // ifstream,ofstream 9 | #include // bitset 10 | #include // intrinsics 11 | #include // timer 12 | #include /* log2 */ 13 | #include // bigint mpz_t 14 | #include // sort 15 | 16 | #include "utility-functions.hpp" 17 | 18 | 19 | using namespace std; 20 | 21 | int main(int argc, char* argv[]) { 22 | 23 | // parse file name 24 | if (argc != 2) { 25 | cout << "Specify a single file, example: " << argv[0] << " " << std::endl; 26 | return 1; 27 | } 28 | 29 | // variable to set file output 30 | bool write_file = true; 31 | 32 | string source_path_file = argv[1]; 33 | string filename = source_path_file.substr(source_path_file.find_last_of("/\\") + 1); 34 | string filename_entropy = source_path_file + ".vli"; 35 | string filename_freq_table = source_path_file + ".freq"; 36 | 37 | // read file into memory 38 | std::vector buffer; 39 | if(!FileToCharVector(source_path_file, buffer)) { 40 | cout << "File read error, most likely it does not exist." << endl; 41 | return 1; 42 | } 43 | cout << "File size: " << buffer.size() << " bytes" << endl; 44 | // Warn & exit in case someone accidentally submits a large file 45 | // File sizes much larger than your CPU cache can be quite slow 46 | // This implementation is designed as a POC and not highly optimized 47 | if(buffer.size() > 64000) { 48 | cout << "The filesize is greater than 64kb, this compressor implementation is not highly optimized, so compression times may be long once outside of L1 cache, you can comment out this if statement if you wish to proceed." << endl; 49 | return 0; 50 | } 51 | 52 | // count the frequencies of each symbol (char/byte) 53 | FreqChar freqs; 54 | CalcFrequencyPairs(buffer, freqs); 55 | 56 | //sort the frequency table, ascending by count then symbol 57 | sort(freqs.data, freqs.data+sizeof(freqs.data)/sizeof(freqs.data[0])); 58 | 59 | printf("Sorted Frequencies:\n"); 60 | cout << "idx : chr : int : count" << endl; 61 | // print non-zero frequencies and count unique symbols 62 | uint64_t unique_symbols = 0; 63 | for (int i = 0; i < 256; i++) { 64 | // cout << freqs.getCount(i) << endl; 65 | if(freqs.getCount(i)) { 66 | unique_symbols++; 67 | cout << i << " : '" << freqs.getChar(i) << "' : " << (uint)freqs.getChar(i) << " : " << freqs.getCount(i) << endl; 68 | } 69 | } 70 | uint64_t total_symbols = buffer.size(); 71 | 72 | cout << "==============================" << endl; 73 | cout << "Total symbols: " << total_symbols << endl; 74 | cout << "Unique symbols: " << unique_symbols << endl; 75 | cout << "------------------------------" << endl; 76 | 77 | if(unique_symbols < 2) { 78 | // todo: handle this case 79 | cout << "Less than 2 unique symbols, nothing to encode, aborting." << endl; 80 | return 1; 81 | } 82 | 83 | uint64_t remaining_loc = total_symbols; 84 | // select the least common character 85 | char null_symbol = (char)freqs.getChar(0); 86 | // This simple demonstration implementation select one character as a 'null' symbol to take the place of 87 | // symbols which have already been encoded. 88 | // As a result, all 256 byte values cannot be used in the input, 89 | // since this is only likely with random or already compressed data, shouldn't be an issue for a POC 90 | if(freqs.getCount(0) != 0) { 91 | cout << "Unhandled: This implementation requires at least one symbol in the input to be unused ('null' symbol). This input has all byte values used (0-255)." << endl; 92 | return -1; 93 | } 94 | // todo: to process inputs with all 256 characters are used: 95 | // if unique_symbols==256: pick lowest frequency and encode solo, then set that as the null symbol 96 | // alternatively (advanced), can rotate the 256 bytes w/ addition so that the max value occurs at the max byte value, 97 | // then use less than to test for placement or not, can be done in parallel threads this way as the array can be static 98 | 99 | // Making an assmumption about the gmp library (not verified) 100 | // Heavy reuse of mpz_t variables that are similar in size, 101 | // with the assumption that there will be fewer allocatitons needed (and less inits) 102 | mpz_t num_product_seq, denom_fact, combo_result, symbol_accumulator; 103 | mpz_inits(num_product_seq, denom_fact, combo_result, symbol_accumulator, NULL); 104 | 105 | uint64_t symbol_count; 106 | uint64_t symbol_idx; 107 | 108 | mpz_t multiply_combiner, data_accumulator; 109 | mpz_inits(multiply_combiner, data_accumulator, NULL); 110 | mpz_set_ui(multiply_combiner, 1); 111 | mpz_set_ui(data_accumulator, 0); 112 | 113 | // Loop through each possible symbol 114 | // 256-1 because the last symbol (asc sort) does not need to be encoded/decoded 115 | for (int i = 0; i < 256-1; i++) { 116 | // if character exists in message 117 | if(freqs.getCount(i)) { 118 | // reset for new symbol 119 | mpz_set_ui(symbol_accumulator, 0); 120 | mpz_set_ui(denom_fact, 1); 121 | // reset symbol count 122 | symbol_count = 1; 123 | 124 | //calculate location for first item 125 | cout << "--- " << freqs.getChar(i) << ":" << freqs.getCount(i) << " (" << (uint)freqs.getChar(i) << ")" << " ---" << endl; 126 | 127 | //find symbol location 128 | size_t removed_loc = 0; 129 | 130 | // encode current symbol by looping through buffer to find each location 131 | // can exit loop when last instance is found k = symbol_count 132 | for(size_t byte_loc = 0; byte_loc < buffer.size(); byte_loc++) { 133 | if(buffer[byte_loc]==(char)freqs.getChar(i)) { 134 | //found instance of symbol 135 | // verbose: combination calculation for location choose symbol_count 136 | cout << " + " << byte_loc-removed_loc << " choose " << symbol_count << endl; 137 | 138 | encode_symbol_location_reuse(byte_loc-removed_loc, symbol_count, symbol_accumulator, num_product_seq, denom_fact, combo_result); 139 | buffer[byte_loc] = null_symbol; 140 | if(symbol_count==freqs.getCount(i)) { 141 | // all symbols have been found 142 | // exit loop 143 | break; 144 | } 145 | // increment k and multiply into denom_fact for next loop 146 | symbol_count += 1; 147 | mpz_mul_ui(denom_fact, denom_fact, symbol_count); 148 | 149 | } else if(buffer[byte_loc]==null_symbol) { 150 | // count the number of symbols that have been removed between the last byte location and the next one 151 | removed_loc++; 152 | } 153 | } 154 | 155 | // verbose output: sum of symbols and combiner multiple 156 | gmp_printf("Sum of Binomials: %Zd \n", symbol_accumulator); 157 | gmp_printf("Multiply combiner: %Zd \n", multiply_combiner); 158 | mpz_mul(combo_result, multiply_combiner, symbol_accumulator); 159 | mpz_add(data_accumulator, data_accumulator, combo_result); 160 | 161 | // calculation not needed for last symbol since it's not encoded 162 | // however it is needed for the 'max bit length' calculation at the end 163 | // otherwise can wrap with if(i != 254) {} 164 | next_multiply_combiner(multiply_combiner, remaining_loc, symbol_count, combo_result, num_product_seq, denom_fact); 165 | 166 | //track how many possible locations remain without the current symbol 167 | remaining_loc -= freqs.getCount(i); 168 | } 169 | } 170 | 171 | // Verbose: output the final integer and statistics around the output 172 | cout << "----------Final Data----------" << endl; 173 | gmp_printf("%Zd \n", data_accumulator); 174 | cout << "------------------------------" << endl; 175 | 176 | size_t bit_length = mpz_sizeinbase(data_accumulator, 2); 177 | cout << "Current byte length: " << ceil(bit_length/8.0) << endl; 178 | cout << "Current bit length: " << bit_length << endl; 179 | // Use combiner to calc max bit len (total # of permutations of symbol frequencies) 180 | size_t max_bit_length = mpz_sizeinbase(multiply_combiner, 2); 181 | cout << "Max bit length: " << max_bit_length << endl; 182 | 183 | // Calculate the Shannon minimum bit length, static frequency table 184 | // = shannon entropy per symbol * message length 185 | double shannon_entropy = 0.0; 186 | for (int i = 0; i < 256; i++) { 187 | // cout << freqs.getCount(i) << endl; 188 | if(freqs.getCount(i)) { 189 | double probability = static_cast(freqs.getCount(i)) / total_symbols; 190 | shannon_entropy -= probability * log2(probability); 191 | } 192 | } 193 | shannon_entropy = shannon_entropy * total_symbols; 194 | 195 | cout << "Static frequency Shannon limit: " << ceil(shannon_entropy) << endl; 196 | cout << "Bits saved: " << ceil(shannon_entropy)-max_bit_length << endl; 197 | cout << "Relative Size: " << 100 * (max_bit_length / ceil(shannon_entropy)) << "%" << endl; 198 | 199 | // Write compressed data and frequencies table 200 | if(write_file) { 201 | cout << "Writing compressed data to: " << filename_entropy << endl; 202 | ofstream out_file(filename_entropy); 203 | //write frequency table 204 | cout << "Frequency table size (bytes): " << freqs.serialize(out_file) << endl; 205 | // write encoded data 206 | size_t out_size = mpz_sizeinbase(data_accumulator, 256); 207 | cout << "Encoded data (bytes): " << out_size << endl; 208 | unsigned char *output_array = new unsigned char[out_size]; 209 | // output_array, word_count, order, size, endian, nails, data 210 | mpz_export(output_array, NULL, 1, 1, -1, 0, data_accumulator); 211 | if(mpz_cmp_ui(data_accumulator, 0) == 0) { 212 | // if output == 0, gmp will not write to the array 213 | output_array[0] = 0; 214 | } 215 | out_file.write((char *)output_array, out_size); 216 | delete[] output_array; 217 | out_file.close(); 218 | } else { 219 | cout << "Skipping data write." << endl; 220 | } 221 | 222 | return 0; 223 | } -------------------------------------------------------------------------------- /poc-decompress.cpp: -------------------------------------------------------------------------------- 1 | // Valli Decompression - Proof of concept 2 | // clang++ -std=c++17 -O2 poc-decompress.cpp -lgmp -o poc-decompress 3 | 4 | // This implementation is more complicated than the naive approach in the documentation. 5 | // It uses an an approximate calculation to estimate the binomial, then adjusts it from there. 6 | // Also some shortcuts for trivial values and other minor optimizations. 7 | #include // cout 8 | #include // ifstream,ofstream 9 | #include // bitset 10 | #include // intrinsics 11 | #include // timer 12 | #include // log2 13 | #include // bigint mpz_t 14 | #include // sort 15 | 16 | #include "utility-functions.hpp" 17 | 18 | 19 | using namespace std; 20 | 21 | int main(int argc, char* argv[]) { 22 | 23 | // verify args 24 | if (argc != 2) { 25 | cout << "Specify a compressed file ending in .vli, example: " << argv[0] << " " << endl; 26 | return 1; 27 | } 28 | 29 | // variable to set file output 30 | bool write_file = true; 31 | 32 | // validate and set filenames 33 | string compressed_path_file = argv[1]; 34 | string file_ending = ".vli"; 35 | // Ensure file ending is .vli, no other validation performed. 36 | // Will error out (vector out of bounds) if the encoded data value is too large for the frequency counts. 37 | // Since this *should* only be caused by user error/manipulation, leaving unhandled for now. 38 | // *could also be cause by bugs, please report any issues 39 | // Else, every value smaller than the max permutations will decompress to some permutation of symbols. 40 | if (!(compressed_path_file.length() >= file_ending.length() && compressed_path_file.compare(compressed_path_file.length() - file_ending.length(), file_ending.length(), file_ending) == 0)) { 41 | cout << "Invalid filename, must end with '.vli'" << endl; 42 | return 1; 43 | } 44 | 45 | // assume frequencies file name & location based on source file name 46 | string basename = compressed_path_file.substr(0, compressed_path_file.length() - file_ending.length()); 47 | string filename_freq_table = basename + ".freq"; 48 | // create a new output file so that they can be compared to the source 49 | string filename_out = basename + ".decom"; 50 | 51 | cout << "Compressed file: " << compressed_path_file << endl; 52 | cout << "Frequencies file: " << filename_freq_table << endl; 53 | cout << "Output file: " << filename_out << endl; 54 | 55 | // read file 56 | std::ifstream input_file(compressed_path_file, std::ios::binary); 57 | if (!input_file) { 58 | // Handle file open error 59 | std::cerr << "Error opening file, most likely file does not exist." << std::endl; 60 | return 1; 61 | } 62 | // deserialize frequency table at start of file 63 | FreqChar freqs; 64 | size_t freq_byte_count = freqs.deserialize(input_file); 65 | if (!input_file) { 66 | std::cerr << "Error reading frequency table." << std::endl; 67 | return 1; 68 | } 69 | cout << "Frequency table size (bytes):" << freq_byte_count << endl; 70 | 71 | cout << "Frequencies:" << endl; 72 | cout << "int : char : count" << endl; 73 | uint64_t total_symbols = 0; 74 | uint64_t unique_symbols = 0; 75 | for (int i = 0; i < 256; i++) { 76 | if(freqs.getCount(i)) { 77 | cout << (uint)freqs.getChar(i) << " : " << freqs.getChar(i) << " : " << freqs.getCount(i) << endl; 78 | total_symbols += freqs.getCount(i); 79 | unique_symbols++; 80 | } 81 | } 82 | 83 | // load encoding into input_buffer 84 | std::vector input_buffer((std::istreambuf_iterator(input_file)), std::istreambuf_iterator()); 85 | // Check successful file read 86 | if (!input_file && !input_file.eof()) { 87 | std::cerr << "Error reading encoding." << std::endl; 88 | return 1; 89 | } 90 | input_file.close(); 91 | cout << "Encoding size (bytes): " << input_buffer.size() << endl; 92 | 93 | mpz_t compressed_data, symbol_combo, extracted_combo, root_result, binomial, numerator, denominator, factorial, uncombiner, est_binomial; 94 | mpz_inits(compressed_data, symbol_combo, extracted_combo, root_result, binomial, numerator, denominator, factorial, uncombiner, est_binomial, NULL); 95 | 96 | // export and import must match 97 | // mpz_export(output_array, NULL, 1, 1, -1, 0, data_accumulator); 98 | mpz_import(compressed_data, input_buffer.size(), 1, 1, -1, 0, input_buffer.data()); 99 | 100 | // verbose info 101 | gmp_printf("Imported Integer: %Zd\n", compressed_data); 102 | if(unique_symbols < 1) { 103 | // todo: handle this case, not implemented 104 | cout << "Not enough unique symbols, 2 required in current implementation, aborting." << endl; 105 | return 1; 106 | } 107 | 108 | // This implementation fills the output message buffer with the 109 | // last symbol and uses it as an 'empty' location indicator. 110 | // After all other symbols are placed correctly the 111 | // last symbol is already in the correct locations. 112 | char last_symbol = (char)freqs.getChar(255); 113 | // allocate buffer for decoded output, fill with most common character 114 | std::vector output_buffer(total_symbols, last_symbol); 115 | uint64_t remaining_locations = total_symbols; 116 | 117 | mpz_set_ui(uncombiner, 1); 118 | size_t symbol_start=256; //starting out of bounds 119 | // start at 254 becuase last symbol isnt encoded 120 | 121 | // symbol index 122 | size_t symbol_idx = 0; 123 | // advance index to start of populated symbols 124 | while(freqs.getCount(symbol_idx) == 0) { 125 | symbol_idx++; 126 | } 127 | 128 | // Loop through each symbol, except the last 129 | while(symbol_idx < 255) { 130 | // verbose output 131 | cout << "------------------------------------" << endl; 132 | char current_symbol = (char)freqs.getChar(symbol_idx); 133 | cout << "Current symbol: " << current_symbol << " (" << (uint)current_symbol << ")" << endl; 134 | cout << "Locations remaining: " << remaining_locations << endl; 135 | // To extract symbol_combo from compressed_data 136 | // we mod the compressed_data by the number of combinations for that symbol 137 | // then subtract out that remainder and repeat for next symbol 138 | 139 | // 2nd to last value, calculation & extraction not needed 140 | if(symbol_idx != 254) { 141 | // calculate permutations of symbol="uncombiner" to extract symbol combination 142 | choose_reuse(remaining_locations, freqs.getCount(symbol_idx), uncombiner, numerator, denominator); 143 | // compressed data = quotient 144 | // extracted combo = remainder 145 | mpz_tdiv_qr(compressed_data, extracted_combo, compressed_data, uncombiner); 146 | } else { 147 | // Last encoded value 148 | // no more extraction needed, what remains is the 254th symbol's sum of binomials 149 | // because uncombiner == 1 so extracted_combo = compressed_data 150 | mpz_set(extracted_combo, compressed_data); 151 | } 152 | 153 | // verbose output 154 | gmp_printf("Uncombiner: %Zd \n", uncombiner); 155 | gmp_printf("Extracted Binomial Sum: %Zd \n", extracted_combo); 156 | 157 | // setup loop to deconstruct the extracted binomial sum 158 | // largest index location extracted first 159 | size_t symbol_count = freqs.getCount(symbol_idx); 160 | size_t insert_offset = total_symbols - remaining_locations; 161 | // zero based index for locations 162 | size_t loc_idx; 163 | size_t last_loc_idx = total_symbols - 1; 164 | // size_t removed_symbols = total_symbols - remaining_locations; 165 | // calculate factorial 166 | mpz_fac_ui(factorial, symbol_count); 167 | 168 | // Symbol extraction innner loop 169 | // continue while symbol_count > extracted_combo 170 | // this avoids estimates that are below zero 171 | while(mpz_cmp_ui(extracted_combo, symbol_count) > 0) { 172 | // use sum of binomials property, combined with Newton's method 173 | // combo ~= (n-k//2)^k / k! 174 | // (combo*k!)^(1/k) + k//2 ~= n 175 | 176 | // multiply in k! 177 | mpz_mul(symbol_combo, extracted_combo, factorial); 178 | 179 | // find the root 180 | mpz_root(root_result, symbol_combo, symbol_count); 181 | // add symbol_count//2 to go from near the middle to near the top=n 182 | size_t loc_idx = mpz_get_ui(root_result) + symbol_count/2; 183 | 184 | // calculate estimated binomial value 185 | choose_reuse(loc_idx, symbol_count, est_binomial, numerator, denominator); 186 | // verbose output 187 | cout << "Estimated: " << loc_idx << " choose " << symbol_count << endl; 188 | 189 | // Validate estimation (1 & 2) // 190 | // (1) ensure estimate is less than target value, some even values of k can over estimate 191 | if(mpz_cmp(est_binomial, extracted_combo) > 0) { 192 | // if overestimated, shift N down by 1 193 | // N-1 choose K: multiply by N-K; divide by N 194 | mpz_mul_ui(est_binomial, est_binomial, loc_idx-symbol_count); 195 | mpz_divexact_ui(est_binomial, est_binomial, loc_idx); 196 | loc_idx -= 1; 197 | cout << "Adjusted down" << endl; 198 | } 199 | // (2) ensure estimate is not too low by looking at the delta 200 | // todo: opt: can skip if corrected down 201 | // subtract estimate from binomial 202 | mpz_sub(extracted_combo, extracted_combo, est_binomial); 203 | // calculate diff between N choose K and N+1 choose K => N choose K-1 => est_binomial *K then /(N-K+1) 204 | mpz_mul_ui(est_binomial, est_binomial, symbol_count); 205 | // todo: should be removable, checked for safety 206 | // avoid div by zero 207 | if(loc_idx-symbol_count+1 != 0) { 208 | // est_binomial / (N-K+1) 209 | mpz_divexact_ui(est_binomial, est_binomial, loc_idx-symbol_count+1); 210 | } 211 | 212 | // tracking for information purposes only, not needed for calculation 213 | size_t adjust_up_count = 0; 214 | cout << "---Adjustment loop---" << endl; 215 | 216 | // Estimate adjustment loop 217 | // while(delta to next binomial < remaining sum of binomials) 218 | while(mpz_cmp(est_binomial, extracted_combo) <= 0 && mpz_cmp_ui(extracted_combo, symbol_count) > 0) { 219 | // estimate too small, adjust estimated N up by 1 220 | // subtract diff (N Choose K-1) 221 | mpz_sub(extracted_combo, extracted_combo, est_binomial); 222 | loc_idx += 1; 223 | // calculate next diff and confirm 224 | // to continue calculation up, estimate N+1 choose K = current estimate *N+1; /N-K to current estimate 225 | // if estimate is zero, cannot shift up 226 | if(loc_idx<=symbol_count) { 227 | //then previous estimate is 0, set est_binomial = 1 228 | mpz_set_ui(est_binomial, 1); 229 | loc_idx = symbol_count; 230 | } else { 231 | mpz_mul_ui(est_binomial, est_binomial, loc_idx); 232 | } 233 | //avoid division by zero 234 | if(loc_idx-symbol_count != 0) { 235 | mpz_divexact_ui(est_binomial, est_binomial, loc_idx-(symbol_count-1)); 236 | // mpz_divexact_ui(est_binomial, est_binomial, loc_idx-symbol_count+1)); // cleaner? VERIFY 237 | } 238 | adjust_up_count += 1; 239 | } 240 | // verbose output 241 | cout << "Adjusted up: " << adjust_up_count << "x" << endl; 242 | 243 | // calculate offset based on previously placed symbols 244 | // optimization: start with the total count of placed symbols, subtract already placed symbols as we move backwards 245 | // for each non-last symbol found, reduce the offset 246 | // cout << "loop: " << last_loc_idx << " to " << loc_idx+insert_offset << endl; 247 | for(size_t i=last_loc_idx; i>=loc_idx+insert_offset && insert_offset!=0; i--) { 248 | if(output_buffer[i] != last_symbol) { 249 | insert_offset--; 250 | } 251 | } 252 | // verbose output 253 | cout << "Location offset: " << insert_offset << endl; 254 | last_loc_idx = loc_idx+insert_offset-1; // -1 because current placement location already checked 255 | //update character in output buffer 256 | cout << "Symbol placed at: " << loc_idx+insert_offset << endl; 257 | output_buffer.at(loc_idx+insert_offset) = current_symbol; 258 | 259 | // Setup next loop 260 | // calculate next lower factorial=(N-1)!=(N!)/N 261 | mpz_divexact_ui(factorial, factorial, symbol_count); 262 | symbol_count -= 1; 263 | } 264 | 265 | // if symbol_count <= extracted combo, decoding is trivial 266 | if(symbol_count != 0 && mpz_cmp_ui(extracted_combo, symbol_count) <= 0) { 267 | cout << "symbol_count <= remaining sum of binomials" << symbol_count << endl; 268 | // gmp_printf("extracted_combo: %Zd\n", extracted_combo); 269 | // cout << "symbol_count: " << symbol_count << endl; 270 | size_t null_idx = 0; 271 | do { 272 | // seek to next non-null index 273 | while(output_buffer[null_idx] != last_symbol) { 274 | null_idx += 1; 275 | } 276 | if(mpz_cmp_ui(extracted_combo, symbol_count) < 0) { 277 | // if symbol count gt extracted combo place all remaining symbols in in the first location they appear 278 | output_buffer.at(null_idx) = current_symbol; 279 | cout << "Symbol placed at: " << null_idx << endl; 280 | symbol_count -= 1; 281 | } else if(mpz_cmp_ui(extracted_combo, symbol_count) == 0) { 282 | // if equal skip one location 283 | null_idx += 1; 284 | while(output_buffer[null_idx] != last_symbol) { 285 | null_idx += 1; 286 | } 287 | output_buffer.at(null_idx) = current_symbol; 288 | cout << "Symbol placed at: " << null_idx << endl; 289 | symbol_count -= 1; 290 | } else { 291 | // if lt, continue placing 292 | output_buffer.at(null_idx) = current_symbol; 293 | cout << "Symbol placed at: " << null_idx << endl; 294 | symbol_count -= 1; 295 | } 296 | } while(symbol_count > 0); 297 | } 298 | //used for next loop and location placement later 299 | remaining_locations -= freqs.getCount(symbol_idx); 300 | symbol_idx++; 301 | } 302 | 303 | // decompression done, final output: 304 | cout << "----------------------------------" << endl; 305 | // print decompressed data to console 306 | for (const auto& value : output_buffer) { 307 | cout << value; 308 | } 309 | cout << endl; 310 | 311 | cout << "Writing decompressed data to: " << filename_out << endl; 312 | // check setting for file output 313 | if (write_file) { 314 | //output to file 315 | ofstream output_file(filename_out); 316 | if (!output_file.is_open()) { 317 | std::cerr << "Failed creating output file." << std::endl; 318 | return 1; 319 | } 320 | for (const auto& value : output_buffer) { 321 | output_file << value; 322 | } 323 | output_file.close(); 324 | } 325 | 326 | return 0; 327 | } 328 | -------------------------------------------------------------------------------- /testfiles/panagram: -------------------------------------------------------------------------------- 1 | The quick brown fox jumps over the lazy dog -------------------------------------------------------------------------------- /testfiles/sparse: -------------------------------------------------------------------------------- 1 | aaaaaaaaaaaaaaaaaaaaaaabaaaaaaaaaaaaaaaaaaaaaaaaaaacaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaabaaaaaaaaaaaaaaaaaaaaaa -------------------------------------------------------------------------------- /testfiles/tongue_twister: -------------------------------------------------------------------------------- 1 | She sells seashells by the seashore -------------------------------------------------------------------------------- /testfiles/walkthrough: -------------------------------------------------------------------------------- 1 | hidehohedehe -------------------------------------------------------------------------------- /testfiles/wizard_of_oz: -------------------------------------------------------------------------------- 1 | Title: The Wonderful Wizard of Oz 2 | Author: L. Frank Baum 3 | 4 | Chapter V 5 | The Rescue of the Tin Woodman 6 | When Dorothy awoke the sun was shining through the trees and Toto had long been out chasing birds around him and squirrels. She sat up and looked around her. There was the Scarecrow, still standing patiently in his corner, waiting for her. 7 | 8 | "We must go and search for water," she said to him. 9 | 10 | "Why do you want water?" he asked. 11 | 12 | "To wash my face clean after the dust of the road, and to drink, so the dry bread will not stick in my throat." 13 | 14 | "It must be inconvenient to be made of flesh," said the Scarecrow thoughtfully, "for you must sleep, and eat and drink. However, you have brains, and it is worth a lot of bother to be able to think properly." 15 | 16 | They left the cottage and walked through the trees until they found a little spring of clear water, where Dorothy drank and bathed and ate her breakfast. She saw there was not much bread left in the basket, and the girl was thankful the Scarecrow did not have to eat anything, for there was scarcely enough for herself and Toto for the day. 17 | 18 | When she had finished her meal, and was about to go back to the road of yellow brick, she was startled to hear a deep groan near by. 19 | 20 | "What was that?" she asked timidly. 21 | 22 | "I cannot imagine," replied the Scarecrow; "but we can go and see." 23 | 24 | Just then another groan reached their ears, and the sound seemed to come from behind them. They turned and walked through the forest a few steps, when Dorothy discovered something shining in a ray of sunshine that fell between the trees. She ran to the place and then stopped short, with a little cry of surprise. 25 | 26 | One of the big trees had been partly chopped through, and standing beside it, with an uplifted axe in his hands, was a man made entirely of tin. His head and arms and legs were jointed upon his body, but he stood perfectly motionless, as if he could not stir at all. 27 | 28 | Dorothy looked at him in amazement, and so did the Scarecrow, while Toto barked sharply and made a snap at the tin legs, which hurt his teeth. 29 | 30 | "Did you groan?" asked Dorothy. 31 | 32 | "Yes," answered the tin man, "I did. I've been groaning for more than a year, and no one has ever heard me before or come to help me." 33 | 34 | "What can I do for you?" she inquired softly, for she was moved by the sad voice in which the man spoke. 35 | 36 | "Get an oil-can and oil my joints," he answered. "They are rusted so badly that I cannot move them at all; if I am well oiled I shall soon be all right again. You will find an oil-can on a shelf in my cottage." 37 | 38 | Dorothy at once ran back to the cottage and found the oil-can, and then she returned and asked anxiously, "Where are your joints?" 39 | 40 | "Oil my neck, first," replied the Tin Woodman. So she oiled it, and as it was quite badly rusted the Scarecrow took hold of the tin head and moved it gently from side to side until it worked freely, and then the man could turn it himself. 41 | 42 | "Now oil the joints in my arms," he said. And Dorothy oiled them and the Scarecrow bent them carefully until they were quite free from rust and as good as new. 43 | 44 | The Tin Woodman gave a sigh of satisfaction and lowered his axe, which he leaned against the tree. 45 | 46 | "This is a great comfort," he said. "I have been holding that axe in the air ever since I rusted, and I'm glad to be able to put it down at last. Now, if you will oil the joints of my legs, I shall be all right once more." 47 | 48 | So they oiled his legs until he could move them freely; and he thanked them again and again for his release, for he seemed a very polite creature, and very grateful. 49 | 50 | "I might have stood there always if you had not come along," he said; "so you have certainly saved my life. How did you happen to be here?" 51 | 52 | "We are on our way to the Emerald City to see the Great Oz," she answered, "and we stopped at your cottage to pass the night." 53 | 54 | "Why do you wish to see Oz?" he asked. 55 | 56 | "I want him to send me back to Kansas, and the Scarecrow wants him to put a few brains into his head," she replied. 57 | 58 | The Tin Woodman appeared to think deeply for a moment. Then he said: 59 | 60 | "Do you suppose Oz could give me a heart?" 61 | 62 | "Why, I guess so," Dorothy answered. "It would be as easy as to give the Scarecrow brains." 63 | 64 | "True," the Tin Woodman returned. "So, if you will allow me to join your party, I will also go to the Emerald City and ask Oz to help me." 65 | 66 | "Come along," said the Scarecrow heartily, and Dorothy added that she would be pleased to have his company. So the Tin Woodman shouldered his axe and they all passed through the forest until they came to the road that was paved with yellow brick. 67 | 68 | The Tin Woodman had asked Dorothy to put the oil-can in her basket. "For," he said, "if I should get caught in the rain, and rust again, I would need the oil-can badly." 69 | 70 | It was a bit of good luck to have their new comrade join the party, for soon after they had begun their journey again they came to a place where the trees and branches grew so thick over the road that the travelers could not pass. But the Tin Woodman set to work with his axe and chopped so well that soon he cleared a passage for the entire party. 71 | 72 | Dorothy was thinking so earnestly as they walked along that she did not notice when the Scarecrow stumbled into a hole and rolled over to the side of the road. Indeed he was obliged to call to her to help him up again. 73 | 74 | "Why didn't you walk around the hole?" asked the Tin Woodman. 75 | 76 | "I don't know enough," replied the Scarecrow cheerfully. "My head is stuffed with straw, you know, and that is why I am going to Oz to ask him for some brains." 77 | 78 | "Oh, I see," said the Tin Woodman. "But, after all, brains are not the best things in the world." 79 | 80 | "Have you any?" inquired the Scarecrow. 81 | 82 | "No, my head is quite empty," answered the Woodman. "But once I had brains, and a heart also; so, having tried them both, I should much rather have a heart." 83 | 84 | "And why is that?" asked the Scarecrow. 85 | 86 | "I will tell you my story, and then you will know." 87 | 88 | So, while they were walking through the forest, the Tin Woodman told the following story: 89 | 90 | "I was born the son of a woodman who chopped down trees in the forest and sold the wood for a living. When I grew up, I too became a woodchopper, and after my father died I took care of my old mother as long as she lived. Then I made up my mind that instead of living alone I would marry, so that I might not become lonely. 91 | 92 | "There was one of the Munchkin girls who was so beautiful that I soon grew to love her with all my heart. She, on her part, promised to marry me as soon as I could earn enough money to build a better house for her; so I set to work harder than ever. But the girl lived with an old woman who did not want her to marry anyone, for she was so lazy she wished the girl to remain with her and do the cooking and the housework. So the old woman went to the Wicked Witch of the East, and promised her two sheep and a cow if she would prevent the marriage. Thereupon the Wicked Witch enchanted my axe, and when I was chopping away at my best one day, for I was anxious to get the new house and my wife as soon as possible, the axe slipped all at once and cut off my left leg. 93 | 94 | "This at first seemed a great misfortune, for I knew a one-legged man could not do very well as a wood-chopper. So I went to a tinsmith and had him make me a new leg out of tin. The leg worked very well, once I was used to it. But my action angered the Wicked Witch of the East, for she had promised the old woman I should not marry the pretty Munchkin girl. When I began chopping again, my axe slipped and cut off my right leg. Again I went to the tinsmith, and again he made me a leg out of tin. After this the enchanted axe cut off my arms, one after the other; but, nothing daunted, I had them replaced with tin ones. The Wicked Witch then made the axe slip and cut off my head, and at first I thought that was the end of me. But the tinsmith happened to come along, and he made me a new head out of tin. 95 | 96 | "I thought I had beaten the Wicked Witch then, and I worked harder than ever; but I little knew how cruel my enemy could be. She thought of a new way to kill my love for the beautiful Munchkin maiden, and made my axe slip again, so that it cut right through my body, splitting me into two halves. Once more the tinsmith came to my help and made me a body of tin, fastening my tin arms and legs and head to it, by means of joints, so that I could move around as well as ever. But, alas! I had now no heart, so that I lost all my love for the Munchkin girl, and did not care whether I married her or not. I suppose she is still living with the old woman, waiting for me to come after her. 97 | 98 | "My body shone so brightly in the sun that I felt very proud of it and it did not matter now if my axe slipped, for it could not cut me. There was only one danger--that my joints would rust; but I kept an oil-can in my cottage and took care to oil myself whenever I needed it. However, there came a day when I forgot to do this, and, being caught in a rainstorm, before I thought of the danger my joints had rusted, and I was left to stand in the woods until you came to help me. It was a terrible thing to undergo, but during the year I stood there I had time to think that the greatest loss I had known was the loss of my heart. While I was in love I was the happiest man on earth; but no one can love who has not a heart, and so I am resolved to ask Oz to give me one. If he does, I will go back to the Munchkin maiden and marry her." 99 | 100 | Both Dorothy and the Scarecrow had been greatly interested in the story of the Tin Woodman, and now they knew why he was so anxious to get a new heart. 101 | 102 | "All the same," said the Scarecrow, "I shall ask for brains instead of a heart; for a fool would not know what to do with a heart if he had one." 103 | 104 | "I shall take the heart," returned the Tin Woodman; "for brains do not make one happy, and happiness is the best thing in the world." -------------------------------------------------------------------------------- /the-algorithm.md: -------------------------------------------------------------------------------- 1 | # The Algorithm 2 | Note: 3 | // indicates integer division (truncation) 4 | N *choose* K is the [binomial coefficient](https://en.wikipedia.org/wiki/Binomial_coefficient) (N,K) 5 | ## Encoding 6 | 7 | #### Input 8 | 1. A message composed of some number of symbols (characters/bytes in the input file). 9 | #### Steps 10 | 1. Create a frequency table by counting occurrences of each symbol in the message. 11 | 2. Set **encoding_accumulator** = 0; this value accumulates the encoding data for each symbol and contains the final encoding value. 12 | 3. Iterate through each symbol in the frequency table except for the last one[^1]. 13 | 1. Set **symbol_binomial_sum** = 0; holds the binomial sums for the current symbol being encoded 14 | 2. Set **combiner** = 1; the number of permutations of all prior symbols before to the current symbol (see examples for more info) 15 | 3. Iterate through the message until the current symbol is encountered. 16 | 1. Calculate the binomial N *choose* K where: 17 | N= the zero based index of the symbol location and 18 | K = the current symbol count (1,2,3,...) 19 | 1. Continue iterating through the message, summing all binomials calculated in the previous step together into **symbol_binomial_sum**. 20 | 4. When the end of the message is reached for a symbol: 21 | 1. Set **encoding_accumulator = encoding_accumulator + symbol_binomial_sum * combiner** 22 | 2. Set **combiner = combiner * (message_size *choose* symbol_count)**; which will be used in the next symbol iteration. 23 | 3. Remove the encoded symbols from the message. (This is optimized away in the code as array resizing is expensive, instead a 'null' value is used to indicate the location should no longer be counted.) 24 | 5. Continue symbol until you reach the last symbol in the set. The last value is not encoded since it can be inferred by the location of all the other symbols. 25 | 3. The encoding_accumulator is now the final encoded data. 26 | 27 | [^1]: The order of iteration through the frequency table must be consistent and repeatable for decoding. The provided code iterates in order of ascending symbol count followed by symbol itself to resolve duplicate counts. This sort is used since more symbol occurrences means more operations required to encode. 28 | ## Encoding Example 29 | 30 | | Message | h | i | d | e | h | o | h | e | d | e | h | e | 31 | | ------- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | 32 | | Index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 33 | 34 | Organize the data into the following frequency table, sorted by count and symbol: 35 | 36 | | Count | Symbol | 37 | | ----- | ------ | 38 | | 1 | i | 39 | | 1 | o | 40 | | 2 | d | 41 | | 4 | e | 42 | | 4 | h | 43 | 44 | **Initial Values** 45 | ``` 46 | message_length = 12 47 | combiner = 1 48 | encoding_accumulator = 0 49 | ``` 50 | **Symbol 'i'** 51 | ``` 52 | symbol_binomial_sum = 0 53 | # First 'i' in message at index 1 54 | symbol_binomial_sum = symbol_binomial_sum + (location choose count) = 0 + (1 choose 1) = 1 55 | no more 'i' locations 56 | encoding_accumulator = encoding_accumulator + combiner * symbol_binomial_sum 57 | encoding_accumulator = 0 + 1 * 1 = 1 58 | combiner = combiner * (message_length choose symbol_count) 59 | combiner = 1 * (12 choose 1) = 12 60 | message_length = message_length - symbol count 61 | message_length = 12 - 1 = 11 62 | # Remove 'i' from message 63 | ``` 64 | 65 | | Message | h | d | e | h | o | h | e | d | e | h | e | 66 | | ------- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | 67 | | Index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 68 | 69 | **Symbol 'o'** 70 | ``` 71 | symbol_binomial_sum = 0 72 | # first 'o' in message at index 4 73 | symbol_binomial_sum = 0 + 4 choose 1 = 4 74 | # no more 'o' locations 75 | encoding_accumulator = 1 + 12 * 4 = 49 76 | combiner = 12 * (11 choose 1) = 132 77 | message_length = 11 - 1 = 10 78 | # remove 'o' from message 79 | ``` 80 | 81 | | Message | h | d | e | h | h | e | d | e | h | e | 82 | | ------- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | 83 | | Index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 84 | 85 | **Symbol 'd'** 86 | ``` 87 | # First 'd' at index 1 88 | symbol_binomial_sum = 0 + 1 choose 1 = 1 89 | # Second 'd' at index 6 90 | symbol_binomial_sum = 1 + 6 choose 2 = 16 91 | # no more 'd' locations 92 | encoding_accumulator = 49 + 132 * 16 = 2161 93 | combiner = 132 * (10 choose 2) = 5940 94 | message_length = 10 - 2 = 8 95 | # Remove 'd' from message 96 | ``` 97 | 98 | | Message | h | e | h | h | e | e | h | e | 99 | | ------- | --- | --- | --- | --- | --- | --- | --- | --- | 100 | | Index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 101 | 102 | **Symbol 'e'** 103 | ``` 104 | First 'e' at index 1 105 | symbol_binomial_sum = 0 + 1 choose 1 = 1 106 | Second 'e' at index 4 107 | symbol_binomial_sum = 1 + 4 choose 2 = 7 108 | Third 'e' at index 5 109 | symbol_binomial_sum = 7 + 5 choose 3 = 17 110 | Fourth 'e' at index 7 111 | symbol_binomial_sum = 17 + 7 choose 4 = 52 112 | no more 'e' locations 113 | encoding_accumulator = 2161 + 5940 * 52 = 311041 114 | ``` 115 | 116 | | Message | h | h | h | h | 117 | | ------- | --- | --- | --- | --- | 118 | | Index | 0 | 1 | 2 | 3 | 119 | 120 | The remaining locations are all 'h' which does not need to be encoded since there is only 1 permutation possible. This would also be the case if the message was all a single character, in that case the algorithm is essentially run length encoding using the frequency table. 121 | 122 | Thus encoding is complete, the final value to store = 311041. This requires 3 bytes / 19 bits to store, while the original message was 12 bytes / 96 bits. 123 | ## Decoding 124 | Strike that, reverse it. 125 | #### Inputs 126 | 1. Encoded output. 127 | 2. Frequency table containing exact counts of each symbol/character. 128 | 129 | #### Steps 130 | 1. Load the encoded data into an integer and the frequency table. 131 | 2. Create an array/buffer to hold the message that is the sum of the frequency counts. 132 | 1. Fill this array with the last symbol to be decoded. In the provided code this is the most common symbol, which means fewer locations to place in total. 133 | 3. Iterate through the symbols in the same order as the encoding, not including the last symbol as those locations are inferred. 134 | 1. If the current symbol is the second to last symbol ('e' in the example message) 135 | 1. The symbol_binomial_sum = encoded data 136 | 2. Else 137 | 1. Calculate the uncombiner = (total positions - decoded positions) *choose* current symbol count. 138 | 2. Divide with remainder the encoded data by the uncombiner. 139 | 1. The remainder is the sum of binomials for the current symbol. 140 | 2. The quotient is the encoded data without the current symbol included, which will be used in the next iteration. 141 | 3. For each symbol count, starting with K=total_symbol_count and working backwards to 1 142 | 1. Find the binomial N *choose* K where K= the current symbol count that satisfies 143 | 1. (N *choose* K) < symbol_binomial_sums and 144 | 2. symbol_binomial_sums < ((N+1) *choose* K) 145 | 2. Count the offset caused by previously placed symbols. This is similar to encoding removing symbols from the message after encoding them, we must calculate the final location of the symbol as though those symbols were not there. A naïve[^2] but straightforward approach would be: 146 | 1. `offset = 0 147 | 2. `for i = 0 to N+offset: if message[i]!='h': offset++ 148 | 3. Place the K'th symbol in the message array in the N+offset zero based index location 149 | 4. Continue placing symbols until symbol count == 1, then no estimate is needed and N = symbol_binomial_sum 150 | 4. When the symbol iteration completes, the message is decoded in the array contents. 151 | [^2]: In the provided code this is optimized, the offset is set to the total number of symbols placed and working backwards from the end of the message we subtract every symbol placed from that offset, to arrive at the same result as the naive solution 152 | ### Decoding Example 153 | 154 | **Initial values** 155 | Frequency table 156 | 157 | | Count | Symbol | 158 | | ----- | ------ | 159 | | 1 | i | 160 | | 1 | o | 161 | | 2 | d | 162 | | 4 | e | 163 | | 4 | h | 164 | 165 | ``` 166 | encoded_data = 311041 167 | remaining_positions = message_length = 12 168 | ``` 169 | Message Buffer/Array, filled with last symbol 170 | 171 | | Message | h | h | h | h | h | h | h | h | h | h | h | h | 172 | | ------- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | 173 | | Index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 174 | 175 | **Symbol 'i'** 176 | ``` 177 | uncombiner = remaining_positions choose count = 12 choose 1 = 12 178 | symbol_binomial_sum = combined_data % uncombiner 179 | symbol_binomial_sum = 311041 % 12 = 1 180 | encoded_data = encoded_data // uncombiner 181 | encoded_data = 311041 // 12 = 25920 182 | decode first 'i' 183 | # location estimate & verification not needed for count==1 184 | # for an example go to symbol 'd' 185 | location_estimate = symbol_binomial_sum = 1 186 | # calculate offset 187 | offset = 0 188 | for i = 0 to location_estimate+offset: if message[i]!='h': offset++ 189 | # no non-'h' symbols found, offset = 0 190 | message[location_estimate + offset => 1 + 0 => 1] = 'i' 191 | no more 'i' locations 192 | remaining_positions = remaining_positions - count = 12 - 1 = 11 193 | ``` 194 | 195 | | Message | h | **i** | h | h | h | h | h | h | h | h | h | h | 196 | | ------- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | 197 | | Index | 0 | **1** | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 198 | 199 | **Symbol 'o'** 200 | ``` 201 | uncombiner = 11 choose 1 = 11 202 | symbol_binomial_sum = 25920 % 11 = 4 203 | encoded_data = 25920 // 11 = 2356 204 | # decode first 'o' 205 | # location estimate & verification not needed for count==1 206 | # for an example go to symbol 'd' 207 | location_estimate = symbol_binomial_sum = 4 208 | # calculate offset 209 | offset = 0 210 | for i = 0 to location_estimate+offset: if message[i]!='h': offset++ 211 | # found i; offset = 1 212 | message[4+1=>5] = 'o' 213 | # no more 'o' locations 214 | remaining_positions = 11 - 1 = 10 215 | ``` 216 | 217 | | Message | h | i | h | h | h | **o** | h | h | h | h | h | h | 218 | | ------- | --- | --- | --- | --- | --- | ----- | --- | --- | --- | --- | --- | --- | 219 | | Index | 0 | 1 | 2 | 3 | 4 | **5** | 6 | 7 | 8 | 9 | 10 | 11 | 220 | 221 | **Symbol 'd'** 222 | ``` 223 | uncombiner = 10 choose 2 = 45 224 | symbol_binomial_sum = 2356 % 45 = 16 225 | encoded_data = 2356 // 45 = 52 226 | # decode first 'd' 227 | # symbol count > 1, estimate location 228 | location_estimate = (symbol_binomial_sum * count!)^(1/count) + count//2 229 | location_estimate = (16*2!)^(1/2)+2//2 = 6 230 | # verify estimate 231 | # check overestimate, can only overestimate by 1 with this approach 232 | current_binomial = location_estimate choose count = 6 choose 2 = 15 233 | if current_binomial > symbol_binomial_sum: 234 | # 15 > 16 => false, no underestimation 235 | location_estimate -= 1 236 | else: 237 | # if the next location's (N+1) binomial is larger than symbol_binomial_sum 238 | # then the estimate is correct 239 | # else, loop until larger is found 240 | next_binomial = location_estimate+1 choose count = 7 choose 2 = 21 241 | while(next_binomial < symbol_binomial_sum): 242 | # 21 < 16 => false, no correction needed, no underestimation 243 | location_estimate += 1 244 | next_binomial = location_estimate choose count 245 | # location_estimate = 7 246 | # calculate offset 247 | offset = 0 248 | for i = 0 to location_estimate+offset: if message[i]!='h': offset++ 249 | # found i,d; offset = 2 250 | message[location_estimate+offset=6+2=>8] = 'd' 251 | symbol_binomial_sum -= location_estimate choose count = 15 252 | symbol_binomial_sum = 1 253 | # decode second 'd' 254 | # location estimate & verification not needed for count==1 255 | location_estimate = symbol_binomial_sum = 1 256 | # calculate offset 257 | offset = 0 258 | for i = 0 to location_estimate+offset: if message[i]!='h': offset++ 259 | # found i; offset = 1 260 | message[1+1=>2] = 'd' 261 | # no more 'd' locations 262 | remaining_positions = 10 - 2 = 8 263 | ``` 264 | 265 | | Message | h | i | **d** | h | h | o | h | h | **d** | h | h | h | 266 | | ------- | --- | --- | ----- | --- | --- | --- | --- | --- | ----- | --- | --- | --- | 267 | | Index | 0 | 1 | **2** | 3 | 4 | 5 | 6 | 7 | **8** | 9 | 10 | 11 | 268 | 269 | **Symbol 'e'** 270 | ``` 271 | # for the second to last symbol (last to be decoded) 272 | # the uncombiner is not needed and symbol_binomial_sum = encoded_data 273 | symbol_binomial_sum = encoded_data = 52 274 | 275 | # decode first 'e' 276 | location_estimate = (52*4!)^(1/4)+4//2 = 7 277 | # check overestimate 278 | current_binomial = 7 choose 4 = 35 279 | # 35 > 52 => false, no overestimation 280 | if current_binomial > symbol_binomial_sum: 281 | location_estimate -= 1 282 | else: 283 | # check underestimate 284 | next_binomial = location_estimate+1 choose count = 8 choose 4 = 70 285 | # 70 < 52 => false, no correction needed, no underestimation 286 | while(next_binomial < symbol_binomial_sum): 287 | location_estimate += 1 288 | next_binomial = location_estimate choose count 289 | # location_estimate = 7 290 | # calculate offset 291 | offset = 0 292 | for i = 0 to location_estimate+offset: if message[i]!='h': offset++ 293 | # found i,d,h,o; offset = 4 294 | message[7+4=>11] = 'e' 295 | symbol_binomial_sum -= location_estimate choose count => 7 choose 4 => 35 296 | symbol_binomial_sum = 17 297 | 298 | # decode second 'e' 299 | location_estimate = (17*3!)^(1/3)+3//2 = 5 300 | # check overestimate 301 | current_binomial = 5 choose 3 = 10 302 | # 10 > 17 => false, no overestimation 303 | # check underestimate 304 | next_binomial = 6 choose 3 = 20 305 | # 20 < 17 => false, no underestimation 306 | # calculate offset 307 | offset = 0 308 | for i = 0 to location_estimate+offset: if message[i]!='h': offset++ 309 | # found i,d,h,o; offset = 4 310 | message[5+4=>9] = 'e' 311 | symbol_binomial_sum -= 5 choose 3 => 10 312 | symbol_binomial_sum = 7 313 | 314 | # decode third 'e' 315 | location_estimate = (7*2!)^(1/2)+2//2 = 4 316 | # check overestimate 317 | current_binomial = 4 choose 2 = 6 318 | # 6 > 7 => false, no overestimation 319 | # check underestimate 320 | next_binomial = 5 choose 2 = 10 321 | # 10 < 7 => false, no underestimation 322 | # calculate offset 323 | offset = 0 324 | for i = 0 to location_estimate+offset: if message[i]!='h': offset++ 325 | # found i,d,o; offset = 3 326 | message[4+3=>7] = 'e' 327 | symbol_binomial_sum -= 4 choose 2 => 6 328 | symbol_binomial_sum = 1 329 | 330 | # decode fouth 'e' 331 | # location estimate & verification not needed for count==1 332 | location_estimate = symbol_binomial_sum = 1 333 | # calculate offset 334 | offset = 0 335 | for i = 0 to location_estimate+offset: if message[i]!='h': offset++ 336 | # found i,d; offset = 2 337 | message[1+2=>3] = 'e' 338 | 339 | # no more 'e' locations 340 | # remaining_positions no longer needed 341 | # decoding complete 342 | ``` 343 | 344 | | Message | h | i | d | e | h | o | h | e | d | e | h | e | 345 | | ------- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | 346 | | Index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 347 | 348 | There and back again, decoding is complete. 349 | 350 | 351 | If you enjoyed that and would like to help me make more you could consider [Supporting](https://github.com/Peter-Ebert/Valli-Encoding#support). 352 | More info in the [README](README.md). 353 | -------------------------------------------------------------------------------- /utility-functions.hpp: -------------------------------------------------------------------------------- 1 | // useful functions for binomials and other calculations 2 | 3 | #include 4 | #include 5 | #include //mpz_t 6 | 7 | #include "frequency-table.hpp" 8 | 9 | 10 | // returns if read is valid: true = valid 11 | bool FileToCharVector(const std::string& filename, std::vector& buffer) { 12 | std::ifstream file(filename, std::ios::binary | std::ios::ate); 13 | if (!file) { 14 | // Handle file open error 15 | return false; 16 | } 17 | 18 | std::streamsize fileSize = file.tellg(); 19 | file.seekg(0, std::ios::beg); 20 | 21 | buffer.resize(fileSize); 22 | if (file.read(buffer.data(), fileSize)) { 23 | return true; 24 | } else { 25 | return false; 26 | } 27 | 28 | } 29 | 30 | void CalcFrequencyPairs(std::vector &buffer, FreqChar &freqs) { 31 | // initialize values 32 | for(int i=0; i<256; i++) { 33 | freqs.data[i] = i; 34 | } 35 | // loop through buffer 36 | for (size_t i=0; i < buffer.size(); i++) { 37 | freqs.incrCount(buffer[i]); 38 | if(buffer[i] < 0) { 39 | std::cout << "index: " << i << " value:" << buffer[i] << std::endl; 40 | } 41 | } 42 | } 43 | 44 | void choose_reuse(uint64_t n, uint64_t k, mpz_t &c, mpz_t &c1, mpz_t c2) { 45 | // mpz_t's must already be initialized 46 | // mpz_t c1,c2; 47 | // mpz_inits(c1,c2, NULL); //opt: move inline to avoid cost of initialization every time 48 | //assert k > 0 49 | //if nn-k;--i) { 56 | mpz_mul_ui(c1, c1, i); 57 | } 58 | //opt: if decrementing k by 1 every time, can reuse and divide, as in encode_symbol_location_reuse 59 | mpz_fac_ui(c2, k); 60 | mpz_cdiv_q(c, c1, c2); 61 | } 62 | 63 | //clean-ver 64 | void encode_symbol_location_reuse(uint64_t n, uint64_t k, mpz_t symbol_accumulator, mpz_t &numerator, mpz_t &denom_fact, mpz_t &combo_result) { 65 | //moved 66 | // // increment k and multiply into denom_fact 67 | // k += 1; 68 | // mpz_mul_ui(denom_fact, denom_fact, k); 69 | 70 | //if nn-k;--i) { 78 | mpz_mul_ui(numerator, numerator, i); 79 | } 80 | mpz_divexact(combo_result, numerator, denom_fact); 81 | mpz_add(symbol_accumulator, symbol_accumulator, combo_result); 82 | } 83 | 84 | //warning: overwrites combo_result 85 | void next_multiply_combiner(mpz_t &multiply_combiner, uint64_t n, uint64_t k, mpz_t &combo_result, mpz_t &numerator, mpz_t &denom_fact) { 86 | //remaining_locations choose symbol count 87 | mpz_set_ui(numerator, n); 88 | for(uint64_t i=n-1;i>n-k;--i) { 89 | mpz_mul_ui(numerator, numerator, i); 90 | } 91 | mpz_divexact(combo_result, numerator, denom_fact); 92 | mpz_mul(multiply_combiner, multiply_combiner, combo_result); 93 | } 94 | --------------------------------------------------------------------------------