├── .gitignore ├── LICENSE ├── README.md └── source ├── Decode ├── Decoding.cpp ├── Demo.txt ├── Demo.txt.spd ├── Demo.txt.spd.txt ├── Encode ├── Encoding.cpp ├── compression.h ├── huffman └── huffman_test.cpp /.gitignore: -------------------------------------------------------------------------------- 1 | .vscode 2 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2019 Shashi Prakash 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | **File Compression Using Huffman's Algorithm** 2 | ========================= 3 | 4 | 5 | About 6 | ===== 7 | 8 | Huffman Algorithm is an efficient way for file Compression and Decompression. 9 | This program exactly follows huffman algorithm. It reads frequent characters from input file and replace it with shorter binary codeword. 10 | The original file can be produced again without loosing any bit. 11 | 12 | Usage 13 | ===== 14 | Compression: 15 | ``` 16 | ./encode 17 | ``` 18 | Output file named .spd will be produced. 19 | Decompression: 20 | ``` 21 | ./decode 22 | ``` 23 | 24 | File Structure 25 | ============================ 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 |
N= total number of unique characters(1 byte)
Character[1 byte] Binary codeword String Form[MAX bytes]
Character[1 byte] Binary codeword String Form[MAX bytes]
N times
p (1 byte) p times 0's (p bits)
DATA
35 | 36 | p = Padding done to ensure file fits in whole number of bytes. eg, file of 4 bytes + 3 bits must ne padded by 5 bits to make it 5 bytes. 37 | 38 | Example 39 | ---------------------------- 40 | Text: aabcbaab 41 | 42 | | Content | Comment | 43 | |-----------------------------------|---------------------------------------| 44 | |3 | N=3 (a,b,c) | 45 | |a "1" | character and corresponding code "1" | 46 | |b "01" | character and corresponding code "01" | 47 | |c "00" | character and corresponding code "00" | 48 | |4 | Padding count | 49 | |[0000] | Padding 4 zeroes | 50 | |[1] [1] [01] [00] [01] [1] [1] [01]| Actual data, code in place of char | 51 | 52 | Algorithm 53 | ============================ 54 | 0. **(Pass 1)** Read input file 55 | 0. Create sorted linked list of characters from file, as per character frequency 56 | ``` 57 | for eah character ch from file 58 | 59 | if( ch available in linked list at node p) then 60 | { 61 | p.freq++; 62 | sort Linked list as per node's freq; 63 | } 64 | else 65 | add new node at beginning of linked list with frequency=1; 66 | ``` 67 | 0. Construct huffman tree from linked list 68 | 0. Create new node q, join two least freq nodes to its left and right 69 | 0. Insert created node q into ascending list 70 | 0. Repeat i & ii till only one nodes remains, i.e, ROOT of h-tree 71 | 0. Traverse tree in preorder mark each node with its codeword. simultaneously Recreate linked list of leaf nodes. 72 | 0. Write Mapping Table(character to codeword) to output file. 73 | 0. **(Pass 2)** Read input file. 74 | 0. Write codeword in place of each character in input file to output file 75 | for each character ch from input file 76 | write corresponding codeword into o/p file (lookup in mapping table OR linked list) 77 | 0. End 78 | 79 | Contributing 80 | ============ 81 | 82 | Please feel free to submit issues and pull requests. I appreciate bug reports. 83 | Testing on different platforms is especially appreciated. I only tested on Linux. 84 | 85 | License 86 | ======= 87 | [MIT](https://opensource.org/licenses/MIT) 88 | 89 | 90 | Development 91 | =========== 92 | 93 | To do: 94 | * Binary files, like jpeg,mp3 support 95 | * Run scan to group repeating bit patterns, not bit. 96 | * Unicode support 97 | * Move entire codebase to python, use neural network to compress files. 98 | -------------------------------------------------------------------------------- /source/Decode: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sspeedy99/File-Compression/8346e57242d5d9b8756f8336a60369d57b506a52/source/Decode -------------------------------------------------------------------------------- /source/Decoding.cpp: -------------------------------------------------------------------------------- 1 | #include 2 | #include 3 | #include 4 | #include "compression.h" 5 | 6 | using namespace std; 7 | 8 | codeTable *codelist; 9 | 10 | int n; 11 | char *decodeBuffer(char buffer); 12 | char *int2string(int n); 13 | int match(char a[],char b[],int limit); 14 | int fileError(FILE *fp); 15 | 16 | int main(int argc, char** argv) 17 | { 18 | FILE *fp,*outfile; 19 | char buffer; 20 | char *decoded; 21 | int i; 22 | 23 | if(argc<=2) 24 | { 25 | printf("***Huffman Decoding***\n"); 26 | if(argc==2) 27 | { 28 | argv[2]=(char *)malloc(sizeof(char)*(strlen(argv[1])+strlen(decompressed_extension)+1)); 29 | strcpy(argv[2],argv[1]); 30 | strcat(argv[2],decompressed_extension); 31 | argc++; 32 | } 33 | else 34 | return 0; 35 | } 36 | if((fp=fopen(argv[1],"rb"))==NULL) 37 | { 38 | printf("[!]Input file cannot be opened.\n"); 39 | return -1; 40 | } 41 | 42 | printf("\n[Reading File Header]"); 43 | if(fread(&buffer,sizeof(unsigned char),1,fp)==0) return (fileError(fp)); 44 | N=buffer; //No. of structures(mapping table records) to read 45 | if(N==0) 46 | n=256; 47 | else 48 | n=N; 49 | printf("\nDetected: %u different characters.",n); 50 | 51 | //allocate memory for mapping table 52 | codelist=(codeTable *)malloc(sizeof(codeTable)*n); 53 | 54 | printf("\nReading character to Codeword Mapping Table"); 55 | if(fread(codelist,sizeof(codeTable),n,fp)==0) return (fileError(fp)); 56 | //Read mapping table 57 | /* 58 | printf("\n"); 59 | for(i=0;i=0;i--) 157 | { 158 | andd=1< 7 | #include 8 | #include 9 | #include 10 | #include 11 | #include "compression.h" 12 | 13 | //Internal and Leaf nodes 14 | #define LEAF 0 15 | #define INTERNAL 1 16 | 17 | typedef struct node 18 | { 19 | char x; 20 | int freq; 21 | char* code; 22 | int type; 23 | struct node* next; 24 | struct node* left; 25 | struct node* right; 26 | 27 | } node; 28 | 29 | //Head and root nodes in the linked list. 30 | node *HEAD, *ROOT; 31 | 32 | 33 | void printll(); 34 | void makeTree(); 35 | void genCode(node *p,char* code); 36 | void insert(node *p,node *m); 37 | void addSymbol(char c); 38 | void writeHeader(FILE *f); 39 | void writeBit(int b,FILE *f); 40 | void writeCode(char ch,FILE *f); 41 | char *getCode(char ch); 42 | 43 | node* newNode(char c) 44 | { 45 | node *temp; 46 | temp = new node; 47 | temp->x = c; 48 | temp->type = LEAF; 49 | temp->freq = 1; 50 | temp->next = NULL; 51 | temp->left = NULL; 52 | temp->right = NULL; 53 | return temp; 54 | } 55 | 56 | int main(int argc, char** argv) 57 | { 58 | FILE *fp, *fp1; 59 | char ch; 60 | int t; 61 | HEAD = NULL; 62 | ROOT = NULL; 63 | 64 | if(argc <= 2) 65 | { 66 | printf("\n***Automated File Compression***\n"); 67 | printf("\nCreating new compressed file...............\n"); 68 | argv[2]=(char *)malloc(sizeof(char)*(strlen(argv[1])+strlen(compressed_extension)+1)); 69 | strcpy(argv[2],argv[1]); 70 | strcat(argv[2],compressed_extension); 71 | argc++; 72 | } 73 | else 74 | return 0; 75 | 76 | fp = fopen(argv[1],"rb"); 77 | if(fp == NULL) 78 | { 79 | printf("Error, Input file does not exists, Check the file name\n"); 80 | return -1; 81 | } 82 | 83 | printf("Initiating the compression sequence.................\n"); 84 | printf("Reading input file %s\n",argv[1]); 85 | while(fread(&ch,sizeof(char),1,fp)!=0) 86 | addSymbol(ch); 87 | fclose(fp); 88 | 89 | printf("Constructing Huffman-Tree....................\n"); 90 | makeTree(); 91 | printf("Assigning codewords..........................\n"); 92 | //Pre order traversal of the of the huffman code. 93 | genCode(ROOT,"\0"); 94 | 95 | printf("Compressing the file.........................\n"); 96 | fp=fopen(argv[1],"r"); 97 | if(fp==NULL) 98 | { 99 | printf("\n[!]Input file cannot be opened.\n"); 100 | return -1; 101 | } 102 | fp1=fopen(argv[2],"wb"); 103 | if(fp1==NULL) 104 | { 105 | printf("\n[!]Output file cannot be opened.\n"); 106 | return -2; 107 | } 108 | 109 | printf("\nReading input file %s.......................",argv[1]); 110 | printf("\nWriting file %s........................",argv[2]); 111 | printf("\nWriting File Header.............................."); 112 | writeHeader(fp1); 113 | printf("\nWriting compressed content............................"); 114 | //writing corersponding codes into the new file fp1 115 | while(fread(&ch,sizeof(char),1,fp)!=0) 116 | writeCode(ch,fp1); 117 | fclose(fp); 118 | fclose(fp1); 119 | 120 | printf("\n***Done***\n"); 121 | return 0; 122 | 123 | } 124 | 125 | 126 | void writeHeader(FILE *f) 127 | { 128 | //mapping of codes to actual words 129 | codeTable record; 130 | node *p; 131 | int temp =0, i=0; 132 | p = HEAD; 133 | //Determine the uniwue symbols and padding of bits 134 | while(p!=NULL) 135 | { 136 | temp+=(strlen(p->code)) * (p->freq); //temp stores padding 137 | if(strlen(p->code) > MAX) printf("\n[!] Codewords are longer than usual."); //TODO: Solve this case 138 | temp%=8; 139 | i++; 140 | p=p->next; 141 | } 142 | 143 | if(i==256) 144 | N=0; //if 256 diff bit combinations exist, then alias 256 as 0 145 | else 146 | N=i; 147 | 148 | fwrite(&N,sizeof(unsigned char),1,f); //read these many structures while reading 149 | printf("\nN=%u",i); 150 | 151 | 152 | p=HEAD; 153 | //Start from the HEAD and wirte all character with its corresponding codes 154 | while(p!=NULL) 155 | { 156 | record.x=p->x; 157 | strcpy(record.code,p->code); 158 | fwrite(&record,sizeof(codeTable),1,f); 159 | p=p->next; 160 | } 161 | 162 | padding=8-(char)temp; //int to char & padding = 8-bitsExtra 163 | fwrite(&padding,sizeof(char),1,f); 164 | printf("\nPadding=%d",padding); 165 | //do actual padding 166 | for(i=0;ix==ch) 222 | return p->code; 223 | p=p->next; 224 | } 225 | return NULL; //not found 226 | } 227 | 228 | //Inserting a node according to its freq in the linked list 229 | void insert(node *p,node *m) 230 | { // insert p in list as per its freq., start from m to right, 231 | // we cant place node smaller than m since we dont have ptr to node left to m 232 | if(m->next==NULL) 233 | { m->next=p; return;} 234 | while(m->next->freq < p->freq) 235 | { m=m->next; 236 | if(m->next==NULL) 237 | { m->next=p; return; } 238 | } 239 | p->next=m->next; 240 | m->next=p; 241 | } 242 | 243 | //Adding the symbols to the linked list 244 | void addSymbol(char c) 245 | {// Insert symbols into linked list if its new, otherwise freq++ 246 | node *p,*q,*m; 247 | int t; 248 | 249 | if(HEAD==NULL) 250 | { HEAD=newNode(c); 251 | return; 252 | } 253 | p=HEAD; q=NULL; 254 | if(p->x==c) //item found in HEAD 255 | { 256 | p->freq+=1; 257 | if(p->next==NULL) 258 | return; 259 | if(p->freq > p->next->freq) 260 | { 261 | HEAD=p->next; 262 | p->next=NULL; 263 | insert(p,HEAD); 264 | } 265 | return; 266 | } 267 | 268 | while(p->next!=NULL && p->x!=c) 269 | { 270 | q=p; p=p->next; 271 | } 272 | 273 | if(p->x==c) 274 | { 275 | p->freq+=1; 276 | if(p->next==NULL) 277 | return; 278 | if(p->freq > p->next->freq) 279 | { 280 | m=p->next; 281 | q->next=p->next; 282 | p->next=NULL; 283 | insert(p,HEAD); 284 | } 285 | } 286 | else //p->next==NULL , all list traversed c is not found, insert it at beginning 287 | { 288 | q=newNode(c); 289 | q->next=HEAD; //first because freq is minimum 290 | HEAD=q; 291 | } 292 | } 293 | 294 | //Generating huffman tree 295 | void makeTree() 296 | { 297 | node *p,*q; 298 | p=HEAD; 299 | while(p!=NULL) 300 | { 301 | q=newNode('@'); 302 | q->type=INTERNAL; //internal node 303 | q->left=p; //join left subtree/node 304 | q->freq=p->freq; 305 | if(p->next!=NULL) 306 | { 307 | p=p->next; 308 | q->right=p; //join right subtree /node 309 | q->freq+=p->freq; 310 | } 311 | p=p->next; //consider next node frm list 312 | if(p==NULL) //list ends 313 | break; 314 | //insert new subtree rooted at q into list starting from p 315 | //if q smaller than p 316 | if(q->freq <= p->freq) 317 | {//place it before p 318 | q->next=p; 319 | p=q; 320 | } 321 | else 322 | insert(q,p); //find appropriate position 323 | }//while 324 | ROOT=q; //q created at last iteration is ROOT of h-tree 325 | } 326 | 327 | //Genreating Huffman codes of the characters 328 | void genCode(node *p,char* code) 329 | { 330 | char *lcode,*rcode; 331 | static node *s; 332 | static int flag; 333 | if(p!=NULL) 334 | { 335 | //sort linked list as it was 336 | if(p->type==LEAF) //leaf node 337 | { if(flag==0) //first leaf node 338 | {flag=1; HEAD=p;} 339 | else //other leaf nodes 340 | { s->next=p;} //sorting LL 341 | p->next=NULL; 342 | s=p; 343 | } 344 | 345 | //assign code 346 | p->code=code; //assign code to current node 347 | // printf("[%c|%d|%s|%d]",p->x,p->freq,p->code,p->type); 348 | lcode=(char *)malloc(strlen(code)+2); 349 | rcode=(char *)malloc(strlen(code)+2); 350 | sprintf(lcode,"%s0",code); 351 | sprintf(rcode,"%s1",code); 352 | //recursive DFS 353 | genCode(p->left,lcode); //left child has 0 appended to current node's code 354 | genCode(p->right,rcode); 355 | } 356 | } 357 | 358 | -------------------------------------------------------------------------------- /source/compression.h: -------------------------------------------------------------------------------- 1 | /* 2 | Author : ssp3edy 3 | Date : 2019-June-06 4 | */ 5 | 6 | #define MAX 16 7 | // padding is done to ensure that the code generated for each charater will fit byte size. 8 | // i.e : 4 byte + 3bits will be consider as 5 bits. 9 | char padding; 10 | unsigned char N; 11 | 12 | 13 | // Code table regarding every character in the file 14 | typedef struct codeTable 15 | { 16 | char x; 17 | char code[MAX]; 18 | } codeTable; 19 | 20 | char compressed_extension[] = ".spd"; 21 | char decompressed_extension[] = ".txt"; 22 | 23 | 24 | -------------------------------------------------------------------------------- /source/huffman: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sspeedy99/File-Compression/8346e57242d5d9b8756f8336a60369d57b506a52/source/huffman -------------------------------------------------------------------------------- /source/huffman_test.cpp: -------------------------------------------------------------------------------- 1 | // C++ program for Huffman Coding 2 | #include 3 | using namespace std; 4 | 5 | // A Huffman tree node 6 | struct MinHeapNode { 7 | 8 | // One of the input characters 9 | char data; 10 | 11 | // Frequency of the character 12 | unsigned freq; 13 | 14 | // Left and right child 15 | MinHeapNode *left, *right; 16 | 17 | MinHeapNode(char data, unsigned freq) 18 | 19 | { 20 | 21 | left = right = NULL; 22 | this->data = data; 23 | this->freq = freq; 24 | } 25 | }; 26 | 27 | // For comparison of 28 | // two heap nodes (needed in min heap) 29 | struct compare { 30 | 31 | bool operator()(MinHeapNode* l, MinHeapNode* r) 32 | 33 | { 34 | return (l->freq > r->freq); 35 | } 36 | }; 37 | 38 | // Prints huffman codes from 39 | // the root of Huffman Tree. 40 | void printCodes(struct MinHeapNode* root, string str) 41 | { 42 | 43 | if (!root) 44 | return; 45 | 46 | if (root->data != '$') 47 | cout << root->data << ": " << str << "\n"; 48 | 49 | printCodes(root->left, str + "0"); 50 | printCodes(root->right, str + "1"); 51 | } 52 | 53 | // The main function that builds a Huffman Tree and 54 | // print codes by traversing the built Huffman Tree 55 | void HuffmanCodes(vector data, vector freq, int size) 56 | { 57 | struct MinHeapNode *left, *right, *top; 58 | 59 | // Create a min heap & inserts all characters of data[] 60 | priority_queue, compare> minHeap; 61 | 62 | for (int i = 0; i < size; ++i) 63 | minHeap.push(new MinHeapNode(data[i], freq[i])); 64 | // Iterate while size of heap doesn't become 1 65 | while (minHeap.size() != 1) { 66 | 67 | // Extract the two minimum 68 | // freq items from min heap 69 | left = minHeap.top(); 70 | minHeap.pop(); 71 | 72 | right = minHeap.top(); 73 | minHeap.pop(); 74 | 75 | // Create a new internal node with 76 | // frequency equal to the sum of the 77 | // two nodes frequencies. Make the 78 | // two extracted node as left and right children 79 | // of this new node. Add this node 80 | // to the min heap '$' is a special value 81 | // for internal nodes, not used 82 | top = new MinHeapNode('$', left->freq + right->freq); 83 | 84 | top->left = left; 85 | top->right = right; 86 | 87 | minHeap.push(top); 88 | } 89 | 90 | // Print Huffman codes using 91 | // the Huffman tree built above 92 | printCodes(minHeap.top(), ""); 93 | } 94 | 95 | // Driver program to test above functions 96 | int main() 97 | { 98 | 99 | vector arr = { 'a', 'b', 'c', 'd', 'e', 'f' }; 100 | vector freq = { 5, 9, 12, 13, 16, 45 }; 101 | 102 | int size = arr.size(); 103 | 104 | HuffmanCodes(arr, freq, size); 105 | 106 | return 0; 107 | } --------------------------------------------------------------------------------