├── README.md ├── files ├── original.txt ├── test.dot ├── test.jpg └── test.pdf └── src ├── Huffman.java └── Test.java /README.md: -------------------------------------------------------------------------------- 1 | Huffman Coding (Greedy Algorithms) in Java 2 | ======================== 3 | 4 | ## Introduction 5 | This repository was created to share my project in "Data Structures and Algorithms in Java" class. 6 | What I did in the project are: 7 | 8 | * Implemented Huffman Coding in Java 9 | * Implemented function to automatically generate .dot file for Graphviz software to visualize the Huffman Tree 10 | * Tested the code with Unix's words file (/usr/share/dict/words) 11 | * Get the result of data compression rate 12 | 13 | 14 | ## Key techniques used 15 | * Greedy Algorithms 16 | * HashMap 17 | * PriorityQueue (Min Heap) 18 | * Nested Class 19 | * Recursion 20 | 21 | 22 | ## Requirements 23 | Graphviz software is required to visualize the Huffman Tree. 24 | Please download it on [here](http://www.graphviz.org) 25 | 26 | 27 | ## How to test 28 | 29 | #### 1. Prepare your own text file 30 | 31 | Create a text file named "original.txt" under /files with any contents. 32 | In this demo, I used Unix's words file which contains 235,886 dictionary words(2,493,109 characters in total). 33 | 34 | ``` 35 | $ cd /files 36 | $ cat /usr/share/dict/words > original.txt 37 | ``` 38 | 39 | 40 | #### 2. Run the test code 41 | "Test.java" automatically loads the original.txt, then test encoding/decoding the given text, and will show the result of text compression. 42 | The actual result will be something like below. 43 | 44 | ``` 45 | ----- Test.java START ----- 46 | * Loading the file...DONE 47 | * Builiding Huffman Tree and Code Tables... DONE 48 | 49 | ============= Word Frequency ============= 50 | \n occurs 235886 times 51 | - occurs 2 times 52 | A occurs 2559 times 53 | B occurs 1395 times 54 | C occurs 2496 times 55 | D occurs 950 times 56 | E occurs 918 times 57 | F occurs 478 times 58 | G occurs 1018 times 59 | H occurs 1138 times 60 | I occurs 496 times 61 | J occurs 484 times 62 | K occurs 546 times 63 | L occurs 1073 times 64 | M occurs 1907 times 65 | N occurs 682 times 66 | O occurs 630 times 67 | P occurs 2291 times 68 | Q occurs 77 times 69 | R occurs 716 times 70 | S occurs 2403 times 71 | T occurs 1577 times 72 | U occurs 208 times 73 | V occurs 361 times 74 | W occurs 337 times 75 | X occurs 92 times 76 | Y occurs 139 times 77 | Z occurs 230 times 78 | a occurs 196995 times 79 | b occurs 39038 times 80 | c occurs 100944 times 81 | d occurs 67241 times 82 | e occurs 234413 times 83 | f occurs 23687 times 84 | g occurs 46076 times 85 | h occurs 63218 times 86 | i occurs 200536 times 87 | j occurs 2683 times 88 | k occurs 15612 times 89 | l occurs 129390 times 90 | m occurs 68773 times 91 | n occurs 158061 times 92 | o occurs 170062 times 93 | p occurs 75872 times 94 | q occurs 3657 times 95 | r occurs 160269 times 96 | s occurs 137139 times 97 | t occurs 151254 times 98 | u occurs 87145 times 99 | v occurs 19816 times 100 | w occurs 13527 times 101 | x occurs 6840 times 102 | y occurs 51542 times 103 | z occurs 8230 times 104 | 105 | ========== Huffman Code for each character ============= 106 | \n: 001 107 | -: 0110110111010100 108 | A: 0110110110 109 | B: 01101101111 110 | C: 0110110101 111 | D: 01101100000 112 | E: 01101000001 113 | F: 011011000010 114 | G: 01101100010 115 | H: 01101100101 116 | I: 011011000110 117 | J: 011011000011 118 | K: 011011000111 119 | L: 01101100100 120 | M: 0110100001 121 | N: 111100100100 122 | O: 011011011100 123 | P: 0110110011 124 | Q: 0110110111010101 125 | R: 111100100101 126 | S: 0110110100 127 | T: 11110010011 128 | U: 0110100000010 129 | V: 011010000000 130 | W: 0110110111011 131 | X: 011011011101011 132 | Y: 01101101110100 133 | Z: 0110100000011 134 | a: 1101 135 | b: 101010 136 | c: 11001 137 | d: 01001 138 | e: 000 139 | f: 1111000 140 | g: 101011 141 | h: 01000 142 | i: 1110 143 | j: 1111001000 144 | k: 0110101 145 | l: 11111 146 | m: 01100 147 | n: 1000 148 | o: 1011 149 | p: 10100 150 | q: 011010001 151 | r: 1001 152 | s: 0101 153 | t: 0111 154 | u: 11000 155 | v: 0110111 156 | w: 11110011 157 | x: 111100101 158 | y: 111101 159 | z: 01101001 160 | 161 | * Encoding the text... DONE 162 | * Decoding the encoded text... DONE 163 | 164 | ========== RESULT ========== 165 | Original string cost = 17451763 bits 166 | Encoded string cost = 10872876 bits 167 | % reduction = 37.697549525512116 168 | 169 | ----- Test DONE ----- 170 | 171 | 172 | ``` 173 | 174 | > The program will automatically create "test.dot" for visualizing the tree. 175 | 176 | 177 | 178 | #### 3. Visualize the tree with Graphviz 179 | 180 | Run below command and it will create test.pdf which visualized the Huffman Tree built for the test 181 | 182 | ``` 183 | $ dot -Tpdf test.dot -o test.pdf 184 | ``` 185 | 186 | It will create a pdf file like below. 187 | 188 | ![Huffman Tree](files/test.jpg) 189 | 190 | Each circle represents a tree node with below properties. 191 | * Unique ID of a node 192 | * Weight of a node 193 | * Character (leaf nodes only) 194 | 195 | 196 | 197 | 198 | -------------------------------------------------------------------------------- /files/test.dot: -------------------------------------------------------------------------------- 1 | ## Command to generate pdf: dot -Tpdf test.dot -o test.pdf 2 | digraph g { 3 | "107\n2493109" -> "105\n1027467" [color=red, label=0] 4 | "105\n1027467" -> "101\n470299" [color=red, label=0] 5 | "101\n470299" -> "33\n234413\n e" [color=red, label=0] 6 | "101\n470299" -> "1\n235886\n /n" [color=blue, label=1] 7 | "105\n1027467" -> "102\n557168" [color=blue, label=1] 8 | "102\n557168" -> "95\n267598" [color=red, label=0] 9 | "95\n267598" -> "90\n130459" [color=red, label=0] 10 | "90\n130459" -> "36\n63218\n h" [color=red, label=0] 11 | "90\n130459" -> "32\n67241\n d" [color=blue, label=1] 12 | "95\n267598" -> "47\n137139\n s" [color=blue, label=1] 13 | "102\n557168" -> "96\n289570" [color=blue, label=1] 14 | "96\n289570" -> "91\n138316" [color=red, label=0] 15 | "91\n138316" -> "41\n68773\n m" [color=red, label=0] 16 | "91\n138316" -> "87\n69543" [color=blue, label=1] 17 | "87\n69543" -> "84\n31123" [color=red, label=0] 18 | "84\n31123" -> "81\n15511" [color=red, label=0] 19 | "81\n15511" -> "77\n7281" [color=red, label=0] 20 | "77\n7281" -> "71\n3624" [color=red, label=0] 21 | "71\n3624" -> "65\n1717" [color=red, label=0] 22 | "65\n1717" -> "60\n799" [color=red, label=0] 23 | "60\n799" -> "24\n361\n V" [color=red, label=0] 24 | "60\n799" -> "58\n438" [color=blue, label=1] 25 | "58\n438" -> "23\n208\n U" [color=red, label=0] 26 | "58\n438" -> "28\n230\n Z" [color=blue, label=1] 27 | "65\n1717" -> "7\n918\n E" [color=blue, label=1] 28 | "71\n3624" -> "15\n1907\n M" [color=blue, label=1] 29 | "77\n7281" -> "45\n3657\n q" [color=blue, label=1] 30 | "81\n15511" -> "54\n8230\n z" [color=blue, label=1] 31 | "84\n31123" -> "39\n15612\n k" [color=blue, label=1] 32 | "87\n69543" -> "85\n38420" [color=blue, label=1] 33 | "85\n38420" -> "82\n18604" [color=red, label=0] 34 | "82\n18604" -> "78\n8474" [color=red, label=0] 35 | "78\n8474" -> "72\n3972" [color=red, label=0] 36 | "72\n3972" -> "66\n1912" [color=red, label=0] 37 | "66\n1912" -> "6\n950\n D" [color=red, label=0] 38 | "66\n1912" -> "61\n962" [color=blue, label=1] 39 | "61\n962" -> "8\n478\n F" [color=red, label=0] 40 | "61\n962" -> "12\n484\n J" [color=blue, label=1] 41 | "72\n3972" -> "67\n2060" [color=blue, label=1] 42 | "67\n2060" -> "9\n1018\n G" [color=red, label=0] 43 | "67\n2060" -> "62\n1042" [color=blue, label=1] 44 | "62\n1042" -> "11\n496\n I" [color=red, label=0] 45 | "62\n1042" -> "13\n546\n K" [color=blue, label=1] 46 | "78\n8474" -> "73\n4502" [color=blue, label=1] 47 | "73\n4502" -> "68\n2211" [color=red, label=0] 48 | "68\n2211" -> "14\n1073\n L" [color=red, label=0] 49 | "68\n2211" -> "10\n1138\n H" [color=blue, label=1] 50 | "73\n4502" -> "18\n2291\n P" [color=blue, label=1] 51 | "82\n18604" -> "79\n10130" [color=blue, label=1] 52 | "79\n10130" -> "74\n4899" [color=red, label=0] 53 | "74\n4899" -> "21\n2403\n S" [color=red, label=0] 54 | "74\n4899" -> "5\n2496\n C" [color=blue, label=1] 55 | "79\n10130" -> "75\n5231" [color=blue, label=1] 56 | "75\n5231" -> "3\n2559\n A" [color=red, label=0] 57 | "75\n5231" -> "69\n2672" [color=blue, label=1] 58 | "69\n2672" -> "63\n1277" [color=red, label=0] 59 | "63\n1277" -> "17\n630\n O" [color=red, label=0] 60 | "63\n1277" -> "59\n647" [color=blue, label=1] 61 | "59\n647" -> "57\n310" [color=red, label=0] 62 | "57\n310" -> "27\n139\n Y" [color=red, label=0] 63 | "57\n310" -> "56\n171" [color=blue, label=1] 64 | "56\n171" -> "55\n79" [color=red, label=0] 65 | "55\n79" -> "2\n2\n -" [color=red, label=0] 66 | "55\n79" -> "19\n77\n Q" [color=blue, label=1] 67 | "56\n171" -> "26\n92\n X" [color=blue, label=1] 68 | "59\n647" -> "25\n337\n W" [color=blue, label=1] 69 | "69\n2672" -> "4\n1395\n B" [color=blue, label=1] 70 | "85\n38420" -> "50\n19816\n v" [color=blue, label=1] 71 | "96\n289570" -> "48\n151254\n t" [color=blue, label=1] 72 | "107\n2493109" -> "106\n1465642" [color=blue, label=1] 73 | "106\n1465642" -> "103\n649378" [color=red, label=0] 74 | "103\n649378" -> "97\n318330" [color=red, label=0] 75 | "97\n318330" -> "42\n158061\n n" [color=red, label=0] 76 | "97\n318330" -> "46\n160269\n r" [color=blue, label=1] 77 | "103\n649378" -> "98\n331048" [color=blue, label=1] 78 | "98\n331048" -> "92\n160986" [color=red, label=0] 79 | "92\n160986" -> "44\n75872\n p" [color=red, label=0] 80 | "92\n160986" -> "88\n85114" [color=blue, label=1] 81 | "88\n85114" -> "30\n39038\n b" [color=red, label=0] 82 | "88\n85114" -> "35\n46076\n g" [color=blue, label=1] 83 | "98\n331048" -> "43\n170062\n o" [color=blue, label=1] 84 | "106\n1465642" -> "104\n816264" [color=blue, label=1] 85 | "104\n816264" -> "99\n385084" [color=red, label=0] 86 | "99\n385084" -> "93\n188089" [color=red, label=0] 87 | "93\n188089" -> "49\n87145\n u" [color=red, label=0] 88 | "93\n188089" -> "31\n100944\n c" [color=blue, label=1] 89 | "99\n385084" -> "29\n196995\n a" [color=blue, label=1] 90 | "104\n816264" -> "100\n431180" [color=blue, label=1] 91 | "100\n431180" -> "37\n200536\n i" [color=red, label=0] 92 | "100\n431180" -> "94\n230644" [color=blue, label=1] 93 | "94\n230644" -> "89\n101254" [color=red, label=0] 94 | "89\n101254" -> "86\n49712" [color=red, label=0] 95 | "86\n49712" -> "34\n23687\n f" [color=red, label=0] 96 | "86\n49712" -> "83\n26025" [color=blue, label=1] 97 | "83\n26025" -> "80\n12498" [color=red, label=0] 98 | "80\n12498" -> "76\n5658" [color=red, label=0] 99 | "76\n5658" -> "38\n2683\n j" [color=red, label=0] 100 | "76\n5658" -> "70\n2975" [color=blue, label=1] 101 | "70\n2975" -> "64\n1398" [color=red, label=0] 102 | "64\n1398" -> "16\n682\n N" [color=red, label=0] 103 | "64\n1398" -> "20\n716\n R" [color=blue, label=1] 104 | "70\n2975" -> "22\n1577\n T" [color=blue, label=1] 105 | "80\n12498" -> "52\n6840\n x" [color=blue, label=1] 106 | "83\n26025" -> "51\n13527\n w" [color=blue, label=1] 107 | "89\n101254" -> "53\n51542\n y" [color=blue, label=1] 108 | "94\n230644" -> "40\n129390\n l" [color=blue, label=1] 109 | } 110 | -------------------------------------------------------------------------------- /files/test.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yugokato/Huffman-Coding-In-Java/cc3d968e72f973f5bf9df560d8a11d1ef81d20f8/files/test.jpg -------------------------------------------------------------------------------- /files/test.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yugokato/Huffman-Coding-In-Java/cc3d968e72f973f5bf9df560d8a11d1ef81d20f8/files/test.pdf -------------------------------------------------------------------------------- /src/Huffman.java: -------------------------------------------------------------------------------- 1 | import java.util.HashMap; 2 | import java.util.Map; 3 | import java.util.PriorityQueue; 4 | import java.io.BufferedWriter; 5 | import java.io.FileWriter; 6 | import java.io.IOException; 7 | import java.io.PrintWriter; 8 | import java.util.Comparator; 9 | 10 | public class Huffman { 11 | private String orgStr, encodedStr, decodedStr; 12 | public HashMap hmapWC; // for occurrence count 13 | public HashMap hmapCode; // for code(character/code) 14 | public HashMap hmapCodeR; // for code(code/character) 15 | private PriorityQueue pq; // for MinHeap 16 | private int counter; // Unique id assigned to each node 17 | private int treeSize; // # of total nodes in the tree 18 | private node root; 19 | 20 | // Inner class 21 | private class node{ 22 | int uid, weight; 23 | char ch; 24 | node left, right; 25 | 26 | // Constructor for class node 27 | private node(Character ch, Integer weight, node left, node right){ 28 | uid = ++counter; 29 | this.weight = weight; 30 | this.ch = ch; 31 | this.left = left; 32 | this.right = right; 33 | } 34 | } 35 | 36 | // Constructor for class Huffman 37 | public Huffman(String orgStr, boolean show, String dotfilename){ 38 | this.counter = 0; 39 | this.treeSize = 0; 40 | this.orgStr = orgStr; 41 | hmapWC = new HashMap(); 42 | hmapCode = new HashMap(); 43 | hmapCodeR = new HashMap(); 44 | pq = new PriorityQueue(1, new Comparator() { 45 | @Override 46 | public int compare(node n1, node n2) { 47 | if (n1.weight < n2.weight) 48 | return -1; 49 | else if (n1.weight > n2.weight) 50 | return 1; 51 | return 0; 52 | } 53 | }); 54 | 55 | countWord(); // STEP 1: Count frequency of word 56 | buildTree(); // STEP 2: Build Huffman Tree 57 | writeDot(dotfilename); // STEP 3: Write .dot file to visualize the tree with Graphviz software 58 | buildCodeTable(); // STEP 4: Build Huffman Code Table 59 | } 60 | 61 | private void buildCodeTable(){ 62 | String code = ""; 63 | node n = root; 64 | buildCodeRecursion(n, code); // Recursion 65 | } 66 | 67 | private void buildCodeRecursion(node n, String code){ 68 | if (n != null){ 69 | if (! isLeaf(n)){ // n = internal node 70 | buildCodeRecursion(n.left, code + '0'); 71 | buildCodeRecursion(n.right, code + '1'); 72 | } 73 | else{ // n = Leaf node 74 | hmapCode.put(n.ch, code); // for {character:code} 75 | hmapCodeR.put(code, n.ch); // for {code:character} 76 | } 77 | } 78 | } 79 | 80 | private void writeDot(String fname){ 81 | if (treeSize > 1){ 82 | node n = root; 83 | try (PrintWriter o = new PrintWriter(new BufferedWriter (new FileWriter(fname)))){ 84 | o.println("## Command to generate pdf: dot -Tpdf test.dot -o test.pdf"); 85 | o.println("digraph g {"); 86 | dotWriteRecursion(n, o); // Recursion 87 | o.println("}"); 88 | } 89 | catch (IOException e){ 90 | System.out.println(e); 91 | } 92 | } 93 | } 94 | 95 | private void dotWriteRecursion(node n, PrintWriter o){ 96 | if (! isLeaf(n)){ 97 | if (n.left != null){ // has left kid 98 | String t = ""; 99 | char c = n.left.ch; 100 | if (c != '\0' && c != ' ' && c != '"' && c!= '\n') // regular characters 101 | t = "\\n " + c; 102 | else if (c == ' ') 103 | t = "\\n blank"; 104 | else if (c == '"') //escape " 105 | t = "\\n \\\""; 106 | else if (c == '\n') 107 | t = "\\n /n"; 108 | o.println(" \"" + n.uid + "\\n" + n.weight + "\" -> \"" + n.left.uid + "\\n" + n.left.weight + t + "\" [color=red, label=0]"); 109 | dotWriteRecursion(n.left, o); 110 | } 111 | if (n.right != null){ // has right kid 112 | String t = ""; 113 | char c = n.right.ch; 114 | if (c != '\0' && c != ' ' && c != '"' && c != '\n') // regular characters 115 | t = "\\n " + c; 116 | else if (c == ' ') 117 | t = "\\n blank"; 118 | else if (c == '"') //escape 119 | t = "\\n \\\""; 120 | else if (c == '\n') 121 | t = "\\n /n"; 122 | o.println(" \"" + n.uid + "\\" +"n" + n.weight + "\" -> \"" + n.right.uid + "\\n" + n.right.weight + t + "\" [color=blue, label=1]"); 123 | dotWriteRecursion(n.right, o); 124 | } 125 | } 126 | } 127 | 128 | private void buildTree(){ 129 | buildMinHeap(); // Set all leaf nodes into MinHeap 130 | node left, right; 131 | while (! pq.isEmpty()){ 132 | left = pq.poll(); treeSize++; 133 | if (pq.peek() != null){ 134 | right = pq.poll(); treeSize++; 135 | root = new node('\0', left.weight + right.weight, left, right); 136 | } 137 | else{ // only left child. right=null 138 | root = new node('\0', left.weight, left, null); 139 | } 140 | 141 | if (pq.peek() != null){ 142 | pq.offer(root); 143 | } 144 | else{ // = Top root. Finished building the tree. 145 | treeSize++; 146 | break; 147 | } 148 | } 149 | } 150 | 151 | private void buildMinHeap(){ 152 | for (Map.Entry entry: hmapWC.entrySet()){ 153 | Character ch = entry.getKey(); 154 | Integer weight = entry.getValue(); 155 | node n = new node(ch, weight, null, null); 156 | pq.offer(n); 157 | } 158 | } 159 | 160 | private void countWord(){ 161 | Character ch; 162 | Integer weight; 163 | for (int i=0; i entry: h.hmapWC.entrySet()){ 16 | String key = entry.getKey().toString(); 17 | int val = entry.getValue(); 18 | if (key.equals("\n")) 19 | key = "\\n"; 20 | System.out.println(key + " occurs " + val + " times"); 21 | } 22 | 23 | System.out.println("\n========== Huffman Code for each character ============="); 24 | for (Map.Entry entry: h.hmapCode.entrySet()){ 25 | String key = entry.getKey().toString(); 26 | String val = entry.getValue(); 27 | if (key.equals("\n")) 28 | key = "\\n"; 29 | System.out.println(key + ": " + val); 30 | } 31 | System.out.println(); 32 | } 33 | 34 | System.out.print("* Encoding the text..."); 35 | String e = h.encode(); 36 | System.out.println(" DONE"); 37 | 38 | System.out.print("* Decoding the encoded text..."); 39 | String d = h.decode(); 40 | myassert(orgStr.equals(d)) ; // Check if original text and decoded text is exactly same 41 | System.out.println(" DONE"); 42 | 43 | double sl = orgStr.length() * 7 ; 44 | double el = e.length(); 45 | System.out.println("\n========== RESULT =========="); 46 | System.out.println("Original string cost = " + (int)sl + " bits") ; 47 | System.out.println("Encoded string cost = " + (int)el + " bits") ; 48 | double r = ((el - sl)/sl) * 100 ; 49 | System.out.println("% reduction = " + (-r)) ; 50 | } 51 | 52 | public static String readFile(String fname){ 53 | StringBuilder sb = new StringBuilder(); 54 | File filename = new File(fname); 55 | try (BufferedReader in = new BufferedReader(new FileReader(filename))){ 56 | String line = in.readLine(); 57 | while (line != null){ 58 | sb.append(line + "\n"); 59 | line = in.readLine(); 60 | } 61 | } 62 | catch (IOException e){ 63 | System.out.println(e); 64 | } 65 | return sb.toString(); 66 | } 67 | 68 | public static void myassert(boolean x) { 69 | if (!x) { 70 | throw new IllegalArgumentException("Assert fail") ; 71 | } 72 | } 73 | 74 | public static void testbed(){ 75 | boolean show = true ; 76 | String orgFile = "files/original.txt"; 77 | String dotFile = "files/test.dot"; 78 | 79 | System.out.print("* Loading the file..."); 80 | String orgString = readFile(orgFile); 81 | System.out.println("DONE"); 82 | 83 | testHuffman(orgString, show, dotFile); 84 | } 85 | public static void main(String[] args) { 86 | System.out.println("----- Test.java START -----"); 87 | testbed(); 88 | System.out.println("\n----- Test DONE ----- "); 89 | } 90 | 91 | } --------------------------------------------------------------------------------