├── README.md ├── hw4-task1-data.tsv ├── result.txt └── tree.py /README.md: -------------------------------------------------------------------------------- 1 | Part a) How you implemented the initial tree (Section A) and why you chose your approaches? 2 | 3 | For implementing the decision tree, we have used the ID3 (Iterative Dichotomiser 3) Heuristic. 4 | 5 | Training Phase - Building the decision tree: 6 | 1. In the ID3 algorithm, we begin with the original set of attributes as the root node. 7 | 2. On each iteration of the algorithm, we iterate through every unused attribute of the remaining set and calculates the entropy (or information gain) of that attribute. 8 | 3. Then, we select the attribute which has the smallest entropy (or largest information gain) value. 9 | 4. The set of remaining attributes is then split by the selected attribute to produce subsets of the data. The algorithm continues to recurse on each subset, considering only attributes never selected before. 10 | 11 | Testing Phase: 12 | At runtime, we will use trained decision tree to classify the new unseen test cases by working down the decision tree using the values of this test case to arrive at a terminal node that tells us what class this test case belongs to. 13 | 14 | 15 | I choose this approach because of the following reasons: 16 | 17 | 1. It uses a greedy approach by selecting the best attribute to split the dataset on each iteration. 18 | 2. Runs quite fast on the discrete data (Runs within 3 to 4 mins). However, on continuous data, it runs within 30-40 mins. 19 | 3. Accuracy come around 84% - 85% when discrete splitting and randomness is used along with this algorithm 20 | 21 | 22 | -------------------------------------------------------------------------------------------------------------------------------------------------------- 23 | 24 | 25 | Part b) Accuracy result of your initial implementation (with cross validation)? 26 | 27 | For section, I have used continuous splitting of the data. However, I have tested the decision tree with and without randomness in section A. 28 | So, my results are as follows: 29 | 30 | 1. ID3 Algorithm with continuous splitting (non - random) = 0.8037 or 80.37% 31 | 2. ID3 Algorithm with continuous splitting (random shuffling) = 0.8129 or 81.29% 32 | 33 | 34 | -------------------------------------------------------------------------------------------------------------------------------------------------------- 35 | 36 | 37 | Part c) Explain improvements that you made (Section C) and why you think it works better (or worse)? 38 | 39 | I have done the following improvements: 40 | 41 | 1. Used Discrete Splitting Strategy instead of Continuous Splitting Strategy 42 | Reason -- ID3 is harder to use on continuous data. If the values of any given attribute is continuous, then there are many more places to split the data on this attribute, and searching for the best value to split by can be time consuming. Thus, we have split the attributes "age" and "fnlweight" around their mean into two respective classes. By doing this, there was a major improvement in the time taken by the two algorithms and the accuracy too. 43 | 44 | 2. Used Random shuffling of data: 45 | Reason -- We know that ID3 does not guarantee an optimal solution; it can get stuck in local optimums. The advantage of this method is that it reduces the probability that the ID3 algorithm will get stuck into local optimims. There was a slight increase in the accuracy by using the random shuffling of the data. 46 | 47 | 48 | -------------------------------------------------------------------------------------------------------------------------------------------------------- 49 | 50 | 51 | Part d) Accuracy results for your improvements? 52 | 53 | For this section, I have used discrete splitting of the data along with other improvements as mentioned above. I have tested the decision tree with and without randomness. 54 | 55 | So, my results are as follows: 56 | 57 | 1. ID3 Algorithm with discrete splitting (non - random) = 0.8480 or 84.80% 58 | 2. ID3 Algorithm with discrete splitting (random shuffling) = 0.8512 or 85.12% 59 | 60 | Note: The average accuracy for the ID3 Algorithm with discrete splitting (random shuffling) can change a little as the code is using random shuffling. I have reported the one that I found the best during multiple runs of the program. 61 | 62 | 63 | -------------------------------------------------------------------------------- /result.txt: -------------------------------------------------------------------------------- 1 | accuracy: 0.8502 -------------------------------------------------------------------------------- /tree.py: -------------------------------------------------------------------------------- 1 | import csv 2 | import math 3 | import random 4 | 5 | # Implement your decision tree below 6 | # Used the ID3 algorithm to implement the Decision Tree 7 | 8 | # Class used for learning and building the Decision Tree using the given Training Set 9 | class DecisionTree(): 10 | tree = {} 11 | 12 | def learn(self, training_set, attributes, target): 13 | self.tree = build_tree(training_set, attributes, target) 14 | 15 | 16 | # Class Node which will be used while classify a test-instance using the tree which was built earlier 17 | class Node(): 18 | value = "" 19 | children = [] 20 | 21 | def __init__(self, val, dictionary): 22 | self.value = val 23 | if (isinstance(dictionary, dict)): 24 | self.children = dictionary.keys() 25 | 26 | 27 | # Majority Function which tells which class has more entries in given data-set 28 | def majorClass(attributes, data, target): 29 | 30 | freq = {} 31 | index = attributes.index(target) 32 | 33 | for tuple in data: 34 | if (freq.has_key(tuple[index])): 35 | freq[tuple[index]] += 1 36 | else: 37 | freq[tuple[index]] = 1 38 | 39 | max = 0 40 | major = "" 41 | 42 | for key in freq.keys(): 43 | if freq[key]>max: 44 | max = freq[key] 45 | major = key 46 | 47 | return major 48 | 49 | 50 | # Calculates the entropy of the data given the target attribute 51 | def entropy(attributes, data, targetAttr): 52 | 53 | freq = {} 54 | dataEntropy = 0.0 55 | 56 | i = 0 57 | for entry in attributes: 58 | if (targetAttr == entry): 59 | break 60 | i = i + 1 61 | 62 | i = i - 1 63 | 64 | for entry in data: 65 | if (freq.has_key(entry[i])): 66 | freq[entry[i]] += 1.0 67 | else: 68 | freq[entry[i]] = 1.0 69 | 70 | for freq in freq.values(): 71 | dataEntropy += (-freq/len(data)) * math.log(freq/len(data), 2) 72 | 73 | return dataEntropy 74 | 75 | 76 | # Calculates the information gain (reduction in entropy) in the data when a particular attribute is chosen for splitting the data. 77 | def info_gain(attributes, data, attr, targetAttr): 78 | 79 | freq = {} 80 | subsetEntropy = 0.0 81 | i = attributes.index(attr) 82 | 83 | for entry in data: 84 | if (freq.has_key(entry[i])): 85 | freq[entry[i]] += 1.0 86 | else: 87 | freq[entry[i]] = 1.0 88 | 89 | for val in freq.keys(): 90 | valProb = freq[val] / sum(freq.values()) 91 | dataSubset = [entry for entry in data if entry[i] == val] 92 | subsetEntropy += valProb * entropy(attributes, dataSubset, targetAttr) 93 | 94 | return (entropy(attributes, data, targetAttr) - subsetEntropy) 95 | 96 | 97 | # This function chooses the attribute among the remaining attributes which has the maximum information gain. 98 | def attr_choose(data, attributes, target): 99 | 100 | best = attributes[0] 101 | maxGain = 0; 102 | 103 | for attr in attributes: 104 | newGain = info_gain(attributes, data, attr, target) 105 | if newGain>maxGain: 106 | maxGain = newGain 107 | best = attr 108 | 109 | return best 110 | 111 | 112 | # This function will get unique values for that particular attribute from the given data 113 | def get_values(data, attributes, attr): 114 | 115 | index = attributes.index(attr) 116 | values = [] 117 | 118 | for entry in data: 119 | if entry[index] not in values: 120 | values.append(entry[index]) 121 | 122 | return values 123 | 124 | # This function will get all the rows of the data where the chosen "best" attribute has a value "val" 125 | def get_data(data, attributes, best, val): 126 | 127 | new_data = [[]] 128 | index = attributes.index(best) 129 | 130 | for entry in data: 131 | if (entry[index] == val): 132 | newEntry = [] 133 | for i in range(0,len(entry)): 134 | if(i != index): 135 | newEntry.append(entry[i]) 136 | new_data.append(newEntry) 137 | 138 | new_data.remove([]) 139 | return new_data 140 | 141 | 142 | # This function is used to build the decision tree using the given data, attributes and the target attributes. It returns the decision tree in the end. 143 | def build_tree(data, attributes, target): 144 | 145 | data = data[:] 146 | vals = [record[attributes.index(target)] for record in data] 147 | default = majorClass(attributes, data, target) 148 | 149 | if not data or (len(attributes) - 1) <= 0: 150 | return default 151 | elif vals.count(vals[0]) == len(vals): 152 | return vals[0] 153 | else: 154 | best = attr_choose(data, attributes, target) 155 | tree = {best:{}} 156 | 157 | for val in get_values(data, attributes, best): 158 | new_data = get_data(data, attributes, best, val) 159 | newAttr = attributes[:] 160 | newAttr.remove(best) 161 | subtree = build_tree(new_data, newAttr, target) 162 | tree[best][val] = subtree 163 | 164 | return tree 165 | 166 | # This function runs the decision tree algorithm. It parses the file for the data-set, and then it runs the 10-fold cross-validation. It also classifies a test-instance and later compute the average accurracy 167 | # Improvements Used: 168 | # 1. Discrete Splitting for attributes "age" and "fnlwght" 169 | # 2. Random-ness: Random Shuffle of the data before Cross-Validation 170 | def run_decision_tree(): 171 | 172 | data = [] 173 | 174 | with open("hw4-task1-data.tsv") as tsv: 175 | for line in csv.reader(tsv, delimiter="\t"): 176 | 177 | if line[0] > '37': 178 | line[0] = '1' 179 | else: 180 | line[0] = '0' 181 | 182 | if line[2] > '178302': 183 | line[2] = '1' 184 | else: 185 | line[2] = '0' 186 | 187 | data.append(tuple(line)) 188 | 189 | print "Number of records: %d" % len(data) 190 | 191 | attributes = ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-info_gain', 'capital-loss', 'hours-per-week', 'native-country', 'salary'] 192 | target = attributes[-1] 193 | 194 | K = 10 195 | acc = [] 196 | for k in range(K): 197 | random.shuffle(data) 198 | training_set = [x for i, x in enumerate(data) if i % K != k] 199 | test_set = [x for i, x in enumerate(data) if i % K == k] 200 | tree = DecisionTree() 201 | tree.learn( training_set, attributes, target ) 202 | results = [] 203 | 204 | for entry in test_set: 205 | tempDict = tree.tree.copy() 206 | result = "" 207 | while(isinstance(tempDict, dict)): 208 | root = Node(tempDict.keys()[0], tempDict[tempDict.keys()[0]]) 209 | tempDict = tempDict[tempDict.keys()[0]] 210 | index = attributes.index(root.value) 211 | value = entry[index] 212 | if(value in tempDict.keys()): 213 | child = Node(value, tempDict[value]) 214 | result = tempDict[value] 215 | tempDict = tempDict[value] 216 | else: 217 | result = "Null" 218 | break 219 | if result != "Null": 220 | results.append(result == entry[-1]) 221 | 222 | accuracy = float(results.count(True))/float(len(results)) 223 | acc.append(accuracy) 224 | 225 | avg_acc = sum(acc)/len(acc) 226 | print "Average accuracy: %.4f" % avg_acc 227 | 228 | # Writing results to a file (DO NOT CHANGE) 229 | f = open("result.txt", "w") 230 | f.write("accuracy: %.4f" % avg_acc) 231 | f.close() 232 | 233 | if __name__ == "__main__": 234 | run_decision_tree() --------------------------------------------------------------------------------