├── README.md └── Random Forests .ipynb /README.md: -------------------------------------------------------------------------------- 1 | # random_forests 2 | This is the code for "Random Forests - The Math of Intelligence (Week 7)" By Siraj Raval on Youtube 3 | 4 | ## Overview 5 | 6 | This is the code for [this](https://youtu.be/QHOazyP-YlM) video on Youtube by Siraj Raval as part of The Math of Intelligence series. This is a lesson on Random Forests, which is a collection of decision trees. Useful for both classification and regression problems. You can find relevant datasets [here](http://archive.ics.uci.edu/ml/datasets/banknote+authentication) 7 | 8 | 9 | ## Dependencies 10 | 11 | * numpy 12 | 13 | Install missing dependencies with [pip](https://pip.pypa.io/en/stable/) 14 | 15 | ## Usage 16 | 17 | Just run `jupyter notebook` in terminal and the code will pop up in your browser 18 | 19 | Install jupyter [here](http://jupyter.readthedocs.io/en/latest/install.html). 20 | 21 | ## Credits 22 | 23 | The credits for this code go to [rushter](https://github.com/rushter). ive merely created a wrapper to get people started 24 | -------------------------------------------------------------------------------- /Random Forests .ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Random Forests - The Math of Intelligence (Week 7)\n", 8 | "\n", 9 | "\n", 10 | "## Our demo\n", 11 | "\n", 12 | "![alt text](https://mapr.com/blog/predicting-loan-credit-risk-using-apache-spark-machine-learning-random-forests/assets/blogimages/creditdecisiontree.png \"Logo Title Text 1\")\n", 13 | "\n", 14 | "We're going to learn about a machine learning model called a Random Forest. The task is to asses Credit Risk of someone using their financial history. Useful for insurance companies.\n", 15 | "\n", 16 | "Dataset: https://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data)\n", 17 | "\n", 18 | "This dataset classifies people described by a set of attributes as good or bad credit risks.\n", 19 | "\n", 20 | "\n", 21 | "## What is a Random forest?\n", 22 | "\n", 23 | "### Based off of decision trees\n", 24 | "![alt text](https://upload.wikimedia.org/wikipedia/commons/f/f3/CART_tree_titanic_survivors.png \"Logo Title Text 1\")\n", 25 | "\n", 26 | "Classification and Regression Trees or CART for short is a term introduced by Leo Breiman to refer to Decision Tree algorithms that can used for classification or regression predictive modeling problems.\n", 27 | "\n", 28 | "The aim at each stage is to associate specific targets (i.e., desired output values) with specific values of a particular variable.The result is a decision-tree in which each path identifies a combination of values associated with a particular prediction.\n", 29 | "\n", 30 | "Each non-leaf node in this tree is basically a decision maker. These nodes are called decision nodes. Each node carries out a specific test to determine where to go next. Depending on the outcome, you either go to the left branch or the right branch of this node. We keep doing this until we reach a leaf node. If we are constructing a classifier, each leaf node represents a class. Let’s say you are trying to determine whether or not it’s going to rain tomorrow based on three factors available today — temperature, pressure, and wind. So a typical decision tree would look like this:\n", 31 | "\n", 32 | "![alt text](https://prateekvjoshi.files.wordpress.com/2016/03/2-tree.png \"Logo Title Text 1\")\n", 33 | "\n", 34 | "But how do we construct the optimal tree? What attribute should be at the root node? How do we decide the thresholds? \n", 35 | "\n", 36 | "We use the Gini Index as our cost function used to evaluate splits in the dataset. We minimize it.\n", 37 | "\n", 38 | "![alt text](http://i.imgur.com/IijgHbt.png \"Logo Title Text 1\")\n", 39 | "\n", 40 | "A split in the dataset involves one input attribute and one value for that attribute. It can be used to divide training patterns into two groups of rows.\n", 41 | "\n", 42 | "A Gini score gives an idea of how good a split is by how mixed the classes are in the two groups created by the split. A perfect separation results in a Gini score of 0, whereas the worst case split that results in 50/50 classes in each group results in a Gini score of 1.0 (for a 2 class problem).\n", 43 | "\n", 44 | "![alt text](https://image.slidesharecdn.com/decisiontree-151015165353-lva1-app6892/95/classification-using-decision-tree-41-638.jpg?cb=1444928106 \"Logo Title Text 1\")\n", 45 | "\n", 46 | "A Gini score gives an idea of how good a split is by how mixed the classes are in the two groups created \n", 47 | "by the split. A perfect separation results in a Gini score of 0, whereas the worst case split that \n", 48 | "results in 50/50 classes. We calculate it for every row and split the data accordingly in our binary tree. We repeat this process recursively. \n", 49 | "\n", 50 | "![alt text](https://image.slidesharecdn.com/decisiontree-151015165353-lva1-app6892/95/classification-using-decision-tree-14-638.jpg?cb=1444928106\n", 51 | " \"Logo Title Text 1\")\n", 52 | " \n", 53 | "Using decision trees, we can build a random forest\n", 54 | " \n", 55 | "One problem that might occur with one big (deep) single DT is that it can overfit. That is the DT can “memorize” the training set the way a person might memorize an Eye Chart.\n", 56 | "\n", 57 | "The point of RF is to prevent overfitting. It does this by creating random subsets of the features and building smaller (shallow) trees using the subsets and then it combines the subtrees.\n", 58 | "\n", 59 | "The downside of RF is it can be slow if you have a single process but it can be parallelized.\n", 60 | "\n", 61 | "### Majority Vote\n", 62 | "![alt text](https://i.ytimg.com/vi/ajTc5y3OqSQ/hqdefault.jpg \"Logo Title Text 1\")\n", 63 | "\n", 64 | "### Subset of data\n", 65 | "![alt text](https://mapr.com/blog/predicting-loan-credit-risk-using-apache-spark-machine-learning-random-forests/assets/blogimages/sparkmlrandomforest.png \"Logo Title Text 1\")\n", 66 | "\n", 67 | "### Nodes represent feature splits\n", 68 | "![alt text](https://qph.ec.quoracdn.net/main-qimg-b17755d2e0ffb326d8c39b7f3e07e03b-c \"Logo Title Text 1\")\n", 69 | "\n", 70 | "\n", 71 | "## Other good examples?\n", 72 | "\n", 73 | "We can treat stock prices as a regression problem (time series) or classification\n", 74 | "\n", 75 | "#### Stock Price Classification (2 examples)\n", 76 | "\n", 77 | "https://github.com/qiulx026/stock-predictor-with-machine-learning-methods\n", 78 | "\n", 79 | "https://github.com/joy13/RandomForest\n", 80 | "\n", 81 | "Ignore the forest part for a moment, even a single tree can do regression. Each leaf holds a prediction value, which no longer is a class for regression. Given an input feature vector, you simply walk the tree as you'd do for a classification problem, and the resulting value in the leaf node is the prediction." 82 | ] 83 | }, 84 | { 85 | "cell_type": "markdown", 86 | "metadata": {}, 87 | "source": [ 88 | "# Import Dependencies" 89 | ] 90 | }, 91 | { 92 | "cell_type": "code", 93 | "execution_count": 25, 94 | "metadata": { 95 | "collapsed": false 96 | }, 97 | "outputs": [], 98 | "source": [ 99 | "# Random Forest Algorithm\n", 100 | "#This module implements pseudo-random number generators for various distributions.\n", 101 | "#seeding the generated number makes our results reproducible (good for debugging)\n", 102 | "from random import seed\n", 103 | "#Return a randomly selected element from range(start, stop, step). \n", 104 | "from random import randrange\n", 105 | "#read CSV file (dataset)\n", 106 | "from csv import reader\n", 107 | "#square root function\n", 108 | "from math import sqrt" 109 | ] 110 | }, 111 | { 112 | "cell_type": "markdown", 113 | "metadata": {}, 114 | "source": [ 115 | "# Data Loading Helper Functions" 116 | ] 117 | }, 118 | { 119 | "cell_type": "code", 120 | "execution_count": 26, 121 | "metadata": { 122 | "collapsed": false 123 | }, 124 | "outputs": [], 125 | "source": [ 126 | "# Load a CSV file\n", 127 | "def load_csv(filename):\n", 128 | " #init the dataset as a list\n", 129 | "\tdataset = list()\n", 130 | " #open it as a readable file\n", 131 | "\twith open(filename, 'r') as file:\n", 132 | " #init the csv reader\n", 133 | "\t\tcsv_reader = reader(file)\n", 134 | " #for every row in the dataset\n", 135 | "\t\tfor row in csv_reader:\n", 136 | "\t\t\tif not row:\n", 137 | "\t\t\t\tcontinue\n", 138 | " #add that row as an element in our dataset list (2D Matrix of values)\n", 139 | "\t\t\tdataset.append(row)\n", 140 | " #return in-memory data matrix\n", 141 | "\treturn dataset\n", 142 | " \n", 143 | "# Convert string column to float\n", 144 | "def str_column_to_float(dataset, column):\n", 145 | " #iterate throw all the rows in our data matrix\n", 146 | "\tfor row in dataset:\n", 147 | " #for the given column index, convert all values in that column to floats\n", 148 | "\t\trow[column] = float(row[column].strip())\n", 149 | " \n", 150 | "# Convert string column to integer\n", 151 | "def str_column_to_int(dataset, column):\n", 152 | " #store a given column \n", 153 | " class_values = [row[column] for row in dataset]\n", 154 | " #create an unordered collection with no duplicates, only unique valeus\n", 155 | " unique = set(class_values)\n", 156 | " #init a lookup table\n", 157 | " lookup = dict()\n", 158 | " #for each element in the column\n", 159 | " for i, value in enumerate(unique):\n", 160 | " #add it to our lookup table\n", 161 | " lookup[value] = i\n", 162 | " #the lookup table stores pointers to the strings\n", 163 | " for row in dataset:\n", 164 | " row[column] = lookup[row[column]]\n", 165 | " #return the lookup table\n", 166 | " return lookup" 167 | ] 168 | }, 169 | { 170 | "cell_type": "markdown", 171 | "metadata": {}, 172 | "source": [ 173 | "# Decision Tree Algorithm Helper Functions" 174 | ] 175 | }, 176 | { 177 | "cell_type": "code", 178 | "execution_count": 27, 179 | "metadata": { 180 | "collapsed": false 181 | }, 182 | "outputs": [], 183 | "source": [ 184 | "# Split a dataset into k folds\n", 185 | "# the original sample is randomly partitioned into k equal sized subsamples. \n", 186 | "#Of the k subsamples, a single subsample is retained as the validation data \n", 187 | "#for testing the model, and the remaining k − 1 subsamples are used as training data. \n", 188 | "#The cross-validation process is then repeated k times (the folds),\n", 189 | "#with each of the k subsamples used exactly once as the validation data.\n", 190 | "def cross_validation_split(dataset, n_folds):\n", 191 | "\tdataset_split = list()\n", 192 | "\tdataset_copy = list(dataset)\n", 193 | "\tfold_size = int(len(dataset) / n_folds)\n", 194 | "\tfor i in range(n_folds):\n", 195 | "\t\tfold = list()\n", 196 | "\t\twhile len(fold) < fold_size:\n", 197 | "\t\t\tindex = randrange(len(dataset_copy))\n", 198 | "\t\t\tfold.append(dataset_copy.pop(index))\n", 199 | "\t\tdataset_split.append(fold)\n", 200 | "\treturn dataset_split\n", 201 | "\n", 202 | "# Split a dataset based on an attribute and an attribute value\n", 203 | "def test_split(index, value, dataset):\n", 204 | " #init 2 empty lists for storing split dataubsets\n", 205 | "\tleft, right = list(), list()\n", 206 | " #for every row\n", 207 | "\tfor row in dataset:\n", 208 | " #if the value at that row is less than the given value\n", 209 | "\t\tif row[index] < value:\n", 210 | " #add it to list 1\n", 211 | "\t\t\tleft.append(row)\n", 212 | "\t\telse:\n", 213 | " #else add it list 2 \n", 214 | "\t\t\tright.append(row)\n", 215 | " #return both lists\n", 216 | "\treturn left, right\n", 217 | " \n", 218 | "# Calculate accuracy percentage\n", 219 | "def accuracy_metric(actual, predicted):\n", 220 | " #how many correct predictions?\n", 221 | "\tcorrect = 0\n", 222 | " #for each actual label\n", 223 | "\tfor i in range(len(actual)):\n", 224 | " #if actual matches predicted label\n", 225 | "\t\tif actual[i] == predicted[i]:\n", 226 | " #add 1 to the correct iterator\n", 227 | "\t\t\tcorrect += 1\n", 228 | " #return percentage of predictions that were correct\n", 229 | "\treturn correct / float(len(actual)) * 100.0\n", 230 | " \n", 231 | "# Evaluate an algorithm using a cross validation split\n", 232 | "def evaluate_algorithm(dataset, algorithm, n_folds, *args):\n", 233 | " #folds are the subsamples used to train and validate model\n", 234 | "\tfolds = cross_validation_split(dataset, n_folds)\n", 235 | "\tscores = list()\n", 236 | " #for each subsample\n", 237 | "\tfor fold in folds:\n", 238 | " #create a copy of the data\n", 239 | "\t\ttrain_set = list(folds)\n", 240 | " #remove the given subsample\n", 241 | "\t\ttrain_set.remove(fold)\n", 242 | "\t\ttrain_set = sum(train_set, [])\n", 243 | " #init a test set\n", 244 | "\t\ttest_set = list()\n", 245 | " #add each row in a given subsample to the test set\n", 246 | "\t\tfor row in fold:\n", 247 | "\t\t\trow_copy = list(row)\n", 248 | "\t\t\ttest_set.append(row_copy)\n", 249 | "\t\t\trow_copy[-1] = None\n", 250 | " #get predicted labls\n", 251 | "\t\tpredicted = algorithm(train_set, test_set, *args)\n", 252 | " #get actual labels\n", 253 | "\t\tactual = [row[-1] for row in fold]\n", 254 | " #compare accuracy\n", 255 | "\t\taccuracy = accuracy_metric(actual, predicted)\n", 256 | " #add it to scores list, for each fold\n", 257 | "\t\tscores.append(accuracy)\n", 258 | " #return all accuracy scores\n", 259 | "\treturn scores\n", 260 | " \n", 261 | " \n", 262 | "# Calculate the Gini index for a split dataset\n", 263 | "## this is the name of the cost function used to evaluate splits in the dataset.\n", 264 | "# this is a measure of how often a randomly chosen element from the set \n", 265 | "#would be incorrectly labeled if it was randomly labeled according to the distribution\n", 266 | "#of labels in the subset. Can be computed by summing the probability\n", 267 | "#of an item with label i being chosen times the probability \n", 268 | "#of a mistake in categorizing that item. \n", 269 | "#It reaches its minimum (zero) when all cases in the node \n", 270 | "#fall into a single target category.\n", 271 | "#A split in the dataset involves one input attribute and one value for that attribute. \n", 272 | "#It can be used to divide training patterns into two groups of rows.\n", 273 | "#A Gini score gives an idea of how good a split is by how mixed the classes \n", 274 | "#are in the two groups created by the split. A perfect separation results in \n", 275 | "#a Gini score of 0, whereas the worst case split that results in 50/50 classes \n", 276 | "#in each group results in a Gini score of 1.0 (for a 2 class problem).\n", 277 | "#We first need to calculate the proportion of classes in each group.\n", 278 | "def gini_index(groups, class_values):\n", 279 | "\tgini = 0.0\n", 280 | " #for each class\n", 281 | "\tfor class_value in class_values:\n", 282 | " #a random subset of that class\n", 283 | "\t\tfor group in groups:\n", 284 | "\t\t\tsize = len(group)\n", 285 | "\t\t\tif size == 0:\n", 286 | "\t\t\t\tcontinue\n", 287 | " #average of all class values\n", 288 | "\t\t\tproportion = [row[-1] for row in group].count(class_value) / float(size)\n", 289 | " # sum all (p * 1-p) values, this is gini index\n", 290 | "\t\t\tgini += (proportion * (1.0 - proportion))\n", 291 | "\treturn gini\n", 292 | " \n", 293 | "# Select the best split point for a dataset\n", 294 | "#This is an exhaustive and greedy algorithm\n", 295 | "def get_split(dataset, n_features):\n", 296 | " ##Given a dataset, we must check every value on each attribute as a candidate split, \n", 297 | " #evaluate the cost of the split and find the best possible split we could make.\n", 298 | "\tclass_values = list(set(row[-1] for row in dataset))\n", 299 | "\tb_index, b_value, b_score, b_groups = 999, 999, 999, None\n", 300 | "\tfeatures = list()\n", 301 | "\twhile len(features) < n_features:\n", 302 | "\t\tindex = randrange(len(dataset[0])-1)\n", 303 | "\t\tif index not in features:\n", 304 | "\t\t\tfeatures.append(index)\n", 305 | "\tfor index in features:\n", 306 | "\t\tfor row in dataset:\n", 307 | " ##When selecting the best split and using it as a new node for the tree \n", 308 | " #we will store the index of the chosen attribute, the value of that attribute \n", 309 | " #by which to split and the two groups of data split by the chosen split point.\n", 310 | " ##Each group of data is its own small dataset of just those rows assigned to the \n", 311 | " #left or right group by the splitting process. You can imagine how we might split \n", 312 | " #each group again, recursively as we build out our decision tree.\n", 313 | "\t\t\tgroups = test_split(index, row[index], dataset)\n", 314 | "\t\t\tgini = gini_index(groups, class_values)\n", 315 | "\t\t\tif gini < b_score:\n", 316 | "\t\t\t\tb_index, b_value, b_score, b_groups = index, row[index], gini, groups\n", 317 | " ##Once the best split is found, we can use it as a node in our decision tree.\n", 318 | " ##We will use a dictionary to represent a node in the decision tree as \n", 319 | " #we can store data by name. \n", 320 | "\treturn {'index':b_index, 'value':b_value, 'groups':b_groups}\n", 321 | " \n", 322 | "# Create a terminal node value\n", 323 | "\n", 324 | "def to_terminal(group):\n", 325 | " #select a class value for a group of rows. \n", 326 | "\toutcomes = [row[-1] for row in group]\n", 327 | " #returns the most common output value in a list of rows.\n", 328 | "\treturn max(set(outcomes), key=outcomes.count)\n", 329 | " \n", 330 | "#Create child splits for a node or make terminal\n", 331 | "#Building a decision tree involves calling the above developed get_split() function over \n", 332 | "#and over again on the groups created for each node.\n", 333 | "#New nodes added to an existing node are called child nodes. \n", 334 | "#A node may have zero children (a terminal node), one child (one side makes a prediction directly) \n", 335 | "#or two child nodes. We will refer to the child nodes as left and right in the dictionary representation \n", 336 | "#of a given node.\n", 337 | "#Once a node is created, we can create child nodes recursively on each group of data from \n", 338 | "#the split by calling the same function again.\n", 339 | "def split(node, max_depth, min_size, n_features, depth):\n", 340 | " #Firstly, the two groups of data split by the node are extracted for use and \n", 341 | " #deleted from the node. As we work on these groups the node no longer requires access to these data.\n", 342 | "\tleft, right = node['groups']\n", 343 | "\tdel(node['groups'])\n", 344 | " \n", 345 | " #Next, we check if either left or right group of rows is empty and if so we create \n", 346 | " #a terminal node using what records we do have.\n", 347 | "\t# check for a no split\n", 348 | "\tif not left or not right:\n", 349 | "\t\tnode['left'] = node['right'] = to_terminal(left + right)\n", 350 | "\t\treturn\n", 351 | " #We then check if we have reached our maximum depth and if so we create a terminal node.\n", 352 | "\t# check for max depth\n", 353 | "\tif depth >= max_depth:\n", 354 | "\t\tnode['left'], node['right'] = to_terminal(left), to_terminal(right)\n", 355 | "\t\treturn\n", 356 | " #We then process the left child, creating a terminal node if the group of rows is too small, \n", 357 | " #otherwise creating and adding the left node in a depth first fashion until the bottom of \n", 358 | " #the tree is reached on this branch.\n", 359 | "\t# process left child\n", 360 | "\tif len(left) <= min_size:\n", 361 | "\t\tnode['left'] = to_terminal(left)\n", 362 | "\telse:\n", 363 | "\t\tnode['left'] = get_split(left, n_features)\n", 364 | "\t\tsplit(node['left'], max_depth, min_size, n_features, depth+1)\n", 365 | "\t# process right child\n", 366 | " #The right side is then processed in the same manner, \n", 367 | " #as we rise back up the constructed tree to the root.\n", 368 | "\tif len(right) <= min_size:\n", 369 | "\t\tnode['right'] = to_terminal(right)\n", 370 | "\telse:\n", 371 | "\t\tnode['right'] = get_split(right, n_features)\n", 372 | "\t\tsplit(node['right'], max_depth, min_size, n_features, depth+1)\n", 373 | " \n", 374 | "#Build a decision tree\n", 375 | "def build_tree(train, max_depth, min_size, n_features):\n", 376 | " #Building the tree involves creating the root node and \n", 377 | "\troot = get_split(train, n_features)\n", 378 | " #calling the split() function that then calls itself recursively to build out the whole tree.\n", 379 | "\tsplit(root, max_depth, min_size, n_features, 1)\n", 380 | "\treturn root\n", 381 | " \n", 382 | "# Make a prediction with a decision tree\n", 383 | "def predict(node, row):\n", 384 | " #Making predictions with a decision tree involves navigating the \n", 385 | " #tree with the specifically provided row of data.\n", 386 | " #Again, we can implement this using a recursive function, where the same prediction routine is \n", 387 | " #called again with the left or the right child nodes, depending on how the split affects the provided data.\n", 388 | " #We must check if a child node is either a terminal value to be returned as the prediction\n", 389 | " #, or if it is a dictionary node containing another level of the tree to be considered.\n", 390 | "\tif row[node['index']] < node['value']:\n", 391 | "\t\tif isinstance(node['left'], dict):\n", 392 | "\t\t\treturn predict(node['left'], row)\n", 393 | "\t\telse:\n", 394 | "\t\t\treturn node['left']\n", 395 | "\telse:\n", 396 | "\t\tif isinstance(node['right'], dict):\n", 397 | "\t\t\treturn predict(node['right'], row)\n", 398 | "\t\telse:\n", 399 | "\t\t\treturn node['right']\n", 400 | " \n", 401 | "# Create a random subsample from the dataset with replacement\n", 402 | "def subsample(dataset, ratio):\n", 403 | "\tsample = list()\n", 404 | "\tn_sample = round(len(dataset) * ratio)\n", 405 | "\twhile len(sample) < n_sample:\n", 406 | "\t\tindex = randrange(len(dataset))\n", 407 | "\t\tsample.append(dataset[index])\n", 408 | "\treturn sample" 409 | ] 410 | }, 411 | { 412 | "cell_type": "markdown", 413 | "metadata": {}, 414 | "source": [ 415 | "# Main Code " 416 | ] 417 | }, 418 | { 419 | "cell_type": "code", 420 | "execution_count": 28, 421 | "metadata": { 422 | "collapsed": false 423 | }, 424 | "outputs": [ 425 | { 426 | "name": "stdout", 427 | "output_type": "stream", 428 | "text": [ 429 | "Trees: 1\n", 430 | "Scores: [48.78048780487805, 60.97560975609756, 63.41463414634146, 58.536585365853654, 43.90243902439025]\n", 431 | "Mean Accuracy: 55.122%\n", 432 | "Trees: 5\n", 433 | "Scores: [53.65853658536586, 58.536585365853654, 60.97560975609756, 70.73170731707317, 63.41463414634146]\n", 434 | "Mean Accuracy: 61.463%\n", 435 | "Trees: 10\n", 436 | "Scores: [73.17073170731707, 63.41463414634146, 60.97560975609756, 53.65853658536586, 58.536585365853654]\n", 437 | "Mean Accuracy: 61.951%\n" 438 | ] 439 | } 440 | ], 441 | "source": [ 442 | "# Make a prediction with a list of bagged trees\n", 443 | "#responsible for making a prediction with each decision tree and \n", 444 | "#combining the predictions into a single return value. \n", 445 | "#This is achieved by selecting the most common prediction \n", 446 | "#from the list of predictions made by the bagged trees.\n", 447 | "def bagging_predict(trees, row):\n", 448 | "\tpredictions = [predict(tree, row) for tree in trees]\n", 449 | "\treturn max(set(predictions), key=predictions.count)\n", 450 | " \n", 451 | "# Random Forest Algorithm\n", 452 | "#esponsible for creating the samples of the training dataset, training a decision tree on each,\n", 453 | "#then making predictions on the test dataset using the list of bagged trees.\n", 454 | "def random_forest(train, test, max_depth, min_size, sample_size, n_trees, n_features):\n", 455 | "\ttrees = list()\n", 456 | "\tfor i in range(n_trees):\n", 457 | "\t\tsample = subsample(train, sample_size)\n", 458 | "\t\ttree = build_tree(sample, max_depth, min_size, n_features)\n", 459 | "\t\ttrees.append(tree)\n", 460 | "\tpredictions = [bagging_predict(trees, row) for row in test]\n", 461 | "\treturn(predictions)\n", 462 | " \n", 463 | "# Test the random forest algorithm\n", 464 | "seed(1)\n", 465 | "# load and prepare data\n", 466 | "filename = 'sonar.all-data.csv'\n", 467 | "dataset = load_csv(filename)\n", 468 | "# convert string attributes to integers\n", 469 | "for i in range(0, len(dataset[0])-1):\n", 470 | "\tstr_column_to_float(dataset, i)\n", 471 | "# convert class column to integers\n", 472 | "str_column_to_int(dataset, len(dataset[0])-1)\n", 473 | "# evaluate algorithm\n", 474 | "n_folds = 5\n", 475 | "max_depth = 10\n", 476 | "min_size = 1\n", 477 | "sample_size = 1.0\n", 478 | "n_features = int(sqrt(len(dataset[0])-1))\n", 479 | "for n_trees in [1, 5, 10]:\n", 480 | "\tscores = evaluate_algorithm(dataset, random_forest, n_folds, max_depth, min_size, sample_size, n_trees, n_features)\n", 481 | "\tprint('Trees: %d' % n_trees)\n", 482 | "\tprint('Scores: %s' % scores)\n", 483 | "\tprint('Mean Accuracy: %.3f%%' % (sum(scores)/float(len(scores))))" 484 | ] 485 | }, 486 | { 487 | "cell_type": "code", 488 | "execution_count": null, 489 | "metadata": { 490 | "collapsed": true 491 | }, 492 | "outputs": [], 493 | "source": [] 494 | } 495 | ], 496 | "metadata": { 497 | "kernelspec": { 498 | "display_name": "Python 2", 499 | "language": "python", 500 | "name": "python2" 501 | }, 502 | "language_info": { 503 | "codemirror_mode": { 504 | "name": "ipython", 505 | "version": 2 506 | }, 507 | "file_extension": ".py", 508 | "mimetype": "text/x-python", 509 | "name": "python", 510 | "nbconvert_exporter": "python", 511 | "pygments_lexer": "ipython2", 512 | "version": "2.7.12" 513 | } 514 | }, 515 | "nbformat": 4, 516 | "nbformat_minor": 2 517 | } 518 | --------------------------------------------------------------------------------