├── .gitignore
├── README.md
├── RandomForest.sln
├── RandomForest.vcxproj
├── RandomForest.vcxproj.filters
├── doc
├── kaggle.png
├── multi.png
├── report.tex
└── single.png
├── premake5.lua
└── src
├── Config.cpp
├── Config.h
├── DecisionTree.cpp
├── DecisionTree.h
├── RandomForest.cpp
├── RandomForest.h
├── RandomForest.vcxproj
├── RandomForest.vcxproj.filters
├── RandomForest.vcxproj.user
└── main.cpp
/.gitignore:
--------------------------------------------------------------------------------
1 | *.swp
2 | *.opensdf
3 | *.sdf
4 | *.suo
5 | *.exe
6 | *.ilk
7 | *.pdb
8 | obj
9 | data
10 | *.make
11 | Makefile
12 |
13 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Parallel Random Forest
2 |
3 | Parallel Random Forest Implementation in C++(with OpenMP).
4 |
5 | ## Dependencies
6 |
7 | * C++11 support(smart pointers, range-based for loops, lambda, etc.).
8 | * OpenMP(available in most compilers including VC++/g++/clang++)
9 |
10 | Tested under the following environments:
11 |
12 | 1. Windows 8.1 + VS2013
13 | 2. Windows 8.1 + g++ (GCC) 4.8.1(mingw32)
14 | 3. Ubuntu 14.04 + g++ (GCC) 4.8.2
15 | 4. Ubuntu 14.04 + clang++ 3.5.2
16 |
17 | ## Predefined parameters
18 |
19 | * Number of features: 617
20 | * Number of labels: 26
21 | * Minimum node size: 2
22 |
23 | Defined in [Config.h](src/Config.h).
24 |
25 | ##Directory structure
26 |
27 | ```
28 | .
29 | ├─ premake5.lua (premake build scripts)
30 | ├─ RandomForest.vcxproj.filters, RandomForest.vcxproj, RandomForest.sln(VS2013 project files)
31 | ├─ RandomForest.exe (executable built with VS2013)
32 | ├─ README.md (you are reading this)
33 | ├─ doc
34 | │ └── report.pdf
35 | ├─ data (dataset and output)
36 | │ ├── train.csv (the training set)
37 | │ ├── 1000.csv (subset of the training set for validation)
38 | │ ├── test.csv (the test set)
39 | │ └── submit.csv (the output)
40 | └─ src (source code)
41 | ├── Config.h, Config.cpp (configurations, common headers)
42 | ├── DecisionTree.h, DecisionTree.cpp (decision tree implementation)
43 | ├── RandomForest.h, RandomForest.cpp (random forest implementation)
44 | └── main.cpp (then entry file)
45 | ```
46 |
47 | ## About the executable
48 |
49 | The executable is built for Windows with VS2013, so it needs some .dlls that come with VS2013. If you want to run the program under other environments, you need to build it from source.
50 |
51 | ### Generate output for Kaggle submission
52 |
53 | 1. Put the training set in `data/train.csv`, and test set in `data/test.csv`(the header line will be ignored)
54 | 2. Run `RandomForest`(`./RandomForest` if you are under Linux). You can pass in an optional number of trees, e.g. `RandomForest 1000` will generate 1000 trees.
55 | 3. The results will be saved in `data/submit.csv`
56 |
57 | ### Validate against the training set
58 |
59 | Note: you need to uncomment the `VALIDATE` flag in `src/Config.h` and build the executable again.
60 |
61 | 1. Put the training set in `data/train.csv`, and the validation set in `data/1000.csv`(the header line will be ignored)
62 | 2. Run `RandomForest`(`./RandomForest` if you are under Linux). You can pass in an optional number of trees, e.g. `RandomForest 1000` will generate 1000 trees.
63 | 3. The results will be saved in `data/submit.csv`
64 |
65 | ## Build
66 |
67 | On Windows, you can build it with VS2013(or maybe a lower version of VS), or GNU Make(Win32 port) and MinGW. Note that this program needs C++11 support, so an old compiler might not be able to build it.
68 |
69 | On Linux, you can build it with Make and g++ or clang++.
70 |
71 | ### Windows with VS2013
72 |
73 | Open the `RandomForest.sln` with VS and build the `RandomForest` target. Make sure that it's in Release mode or the build could be slowed down. The executable will appear under the project directory, named `RandomForest.exe`.
74 |
75 | ### Windows with Make+MinGW / older VS
76 |
77 | Download premake5 from [here](https://premake.github.io/download.html#v5), extract the executable in the archive(e.g. `premake5.exe`), and put the path to the executable in your `PATH` environment variables. Then open cmd and run `premake5 --help` to see what project files you can generate. I've written the premake script `premake5.lua` to generate the proper project files.
78 |
79 | For example, to generate the project files for VS2012, simply run `premake5 vs2012` under the project directory, then open `RandomForest.sln` with your VS and build the `RandomForest` target. The executable will appear under the project directory, named `RandomForest.exe`.
80 |
81 | WARNING: make sure your compiler has enough C++11 support or the build could fail.
82 |
83 | ### Linux with Make and g++/clang++
84 |
85 | Download premake5 from [here](https://premake.github.io/download.html#v5), extract the executable in the archive(e.g. `premake5`), and put the path to the executable in your `PATH` environment variables(e.g. extract the file to `/usr/local/bin` with root permission so you don't have to touch `PATH`). To generate the project files for make, simply run `premake5 gmake`. If you want to use g++, just run `make` to build it. If you want to use clang, run `make config=clang`(but first you need to make sure you have a symlink `clang++` to your clang++ executable). The executable will appear under the project directory, named `RandomForest`.
86 |
87 | WARNING: make sure your compiler has enough C++11 support or the build could fail.
88 |
89 | ##About
90 |
91 | * [Github repository](https://github.com/joyeecheung/parallel-random-forest)
92 | * Author: Qiuyi Zhang
93 | * Jul. 2015
94 |
--------------------------------------------------------------------------------
/RandomForest.sln:
--------------------------------------------------------------------------------
1 |
2 | Microsoft Visual Studio Solution File, Format Version 12.00
3 | # Visual Studio 2013
4 | VisualStudioVersion = 12.0.31101.0
5 | MinimumVisualStudioVersion = 10.0.40219.1
6 | Project("{8BC9CEB8-8B4A-11D0-8D11-00A0C91BC942}") = "RandomForest", "RandomForest.vcxproj", "{7B24B04F-0456-46EF-BC9A-5E9B2203A71F}"
7 | EndProject
8 | Global
9 | GlobalSection(SolutionConfigurationPlatforms) = preSolution
10 | Debug|Win32 = Debug|Win32
11 | Release|Win32 = Release|Win32
12 | EndGlobalSection
13 | GlobalSection(ProjectConfigurationPlatforms) = postSolution
14 | {7B24B04F-0456-46EF-BC9A-5E9B2203A71F}.Debug|Win32.ActiveCfg = Debug|Win32
15 | {7B24B04F-0456-46EF-BC9A-5E9B2203A71F}.Debug|Win32.Build.0 = Debug|Win32
16 | {7B24B04F-0456-46EF-BC9A-5E9B2203A71F}.Release|Win32.ActiveCfg = Release|Win32
17 | {7B24B04F-0456-46EF-BC9A-5E9B2203A71F}.Release|Win32.Build.0 = Release|Win32
18 | EndGlobalSection
19 | GlobalSection(SolutionProperties) = preSolution
20 | HideSolutionNode = FALSE
21 | EndGlobalSection
22 | EndGlobal
23 |
--------------------------------------------------------------------------------
/RandomForest.vcxproj:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 | Debug
6 | Win32
7 |
8 |
9 | Release
10 | Win32
11 |
12 |
13 |
14 | {7B24B04F-0456-46EF-BC9A-5E9B2203A71F}
15 | Win32Proj
16 |
17 |
18 |
19 | Application
20 | true
21 | v120
22 |
23 |
24 | Application
25 | false
26 | v120
27 |
28 |
29 |
30 |
31 |
32 |
33 |
34 |
35 |
36 |
37 |
38 |
39 | true
40 | $(SolutionDir)\
41 | $(SolutionDir)\obj\
42 |
43 |
44 | true
45 | $(SolutionDir)\
46 | $(SolutionDir)\obj\
47 |
48 |
49 |
50 | WIN32;_DEBUG;_CONSOLE;%(PreprocessorDefinitions)
51 | MultiThreadedDebugDLL
52 | Level3
53 | ProgramDatabase
54 | Disabled
55 | true
56 |
57 |
58 | MachineX86
59 | true
60 | Console
61 |
62 |
63 |
64 |
65 | WIN32;NDEBUG;_CONSOLE;%(PreprocessorDefinitions)
66 | MultiThreadedDLL
67 | Level3
68 | ProgramDatabase
69 | true
70 |
71 |
72 | MachineX86
73 | true
74 | Console
75 | true
76 | true
77 |
78 |
79 |
80 |
81 |
82 |
83 |
84 |
85 |
86 |
87 |
88 |
89 |
90 |
91 |
92 |
93 |
94 |
95 |
96 |
97 |
--------------------------------------------------------------------------------
/RandomForest.vcxproj.filters:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 | {4FC737F1-C7A5-4376-A066-2A32D752A2FF}
6 | cpp;c;cc;cxx;def;odl;idl;hpj;bat;asm;asmx
7 |
8 |
9 | {93995380-89BD-4b04-88EB-625FBE52EBFB}
10 | h;hh;hpp;hxx;hm;inl;inc;xsd
11 |
12 |
13 | {67DA6AB6-F800-4c08-8B7A-83BB121AAD01}
14 | rc;ico;cur;bmp;dlg;rc2;rct;bin;rgs;gif;jpg;jpeg;jpe;resx;tiff;tif;png;wav
15 |
16 |
17 |
18 |
19 | Source Files
20 |
21 |
22 | Source Files
23 |
24 |
25 | Source Files
26 |
27 |
28 | Source Files
29 |
30 |
31 |
32 |
33 | Header Files
34 |
35 |
36 | Header Files
37 |
38 |
39 | Header Files
40 |
41 |
42 |
43 |
44 | Resource Files
45 |
46 |
47 | Resource Files
48 |
49 |
50 |
--------------------------------------------------------------------------------
/doc/kaggle.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/joyeecheung/parallel-random-forest/7c3eb937b24904ea3835ea321e3cd045e280979d/doc/kaggle.png
--------------------------------------------------------------------------------
/doc/multi.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/joyeecheung/parallel-random-forest/7c3eb937b24904ea3835ea321e3cd045e280979d/doc/multi.png
--------------------------------------------------------------------------------
/doc/report.tex:
--------------------------------------------------------------------------------
1 | \documentclass{article}
2 | %\usepackage[a4paper,top=0.75in, bottom=0.75in, left=1in, right=1in,footskip=0.2in]{geometry}
3 | \usepackage{fullpage}
4 | %-----------------Hyperlink Packages--------------------
5 | \usepackage{hyperref}
6 | \hypersetup{
7 | colorlinks = true,
8 | citecolor = black,
9 | linkcolor = black,
10 | urlcolor = black
11 | }
12 | %-----------------Figure Packages--------------------
13 | \usepackage{graphicx} % For figures
14 | %\usepackage{epsfig} % for postscript graphics files
15 | %------------------Math Packages------------------------
16 | \usepackage{amssymb,amsmath}
17 | \usepackage{textcomp}
18 | \usepackage{mdwmath}
19 | \usepackage{mdwtab}
20 | \usepackage{eqparbox}
21 | %------------------Table Packages-----------------------
22 | \usepackage{rotating} % Used to rotate tables
23 | \usepackage{array} % Fixed column widths for tables
24 | %-----------------Algorithm Packages--------------------
25 | \usepackage{listings} % Source code
26 | \usepackage{algorithm} % Pseudo Code
27 | \usepackage{algpseudocode}
28 | %---------------------------------------------------------
29 |
30 | %opening
31 |
32 | \begin{document}
33 |
34 | \title{
35 | Data Mining Course Project \\
36 | Parallel Random Forest
37 | }
38 | \author {
39 | Computer Application Class 2, 12330402\\
40 | Qiuyi Zhang (\href{mailto:joyeec9h3@gmail.com}{joyeec9h3@gmail.com})
41 | }
42 | \date{\today}
43 |
44 | \maketitle
45 | \tableofcontents
46 |
47 |
48 | \section{Problem Description}
49 |
50 | The dataset contains 6238 records for training and 1559 records for testing. Each record has 617 features and a label ranging in $[0, 25]$. Given this dataset, the goal is to train a random forest to predict the labels for the test set as precisely as possible.
51 |
52 | \section{Algorithms}
53 |
54 | For this project I implemented a random forest(with parallelism support) consisting of C4.5 decision trees built with Gini impurity.
55 |
56 | \subsection{Decision Tree}
57 |
58 | \subsubsection{Building Decision Trees}
59 |
60 | The algorithm for building the decision tree is described in Algorithm~\ref{alg:dt}.
61 |
62 | For simplicity, the trees are built recursively.
63 |
64 | \begin{algorithm}[H]
65 | \centering
66 | \caption{Decision tree}
67 | \label{alg:dt}
68 | \begin{algorithmic}[1]
69 | \Function{DecisionTree.fit}{$data$, $attributes$}
70 | \If{$data$ is empty}
71 | \State \Return an empty node
72 | \EndIf
73 |
74 | \State $gain_{best}$ = 0.0, $attr_{best}$ = 0, $value_{best}$ = 0.0
75 | \State $set1_{best}$ = $\{\}$, $set2_{best}$ = $\{\}$
76 |
77 | \If{$data.size > MIN\_NODE\_SIZE$}
78 | \For{each attribute $a$ in $attributes$}
79 | \State Sort $data$ by their values of $a$, $threshold$ = the midpoint of the sorted $data$
80 | \State Divide $data$ by $threshold$ into $set1$, $set2$
81 | \State Calculate the information gain of spliting $data$ into $set1$ and $set2$
82 | \If{The new information gain $> gain_{best}$}
83 | \State Update $gain_{best}$, $attr_{best}$, $value_{best}$, $set1_{best}$, $set2_{best}$
84 | \EndIf
85 | \EndFor
86 | \EndIf
87 | \If{$attributes$ is not empty and $gain_{best} \neq 0.0$}
88 | \State remove $attr_{best}$ from $attributes$
89 | \State $this.left =$ \Call{DecisionTree.fit}{$set1_{best}$, $attributes$}
90 | \State $this.right =$ \Call{DecisionTree.fit}{$set1_{best}$, $attributes$}
91 | \State $this.attr = attr_{best}$
92 | \State $this.threshold = value_{best}$
93 | \State $this.cound = data.size$
94 | \State \Return $this$
95 | \Else
96 | \State Create a counter for labels in $data$
97 | \State $this.leaf = counter$
98 | \State \Return $this$
99 | \EndIf
100 | \EndFunction
101 | \end{algorithmic}
102 | \end{algorithm}
103 |
104 | \begin{description}
105 | \item Remark 1. \hfill \\
106 | Suppose $i \in \{1, 2, ..., m\}$, and let $f_i$ be the fraction of items labeled with value $i$ in the set, the \textit{Gini impurity} for a set of items is defined as:
107 |
108 | $$I_{G}(f) = \sum_{i=1}^{m} f_i (1-f_i) = \sum_{i=1}^{m} (f_i - {f_i}^2) = \sum_{i=1}^m f_i - \sum_{i=1}^{m} {f_i}^2 = 1 - \sum^{m}_{i=1} {f_i}^{2}$$
109 |
110 |
111 | \item Remark 2.\hfill \\
112 | When a data set $D$ with goal attribute $V$ is split into subsets $D_k$ using attribute $A$, each with $n_k$ records, the \textit{information gain} is defined as:
113 |
114 | $$
115 | Gain(A) = H(V, D) - Remainder(A)
116 | $$
117 |
118 | where
119 |
120 | $$
121 | Remainder(A) = \sum_k P(n_k) H(V, D_k)
122 | $$
123 |
124 | $P(n_k)$ is the proportion of $E_k$ in $D$.
125 | \end{description}
126 |
127 | \subsubsection{Use Decision Trees for Prediction}
128 |
129 | The algorithm for predicting a label with a set of observations for the features are shown in Algorithm~\ref{alg:predictdt}. It's also recursive, therefore rather intuitive.
130 |
131 | \begin{algorithm}[H]
132 | \centering
133 | \caption{Prediction using Decision Trees}
134 | \label{alg:predictdt}
135 | \begin{algorithmic}[1]
136 | \Function{DecisionTree.predict}{$this$, $observation$}
137 | \If{$this.leaves$ is a counter}
138 | \State\Return The most frequent item in $tree.leaves$
139 | \EndIf
140 |
141 | \State $value$ = $observation[this.attr]$
142 | \If{$value < this.threshold$}
143 | \State \Return \Call{DecisionTree.predict}{$this.left$, $observation$}
144 | \Else
145 | \State \Return \Call{DecisionTree.predict}{$tree.right$, $observation$}
146 | \EndIf
147 | \EndFunction
148 | \end{algorithmic}
149 | \end{algorithm}
150 |
151 | \subsection{Random Forest}
152 |
153 | \subsubsection{Building Random Forest}
154 |
155 | The algorithm for building a random forest out of a set of decision trees is described in Algorithm~\ref{alg:rf}. Notice that the sampling of records are without replacement, and the sampling of features are with replacement.
156 |
157 | \begin{algorithm}[H]
158 | \centering
159 | \caption{Building a Random Forest}
160 | \label{alg:rf}
161 | \begin{algorithmic}[1]
162 | \Function{RandomForest.fit}{$data$, $numTrees$, $numFeatures$, $numSampleCoeff$}
163 | \State Sample $numSampleCoeff \times data.size$ records(without replacement) from $data$ as the $bootstrap$
164 | \For{each $tree$ in this forest}
165 | \State Sample $numFeatures$ attributes(with replacement) from the attribute set as $attributes$
166 | \State $tree$ = \Call{$DecisionTree.fit$}{$bootstrap$, $attributes$}
167 | \EndFor
168 | \EndFunction
169 | \end{algorithmic}
170 | \end{algorithm}
171 |
172 | \subsection{Use the Random Forest for Prediction}
173 |
174 | The algorithm for classifying an observation is described in Algorithm~\ref{alg:predictrf}. It uses the most popular result from the predictions produced by the trees.
175 |
176 | \begin{algorithm}[H]
177 | \centering
178 | \caption{Prediction using a Random Forest}
179 | \label{alg:predictrf}
180 | \begin{algorithmic}[1]
181 | \Function{RandomForest.predict}{$observations$}
182 | \State $result = []$
183 |
184 | \For{each $row$ with index $i$ in $observations$}
185 | \State Create a new $counter$
186 | \For {each $tree$ in the forest}
187 | \State $counter.add$\Call{$tree$.predict}{$row$}
188 | \EndFor
189 | \State $result[i]$ = the most frequent item in $counter$
190 | \EndFor
191 |
192 | \State \Return $result$
193 | \EndFunction
194 | \end{algorithmic}
195 | \end{algorithm}
196 |
197 | \section{Implementation}
198 |
199 | \subsection{Parallelism}
200 |
201 | The random forest can be easily parallelized since the training and classifying done in each tree can be completely separated. To build a parallel implementation, there are many options available: multi-threading, MPI, Map-reduce(Hadoop), Spark, etc. Because the computational resources I have are limited, I chose threads to parallelize the implementation. Using OpenMP, which is built in most compilers nowadays, it's fairly trivial to parallelize the for loops in the random forest(a few simple \texttt{\#pragma} directives will suffice).
202 |
203 | On a machine with Intel i7-4710HQ@2.5GHz(which has 4 cores and 8 hyperthreads) and 8G RAM, it takes the program(built with VS2013) about 170s to build a random forest with 1000 trees when multi-threading is not enabled(see Figure~\ref{fig:single}), and about 40s when multi-threading is enabled(see Figure~\ref{fig:multi}). In the first case, only 25\% of the CPU can be utilized while in the second case, 100\% of the CPU is utilized.
204 |
205 | \begin{figure}[]
206 | \centering
207 | \includegraphics[width=\linewidth]{single.png}
208 | \caption{Performance analysis of the implementation without multi-threading}
209 | \label{fig:single}
210 | \end{figure}
211 |
212 | \begin{figure}[]
213 | \centering
214 | \includegraphics[width=\linewidth]{multi.png}
215 | \caption{Performance analysis of the implementation with multi-threading}
216 | \label{fig:multi}
217 | \end{figure}
218 |
219 | \subsection{Cross-platform}
220 |
221 | This implementation was originally developed with Visual Studio 2013. By using Premake, it can also be built with MinGW under Windows, or with g++/clang++ under Linux. See README.md for instructions on how to build it on other platforms.
222 |
223 | \section{Experiment Results}
224 |
225 | For applications of random forests, there are some critical parameters that needs to be tuned:
226 |
227 | \begin{enumerate}
228 | \item \textbf{Number of trees in the forest}. Generally speaking, the more trees in the forest, the better the predictions will be.
229 | \item \textbf{Number of sampled features}. A rule of thumb for it is $\log_2n$ where $n$ is the number of features in the dataset.
230 | \item \textbf{Minimum size of tree nodes}. The minimum number of samples required to split a node.
231 | \item \textbf{Size of the bootstrap}. Generally speaking, if the size of the bootstrap is the same as the size of the training set(which means some rows could be chosen multiple times while some would be left out), the random forest will typically provide near-optimal performance.
232 | \item And more...
233 | \end{enumerate}
234 |
235 | For this project, I used 1000 trees in the forest, $\log_2n$ features, set the minimum size of tree nodes to be 2, and use a bootstrap the same size as the training set. This configuration gives a $96.144\%$ precision when tested on the Kaggle platform, see Figure~\ref{fig:kaggle}.
236 |
237 | \begin{figure}[]
238 | \centering
239 | \includegraphics[width=\linewidth]{kaggle.png}
240 | \caption{Score on Kaggle}
241 | \label{fig:kaggle}
242 | \end{figure}
243 |
244 | \end{document}
245 |
--------------------------------------------------------------------------------
/doc/single.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/joyeecheung/parallel-random-forest/7c3eb937b24904ea3835ea321e3cd045e280979d/doc/single.png
--------------------------------------------------------------------------------
/premake5.lua:
--------------------------------------------------------------------------------
1 | -- premake5.lua
2 | solution "RandomForest"
3 | configurations { "Debug", "Release", "clang" }
4 |
5 | -- A project defines one build target
6 | project "RandomForest"
7 | kind "ConsoleApp"
8 | language "C++"
9 | files { "src/*.cpp" }
10 | includedirs { "src" }
11 |
12 | configuration { "gmake", "-std=c++11" }
13 | buildoptions { "-fopenmp" }
14 | links { "gomp" }
15 |
16 | configuration { "vs*" }
17 | buildoptions { "/openmp" }
18 |
19 | configuration "Debug"
20 | defines { "DEBUG" } -- -DDEBUG
21 | flags { "Symbols" }
22 |
23 | configuration "Release"
24 | defines { "NDEBUG" } -- -NDEBUG
25 | flags { "Optimize" }
26 |
27 | configuration "clang"
28 | toolset "clang"
29 |
--------------------------------------------------------------------------------
/src/Config.cpp:
--------------------------------------------------------------------------------
1 | #include "Config.h"
2 |
--------------------------------------------------------------------------------
/src/Config.h:
--------------------------------------------------------------------------------
1 | #ifndef __CONFIG__
2 | #define __CONFIG__
3 |
4 | #define JDEBUG
5 | //#define VALIDATE
6 | //#define DEBUG_TREE
7 | #define DEBUG_FOREST
8 |
9 | #include
10 | #include
11 | #include
12 | #include
13 | #include