├── .gitignore
├── msleep_ggplot2.csv
└── README.md


/.gitignore:
--------------------------------------------------------------------------------
1 | *.pyc
2 | *.pyo
3 | .ipynb_checkpoints/
4 | 


--------------------------------------------------------------------------------
/msleep_ggplot2.csv:
--------------------------------------------------------------------------------
 1 | name,genus,vore,order,conservation,sleep_total,sleep_rem,sleep_cycle,awake,brainwt,bodywt
 2 | Cheetah,Acinonyx,carni,Carnivora,lc,12.1,NA,NA,11.9,NA,50
 3 | Owl monkey,Aotus,omni,Primates,NA,17,1.8,NA,7,0.0155,0.48
 4 | Mountain beaver,Aplodontia,herbi,Rodentia,nt,14.4,2.4,NA,9.6,NA,1.35
 5 | Greater short-tailed shrew,Blarina,omni,Soricomorpha,lc,14.9,2.3,0.133333333,9.1,0.00029,0.019
 6 | Cow,Bos,herbi,Artiodactyla,domesticated,4,0.7,0.666666667,20,0.423,600
 7 | Three-toed sloth,Bradypus,herbi,Pilosa,NA,14.4,2.2,0.766666667,9.6,NA,3.85
 8 | Northern fur seal,Callorhinus,carni,Carnivora,vu,8.7,1.4,0.383333333,15.3,NA,20.49
 9 | Vesper mouse,Calomys,NA,Rodentia,NA,7,NA,NA,17,NA,0.045
10 | Dog,Canis,carni,Carnivora,domesticated,10.1,2.9,0.333333333,13.9,0.07,14
11 | Roe deer,Capreolus,herbi,Artiodactyla,lc,3,NA,NA,21,0.0982,14.8
12 | Goat,Capri,herbi,Artiodactyla,lc,5.3,0.6,NA,18.7,0.115,33.5
13 | Guinea pig,Cavis,herbi,Rodentia,domesticated,9.4,0.8,0.216666667,14.6,0.0055,0.728
14 | Grivet,Cercopithecus,omni,Primates,lc,10,0.7,NA,14,NA,4.75
15 | Chinchilla,Chinchilla,herbi,Rodentia,domesticated,12.5,1.5,0.116666667,11.5,0.0064,0.42
16 | Star-nosed mole,Condylura,omni,Soricomorpha,lc,10.3,2.2,NA,13.7,0.001,0.06
17 | African giant pouched rat,Cricetomys,omni,Rodentia,NA,8.3,2,NA,15.7,0.0066,1
18 | Lesser short-tailed shrew,Cryptotis,omni,Soricomorpha,lc,9.1,1.4,0.15,14.9,0.00014,0.005
19 | Long-nosed armadillo,Dasypus,carni,Cingulata,lc,17.4,3.1,0.383333333,6.6,0.0108,3.5
20 | Tree hyrax,Dendrohyrax,herbi,Hyracoidea,lc,5.3,0.5,NA,18.7,0.0123,2.95
21 | North American Opossum,Didelphis,omni,Didelphimorphia,lc,18,4.9,0.333333333,6,0.0063,1.7
22 | Asian elephant,Elephas,herbi,Proboscidea,en,3.9,NA,NA,20.1,4.603,2547
23 | Big brown bat,Eptesicus,insecti,Chiroptera,lc,19.7,3.9,0.116666667,4.3,3e-04,0.023
24 | Horse,Equus,herbi,Perissodactyla,domesticated,2.9,0.6,1,21.1,0.655,521
25 | Donkey,Equus,herbi,Perissodactyla,domesticated,3.1,0.4,NA,20.9,0.419,187
26 | European hedgehog,Erinaceus,omni,Erinaceomorpha,lc,10.1,3.5,0.283333333,13.9,0.0035,0.77
27 | Patas monkey,Erythrocebus,omni,Primates,lc,10.9,1.1,NA,13.1,0.115,10
28 | Western american chipmunk,Eutamias,herbi,Rodentia,NA,14.9,NA,NA,9.1,NA,0.071
29 | Domestic cat,Felis,carni,Carnivora,domesticated,12.5,3.2,0.416666667,11.5,0.0256,3.3
30 | Galago,Galago,omni,Primates,NA,9.8,1.1,0.55,14.2,0.005,0.2
31 | Giraffe,Giraffa,herbi,Artiodactyla,cd,1.9,0.4,NA,22.1,NA,899.995
32 | Pilot whale,Globicephalus,carni,Cetacea,cd,2.7,0.1,NA,21.35,NA,800
33 | Gray seal,Haliochoerus,carni,Carnivora,lc,6.2,1.5,NA,17.8,0.325,85
34 | Gray hyrax,Heterohyrax,herbi,Hyracoidea,lc,6.3,0.6,NA,17.7,0.01227,2.625
35 | Human,Homo,omni,Primates,NA,8,1.9,1.5,16,1.32,62
36 | Mongoose lemur,Lemur,herbi,Primates,vu,9.5,0.9,NA,14.5,NA,1.67
37 | African elephant,Loxodonta,herbi,Proboscidea,vu,3.3,NA,NA,20.7,5.712,6654
38 | Thick-tailed opposum,Lutreolina,carni,Didelphimorphia,lc,19.4,6.6,NA,4.6,NA,0.37
39 | Macaque,Macaca,omni,Primates,NA,10.1,1.2,0.75,13.9,0.179,6.8
40 | Mongolian gerbil,Meriones,herbi,Rodentia,lc,14.2,1.9,NA,9.8,NA,0.053
41 | Golden hamster,Mesocricetus,herbi,Rodentia,en,14.3,3.1,0.2,9.7,0.001,0.12
42 | Vole ,Microtus,herbi,Rodentia,NA,12.8,NA,NA,11.2,NA,0.035
43 | House mouse,Mus,herbi,Rodentia,nt,12.5,1.4,0.183333333,11.5,4e-04,0.022
44 | Little brown bat,Myotis,insecti,Chiroptera,NA,19.9,2,0.2,4.1,0.00025,0.01
45 | Round-tailed muskrat,Neofiber,herbi,Rodentia,nt,14.6,NA,NA,9.4,NA,0.266
46 | Slow loris,Nyctibeus,carni,Primates,NA,11,NA,NA,13,0.0125,1.4
47 | Degu,Octodon,herbi,Rodentia,lc,7.7,0.9,NA,16.3,NA,0.21
48 | Northern grasshopper mouse,Onychomys,carni,Rodentia,lc,14.5,NA,NA,9.5,NA,0.028
49 | Rabbit,Oryctolagus,herbi,Lagomorpha,domesticated,8.4,0.9,0.416666667,15.6,0.0121,2.5
50 | Sheep,Ovis,herbi,Artiodactyla,domesticated,3.8,0.6,NA,20.2,0.175,55.5
51 | Chimpanzee,Pan,omni,Primates,NA,9.7,1.4,1.416666667,14.3,0.44,52.2
52 | Tiger,Panthera,carni,Carnivora,en,15.8,NA,NA,8.2,NA,162.564
53 | Jaguar,Panthera,carni,Carnivora,nt,10.4,NA,NA,13.6,0.157,100
54 | Lion,Panthera,carni,Carnivora,vu,13.5,NA,NA,10.5,NA,161.499
55 | Baboon,Papio,omni,Primates,NA,9.4,1,0.666666667,14.6,0.18,25.235
56 | Desert hedgehog,Paraechinus,NA,Erinaceomorpha,lc,10.3,2.7,NA,13.7,0.0024,0.55
57 | Potto,Perodicticus,omni,Primates,lc,11,NA,NA,13,NA,1.1
58 | Deer mouse,Peromyscus,NA,Rodentia,NA,11.5,NA,NA,12.5,NA,0.021
59 | Phalanger,Phalanger,NA,Diprotodontia,NA,13.7,1.8,NA,10.3,0.0114,1.62
60 | Caspian seal,Phoca,carni,Carnivora,vu,3.5,0.4,NA,20.5,NA,86
61 | Common porpoise,Phocoena,carni,Cetacea,vu,5.6,NA,NA,18.45,NA,53.18
62 | Potoroo,Potorous,herbi,Diprotodontia,NA,11.1,1.5,NA,12.9,NA,1.1
63 | Giant armadillo,Priodontes,insecti,Cingulata,en,18.1,6.1,NA,5.9,0.081,60
64 | Rock hyrax,Procavia,NA,Hyracoidea,lc,5.4,0.5,NA,18.6,0.021,3.6
65 | Laboratory rat,Rattus,herbi,Rodentia,lc,13,2.4,0.183333333,11,0.0019,0.32
66 | African striped mouse,Rhabdomys,omni,Rodentia,NA,8.7,NA,NA,15.3,NA,0.044
67 | Squirrel monkey,Saimiri,omni,Primates,NA,9.6,1.4,NA,14.4,0.02,0.743
68 | Eastern american mole,Scalopus,insecti,Soricomorpha,lc,8.4,2.1,0.166666667,15.6,0.0012,0.075
69 | Cotton rat,Sigmodon,herbi,Rodentia,NA,11.3,1.1,0.15,12.7,0.00118,0.148
70 | Mole rat,Spalax,NA,Rodentia,NA,10.6,2.4,NA,13.4,0.003,0.122
71 | Arctic ground squirrel,Spermophilus,herbi,Rodentia,lc,16.6,NA,NA,7.4,0.0057,0.92
72 | Thirteen-lined ground squirrel,Spermophilus,herbi,Rodentia,lc,13.8,3.4,0.216666667,10.2,0.004,0.101
73 | Golden-mantled ground squirrel,Spermophilus,herbi,Rodentia,lc,15.9,3,NA,8.1,NA,0.205
74 | Musk shrew,Suncus,NA,Soricomorpha,NA,12.8,2,0.183333333,11.2,0.00033,0.048
75 | Pig,Sus,omni,Artiodactyla,domesticated,9.1,2.4,0.5,14.9,0.18,86.25
76 | Short-nosed echidna,Tachyglossus,insecti,Monotremata,NA,8.6,NA,NA,15.4,0.025,4.5
77 | Eastern american chipmunk,Tamias,herbi,Rodentia,NA,15.8,NA,NA,8.2,NA,0.112
78 | Brazilian tapir,Tapirus,herbi,Perissodactyla,vu,4.4,1,0.9,19.6,0.169,207.501
79 | Tenrec,Tenrec,omni,Afrosoricida,NA,15.6,2.3,NA,8.4,0.0026,0.9
80 | Tree shrew,Tupaia,omni,Scandentia,NA,8.9,2.6,0.233333333,15.1,0.0025,0.104
81 | Bottle-nosed dolphin,Tursiops,carni,Cetacea,NA,5.2,NA,NA,18.8,NA,173.33
82 | Genet,Genetta,carni,Carnivora,NA,6.3,1.3,NA,17.7,0.0175,2
83 | Arctic fox,Vulpes,carni,Carnivora,NA,12.5,NA,NA,11.5,0.0445,3.38
84 | Red fox,Vulpes,carni,Carnivora,NA,9.8,2.4,0.35,14.2,0.0504,4.23
85 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | Intro to Data Science with R Programming
  2 | ==================
  3 | 
  4 | Brought to you by [Lesley Cordero](http://www.columbia.edu/~lc2958) and [ADI](https://adicu.com)
  5 | 
  6 | ## Table of Contents
  7 | 
  8 | - [0.0 Setup](#00-setup)
  9 |     + [0.1 R and R Studio](#01-r-and-r-studio)
 10 |     + [0.2 Packages](#02-packages)
 11 | - [1.0 Background](#10-background)
 12 |     + [1.1 Machine Learning](#11-Machine Learning)
 13 |     + [1.2 Data](#12-data)
 14 |     + [1.3 Overfitting vs Underfitting](#13-overfitting-vs-underfitting)
 15 |     + [1.4 Glossary](#14-glossary)
 16 |         * [1.4.1 Factors](#141-factors)
 17 |         * [1.4.2 Corpus](#142-corpus)
 18 |         * [1.4.3 Bias](#143-bias)
 19 |         * [1.4.4 Variance](#144-variance)
 20 | - [2.0 Data Preparation](#30-data-preparation)
 21 |     + [2.1 dplyr](#31-dplyr)
 22 |     + [2.2 Geopandas](#32-geopandas)
 23 | - [3.0 Exploratory Analysis](#30-exploratory-analysis)
 24 | - [4.0 Data Visualization](#50-data-visualization)
 25 | - [5.0 Machine Learning & Prediction](#50-machine-learning--prediction)
 26 |     + [5.1 Random Forests](#51-random-forests)
 27 |     + [5.2 Natural Language Processing](#52-natural-language-processing)
 28 |         * [5.2.1 ANLP](#521-anlp)
 29 |     + [5.3 K Means Clustering](#53-k-means-clustering)
 30 | - [6.0 Final Exercise]($60-final-exercise)
 31 | - [7.0 Final Words](#60-final-words)
 32 |     + [7.1 Resources](#61-resources)
 33 |     + [7.2 Mini Courses](#72-mini-courses)
 34 | 
 35 | 
 36 | ## 0.0 Setup
 37 | 
 38 | This guide was written in R 3.2.3.
 39 | 
 40 | 
 41 | ### 0.1 R and R Studio
 42 | 
 43 | Download [R](https://www.r-project.org/) and [R Studio](https://www.rstudio.com/products/rstudio/download/).
 44 | 
 45 | 
 46 | ### 0.2 Packages
 47 | 
 48 | Next, to install the R packages, cd into your workspace, and enter the following, very simple, command into your bash: 
 49 | 
 50 | ```
 51 | R
 52 | ```
 53 | 
 54 | This will prompt a session in R! From here, you can install any needed packages. For the sake of this tutorial, enter the following into your terminal R session:
 55 | 
 56 | ``` 
 57 | install.packages("ggvis”)
 58 | install.packages("gmodels")
 59 | install.packages("RCurl")
 60 | install.packages("tm")
 61 | install.packages("caTools")
 62 | install.packages("ggplot2")
 63 | install.packages("RFinfer")
 64 | install.packages("dplyr")
 65 | install.packages("lubridate")
 66 | install.packages("compare")
 67 | install.packages("downloader")
 68 | ```
 69 | 
 70 | ### 0.3 Virtual Environment
 71 | 
 72 | If you'd like to work in a virtual environment, you can set it up as follows: 
 73 | ```
 74 | pip3 install virtualenv
 75 | virtualenv your_env
 76 | ```
 77 | And then launch it with: 
 78 | ```
 79 | source your_env/bin/activate
 80 | ```
 81 | 
 82 | To execute the visualizations in matplotlib, do the following:
 83 | 
 84 | ```
 85 | cd ~/.matplotlib
 86 | vim matplotlibrc
 87 | ```
 88 | And then, write `backend: TkAgg` in the file. Now you should be set up with your virtual environment!
 89 | 
 90 | Cool, now we're ready to start!
 91 | 
 92 | 
 93 | ## 1.0 Background
 94 | 
 95 | Before we head into an actual data science problem demo, let's go over some vital background information. 
 96 | 
 97 | ### 1.1 What is Data Science?
 98 | 
 99 | Data Science is the application of statistical and mathematical methods to problems involving sets of data. In other words, it's taking techniques developed in the areas of statistics and math and using them to learn from some sort of data source. 
100 | 
101 | #### 1.1.1 What do you mean by data? 
102 | 
103 | Data is essentially anything that can be recorded or transcribed - numerical, text, images, sounds, anything!
104 | 
105 | #### 1.1.2 What background do you need to work on a data science problem?
106 | 
107 | It depends entirely on what you're working on, but generally speaking, you should be comfortable with probability, statistics, and some linear algebra.  
108 | 
109 | ### 1.2 Is data science the same as machine learning?
110 | 
111 | Well, no. They do have overlap, but they are not the same! Whereas the topic of machine learning involves lots of theoretical components we won't worry about, data science takes these methods and applies them to the real world. It's important to note that studying these theoretical components can be very useful to your understanding of data science, however!
112 | 
113 | ### 1.3 Why is Data Science important? 
114 | 
115 | Data Science has so much potential! By using data in creative and innovative ways, we can gain a lot of insight on the world, whether that be in economics, biology, sociology, math - any topic you can think of, data science has its role. 
116 | 
117 | ### 1.4 Machine Learning
118 | 
119 | Generally speaking, Machine Learning can be split into three types of learning: supervised, unsupervised, and reinforcement learning. 
120 | 
121 | #### 1.4.1 Supervised Learning
122 | 
123 | This algorithm consist of a target / outcome variable (or dependent variable) which is to be predicted from a given set of predictors (independent variables). Using these set of variables, we generate a function that map inputs to desired outputs. The training process continues until the model achieves a desired level of accuracy on the training data. Examples of Supervised Learning: Regression, Decision Tree, Random Forest, KNN, Logistic Regression etc.
124 | 
125 | 
126 | #### 1.4.2 Unsupervised Learning
127 | 
128 | In this algorithm, we do not have any target or outcome variable to predict / estimate.  It is used for clustering population in different groups, which is widely used for segmenting customers in different groups for specific intervention. Examples of Unsupervised Learning: Apriori algorithm, K-means.
129 | 
130 | 
131 | #### 1.4.2 Reinforcement Learning
132 | 
133 | Using this algorithm, the machine is trained to make specific decisions. It works this way: the machine is exposed to an environment where it trains itself continually using trial and error. This machine learns from past experience and tries to capture the best possible knowledge to make accurate business decisions. Example of Reinforcement Learning: Markov Decision Process.
134 | 
135 | 
136 | ### 1.5 Data 
137 | 
138 | As a data scientist, knowing the different forms data takes is highly important. 
139 | 
140 | #### 1.5.1 Training vs Test Data
141 | 
142 | When it comes time to train your classifier or model, you're going to need to split your data into <b>testing</b> and <b>training</b> data. 
143 | 
144 | Typically, the majority of your data will go towards your training data, while only 10-25% of your data will go towards testing. It's important to note there is no overlap between the two. Should you have overlap or use all your training data for testing, your accuracy results will be wrong. Any classifier that's tested on the data it's training is obviously going to do very well since it will have observed those results before, so the accuracy will be high, but wrongly so. 
145 | 
146 | 
147 | #### 1.5.2 Open Data 
148 | 
149 | What's open data, you ask? Simple, it's data that's freely  for anyone to use! Some examples include things you might have already heard of, like APIs, online zip files, or by scraping data!
150 | 
151 | You might be wondering where this data comes from - well, it can come from a variety of sources, but some common ones include large tech companies like Facebook, Google, Instagram. Others include large institutions, like the US government! Otherwise, you can find tons of data from all sorts of organizations and individuals. 
152 | 
153 | ### 1.6 Overfitting vs Underfitting
154 | 
155 | In section 1.2.1, we mentioned the concept of overfitting your data. The concept of overfitting refers to creating a model that doesn't generalize to your model. In other words, if your model overfits your data, that means it's learned your data <i>too</i> much - it's essentially memorized it. This might not seem like it would be a problem at first, but a model that's just "memorized" your data is one that's going to perform poorly on new, unobserved data. 
156 | 
157 | Underfitting, on the other hand, is when your model is <i>too</i> generalized to your data. This model will also perform poorly on new unobserved data. This usually means we should increase the number of considered features, which will expand the hypothesis space. 
158 | 
159 | 
160 | ### 1.7 Glossary 
161 | 
162 | #### 1.7.1 Factors
163 | 
164 | Factors in R are stored as a vector of integer values with a corresponding set of character values to use when the factor is displayed. 
165 | 
166 | #### 1.7.2 Corpus
167 | 
168 | A Corpus (Plural: Corpora) is a collection of written texts that serve as our datasets.
169 | 
170 | #### 1.7.3 Bias
171 | 
172 | In machine learning, bias is the tendency for a learner to consistently learn the same wrong thing. 
173 | 
174 | #### 1.7.4 Variance 
175 | 
176 | Variance is the error from sensitivity to small fluctuations in the training set. High variance can cause overfitting since it causes a classifier to  model the random noise in the training data rather than the intended outputs.
177 | 
178 | ## 2.0 Data Preparation
179 | 
180 | ### 2.1 dplyr
181 | 
182 | dplyr allows us to transform and summarize tabular data with rows and columns. It contains a set of functions that perform common data manipulation operations like filtering rows, selecting specific columns, re-ordering rows, adding new columns, and summarizing data.
183 | 
184 | First we begin by loading in the needed packages:
185 | ``` R
186 | library(dplyr)
187 | library(downloader)
188 | ```
189 | 
190 | Using the data available in [this](https://github.com/lesley2958/data-science-r/blob/master/msleep_ggplot2.csv) repo, we''ll load the data into R:
191 | 
192 | ``` R
193 | url <- "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/msleep_ggplot2.csv"
194 | filename <- "msleep_ggplot2.csv"
195 | if (!file.exists(filename)) download(url,filename)
196 | msleep <- read.csv("msleep_ggplot2.csv")
197 | head(msleep)
198 | ```
199 | 
200 | #### 2.1.1 select()
201 | 
202 | To demonstrate how the `select()` method works, we select the name and sleep_total columns.
203 | 
204 | ``` R
205 | sleepData <- select(msleep, name, sleep_total)
206 | head(sleepData)
207 | ```
208 | 
209 | To select all the columns except a specific column, you can use the subtraction sign:
210 | 
211 | ``` R
212 | head(select(msleep, -name))
213 | ```
214 | 
215 | You can also select a range of columns with a colon:
216 | 
217 | ``` R
218 | head(select(msleep, name:order))
219 | ```
220 | 
221 | #### 2.1.2 filter()
222 | 
223 | Using the `filter()` method in dplyr we can select rows that meet a certain criterion, such as in the following:
224 | 
225 | ``` R
226 | filter(msleep, sleep_total >= 16)
227 | ```
228 | There, we filter out the animals whose sleep total is less than 16 hours. If you want to expand the criteria, you can: 
229 | 
230 | ```R
231 | filter(msleep, sleep_total >= 16, bodywt >= 1)
232 | ```
233 | 
234 | #### 2.1.3 Functions
235 | 
236 | `arrange()`: re-order or arrange rows <br>
237 | `filter()`: filter rows <br>
238 | `group_by()`: allows for group operations in the “split-apply-combine” concept <br>
239 | `mutate()`: create new columns <br>
240 | `select()`: select columns <br>
241 | `summarise()`: summarise values
242 | 
243 | 
244 | ## 3.0 Exploratory Analysis
245 | 
246 | ### 3.1 summary()
247 | R gives you the opportunity to go more in-depth with the summary() function. This will give you the minimum value, first quantile, median, mean, third quantile and maximum value of the data set Iris for numeric data types.
248 | 
249 | ``` R
250 | summary(iris) 
251 | ```
252 | 
253 | ### 3.2 xda
254 | 
255 | xda contains tools to perform initial exploratory analysis on any input dataset. It includes custom functions for plotting the data as well as performing different kinds of analyses such as univariate, bivariate, and multivariate investigation - the typical first step of any predictive modeling pipeline. This is a great package to start off on any dataset because it gives you a good sense of the dataset before jumping on to building predictive models.
256 | 
257 | 
258 | ### 3.3 preprosim
259 | 
260 | [preprosim](https://mran.revolutionanalytics.com/web/packages/preprosim/vignettes/preprosim.html) helps to add contaminations (noise, missing values, outliers, low variance, irrelevant features, class swap (inconsistency), class imbalance and decrease in data volume) to data and then evaluate the simulated data sets for classification accuracy.
261 | 
262 | 
263 | ## 4.0 Data Visualization 
264 | 
265 | 
266 | ### 4.1 ggvis 
267 | 
268 | ggvis allows you to make scatterplots, as with the following: 
269 | 
270 | 
271 | ``` R
272 | library(ggvis)
273 | iris %>% ggvis(~Petal.Length, ~Petal.Width, fill = ~Species) %>% layer_points()
274 | 
275 | ```
276 | 
277 | ### 4.2 heatmaply
278 | 
279 | [heatmaply](https://mran.revolutionanalytics.com/package/heatmaply/) produces interactive heatmaps.
280 | 
281 | This code snippet shows the correlation structure of variables in the mtcars dataset:
282 | 
283 | ``` R
284 | library(heatmaply)
285 | heatmaply(cor(mtcars), 
286 | k_col = 2, k_row = 2,
287 | limits = c(-1,1)) %>% 
288 | layout(margin = list(l = 40, b = 40))
289 | ```
290 | 
291 | ## 5.0 Machine Learning & Prediction
292 | 
293 | 
294 | ### 5.1 Random Forests
295 | 
296 | Random forest is a great choice for nearly any prediction problem, even non-linear ones, that belongs to a larger class of machine learning algorithms called ensemble methods.
297 | 
298 | 
299 | #### 5.1.1 RFinfer
300 | 
301 | RFinfer provides functions that use the infinitesimal jackknife to generate predictions and prediction variances from random forest models.
302 | 
303 | Now we'll go through an exercise involving RFinfer. First, we'll load the needed package and example data included in R. (Specifically, the dataset we'll be using is the New York Air Quality Measurements). 
304 | 
305 | ``` R
306 | library(RFinfer)
307 | library(ggplot2)
308 | data('airquality')
309 | ```
310 | 
311 | Because calls to random forest do not allow missing data, we omit incomplete cases of the data and high outliers.
312 | 
313 | ``` R
314 | d.aq <- na.omit(airquality)
315 | d.aq <- d.aq[d.aq$Ozone < 100, ]
316 | ```
317 | 
318 | Now we finally train the random forest model: 
319 | ``` R
320 | rf <- randomForest(Ozone ~ .,data=d.aq,keep.inbag=T)
321 | ```
322 | 
323 | Here, we grab the prediction variances for the training data along with the 95% confidence intervals:
324 | 
325 | ``` R 
326 | rf.preds <- rfPredVar(rf,rf.data=d.aq,CI=TRUE)
327 | str(rf.preds)
328 | ```
329 | 
330 | Then we get: 
331 | ```
332 | ## 'data.frame':    104 obs. of  4 variables:
333 | ##  $ pred       : num  37.2 29.5 19.8 21.9 23.9 ...
334 | ##  $ pred.ij.var: num  -1.29 15.7 20.31 4.94 3.78 ...
335 | ##  $ l.ci       : num  39.71 -1.25 -19.97 12.25 16.47 ...
336 | ##  $ u.ci       : num  34.7 60.3 59.6 31.6 31.3 ...
337 | ```
338 | 
339 | Next, we'll plot the predictions with their 95% confidence intervals in accordance to the actual values.
340 | 
341 | ``` R
342 | ggplot(rf.preds,aes(d.aq$Ozone,pred)) + 
343 |     geom_abline(intercept=0,slope=1,lty=2, color='#999999')  +
344 |    geom_point()  +
345 |    geom_errorbar(aes(ymin=l.ci,ymax=u.ci,height=0.15)) + 
346 |    xlab('Actual') + ylab('Predicted') +
347 |    theme_bw()
348 | ```
349 | 
350 | Here, we can see that the random forest is generally less confident about its inaccurate predictions, which we visualize by plotting the prediction variance as a function of the prediction error.
351 | 
352 | 
353 | ``` R
354 | qplot(d.aq$Ozone - rf.preds$pred,rf.preds$pred.ij.var, xlab='prediction error',ylab='prediction variance') + theme_bw()
355 | ```
356 | 
357 | ### 5.2 Natural Language Processing
358 | 
359 | #### 5.2.1 ANLP
360 | 
361 | [ANLP](https://mran.revolutionanalytics.com/web/packages/ANLP/vignettes/ANLP_Documentation.html) provides functions for building text prediction models. It contains functions for cleaning text data, building N-grams and more. 
362 | 
363 | 
364 | ### 5.3 k-Means Clustering
365 | 
366 | K Means Clustering is an unsupervised learning algorithm that clusts data based on their similarity. In k means clustering, we have the specify the number of clusters we want the data to be grouped into. The algorithm randomly assigns each observation to a cluster, and finds the centroid of each cluster. Then, the algorithm iterates through two steps:
367 | 
368 | - Reassign data points to the cluster whose centroid is closest
369 | - Calculate new centroid of each cluster
370 | 
371 | These two steps are repeated till the within cluster variation cannot be reduced any further. The within cluster variation is calculated as the sum of the euclidean distance between the data points and their respective cluster centroids.
372 | 
373 | 
374 | ## 6.0 Final Exercise 
375 | 
376 | For this final exercise, we'll be implementing a sentiment analysis classifier. Sentiment analysis involves building a system to collect and determine the emotional tone behind words. This is important because it allows you to gain an understanding of the attitudes, opinions and emotions of the people in your data. 
377 | 
378 | At a high level, sentiment analysis involves Natural language processing and artificial intelligence by taking the actual text element, transforming it into a format that a machine can read, and using statistics to determine the actual sentiment.
379 | 
380 | For the model portion of this exercise, we'll use linear models since they allow us to define our input variable as a linear combination of input variables. 
381 | 
382 | For this tutorial, we'll be using the following packages: 
383 | 
384 | ``` R
385 | library(RCurl)
386 | library(tm)
387 | library(caTools)
388 | ```
389 | 
390 | ### 6.1 Data Preparation
391 | 
392 | Here, we're just loading the data from the URLs. Although the R function read.csv can work with URLs, it doesn't necessarily handle https, so we use the package RCurl to ensure our links are able to be downloaded.
393 | 
394 | ``` R
395 | test_data_url <- "https://dl.dropboxusercontent.com/u/8082731/datasets/UMICH-SI650/testdata.txt"
396 | train_data_url <- "https://dl.dropboxusercontent.com/u/8082731/datasets/UMICH-SI650/training.txt"
397 | 
398 | test_data_file <- getURL(test_data_url)
399 | train_data_file <- getURL(train_data_url)
400 | 
401 | train_data_df <- read.csv(
402 |     text = train_data_file, 
403 |     sep='\t', 
404 |     header=FALSE, 
405 |     quote = "",
406 |     stringsAsFactor=F,
407 |     col.names=c("Sentiment", "Text"))
408 | test_data_df <- read.csv(
409 |     text = test_data_file, 
410 |     sep='\t', 
411 |     header=FALSE, 
412 |     quote = "",
413 |     stringsAsFactor=F,
414 |     col.names=c("Text"))
415 | ```
416 | 
417 | Here, we convert Sentiment to factor. 
418 | 
419 | ``` R
420 | train_data_df$Sentiment <- as.factor(train_data_df$Sentiment)
421 | ```
422 | 
423 | In R we will use the tm package for text mining, so we'll be using it to create a corpus. First we use all the data to get all possible words in our corpus. Then we create a VCorpus object that's essentiallya collection of content and metadata objects.
424 | 
425 | ``` R
426 | corpus <- Corpus(VectorSource(c(train_data_df$Text, test_data_df$Text)))
427 | ```
428 | 
429 | We want our data to be as clean as possible before sending it out for training - so we apply techniques like removing punctionation, stop words, and white space to make our data as consistent as possible. 
430 | 
431 | ``` R 
432 | corpus <- tm_map(corpus, content_transformer(tolower))
433 | corpus <- tm_map(corpus, removePunctuation)
434 | corpus <- tm_map(corpus, removeWords, stopwords("english"))
435 | corpus <- tm_map(corpus, stripWhitespace)
436 | corpus <- tm_map(corpus, stemDocument)
437 | corpus <- tm_map(corpus, PlainTextDocument)
438 | ```
439 | 
440 | Next, we need a DocumentTermMatrix object for the corpus. A document matrix is a matrix containing a column for each different word in our whole corpus, and a row for each document. A given cell equals to the frequency in a document for a given term.
441 | 
442 | ``` R
443 | dtm <- DocumentTermMatrix(corpus)
444 | ```
445 | 
446 | We can take a glimpse of how it looks:
447 | ```
448 | <<DocumentTermMatrix (documents: 40138, terms: 10)>>
449 | Non-/sparse entries: 401380/0
450 | Sparsity           : 0%
451 | Maximal term length: 20
452 | Weighting          : term frequency (tf)
453 | ```
454 | 
455 | Now we want to convert this matrix into a dataframe that we can use to train a classifier :
456 | 
457 | ``` R
458 | important_words_df <- as.data.frame(as.matrix(dtm))
459 | colnames(important_words_df) <- make.names(colnames(important_words_df))
460 | ```
461 | 
462 | Here, we're just splitting our data into training and test, then adding them back to dataframes, and then getting rid of the original text field:
463 | ``` R
464 | important_words_train_df <- head(important_words_df, nrow(train_data_df))
465 | important_words_test_df <- tail(important_words_df, nrow(test_data_df))
466 | 
467 | train_data_words_df <- cbind(train_data_df, important_words_train_df)
468 | test_data_words_df <- cbind(test_data_df, important_words_test_df)
469 | 
470 | train_data_words_df$Text <- NULL
471 | test_data_words_df$Text <- NULL
472 | ```
473 | 
474 | In order to obtain our evaluation set, we split our dataset using sample.split from the caTools package:
475 | 
476 | ``` R
477 | set.seed(1234)
478 | spl <- sample.split(train_data_words_df$Sentiment, .85)
479 | ```
480 | 
481 | Now we use `spl` to split our data into train and test
482 | ``` R
483 | eval_train_data_df <- train_data_words_df[spl==T,]
484 | eval_test_data_df <- train_data_words_df[spl==F,]
485 | ```
486 | 
487 | 
488 | ### 6.2 Data Analysis
489 | 
490 | Building a linear model in R requires only one function call, `glm`, so we use that to create our classifier. As a parameter, we set family to binomial to indicate that we want to use logistic regression:
491 | 
492 | ``` R
493 | log_model <- glm(Sentiment~., data=eval_train_data_df, family=binomial)
494 | ```
495 | 
496 | And as always, we now use our model on the test data:
497 | 
498 | ``` R
499 | log_pred <- predict(log_model, newdata=eval_test_data_df, type="response")
500 | ```
501 | 
502 | Using this table, we'll be able to calculate accuracy based on probability:
503 | 
504 | ``` R
505 | table(eval_test_data_df$Sentiment, log_pred>.5)
506 | ```
507 | 
508 | So then we get
509 | 
510 | ```
511 | (453 + 590) / nrow(eval_test_data_df)
512 | ```
513 | ```
514 | 0.9811853
515 | ```
516 | 
517 | This is a very good accuracy. It seems that our bag of words approach works nicely with this particular problem.
518 | 
519 | 
520 | ## 7.0 Final Words
521 | 
522 | This was a brief overview of Data Science and its different components. Obviously there is more to each component we went through, but this tutorial should have given you an idea of what a data problem should look like. 
523 | 
524 | ### 7.1 Resources
525 | 
526 | [The Art of R Programming](https://www.dropbox.com/s/cr7mg2h20yzvbq3/The_Art_Of_R_Programming.pdf?dl=0)<br>
527 | [R Bloggers](https://www.r-bloggers.com/) <br>
528 | [kdnuggets](http://www.kdnuggets.com/)
529 | 
530 | 
531 | 
532 | ### 7.2 Mini Courses
533 | 
534 | Learn about courses [here](www.byteacademy.co/all-courses/data-science-mini-courses/).
535 | 
536 | [Python 101: Data Science Prep](https://www.eventbrite.com/e/python-101-data-science-prep-tickets-30980459388) <br>
537 | [Intro to Data Science & Stats with R](https://www.eventbrite.com/e/data-sci-109-intro-to-data-science-statistics-using-r-tickets-30908877284) <br>
538 | [Data Acquisition Using Python & R](https://www.eventbrite.com/e/data-sci-203-data-acquisition-using-python-r-tickets-30980705123) <br>
539 | [Data Visualization with Python](https://www.eventbrite.com/e/data-sci-201-data-visualization-with-python-tickets-30980827489) <br>
540 | [Fundamentals of Machine Learning and Regression Analysis](https://www.eventbrite.com/e/data-sci-209-fundamentals-of-machine-learning-and-regression-analysis-tickets-30980917759) <br>
541 | [Natural Language Processing with Data Science](https://www.eventbrite.com/e/data-sci-210-natural-language-processing-with-data-science-tickets-30981006023) <br>
542 | [Machine Learning with Data Science](https://www.eventbrite.com/e/data-sci-309-machine-learning-with-data-science-tickets-30981154467) <br>
543 | [Databases & Big Data](https://www.eventbrite.com/e/data-sci-303-databases-big-data-tickets-30981182551) <br>
544 | [Deep Learning with Data Science](https://www.eventbrite.com/e/data-sci-403-deep-learning-with-data-science-tickets-30981221668) <br>
545 | [Data Sci 500: Projects](https://www.eventbrite.com/e/data-sci-500-projects-tickets-30981330995)
546 | 
547 | 


--------------------------------------------------------------------------------