├── .gitignore ├── msleep_ggplot2.csv └── README.md /.gitignore: -------------------------------------------------------------------------------- 1 | *.pyc 2 | *.pyo 3 | .ipynb_checkpoints/ 4 | -------------------------------------------------------------------------------- /msleep_ggplot2.csv: -------------------------------------------------------------------------------- 1 | name,genus,vore,order,conservation,sleep_total,sleep_rem,sleep_cycle,awake,brainwt,bodywt 2 | Cheetah,Acinonyx,carni,Carnivora,lc,12.1,NA,NA,11.9,NA,50 3 | Owl monkey,Aotus,omni,Primates,NA,17,1.8,NA,7,0.0155,0.48 4 | Mountain beaver,Aplodontia,herbi,Rodentia,nt,14.4,2.4,NA,9.6,NA,1.35 5 | Greater short-tailed shrew,Blarina,omni,Soricomorpha,lc,14.9,2.3,0.133333333,9.1,0.00029,0.019 6 | Cow,Bos,herbi,Artiodactyla,domesticated,4,0.7,0.666666667,20,0.423,600 7 | Three-toed sloth,Bradypus,herbi,Pilosa,NA,14.4,2.2,0.766666667,9.6,NA,3.85 8 | Northern fur seal,Callorhinus,carni,Carnivora,vu,8.7,1.4,0.383333333,15.3,NA,20.49 9 | Vesper mouse,Calomys,NA,Rodentia,NA,7,NA,NA,17,NA,0.045 10 | Dog,Canis,carni,Carnivora,domesticated,10.1,2.9,0.333333333,13.9,0.07,14 11 | Roe deer,Capreolus,herbi,Artiodactyla,lc,3,NA,NA,21,0.0982,14.8 12 | Goat,Capri,herbi,Artiodactyla,lc,5.3,0.6,NA,18.7,0.115,33.5 13 | Guinea pig,Cavis,herbi,Rodentia,domesticated,9.4,0.8,0.216666667,14.6,0.0055,0.728 14 | Grivet,Cercopithecus,omni,Primates,lc,10,0.7,NA,14,NA,4.75 15 | Chinchilla,Chinchilla,herbi,Rodentia,domesticated,12.5,1.5,0.116666667,11.5,0.0064,0.42 16 | Star-nosed mole,Condylura,omni,Soricomorpha,lc,10.3,2.2,NA,13.7,0.001,0.06 17 | African giant pouched rat,Cricetomys,omni,Rodentia,NA,8.3,2,NA,15.7,0.0066,1 18 | Lesser short-tailed shrew,Cryptotis,omni,Soricomorpha,lc,9.1,1.4,0.15,14.9,0.00014,0.005 19 | Long-nosed armadillo,Dasypus,carni,Cingulata,lc,17.4,3.1,0.383333333,6.6,0.0108,3.5 20 | Tree hyrax,Dendrohyrax,herbi,Hyracoidea,lc,5.3,0.5,NA,18.7,0.0123,2.95 21 | North American Opossum,Didelphis,omni,Didelphimorphia,lc,18,4.9,0.333333333,6,0.0063,1.7 22 | Asian elephant,Elephas,herbi,Proboscidea,en,3.9,NA,NA,20.1,4.603,2547 23 | Big brown bat,Eptesicus,insecti,Chiroptera,lc,19.7,3.9,0.116666667,4.3,3e-04,0.023 24 | Horse,Equus,herbi,Perissodactyla,domesticated,2.9,0.6,1,21.1,0.655,521 25 | Donkey,Equus,herbi,Perissodactyla,domesticated,3.1,0.4,NA,20.9,0.419,187 26 | European hedgehog,Erinaceus,omni,Erinaceomorpha,lc,10.1,3.5,0.283333333,13.9,0.0035,0.77 27 | Patas monkey,Erythrocebus,omni,Primates,lc,10.9,1.1,NA,13.1,0.115,10 28 | Western american chipmunk,Eutamias,herbi,Rodentia,NA,14.9,NA,NA,9.1,NA,0.071 29 | Domestic cat,Felis,carni,Carnivora,domesticated,12.5,3.2,0.416666667,11.5,0.0256,3.3 30 | Galago,Galago,omni,Primates,NA,9.8,1.1,0.55,14.2,0.005,0.2 31 | Giraffe,Giraffa,herbi,Artiodactyla,cd,1.9,0.4,NA,22.1,NA,899.995 32 | Pilot whale,Globicephalus,carni,Cetacea,cd,2.7,0.1,NA,21.35,NA,800 33 | Gray seal,Haliochoerus,carni,Carnivora,lc,6.2,1.5,NA,17.8,0.325,85 34 | Gray hyrax,Heterohyrax,herbi,Hyracoidea,lc,6.3,0.6,NA,17.7,0.01227,2.625 35 | Human,Homo,omni,Primates,NA,8,1.9,1.5,16,1.32,62 36 | Mongoose lemur,Lemur,herbi,Primates,vu,9.5,0.9,NA,14.5,NA,1.67 37 | African elephant,Loxodonta,herbi,Proboscidea,vu,3.3,NA,NA,20.7,5.712,6654 38 | Thick-tailed opposum,Lutreolina,carni,Didelphimorphia,lc,19.4,6.6,NA,4.6,NA,0.37 39 | Macaque,Macaca,omni,Primates,NA,10.1,1.2,0.75,13.9,0.179,6.8 40 | Mongolian gerbil,Meriones,herbi,Rodentia,lc,14.2,1.9,NA,9.8,NA,0.053 41 | Golden hamster,Mesocricetus,herbi,Rodentia,en,14.3,3.1,0.2,9.7,0.001,0.12 42 | Vole ,Microtus,herbi,Rodentia,NA,12.8,NA,NA,11.2,NA,0.035 43 | House mouse,Mus,herbi,Rodentia,nt,12.5,1.4,0.183333333,11.5,4e-04,0.022 44 | Little brown bat,Myotis,insecti,Chiroptera,NA,19.9,2,0.2,4.1,0.00025,0.01 45 | Round-tailed muskrat,Neofiber,herbi,Rodentia,nt,14.6,NA,NA,9.4,NA,0.266 46 | Slow loris,Nyctibeus,carni,Primates,NA,11,NA,NA,13,0.0125,1.4 47 | Degu,Octodon,herbi,Rodentia,lc,7.7,0.9,NA,16.3,NA,0.21 48 | Northern grasshopper mouse,Onychomys,carni,Rodentia,lc,14.5,NA,NA,9.5,NA,0.028 49 | Rabbit,Oryctolagus,herbi,Lagomorpha,domesticated,8.4,0.9,0.416666667,15.6,0.0121,2.5 50 | Sheep,Ovis,herbi,Artiodactyla,domesticated,3.8,0.6,NA,20.2,0.175,55.5 51 | Chimpanzee,Pan,omni,Primates,NA,9.7,1.4,1.416666667,14.3,0.44,52.2 52 | Tiger,Panthera,carni,Carnivora,en,15.8,NA,NA,8.2,NA,162.564 53 | Jaguar,Panthera,carni,Carnivora,nt,10.4,NA,NA,13.6,0.157,100 54 | Lion,Panthera,carni,Carnivora,vu,13.5,NA,NA,10.5,NA,161.499 55 | Baboon,Papio,omni,Primates,NA,9.4,1,0.666666667,14.6,0.18,25.235 56 | Desert hedgehog,Paraechinus,NA,Erinaceomorpha,lc,10.3,2.7,NA,13.7,0.0024,0.55 57 | Potto,Perodicticus,omni,Primates,lc,11,NA,NA,13,NA,1.1 58 | Deer mouse,Peromyscus,NA,Rodentia,NA,11.5,NA,NA,12.5,NA,0.021 59 | Phalanger,Phalanger,NA,Diprotodontia,NA,13.7,1.8,NA,10.3,0.0114,1.62 60 | Caspian seal,Phoca,carni,Carnivora,vu,3.5,0.4,NA,20.5,NA,86 61 | Common porpoise,Phocoena,carni,Cetacea,vu,5.6,NA,NA,18.45,NA,53.18 62 | Potoroo,Potorous,herbi,Diprotodontia,NA,11.1,1.5,NA,12.9,NA,1.1 63 | Giant armadillo,Priodontes,insecti,Cingulata,en,18.1,6.1,NA,5.9,0.081,60 64 | Rock hyrax,Procavia,NA,Hyracoidea,lc,5.4,0.5,NA,18.6,0.021,3.6 65 | Laboratory rat,Rattus,herbi,Rodentia,lc,13,2.4,0.183333333,11,0.0019,0.32 66 | African striped mouse,Rhabdomys,omni,Rodentia,NA,8.7,NA,NA,15.3,NA,0.044 67 | Squirrel monkey,Saimiri,omni,Primates,NA,9.6,1.4,NA,14.4,0.02,0.743 68 | Eastern american mole,Scalopus,insecti,Soricomorpha,lc,8.4,2.1,0.166666667,15.6,0.0012,0.075 69 | Cotton rat,Sigmodon,herbi,Rodentia,NA,11.3,1.1,0.15,12.7,0.00118,0.148 70 | Mole rat,Spalax,NA,Rodentia,NA,10.6,2.4,NA,13.4,0.003,0.122 71 | Arctic ground squirrel,Spermophilus,herbi,Rodentia,lc,16.6,NA,NA,7.4,0.0057,0.92 72 | Thirteen-lined ground squirrel,Spermophilus,herbi,Rodentia,lc,13.8,3.4,0.216666667,10.2,0.004,0.101 73 | Golden-mantled ground squirrel,Spermophilus,herbi,Rodentia,lc,15.9,3,NA,8.1,NA,0.205 74 | Musk shrew,Suncus,NA,Soricomorpha,NA,12.8,2,0.183333333,11.2,0.00033,0.048 75 | Pig,Sus,omni,Artiodactyla,domesticated,9.1,2.4,0.5,14.9,0.18,86.25 76 | Short-nosed echidna,Tachyglossus,insecti,Monotremata,NA,8.6,NA,NA,15.4,0.025,4.5 77 | Eastern american chipmunk,Tamias,herbi,Rodentia,NA,15.8,NA,NA,8.2,NA,0.112 78 | Brazilian tapir,Tapirus,herbi,Perissodactyla,vu,4.4,1,0.9,19.6,0.169,207.501 79 | Tenrec,Tenrec,omni,Afrosoricida,NA,15.6,2.3,NA,8.4,0.0026,0.9 80 | Tree shrew,Tupaia,omni,Scandentia,NA,8.9,2.6,0.233333333,15.1,0.0025,0.104 81 | Bottle-nosed dolphin,Tursiops,carni,Cetacea,NA,5.2,NA,NA,18.8,NA,173.33 82 | Genet,Genetta,carni,Carnivora,NA,6.3,1.3,NA,17.7,0.0175,2 83 | Arctic fox,Vulpes,carni,Carnivora,NA,12.5,NA,NA,11.5,0.0445,3.38 84 | Red fox,Vulpes,carni,Carnivora,NA,9.8,2.4,0.35,14.2,0.0504,4.23 85 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | Intro to Data Science with R Programming 2 | ================== 3 | 4 | Brought to you by [Lesley Cordero](http://www.columbia.edu/~lc2958) and [ADI](https://adicu.com) 5 | 6 | ## Table of Contents 7 | 8 | - [0.0 Setup](#00-setup) 9 | + [0.1 R and R Studio](#01-r-and-r-studio) 10 | + [0.2 Packages](#02-packages) 11 | - [1.0 Background](#10-background) 12 | + [1.1 Machine Learning](#11-Machine Learning) 13 | + [1.2 Data](#12-data) 14 | + [1.3 Overfitting vs Underfitting](#13-overfitting-vs-underfitting) 15 | + [1.4 Glossary](#14-glossary) 16 | * [1.4.1 Factors](#141-factors) 17 | * [1.4.2 Corpus](#142-corpus) 18 | * [1.4.3 Bias](#143-bias) 19 | * [1.4.4 Variance](#144-variance) 20 | - [2.0 Data Preparation](#30-data-preparation) 21 | + [2.1 dplyr](#31-dplyr) 22 | + [2.2 Geopandas](#32-geopandas) 23 | - [3.0 Exploratory Analysis](#30-exploratory-analysis) 24 | - [4.0 Data Visualization](#50-data-visualization) 25 | - [5.0 Machine Learning & Prediction](#50-machine-learning--prediction) 26 | + [5.1 Random Forests](#51-random-forests) 27 | + [5.2 Natural Language Processing](#52-natural-language-processing) 28 | * [5.2.1 ANLP](#521-anlp) 29 | + [5.3 K Means Clustering](#53-k-means-clustering) 30 | - [6.0 Final Exercise]($60-final-exercise) 31 | - [7.0 Final Words](#60-final-words) 32 | + [7.1 Resources](#61-resources) 33 | + [7.2 Mini Courses](#72-mini-courses) 34 | 35 | 36 | ## 0.0 Setup 37 | 38 | This guide was written in R 3.2.3. 39 | 40 | 41 | ### 0.1 R and R Studio 42 | 43 | Download [R](https://www.r-project.org/) and [R Studio](https://www.rstudio.com/products/rstudio/download/). 44 | 45 | 46 | ### 0.2 Packages 47 | 48 | Next, to install the R packages, cd into your workspace, and enter the following, very simple, command into your bash: 49 | 50 | ``` 51 | R 52 | ``` 53 | 54 | This will prompt a session in R! From here, you can install any needed packages. For the sake of this tutorial, enter the following into your terminal R session: 55 | 56 | ``` 57 | install.packages("ggvis”) 58 | install.packages("gmodels") 59 | install.packages("RCurl") 60 | install.packages("tm") 61 | install.packages("caTools") 62 | install.packages("ggplot2") 63 | install.packages("RFinfer") 64 | install.packages("dplyr") 65 | install.packages("lubridate") 66 | install.packages("compare") 67 | install.packages("downloader") 68 | ``` 69 | 70 | ### 0.3 Virtual Environment 71 | 72 | If you'd like to work in a virtual environment, you can set it up as follows: 73 | ``` 74 | pip3 install virtualenv 75 | virtualenv your_env 76 | ``` 77 | And then launch it with: 78 | ``` 79 | source your_env/bin/activate 80 | ``` 81 | 82 | To execute the visualizations in matplotlib, do the following: 83 | 84 | ``` 85 | cd ~/.matplotlib 86 | vim matplotlibrc 87 | ``` 88 | And then, write `backend: TkAgg` in the file. Now you should be set up with your virtual environment! 89 | 90 | Cool, now we're ready to start! 91 | 92 | 93 | ## 1.0 Background 94 | 95 | Before we head into an actual data science problem demo, let's go over some vital background information. 96 | 97 | ### 1.1 What is Data Science? 98 | 99 | Data Science is the application of statistical and mathematical methods to problems involving sets of data. In other words, it's taking techniques developed in the areas of statistics and math and using them to learn from some sort of data source. 100 | 101 | #### 1.1.1 What do you mean by data? 102 | 103 | Data is essentially anything that can be recorded or transcribed - numerical, text, images, sounds, anything! 104 | 105 | #### 1.1.2 What background do you need to work on a data science problem? 106 | 107 | It depends entirely on what you're working on, but generally speaking, you should be comfortable with probability, statistics, and some linear algebra. 108 | 109 | ### 1.2 Is data science the same as machine learning? 110 | 111 | Well, no. They do have overlap, but they are not the same! Whereas the topic of machine learning involves lots of theoretical components we won't worry about, data science takes these methods and applies them to the real world. It's important to note that studying these theoretical components can be very useful to your understanding of data science, however! 112 | 113 | ### 1.3 Why is Data Science important? 114 | 115 | Data Science has so much potential! By using data in creative and innovative ways, we can gain a lot of insight on the world, whether that be in economics, biology, sociology, math - any topic you can think of, data science has its role. 116 | 117 | ### 1.4 Machine Learning 118 | 119 | Generally speaking, Machine Learning can be split into three types of learning: supervised, unsupervised, and reinforcement learning. 120 | 121 | #### 1.4.1 Supervised Learning 122 | 123 | This algorithm consist of a target / outcome variable (or dependent variable) which is to be predicted from a given set of predictors (independent variables). Using these set of variables, we generate a function that map inputs to desired outputs. The training process continues until the model achieves a desired level of accuracy on the training data. Examples of Supervised Learning: Regression, Decision Tree, Random Forest, KNN, Logistic Regression etc. 124 | 125 | 126 | #### 1.4.2 Unsupervised Learning 127 | 128 | In this algorithm, we do not have any target or outcome variable to predict / estimate. It is used for clustering population in different groups, which is widely used for segmenting customers in different groups for specific intervention. Examples of Unsupervised Learning: Apriori algorithm, K-means. 129 | 130 | 131 | #### 1.4.2 Reinforcement Learning 132 | 133 | Using this algorithm, the machine is trained to make specific decisions. It works this way: the machine is exposed to an environment where it trains itself continually using trial and error. This machine learns from past experience and tries to capture the best possible knowledge to make accurate business decisions. Example of Reinforcement Learning: Markov Decision Process. 134 | 135 | 136 | ### 1.5 Data 137 | 138 | As a data scientist, knowing the different forms data takes is highly important. 139 | 140 | #### 1.5.1 Training vs Test Data 141 | 142 | When it comes time to train your classifier or model, you're going to need to split your data into testing and training data. 143 | 144 | Typically, the majority of your data will go towards your training data, while only 10-25% of your data will go towards testing. It's important to note there is no overlap between the two. Should you have overlap or use all your training data for testing, your accuracy results will be wrong. Any classifier that's tested on the data it's training is obviously going to do very well since it will have observed those results before, so the accuracy will be high, but wrongly so. 145 | 146 | 147 | #### 1.5.2 Open Data 148 | 149 | What's open data, you ask? Simple, it's data that's freely for anyone to use! Some examples include things you might have already heard of, like APIs, online zip files, or by scraping data! 150 | 151 | You might be wondering where this data comes from - well, it can come from a variety of sources, but some common ones include large tech companies like Facebook, Google, Instagram. Others include large institutions, like the US government! Otherwise, you can find tons of data from all sorts of organizations and individuals. 152 | 153 | ### 1.6 Overfitting vs Underfitting 154 | 155 | In section 1.2.1, we mentioned the concept of overfitting your data. The concept of overfitting refers to creating a model that doesn't generalize to your model. In other words, if your model overfits your data, that means it's learned your data too much - it's essentially memorized it. This might not seem like it would be a problem at first, but a model that's just "memorized" your data is one that's going to perform poorly on new, unobserved data. 156 | 157 | Underfitting, on the other hand, is when your model is too generalized to your data. This model will also perform poorly on new unobserved data. This usually means we should increase the number of considered features, which will expand the hypothesis space. 158 | 159 | 160 | ### 1.7 Glossary 161 | 162 | #### 1.7.1 Factors 163 | 164 | Factors in R are stored as a vector of integer values with a corresponding set of character values to use when the factor is displayed. 165 | 166 | #### 1.7.2 Corpus 167 | 168 | A Corpus (Plural: Corpora) is a collection of written texts that serve as our datasets. 169 | 170 | #### 1.7.3 Bias 171 | 172 | In machine learning, bias is the tendency for a learner to consistently learn the same wrong thing. 173 | 174 | #### 1.7.4 Variance 175 | 176 | Variance is the error from sensitivity to small fluctuations in the training set. High variance can cause overfitting since it causes a classifier to model the random noise in the training data rather than the intended outputs. 177 | 178 | ## 2.0 Data Preparation 179 | 180 | ### 2.1 dplyr 181 | 182 | dplyr allows us to transform and summarize tabular data with rows and columns. It contains a set of functions that perform common data manipulation operations like filtering rows, selecting specific columns, re-ordering rows, adding new columns, and summarizing data. 183 | 184 | First we begin by loading in the needed packages: 185 | ``` R 186 | library(dplyr) 187 | library(downloader) 188 | ``` 189 | 190 | Using the data available in [this](https://github.com/lesley2958/data-science-r/blob/master/msleep_ggplot2.csv) repo, we''ll load the data into R: 191 | 192 | ``` R 193 | url <- "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/msleep_ggplot2.csv" 194 | filename <- "msleep_ggplot2.csv" 195 | if (!file.exists(filename)) download(url,filename) 196 | msleep <- read.csv("msleep_ggplot2.csv") 197 | head(msleep) 198 | ``` 199 | 200 | #### 2.1.1 select() 201 | 202 | To demonstrate how the `select()` method works, we select the name and sleep_total columns. 203 | 204 | ``` R 205 | sleepData <- select(msleep, name, sleep_total) 206 | head(sleepData) 207 | ``` 208 | 209 | To select all the columns except a specific column, you can use the subtraction sign: 210 | 211 | ``` R 212 | head(select(msleep, -name)) 213 | ``` 214 | 215 | You can also select a range of columns with a colon: 216 | 217 | ``` R 218 | head(select(msleep, name:order)) 219 | ``` 220 | 221 | #### 2.1.2 filter() 222 | 223 | Using the `filter()` method in dplyr we can select rows that meet a certain criterion, such as in the following: 224 | 225 | ``` R 226 | filter(msleep, sleep_total >= 16) 227 | ``` 228 | There, we filter out the animals whose sleep total is less than 16 hours. If you want to expand the criteria, you can: 229 | 230 | ```R 231 | filter(msleep, sleep_total >= 16, bodywt >= 1) 232 | ``` 233 | 234 | #### 2.1.3 Functions 235 | 236 | `arrange()`: re-order or arrange rows
237 | `filter()`: filter rows
238 | `group_by()`: allows for group operations in the “split-apply-combine” concept
239 | `mutate()`: create new columns
240 | `select()`: select columns
241 | `summarise()`: summarise values 242 | 243 | 244 | ## 3.0 Exploratory Analysis 245 | 246 | ### 3.1 summary() 247 | R gives you the opportunity to go more in-depth with the summary() function. This will give you the minimum value, first quantile, median, mean, third quantile and maximum value of the data set Iris for numeric data types. 248 | 249 | ``` R 250 | summary(iris) 251 | ``` 252 | 253 | ### 3.2 xda 254 | 255 | xda contains tools to perform initial exploratory analysis on any input dataset. It includes custom functions for plotting the data as well as performing different kinds of analyses such as univariate, bivariate, and multivariate investigation - the typical first step of any predictive modeling pipeline. This is a great package to start off on any dataset because it gives you a good sense of the dataset before jumping on to building predictive models. 256 | 257 | 258 | ### 3.3 preprosim 259 | 260 | [preprosim](https://mran.revolutionanalytics.com/web/packages/preprosim/vignettes/preprosim.html) helps to add contaminations (noise, missing values, outliers, low variance, irrelevant features, class swap (inconsistency), class imbalance and decrease in data volume) to data and then evaluate the simulated data sets for classification accuracy. 261 | 262 | 263 | ## 4.0 Data Visualization 264 | 265 | 266 | ### 4.1 ggvis 267 | 268 | ggvis allows you to make scatterplots, as with the following: 269 | 270 | 271 | ``` R 272 | library(ggvis) 273 | iris %>% ggvis(~Petal.Length, ~Petal.Width, fill = ~Species) %>% layer_points() 274 | 275 | ``` 276 | 277 | ### 4.2 heatmaply 278 | 279 | [heatmaply](https://mran.revolutionanalytics.com/package/heatmaply/) produces interactive heatmaps. 280 | 281 | This code snippet shows the correlation structure of variables in the mtcars dataset: 282 | 283 | ``` R 284 | library(heatmaply) 285 | heatmaply(cor(mtcars), 286 | k_col = 2, k_row = 2, 287 | limits = c(-1,1)) %>% 288 | layout(margin = list(l = 40, b = 40)) 289 | ``` 290 | 291 | ## 5.0 Machine Learning & Prediction 292 | 293 | 294 | ### 5.1 Random Forests 295 | 296 | Random forest is a great choice for nearly any prediction problem, even non-linear ones, that belongs to a larger class of machine learning algorithms called ensemble methods. 297 | 298 | 299 | #### 5.1.1 RFinfer 300 | 301 | RFinfer provides functions that use the infinitesimal jackknife to generate predictions and prediction variances from random forest models. 302 | 303 | Now we'll go through an exercise involving RFinfer. First, we'll load the needed package and example data included in R. (Specifically, the dataset we'll be using is the New York Air Quality Measurements). 304 | 305 | ``` R 306 | library(RFinfer) 307 | library(ggplot2) 308 | data('airquality') 309 | ``` 310 | 311 | Because calls to random forest do not allow missing data, we omit incomplete cases of the data and high outliers. 312 | 313 | ``` R 314 | d.aq <- na.omit(airquality) 315 | d.aq <- d.aq[d.aq$Ozone < 100, ] 316 | ``` 317 | 318 | Now we finally train the random forest model: 319 | ``` R 320 | rf <- randomForest(Ozone ~ .,data=d.aq,keep.inbag=T) 321 | ``` 322 | 323 | Here, we grab the prediction variances for the training data along with the 95% confidence intervals: 324 | 325 | ``` R 326 | rf.preds <- rfPredVar(rf,rf.data=d.aq,CI=TRUE) 327 | str(rf.preds) 328 | ``` 329 | 330 | Then we get: 331 | ``` 332 | ## 'data.frame': 104 obs. of 4 variables: 333 | ## $ pred : num 37.2 29.5 19.8 21.9 23.9 ... 334 | ## $ pred.ij.var: num -1.29 15.7 20.31 4.94 3.78 ... 335 | ## $ l.ci : num 39.71 -1.25 -19.97 12.25 16.47 ... 336 | ## $ u.ci : num 34.7 60.3 59.6 31.6 31.3 ... 337 | ``` 338 | 339 | Next, we'll plot the predictions with their 95% confidence intervals in accordance to the actual values. 340 | 341 | ``` R 342 | ggplot(rf.preds,aes(d.aq$Ozone,pred)) + 343 | geom_abline(intercept=0,slope=1,lty=2, color='#999999') + 344 | geom_point() + 345 | geom_errorbar(aes(ymin=l.ci,ymax=u.ci,height=0.15)) + 346 | xlab('Actual') + ylab('Predicted') + 347 | theme_bw() 348 | ``` 349 | 350 | Here, we can see that the random forest is generally less confident about its inaccurate predictions, which we visualize by plotting the prediction variance as a function of the prediction error. 351 | 352 | 353 | ``` R 354 | qplot(d.aq$Ozone - rf.preds$pred,rf.preds$pred.ij.var, xlab='prediction error',ylab='prediction variance') + theme_bw() 355 | ``` 356 | 357 | ### 5.2 Natural Language Processing 358 | 359 | #### 5.2.1 ANLP 360 | 361 | [ANLP](https://mran.revolutionanalytics.com/web/packages/ANLP/vignettes/ANLP_Documentation.html) provides functions for building text prediction models. It contains functions for cleaning text data, building N-grams and more. 362 | 363 | 364 | ### 5.3 k-Means Clustering 365 | 366 | K Means Clustering is an unsupervised learning algorithm that clusts data based on their similarity. In k means clustering, we have the specify the number of clusters we want the data to be grouped into. The algorithm randomly assigns each observation to a cluster, and finds the centroid of each cluster. Then, the algorithm iterates through two steps: 367 | 368 | - Reassign data points to the cluster whose centroid is closest 369 | - Calculate new centroid of each cluster 370 | 371 | These two steps are repeated till the within cluster variation cannot be reduced any further. The within cluster variation is calculated as the sum of the euclidean distance between the data points and their respective cluster centroids. 372 | 373 | 374 | ## 6.0 Final Exercise 375 | 376 | For this final exercise, we'll be implementing a sentiment analysis classifier. Sentiment analysis involves building a system to collect and determine the emotional tone behind words. This is important because it allows you to gain an understanding of the attitudes, opinions and emotions of the people in your data. 377 | 378 | At a high level, sentiment analysis involves Natural language processing and artificial intelligence by taking the actual text element, transforming it into a format that a machine can read, and using statistics to determine the actual sentiment. 379 | 380 | For the model portion of this exercise, we'll use linear models since they allow us to define our input variable as a linear combination of input variables. 381 | 382 | For this tutorial, we'll be using the following packages: 383 | 384 | ``` R 385 | library(RCurl) 386 | library(tm) 387 | library(caTools) 388 | ``` 389 | 390 | ### 6.1 Data Preparation 391 | 392 | Here, we're just loading the data from the URLs. Although the R function read.csv can work with URLs, it doesn't necessarily handle https, so we use the package RCurl to ensure our links are able to be downloaded. 393 | 394 | ``` R 395 | test_data_url <- "https://dl.dropboxusercontent.com/u/8082731/datasets/UMICH-SI650/testdata.txt" 396 | train_data_url <- "https://dl.dropboxusercontent.com/u/8082731/datasets/UMICH-SI650/training.txt" 397 | 398 | test_data_file <- getURL(test_data_url) 399 | train_data_file <- getURL(train_data_url) 400 | 401 | train_data_df <- read.csv( 402 | text = train_data_file, 403 | sep='\t', 404 | header=FALSE, 405 | quote = "", 406 | stringsAsFactor=F, 407 | col.names=c("Sentiment", "Text")) 408 | test_data_df <- read.csv( 409 | text = test_data_file, 410 | sep='\t', 411 | header=FALSE, 412 | quote = "", 413 | stringsAsFactor=F, 414 | col.names=c("Text")) 415 | ``` 416 | 417 | Here, we convert Sentiment to factor. 418 | 419 | ``` R 420 | train_data_df$Sentiment <- as.factor(train_data_df$Sentiment) 421 | ``` 422 | 423 | In R we will use the tm package for text mining, so we'll be using it to create a corpus. First we use all the data to get all possible words in our corpus. Then we create a VCorpus object that's essentiallya collection of content and metadata objects. 424 | 425 | ``` R 426 | corpus <- Corpus(VectorSource(c(train_data_df$Text, test_data_df$Text))) 427 | ``` 428 | 429 | We want our data to be as clean as possible before sending it out for training - so we apply techniques like removing punctionation, stop words, and white space to make our data as consistent as possible. 430 | 431 | ``` R 432 | corpus <- tm_map(corpus, content_transformer(tolower)) 433 | corpus <- tm_map(corpus, removePunctuation) 434 | corpus <- tm_map(corpus, removeWords, stopwords("english")) 435 | corpus <- tm_map(corpus, stripWhitespace) 436 | corpus <- tm_map(corpus, stemDocument) 437 | corpus <- tm_map(corpus, PlainTextDocument) 438 | ``` 439 | 440 | Next, we need a DocumentTermMatrix object for the corpus. A document matrix is a matrix containing a column for each different word in our whole corpus, and a row for each document. A given cell equals to the frequency in a document for a given term. 441 | 442 | ``` R 443 | dtm <- DocumentTermMatrix(corpus) 444 | ``` 445 | 446 | We can take a glimpse of how it looks: 447 | ``` 448 | <> 449 | Non-/sparse entries: 401380/0 450 | Sparsity : 0% 451 | Maximal term length: 20 452 | Weighting : term frequency (tf) 453 | ``` 454 | 455 | Now we want to convert this matrix into a dataframe that we can use to train a classifier : 456 | 457 | ``` R 458 | important_words_df <- as.data.frame(as.matrix(dtm)) 459 | colnames(important_words_df) <- make.names(colnames(important_words_df)) 460 | ``` 461 | 462 | Here, we're just splitting our data into training and test, then adding them back to dataframes, and then getting rid of the original text field: 463 | ``` R 464 | important_words_train_df <- head(important_words_df, nrow(train_data_df)) 465 | important_words_test_df <- tail(important_words_df, nrow(test_data_df)) 466 | 467 | train_data_words_df <- cbind(train_data_df, important_words_train_df) 468 | test_data_words_df <- cbind(test_data_df, important_words_test_df) 469 | 470 | train_data_words_df$Text <- NULL 471 | test_data_words_df$Text <- NULL 472 | ``` 473 | 474 | In order to obtain our evaluation set, we split our dataset using sample.split from the caTools package: 475 | 476 | ``` R 477 | set.seed(1234) 478 | spl <- sample.split(train_data_words_df$Sentiment, .85) 479 | ``` 480 | 481 | Now we use `spl` to split our data into train and test 482 | ``` R 483 | eval_train_data_df <- train_data_words_df[spl==T,] 484 | eval_test_data_df <- train_data_words_df[spl==F,] 485 | ``` 486 | 487 | 488 | ### 6.2 Data Analysis 489 | 490 | Building a linear model in R requires only one function call, `glm`, so we use that to create our classifier. As a parameter, we set family to binomial to indicate that we want to use logistic regression: 491 | 492 | ``` R 493 | log_model <- glm(Sentiment~., data=eval_train_data_df, family=binomial) 494 | ``` 495 | 496 | And as always, we now use our model on the test data: 497 | 498 | ``` R 499 | log_pred <- predict(log_model, newdata=eval_test_data_df, type="response") 500 | ``` 501 | 502 | Using this table, we'll be able to calculate accuracy based on probability: 503 | 504 | ``` R 505 | table(eval_test_data_df$Sentiment, log_pred>.5) 506 | ``` 507 | 508 | So then we get 509 | 510 | ``` 511 | (453 + 590) / nrow(eval_test_data_df) 512 | ``` 513 | ``` 514 | 0.9811853 515 | ``` 516 | 517 | This is a very good accuracy. It seems that our bag of words approach works nicely with this particular problem. 518 | 519 | 520 | ## 7.0 Final Words 521 | 522 | This was a brief overview of Data Science and its different components. Obviously there is more to each component we went through, but this tutorial should have given you an idea of what a data problem should look like. 523 | 524 | ### 7.1 Resources 525 | 526 | [The Art of R Programming](https://www.dropbox.com/s/cr7mg2h20yzvbq3/The_Art_Of_R_Programming.pdf?dl=0)
527 | [R Bloggers](https://www.r-bloggers.com/)
528 | [kdnuggets](http://www.kdnuggets.com/) 529 | 530 | 531 | 532 | ### 7.2 Mini Courses 533 | 534 | Learn about courses [here](www.byteacademy.co/all-courses/data-science-mini-courses/). 535 | 536 | [Python 101: Data Science Prep](https://www.eventbrite.com/e/python-101-data-science-prep-tickets-30980459388)
537 | [Intro to Data Science & Stats with R](https://www.eventbrite.com/e/data-sci-109-intro-to-data-science-statistics-using-r-tickets-30908877284)
538 | [Data Acquisition Using Python & R](https://www.eventbrite.com/e/data-sci-203-data-acquisition-using-python-r-tickets-30980705123)
539 | [Data Visualization with Python](https://www.eventbrite.com/e/data-sci-201-data-visualization-with-python-tickets-30980827489)
540 | [Fundamentals of Machine Learning and Regression Analysis](https://www.eventbrite.com/e/data-sci-209-fundamentals-of-machine-learning-and-regression-analysis-tickets-30980917759)
541 | [Natural Language Processing with Data Science](https://www.eventbrite.com/e/data-sci-210-natural-language-processing-with-data-science-tickets-30981006023)
542 | [Machine Learning with Data Science](https://www.eventbrite.com/e/data-sci-309-machine-learning-with-data-science-tickets-30981154467)
543 | [Databases & Big Data](https://www.eventbrite.com/e/data-sci-303-databases-big-data-tickets-30981182551)
544 | [Deep Learning with Data Science](https://www.eventbrite.com/e/data-sci-403-deep-learning-with-data-science-tickets-30981221668)
545 | [Data Sci 500: Projects](https://www.eventbrite.com/e/data-sci-500-projects-tickets-30981330995) 546 | 547 | --------------------------------------------------------------------------------