├── Chapter01 └── ch 1 codebase June 6.R ├── Chapter02 └── Ch 2 Codebase June 6.R ├── Chapter03 └── ch 3 Codebase.R ├── Chapter04 ├── Ch 4 Codebase.R └── Consumer_Complaints (1).zip ├── Chapter05 └── ch 5 codebase as of May 31.R ├── Chapter06 └── ch 6 codebase June 6.R ├── Chapter07 └── ch 7 codebase.R ├── Chapter08,09 └── B05033_08_09_Codebase.html ├── Chapter10 ├── B05033_10_Regression_Codebase.html └── B05033_10_kmeans_Codebase.html ├── Chapter11 └── B05033_11_Codebase.html ├── LICENSE └── README.md /Chapter01/ch 1 codebase June 6.R: -------------------------------------------------------------------------------- 1 | setwd("C:/PracticalPredictiveAnalytics/R") 2 | getwd() 3 | #[1] "C:/PracticalPredictiveAnalytics/R" 4 | # 5 | # 6 | # I suggest using getwd and setwd liberally, especially if you are working on multiple projects, and want to avoid reading or writing the wrong files. 7 | # The Source Window 8 | # The Source window is where all of your R Code appears. It is also where you will be probably spending most of your time. You can have several script windows open, all at once. 9 | # Creating a new Script 10 | # To create a new script select "File/New File/R Script" from the top navigation bar. A new blank script window will appear with the name "Untitled1". 11 | # Once it appears you can start entering code! 12 | # 13 | # B05033_01_05.png 14 | # Our First Predictive Model 15 | # Now that all of the preliminaries are out of the way, it is time to jump into starting to code our first simple predictive model. There will be 2 scripts written to accomplish this. 16 | # Our first R script is not a predictive model (yet), but it is a preliminary program which will view and plot some data. The data set we will use is already built into the R package system, and is not necessary to load externally. For quickly illustrating techniques, I will sometimes use sample data contained within specific R packages themselves in order to demonstrate ideas, rather than pulling data in from an external file. In this case our data will be pulled from the "datasets" package, which is loaded by default at startup. 17 | # . Paste the following code into the "Untitled1" scripts that was just created. Don't worry about what each line means yet. I will cover the specific lines after the code is executed. 18 | require(graphics) 19 | data(women) 20 | head(women) 21 | View(women) 22 | plot(women$height,women$weight) 23 | # 24 | # . Within the code pane, you will see a menu bar right beneath the "Untitled1" tab. It should look something like this: 25 | # 26 | # 27 | # 28 | # . To execute the code, Click the "Source" Icon. The display should then change to the diagram below: 29 | # 30 | # 31 | # B05033_01_06.png 32 | # Notice from the picture below that three things have changed: 33 | # 1. Output has been written to the console pane 34 | # 2. The View pane has popped up which contains a two column table. 35 | # 3. Additionally, a plot will appear in the Plot pane 36 | # Code Description 37 | # 38 | # Here are some more details on what the code has accomplished. 39 | # 40 | # Line 1 of the code contains the function "require", which is just a way of saying that R needs a specific package to run. In this case require(graphics) specifies that the graphics package is needed for the analysis, and it will load it into memory. If it is not available, you will get an error message. However, "graphics" is a base package and should be available 41 | # 42 | # Line 2 of the code loads the "Women" data object into memory using the data(women) function. 43 | # Lines 3-5 of the code display the raw data in three different ways: 44 | # 45 | # 1. View(women) - This will visually display the dataframe. Although this is part of the actual R script, viewing a dataframe is a very common task, and is often issued directly as a command via the R Console. As you can see in the figure above, the "Women" data frame has 15 rows, and 2 columns named height and weight. 46 | # 47 | # 2. plot(women$height, women$weight) - This uses the native R plot function which plots the values of the two variables against each other. It is usually the first step one does to begin to understand the relationship between 2 variables. As you can see the relationship is very linear. 48 | # 49 | # 3. head(women) - This displays the first N rows of the women data frame to the console. If you want no more than a certain number of rows, add that as a 2nd argument of the function. E.g. Head(women,99) will display UP TO 99 rows in the console. The tail() function works similarly, but displays the last rows of data. 50 | # 51 | # Saving the script 52 | # To save this script, navigate to the top navigation menu bar and select "File/Save". When the file selector appears navigate to the PracticalPredictiveAnalytics/R folder that was created, and name the file "Chapter1_DataSource". Then select "Save". 53 | # 54 | # Your 2nd script 55 | # Our 2nd R script is a simple two variable regression model which predicts women's height based upon weight. 56 | # Begin by creating another Rscript by selecting "File/New File/R Script" from the top navigation bar. 57 | # If you create new scripts via "File/New File/R Script" often enough you might get "Click Fatigue" (uses 3 clicks), so you can also save a click by selecting the icon in the top left with the + sign. 58 | # 59 | # 60 | # 61 | # Whichever way you choose , a new blank script window will appear with the name "Untitled2" 62 | # Now Paste the following code into the new script window 63 | require(graphics) 64 | data(women) 65 | lm_output <- lm(women$height ~ women$weight) 66 | summary(lm_output) 67 | prediction <- predict(lm_output) 68 | error <- women$height-prediction 69 | plot(women$height,error) 70 | 71 | # Press the "Source" icon to run the entire code. The display will change to something similar to what is displayed below: 72 | # 73 | # 74 | # B05033_01_07.png 75 | # 76 | # Code Description 77 | # Here are the notes and explanations for the script code that you have just ran: 78 | # Line 3 - lm() function: This function runs a simple linear regression using the lm() function. This function will run a linear regression which predicts woman's height based upon the value of their weight. In statistical parlance, you will be 'regressing' height on weight. The line of code which accomplishes this is: 79 | # 80 | # lm_output <- lm(women$height ~ women$weight) 81 | # 82 | # There are two operators, and one function that you will become very familiar with when running Predictive Models in R. 83 | # 84 | # 1. The ~ operator (also called the tilde) is an expression that is used when specifying a formula in R. In a predictive context, is a shorthand way for separating what you want to predict, with what you are using to predict. What you are predicting (the dependent or Target variable) is usually on the left side of the formula, and the predictors (independent variables, features) are on the right side. In the formula, I have specified them explicitly by using the data frame name together with the column name, i.e. women$height (what you are predicting) and women$weight (predictor) 85 | # 86 | # 2. The <- operator (also called assignment) says assign whatever function operators are on the right side to whatever object is on the left side. This will always create or replace a new object that you can further display or manipulate. In this case we will be creating a new object called lm_output, which is created using the function lm(), which creates a Linear model based on the formula contained within the parentheses. 87 | # 88 | # 89 | # Note that the execution of this line does not produce any displayed output. You can see if the line was executed by checking the console. If there is any problem with running the line (or any line for that matter) you will see an error message in the console. 90 | # Line 4 - summary(lm_output): The summary function is a generic R function which works on many different types of R objects. In this case, summary() will summarize the results of the linear regression output. 91 | # The results will appear in the Console window as pictured in the figure above. Just to keep thing a little bit simpler for now, I will just show the first few lines of the output, and underline what you should be looking at. Do not be discouraged by the amount of output produced. 92 | # First look at the lines marked (Intercept), and women$weight which appear under the Coefficients line in the console. 93 | # Coefficients: 94 | # Estimate Std. Error t value Pr(>|t|) 95 | # (Intercept) 25.723456 1.043746 24.64 2.68e-12 *** 96 | # women$weight 0.287249 0.007588 37.85 1.09e-14 *** 97 | # 98 | # The Estimate Column illustrates the linear regression formula needed to derive height from weight. We can actually use these numbers along with a calculator to determine the prediction ourselves. For our example the output tells us that we should perform the following steps for ALL of the observations in our dataframe in order to obtain the prediction for height. We will obviously not want to do all of the observations (R will do that via the predict() function below), but we will illustrate the calculation for 1 data point. 99 | # . Take the weight value for each observation. Let's take the weight of the first woman which is 115 lbs. 100 | # . Then,multiply weight by 0.2872 . That is the number that is listed under Estimate for womens$weight. Multiplying 115 lbs. by 0.2872 yield 33.028 101 | # . Then add 25.7235 which is the estimate of the (intercept) row. That will yield a prediction of 58.75 inches. 102 | # If you do not have a calculator handy, the calculation is easily done in 'calculator' mode via the R Console, by typing the following: 103 | # 104 | # The Predict function 105 | # To predict the value for all of the values we will use a function called predict(). This function reads each input (independent) variable and then predicts a target (dependent) variable based on the linear regression equation. In the code we have assigned the output of this function to a new object named "prediction". 106 | # Switch over to the console area, and type "prediction", then Enter, to see the predicted values for the 15 women. The following should appear in the console. 107 | 108 | prediction 109 | length(error) 110 | #[1] 15 111 | # 112 | # In all of the above cases, the counts all compute as 15, so all is good. 113 | # If we want to see the raw data, predictions, and the prediction errors for all of the data, we can use the cbind() function (Column bind) to concatenate all three of those values, and display as a simple table. 114 | # At the console enter the follow cbind command. 115 | cbind(height=women$height,PredictedHeight=prediction,ErrorInPrediction=error) 116 | height PredictedHeight ErrorInPrediction 117 | # 1 58 58.75712 -0.75711680 118 | # 2 59 59.33162 -0.33161526 119 | # 3 60 60.19336 -0.19336294 120 | # 4 61 61.05511 -0.05511062 121 | # 5 62 61.91686 0.08314170 122 | # 6 63 62.77861 0.22139402 123 | # 7 64 63.64035 0.35964634 124 | # 8 65 64.50210 0.49789866 125 | # 9 66 65.65110 0.34890175 126 | # 10 67 66.51285 0.48715407 127 | # 11 68 67.66184 0.33815716 128 | # 12 69 68.81084 0.18916026 129 | # 13 70 69.95984 0.04016335 130 | # 14 71 71.39608 -0.39608278 131 | # 15 72 72.83233 -0.83232892 132 | # 133 | # 134 | # From the output above, we can see that there are a total 15 predictions. If you compare the ErrorInPrediction with the error plot shown above, you can see that for this very simple model, the prediction errors are much larger for extreme values in height (shaded values). 135 | # Just to verify that we have one for each of our original observations we will use the nrow() function to count the number of rows. 136 | # At the command prompt in the console area, enter the command: 137 | nrow(women) 138 | # 139 | # The following should appear: 140 | # 141 | # >nrow(women) 142 | # [1] 15 143 | # 144 | # Line 7 - plot(women$height,error) :This plots the predicted height vs. the errors. It shows how much the prediction was 'off' from the original value. You can see that the errors show a non-random pattern. 145 | # 146 | # After you are done, save the file using "File/File Save", navigate to the PracticalPredictiveAnalytics/R folder that was created, and name it Chapter1_LinearRegression 147 | # R packages 148 | # An R package extends the functionality of basic R. Base R, by itself, is very capable, and you can do an incredible amount of analytics without adding any additional packages. However adding a package may be beneficial if it adds a functionality which does not exist in base R, improves or builds upon an existing functionality, or just makes something that you can already do easier. 149 | # For example, there are no built in packages in base R which enable you to perform certain types of machine learning (such as Random Forests). As a result, you need to search for an add on package which performs this functionality. Fortunately you are covered. There are many packages available which implement this algorithm. 150 | # Bear in mind that there are always new packages coming out. I tend to favor packages which have been on CRAN for a long time and have large user base. When installing something new, I will try to reference the results against other packages which do similar things. Speed is another reason to consider adopting a new package. 151 | # 152 | # 153 | # The Stargazer Package 154 | # 155 | # For an example of a package which can just make life easier, first let's consider the output produced by running a summary function on the regression results, as we did above. You can run it again if you wish. 156 | # 157 | summary(lm_output) 158 | # 159 | # The amount of statistical information output by the summary function can be overwhelming to the initiated. This is not only related to the amount of output, but the formatting. That is why I did not show the entire output in the above example. 160 | # One way to make output easier to look at is to first reduce the amount of output that is presented, and then reformat it so it is easier on the eyes. 161 | # To accomplish this, we can utilize a package called "stargazer", which will reformat the large volume of output produced by summary and simplify the presentations. Stargazer excels at reformatting the output of many regression models, and displaying the results as HTML, PDF, Latex, or as simple formatted text. By default, it will show you the most important statistical output for various models, and you can always specify the types of statistical output that you want to see. 162 | # To obtain more information on the stargazer package you can first go to CRAN, and search for documentation about stargazer, and/or you can use the R help system: 163 | # IF you already have installed stargazer you can use the following command: 164 | # packageDescription("stargazer") 165 | # If you haven't installed the package, information about stargazer, (or other packages) can also be found using R specific internet searches: 166 | RSiteSearch("stargazer") 167 | # If you like searching for documentation within R, you can obtain more information about the R help system at:: 168 | # https://www.r-project.org/help.html 169 | # Installing Stargazer 170 | # Now, on to installing stargazer: 171 | # . First Create a new R script (File/New File/R Script). 172 | # . Enter the following lines and then select "Source" from the menu bar in the code pane, which will submit the entire script. 173 | # 174 | install.packages("stargazer") 175 | library(stargazer) 176 | stargazer(lm_output, title="Lm Regression on Height", type="text") 177 | # 178 | # After the script has been run, the following should appear in the Console: 179 | # 180 | # B05033_01_08.png 181 | # Code Description 182 | # Line 1: install.packages("stargazer") 183 | # The line above will install the package to the default package directory on your machine. If you will be rerunning this code again,, you can comment out this line, since the package will have already be installed in your R repository. 184 | # 185 | # Line 2: library(stargazer) 186 | # Installing a package does not make the package automatically available. You need to run a library (or require) function in order to actually load the stargazer package. 187 | # 188 | # Line 3: stargazer(lm_output, title="Lm Regression on Height", type="text") 189 | # 190 | # This part of the code will take the output object lm_output, that was created in the first script, condense the output, and write it out to the console in a simpler, more readable format. There are many other options in stargazer, which will format the output as HTML, or Latex. Please refer to the Reference manual at https://cran.r-project.org/web/packages/stargazer/index.html 191 | # 192 | # 193 | # The reformatted results will appear in the R Console. As you can see, the output written to the console is much cleaner and easier to read 194 | # Saving your work 195 | # After you are done, select "File/File Save" from the menu bar 196 | # Then navigate to the PracticalPredictiveAnalytics/Outputs folder that was created, and name it Chapter1_LinearRegressionOutput. Press Save. 197 | # 198 | # Summary 199 | # In this chapter, we have learned a little about what predictive analytics is and how they can be used in various industries. We learned some things about data, and how they can be organized in projects. Finally, we installed RStudio, and ran a simple linear regression, and installed and ran our first package. We learned that it is always good practice to examine data after it has been loaded into memory, and a lot can be learned from simply displaying and plotting the data. 200 | # In the next chapter, we will discuss the overall predictive modeling process itself, introduce some key model packages using R, and provide some guidance on avoiding some predictive modeling pitfalls. 201 | # References 202 | # Computing and the Manhattan Project. (n.d.). Retrieved from http://www.atomicheritage.org/history/computing-and-manhattan-project 203 | # Gladwell, M. (2005). Blink :the power of thinking without thinking. New York: Little, Brown and Co.,. 204 | # Linda Miner et al. (2014). Practical Predictive Analytics and Decisioning Systems for Medicine. Elsevier. 205 | # Watson (Computer). (n.d.). Retrieved from Wikipedia: https://en.wikipedia.org/wiki/Watson_(computer) 206 | # Weather Forecasting through the Ages. (n.d.). Retrieved from http://earthobservatory.nasa.gov/Features/WxForecasting/wx2.php 207 | -------------------------------------------------------------------------------- /Chapter02/Ch 2 Codebase June 6.R: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Practical-Predictive-Analytics/8bdd60cc74f0f0c34a0c0add3a9e4d02c7380afb/Chapter02/Ch 2 Codebase June 6.R -------------------------------------------------------------------------------- /Chapter03/ch 3 Codebase.R: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Practical-Predictive-Analytics/8bdd60cc74f0f0c34a0c0add3a9e4d02c7380afb/Chapter03/ch 3 Codebase.R -------------------------------------------------------------------------------- /Chapter04/Ch 4 Codebase.R: -------------------------------------------------------------------------------- 1 | rm(df) 2 | df <- read.table(text = 'Treatment Gender Age Duration Pain 3 | P F 68 1 0 4 | P M 66 26 1 5 | A F 71 12 0 6 | A M 71 17 1 7 | B F 66 12 0 8 | A F 64 17 0 9 | P M 70 1 1 10 | A F 64 30 0 11 | B F 78 1 0 12 | B M 75 30 1 13 | A M 70 12 0 14 | B M 70 1 0 15 | P M 78 12 1 16 | P M 66 4 1 17 | A M 78 15 1 18 | P F 72 27 0 19 | B F 65 7 0 20 | P M 67 17 1 21 | P F 67 1 1 22 | A F 74 1 0 23 | B M 74 16 0 24 | B F 67 28 0 25 | B F 72 50 0 26 | A F 63 27 0 27 | A M 62 42 0 28 | P M 74 4 0 29 | B M 66 19 0 30 | A M 70 28 0 31 | P M 83 1 1 32 | P M 77 29 1 33 | A F 69 12 0 34 | B M 67 23 0 35 | B M 77 1 1 36 | P F 65 29 0 37 | B M 75 21 1 38 | P F 70 13 1 39 | P F 68 27 1 40 | B M 70 22 0 41 | A M 67 10 0 42 | B M 80 21 1 43 | P F 67 30 0 44 | B F 77 16 0 45 | B F 76 9 1 46 | A F 69 18 1 47 | P F 64 1 1 48 | A F 72 25 0 49 | B M 59 29 0 50 | A M 69 1 0 51 | B F 69 42 0 52 | P F 79 20 1 53 | B F 65 14 0 54 | A M 76 25 1 55 | B F 69 24 0 56 | P M 60 26 1 57 | A F 67 11 0 58 | A M 75 6 1 59 | P M 68 11 1 60 | A M 65 15 0 61 | P F 72 11 1 62 | A F 69 3 0',header = TRUE) 63 | 64 | str(df) 65 | table(df$Pain) 66 | prop.table(table(df$Pain)) 67 | setwd("C:/PracticalPredictiveAnalytics/Data") 68 | save(df,file="pain_raw.Rda") 69 | 70 | 71 | PainGLM <- glm(Pain ~ Treatment + Gender + Age + Duration, data=df, family="binomial") 72 | 73 | summary(PainGLM) 74 | cbind(exp(coef(PainGLM)),confint(PainGLM)) 75 | 76 | quantile(residuals(PainGLM, type="deviance")) # residuals 77 | 78 | mean(residuals(PainGLM, type="deviance")) 79 | plot(residuals(PainGLM, type="deviance")) 80 | 81 | #install.packages("car") 82 | library(car) 83 | avPlots(PainGLM, id.method="mahal", id.n=2) 84 | 85 | df[c(26,27,28,25),] 86 | 87 | set.seed(1020) 88 | lottery <- data.frame( 89 | cbind(x=rnorm(n=1000000,1000000,100),y=rnorm(1000000,1000001,100) ) 90 | ) 91 | summary(lottery) 92 | 93 | t.test(lottery$x,lottery$y) 94 | 95 | 96 | set.seed(1020) 97 | lottery <- data.frame( 98 | cbind(x=rnorm(n=100000,1000000,100),y=rnorm(n=100000,1000001,100) ) 99 | ) 100 | summary(lottery) 101 | t.test(lottery$x,lottery$y) 102 | 103 | library(RcmdrMisc) 104 | 105 | stepwise(PainGLM, direction='forward/backward', criterion='AIC') 106 | 107 | 108 | library(effects) 109 | output <- glm(Pain ~ Treatment + Gender + Age + Duration + Age:Treatment 110 | ,data=df, family="binomial") 111 | summary(output) 112 | 113 | install.package(effects) 114 | library(effects) 115 | output <- glm(Pain ~ Treatment + Gender + Age + Duration + Age:Treatment , data=df, family="binomial") 116 | summary(output) 117 | 118 | plot(effect("Treatment*Age", output, xlevels=list(age=0:99)), 119 | ticks=list(at=c(.001, .005, .01, .05, seq(.1,.9,by=.2), .95, .99, .995))) 120 | 121 | install.packages("pscl") 122 | 123 | library(pscl) 124 | pR2(PainGLM) 125 | 126 | 127 | 128 | oldpar <- par(oma=c(0,0,3,0), mfrow=c(2,2)) 129 | plot(PainGLM) 130 | par(oldpar) 131 | 132 | 133 | library(TeachingDemos) 134 | set.seed(1) 135 | if(interactive()) {vis.test(residuals(PainGLM, type="response"), vt.qqnorm, nrow=4, ncol=4, npage=1)} 136 | 137 | install.packages("ResourceSelection") 138 | library(ResourceSelection) 139 | hoslem.test(PainGLM$y,fitted(PainGLM)) 140 | str(hoslem$observed) 141 | hoslem <- hoslem.test(PainGLM$y,fitted(PainGLM)) 142 | cbind(hoslem$observed,hoslem$expected) 143 | 144 | dummy.vars <- model.matrix(df$Pain ~ df$Treatment + df$Gender + df$Age + df$Duration)[,-1] 145 | x <- as.matrix(data.frame(df$Duration,dummy.vars)) 146 | head(x) 147 | 148 | 149 | install.packages("glmnet") 150 | library(glmnet) 151 | options(scipen = 999) 152 | mod.result<-glmnet(x,y=as.factor(df$Pain),alpha=1,family='binomial') 153 | summary(mod.result$lambda) 154 | 155 | plot(mod.result,xvar="lambda") 156 | 157 | 158 | head(print(mod.result)) 159 | tail(print(mod.result)) 160 | 161 | 162 | coef(mod.result,s=0.1957000) # 4 of 5 coefficient are at 0, too much shrinkage 163 | coef(mod.result,s=0.0131800) # only 1 coefficient is set to 0 164 | 165 | coef(mod.result,s=0.0120100) # All coefficients are included 166 | 167 | coef(mod.result,s=0.01) 168 | 169 | 170 | statedf <- as.data.frame(state.x77) 171 | x <- data.frame(statedf,state.abb,state.region) 172 | summary(x) 173 | 174 | hist(x$Life.Exp) 175 | 176 | install.packages("rpart") 177 | install.packages("rattle") 178 | install.packages("rpart.plot") 179 | install.packages("RColorBrewer") 180 | 181 | library(rpart) 182 | library(rattle) 183 | library(rpart.plot) 184 | library(RColorBrewer) 185 | set.seed(1020) 186 | y1 <- rpart(Life.Exp ~ .,data=x,cp=.01) 187 | prp(y1, type=4, extra=1) 188 | 189 | 190 | print(y1) 191 | 192 | 193 | y2 <- rpart(Life.Exp ~ Population + Income + Illiteracy + 194 | Murder + 195 | HS.Grad + 196 | Frost + 197 | Area + 198 | state.region,method='anova',data=x 199 | ) 200 | 201 | prp(y2, type=4, extra=1) 202 | 203 | install.packages("partykit") 204 | library(partykit) 205 | y2 <- ctree(Life.Exp ~ .,data=x) 206 | y2 207 | plot(y2) 208 | 209 | PrunedTree <- prp(y1,type=4, extra=1,snip=TRUE)$obj 210 | 211 | prp(PrunedTree, type=4, extra=1) 212 | 213 | 214 | library(graphics) 215 | library(dplyr) 216 | 217 | x <- read.table("C:/PracticalPredictiveAnalytics/Data/CDNOW_master.txt", quote="\"", stringsAsFactors=FALSE) 218 | 219 | x$xd <- as.Date(as.character(x$V2), "%Y%m%d") 220 | x$diffdate <- as.integer(as.Date("1998-07-01") - x$xd) 221 | #rename the columns 222 | 223 | colnames(x) <- c("id","orig.date","units.bought","TotalPaid","purch.date","Days.since") 224 | str(x) 225 | summary(x) 226 | 227 | attach(x) 228 | #cluster on the RFM variables 229 | y <- subset(x, select = c(units.bought,TotalPaid,Days.since)) 230 | 231 | #always set seed before clustering 232 | 233 | set.seed(1020) 234 | 235 | clust3 <- kmeans( y,3) 236 | clust3$betweenss/clust3$totss 237 | 238 | 239 | clust5 <- kmeans( y,5) 240 | clust5$betweenss/clust5$totss 241 | 242 | clust7 <- kmeans( y,7) 243 | clust7$betweenss/clust7$totss 244 | 245 | #elbow method 246 | set.seed(1020) 247 | # Compute and plot wss for k = 3 to k = 15 248 | df <- sapply(3:15,function(k){kmeans(y,k)$tot.withinss}) 249 | 250 | plot(3:15, df,type='b',xlab="# of clusters",ylab="Total Within Clusters SS") 251 | 252 | #cluster assignments 253 | clusters3 <- clust3$cluster 254 | clusters5 <- clust5$cluster 255 | clusters7 <- clust7$cluster 256 | 257 | #usually but now always the middle cluster is the average cluster 258 | par(mfrow=c(1,3)) 259 | hist(clusters3) 260 | hist(clusters5) 261 | hist(clusters7) 262 | 263 | #append the clusters the original data 264 | append.clust <- data.frame(x, clusters3,clusters5,clusters7) 265 | 266 | install.packages("cluster") 267 | library(cluster) 268 | set.seed(1020) 269 | sampleit <- append.clust[sample(nrow(append.clust), 100), ] 270 | str(sampleit) 271 | 272 | prcomp(append.clust[,c(3,4,6)], scale = TRUE) 273 | par(mfrow=c(1,3)) 274 | clusplot(sampleit[,c(3,4,6)], sampleit$clusters3, color=TRUE, shade=TRUE,labels=2, lines=0) 275 | clusplot(sampleit[,c(3,4,6)], sampleit$clusters5, color=TRUE, shade=TRUE,labels=2, lines=0) 276 | clusplot(sampleit[,c(3,4,6)], sampleit$clusters7, color=TRUE, shade=TRUE,labels=2, lines=0) 277 | 278 | head(append.clust) 279 | library(dplyr) 280 | attach(append.clust) 281 | append.clust %>% select(units.bought,TotalPaid,Days.since,clusters3) %>% 282 | group_by(clusters3) %>% 283 | summarise_each(funs(n(),mean)) 284 | append.clust %>% select(units.bought,TotalPaid,Days.since,clusters5) %>% 285 | group_by(clusters5) %>% 286 | summarise_each(funs(n(),mean)) 287 | append.clust %>% select(units.bought,TotalPaid,Days.since,clusters7) %>% 288 | group_by(clusters7) %>% 289 | summarise_each(funs(n(),mean)) 290 | 291 | 292 | require(graphics) 293 | setwd("C:/PracticalPredictiveAnalytics/Data") 294 | load("pain_raw.Rda") 295 | 296 | df2 <- subset(df, select=c(Age,Duration,Pain)) 297 | df2 <- scale(df2) 298 | head(df2) 299 | fit <- hclust(dist(df2), "average") 300 | 301 | groups <- cutree(fit, k=3) 302 | rect.hclust(fit, k=3, border="red") 303 | cluster1 <- df[c(35,52,10,30),] 304 | View(cluster1) 305 | 306 | cluster2 <- df[c(23,49,25,47),] 307 | 308 | View(cluster2) 309 | 310 | cluster3 <- df[c(20,26,38,53),] 311 | View(cluster3) 312 | 313 | 314 | #generate a non-linear circle of point 315 | 316 | radius <- 2 317 | t2 <- data.frame(x=radius * cos(seq(0,6,length = 20)),y = radius * sin(seq(0, 6, length = 20))) 318 | names(t2) <- c("Latitude","High.Low.Temp") 319 | plot(t2$Latitude,t2$High.Low.Temp) 320 | 321 | # create a new variable and plot it against on the original points 322 | 323 | t2$z = (t2$Latitude^2*t2$High.Low.Temp^2) 324 | 325 | plot(t2$High.Low.Temp,t2$z) 326 | 327 | 328 | install.packages("e1071") 329 | install.packages("RTextTools") 330 | 331 | library(e1071) 332 | library(RTextTools) 333 | data <- read.csv("C:/PracticalPredictiveAnalytics/Data/Consumer_Complaints.csv", sep=",") 334 | 335 | 336 | 337 | data <- subset(data, select=c(Issue,Consumer.complaint.narrative)) 338 | data.samp <- subset(data[1:50,], select=c(Issue,Consumer.complaint.narrative)) 339 | str(data) 340 | View(data.samp) 341 | 342 | 343 | # Create the document term matrix 344 | dtMatrix <- create_matrix(data.samp["Consumer.complaint.narrative"],minDocFreq = 1, removeNumbers=TRUE, 345 | minWordLength=4,removeStopwords=TRUE,removePunctuation=TRUE,stemWords = FALSE) 346 | 347 | 348 | xx = as.data.frame( t(as.matrix( dtMatrix )) ) 349 | head(xx) 350 | 351 | 352 | dtMatrix <- create_matrix(data["Consumer.complaint.narrative"],minDocFreq = 1, removeNumbers=TRUE, 353 | minWordLength=4,removeStopwords=TRUE,removePunctuation=TRUE,stemWords = FALSE) 354 | 355 | 356 | freq <- colSums(as.matrix(dtMatrix)) 357 | length(freq) 358 | head(freq) 359 | freq.df <- as.data.frame(freq) 360 | View(freq.df) 361 | 362 | container <- create_container(dtMatrix, data$Issue, trainSize=1:500,virgin=FALSE) 363 | str(container) 364 | 365 | 366 | # train a SVM Model 367 | model <- train_model(container, "SVM", kernel="linear", cost=1) 368 | str(model) 369 | head(model) 370 | summary(model) 371 | 372 | 373 | predictionData <- data$Consumer.complaint.narrative[501:1000] 374 | 375 | # create a prediction document term matrix 376 | predMatrix <- create_matrix(predictionData, originalMatrix=dtMatrix) 377 | 378 | # create the corresponding container 379 | 380 | plength = length(predictionData); 381 | predictionContainer <- create_container(predMatrix, labels=rep(0,plength), testSize=1:plength, virgin=FALSE) 382 | 383 | 384 | # predict 385 | results <- classify_model(predictionContainer, model) 386 | head(results) 387 | aggregate(results$SVM_PROB, by=list(results$SVM_LABEL), FUN=mean, na.rm=TRUE) 388 | 389 | 390 | -------------------------------------------------------------------------------- /Chapter04/Consumer_Complaints (1).zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Practical-Predictive-Analytics/8bdd60cc74f0f0c34a0c0add3a9e4d02c7380afb/Chapter04/Consumer_Complaints (1).zip -------------------------------------------------------------------------------- /Chapter05/ch 5 codebase as of May 31.R: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Practical-Predictive-Analytics/8bdd60cc74f0f0c34a0c0add3a9e4d02c7380afb/Chapter05/ch 5 codebase as of May 31.R -------------------------------------------------------------------------------- /Chapter06/ch 6 codebase June 6.R: -------------------------------------------------------------------------------- 1 | # 6 2 | # Using Market Basket Analysis as a Recommender Engine 3 | # "It's not wise to violate the rules until you know how to observe them." - T.S. Eliot 4 | # What is Market Basket Analysis? 5 | # If you have "survived" the last chapter, you will now be introduced to the world of Market Basket Analysis. Market Basket Analysis (also sometimes called affinity analysis), is a predictive analytics technique which is used heavily in the retail industry in order to identify "baskets" of items which are purchased together. The typical use case for this is the supermarket shopping cart in which a shopper would typically purchase an assortment of items such as Milk, Bread, Cheese, etc. and the algorithm will predict how purchasing certain items together will affect the purchase of other items. It is one of those methods that retailers know to start sending you coupons, emails for things that you didn't know you needed! 6 | # One most quoted example of MBA is the relationship between diapers and beer: 7 | # "One super market chain discovered in its analysis that customers that bought diapers often bought beer as well, have put the diapers close to beer coolers, and their sales increased dramatically" - http://en.wikipedia.org/wiki/Market_basket 8 | # However, it is not only restricted to the retail industry. Market Basket Analysis (or MBA) can be used in the insurance industry to look at the various products that an insured currently has, such as car, home, etc. as suggest possible other products such as life, disability, or investment products. 9 | # MBA is generally considered an unsupervised learning algorithm, in that target variables are usually not specified. However, as you will see later it is possible to refine the association rules, so that specific items can be specified as target variables. 10 | # MBA is also considered as a type of recommender engine in which purchases of a set of items imply the purchases of others. Certainly MBA and other recommender engines can share the same types of input data. However, MBA was developed before the advent of collaborative filtering techniques, as pioneered by Amazon, and is more suggestive of the integration of collected Web data, while MBA is more associated with the RFID Bar Coding technologies found in scanners. However in both cases, suggestions of future purchases based on past purchases is the goal. 11 | # Terminology 12 | # Critical to the understanding of MBA are the concepts of support, confidence, and lift. These are the measures which evaluated the 'goodness of fit' for a set of association rules. You will also learn some specific definitions that are used in MBA, such as consequence, antecedent, and itemsets. 13 | # To introduce these concepts, we will first illustrate these terms through a very simplistic example. We will use only the first 10 transactions contained in the "Groceries" transaction file, which is contained in the package "arules". 14 | # To see an example of a market basket, run the following code. Examine the output produced from the 'inspect' function, which prints items in the market basket. 15 | rm(list = ls()) 16 | library(arules) 17 | #library(datasets) 18 | 19 | data(Groceries) 20 | str(Groceries) 21 | #RW changed added 22 | summary(Groceries) 23 | 24 | inspect(Groceries[10:19]) 25 | # > items 26 | # > 1 {whole milk, 27 | # > cereals} 28 | # > 2 {tropical fruit, 29 | # > other vegetables, 30 | # > white bread, 31 | # > bottled water, 32 | # > chocolate} 33 | # > 3 {citrus fruit, 34 | # > tropical fruit, 35 | # > whole milk, 36 | # > butter, 37 | # > curd, 38 | # > yogurt, 39 | # > flour, 40 | # > bottled water, 41 | # > dishes} 42 | # > 4 {beef} 43 | # > 5 {frankfurter, 44 | # > rolls/buns, 45 | # > soda} 46 | # > 6 {chicken, 47 | # > tropical fruit} 48 | # > 7 {butter, 49 | # > sugar, 50 | # > fruit/vegetable juice, 51 | # > newspapers} 52 | # > 8 {fruit/vegetable juice} 53 | # > 9 {packaged fruit/vegetables} 54 | # > 10 {chocolate} 55 | # The sample Market Basket 56 | # Each transaction numbered 1-10 listed above represents a basket of items purchased by a shopper. hese are typically all items that are associated with a particular transaction, or invoice. Each basket is enclosed within braces {}, and is referred to as an itemset. An itemset is a group of items that occur together. 57 | # Market Basket algorithms construct rules in the form of 58 | # Itemset{x1,x2,x3 ...} --> Itemset{y1,y2,y3.}. 59 | # 60 | # This notation states that buyers who have purchased items on the left hand side of the formula (lhs) have a propensity to purchase items on the right hand side (rhs). The association is stated using the ??? symbol which can be interpreted as "implies" 61 | # The lhs of the notation is also known as the antecedent, and the rhs is known as the consequence. If nothing appears on either the left hand side or right hand side there is no specific association rule for those items, however it also means that those items has appeared in the basket. 62 | # Association rule algorithms 63 | # Without an association rule algorithm, you are left with the computationally very expensive task of generating all possible pairs of itemsets, and then trying to prune through the best ones yourself. Associate rule algorithms help with filtering this. 64 | # The most popular algorithm for MBA is the apriori algorithm which is contained within the arules package. (The other popular algorithm is the eclat algorithm). 65 | # Running apriori is fairly simple. We will demonstrate this using our demo 10 transaction itemset that we just printed. 66 | # The apriori algorithm is based upon the principle that if a particular itemset is frequent, then all of its subsets must also be frequent. That principle it itself is helpful for reducing the number of itemsets which need to be evaluated, since it only needs to look at the largest items sets first, and then be able to filter down. 67 | # 1. First, some housekeeping. Fix the number of printable digits to 2. 68 | options(digits = 2) 69 | 70 | # 2. Next run the apriori algorithm. However only run it on rows 10 through 19 of the Groceries data set. We want to keep the results very small. Ignore warning messages for now. 71 | rules <- apriori(Groceries[10:19], parameter = list(supp = 0.1, conf = 0.6)) 72 | summary(rules) 73 | # > Apriori 74 | # > 75 | # > Parameter specification: 76 | # > confidence minval smax arem aval originalSupport support minlen maxlen 77 | # > 0.6 0.1 1 none FALSE TRUE 0.1 1 10 78 | # > target ext 79 | # > rules FALSE 80 | # > 81 | # > Algorithmic control: 82 | # > filter tree heap memopt load sort verbose 83 | # > 0.1 TRUE TRUE FALSE TRUE 2 TRUE 84 | # > 85 | # > Absolute minimum support count: 1 86 | # > Warning in apriori(Groceries[10:19], parameter = list(supp = 0.1, conf = 0.6)): You chose a very low absolute support count of 1. You might run out of memory! Increase minimum support. 87 | # > set item appearances ...[0 item(s)] done [0.00s]. 88 | # > set transactions ...[22 item(s), 10 transaction(s)] done [0.00s]. 89 | # > sorting and recoding items ... [22 item(s)] done [0.00s]. 90 | # > creating transaction tree ... done [0.00s]. 91 | # > checking subsets of size 1 2 3 4 5 6 7 8 9 done [0.00s]. 92 | # > writing ... [2351 rule(s)] done [0.00s]. 93 | # > creating S4 object ... done [0.00s]. 94 | # 3. Sort the rules by support, which is one of the important evaluation metrics. 95 | rules <- sort(rules, by = "support", decreasing = TRUE) # 'high-confidence' rules. 96 | 97 | # 4. Look at the first 5 rules. Observe that it has a support of .2 98 | inspect(head(rules, 5)) 99 | # > lhs rhs support confidence lift 100 | # > 63 {bottled water} => {tropical fruit} 0.2 1.00 3.3 101 | # > 64 {tropical fruit} => {bottled water} 0.2 0.67 3.3 102 | # > 1 {cereals} => {whole milk} 0.1 1.00 5.0 103 | # > 2 {chicken} => {tropical fruit} 0.1 1.00 3.3 104 | # > 3 {soda} => {rolls/buns} 0.1 1.00 10.0 105 | # Antecedents and Descendants 106 | # The rules shown above are expressed as an implication between the antecedent (left hand side) and the Consequence (right hand side) 107 | # The first rule above, in bold, describes about customers who buy bottle water as also buying tropical fruit. The 3rd rule says that customers who buy cereals have a tendency to buy whole milk. 108 | # Evaluating the accuracy of a rule 109 | # Three main metrics have been developed which measure the importance, or accuracy of an association rule: Support, Confidence, and Lift 110 | # Support 111 | # Support measures how frequently the items occur together. Imagine having a shopping cart in which there can be a very large number of combinations of items. Some items which occur rarely could be excluded from the analysis. When an item occurs frequently you will have more confidence in the association among the items, since it will be a more popular item. Often your analysis will be centered around items with high support. 112 | # Calculating Support 113 | # Calculating support is simple. You take the proportion of the number of times that the items in the rule appear in the basket divided by the number of itemsets. 114 | # . We can see that for rule number 1, both bottled water and Tropical fruit appear together in 2 of the itemsets, therefore the support for that rule is 2/10 or 20%. 115 | # . In the above examples, rules number 1 and 2 have the highest support since bottled water appears times out of 10 Tropical fruit 3 times. Bottled water appears two times out of the three time for 2/3 116 | # Confidence 117 | # Confidence is the conditional probability that the event on the right hand side (consequence) will occur, given that the items on the left hand has occurred (antecedent). This is computed by counting the number of occurrences. 118 | # . For example, if we take a closer look at rule # 64, {tropical fruit} => {bottled water}, we can see that tropical fruit occurs in 3 separate itemsets, itemset 2,3 and 6. Therefore the denominator of the formula is 3. 119 | # . Of those 3 itemsets, bottled water occurs in 2 of them (itemsets 2 and 3 only, but NOT 6). So the confidence is 2/3 or 67%.Note that the confidence for the reverse itemset {bottled water} => {tropical fruit} is higher, since every time bottled water is purchased, tropical fruit is purchased as well. You can easily verify that by inspecting and counting the elements by hand. 120 | # Lift 121 | # Lift is determined by dividing the confidence just calculated, by the independent probability of the consequent. In some respects, Lift is a better measure than support or confidence by itself since it incorporates features of both. 122 | # . To calculate the lift for rule 64, we only need to determine the unconditional probability of the consequence. Since bottled water appears 2 out of 10 times (20%) as the consequence, we divide .67 by .20 to yield 3.4, which is the lift for rule 64. 123 | # . When evaluating the lift metric, use 1 as a baseline lift measure, since a lift of 1 implies no relationship between the antecedent and the consequence. 124 | # Preparing the raw data file for analysis. 125 | # Now that we have had a short introduction to the association rules algorithm, we will illustrate applying association rules to a more meaningful example. 126 | # We will be using the Online Retail dataset which can be obtained from the UCI Machine Learning Repository at https://archive.ics.uci.edu/ml/datasets/Online+Retail. 127 | # As described by the source, the data is: 128 | # ". A transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail. The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers". 129 | # Reading the Transaction file: 130 | # We will input the data using the read.csv function. 131 | # Set stringsAsFactors to FALSE since we will be manipulating the variables as character strings later. 132 | # The code illustrates the capture.output function. The capture.output function will save the metadata for the raw input file. This is done because we want to track changes done to the input, and we want to capture the contents of the same dataframe at different points in time. That will enable us to save the values of metadata and compare them points. 133 | # We can use the file.show function to directly examine the input file if needed. This is sometimes needed if you find that there are errors in the input. It has been commented out in the code, but you are encouraged to try it out yourself. 134 | # The knitr library will be used mostly for the purposes of display output via html via the kable function. If you wish, you can replace kable function calls with head or print functions. 135 | library(sqldf) 136 | library(knitr) 137 | 138 | setwd("C:/PracticalPredictiveAnalytics/Data") 139 | options(stringsAsFactors = F) 140 | 141 | OnlineRetail <- read.csv("Online Retail.csv", strip.white = TRUE) 142 | setwd("C:/PracticalPredictiveAnalytics/Outputs") 143 | 144 | # Look at the first few records 145 | head(OnlineRetail) 146 | # > InvoiceNo StockCode Description Quantity 147 | # > 1 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 148 | # > 2 536365 71053 WHITE METAL LANTERN 6 149 | # > 3 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 150 | # > 4 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 151 | # > 5 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 152 | # > 6 536365 22752 SET 7 BABUSHKA NESTING BOXES 2 153 | # > InvoiceDate UnitPrice CustomerID Country 154 | # > 1 12/1/2010 8:26 2.5 17850 United Kingdom 155 | # > 2 12/1/2010 8:26 3.4 17850 United Kingdom 156 | # > 3 12/1/2010 8:26 2.8 17850 United Kingdom 157 | # > 4 12/1/2010 8:26 3.4 17850 United Kingdom 158 | # > 5 12/1/2010 8:26 3.4 17850 United Kingdom 159 | # > 6 12/1/2010 8:26 7.6 17850 United Kingdom 160 | # file.show('C:/Users/randy/Downloads/Online Retail.csv') NOT RUN 161 | 162 | # Save it in case we need to look at the metadata later on. 163 | 164 | OnlineRetail.Metadata <- capture.output(str(OnlineRetail)) 165 | 166 | # print it now. We can see that the capture.output contains the str function, and that there are 541,909 observations 167 | 168 | 169 | cat(OnlineRetail.Metadata, sep = "\n") 170 | # > 'data.frame': 541909 obs. of 8 variables: 171 | # > $ InvoiceNo : chr "536365" "536365" "536365" "536365" ... 172 | # > $ StockCode : chr "85123A" "71053" "84406B" "84029G" ... 173 | # > $ Description: chr "WHITE HANGING HEART T-LIGHT HOLDER" "WHITE METAL LANTERN" "CREAM CUPID HEARTS COAT HANGER" "KNITTED UNION FLAG HOT WATER BOTTLE" ... 174 | # > $ Quantity : int 6 6 8 6 6 2 6 6 6 32 ... 175 | # > $ InvoiceDate: chr "12/1/2010 8:26" "12/1/2010 8:26" "12/1/2010 8:26" "12/1/2010 8:26" ... 176 | # > $ UnitPrice : num 2.55 3.39 2.75 3.39 3.39 7.65 4.25 1.85 1.85 1.69 ... 177 | # > $ CustomerID : int 17850 17850 17850 17850 17850 17850 17850 17850 17850 13047 ... 178 | # > $ Country : chr "United Kingdom" "United Kingdom" "United Kingdom" "United Kingdom" ... 179 | # change stringsasfactors back to TRUE 180 | 181 | options(stringsAsFactors = T) 182 | 183 | 184 | # Analysing the Input file 185 | # After reading in the file, the nrow function, shows that transaction file containing 541909 rows. We can use our handy View function to peruse the contents. Alternatively, you can use the kable function from the knitr library to display a simple tabular display of the dataframe in the console, as indicated below. 186 | # . Note that if you are also using the Rmarkdown package, the output can be formatted to appear as an HTML table in the markdown file. Otherwise it will appear as plain ASCII text. 187 | # . We can also look at the distribution plots of InvoiceDate, although it will need to be transformed to date format, and sorted first. 188 | # # View(OnlineRetail) 189 | nrow(OnlineRetail) 190 | # > [1] 541909 191 | library(knitr) 192 | 193 | InvoiceDate <- gsub(" .*$", "", OnlineRetail$InvoiceDate) 194 | InvoiceDate <- (as.Date(InvoiceDate, format = "%m/%d/%Y")) 195 | InvoiceDate <- sort(InvoiceDate, decreasing = FALSE) 196 | 197 | # . We can see from the head and tail commands, as well as the plots, that the Invoices encompass the period from 12/1/2011 through 12/9/2011. We can also see several spikes in orders around the December holiday season. 198 | #RW This was changed!! 199 | kable(head(as.data.frame(InvoiceDate))) 200 | 201 | # 2010-12-01 202 | # 2010-12-01 203 | # 2010-12-01 204 | # 2010-12-01 205 | # 2010-12-01 206 | # 2010-12-01 207 | # 208 | #RW This was changed!! 209 | #kable(tail(InvoiceDate)) 210 | kable(tail(as.data.frame(InvoiceDate))) 211 | 212 | # 2011-12-09 213 | # 2011-12-09 214 | # 2011-12-09 215 | # 2011-12-09 216 | # 2011-12-09 217 | # 2011-12-09 218 | # 219 | # Plotting the dates 220 | par(las = 2) 221 | barplot(table(InvoiceDate), cex.lab = 0.5, cex.main = 0.5, cex.names = 0.5, 222 | col = c("blue")) 223 | # Insert Image B05033_06_01.png 224 | # Scrubbing and Cleaning the data 225 | # Here comes the cleaning part! Unstructured data such as text usually needs to be scrubbed and cleaned before it is in a form that is suitable for analysis. Fortunately, there are several R functions available which help you do that. 226 | # Removing Unneeded character spaces 227 | # We can start by removing leading and trailing blanks from each of the product description, since they add no value to the analysis and just take up extra space. The trimws is a handy function to accomplish this, since it removes both leading and trailing spaces. The nchar function will count the number of bytes in a character string, so we can run this function on OnlineRetail$Description before and after performing the string trim to see how much space we will actually be saving. 228 | sum(nchar(OnlineRetail$Description)) 229 | # > [1] 14284888 230 | OnlineRetail$Description <- trimws(OnlineRetail$Description) 231 | sum(nchar(OnlineRetail$Description)) 232 | # > [1] 14283213 233 | # We see that trimws reduced the size of the Description. Had the number increased after using the function, that would be a clue to check your code! 234 | # Simplifying the Descriptions 235 | # Market basket analysis using color choice analysis is an interesting topic in itself. However, for this analysis, we will remove the colors in order to simplify some of the descriptions. We can use the gsub function to remove some of the specific colors which appear as part of the product description. 236 | #CHANGED NEW CODE 237 | #OnlineRetail.save <- OnlineRetail 238 | #OnlineRetail <- OnlineRetail.save 239 | #View(OnlineRetail.save) 240 | gsub_multiple <- function(from, to, x) { 241 | updated <- x 242 | for (i in 1:length(from)) { 243 | updated <- gsub(from[i], to[i], updated) 244 | } 245 | return(updated) 246 | } 247 | 248 | 249 | OnlineRetail$Description <- gsub_multiple(c("RED","PINK","GREEN","SMALL","MEDIUM","LARGE","JUMBO","STRAWBERRY"), rep("",8), OnlineRetail$Description) 250 | 251 | 252 | View(OnlineRetail) 253 | 254 | 255 | 256 | #RW now not needed? 257 | 258 | # OnlineRetail$Description2 <- gsub("RED", "", saveit$Description) 259 | # OnlineRetail$Description2 <- gsub("PINK", "", OnlineRetail$Description2) 260 | # OnlineRetail$Description2 <- gsub("GREEN", "", OnlineRetail$Description2) 261 | # OnlineRetail$Description2 <- gsub("SMALL", "", OnlineRetail$Description2) 262 | # OnlineRetail$Description2 <- gsub("MEDIUM", "", OnlineRetail$Description2) 263 | # OnlineRetail$Description2 <- gsub("LARGE", "", OnlineRetail$Description2) 264 | # OnlineRetail$Description2 <- gsub("JUMBO", "", OnlineRetail$Description2) 265 | # OnlineRetail$Description2 <- gsub("PINK", "", OnlineRetail$Description2) 266 | # OnlineRetail$Description2 <- gsub("STRAWBERRY", "", OnlineRetail$Description2) 267 | # OnlineRetail$changed <- ifelse(OnlineRetail$Description==OnlineRetail$Description2,"1","0") 268 | # all.equal(OnlineRetail$Description,OnlineRetail$Description2) 269 | # View(OnlineRetail[87500:875021,]) 270 | View(OnlineRetail) 271 | # Removing colors automatically 272 | # If we want to remove colors automatically, we can also do that as well. The colors() function returns a list of colors which are used in the current palette. We can then perform a little code manipulation in conjunction with the gsub function that we just used, to replace all of the specified colors from OnlineRetail$Description with blanks 273 | # We will also use the kable function, which is contained within the knitr package in order to produce simple html tables of the results. 274 | # compute the length of the field before changes 275 | before <- sum(nchar(OnlineRetail$Description)) 276 | # get the unique colors returned from the colors function, and remove any 277 | # digits found at the end of the string 278 | 279 | # get the unique colors 280 | col2 <- unique(gsub("[0-9]+", "", colors(TRUE))) 281 | 282 | # Filter out any colors with a length > 7. This is just done to reduce the 283 | # #number of colors to the 'popular' ones. We may miss a couple, but we can 284 | # #filter out later. 285 | 286 | for (i in 1:length(col2)) { 287 | col2[i] <- ifelse(nchar(col2[i]) > 7, "", col2[i]) 288 | } 289 | 290 | col2 <- unique(col2) 291 | 292 | 293 | cat("Unique Colors\n") 294 | # > Unique Colors 295 | #RW Changed 296 | kable(head(data.frame(col2), 10)) 297 | 298 | # white 299 | # azure beige bisque black blue brown coral cyan 300 | # Cleaning up the colors 301 | # Clean up the colors a little more by capitalizing all of the colors and inserting a delimiter, and then pass the result to gsub 302 | #RW changed 303 | col <- toupper(paste0(col2, collapse = "|")) 304 | cat("Pass to gsub\n", head(col, 9)) 305 | # > Pass to gsub 306 | # > WHITE||AZURE|BEIGE|BISQUE|BLACK|BLUE|BROWN|CORAL|CYAN|DARKRED|DIMGRAY|GOLD|GRAY|GREEN|HOTPINK|IVORY|KHAKI|LINEN|MAGENTA|MAROON|NAVY|OLDLACE|ORANGE|ORCHID|PERU|PINK|PLUM|PURPLE|RED|SALMON|SIENNA|SKYBLUE|SNOW|TAN|THISTLE|TOMATO|VIOLET|WHEAT|YELLOW 307 | # 308 | # To replace the colors in the dataframe with blanks: 309 | # 310 | OnlineRetail$Description <- gsub(col, "", OnlineRetail$Description) 311 | 312 | # Check the length to see how much was removed. As before, print the character count before and after, to insure that the Description is reduced in size. 313 | after <- sum(nchar(OnlineRetail$Description)) 314 | print(before) 315 | # > [1] 13682222 316 | print(after) 317 | # > [1] 13341097 318 | # Verify that there are no more colors by inspection. We will look at the first 5, although you will probably want to look at more, all from different parts of the data set. We will leave "Cream" in for now, but we would remove that as well if we felt it would not help in the analysis. 319 | #RW changed 320 | kable(data.frame(OnlineRetail$Description[1:5])) 321 | # HANGING HEART T-LIGHT HOLDER 322 | # METAL LANTERN 323 | # CREAM CUPID HEARTS COAT HANGER 324 | # KNITTED UNION FLAG HOT WATER BOTTLE 325 | # WOOLLY HOTTIE HEART. 326 | # Filtering out single item transactions 327 | # Since we will want to have a 'basket' of items to perform some association rules on, we will want to filter out the transactions that only have 1 item per invoice. (That might be useful for a separate analysis of customer who only purchased one item). 328 | # . Let's use sqldf to first all of the single item transactions, and then we will create a separate dataframe consisting of the number of items per customer invoice. 329 | 330 | library(sqldf) 331 | 332 | #. Construct a query: How many distinct invoices were there? 333 | # 334 | # sqldf("select count(distinct InvoiceNo) from OnlineRetail") 335 | # # > Loading required package: tcltk 336 | # # > count(distinct InvoiceNo) 337 | # # > 1 25900 338 | # # 339 | # # . How many invoices contain only single transactions? 340 | 341 | single.trans <- sqldf("select InvoiceNo, count(*) as itemcount from OnlineRetail group by InvoiceNo having count(*)==1") 342 | 343 | # . Add them up. This shows us that there are not a lot of single transaction items. 344 | 345 | sum(single.trans$itemcount) 346 | # > [1] 5841 347 | # . SQL Query: How many have multiple transactions? 348 | 349 | x2 <- sqldf("select InvoiceNo, count(*) as itemcount from OnlineRetail group by InvoiceNo having count(*) > 1") 350 | 351 | sum(x2$itemcount) 352 | # > [1] 536068 353 | # 354 | # . Show a tabulation of the number of items per invoice to verify that they all have at least 2 items. 355 | 356 | kable(head(x2)) 357 | # 358 | # 359 | # 360 | # 536365 7 361 | # 536366 2 362 | # 536367 12 363 | # 536368 4 364 | # 536370 20 365 | # 536372 2 366 | # InvoiceNo itemcount 367 | # 536365 7 368 | # 536366 2 369 | # 536367 12 370 | # 536368 4 371 | # 536370 20 372 | # 536372 2 373 | # 374 | # Looking at the distributions 375 | # Now we can take a look at the distribution of the number of items. We can see that there is an average of 27.94 items. This will be a large enough assortment of items to do a meaningful analysis. 376 | mean(x2$itemcount) 377 | # > [1] 27 378 | # 379 | # We can also plot a histogram. The histogram shows a definite spike at a lower item number that we know is not equal to 1 which we have filtered out. We can see what that count is by using the min() function 380 | # 381 | hist(x2$itemcount, breaks = 500, xlim = c(0, 50)) 382 | 383 | # 384 | # 385 | # Insert Image: B05033_06_02.extension 386 | # 387 | # 388 | min(x2$itemcount) 389 | # > [1] 2 390 | # Merging the results back into the original data 391 | # We will want to hold on to the number of total items for each invoice on the original data frame. Merge the number of items contained in each invoice back to the original transactions, with the merge function, using Invoicenum as the key. 392 | # . If you count the number of distinct invoices before and after the merge, you can see that the invoice count is lower than prior to the merge. 393 | nrow(OnlineRetail) 394 | # > [1] 541909 395 | sqldf("select count(distinct InvoiceNo) from OnlineRetail") 396 | # > count(distinct InvoiceNo) 397 | # > 1 25900 398 | tmp <- merge(OnlineRetail, x2, by = "InvoiceNo") 399 | nrow(tmp) 400 | # > [1] 536068 401 | sqldf("select count(distinct InvoiceNo) from tmp") 402 | # > count(distinct InvoiceNo) 403 | # > 1 20059 404 | # # we can see that we filtered out some members 405 | # 406 | OnlineRetail <- merge(OnlineRetail, x2, by = "InvoiceNo") 407 | # 408 | # . Print the OnlineRetail Table along with the merged itemcount 409 | # 410 | kable(OnlineRetail[1:5, ], padding = 0) 411 | # 412 | # InvoiceNo 413 | # StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country itemcount 414 | # 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 12/1/2010 8:26 2.8 17850 United kingdom 7 415 | # 536365 22752 SET 7 BABUSHKA NESTING BOXES 2 12/1/2010 8:26 7.6 17850 United kingdom 7 416 | # 536365 85123A HANGING HEART T-LIGHT HOLDER 6 12/1/2010 8:26 2.5 17850 United kingdom 7 417 | # 536365 84029E WOOLLY HOTTIE HEART. 6 12/1/2010 8:26 3.4 17850 United kingdom 7 418 | # 536365 71053 METAL LANTERN 6 12/1/2010 8:26 3.4 17850 United kingdom 7 419 | # 420 | # 421 | # 422 | # 423 | # 424 | # 425 | # 426 | # 427 | # 428 | # 429 | # 430 | # 431 | # 84406B 432 | # 8 433 | # 2.8 17850 United Kingdom 7 434 | # 435 | # 436 | # 437 | # 438 | # 439 | # 440 | # 441 | # 442 | # 443 | # 444 | # 445 | # 446 | # 447 | # 448 | # 449 | # 450 | # 451 | # 452 | # 453 | # 454 | # 455 | # 456 | # 457 | # 458 | # 459 | # 460 | # 461 | # 462 | # 463 | # 464 | # 465 | # 466 | # 467 | # 468 | # 469 | # 470 | # Compressing the descriptions using CamelCase 471 | # For long descriptions, sometimes it is beneficial to compress the descriptions into CamelCase to improve readability. This is especially valuable when viewing descriptions which are labels on x or y axes. 472 | # CamelCase is a method which some programmers use for writing compound words where spaces are first removed, and then each word begins with a capital letter. It is also a way of conserving space. 473 | # To accomplish this, we can write a function called .simpleCap which performs this function. To illustrate how it works, we will pass it a two element character vector c("A certain good book","A very easy book"), and observe the results. 474 | # Custom function to map to camelcase 475 | # This is a simple example use of this function which maps the 2 character vector c("A certain good book", "A very easy book") to CamelCase. This vector is mapped to 2 new elements: 476 | # [1] "ACertainGoodBook", and [2] "AVeryEasyBook" 477 | 478 | # change descriptions to camelcase maybe append to itemnumber for uniqueness 479 | .simpleCap <- function(x) { 480 | # s <- strsplit(x, ' ')[[1]] 481 | s <- strsplit(tolower(x), " ")[[1]] 482 | 483 | aa <- paste(toupper(substring(s, 1, 1)), substring(s, 2), sep = "", collapse = " ") 484 | gsub(" ", "", aa, fixed = TRUE) 485 | 486 | } 487 | 488 | a <- c("A certain good book", "A very easy book") 489 | a4 <- gsub(" ", "", .simpleCap(a), fixed = TRUE) 490 | a4 491 | # > [1] "ACertainGoodBook" 492 | lapply(a, .simpleCap) 493 | # > [[1]] 494 | # > [1] "ACertainGoodBook" 495 | # > 496 | # > [[2]] 497 | # > [1] "AVeryEasyBook" 498 | # 499 | # Let's use the .simpleCap function to create a new version of Description from our OnlineRetail dataset, and call it Desc2, which removes the blanks, and capitalizes the first letter of each word. 500 | OnlineRetail$Desc2 <- lapply(as.character(OnlineRetail$Description), .simpleCap) 501 | 502 | kable(OnlineRetail[1:5, c(3, 10)], padding = 0) 503 | # Description Desc2 504 | # CREAM CUPID HEARTS COAT HANGER CreamCupidHeartsCoatHanger 505 | # SET 7 BABUSHKA NESTING BOXES Set7BabushkaNestingBoxes 506 | # HANGING HEART T-LIGHT HOLDER HangingHeartT-lightHolder 507 | # WOOLLY HOTTIE HEART. WoollyHottieHeart. 508 | # METAL LANTERN MetalLantern 509 | # Extracting the 'last word' 510 | # Often the first and last word of product descriptions contain useful information, and sometimes you can use a single word or phrase in place of the original longer description. This may not always be the case, but it is worth trying. In order to extract the last word from the descriptions, we can use the word function from the stringr package. 511 | library(stringr) 512 | OnlineRetail$lastword <- word(OnlineRetail$Description, -1) #supply -1 to extract the last word 513 | OnlineRetail$Description <- trimws(OnlineRetail$Description, "l") 514 | OnlineRetail$firstword <- word(OnlineRetail$Description, 1) 515 | # use head(OnlineRetail) if you are no using Rmarkdown 516 | 517 | kable(OnlineRetail[1:5, c(3, 10:12)], padding = 0) 518 | 519 | 520 | 521 | 522 | 523 | # CREAM CUPID HEARTS COAT HANGER 524 | # CreamCupidHeartsCoatHanger HANGER CREAM 525 | # SET 7 BABUSHKA NESTING BOXES Set7BabushkaNestingBoxes BOXES SET 526 | # HANGING HEART T-LIGHT HOLDER HangingHeartT-lightHolder HOLDER HANGING 527 | # WOOLLY HOTTIE HEART. WoollyHottieHeart. HEART. WOOLLY 528 | # METAL LANTERN MetalLantern LANTERN METAL 529 | # 530 | # 531 | # 532 | # 533 | # Description Desc2 lastword firstword 534 | # CREAM CUPID HEARTS COAT HANGER CreamCupidHeartsCoatHanger 535 | # HANGER CREAM 536 | # SET 7 BABUSHKA NESTING BOXES Set7BabushkaNestingBoxes BOXES SET 537 | # HANGING HEART T-LIGHT HOLDER HangingHeartT-lightHolder HOLDER HANGING 538 | # WOOLLY HOTTIE HEART. WoollyHottieHeart. HEART. WOOLLY 539 | # METAL LANTERN MetalLantern LANTERN METAL 540 | # 541 | # In order to see if this 'lastword' mapping makes any sense, we will sort the results so that we can see the most frequently occurring ending words. We could ultimately use this information to create category subproducts, such as CASES, BAGS, SIGNS etc. 542 | kable(head(as.data.frame(sort(table(OnlineRetail$lastword[]), decreasing = TRUE)), 543 | 10)) 544 | 545 | sort(table(OnlineRetail$lastword[]), decreasing = TRUE) 546 | # 36640 547 | # DESIGN 25557 548 | # HOLDER 13528 549 | # RETROSPOT 13013 550 | # BOX 12939 551 | # SIGN 12210 552 | # CASES 10888 553 | # BAG 9723 554 | # SET 9056 555 | # CHRISTMAS 7868 556 | # 557 | # 558 | nrow(OnlineRetail) 559 | # > [1] 536068 560 | # The lastword description looks useable, so we will keep it as part of the analytics dataset. 561 | # Creating the Test and Training Datasets 562 | # Now that we are finished with our transformations, we will create the training and test data frames. We will perform a 50/50 split between training and test. 563 | # # Take a sample of full vector 564 | # 565 | nrow(OnlineRetail) 566 | # > [1] 536068 567 | pctx <- round(0.5 * nrow(OnlineRetail)) 568 | set.seed(1) 569 | 570 | # randomize rows 571 | 572 | df <- OnlineRetail[sample(nrow(OnlineRetail)), ] 573 | rows <- nrow(df) 574 | OnlineRetail <- df[1:pctx, ] #training set 575 | OnlineRetail.test <- df[(pctx + 1):rows, ] #test set 576 | rm(df) 577 | 578 | 579 | # Display the number of rows in the training and test datasets. 580 | 581 | nrow(OnlineRetail) 582 | # > [1] 268034 583 | nrow(OnlineRetail.test) 584 | # > [1] 268034 585 | # Saving the results 586 | # It is a good idea to periodically save your data frames, so that you can pick up your analysis from various checkpoints. 587 | # In this example I will first sort them both by InvoiceNo, and then save the test and train data sets to disk, where I can always load them back into memory as needed. 588 | #RW changed 589 | 590 | OnlineRetail <- OnlineRetail[order(OnlineRetail$InvoiceNo), ] 591 | OnlineRetail.test <- OnlineRetail.test[order(OnlineRetail.test$InvoiceNo), ] 592 | # 593 | # # save(OnlineRetail,file='OnlineRetail.full.Rda') 594 | # # save(OnlineRetail.test,file='OnlineRetail.test.Rda') 595 | # 596 | # # load('OnlineRetail.full.Rda') load('OnlineRetail.test.Rda') 597 | # 598 | # nrow(OnlineRetail) 599 | # > [1] 268034 600 | nrow(OnlineRetail.test) 601 | # > [1] 268034 602 | nrow(OnlineRetail) 603 | # > [1] 268034 604 | 605 | # At this point we have prepared our analytics data sets and are ready to move on to the actual analysis. 606 | # If you wish, you can save the entire workspace to disk as follows: 607 | save.image(file = "ch6 part 1.Rdata") 608 | # Loading the Analytics file 609 | # If you are still in a session in which OnlineRetail is still in memory, you are OK! However, if you are picking up where we left off, you will need to load the data that we saved in the last session. Start by setting the working directory and then loading the OnlineRetail training dataset. 610 | #RW temporarily disabling loads and saves and rm's 611 | #rm(list = ls()) 612 | #setwd("C:/PracticalPredictiveAnalytics/Outputs") 613 | #load("ch6 part 1.Rdata") 614 | # works for small data OnlineRetail <- OnlineRetail[1:10000,] 615 | cat(nrow(OnlineRetail), "rows loaded\n") 616 | # > 268034 rows loaded 617 | # The cat function in the previous step should reflect the number of rows in the training data set which is 268034 618 | # Determining the Consequent Rules 619 | # We have seen in the data prep stage that there are a large number of itemsets generated for each invoice. To begin to demonstrate the algorithm, we will extract one representative word from each product description, and use that word as the Consequent (or rhs) to build some association rules. We have already saved the first and last words from each product description. We would examine those words more closely and see if we can filter them to result in a manageable set of transactions. 620 | # Let's first preview the frequency of the first and last word of the descriptions in descending order. That should give us a clue as to what the popular products are. 621 | library(arules) 622 | # > Loading required package: Matrix 623 | # > 624 | # > Attaching package: 'arules' 625 | # > The following objects are masked from 'package:base': 626 | # > 627 | # > abbreviate, write 628 | library(arulesViz) 629 | # > Loading required package: grid 630 | # Print the popular "first words" of the description. We will do that by sorting the frequency of "first word" in descending order: 631 | # 632 | kable(head(as.data.frame(sort(table(OnlineRetail$firstword[]), decreasing = TRUE)), 10)) 633 | sort(table(OnlineRetail$firstword[]), decreasing = TRUE) 634 | # SET 17381 635 | # BAG 8720 636 | # LUNCH 7692 637 | # RETROSPOT 7155 638 | # PACK 6861 639 | # VINTAGE 6204 640 | # HEART 4799 641 | # HANGING 4457 642 | # DOORMAT 4175 643 | # REGENCY 3452 644 | # 645 | # Similarly, print the popular "last words" of the description 646 | 647 | kable(head(as.data.frame(sort(table(OnlineRetail$lastword[]), decreasing = TRUE)), 648 | 10)) 649 | sort(table(OnlineRetail$lastword[]), decreasing = TRUE) 650 | # 18376 651 | # DESIGN 12713 652 | # HOLDER 6792 653 | # BOX 6528 654 | # RETROSPOT 6517 655 | # SIGN 6184 656 | # CASES 5465 657 | # BAG 4826 658 | # SET 4418 659 | # CHRISTMAS 3963 660 | # Looking at the popular terms above show that many transactions concern the purchases of Boxes, Cases, Signs, Bags, etc. 661 | # Replacing Missing Values 662 | # For lastword, we see that there are some blank values, so we will replace any blank values with any value found in firstword. 663 | 664 | # replace blank values in lastword, with first word. 665 | 666 | OnlineRetail$lastword <- ifelse(OnlineRetail$lastword == "", OnlineRetail$firstword, 667 | OnlineRetail$lastword) 668 | 669 | 670 | # After we are done with this, we will take another look at the frequencies and observe that the blank values have disappeared. 671 | 672 | head(as.data.frame(sort(table(OnlineRetail$lastword[]), decreasing = TRUE)), 673 | 10) 674 | sort(table(OnlineRetail$lastword[]), decreasing = TRUE) 675 | # > DESIGN 12713 676 | # > HOLDER 6792 677 | # > RETROSPOT 6574 678 | # > BOX 6528 679 | # > SIGN 6184 680 | # > BAG 5761 681 | # > CASES 5465 682 | # > SET 4418 683 | # > HEART 4027 684 | # > CHRISTMAS 4005 685 | # Making the final subset 686 | # Based upon these frequencies, we will filter the data to only include a subset of the top categories. We will exclude some of the terms which do not apply to the physical product, such as "DESIGN", "SET", and any associated colors. 687 | # Testing OnlineRetail2 <- OnlineRetail 688 | OnlineRetail2 <- subset(OnlineRetail, lastword %in% c("BAG", "CASES", "HOLDER", 689 | "BOX", "SIGN", "CHRISTMAS", "BOTTLE", "BUNTING", "MUG", "BOWL", "CANDLES", 690 | "COVER", "HEART", "MUG", "BOWL")) 691 | 692 | # Run the table function again on the results to see the new frequencies. 693 | 694 | head(as.data.frame(sort(table(OnlineRetail2$lastword[]), decreasing = TRUE)), 695 | 10) 696 | sort(table(OnlineRetail2$lastword[]), decreasing = TRUE) 697 | # > HOLDER 6792 698 | # > BOX 6528 699 | # > SIGN 6184 700 | # > BAG 5761 701 | # > CASES 5465 702 | # > HEART 4027 703 | # > CHRISTMAS 4005 704 | # > BOTTLE 3795 705 | # > BUNTING 3066 706 | # > MUG 2900 707 | # 708 | # Use the nrow function to see how much of the data was filtered from the original. 709 | 710 | cat(nrow(OnlineRetail), "Original before subsetting\n") 711 | # > 268034 Original before subsetting 712 | cat(nrow(OnlineRetail2), "After Subsetting\n") 713 | # > 55609 After Subsetting 714 | # Creating the Market Basket Transaction file 715 | # We are almost there! There is extra step that we need to do in order to prepare our data for market basket analysis. 716 | # The association rules package requires that the data be in transaction format. Transactions can either be specified in two different formats: 717 | # 1. One transaction per itemset with an identifier. This shows the entire basket in one line, just as we saw with the Groceries data. 718 | # 2. One single item per line with an identifier. 719 | # Additionally, you can create the actual transaction file in two different ways, by either: 720 | # 1. Physically writing a transactions file 721 | # 2. Coercing a dataframe to transaction format. 722 | # For smaller amounts of data, coercing the dataframe to a transaction file is simpler, but for large transaction files, writing the transaction file first is preferable, since append files can be fed from large operational transaction systems . We will illustrate both ways. 723 | # Method 1 - Coercing a dataframe to a transaction file. 724 | # Now we are ready to coerce the data frame. We will create a temporary data frame containing just the transaction id (InvoiceNo), and the descriptor (lastword). 725 | # First we will verify the column names and numbers for these two variables. We can see that they correspond to columns 1, and 12 of the dataframe by first running a colnames function on OnlineRetail2. 726 | colnames(OnlineRetail2) 727 | # > [1] "InvoiceNo" "StockCode" "Description" "Quantity" "InvoiceDate" 728 | # > [6] "UnitPrice" "CustomerID" "Country" "itemcount" "Desc2" 729 | # > [11] "lastword" "firstword" 730 | # 731 | # As a double check, display the first 25 rows, specifying the indices found above. 732 | kable(head(OnlineRetail2[, c(1, 11)], 5)) 733 | # InvoiceNo lastword 734 | # 6 536365 HOLDER 735 | # 45 536370 BOX 736 | # 39 536370 BOX 737 | # 57 536373 HOLDER 738 | # 59 536373 BOTTLE 739 | # Next we will create a temporary data frame from just these two columns, give them names, and eliminate duplicates. 740 | # 741 | # First, create the dataframe with only two columns named TransactionID, and Items. 742 | tmp <- data.frame(OnlineRetail2[, 1], OnlineRetail2[, 11]) 743 | names(tmp)[1] <- "TransactionID" 744 | names(tmp)[2] <- "Items" 745 | 746 | tmp <- unique(tmp) 747 | nrow(tmp) 748 | # > [1] 33182 749 | # Verify the results 750 | kable(head(tmp)) 751 | # TransactionID Items 752 | # 1 536365 HOLDER 753 | # 2 536370 BOX 754 | # 4 536373 HOLDER 755 | # 5 536373 BOTTLE 756 | # 7 536373 MUG 757 | # 9 536375 HOLDER 758 | # We will now use the split function to group the descriptions (lastword which is column 2), by InvoiceID (column 1). The "as" function is the critical keyword, as it converts the results of the split to transaction form. 759 | 760 | trans4 <- as(split(tmp[, 2], tmp[, 1]), "transactions") 761 | # Inspecting the transaction file 762 | # Once the data has been coerced to transaction form, we can use the inspect function to examine the data. Note: when inspecting transaction files, you do not use normal print, or head functions directly since the objects are in sparse format. You use an "inspect" function instead. 763 | # If you happen to have the tm package loaded (which we will use later), you MUST preface inspect with alrules::inspect, since there is also an inspect function in the tm package which serves a different purpose. 764 | # If you run an inspect command on the first 5 records, you can see that the data is in 'basket' format, i.e each invoice shows the itemsets, delimited by {}, that are associated with each invoice. 765 | arules::inspect(trans4[1:5]) 766 | # > items transactionID 767 | # > 1 {HOLDER} 536365 768 | # > 2 {BOX} 536370 769 | # > 3 {BOTTLE,HOLDER,MUG} 536373 770 | # > 4 {BOTTLE,HOLDER} 536375 771 | # > 5 {HOLDER} 536376 772 | # Another way of displaying transactions, other than using inspect, is to first coerce it to a matrix and then display the items as boolean values. If you have many items you will need to subset the column vectors so that they fit on the screen. 773 | as(trans4, "matrix")[1:5, 1:5] 774 | # > BAG BOTTLE BOWL BOX BUNTING 775 | # > 536365 FALSE FALSE FALSE FALSE FALSE 776 | # > 536370 FALSE FALSE FALSE TRUE FALSE 777 | # > 536373 FALSE TRUE FALSE FALSE FALSE 778 | # > 536375 FALSE TRUE FALSE FALSE FALSE 779 | # > 536376 FALSE FALSE FALSE FALSE FALSE 780 | as(trans4, "matrix")[1:5, 6:ncol(trans4)] 781 | # > CANDLES CASES CHRISTMAS COVER HEART HOLDER MUG SIGN 782 | # > 536365 FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE 783 | # > 536370 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE 784 | # > 536373 FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE 785 | # > 536375 FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE 786 | # > 536376 FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE 787 | # Obtaining the topN purchased items 788 | # Even before we run any association rules we can obtain counts of the topN items, as shown in the first plot. 789 | # If we are looking for items with a specified support level, we can include that as a parameter to the function as well. 790 | # The second plot below is another way of plotting the items, which specifies support as a filter, and it indicates higher support for "Bags","Boxes","Cases", and "Holders" 791 | par(mfrow = c(1, 2), bg = "white", col = c("blue")) 792 | 793 | itemFrequencyPlot(trans4, topN = 10, type = "absolute", cex.names = 0.7) 794 | itemFrequencyPlot(trans4, support = 0.2, cex.names = 0.75) 795 | 796 | # Insert Image: B05033_06_03.extension 797 | # 798 | # 799 | # Reset the graphics parameters 800 | dev.off() 801 | # > null device 802 | # > 1 803 | # Finding the association rules 804 | # As shown earlier, the association rules are run using the apriori function. The apriori function has several filtering parameters which are used to control the number of rules which are produced. In our example we will specify the minimum support and confidence threshold that a rule needs to pass in order to be considered. 805 | # The number that you pass to aprior depends upon how you want to look at the rules. It can be an initial screening, or it can be a 'deeper dive', after you have performed several passes. But generally, If we want many rules we can decrease the support and confidence parameters. If we want to focus on items which appear frequently we raise the support threshold. If we want to concentrate on higher quality and more 'accurate' rules, we would raise the confidence level. 806 | # These numbers are not absolutes. They are numbers that you can tweak in order to limit the number of rules that you will look at, versus their quality. 807 | # The minlen=2 parameter is often specified in order to guarantee that there are itemsets included in the left hand side of the rule. 808 | rulesx <- apriori(trans4, parameter = list(minlen = 2, support = 0.02, confidence = 0.01)) 809 | # > Apriori 810 | # > 811 | # > Parameter specification: 812 | # > confidence minval smax arem aval originalSupport support minlen maxlen 813 | # > 0.01 0.1 1 none FALSE TRUE 0.02 2 10 814 | # > target ext 815 | # > rules FALSE 816 | # > 817 | # > Algorithmic control: 818 | # > filter tree heap memopt load sort verbose 819 | # > 0.1 TRUE TRUE FALSE TRUE 2 TRUE 820 | # > 821 | # > Absolute minimum support count: 272 822 | # > 823 | # > set item appearances ...[0 item(s)] done [0.00s]. 824 | # > set transactions ...[13 item(s), 13617 transaction(s)] done [0.00s]. 825 | # > sorting and recoding items ... [13 item(s)] done [0.00s]. 826 | # > creating transaction tree ... done [0.00s]. 827 | # > checking subsets of size 1 2 3 4 5 done [0.00s]. 828 | # > writing ... [829 rule(s)] done [0.00s]. 829 | # > creating S4 object ... done [0.00s]. 830 | # The output from the apriori algorithm will tell us how many rules were generated. If you also perform a str function on rulesx, that will show you all of the sublists that are contained within the rulesx object, which can get complex. Ordinarily the output is sufficient for analyzing the rules, but these sublists can be used programmatically to print the number of rules generated if you vary the paramaters in a loop. For our example, printing the second column of one of the sublists (rulesx@lhs@data@Dim[2]) shows that here have been 829 rules generated. This is a manageable number of rules to look at. 831 | str(rulesx) 832 | # > Formal class 'rules' [package "arules"] with 4 slots 833 | # > ..@ lhs :Formal class 'itemMatrix' [package "arules"] with 3 slots 834 | # > .. .. ..@ data :Formal class 'ngCMatrix' [package "Matrix"] with 5 slots 835 | # > .. .. .. .. ..@ i : int [1:1752] 8 5 8 7 8 11 8 2 8 4 ... 836 | # > .. .. .. .. ..@ p : int [1:830] 0 1 2 3 4 5 6 7 8 9 ... 837 | # > .. .. .. .. ..@ Dim : int [1:2] 13 829 838 | # > .. .. .. .. ..@ Dimnames:List of 2 839 | # > .. .. .. .. .. ..$ : NULL 840 | # > .. .. .. .. .. ..$ : NULL 841 | # > .. .. .. .. ..@ factors : list() 842 | # > .. .. ..@ itemInfo :'data.frame': 13 obs. of 1 variable: 843 | # > .. .. .. ..$ labels: chr [1:13] "BAG" "BOTTLE" "BOWL" "BOX" ... 844 | # > .. .. ..@ itemsetInfo:'data.frame': 0 obs. of 0 variables 845 | # > ..@ rhs :Formal class 'itemMatrix' [package "arules"] with 3 slots 846 | # > .. .. ..@ data :Formal class 'ngCMatrix' [package "Matrix"] with 5 slots 847 | # > .. .. .. .. ..@ i : int [1:829] 5 8 7 8 11 8 2 8 4 8 ... 848 | # > .. .. .. .. ..@ p : int [1:830] 0 1 2 3 4 5 6 7 8 9 ... 849 | # > .. .. .. .. ..@ Dim : int [1:2] 13 829 850 | # > .. .. .. .. ..@ Dimnames:List of 2 851 | # > .. .. .. .. .. ..$ : NULL 852 | # > .. .. .. .. .. ..$ : NULL 853 | # > .. .. .. .. ..@ factors : list() 854 | # > .. .. ..@ itemInfo :'data.frame': 13 obs. of 1 variable: 855 | # > .. .. .. ..$ labels: chr [1:13] "BAG" "BOTTLE" "BOWL" "BOX" ... 856 | # > .. .. ..@ itemsetInfo:'data.frame': 0 obs. of 0 variables 857 | # > ..@ quality:'data.frame': 829 obs. of 3 variables: 858 | # > .. ..$ support : num [1:829] 0.0239 0.0239 0.0206 0.0206 0.0245 ... 859 | # > .. ..$ confidence: num [1:829] 0.278 0.21 0.24 0.129 0.285 ... 860 | # > .. ..$ lift : num [1:829] 2.43 2.43 1.5 1.5 2.03 ... 861 | # > ..@ info :List of 4 862 | # > .. ..$ data : symbol trans4 863 | # > .. ..$ ntransactions: int 13617 864 | # > .. ..$ support : num 0.02 865 | # > .. ..$ confidence : num 0.01 866 | rulesx@lhs@data@Dim[2] 867 | # > [1] 829 868 | # Examining the rules summary 869 | # The rule length distribution indicates the number of itemsets which appear in both the left and right side of the association. The most frequent number of item is 3, which either means that the purchase of 2 items together, implies purchasing a single item, or conversely, purchasing a single item implies purchasing two additional item. 870 | # Examining the rules quality and observe the highest support 871 | # The quality measures tell you something about the distribution of support, confidence, and lift. The quality function will give the support, list, and confidence for each of the rules. You can also sort the rules by each of these important measures and observe which rules have the highest measures. 872 | # We can see from the low support distribution that there are no particular itemsets which occur more than others. If you inspect the rules sorted by the support level, you will see the highest support level at .046 Notice that this agrees with the Max. support level given in the summary. The top 3 support levels are for those customers who purchase Holders, Boxes, or signs. 873 | summary(rulesx) 874 | # > set of 829 rules 875 | # > 876 | # > rule length distribution (lhs + rhs):sizes 877 | # > 2 3 4 5 878 | # > 156 438 220 15 879 | # > 880 | # > Min. 1st Qu. Median Mean 3rd Qu. Max. 881 | # > 2.000 3.000 3.000 3.113 4.000 5.000 882 | # > 883 | # > summary of quality measures: 884 | # > support confidence lift 885 | # > Min. :0.02005 Min. :0.1293 Min. :1.000 886 | # > 1st Qu.:0.02210 1st Qu.:0.3672 1st Qu.:1.948 887 | # > Median :0.02512 Median :0.5110 Median :2.393 888 | # > Mean :0.02987 Mean :0.5068 Mean :2.444 889 | # > 3rd Qu.:0.03202 3rd Qu.:0.6533 3rd Qu.:2.899 890 | # > Max. :0.09885 Max. :0.8571 Max. :5.419 891 | # > 892 | # > mining info: 893 | # > data ntransactions support confidence 894 | # > trans4 13617 0.02 0.01 895 | head(quality(rulesx)) #also look at the quality measures for each of the rules 896 | # > support confidence lift 897 | # > 1 0.02394066 0.2779199 2.432156 898 | # > 2 0.02394066 0.2095116 2.432156 899 | # > 3 0.02063597 0.2395567 1.500480 900 | # > 4 0.02063597 0.1292548 1.500480 901 | # > 5 0.02452816 0.2847400 2.026819 902 | # > 6 0.02452816 0.1745949 2.026819 903 | tmp <- as.data.frame(inspect(head(sort(rulesx, by = "support"), 10))) 904 | # > lhs rhs support confidence lift 905 | # > 155 {BOX} => {HOLDER} 0.09884703 0.3417111 1.063075 906 | # > 156 {HOLDER} => {BOX} 0.09884703 0.3075166 1.063075 907 | # > 135 {HEART} => {HOLDER} 0.09620328 0.4768839 1.483602 908 | # > 136 {HOLDER} => {HEART} 0.09620328 0.2992918 1.483602 909 | # > 151 {BAG} => {BOX} 0.09304546 0.3859275 1.334139 910 | # > 152 {BOX} => {BAG} 0.09304546 0.3216552 1.334139 911 | # > 149 {SIGN} => {HOLDER} 0.09069545 0.4203540 1.307736 912 | # > 150 {HOLDER} => {SIGN} 0.09069545 0.2821567 1.307736 913 | # > 141 {CASES} => {BOX} 0.08511420 0.4048201 1.399451 914 | # > 142 {BOX} => {CASES} 0.08511420 0.2942371 1.399451 915 | # Confidence and Lift measures. 916 | # Similar to above, sort the rules and examine the highest confidence and lift measures. 917 | tmp <- as.data.frame(arules::inspect(head(sort(rulesx, by = "confidence"), 10))) 918 | # > lhs rhs support confidence lift 919 | # > 815 {BAG,BOWL,BOX,SIGN} => {HOLDER} 0.02070941 0.8571429 2.666601 920 | # > 631 {BAG,BOTTLE,BOWL} => {BOX} 0.02181097 0.8510029 2.941890 921 | # > 643 {BOWL,HEART,SIGN} => {HOLDER} 0.02151722 0.8492754 2.642125 922 | # > 820 {BAG,BOX,HEART,SIGN} => {HOLDER} 0.02137035 0.8434783 2.624090 923 | # > 632 {BOTTLE,BOWL,BOX} => {BAG} 0.02181097 0.8413598 3.489734 924 | # > 816 {BAG,BOWL,HOLDER,SIGN} => {BOX} 0.02070941 0.8392857 2.901385 925 | # > 817 {BOWL,BOX,HOLDER,SIGN} => {BAG} 0.02070941 0.8392857 3.481131 926 | # > 707 {BOTTLE,HEART,SIGN} => {HOLDER} 0.02144378 0.8366762 2.602929 927 | # > 651 {BAG,BOWL,HEART} => {HOLDER} 0.02115003 0.8347826 2.597038 928 | # > 719 {BOTTLE,CASES,SIGN} => {BAG} 0.02063597 0.8289086 3.438089 929 | tmp <- as.data.frame(arules::inspect(head(sort(rulesx, by = "lift"), 10))) 930 | # > lhs rhs support confidence lift 931 | # > 602 {BAG,HOLDER,SIGN} => {COVER} 0.02012191 0.4667802 5.418710 932 | # > 598 {BAG,BOX,SIGN} => {COVER} 0.02019534 0.4661017 5.410833 933 | # > 819 {BAG,BOX,HOLDER,SIGN} => {BOWL} 0.02070941 0.6698337 5.254105 934 | # > 610 {BAG,BOX,HOLDER} => {COVER} 0.02070941 0.4483307 5.204534 935 | # > 606 {BOX,HOLDER,SIGN} => {COVER} 0.02004847 0.4333333 5.030435 936 | # > 634 {BAG,BOTTLE,BOX} => {BOWL} 0.02181097 0.6279070 4.925236 937 | # > 638 {BAG,HEART,SIGN} => {BOWL} 0.02034222 0.6141907 4.817647 938 | # > 662 {BAG,CASES,SIGN} => {BOWL} 0.02144378 0.6134454 4.811801 939 | # > 159 {BAG,BOWL} => {COVER} 0.02092972 0.4025424 4.672992 940 | # > 670 {CASES,HOLDER,SIGN} => {BOWL} 0.02078284 0.5811088 4.558156 941 | # Filtering a large number of rules 942 | # Once we have the rules built, we can use special subsetting functions to filter items from itemsets on either the left(lhs) or right(rhs) side of the association rule. This is valuable if you are looking for particular items within the itemsets. 943 | # Use the %in% operator to perform an exact match, or %pin% to perform a partial match. 944 | # # to see what 'Christmas' purchases imply. 945 | 946 | lhs.rules <- subset(rulesx, subset = lhs %pin% "CHRISTMAS") 947 | lhs.rules 948 | # > set of 44 rules 949 | inspect(lhs.rules) 950 | # > lhs rhs support confidence lift 951 | # > 4 {CHRISTMAS} => {COVER} 0.02063597 0.1292548 1.500480 952 | # > 26 {CHRISTMAS} => {CANDLES} 0.02702504 0.1692732 1.481358 953 | # > 47 {CHRISTMAS} => {MUG} 0.02379379 0.1490340 1.060845 954 | # > 49 {CHRISTMAS} => {BOWL} 0.02680473 0.1678933 1.316937 955 | # > 51 {CHRISTMAS} => {BUNTING} 0.03018286 0.1890524 1.108191 956 | # > 53 {CHRISTMAS} => {BOTTLE} 0.04553132 0.2851886 1.797876 957 | # > 55 {CHRISTMAS} => {HEART} 0.03708600 0.2322907 1.151475 958 | # > 57 {CHRISTMAS} => {CASES} 0.04751414 0.2976081 1.415484 959 | # > 59 {CHRISTMAS} => {SIGN} 0.03547037 0.2221711 1.029715 960 | # > 61 {CHRISTMAS} => {BAG} 0.04663289 0.2920883 1.211504 961 | # > 63 {CHRISTMAS} => {BOX} 0.05500477 0.3445262 1.191016 962 | # > 65 {CHRISTMAS} => {HOLDER} 0.05133289 0.3215271 1.000282 963 | # > 253 {BOTTLE,CHRISTMAS} => {CASES} 0.02210472 0.4854839 2.309058 964 | # > 254 {CASES,CHRISTMAS} => {BOTTLE} 0.02210472 0.4652241 2.932850 965 | # > 256 {BOTTLE,CHRISTMAS} => {BAG} 0.02283910 0.5016129 2.080555 966 | # > 257 {BAG,CHRISTMAS} => {BOTTLE} 0.02283910 0.4897638 3.087552 967 | # > 259 {BOTTLE,CHRISTMAS} => {BOX} 0.02489535 0.5467742 1.890181 968 | # > 260 {BOX,CHRISTMAS} => {BOTTLE} 0.02489535 0.4526035 2.853288 969 | # > 262 {BOTTLE,CHRISTMAS} => {HOLDER} 0.02504223 0.5500000 1.711069 970 | # > 263 {CHRISTMAS,HOLDER} => {BOTTLE} 0.02504223 0.4878398 3.075423 971 | # > 265 {CHRISTMAS,HEART} => {BAG} 0.02004847 0.5405941 2.242239 972 | # > 266 {BAG,CHRISTMAS} => {HEART} 0.02004847 0.4299213 2.131139 973 | # > 268 {CHRISTMAS,HEART} => {BOX} 0.02203128 0.5940594 2.053645 974 | # > 269 {BOX,CHRISTMAS} => {HEART} 0.02203128 0.4005340 1.985465 975 | # > 271 {CHRISTMAS,HEART} => {HOLDER} 0.02247191 0.6059406 1.885102 976 | # > 272 {CHRISTMAS,HOLDER} => {HEART} 0.02247191 0.4377682 2.170036 977 | # > 274 {CASES,CHRISTMAS} => {BAG} 0.02291254 0.4822257 2.000142 978 | # > 275 {BAG,CHRISTMAS} => {CASES} 0.02291254 0.4913386 2.336904 979 | # > 277 {CASES,CHRISTMAS} => {BOX} 0.02548285 0.5363215 1.854047 980 | # > 278 {BOX,CHRISTMAS} => {CASES} 0.02548285 0.4632844 2.203473 981 | # > 280 {CASES,CHRISTMAS} => {HOLDER} 0.02283910 0.4806801 1.495412 982 | # > 281 {CHRISTMAS,HOLDER} => {CASES} 0.02283910 0.4449213 2.116135 983 | # > 283 {CHRISTMAS,SIGN} => {BAG} 0.02137035 0.6024845 2.498943 984 | # > 284 {BAG,CHRISTMAS} => {SIGN} 0.02137035 0.4582677 2.123973 985 | # > 286 {CHRISTMAS,SIGN} => {BOX} 0.02342660 0.6604555 2.283174 986 | # > 287 {BOX,CHRISTMAS} => {SIGN} 0.02342660 0.4259012 1.973961 987 | # > 289 {CHRISTMAS,SIGN} => {HOLDER} 0.02276566 0.6418219 1.996731 988 | # > 290 {CHRISTMAS,HOLDER} => {SIGN} 0.02276566 0.4434907 2.055484 989 | # > 292 {BAG,CHRISTMAS} => {BOX} 0.02687817 0.5763780 1.992521 990 | # > 293 {BOX,CHRISTMAS} => {BAG} 0.02687817 0.4886515 2.026795 991 | # > 295 {BAG,CHRISTMAS} => {HOLDER} 0.02460160 0.5275591 1.641255 992 | # > 296 {CHRISTMAS,HOLDER} => {BAG} 0.02460160 0.4792561 1.987825 993 | # > 298 {BOX,CHRISTMAS} => {HOLDER} 0.02812661 0.5113485 1.590823 994 | # > 299 {CHRISTMAS,HOLDER} => {BOX} 0.02812661 0.5479256 1.894162 995 | # # what purchases yielded Candles? 996 | 997 | rhs.rules <- subset(rulesx, subset = rhs %pin% c("CANDLES")) 998 | rhs.rules 999 | # > set of 26 rules 1000 | # tmp <- as.data.frame(arules::inspect(head(sort(rhs.rules, by = "support"), 10))) 1001 | # > lhs rhs support confidence lift 1002 | # > 46 {HOLDER} => {CANDLES} 0.04920320 0.1530729 1.339584 1003 | # > 44 {BOX} => {CANDLES} 0.04714695 0.1629855 1.426333 1004 | # > 38 {CASES} => {CANDLES} 0.03958287 0.1882641 1.647552 1005 | # > 42 {BAG} => {CANDLES} 0.03789381 0.1571733 1.375469 1006 | # > 36 {HEART} => {CANDLES} 0.03473599 0.1721878 1.506865 1007 | # > 40 {SIGN} => {CANDLES} 0.03429537 0.1589517 1.391031 1008 | # > 34 {BOTTLE} => {CANDLES} 0.02908130 0.1833333 1.604402 1009 | # > 30 {BOWL} => {CANDLES} 0.02805317 0.2200461 1.925686 1010 | # > 252 {BOX,HOLDER} => {CANDLES} 0.02768598 0.2800892 2.451140 1011 | # > 32 {BUNTING} => {CANDLES} 0.02739223 0.1605682 1.405178 1012 | inspect(head(sort(rhs.rules, by = "confidence"))) 1013 | # > lhs rhs support confidence lift 1014 | # > 219 {BAG,HEART} => {CANDLES} 0.02056253 0.3517588 3.078342 1015 | # > 213 {BOWL,BOX} => {CANDLES} 0.02019534 0.3467844 3.034809 1016 | # > 216 {BOTTLE,HOLDER} => {CANDLES} 0.02026878 0.3183391 2.785876 1017 | # > 222 {BOX,HEART} => {CANDLES} 0.02247191 0.3122449 2.732544 1018 | # > 237 {BAG,SIGN} => {CANDLES} 0.02232503 0.3111566 2.723020 1019 | # > 234 {CASES,HOLDER} => {CANDLES} 0.02232503 0.3095723 2.709156 1020 | # 1021 | # The plot function in arules is also very flexible. You can look at various scatterplots across the various metrics, or even count and group a small number of rules and show the metrics as bubbles. 1022 | 1023 | plot(rhs.rules, method = "scatterplot") 1024 | 1025 | # Insert Image: B05033_06_04.extension 1026 | # 1027 | # 1028 | plot(rhs.rules, method = "grouped") 1029 | 1030 | # Insert Image: B05033_06_05.extension 1031 | # 1032 | # Generating Many Rules 1033 | # If you wish to generate as many rules as possible, set support and confidence to a very low number. 1034 | 1035 | many_rules <- apriori(trans4, parameter = list(minlen = 1, support = 0.01, confidence = 0.01)) 1036 | # > Apriori 1037 | # > 1038 | # > Parameter specification: 1039 | # > confidence minval smax arem aval originalSupport support minlen maxlen 1040 | # > 0.01 0.1 1 none FALSE TRUE 0.01 1 10 1041 | # > target ext 1042 | # > rules FALSE 1043 | # > 1044 | # > Algorithmic control: 1045 | # > filter tree heap memopt load sort verbose 1046 | # > 0.1 TRUE TRUE FALSE TRUE 2 TRUE 1047 | # > 1048 | # > Absolute minimum support count: 136 1049 | # > 1050 | # > set item appearances ...[0 item(s)] done [0.00s]. 1051 | # > set transactions ...[13 item(s), 13617 transaction(s)] done [0.00s]. 1052 | # > sorting and recoding items ... [13 item(s)] done [0.00s]. 1053 | # > creating transaction tree ... done [0.00s]. 1054 | # > checking subsets of size 1 2 3 4 5 6 7 8 done [0.00s]. 1055 | # > writing ... [8898 rule(s)] done [0.00s]. 1056 | # > creating S4 object ... done [0.00s]. 1057 | many_rules 1058 | # > set of 8898 rules 1059 | 1060 | # Plotting many rules 1061 | # Plots are especially helpful for scenarios in which there are many rules generated, and you need to filter on specific support and confidence ranges. 1062 | # 1063 | # Here is a plot which shows 2 of the 3 metrics, along the x and y axis, and the 3rd metric (lift, support, or confidence) as shading. As the plot below suggests, there is a cluster of rules with high lift (>8), high confidence (> .6), but all with low support. 1064 | 1065 | sel <- plot(many_rules, measure = c("support", "confidence"), shading = "lift", 1066 | interactive = FALSE) 1067 | 1068 | # Insert Image: B05033_06_06.extension 1069 | # Method 2 - Creating a physical transactions file 1070 | # Now that you know have to run association rules using the "coerce to dataframe" method, we will now illustrate the "write to file" method. 1071 | # . In the write to file method, each item is written to a separate line, along with the identifying key, which in our case is the InvoiceId. 1072 | # . The advantage to the write to file method is that very large data files can be accumulated separately, and them combined together if needed. 1073 | # . You can use the file.show function to display the contents of the file that will be input to the association rules algorithm. 1074 | 1075 | setwd("C:/PracticalPredictiveAnalytics/Outputs") 1076 | #RW is this needed? 1077 | #load("OnlineRetail.full.Rda") 1078 | # OnlineRetail <- OnlineRetail[1:100,] 1079 | nrow(OnlineRetail) 1080 | # > [1] 268034 1081 | head(OnlineRetail) 1082 | # > InvoiceNo StockCode Description Quantity 1083 | # > 5 536365 71053 METAL LANTERN 6 1084 | # > 6 536365 21730 GLASS STAR FROSTED T-LIGHT HOLDER 6 1085 | # > 2 536365 22752 SET 7 BABUSHKA NESTING BOXES 2 1086 | # > 4 536365 84029E WOOLLY HOTTIE HEART. 6 1087 | # > 1 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 1088 | # > 8 536366 22632 HAND WARMER POLKA DOT 6 1089 | # > InvoiceDate UnitPrice CustomerID Country itemcount 1090 | # > 5 12/1/2010 8:26 3.39 17850 United Kingdom 7 1091 | # > 6 12/1/2010 8:26 4.25 17850 United Kingdom 7 1092 | # > 2 12/1/2010 8:26 7.65 17850 United Kingdom 7 1093 | # > 4 12/1/2010 8:26 3.39 17850 United Kingdom 7 1094 | # > 1 12/1/2010 8:26 2.75 17850 United Kingdom 7 1095 | # > 8 12/1/2010 8:28 1.85 17850 United Kingdom 2 1096 | # > Desc2 lastword firstword 1097 | # > 5 MetalLantern LANTERN METAL 1098 | # > 6 GlassStarFrostedT-lightHolder HOLDER GLASS 1099 | # > 2 Set7BabushkaNestingBoxes BOXES SET 1100 | # > 4 WoollyHottieHeart. HEART. WOOLLY 1101 | # > 1 CreamCupidHeartsCoatHanger HANGER CREAM 1102 | # > 8 HandWarmerPolkaDot DOT HAND 1103 | # concatenate the Invoice Number to the Description separated by a delimiter 1104 | data2 <- paste(OnlineRetail$InvoiceNo, OnlineRetail$Desc2, sep = "!") 1105 | 1106 | # eliminate duplicates 1107 | data2 <- unique(data2) 1108 | # 1109 | 1110 | write(data2, file = "demo_single") 1111 | # file.show('demo_single') 1112 | # Reading the transaction file back in 1113 | # Use the read.transaction to read the delimited file back into memory formatted as a transaction file. This will have the same results as coercing the dataframe to a transaction file as we did earlier. 1114 | # The difference is that the transactions are formatted one transaction per line, as specified by the format='single' option. We also specify where the transactionid and item desriptions are via the cols option. It is possible to have multiple descriptors and transaction id's, and identify them with the cols options. The sep keyword designates the delimiter, which in this case is the ! character. There is also a remove duplicate transactions option, which is a logical value which determines whether or not you want to eliminate duplicates. 1115 | # The returned object, trans, is an itemMatrix. You can type 'trans' to see the dimension, or run the dim(trans) to see the dimensions. This will tell you how many transactions the itemMatrix is based upon. 1116 | # As before, to view the items in the trans object, use the inspect function. 1117 | library(arules) 1118 | library(arulesViz) 1119 | setwd("C:/PracticalPredictiveAnalytics/Outputs") 1120 | 1121 | # file.show('demo_single') 1122 | trans <- read.transactions("demo_single", format = "single", sep = "!", cols = c(1, 1123 | 2), rm.duplicates = FALSE, quote = "") 1124 | trans 1125 | # > transactions in sparse format with 1126 | # > 19403 transactions (rows) and 1127 | # > 3462 items (columns) 1128 | dim(trans) 1129 | # > [1] 19403 3462 1130 | inspect(trans[1:5]) 1131 | # > items transactionID 1132 | # > 1 {CreamCupidHeartsCoatHanger, 1133 | # > GlassStarFrostedT-lightHolder, 1134 | # > MetalLantern, 1135 | # > Set7BabushkaNestingBoxes, 1136 | # > WoollyHottieHeart.} 536365 1137 | # > 2 {HandWarmerPolkaDot} 536366 1138 | # > 3 {FeltcraftPrincessCharlotteDoll, 1139 | # > KnittedMugCosy, 1140 | # > LoveBuildingBlockWord, 1141 | # > Poppy'sPlayhouseKitchen} 536367 1142 | # > 4 {CoatRackParisFashion} 536368 1143 | # > 5 {CircusParadeLunchBox, 1144 | # > LunchBoxILoveLondon, 1145 | # > MiniJigsawSpaceboy, 1146 | # > PandaAndBunniesStickerSheet, 1147 | # > Postage, 1148 | # > RoundSnackBoxesSetOf4Woodland, 1149 | # > SpaceboyLunchBox, 1150 | # > ToadstoolLedNightLight} 536370 1151 | # Take a look at some of the frequently purchases items. Use the itemFrequencePlot function to see a simple barchart of the top item purchases. 1152 | dim(trans) 1153 | # > [1] 19403 3462 1154 | # look up any item in labels to see if it is there. 1155 | 1156 | itemFrequencyPlot(trans, topN = 10, cex.names = 1) 1157 | 1158 | # Insert Image: B05033_06_07.extension 1159 | # 1160 | # The itemLabels function lists all of the labels associated with the itemset. Since the top ranked item has an unusual abbreviation in it (T-light), you could check to see if there are other items which have that term in it. To accomplish this, use the grep function. 1161 | 1162 | result <- grep("T-light", itemLabels(trans), value = TRUE) 1163 | str(result) 1164 | # > chr [1:96] "6ChocolateLoveHeartT-lights" ... 1165 | head(result) 1166 | # > [1] "6ChocolateLoveHeartT-lights" "AgedGlassSilverT-lightHolder" 1167 | # > [3] "AntiqueSilverT-lightGlass" "AssortedColourT-lightHolder" 1168 | # > [5] "BeadedChandelierT-lightHolder" "BonneJamJarT-lightHolder" 1169 | # Apply the rules engine again. We will use a small support and confidence level to generate many rules. 1170 | rules1 <- apriori(trans, parameter = list(minlen = 1, support = 0.001, confidence = 0.001)) 1171 | # > Apriori 1172 | # > 1173 | # > Parameter specification: 1174 | # > confidence minval smax arem aval originalSupport support minlen maxlen 1175 | # > 0.001 0.1 1 none FALSE TRUE 0.001 1 10 1176 | # > target ext 1177 | # > rules FALSE 1178 | # > 1179 | # > Algorithmic control: 1180 | # > filter tree heap memopt load sort verbose 1181 | # > 0.1 TRUE TRUE FALSE TRUE 2 TRUE 1182 | # > 1183 | # > Absolute minimum support count: 19 1184 | # > 1185 | # > set item appearances ...[0 item(s)] done [0.00s]. 1186 | # > set transactions ...[3462 item(s), 19403 transaction(s)] done [0.03s]. 1187 | # > sorting and recoding items ... [2209 item(s)] done [0.01s]. 1188 | # > creating transaction tree ... done [0.01s]. 1189 | # > checking subsets of size 1 2 3 4 done [0.20s]. 1190 | # > writing ... [63121 rule(s)] done [0.02s]. 1191 | # > creating S4 object ... done [0.02s]. 1192 | rules1 1193 | # > set of 63121 rules 1194 | # Sort the rules by the 3 measures support, confidence, lift to get an idea of some of the more valuable rules. Sort by confidence,support, and lift to look at the highest scoring rules in each category. 1195 | tmp <- as.data.frame(inspect(tail(sort(rules1, by = "lift")))) 1196 | # > lhs rhs 1197 | # > 38860 {AssortedColourBirdOrnament} => {StorageBagSuki} 1198 | # > 38861 {StorageBagSuki} => {AssortedColourBirdOrnament} 1199 | # > 38893 {BagRetrospot} => {AssortedColourBirdOrnament} 1200 | # > 38892 {AssortedColourBirdOrnament} => {BagRetrospot} 1201 | # > 11539 {AlarmClockBakelike} => {RexCash+carryShopper} 1202 | # > 11538 {RexCash+carryShopper} => {AlarmClockBakelike} 1203 | # > support confidence lift 1204 | # > 38860 0.001030768 0.02724796 0.8976097 1205 | # > 38861 0.001030768 0.03395586 0.8976097 1206 | # > 38893 0.001700768 0.03122044 0.8252999 1207 | # > 38892 0.001700768 0.04495913 0.8252999 1208 | # > 11539 0.001082307 0.01621622 0.7183636 1209 | # > 11538 0.001082307 0.04794521 0.7183636 1210 | tmp <- as.data.frame(inspect(head(sort(rules1, by = "support")))) 1211 | # > lhs rhs support confidence lift 1212 | # > 2207 {} => {HangingHeartT-lightHolder} 0.07256610 0.07256610 1 1213 | # > 2208 {} => {HeartOfWicker} 0.07004072 0.07004072 1 1214 | # > 2209 {} => {AlarmClockBakelike} 0.06674226 0.06674226 1 1215 | # > 2205 {} => {BagRetrospot} 0.05447611 0.05447611 1 1216 | # > 2201 {} => {RegencyCakesd3Tier} 0.05251765 0.05251765 1 1217 | # > 2190 {} => {PartyBunting} 0.04184920 0.04184920 1 1218 | tmp <- as.data.frame(inspect(head(sort(rules1, by = "confidence")))) 1219 | # > lhs rhs support confidence lift 1220 | # > 1 {PolkadotCup, 1221 | # > RetrospotCharlotteBag, 1222 | # > SpotCeramicDrawerKnob} => {AlarmClockBakelike} 0.001082307 1.0000000 14.98301 1223 | # > 2 {AlarmClockBakelike, 1224 | # > Charlie+lolaHotWaterBottle, 1225 | # > ChristmasGinghamTree} => {BabushkaNotebook} 0.001185384 0.9200000 48.24530 1226 | # > 3 {ChristmasHangingStarWithBell, 1227 | # > RegencyTeacupAndSaucer} => {AlarmClockBakelike} 0.001133845 0.9166667 13.73443 1228 | # > 4 {PolkadotBowl, 1229 | # > RetrospotCharlotteBag, 1230 | # > SpotCeramicDrawerKnob} => {AlarmClockBakelike} 0.001133845 0.9166667 13.73443 1231 | # > 5 {AlarmClockBakelikeChocolate, 1232 | # > PolkadotCup} => {AlarmClockBakelike} 0.001030768 0.9090909 13.62092 1233 | # > 6 {BabushkaNotebook, 1234 | # > Charlie+lolaHotWaterBottle, 1235 | # > HeartMeasuringSpoons} => {AlarmClockBakelike} 0.001339999 0.8965517 13.43304 1236 | # You can also coerce the rules to a dataframe and use kable to print the first 10 rows, or subset as you choose. 1237 | rules1 <- sort(rules1, by = "confidence") 1238 | rules1.df <- as(rules1, "data.frame") 1239 | cat("using kable to print rules") 1240 | # > using kable to print rules 1241 | library(knitr) 1242 | kable(rules1.df[1:10, ]) 1243 | # 1244 | # Rules support confidence lift 1245 | # 62966 {PolkadotCup,RetrospotCharlotteBag,SpotCeramicDrawerKnob} => {AlarmClockBakelike} 0.0010823 1.0000000 14.98301 1246 | # 62971 {AlarmClockBakelike,Charlie+lolaHotWaterBottle,ChristmasGinghamTree} => {BabushkaNotebook} 0.0011854 0.9200000 48.24530 1247 | # 51467 {ChristmasHangingStarWithBell,RegencyTeacupAndSaucer} => {AlarmClockBakelike} 0.0011338 0.9166667 13.73443 1248 | # 62982 {PolkadotBowl,RetrospotCharlotteBag,SpotCeramicDrawerKnob} => {AlarmClockBakelike} 0.0011338 0.9166667 13.73443 1249 | # 51338 {AlarmClockBakelikeChocolate,PolkadotCup} => {AlarmClockBakelike} 0.0010308 0.9090909 13.62092 1250 | # 63058 {BabushkaNotebook,Charlie+lolaHotWaterBottle,HeartMeasuringSpoons} => {AlarmClockBakelike} 0.0013400 0.8965517 13.43304 1251 | # 62970 {BabushkaNotebook,Charlie+lolaHotWaterBottle,ChristmasGinghamTree} => {AlarmClockBakelike} 0.0011854 0.8846154 13.25420 1252 | # 62972 {AlarmClockBakelike,BabushkaNotebook,ChristmasGinghamTree} => {Charlie+lolaHotWaterBottle} 0.0011854 0.8846154 76.62586 1253 | # 51347 {AlarmClockBakelikeChocolate,DinerWallClock} => {AlarmClockBakelike} 0.0011338 0.8800000 13.18505 1254 | # 51356 {AlarmClockBakelikeChocolate,BoxOf24CocktailParasols} => {AlarmClockBakelike} 0.0011338 0.8800000 13.18505 1255 | # Plotting the rules 1256 | # The default plot of the rules will give you a scatterplot of all of the rules showing support on the x-axis and confidence on the y-axis. We can see from the density that confidence can vary from high to low with most of the density occurring at the .5 level. Support tends to be low, and the highest support level attains ~.07. 1257 | plot(rules1) 1258 | 1259 | # Insert Image: B05033_06_08.extension 1260 | # Creating subsets of the rules 1261 | # As we did before, we can look at some of the subsets by parsing left or right side. 1262 | # For example, we might be interested in seeing what items yielded purchasing Chocolate things. 1263 | # . Subset the rules set using the %pin% operator (Partial match), and look for any transactions where "Chocolate" appears in the right hand side. 1264 | 1265 | purchased.this <- "Chocolate" 1266 | 1267 | lhs.rules <- subset(rules1, subset = rhs %pin% purchased.this) 1268 | 1269 | # . Printing lhs.rules shows that there are 487 of them. 1270 | # 1271 | print(lhs.rules) 1272 | # > set of 487 rules 1273 | # 1274 | # . Sort them by lift, inspect them, and plot the first 15 as a graph. 1275 | # 1276 | lhs.rules <- sort(lhs.rules, by = "lift") 1277 | 1278 | inspect(head(sort(lhs.rules, by = "lift"))) 1279 | # > lhs rhs support confidence lift 1280 | # > 1 {CakeTowelSpots} => {CakeTowelChocolateSpots} 0.001185384 0.2911392 89.66626 1281 | # > 2 {BiscuitsBowlLight, 1282 | # > DollyMixDesignBowl} => {ChocolatesBowl} 0.001030768 0.4444444 68.98844 1283 | # > 3 {BakingMouldHeartChocolate} => {BakingMouldHeartMilkChocolate} 0.001030768 0.2941176 57.64409 1284 | # > 4 {BakingMouldHeartMilkChocolate} => {BakingMouldHeartChocolate} 0.001030768 0.2020202 57.64409 1285 | # > 5 {BiscuitsBowlLight} => {ChocolatesBowl} 0.001700768 0.3586957 55.67817 1286 | # > 6 {MarshmallowsBowl} => {ChocolatesBowl} 0.002422306 0.2397959 37.22208 1287 | # 1288 | # . The directional graph is a good way to illustrate which purchases influence other purchase once you have narrowed down the number of itemset to a small number. 1289 | # . The DollyMixDesignBowl and MarshmallowsBowl appear as large and darker bubbles indicating that they are better predictors for "Chocolate" purchase relative to support, confidence and lift. 1290 | plot(lhs.rules[1:15], method = "graph", control = list(type = "items", cex = 0.5)) 1291 | 1292 | # Insert Image: B05033_06_09.extension 1293 | # 1294 | # . Finally, If you wish to save you workspace, use the save.image command. 1295 | save.image(file = "ch6 part 2.Rdata") 1296 | 1297 | # Text Clustering 1298 | # In the previous sections we used the lastword technique for categorizing the types of purchases by simple keywords. We could also use more sophisticated techniques such as word clustering to try to identify which types of purchasing clusters occur and then use that to subset the association rules. To illustrate text clustering on our OnlineRetail dataset will first need to load our training and test dataframes that we previously saved. Also, issue a set.seed command since we will be doing some sampling later on. 1299 | #setwd("C:/Users/randy/Desktop/ch6") 1300 | # load the training data 1301 | #load("OnlineRetail.full.Rda") 1302 | set.seed(1) 1303 | 1304 | # We previously demonstrated some text mining examples using a package called RTextTools. Another popular text mining package is "Tm". Tm has been around for a long time, and it will be useful to know this package works. Tm requires that all text data be converted to a corpus first. That can be done using the VCorpus function. We can use vector input, since we already have the data available in an existing data frame, and there is no need to read in additional external data. 1305 | library(tm) 1306 | # > Loading required package: NLP 1307 | attach(OnlineRetail) 1308 | nrow(OnlineRetail) 1309 | # > [1] 268034 1310 | corp <- VCorpus(VectorSource(OnlineRetail$Description)) 1311 | 1312 | # Displaying the corp object shows you some information about the metadata 1313 | 1314 | head(corp) 1315 | # > <> 1316 | # > Metadata: corpus specific: 0, document level (indexed): 0 1317 | # > Content: documents: 6 1318 | # Converting to a Document Term Matrix 1319 | # Once we have a corpus, we can proceed to convert it to a document term matrix. When building DTM, care must be given to limiting the amount of data and resulting terms that are processed. If not parameterized correctly, it can take a very long time to run. Parameterization is accomplished via the options. We will remove any stopwords, punctuation and numbers. Additionally, we will only include minimum word length of 4. 1320 | 1321 | library(tm) 1322 | dtm <- DocumentTermMatrix(corp, control = list(removePunctuation = TRUE, wordLengths = c(4, 1323 | 999), stopwords = TRUE, removeNumbers = TRUE, stemming = FALSE, bounds = list(global = c(5, 1324 | Inf)))) 1325 | save.image(file = "OnlineRetail-dtm.Rdata") 1326 | 1327 | 1328 | # We can begin to look at the data by using the inspect function. Note: this is different from the inspect function in arules, and if you have the arules package loaded, you will want to preface this inspect with tm::inspect 1329 | #RW put a save file here since it is the longest running part 1330 | 1331 | inspect(dtm[1:10, 1:10]) 1332 | # > <> 1333 | # > Non-/sparse entries: 0/100 1334 | # > Sparsity : 100% 1335 | # > Maximal term length: 8 1336 | # > Weighting : term frequency (tf) 1337 | # > 1338 | # > Terms 1339 | # > Docs abstract acapulco account acrylic address adult advent afghan aged 1340 | # > 1 0 0 0 0 0 0 0 0 0 1341 | # > 2 0 0 0 0 0 0 0 0 0 1342 | # > 3 0 0 0 0 0 0 0 0 0 1343 | # > 4 0 0 0 0 0 0 0 0 0 1344 | # > 5 0 0 0 0 0 0 0 0 0 1345 | # > 6 0 0 0 0 0 0 0 0 0 1346 | # > 7 0 0 0 0 0 0 0 0 0 1347 | # > 8 0 0 0 0 0 0 0 0 0 1348 | # > 9 0 0 0 0 0 0 0 0 0 1349 | # > 10 0 0 0 0 0 0 0 0 0 1350 | # > Terms 1351 | # > Docs ahoy 1352 | # > 1 0 1353 | # > 2 0 1354 | # > 3 0 1355 | # > 4 0 1356 | # > 5 0 1357 | # > 6 0 1358 | # > 7 0 1359 | # > 8 0 1360 | # > 9 0 1361 | # > 10 0 1362 | # After the DTM has been created, we can look at the metadata that has been producted by issuing a print(dtm) command. We can see the number of documents and terms by looking at the first line. 1363 | print(dtm) 1364 | # > <> 1365 | # > Non-/sparse entries: 826898/448130052 1366 | # > Sparsity : 100% 1367 | # > Maximal term length: 20 1368 | # > Weighting : term frequency (tf) 1369 | # Remove Sparse Terms 1370 | # Most TDM's are initially filled with a lot of empty space. That is because every word in a corpus is indexed, and there are many words which occur so infrequently, that they do not matter analytically. Removing sparse terms is a method in which we can reduce the number of terms to a manageable size, and also save space at the same time. 1371 | # The removeSparseTerms function will reduce the number of terms in the Description from 268034 to 62. 1372 | dtms <- removeSparseTerms(dtm, 0.99) 1373 | dim(dtms) 1374 | # > [1] 268034 62 1375 | 1376 | # As an alternative to inspect, we can also view it in matrix form 1377 | 1378 | View(as.matrix(dtms)) 1379 | 1380 | 1381 | 1382 | # Insert Image: B05033_06_10.extension 1383 | # Finding Frequent Terms 1384 | # The tm package has a useful function called findFreqTerms which is useful to find the frequency of the popular terms used. The second argument to the function restricts the results to terms which have a minimum frequency specified. We can also compute the occurrences by summing up the 1's and 0's for each term in the TDM. Then we can sort the list and display the highest and lowest frequency occurrences. 1385 | # 1386 | data.frame(findFreqTerms(dtms, 10000, Inf)) 1387 | # > findFreqTerms.dtms..10000..Inf. 1388 | # > 1 cake 1389 | # > 2 christmas 1390 | # > 3 design 1391 | # > 4 heart 1392 | # > 5 metal 1393 | # > 6 retrospot 1394 | # > 7 vintage 1395 | freq <- colSums(as.matrix(dtms)) 1396 | # there are xx terms 1397 | length(freq) 1398 | # > [1] 62 1399 | ord <- order(freq) 1400 | # look at the top and bottom number of terms 1401 | freq[head(ord, 12)] 1402 | # > union skull zinc bird wood wall birthday 1403 | # > 2752 2770 2837 2974 2993 3042 3069 1404 | # > colour charlotte star antique silver 1405 | # > 3089 3114 3121 3155 3175 1406 | freq[tail(ord, 10)] 1407 | # > hanging sign lunch metal cake christmas design 1408 | # > 8437 8580 9107 10478 10623 12534 14884 1409 | # > vintage retrospot heart 1410 | # > 16755 17445 19520 1411 | # 1412 | # For presentation purposes, a barplot is also useful for displaying the relative frequencies. 1413 | # 1414 | barplot(freq[tail(ord, 10)], cex.names = 0.75, col = c("blue")) 1415 | 1416 | # B05033_06_11.pngWe could also do a little code manipulation to only display the topN most frequent term. 1417 | 1418 | dtmx <- dtms[, names(tail(sort(colSums(as.matrix(dtms))), 12))] 1419 | inspect(dtmx[1:10, ]) 1420 | # > <> 1421 | # > Non-/sparse entries: 3/117 1422 | # > Sparsity : 98% 1423 | # > Maximal term length: 9 1424 | # > Weighting : term frequency (tf) 1425 | # > 1426 | # > Terms 1427 | # > Docs pack holder hanging sign lunch metal cake christmas design vintage 1428 | # > 1 0 0 0 0 0 1 0 0 0 0 1429 | # > 2 0 1 0 0 0 0 0 0 0 0 1430 | # > 3 0 0 0 0 0 0 0 0 0 0 1431 | # > 4 0 0 0 0 0 0 0 0 0 0 1432 | # > 5 0 0 0 0 0 0 0 0 0 0 1433 | # > 6 0 0 0 0 0 0 0 0 0 0 1434 | # > 7 0 0 0 0 0 0 0 0 0 0 1435 | # > 8 0 0 0 0 0 0 0 0 0 0 1436 | # > 9 0 0 0 0 0 0 0 0 0 0 1437 | # > 10 0 0 0 0 0 0 0 0 0 0 1438 | # > Terms 1439 | # > Docs retrospot heart 1440 | # > 1 0 0 1441 | # > 2 0 0 1442 | # > 3 0 0 1443 | # > 4 0 1 1444 | # > 5 0 0 1445 | # > 6 0 0 1446 | # > 7 0 0 1447 | # > 8 0 0 1448 | # > 9 0 0 1449 | # > 10 0 0 1450 | # Kmeans clustering of terms 1451 | # Now we can cluster the Term Document Matrix using Kmeans. We will specify that 5 clusters be generated. 1452 | 1453 | kmeans5 <- kmeans(dtms, 5) 1454 | 1455 | # Once kmeans is done, we will append the cluster number to the original data, and then create 5 subsets based upon the cluster. 1456 | 1457 | kw_with_cluster <- as.data.frame(cbind(OnlineRetail, Cluster = kmeans5$cluster)) 1458 | 1459 | # subset the five clusters 1460 | cluster1 <- subset(kw_with_cluster, subset = Cluster == 1) 1461 | cluster2 <- subset(kw_with_cluster, subset = Cluster == 2) 1462 | cluster3 <- subset(kw_with_cluster, subset = Cluster == 3) 1463 | cluster4 <- subset(kw_with_cluster, subset = Cluster == 4) 1464 | cluster5 <- subset(kw_with_cluster, subset = Cluster == 5) 1465 | # Examining Cluster 1 1466 | # 1467 | # Print out a sample of the data. 1468 | 1469 | head(cluster1[10:13]) 1470 | # Desc2 lastword firstword Cluster 1471 | # 50 VintageBillboardLove/hateMug MUG VINTAGE 1 1472 | # 86 BagVintagePaisley PAISLEY BAG 1 1473 | # 113 ShopperVintagePaisley PAISLEY SHOPPER 1 1474 | # 145 ShopperVintagePaisley PAISLEY SHOPPER 1 1475 | # 200 VintageHeadsAndTailsCardGame GAME VINTAGE 1 1476 | # 210 PaperChainKitVintageChristmas CHRISTMAS PAPER 1 1477 | # 1478 | # Table the frequencies and print out the most popular terms in the clusters. Observe that many of these items have to do with Chrismas and Paisley items, which seem to occur together. 1479 | # 1480 | # tail(sort(table(cluster1$lastword)), 10) 1481 | # MUG GAME BUNTING CARDS DESIGN LEAF 1482 | # 427 431 456 482 535 717 911 1483 | # DOILY PAISLEY CHRISTMAS 1484 | # 1073 1699 1844 1485 | # 1486 | # Examining Cluster 2 1487 | # 1488 | # After tabling Cluster 2 below, it looks like this cluster has something to do with "Hanging Holders" 1489 | head(cluster2[10:13]) 1490 | # Desc2 lastword firstword Cluster 1491 | # 6 GlassStarFrostedT-lightHolder HOLDER GLASS 2 1492 | # 57 HangingHeartT-lightHolder HOLDER HANGING 2 1493 | # 62 GlassStarFrostedT-lightHolder HOLDER GLASS 2 1494 | # 70 HangingHeartT-lightHolder HOLDER HANGING 2 1495 | # 81 HangingHeartT-lightHolder HOLDER HANGING 2 1496 | # 156 ColourGlassT-lightHolderHanging HANGING COLOUR 2 1497 | # 1498 | 1499 | tail(sort(table(cluster2$lastword)), 10) 1500 | # ANTIQUE LANTERN HLDR DECORATION GLASS T-LIGHT HANGING HEART HOLDER 1501 | # 167 181 226 260 319 361 531 639 668 6792 1502 | # 1503 | # Examining Cluster 3 1504 | # 1505 | # Cluster 3 may have to do with customers purchasing sets of design boxes. 1506 | 1507 | head(cluster3[10:13]) 1508 | # Desc2 lastword firstword Cluster 1509 | # 5 MetalLantern LANTERN METAL 3 1510 | # 2 Set7BabushkaNestingBoxes BOXES SET 3 1511 | # 4 WoollyHottieHeart. HEART. WOOLLY 3 1512 | # 1 CreamCupidHeartsCoatHanger HANGER CREAM 3 1513 | # 8 HandWarmerPolkaDot DOT HAND 3 1514 | # 10 FeltcraftPrincessCharlotteDoll DOLL FELTCRAFT 3 1515 | 1516 | tail(sort(table(cluster3$lastword)),10) 1517 | head(cluster4[10:13]) 1518 | # Desc2 lastword firstword Cluster 1519 | # 100 LunchBoxWithCutleryRetrospot RETROSPOT LUNCH 4 1520 | # 102 PackOf72RetrospotCakeCases CASES PACK 4 1521 | # 84 60TeatimeFairyCakeCases CASES 60 4 1522 | # 94 3PieceRetrospotCutlerySet SET 3 4 1523 | # 91 LunchBagRetrospot RETROSPOT LUNCH 4 1524 | # 127 RetrospotMilkJug JUG RETROSPOT 4 1525 | 1526 | tail(sort(table(cluster4$lastword)), 10) 1527 | # APRON NAPKINS BANK PC TINS DESIGN 1528 | # 486 511 512 514 531 664 1203 1529 | # BAG CASES RETROSPOT 1530 | # 1395 4318 6485 1531 | # Examining Cluster 5 1532 | # 1533 | # Finally, Cluster 5 seems to be concerned with the purchases of Bottles. Possible having to do with perfumes, elixir's and tonics. 1534 | 1535 | head(cluster5) 1536 | # > head(cluster5[10:13]) 1537 | # Desc2 lastword firstword Cluster 1538 | # 59 KnittedUnionFlagHotWaterBottle BOTTLE KNITTED 5 1539 | # 68 KnittedUnionFlagHotWaterBottle BOTTLE KNITTED 5 1540 | # 181 AssortedBottleTopMagnets MAGNETS ASSORTED 5 1541 | # 206 EnglishRoseHotWaterBottle BOTTLE ENGLISH 5 1542 | # 229 RetrospotHeartHotWaterBottle BOTTLE RETROSPOT 5 1543 | # 261 HotWaterBottleTeaAndSympathy SYMPATHY HOT 5 1544 | # 1545 | 1546 | tail(sort(table(cluster5$lastword)), 10) 1547 | 1548 | # BABUSHKA MAGNETS TONIC ELIXIR PERFUME OPENER SYMPATHY POORLY CALM BOTTLE 1549 | # 87 95 108 112 132 149 318 345 418 3795 1550 | # Creating a DTM from the test data set, and calculating the new cluster assignments 1551 | # The goal in this exercise is to score the test dataset based upon the predict method for the training dataset. 1552 | # First, let's read in our test data set and create a document term matrix. In this example we will use the create_matrix function from RTextTools which can create a TDM first without having a separate step of creating a corpus 1553 | 1554 | #load("OnlineRetail.test.Rda") 1555 | library(RTextTools) 1556 | # > Loading required package: SparseM 1557 | # > 1558 | # > Attaching package: 'SparseM' 1559 | # > The following object is masked from 'package:base': 1560 | # > 1561 | # > backsolve 1562 | dtMatrixTest <- create_matrix(OnlineRetail.test$Description, minDocFreq = 1, 1563 | removeNumbers = TRUE, minWordLength = 4, removeStopwords = TRUE, removePunctuation = TRUE, 1564 | stemWords = FALSE, weighting = weightTf) 1565 | 1566 | # As we did before for the training dataset, remove the sparse terms. 1567 | dtMatrixTest <- removeSparseTerms(dtMatrixTest, 0.99) 1568 | dtMatrixTest 1569 | # > <> 1570 | # > Non-/sparse entries: 349663/16000411 1571 | # > Sparsity : 98% 1572 | # > Maximal term length: 10 1573 | # > Weighting : term frequency (tf) 1574 | # kmeans function does not have a prediction method. However we can use the flexclust package which does. Since the prediction method can take a long time to run, we will illustrate it only on a sample number of rows and columns. In order to compare the test and training results, they also need to have the same number of columns. Reload the test and training data we have saved, and create the TDM's from scratch, based upon the sample.size 1575 | 1576 | set.seed(1) 1577 | sample.size <- 10000 1578 | max.cols <- 10 1579 | 1580 | library("flexclust") 1581 | #load("OnlineRetail.full.Rda") 1582 | 1583 | 1584 | # Create the sample 1585 | OnlineRetail <- OnlineRetail[1:sample.size, ] 1586 | require(tm) 1587 | library(RTextTools) 1588 | # Create the DTM for the training data 1589 | 1590 | dtMatrix <- create_matrix(OnlineRetail$Description, minDocFreq = 1, removeNumbers = TRUE, 1591 | minWordLength = 4, removeStopwords = TRUE, removePunctuation = TRUE, stemWords = FALSE, 1592 | weighting = weightTf) 1593 | # Check the dimensions of the data. We can see that there are 1300 terms. Virtually all of them are space terms so we will remove them from the matrix. 1594 | 1595 | dim(dtMatrix) 1596 | # > [1] 10000 1300 1597 | 1598 | dtMatrix <- removeSparseTerms(dtMatrix, 0.99) 1599 | 1600 | # Removing the sparse terms reduces the number of columns from 1300 to 62! 1601 | 1602 | dim(dtMatrix) 1603 | # > [1] 10000 62 1604 | # We will only keep same number of terms (max.cols) in test and train. 1605 | 1606 | dtMatrix <- dtMatrix[, 1:max.cols] 1607 | 1608 | # repeat kmeans using the kcca function. Clusters=5 1609 | 1610 | clust1 = kcca(dtMatrix, k = 5, kccaFamily("kmeans")) 1611 | clust1 1612 | # > kcca object of family 'kmeans' 1613 | # > 1614 | # > call: 1615 | # > kcca(x = dtMatrix, k = 5, family = kccaFamily("kmeans")) 1616 | # > 1617 | # > cluster sizes: 1618 | # > 1619 | # > 1 2 3 4 5 1620 | # > 360 120 152 387 8981 1621 | # 1622 | # 1623 | # Print the number of products categorized in each cluster 1624 | 1625 | table(clust1@cluster) 1626 | # > 1627 | # > 1 2 3 4 5 1628 | # > 360 120 152 387 8981 1629 | # 1630 | # Merge the clusters with the training data, and show the cluster assigned to each 1631 | 1632 | kw_with_cluster2 <- as.data.frame(cbind(OnlineRetail, Cluster = clust1@cluster)) 1633 | 1634 | head(kw_with_cluster2) 1635 | # > InvoiceNo StockCode Description Quantity 1636 | # > 5 536365 71053 METAL LANTERN 6 1637 | # > 6 536365 21730 GLASS STAR FROSTED T-LIGHT HOLDER 6 1638 | # > 2 536365 22752 SET 7 BABUSHKA NESTING BOXES 2 1639 | # > 4 536365 84029E WOOLLY HOTTIE HEART. 6 1640 | # > 1 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 1641 | # > 8 536366 22632 HAND WARMER POLKA DOT 6 1642 | # > InvoiceDate UnitPrice CustomerID Country itemcount 1643 | # > 5 12/1/2010 8:26 3.39 17850 United Kingdom 7 1644 | # > 6 12/1/2010 8:26 4.25 17850 United Kingdom 7 1645 | # > 2 12/1/2010 8:26 7.65 17850 United Kingdom 7 1646 | # > 4 12/1/2010 8:26 3.39 17850 United Kingdom 7 1647 | # > 1 12/1/2010 8:26 2.75 17850 United Kingdom 7 1648 | # > 8 12/1/2010 8:28 1.85 17850 United Kingdom 2 1649 | # > Desc2 lastword firstword Cluster 1650 | # > 5 MetalLantern LANTERN METAL 5 1651 | # > 6 GlassStarFrostedT-lightHolder HOLDER GLASS 5 1652 | # > 2 Set7BabushkaNestingBoxes BOXES SET 5 1653 | # > 4 WoollyHottieHeart. HEART. WOOLLY 5 1654 | # > 1 CreamCupidHeartsCoatHanger HANGER CREAM 5 1655 | # > 8 HandWarmerPolkaDot DOT HAND 5 1656 | # Run the predict method on the training set. We will eventually apply it to the test data 1657 | pred_train <- predict(clust1) 1658 | 1659 | # Load the test data set, take an identical sample size as was taken for the training data, and repeat the procedure starting with creating the term document matrix on the sample. 1660 | # 1661 | #load("OnlineRetail.test.Rda") 1662 | OnlineRetail.test <- OnlineRetail.test[1:sample.size, ] 1663 | dtMatrix.test <- create_matrix(OnlineRetail.test$Description, minDocFreq = 1, 1664 | removeNumbers = TRUE, minWordLength = 4, removeStopwords = TRUE, removePunctuation = TRUE, 1665 | stemWords = FALSE, weighting = weightTf) 1666 | # Remove sparse terms 1667 | dtMatrix.test <- removeSparseTerms(dtMatrix.test, 0.99) 1668 | 1669 | # reduced to 61 terms 1670 | dim(dtMatrix.test) 1671 | # > [1] 10000 61 1672 | # Take the first max.col terms 1673 | dtMatrix.test <- dtMatrix.test[, 1:max.cols] 1674 | 1675 | dtMatrix.test 1676 | # > <> 1677 | # > Non-/sparse entries: 2072/97928 1678 | # > Sparsity : 98% 1679 | # > Maximal term length: 8 1680 | # > Weighting : term frequency (tf) 1681 | # Verify that the test and training data have the same number of dimensions 1682 | dim(dtMatrix) 1683 | # > [1] 10000 10 1684 | dim(dtMatrix.test) 1685 | # > [1] 10000 10 1686 | # Run prediction function on training data, and apply it to the test data 1687 | 1688 | 1689 | pred_test <- predict(clust1, newdata = dtMatrix.test) 1690 | 1691 | # Table the cluster assignments for the test dat 1692 | 1693 | table(pred_test) 1694 | # > pred_test 1695 | # > 1 2 3 4 5 1696 | # > 171 113 201 146 9369 1697 | # 1698 | # Finally, merge the clusters with the test data, and show the cluster categories assigned to each. For this demonstration, display two transactions for each cluster 1699 | kw_with_cluster2_score <- as.data.frame(cbind(OnlineRetail.test, Cluster=pred_test)) 1700 | head(kw_with_cluster2_score) 1701 | clust1.score=head(subset(kw_with_cluster2_score,Cluster==1),2) 1702 | clust2.score=head(subset(kw_with_cluster2_score,Cluster==2),2) 1703 | clust3.score=head(subset(kw_with_cluster2_score,Cluster==3),2) 1704 | clust4.score=head(subset(kw_with_cluster2_score,Cluster==4),2) 1705 | clust5.score=head(subset(kw_with_cluster2_score,Cluster==5),2) 1706 | head(clust1.score[,10:13]) 1707 | head(clust2.score[,10:13]) 1708 | head(clust3.score[,10:13]) 1709 | head(clust4.score[,10:13]) 1710 | head(clust5.score[,10:13]) 1711 | 1712 | # > head(clust1.score[,10:13]) 1713 | # Desc2 lastword firstword Cluster 1714 | # 89 PackOf60PaisleyCakeCases CASES PACK 1 1715 | # 96 PackOf60DinosaurCakeCases CASES PACK 1 1716 | # > head(clust2.score[,10:13]) 1717 | # Desc2 lastword firstword Cluster 1718 | # 61 WoodenFrameAntique WOODEN 2 1719 | # 140 AntiqueGlassDressingTablePot POT ANTIQUE 2 1720 | # > head(clust3.score[,10:13]) 1721 | # Desc2 lastword firstword Cluster 1722 | # 148 3TierCakeTinAndCream CREAM 3 3 1723 | # 143 3TierCakeTinAndCream CREAM 3 3 1724 | # > head(clust4.score[,10:13]) 1725 | # Desc2 lastword firstword Cluster 1726 | # 126 ZincWillieWinkieCandleStick STICK ZINC 4 1727 | # 488 LoveBirdCandle CANDLE LOVE 4 1728 | # > head(clust5.score[,10:13]) 1729 | # Desc2 lastword firstword Cluster 1730 | # 3 HangingHeartT-lightHolder HOLDER HANGING 5 1731 | # 7 KnittedUnionFlagHotWaterBottle BOTTLE KNITTED 5 1732 | # 1733 | # Running the aprior algorithm on the clusters 1734 | # Circling back to the arules algorithm, we can use the predicted clusters which were generated, instead of "lastword" in order to develop some rules. 1735 | # . We will use the "coerce to dataframe" method to generate the transaction file as previously generated 1736 | # . Create a rules_clust algorithm which builds association rules based upon the itemset of clusters {1,2,3,4,5} 1737 | # . Inspect some of the generated rules by lift. 1738 | library(arules) 1739 | colnames(kw_with_cluster2_score) 1740 | kable(head(kw_with_cluster2_score[,c(1,13)],5)) 1741 | tmp <- data.frame(kw_with_cluster2_score[,1],kw_with_cluster2_score[,13]) 1742 | names(tmp) [1] <- "TransactionID" 1743 | names(tmp) [2] <- "Items" 1744 | tmp <- unique(tmp) 1745 | trans4 <- as(split(tmp[,2], tmp[,1]), "transactions") 1746 | rules_clust <- apriori(trans4,parameter = list(minlen=2,support = 0.02,confidence = 0.01)) 1747 | summary(rules_clust) 1748 | #RW new arules::inspect 1749 | tmp <- as.data.frame(arules::inspect( head(sort(rules_clust, by="lift"),10))) 1750 | 1751 | # > tmp <- as.data.frame(inspect( head(sort(rules_clust, by="lift"),10) ) ) 1752 | # lhs rhs support confidence lift 1753 | # 22 {2,5} => {4} 0.03065693 0.3088235 3.022059 1754 | # 1 {2} => {4} 0.03065693 0.3043478 2.978261 1755 | # 2 {4} => {2} 0.03065693 0.3000000 2.978261 1756 | # 23 {4,5} => {2} 0.03065693 0.3000000 2.978261 1757 | # 32 {1,5} => {4} 0.02773723 0.2087912 2.043171 1758 | # 9 {4} => {1} 0.02773723 0.2714286 2.020963 1759 | # 10 {1} => {4} 0.02773723 0.2065217 2.020963 1760 | # 31 {4,5} => {1} 0.02773723 0.2714286 2.020963 1761 | # 35 {3,5} => {4} 0.03357664 0.1965812 1.923687 1762 | # 11 {4} => {3} 0.03357664 0.3285714 1.891357 1763 | # Summarizing the metrics 1764 | # Running a summary on the rules_clust object indicates an average support of .05, and average confidence of .43 1765 | # This demonstrates that using clustering can be a viable way to develop association rules, and reduced resources and the number of dimensions at the same time. 1766 | # 1767 | # support confidence lift 1768 | # Min. :0.02044 Min. :0.09985 Min. :0.989 1769 | # 1st Qu.:0.02664 1st Qu.:0.19816 1st Qu.:1.006 1770 | # Median :0.03066 Median :0.27143 Median :1.526 1771 | # Mean :0.05040 Mean :0.43040 Mean :1.608 1772 | # 3rd Qu.:0.04234 3rd Qu.:0.81954 3rd Qu.:1.891 1773 | # Max. :0.17080 Max. :1.00000 Max. :3.022 1774 | # 1775 | # 1776 | # Summary 1777 | 1778 | -------------------------------------------------------------------------------- /Chapter07/ch 7 codebase.R: -------------------------------------------------------------------------------- 1 | # Ch 7. Exploring Health Care Enrollment Data as a Time Series 2 | # "I have seen the future and it is very much like the present, only longer." 3 | # --Kehlog Albran, The Profit 4 | # Time Series Data 5 | # Time Series data is usually a set of ordered data collected over equally spaced intervals. Time series data occurs in most business and scientific disciplines, and the data is closely tied to the concept of forecasting, which uses previously measured data points to predict future data points based upon a specific statistical model. 6 | # Time series data differs from the kind of data that we have been looking at previously, because it is a set of ordered data points, it can contain components such as trend, seasonality, and autocorrelation, which have no meaning in other kinds of analysis. 7 | # Usually time series data is collected in equally spaced intervals, such as days, weeks, quarters, or years, but that is not always the case. Measurement of events such as natural disasters are a prime example. In some cases you can transform uneven data into equally spaced data, Often data needs to be prepped in order to have it conform to a certain model. In other cases, you can use specialized techniques such as Croston's method for forecasting intermittent, or unexpected demand for goods and services, in certain cases. 8 | # Exploring Time Series data 9 | # Many time series study start out by exploring one data metric which has been measured across equally spaced time intervals. In a data science perspective, we could be interested in identifying various segments of a time series which could be exhibiting interesting trends, cyclical, or seasonal patterns. So, we always begin time series by looking at the data graphically, and producing aggregate measures before proceeding with the modeling. 10 | # Health Insurance Coverage Data Set 11 | # We will start by reading in a data set which contains Health Care enrollment data over a period for several categories. This data has been source -from "Table HIB-2. Health Insurance Coverage Status and Type of Coverage All Persons by Age and Sex: 1999 to 2012" and it is available from the CMS website at: 12 | # http://www.census.gov/data/tables/time-series/demo/health-insurance/historical-series/hib.html 13 | # This table shows the number of people covered by government and private insurance, as well as the number of people not covered. 14 | # This table has several embedded time series across all the 14 years represented. 14 data points would not be considered an extremely long time series, however we will use this data to demonstrate how we can comb thru many time series at once. since it IS small, it will be easy enough to verify the results via visual inspection and printing subsets of the data. As you become familiar with the methodology, it will enable you to expand to larger complex datasets with more data points, in order to isolate the most significant trends. 15 | # Housekeeping 16 | # As we have done in other chapters, we will first clear the workspace, and set our working directory. Obviously, change the setwd to the path that you are using to store your files. 17 | rm(list = ls()) 18 | #RW new 19 | setwd("C:/PracticalPredictiveAnalytics/Data") 20 | 21 | # Read the data in 22 | # Next we will read in a few rows of the file (using the nrow parameters), and then run a str() function on the input to see which variables are contained within the file. There are several metrics in the file related to Medicare enrollment. We will just concentrate on the total enrollment metrics, and not utilize some of the others, even though they probably produce separate insights. 23 | x <- read.csv("hihist2bedit.csv", nrow = 10) 24 | str(x) 25 | # > 'data.frame': 10 obs. of 13 variables: 26 | # > $ Year : Factor w/ 10 levels "2003","2004 (4)",..: 10 9 8 7 6 5 4 3 2 1 27 | # > $ Year.1 : int 2012 2011 2010 2009 2008 2007 2006 2005 2004 2003 28 | # > $ Total.People: num 311116 308827 306553 304280 301483 ... 29 | # > $ Total : num 263165 260214 256603 255295 256702 ... 30 | # > $ pritotal : num 198812 197323 196147 196245 202626 ... 31 | # > $ priemp : num 170877 170102 169372 170762 177543 ... 32 | # > $ pridirect : num 30622 30244 30347 29098 28513 ... 33 | # > $ govtotal : num 101493 99497 95525 93245 87586 ... 34 | # > $ govmedicaid : num 50903 50835 48533 47847 42831 ... 35 | # > $ govmedicare : num 48884 46922 44906 43434 43031 ... 36 | # > $ govmilitary : num 13702 13712 12927 12414 11562 ... 37 | # > $ Not.Covered : num 47951 48613 49951 48985 44780 ... 38 | # > $ cat : Factor w/ 1 level "ALL AGES": 1 1 1 1 1 1 1 1 1 1 39 | # 40 | # Subsetting the columns 41 | # For this exercise, we will be using a restricted set of columns from the csv file. We can either select the specific columns from the dataframe just read in (if we just read in the whole file), or reread the csv file using the colClasses parameter to only read the columns that are required. Often this method is preferable when you are reading large file, and will instruct read.csv to only retain the first 3 and the last 2 columns, and ignore the columns "priemp thru govmilitary". 42 | # After re-reading in the file, with a subset of the columns, we print a few records from the beginning and end of the file. We can do this using a combination of the rbind, head and tail functions. This will give us all of the columns we will be using for this chapter, except for some columns which we will derive in the next section. 43 | x <- read.csv("hihist2bedit.csv", colClasses = c(NA, 44 | NA, NA, NA, rep("NULL", 7))) 45 | 46 | rbind(head(x), tail(x)) 47 | # > Year Year.1 Total.People Total Not.Covered 48 | # > 1 2012 2012 311116.15 263165.47 47950.6840 49 | # > 2 2011 2011 308827.25 260213.79 48613.4625 50 | # > 3 2010 (10) 2010 306553.20 256602.70 49950.5004 51 | # > 4 2009 2009 304279.92 255295.10 48984.8204 52 | # > 5 2008 2008 301482.82 256702.42 44780.4031 53 | # > 6 2007 2007 299105.71 255017.52 44088.1840 54 | # > 331 2004 (4) 2004 20062.67 19804.54 258.1313 55 | # > 332 2003 2003 19862.49 19615.92 246.5703 56 | # > 333 2002 2002 19705.99 19484.01 221.9879 57 | # > 334 2001 2001 19533.99 19354.19 179.8033 58 | # > 335 2000 (3) 2000 19450.52 19250.63 199.8871 59 | # > 336 1999 (2) 1999 19378.56 19189.17 189.3922 60 | # > cat 61 | # > 1 ALL AGES 62 | # > 2 ALL AGES 63 | # > 3 ALL AGES 64 | # > 4 ALL AGES 65 | # > 5 ALL AGES 66 | # > 6 ALL AGES 67 | # > 331 FEMALE 65 YEARS AND OVER 68 | # > 332 FEMALE 65 YEARS AND OVER 69 | # > 333 FEMALE 65 YEARS AND OVER 70 | # > 334 FEMALE 65 YEARS AND OVER 71 | # > 335 FEMALE 65 YEARS AND OVER 72 | # > 336 FEMALE 65 YEARS AND OVER 73 | # Description of the data 74 | # Year and Year.1 (columns 1 and 2) 75 | # Year is the year for which the annual enrollment figure are taken. You will notice that Year appears twice, in column 1 (as a factor) and then again in column2 (integer). This is because the data has been previously preprocess, and appears twice merely for convenience, since there are certain instances in which we will prefer to use a factor, and other instances in which we prefer to use an integer. The numbers in parentheses in Year refer to footnotes in the original data source. While you could always create integers from factors and vice versa in the code, this saves valuable processing time if certain transformations can be made available beforehand. 76 | # Total People (column 3) 77 | # Total People is the population size of the category. They may either enrolled for health coverage (Total), or not (Not.Covered). 78 | # Total (Column 4) 79 | # Total is the number of people who were enrolled for health coverage for that year and in the category "cat". 80 | # Not.Covered (Column 5) 81 | # Not.Covered is the number of people NOT enrolled in the specified year and category. 82 | # Cat (Column 6) 83 | # Cat is the time series subset . 84 | # This column, along with year, defines the particular row. It defines the specific demographic data for enrollment for that year. 85 | # The "ALL AGES" category represents the entire population for the specified year. All of the other subsets in the file should roll up to this category when totaled together. 86 | # For example, the last category, (printed as part of tail()) represents Females Over 65, which is a subset of the ALL AGES category. 87 | # Target Time series variable 88 | # The variable that we will begin to look at initially, will be the variable "Not.Covered". We will be interested in examining any possible enrollment trends using this variable. Since the population size will different depending upon the category, we will calculate the percentage of people not covered in a given year by dividing the raw number corresponding the this variable, by the total in the population for that category. This will give us a new variable named "Not.Covered.Pct".This will also standardize the metric across the different sized categories, large and small, and enable us to compare. 89 | # After calculating the variable, we can print the first few records,and also print some summary statistics for this one variable. 90 | x$Not.Covered.Pct <- x$Not.Covered/x$Total.People 91 | head(x) 92 | # > Year Year.1 Total.People Total Not.Covered cat 93 | # > 1 2012 2012 311116.2 263165.5 47950.68 ALL AGES 94 | # > 2 2011 2011 308827.2 260213.8 48613.46 ALL AGES 95 | # > 3 2010 (10) 2010 306553.2 256602.7 49950.50 ALL AGES 96 | # > 4 2009 2009 304279.9 255295.1 48984.82 ALL AGES 97 | # > 5 2008 2008 301482.8 256702.4 44780.40 ALL AGES 98 | # > 6 2007 2007 299105.7 255017.5 44088.18 ALL AGES 99 | # > Not.Covered.Pct 100 | # > 1 0.1541247 101 | # > 2 0.1574131 102 | # > 3 0.1629424 103 | # > 4 0.1609860 104 | # > 5 0.1485338 105 | # > 6 0.1474000 106 | # 107 | summary(x$Not.Covered.Pct) 108 | # 109 | # Min. 1st Qu. Median Mean 3rd Qu. Max. 110 | # 0.009205 0.109400 0.145300 0.154200 0.210400 0.325200 111 | # # 112 | # 113 | # As typical with target variable, we also like to see basic plots showing the distribution of the variable. 114 | # 115 | # 116 | hist(x$Not.Covered.Pct) 117 | # Saving the data 118 | # This may be a good time to save the current data to a file, since we may want to read it back later starting from the analysis stage. This will avoid having to reread the same file again, unless, of course the source data changes. 119 | setwd("C:/PracticalPredictiveAnalytics/Outputs") 120 | 121 | save(x, file = "x.RData") 122 | # Determine all of the Subset Groups 123 | # Since we have only looked at parts of the file (via head or tail), we do not know how many categories there are and how they differ in terms of health care coverage. So we will start off by looking at some of the groupings. 124 | # In previous chapters we have used sql, and the aggregate function to group data. For this example we will use dplyr. One advange of Dplyr is that it can also be used with 'pipe' syntax, which allows the result of one function to be passed to the next function without intermediate assignments. 125 | library(dplyr) 126 | # > 127 | # > Attaching package: 'dplyr' 128 | # > The following objects are masked from 'package:stats': 129 | # > 130 | # > filter, lag 131 | # > The following objects are masked from 'package:base': 132 | # > 133 | # > intersect, setdiff, setequal, union 134 | # # str(x) 135 | # 136 | # 137 | # The by.cat object will show the average number insured, and the average total population for each category. Remember, this data is also grouped by year, however we just want to get a sense of what the averages are across all of the years for now. Since the arrange function will end up sorting the data by the Total population sizes, from largest to smallest, we can see that the numbers line up as expected. 138 | # 1. "ALL AGES" is the largest category, 139 | # 2. followed by "FEMALE ALL AGES", 140 | # 3. and then "MALE ALL AGES". 141 | # As a sanity check, If you add the totals for the latter 2 categories, you can see that they sum to the "ALL AGES" category. 142 | # From the console, we can see from the str(by.cat) function that there are 24 categories. 143 | by.cat <- x %>% select(cat, Total, Total.People) %>% group_by(cat) %>% summarise(Avg.Total.Insured = mean(Total), 144 | Avg.People = mean(Total.People)) %>% arrange(desc(Avg.People)) 145 | 146 | str(by.cat) 147 | # > Classes 'tbl_df', 'tbl' and 'data.frame': 24 obs. of 3 variables: 148 | # > $ cat : Factor w/ 24 levels "18 to 24 YEARS",..: 7 14 22 24 3 4 2 23 6 15 ... 149 | # > $ Avg.Total.Insured: num 251243 130201 121042 66200 34762 ... 150 | # > $ Avg.People : num 294700 150330 144371 73752 42433 ... 151 | by.cat 152 | # > Source: local data frame [24 x 3] 153 | # > 154 | # > cat Avg.Total.Insured Avg.People 155 | # > (fctr) (dbl) (dbl) 156 | # > 1 ALL AGES 251242.96 294700.47 157 | # > 2 FEMALE ALL AGES 130200.90 150329.73 158 | # > 3 MALE ALL AGES 121042.06 144370.74 159 | # > 4 UNDER 18 YEARS 66200.46 73752.50 160 | # > 5 35 to 44 YEARS 34761.74 42433.12 161 | # > 6 45 to 54 YEARS 35911.82 42100.20 162 | # > 7 25 to 34 YEARS 29973.91 39942.64 163 | # > 8 MALE UNDER 18 YEARS 33832.87 37700.70 164 | # > 9 65 YEARS AND OVER 36199.32 36722.61 165 | # > 10 FEMALE UNDER 18 YEARS 32367.59 36051.79 166 | # > .. ... ... ... 167 | # Merging the aggregate data back into the original data 168 | # Often, you will want to augment your original data with some of the calculated data as derived above. In these cases, you can merge the data back into the original data using a common key. Again, we will use dplyr to take the results just obtained (by.cat), and join it back to the original data (x), using the common key "cat". 169 | # We will be using a left_join just for an example, however we could have used a right join to obtain the same results, since by.cat was completely derived from 'x'. After joining the two dataframes, we will end up with a new dataframe named 'x2' 170 | # # Merge the summary measures back into the original data. Merge by cat. 171 | # 172 | #RW RM NOT NEEDED 173 | rm(x2) 174 | x2 <- by.cat %>% left_join(x, by = "cat") 175 | head(x2) 176 | # > Source: local data frame [6 x 9] 177 | # > 178 | # > cat Avg.Total.Insured Avg.People Year Year.1 Total.People 179 | # > (fctr) (dbl) (dbl) (fctr) (int) (dbl) 180 | # > 1 ALL AGES 251243 294700.5 2012 2012 311116.2 181 | # > 2 ALL AGES 251243 294700.5 2011 2011 308827.2 182 | # > 3 ALL AGES 251243 294700.5 2010 (10) 2010 306553.2 183 | # > 4 ALL AGES 251243 294700.5 2009 2009 304279.9 184 | # > 5 ALL AGES 251243 294700.5 2008 2008 301482.8 185 | # > 6 ALL AGES 251243 294700.5 2007 2007 299105.7 186 | # > Variables not shown: Total (dbl), Not.Covered (dbl), Not.Covered.Pct (dbl) 187 | # checking the time intervals 188 | # Earlier, we mentioned needing to have equally sized time intervals. Additionally, before we perform any time series analysis, we need to check for the number of 'non-missing' time intervals. So, Let's check the number of enrollment years for each category. 189 | # 190 | # Using dplyr, we can use summarize(n()) to count the number of entries for each category. 191 | # # -- summarize and sort by the number of years -- updated to x2 192 | yr.count <- x2 %>% group_by(cat) %>% summarise(n = n()) %>% arrange(n) 193 | 194 | # - we can see that there are 14 years for all of the groups. That is good! 195 | print(yr.count, 10) 196 | # > Source: local data frame [24 x 2] 197 | # > 198 | # > cat n 199 | # > (fctr) (int) 200 | # > 1 18 to 24 YEARS 14 201 | # > 2 25 to 34 YEARS 14 202 | # > 3 35 to 44 YEARS 14 203 | # > 4 45 to 54 YEARS 14 204 | # > 5 55 to 64 YEARS 14 205 | # > 6 65 YEARS AND OVER 14 206 | # > 7 ALL AGES 14 207 | # > 8 FEMALE 18 to 24 YEARS 14 208 | # > 9 FEMALE 25 to 34 YEARS 14 209 | # > 10 FEMALE 35 to 44 YEARS 14 210 | # > .. ... ... 211 | # We can see from the above that every category has 14 years of data represented. 212 | # So we don't have to worry about having uniform time period for each subset. However, this is often not the case, and if you come across this, you may need to: 213 | # Impute data for years that are missing 214 | # Only use common time periods 215 | # Use specialized time series techniques to account for unequally spaced time series. 216 | # Picking out the top groups in terms of average population size 217 | # In many instances, we will only want to look at the top categories, especially when there are many demographical categories which have been subsetted. In this example there are only 24 categories, but in other examples there may be a much larger number of categories. 218 | # The dataframe x2 is already sorted by Avg.People. Since we know that there are 14 enrollment records for each category, we can get the top 10 categories based upon the highest base population by selecting the first 14*10 (or 140) rows. We will store this in a new dataframe 'x3' and save this to disk. 219 | # since we know each group has 14 years, extracting the top 10 groups is 220 | # easy to calculate 221 | 222 | x3 <- x2[1:(14 * 10), ] 223 | head(x3) 224 | # > Source: local data frame [6 x 9] 225 | # > 226 | # > cat Avg.Total.Insured Avg.People Year Year.1 Total.People 227 | # > (fctr) (dbl) (dbl) (fctr) (int) (dbl) 228 | # > 1 ALL AGES 251243 294700.5 2012 2012 311116.2 229 | # > 2 ALL AGES 251243 294700.5 2011 2011 308827.2 230 | # > 3 ALL AGES 251243 294700.5 2010 (10) 2010 306553.2 231 | # > 4 ALL AGES 251243 294700.5 2009 2009 304279.9 232 | # > 5 ALL AGES 251243 294700.5 2008 2008 301482.8 233 | # > 6 ALL AGES 251243 294700.5 2007 2007 299105.7 234 | # > Variables not shown: Total (dbl), Not.Covered (dbl), Not.Covered.Pct (dbl) 235 | # # 236 | #RW needed? 237 | save(x3, file = "x3.RData") 238 | # Plotting the data using Lattice 239 | # The Lattice package is a useful package to learn especially for analysts who like to work in formula notation (y~x). 240 | # In this example we will run a lattice plot in order to plot Not.Covered.Pct on the y axis, Year on the x axis, and produce separate plots by category. 241 | # The main call is specified by: 242 | # xyplot(Not.Covered.Pct ~ Year | cat, data = x3) 243 | # Since we are plotting the top ten groups, we can specify layout=c(5,2) to indicate we want to arrange the 10 plots in a 5*2 matrix. Not.Covered.Pct is to be arranged on the y axis (left side of the ~ sign), and Year is arranged along the x-axis (Right side of ~ sign). The bar (|) indicates that the data is to be plotted separately by each category. 244 | library(lattice) 245 | x.tick.number <- 14 246 | at <- seq(1, nrow(x3), length.out = x.tick.number) 247 | labels <- round(seq(1999, 2012, length.out = x.tick.number)) 248 | 249 | p <- xyplot(Not.Covered.Pct ~ Year | cat, data = x3, type = "l", main = list(label = "Enrollment by Categories", 250 | cex = 1), par.strip.text = list(cex = 0.5), scales = list(x = list(labels = labels), 251 | cex = 0.4, rot = 45), layout = c(5, 2)) 252 | 253 | trellis.device() 254 | 255 | print(p) 256 | 257 | # Plottingthe data using ggplot 258 | # If you like using ggplot, a similar set of graphs can be rendered using facets. 259 | require("ggplot2") 260 | 261 | .df <- data.frame(x = x3$Year.1, y = x3$Not.Covered.Pct, z = x3$cat, s = x3$cat) 262 | .df <- .df[order(.df$x), ] 263 | .plot <- ggplot(data = .df, aes(x = x, y = y, colour = z, shape = z)) + geom_point() + 264 | geom_line(size = 1) + scale_shape_manual(values = seq(0, 15)) + scale_y_continuous(expand = c(0.01, 265 | 0)) + facet_wrap(~s) + xlab("Year.1") + ylab("Not.Covered.Pct") + labs(colour = "cat", 266 | shape = "cat") + theme(panel.margin = unit(0.3, "lines"), legend.position = "none") 267 | print(.plot) 268 | 269 | 270 | # Sending output to an external File 271 | # One benefit of assigning plots to a plot object is that you can later send the plots to an external file, such as pdf, view it exernally, and even view it in your browser directly from R. For example, for the Lattice graphs example, you can use treliis.device and specify the output parameters, and then print the object. As we illustrated in an earlier chapter, you can use browseURL to open the PDF in your browser. 272 | # send to pdf 273 | #RW not needed 274 | #setwd("C:/Users/randy/Desktop/ch7") 275 | trellis.device(pdf, file = "x3.pdf", paper = "a4r", height = 8.3, width = 11.7, 276 | col = F) 277 | 278 | # --Not run 279 | # print(p) 280 | # 281 | #RW check this part 282 | dev.off() 283 | browseURL('x3.pdf') 284 | 285 | browseURL('file://c://PracticalPredictiveAnalytics//Outputs//x3.pdf') 286 | # Examining the output 287 | # If we examine our 'top' plots we can see that some groups seem to be trending more than others. For example, the under 18 age group shows a enrollment trend that is declining, while the 25-54 age groups are trending up. 288 | # Detecting linear trend 289 | # In a linear trend model, one constructs a linear regression least squares line by running an lm regression thru the data points. These models are good for initially exploring trends visually. We can take advantages of our lm function, which is available in base R, in order to specifically calculate the slope of the trend line. 290 | # For example, we can compute the slope of the trend line for our first group ("ALL AGES"), and we can see that it has an upward trend. 291 | # After running the regression using the lm function, we can subsequently use the coef function to specifically extract the slope and the intercept from the lm model. Since this is a regression in which there is only one independent variable (Time), there will only be one coefficient. 292 | lm(Not.Covered.Pct ~ Year.1, data = x2[1:14, ]) 293 | # > 294 | # > Call: 295 | # > lm(formula = Not.Covered.Pct ~ Year.1, data = x2[1:14, ]) 296 | # > 297 | # > Coefficients: 298 | # > (Intercept) Year.1 299 | # > -4.102621 0.002119 300 | coef(lm(Not.Covered.Pct ~ Year.1, data = x2[1:14, ])) 301 | # > (Intercept) Year.1 302 | # > -4.102621443 0.002119058 303 | # Automating the regressions 304 | # Now that we have seen how we can run a single time series regression, we can move on to automating separate regressions and extracting the coefficients over all of the categories. 305 | # There are several ways to do this. One way is by using the do() function within the dplyr package. In the example below: 306 | # . The data is first grouped by category 307 | # . Then, a linear regression (lm function) is run for each category, with Year as the independent variable, and Not.Covered as the dependent variable. 308 | # . The coefficient is extracted from the model. The coefficient will act as a proxy for trend Finally, a dataframe of lists is create (fitted.models) , where the coefficients and intercepts are stored for each regression run on every category. The categories which have the highest positive coefficients exhibit the greatest increasing linear trend, while the declining trend is indicated by negative coefficients. 309 | library(dplyr) 310 | 311 | fitted_models = x2 %>% group_by(cat) %>% do(model = coef(lm(Not.Covered.Pct ~ 312 | Year.1, data = .))) 313 | # All of the generated models are now in the fitted_models object. 314 | # The kable function from knitr gives a simple output which displays the intercept as the first number and the coefficient in the "model" column. As a check, We can see that the coefficients in the "ALL AGES" model, are the identical to those derived in the previous section. 315 | # str(fitted_models) 316 | library(knitr) 317 | kable(fitted_models) 318 | # Cat Model 319 | # 18 to 24 YEARS -0.4061834427, 0.0003367988 320 | # 25 to 34 YEARS -11.375187597, 0.005796182 321 | # 35 to 44 YEARS -10.916822084, 0.005534037 322 | # 45 to 54 YEARS -11.544566448, 0.005829194 323 | # 55 to 64 YEARS -4.709612146, 0.002409908 324 | # 65 YEARS AND OVER -1.2562375095, 0.0006334125 325 | # ALL AGES -4.102621443, 0.002119058 326 | # FEMALE 18 to 24 YEARS -2.677300003, 0.001455388 327 | # FEMALE 25 to 34 YEARS -9.990978769, 0.005088009 328 | # FEMALE 35 to 44 YEARS -9.564724041, 0.004850188 329 | # FEMALE 45 to 54 YEARS -10.36336551, 0.00523537 330 | # FEMALE 55 to 64 YEARS -4.102774957, 0.002108343 331 | # FEMALE 65 YEARS AND OVER -1.3674510779, 0.0006887743 332 | # FEMALE ALL AGES -3.817483824, 0.001970059 333 | # FEMALE UNDER 18 YEARS 3.267593386, -0.001578328 334 | # MALE 18 to 24 YEARS 4.036127727, -0.001862991 335 | # MALE 25 to 34 YEARS -9.715950286, 0.004983621 336 | # MALE 35 to 44 YEARS -7.706624821, 0.003941543 337 | # MALE 45 to 54 YEARS -10.975387917, 0.005549255 338 | # MALE 55 to 64 YEARS -5.370380544, 0.002738269 339 | # MALE 65 YEARS AND OVER -0.4834523450, 0.0002479691 340 | # MALE ALL AGES -4.315036958, 0.002232003 341 | # MALE UNDER 18 YEARS 2.914343264, -0.001401998 342 | # UNDER 18 YEARS 3.086509947, -0.001487938 343 | # Ranking the coefficients 344 | # Now that we have the coefficients, we can begin to rank each of the categories by increasing trend. Since the results we have obtained so far are contained in embedded lists, which are a bit difficult to work with, we can perform some code manipulation to transform them into a regular data frame, with 1 row per category, consisting of the category name, coefficient, and coefficient rank. 345 | #RW remove 346 | library(dplyr) 347 | # extract the coefficients part from the model list, and then transpose the 348 | # data frame so that the coefficient appear one per row, rather than 1 per 349 | # column. 350 | 351 | xx <- as.data.frame(fitted_models$model) 352 | xx2 <- as.data.frame(t(xx[2, ])) 353 | 354 | # The output does no contain the category name, so we will merge it back 355 | # from the original data frame. 356 | 357 | xx4 <- cbind(xx2, as.data.frame(fitted_models))[, c(1, 2)] #only keep the first two columns 358 | 359 | # rankthe coefficients from lowest to highest. Force the format of the rank 360 | # as length 2, with leading zero's 361 | 362 | tmp <- sprintf("%02d", rank(xx4[, 1])) 363 | 364 | # Finally prepend the rank to the actual category 365 | xx4$rankcat <- as.factor(paste(tmp, "-", as.character(xx4$cat))) 366 | 367 | # name the columns 368 | names(xx4) <- c("lm.coef", "cat", "coef.rank") 369 | # and View the results 370 | View(xx4) 371 | # As you can see, columns 2,3, and 4 now contain neatly arranged representation of the coefficients, category, and coef.rank, which was derived by ranking lm.coef from smallest to largest, and then prepending the rank order to the category. 372 | # 373 | # Merging scores back into the original dataframe 374 | # We will augment the original x2 dataframe with this new information by merging back by category, and then by sorting the dataframe by the rank of the coefficient, this will allow us to use this as a proxy for trend. 375 | x2x <- x2 %>% left_join(xx4, by = "cat") %>% arrange(coef.rank, cat) 376 | 377 | 378 | # exclude some columns so as to fit on one page 379 | head(x2x[, c(-2, -3, -4, -8)]) 380 | # > Source: local data frame [6 x 7] 381 | # > 382 | # > cat Year.1 Total.People Total Not.Covered.Pct 383 | # > (fctr) (int) (dbl) (dbl) (dbl) 384 | # > 1 MALE 18 to 24 YEARS 2012 15142.04 11091.86 0.2674787 385 | # > 2 MALE 18 to 24 YEARS 2011 15159.87 11028.75 0.2725034 386 | # > 3 MALE 18 to 24 YEARS 2010 14986.02 10646.88 0.2895460 387 | # > 4 MALE 18 to 24 YEARS 2010 14837.14 10109.82 0.3186139 388 | # > 5 MALE 18 to 24 YEARS 2008 14508.04 10021.66 0.3092339 389 | # > 6 MALE 18 to 24 YEARS 2007 14391.92 10230.61 0.2891425 390 | # > Variables not shown: lm.coef (dbl), coef.rank (fctr) 391 | # Plotting the data with the trend lines 392 | # Now that we have the trend coefficients, we will use ggplot to first plot enrollment for all of the 24 categories, and then create a 2nd set up plots which add the trend line based upon the linear coefficients we have just calculated. 393 | # Code Notes: Facet_wrap will order the plots by the value of variable 'z', which was assigned to the coefficient rank. Thus, we can get to see the categories with declining enrollment first, ending with the categories having the highest trend in enrollment from the period 1999-2012 394 | # Tip: I like to assign the variables that I will be changing to standard variable names, like x,y,and z, so that I can remember there usage (e.g variable x is always the x variable, and y always the x variable). But you can supply the variable names directly in the call to ggplot, or set up your own function to do the same thing. 395 | 396 | 397 | 398 | library(ggplot2) 399 | .df <- data.frame(x = x2x$Year.1, y = x2x$Not.Covered.Pct, z = x2x$coef.rank, 400 | slope = x2x$lm.coef) 401 | .plot <- ggplot(data = .df, aes(x = x, y = y, colour = 1, shape = z)) + geom_point() + 402 | scale_shape_manual(values = seq(0, 24)) + scale_y_continuous(expand = c(0.1, 403 | 0)) + scale_x_continuous(expand = c(0.1, 1)) + facet_wrap(~z) + xlab("Year.1") + 404 | ylab("Not.Covered.Pct") + labs(colour = "cat", shape = "cat") + theme(panel.margin = unit(0.3, 405 | "points"), legend.position = "none") + theme(strip.text.x = element_text(size = 6)) 406 | #RW? 407 | print(.plot) 408 | 409 | # As you can see, for the "ALL AGES" category, overall Non Covered percentage is seen to be increasing over the time period, until 2010 (where the Affordable Care Act was enacted), and then will begin to decrease. Contrarily, the under 18 age group shows a decrease in the proportion of non-insured relative to the population size. 410 | # We can take a closer look at the top and bottom four categories, along with the "ALL AGES" category, to examine this more closely. This time we will add our own trend line using the geom_smooth parameter, which will add a linear regression trendline. 411 | # .df <- data.frame(x = x2x$Year.1, y = x2x$Not.Covered.Pct, z = 412 | # x2x$coef.rank, slope=x2x$lm.coef) 413 | 414 | # declining enrollment 415 | .df2 <- rbind(head(.df,(4*14)), tail(.df,(4*14)), .df[.df$z == "12 - ALL AGES", ]) 416 | 417 | .plot2 <- ggplot(data = .df2, aes(x = x, y = y, colour = 1, shape = z)) + geom_point() + 418 | scale_shape_manual(values = seq(0, nrow(.df2))) + scale_y_continuous(expand = c(0.1,0)) + 419 | scale_x_continuous(expand = c(0.1, 1)) + facet_wrap(~z) + xlab("Year.1") + 420 | ylab("Not.Covered.Pct") + labs(colour = "cat", shape = "cat") + theme(panel.margin = unit(0.3,"points"), 421 | legend.position = "none") + geom_smooth(method = "lm", se = FALSE, 422 | colour = "red") 423 | print(.plot2) 424 | 425 | # Plotting all the categories on 1 graph. 426 | # Sometimes we like to plot all of the lines on one graph rather having them as separate plots. To achieve this, we will alter the synax a bit, so that the categories show up as stacked lines. Again we can see the percentage of uninsured align across ages, with the under 18 group having the lowest uninsured rate, and the 25-54 group having the highest. 427 | 428 | library(ggplot2) 429 | ### plot all on one graph 430 | 431 | .df <- data.frame(x = x3$Year.1, y = x3$Not.Covered.Pct, z = x3$cat) 432 | rm(.df) 433 | rm(.plot) 434 | .df <- x3[order(x3$Year.1), ] 435 | # str(.df) 436 | .plot <- ggplot(data = .df, aes(x = Year.1, y = Not.Covered.Pct, colour = cat,shape = cat)) + 437 | geom_point() + geom_line(size = 1) + scale_shape_manual(values = seq(0,15)) + 438 | ylab("Not.Covered.Pct") + labs(colour = "cat", shape = "cat") + theme_bw(base_size = 14,base_family = "serif") 439 | 440 | 441 | print(.plot) 442 | 443 | 444 | # Adding labels 445 | # It can sometimes be a bit difficult to discern the differences among the categories based the legend, especially as the number of categories increase, so we can use the directlables library to mark each line with the category name. Now we can see, for example, that ALL MALES have a higher uninsured rate than ALL FEMALES over all time periods. 446 | library(directlabels) 447 | direct.label(.plot, list(last.points, hjust = 0.75, cex = 0.75, vjust = 0)) 448 | 449 | # Performing some automated forecasting using the ets function 450 | # So far we have looked at ways in which we can explore any linear trends which may be inherent in our data. That provided a solid foundation for the next step, prediction. Now we will begin to start to look at how we can perform some actual forecasting. 451 | # Converting the dataframe to a time series object 452 | # As a preparation step, We will use the ts function to convert our dataframe to a time series object. It is important that time series be equally spaced before converting to a ts object. At minimum, you supply the time series variable, and start and end dates as argument to the ts function. 453 | # After creating a new object "x", run a str() function to verify that all of the 14 time series from 1999 to 2012 have been created. 454 | # only extract the 'ALL' timeseries 455 | x <- ts(x2$Not.Covered.Pct[1:14], start = c(1999), end = c(2012), frequency = 1) 456 | 457 | str(x) 458 | # > Time-Series [1:14] from 1999 to 2012: 0.154 0.157 0.163 0c.161 0.149 ... 459 | # Smoothing the data using Moving Averages 460 | # One simple technique used to analyze time series are simple and exponential moving averages. Both Simple Moving averages and Exponential Moving Average are ways in which we can smooth out the random noise in the series and observe cycles and trends. 461 | # Simple Moving Average 462 | # A simple Moving Average will simply take the sum of the time series variable for the last k periods and then will divide it by the number of periods. In this sense, it is identical to the calculation for the mean. However what makes it different from a simple mean is that: 463 | # . the average will shift for every additional time period. Moving averages are backward looking, and every time a time period shifts, so will the average. That is why they are called "Moving". Moving Average are sometimes called Rolling Averages. 464 | # . The look backwards period can shift. That is the 2nd characteristic of a moving average. A 10 period moving average will take the average of the last 10 data elements, while a 20 period moving average will take the sum of the last 20 data points, and then divide by 20. 465 | # Computing the SMA using a function 466 | # To compute a rolling 5 period moving average for our data, we will use the Simple Moving Average (SMA) function from the TTR package, and then display the first few rows. 467 | library(TTR) 468 | MA <- SMA(x, n = 5) 469 | cbind(head(x, 14), head(MA, 14)) 470 | # > [,1] [,2] 471 | # > [1,] 0.1541247 NA 472 | # > [2,] 0.1574131 NA 473 | # > [3,] 0.1629424 NA 474 | # > [4,] 0.1609860 NA 475 | # > [5,] 0.1485338 0.1568000 476 | # > [6,] 0.1474000 0.1554551 477 | # > [7,] 0.1523271 0.1544379 478 | # > [8,] 0.1464598 0.1511414 479 | # > [9,] 0.1433967 0.1476235 480 | # > [10,] 0.1455136 0.1470194 481 | # > [11,] 0.1391090 0.1453612 482 | # > [12,] 0.1347953 0.1418549 483 | # > [13,] 0.1308892 0.1387408 484 | # > [14,] 0.1362041 0.1373023 485 | # There are many ways in which you can plot the moving average with the original data. Using base R, use the ts.plot function to do this, which takes the original series and the moving average of the series as arguments. 486 | ts.plot(x, MA, gpars = list(xlab = "year", ylab = "Percentage of Non-Insured",lty = c(1:2))) 487 | title("Percentage of Non-Insured 1999-2012 - With SMA") 488 | 489 | # You can see how moving averages are helpful in showing the upward and downward movements of the data, and also help smooth the data to help eliminate some of the noise. Also notice that a moving average needs some "starter" data to begin calculations, so that is why the dotting moving average line is missing from the first 4 time periods of the graph. Only by the 5th period is it able to determine the calculation by summing up the values which correspond to the time period 1999-2003 and then dividing by 5. The next point is derived by summing up the values corresponding the the time periods 2000-2004 and then again dividing by 5. 490 | # Verifying the SMA Calculation 491 | # It is always important to be able to verify calculation, to insure that the values have been performed correctly, and to promote understanding. 492 | # In the case of the Simple Moving Average, we can switch to the console, and calculate the value of the SMA for the last 5 data points: 493 | # First we calculate the sum of the elements and then divide by the number of data points in the moving average (5): 494 | sum(x[10:14])/5 495 | # > [1] 0.1373023 496 | # That matches exactly with the column , given for the Simple Moving Average (SMA) 497 | # Exponential Moving Average 498 | # For a Simple Moving Average, equal weight is given to all data points, regardless of how old or how recent they occurred. An Exponential Moving Average (EMA) gives move weight to recent data, under the assumption that the future is move likely to look like the recent past, rather than the "older past". 499 | # The exponential Moving average is actually a much simpler calculation. An EMA begins by calculating a simple moving average. When it reach the specified number of lookback periods (n) it computes the current value by weighing the value of the current value with the previous value of the EMA. 500 | # This weighting is specified by the smoothing (or ratio ) factor. When ratio=1, the predicted value is entirely based upon the last time value. For ratios b=0, the prediction is based upon the average of the entire lookback period. Therefore the closer the smoothing factor is to 1, the more weight it will give to recent data. If you want to give additional weight to older data, decrease the smoothing factor towards 0. 501 | # Generally the formula for an EMA is: 502 | # {Current Data Point - EMA(previous)} x smoothing factor + EMA(previous day) 503 | # To compute the EMA, you can use the EMA function (from the TTR package). You need to specify a smoothing constant (ratio), as well as a lookback period (n). 504 | # Computing the EMA using a function 505 | # The following code computes the EMA, along with the Simple Moving Average which was computed in the last section. 506 | # The following plot shows the data in graph form. You can see that each data point is closer to its EMA that to the SMA, which as mentioned earlier, weights all previous data points equally within the lookback period. In this regard EMA's react quicker to the recent data, while SMA are slower moving, and have less variability. Of course both are affected by the parameters, especially the lookback period. Longer lookbacks will make for slower moving averages, in both cases. 507 | ExpMA <- EMA(x, n = 5, ratio = 0.8) 508 | cbind(head(x, 15), head(MA, 15), head(ExpMA, 15)) 509 | # > [,1] [,2] [,3] 510 | # > [1,] 0.1541247 NA NA 511 | # > [2,] 0.1574131 NA NA 512 | # > [3,] 0.1629424 NA NA 513 | # > [4,] 0.1609860 NA NA 514 | # > [5,] 0.1485338 0.1568000 0.1568000 515 | # > [6,] 0.1474000 0.1554551 0.1492800 516 | # > [7,] 0.1523271 0.1544379 0.1517177 517 | # > [8,] 0.1464598 0.1511414 0.1475114 518 | # > [9,] 0.1433967 0.1476235 0.1442196 519 | # > [10,] 0.1455136 0.1470194 0.1452548 520 | # > [11,] 0.1391090 0.1453612 0.1403382 521 | # > [12,] 0.1347953 0.1418549 0.1359039 522 | # > [13,] 0.1308892 0.1387408 0.1318922 523 | # > [14,] 0.1362041 0.1373023 0.1353417 524 | ts.plot(x, ExpMA, gpars = list(xlab = "year", ylab = "Percentage of Non-Insured",lty = c(1:2))) 525 | title("Percentage of Non-Insured 1999-2012 - With EMA") 526 | #using 527 | # Using the ets function 528 | # We will now use the ets function (forecast package) to compute an exponentially smoothed model for the "ALL AGES" category. 529 | # The ets function is flexible in that it can also incorporate trend, as well as seasonality for its forecasts. 530 | # We will just be illustrating a simple exponentially smoothed model ("ANN"). However, for completeness, you should know that you specify three letters when calling the ets function, and you should be aware of what each letter represents. Otherwise, it will model based upon the default parameters. 531 | # Here is the description as specified by the package author, Hydman: 532 | # . The first letter denotes the error type ("A", "M" or "Z") 533 | # . The second letter denotes the trend type ("N","A","M" or "Z") 534 | # . The third letter denotes the season type ("N","A","M"or "Z") 535 | # In all cases, "N"=none, "A"=additive, "M"=multiplicative and "Z"=automatically selected. 536 | # So for our example, if we want to model a simple exponentially smoothed model, as we did in our manual calculations, so we would specify model=ANN. 537 | # Forecasting using "ALL AGES" 538 | # The code below will perform the following steps: 539 | # . First in will filter out the "ALL AGES" category 540 | # . Create a time series object 541 | # . Run a simple exponential model, using the ets function 542 | # Note that we did not specify a smoothing factor. The ets functions calculates the optimal smoothing factor, (alpha, shown via the summary function) which in this case is .99, which means that model time series takes about 99% of the previous value to incorporate into the next time series prediction. 543 | library(dplyr) 544 | # > 545 | # > Attaching package: 'dplyr' 546 | # > The following objects are masked from 'package:stats': 547 | # > 548 | # > filter, lag 549 | # > The following objects are masked from 'package:base': 550 | # > 551 | # > intersect, setdiff, setequal, union 552 | library(forecast) 553 | # > Loading required package: zoo 554 | # > 555 | # > Attaching package: 'zoo' 556 | # > The following objects are masked from 'package:base': 557 | # > 558 | # > as.Date, as.Date.numeric 559 | # > Loading required package: timeDate 560 | # > This is forecast 7.1 561 | x4 <- x2[x2$cat == "ALL AGES", ] 562 | 563 | # set up as a time series object 564 | x <- ts(x4$Not.Covered.Pct, start = c(1999), end = c(2012), frequency = 1) 565 | 566 | fit <- ets(x, model = "ANN") 567 | summary(fit) 568 | # > ETS(A,N,N) 569 | # > 570 | # > Call: 571 | # > ets(y = x, model = "ANN") 572 | # > 573 | # > Smoothing parameters: 574 | # > alpha = 0.9999 575 | # > 576 | # > Initial states: 577 | # > l = 0.1541 578 | # > 579 | # > sigma: 0.0052 580 | # > 581 | # > AIC AICc BIC 582 | # > -106.3560 -105.2651 -105.0779 583 | # > 584 | # > Training set error measures: 585 | # > ME RMSE MAE MPE MAPE 586 | # > Training set -0.001279923 0.005191075 0.00430566 -0.9445532 2.955436 587 | # > MASE ACF1 588 | # > Training set 0.9286549 0.004655079 589 | # Plotting the Predicted and Actual Values 590 | # Next, we can plot the predicted vs actual values. Notice that the predicted values are almost identical to the actual values, however they are always one step ahead. 591 | plot(x) 592 | lines(fit$fitted, col = "red") 593 | # We can 594 | # The forecast(fit) method 595 | # The forecast method contains many objects which you can display, so as the fitted value, original values, confidence intervals and residuals. Use str(forecast(fit)) to see which objects are available. 596 | # We will use cbind to print out the original data point, fitted data point, and model fitting method. 597 | # We can also use View to show some of the forecast object in matrix form. 598 | cbind(forecast(fit)$method,forecast(fit)$x,forecast(fit)$fitted,forecast(fit)$residuals) 599 | # Time Series: 600 | # Start = 1999 601 | # End = 2012 602 | # Frequency = 1 603 | # forecast(fit)$method forecast(fit)$x forecast(fit)$fitted forecast(fit)$residuals 604 | # 1999 ETS(A,N,N) 0.15412470117969 0.154120663632029 4.03754766081788e-06 605 | # 2000 ETS(A,N,N) 0.157413125646824 0.154124700770241 0.00328842487658335 606 | # 2001 ETS(A,N,N) 0.162942355969924 0.157412792166205 0.00552956380371911 607 | # 2002 ETS(A,N,N) 0.160986044554207 0.162941795214416 -0.001955750660209 608 | # 2003 ETS(A,N,N) 0.148533847659868 0.160986242887746 -0.0124523952278778 609 | # 2004 ETS(A,N,N) 0.147400008880004 0.148535110462768 -0.00113510158276331 610 | # 2005 ETS(A,N,N) 0.152327126236553 0.147400123991157 0.00492700224539561 611 | # 2006 ETS(A,N,N) 0.146459794092561 0.152326626587079 -0.00586683249451758 612 | # 2007 ETS(A,N,N) 0.14339666192983 0.146460389050636 -0.00306372712080566 613 | # 2008 ETS(A,N,N) 0.145513631588618 0.143396972623751 0.00211665896486724 614 | # 2009 ETS(A,N,N) 0.139109023459534 0.145513416937297 -0.00640439347776356 615 | # 2010 ETS(A,N,N) 0.134795323545856 0.139109672931905 -0.00431434938604935 616 | # 2011 ETS(A,N,N) 0.130889234985064 0.134795761065932 -0.00390652608086872 617 | # 2012 ETS(A,N,N) 0.136204104247743 0.130889631147599 0.00531447310014455 618 | 619 | View(forecast(fit)) 620 | 621 | 622 | # Plotting the future values with confidence bands 623 | # Use the plot function to plot the future predictions. Notice that the prediction for the last value encompasses upper and lower confidence bands surrounding a horizontal prediction line. But why a horizontal prediction line? This is saying that there is no trend or seasonality for the exponential model, and that the best prediction is based upon the last value of the smoothed average. However, we can see that there is significant variation to the prediction, based upon the confidence bands. The confidence bands will also increase in size as the forecast period increases, to reflect the uncertainty associated with the forecast. 624 | plot(forecast(fit)) 625 | 626 | 627 | 628 | # If we 629 | # 630 | # Modifying the model to include a trend components 631 | # Earlier, we added a linear trend line to the data. If we wanted to incorporate a linear trend into the forecast as well, we can substitute "A" for the second parameter (trend parameter), which yields an "AAN" model (Holt's linear trend). This type of method allows exponential smoothing with a trend. 632 | fit <- ets(x, model = "AAN") 633 | summary(fit) 634 | # > ETS(A,A,N) 635 | # > 636 | # > Call: 637 | # > ets(y = x, model = "AAN") 638 | # > 639 | # > Smoothing parameters: 640 | # > alpha = 0.0312 641 | # > beta = 0.0312 642 | # > 643 | # > Initial states: 644 | # > l = 0.1641 645 | # > b = -0.0021 646 | # > 647 | # > sigma: 0.0042 648 | # > 649 | # > AIC AICc BIC 650 | # > -108.5711 -104.1267 -106.0149 651 | # > 652 | # > Training set error measures: 653 | # > ME RMSE MAE MPE MAPE 654 | # > Training set -0.000290753 0.004157744 0.003574276 -0.2632899 2.40212 655 | # > MASE ACF1 656 | # > Training set 0.7709083 0.05003007 657 | plot(forecast(fit)) 658 | 659 | # Running the ets function iteratively over all of the categories 660 | # Now that we have run an ets model on one category, we can construct some code to automate model construction over ALL of the categories. In the process, we will also save some of the accuracy measures so that we can see how our models performed. 661 | # . First, we will sort the dataframe by category, and then by year. 662 | # . Then initialize a new dataframe (onestep.df) that we will use to store the accuracy results for each moving window prediction of test and training data 663 | # . Then we will process each of the groups, all which have 14 time periods, as an iteration in a for loop 664 | # . For each iteration, extract a test and training dataframe 665 | # . Fit a simple exponential smoothed model for the training dataset 666 | # . Apply a model fit to the test dataset 667 | # . Apply the accuracy function in order to extract the validation statistics 668 | # . Store each of them in the onestep.df dataframe which was initialized in the previous step 669 | df <- x2 %>% arrange(cat, Year.1) 670 | 671 | # create results data frame 672 | onestep.df <- data.frame(cat = character(), rmse = numeric(), mae = numeric(), 673 | mape = numeric(), acf1 = numeric(), stringsAsFactors = FALSE) 674 | 675 | # 676 | # 677 | library(forecast) 678 | iterations <- 0 679 | for (i in seq(from = 1, to = 999, by = 14)) { 680 | j <- i + 13 681 | # pull out the next category. It will always be 14 records. 682 | x4 <- df[i:j, ] 683 | x <- ts(x4$Not.Covered.Pct, start = c(1999), end = c(2012), frequency = 1) 684 | 685 | # assign the first 10 records to the training data, and the next 4 to the 686 | # test data. 687 | 688 | 689 | trainingdata <- window(x, start = c(1999), end = c(2008)) 690 | testdata <- window(x, start = c(2009), end = c(2012)) 691 | 692 | 693 | par(mfrow = c(2, 2)) 694 | 695 | # first fit the training data, then the test data. 696 | # Use simple exponential smoothing 697 | fit <- ets(trainingdata, model = "ANN") 698 | # summary(fit) 699 | fit2 <- ets(testdata, model = fit) 700 | onestep <- fitted(fit2) 701 | 702 | iterations <- iterations + 1 703 | onestep.df[iterations, 1] <- paste(x4$cat[1]) 704 | onestep.df[iterations, 2] <- accuracy(onestep, testdata)[, 2] #RMSE 705 | onestep.df[iterations, 3] <- accuracy(onestep, testdata)[, 3] #MAE 706 | onestep.df[iterations, 4] <- accuracy(onestep, testdata)[, 5] #MAPE 707 | onestep.df[iterations, 5] <- accuracy(onestep, testdata)[, 7] #ACF1 708 | 709 | if (iterations == 24) 710 | break 711 | } 712 | 713 | # Viewing the prediction with the original values 714 | # In the code above the fit object contains the original training data along with the fitted values. We can also print the residuals, which are the actual values minus the fitted values. The residuals are important, since they are usually the basis for the accuracy measures. 715 | tail(x4) 716 | # > Source: local data frame [6 x 9] 717 | # > 718 | # > cat Avg.Total.Insured Avg.People Year Year.1 719 | # > (fctr) (dbl) (dbl) (fctr) (int) 720 | # > 1 UNDER 18 YEARS 66200.46 73752.5 2007 2007 721 | # > 2 UNDER 18 YEARS 66200.46 73752.5 2008 2008 722 | # > 3 UNDER 18 YEARS 66200.46 73752.5 2009 2009 723 | # > 4 UNDER 18 YEARS 66200.46 73752.5 2010 (10) 2010 724 | # > 5 UNDER 18 YEARS 66200.46 73752.5 2011 2011 725 | # > 6 UNDER 18 YEARS 66200.46 73752.5 2012 2012 726 | # > Variables not shown: Total.People (dbl), Total (dbl), Not.Covered (dbl), 727 | # > Not.Covered.Pct (dbl) 728 | head(onestep.df) 729 | # > cat rmse mae mape acf1 730 | # > 1 18 to 24 YEARS 0.013772470 0.009869590 3.752381 1.0000552 731 | # > 2 25 to 34 YEARS 0.004036661 0.003380938 1.217612 1.0150588 732 | # > 3 35 to 44 YEARS 0.006441549 0.004790155 2.231886 0.9999469 733 | # > 4 45 to 54 YEARS 0.004261185 0.003129072 1.734022 0.9999750 734 | # > 5 55 to 64 YEARS 0.005160212 0.004988592 3.534093 0.7878765 735 | # > 6 65 YEARS AND OVER 0.002487451 0.002096323 12.156875 0.9999937 736 | cbind(fit$x, fit$fitted, fit$residuals) 737 | # > Time Series: 738 | # > Start = 1999 739 | # > End = 2008 740 | # > Frequency = 1 741 | # > fit$x fit$fitted fit$residuals 742 | # > 1999 0.11954420 0.1056241 0.0139200744 743 | # > 2000 0.10725759 0.1056255 0.0016320737 744 | # > 2001 0.10649619 0.1056257 0.0008705016 745 | # > 2002 0.10291788 0.1056258 -0.0027078918 746 | # > 2003 0.10393522 0.1056255 -0.0016902769 747 | # > 2004 0.09942592 0.1056253 -0.0061994140 748 | # > 2005 0.10320781 0.1056247 -0.0024169001 749 | # > 2006 0.11231138 0.1056245 0.0066869135 750 | # > 2007 0.10587369 0.1056251 0.0002485556 751 | # > 2008 0.09527547 0.1056252 -0.0103496921 752 | 753 | mean(fit$residuals) 754 | # > [1] -6.056125e-07 755 | # We should also plot the residuals for this category. 756 | absresid <- abs(fit2$residuals) 757 | plot(absresid) 758 | 759 | # Fit2 is the testdata along with the fitted values from the model developed from the training data (fit). 760 | cbind(fit2$x, fit2$fitted, fit2$residuals) 761 | # > Time Series: 762 | # > Start = 2009 763 | # > End = 2012 764 | # > Frequency = 1 765 | # > fit2$x fit2$fitted fit2$residuals 766 | # > 2009 0.09745024 0.09451361 0.0029366252 767 | # > 2010 0.09785322 0.09451391 0.0033393083 768 | # > 2011 0.09397731 0.09451424 -0.0005369347 769 | # > 2012 0.08877369 0.09451419 -0.0057404940 770 | 771 | mean(fit2$residuals) 772 | # > [1] -3.738216e-07 773 | # Accuracy Measures 774 | # Using the residuals, we can measure the error from the predicted and actual values based upon three popular accuracy measures. 775 | # Mean Absolute Error (MAE): This measure takes the mean of the absolute values of all of the errors (residuals) 776 | # Root Mean Squared Error (RMSE): The root mean square error measures the error by first taking the mean of all of the squared errors, and then takes the square root of the mean, in order to revert back to the original scale. This is a standard statistical method of measuring errors. 777 | # Tip: Both MAE, and RMSE are scale dependent measures, which means that that they can be used to compare problems with similar scales. When comparing accuracy among models with different scales, other scale independent measures such as MAPE should be used. 778 | # Mean Percentage Error MAPE - This is the absolute difference between the actual and forecasted value, expressed as a percentage of the actual value. This is intuitively easy to understand and is a very popular measure. 779 | # 780 | # We can look at the worst performing models, in terms of a metric such as MAPE, by simply sorting the onestep.df object by the MAPE columns. 781 | onestep.df %>% arrange(., desc(mape)) %>% head() 782 | # > cat rmse mae mape acf1 783 | # > 1 MALE 65 YEARS AND OVER 0.002647903 0.002226841 17.440671 0.5781697 784 | # > 2 MALE 35 to 44 YEARS 0.039044319 0.024111701 14.218445 0.9999903 785 | # > 3 65 YEARS AND OVER 0.002487451 0.002096323 12.156875 0.9999937 786 | # > 4 MALE 25 to 34 YEARS 0.024195057 0.019901117 6.748294 0.9999310 787 | # > 5 MALE 45 to 54 YEARS 0.017711280 0.010388681 6.380865 0.9999900 788 | # > 6 FEMALE 55 to 64 YEARS 0.006749771 0.005610275 4.021224 0.5038707 789 | # Charting all of the MAPE's at once show that the exponential model works better for the younger groups, and seems to degrade for older groups, especially for Males. 790 | lattice::barchart(cat ~ mape, data = onestep.df) 791 | 792 | 793 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2017 Packt 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | # Practical Predictive Analytics 5 | This is the code repository for [Practical Predictive Analytics](https://www.packtpub.com/big-data-and-business-intelligence/practical-predictive-analytics?utm_source=github&utm_medium=repository&utm_campaign=9781785886188), published by [Packt](https://www.packtpub.com/?utm_source=github). It contains all the supporting project files necessary to work through the book from start to finish. 6 | ## About the Book 7 | This is the go-to book for anyone interested in the steps needed to develop predictive analytics solutions with examples from the world of marketing, healthcare, and retail. We'll get started with a brief history of predictive analytics and learn about different roles and functions people play within a predictive analytics project. Then, we will learn about various ways of installing R along with their pros and cons, combined with a step-by-step installation of RStudio, and a description of the best practices for organizing your projects. 8 | 9 | On completing the installation, we will begin to acquire the skills necessary to input, clean, and prepare your data for modeling. We will learn the six specific steps needed to implement and successfully deploy a predictive model starting from asking the right questions through model development and ending with deploying your predictive model into production. We will learn why collaboration is important and how agile iterative modeling cycles can increase your chances of developing and deploying the best successful model. 10 | 11 | 12 | ## Instructions and Navigation 13 | All of the code is organized into folders. Each folder starts with a number followed by the application name. For example, Chapter02. 14 | 15 | 16 | 17 | The code will look like the following: 18 | ``` 19 | #run the model 20 | model <- OneR(train_data, frisked ~ ., verbose = TRUE) 21 | #summarize the model 22 | summary(model) 23 | #run the sql function from the SparkR package 24 | SparkR::sql("SELECT sample_bin , count(*) 25 | \FROM out_tbl group by sample_bin") 26 | ``` 27 | 28 | This is neither an introductory predictive analytics book, not an introductory book for 29 | learning R or Spark. Some knowledge of base R data manipulation techniques is expected. 30 | Some prior knowledge of predictive analytics is useful. As mentioned earlier, knowledge of 31 | basic statistical concepts such as hypothesis testing, correlation, means, standard deviations, 32 | and p-values will also help you navigate this book. 33 | 34 | ## Related Products 35 | * [Practical Real-time Data Processing and Analytics](https://www.packtpub.com/big-data-and-business-intelligence/practical-real-time-data-processing-and-analytics?utm_source=github&utm_medium=repository&utm_campaign=9781787281202) 36 | 37 | * [Mastering Predictive Analytics with Python](https://www.packtpub.com/big-data-and-business-intelligence/mastering-predictive-analytics-python?utm_source=github&utm_medium=repository&utm_campaign=9781785882715) 38 | 39 | * [Learning Predictive Analytics with R](https://www.packtpub.com/big-data-and-business-intelligence/learning-predictive-analytics-r?utm_source=github&utm_medium=repository&utm_campaign=9781782169352) 40 | 41 | ### Suggestions and Feedback 42 | [Click here](https://docs.google.com/forms/d/e/1FAIpQLSe5qwunkGf6PUvzPirPDtuy1Du5Rlzew23UBp2S-P3wB-GcwQ/viewform) if you have any feedback or suggestions. 43 | 44 | ### Download a free PDF 45 | 46 | If you have already purchased a print or Kindle version of this book, you can get a DRM-free PDF version at no cost.
Simply click on the link to claim your free PDF.
47 |

https://packt.link/free-ebook/9781785886188

--------------------------------------------------------------------------------