├── Chapter01
    └── ch 1 codebase June 6.R
├── Chapter02
    └── Ch 2 Codebase June 6.R
├── Chapter03
    └── ch 3 Codebase.R
├── Chapter04
    ├── Ch 4 Codebase.R
    └── Consumer_Complaints (1).zip
├── Chapter05
    └── ch 5 codebase as of May 31.R
├── Chapter06
    └── ch 6 codebase June 6.R
├── Chapter07
    └── ch 7 codebase.R
├── Chapter08,09
    └── B05033_08_09_Codebase.html
├── Chapter10
    ├── B05033_10_Regression_Codebase.html
    └── B05033_10_kmeans_Codebase.html
├── Chapter11
    └── B05033_11_Codebase.html
├── LICENSE
└── README.md


/Chapter01/ch 1 codebase June 6.R:
--------------------------------------------------------------------------------
  1 | setwd("C:/PracticalPredictiveAnalytics/R")
  2 | getwd()
  3 | #[1] "C:/PracticalPredictiveAnalytics/R"
  4 | # 
  5 | # 
  6 | # I suggest using getwd and setwd liberally, especially if you are working on multiple projects, and want to avoid reading or writing the wrong files.
  7 | # The Source Window
  8 | # The Source window is where all of your R Code appears.  It is also where you will be probably spending most of your time. You can have several script windows open, all at once.
  9 | # Creating a new Script
 10 | # To create a new script select "File/New File/R Script" from the top navigation bar.   A new blank script window will appear with the name "Untitled1".
 11 | # Once it appears you can start entering code!
 12 | #   
 13 | #   B05033_01_05.png
 14 | # Our First Predictive Model
 15 | # Now that all of the preliminaries are out of the way, it is time to jump into starting to code our first simple predictive model. There will be 2 scripts written to accomplish this. 
 16 | # Our first R script is not a predictive model (yet), but it is a preliminary program which will view and plot some data.  The data set we will use is already built into the R package system, and is not necessary to load externally.   For quickly illustrating techniques, I will sometimes use sample data contained within specific R packages themselves in order to demonstrate ideas, rather than pulling data in from an external file. In this case our data will be pulled from the "datasets" package, which is loaded by default at startup.
 17 | # .	Paste the following code into the "Untitled1" scripts that was just created. Don't worry about what each line means yet.  I will cover the specific lines after the code is executed.
 18 | require(graphics)
 19 | data(women)
 20 | head(women)
 21 | View(women)
 22 | plot(women$height,women$weight)
 23 | # 
 24 | # .	Within the code pane, you will see a menu bar right beneath the "Untitled1" tab. It should look something like this:
 25 | #   
 26 | #   
 27 | #   
 28 | #   .	To execute the code, Click the "Source" Icon.  The display should then change to the diagram below:
 29 | #   
 30 | #   
 31 | #   B05033_01_06.png
 32 | # Notice from the picture below that three things have changed:
 33 | #   1.	Output has been written to the console pane
 34 | # 2.	The View pane has popped up which contains a two column table.
 35 | # 3.	Additionally, a plot will appear in the Plot pane 
 36 | # Code Description
 37 | # 
 38 | # Here are some more details on what the code has accomplished.
 39 | # 
 40 | # Line 1 of the code contains the function "require", which is just a way of saying that R needs a specific package to run.  In this case require(graphics) specifies that the graphics package is needed for the analysis, and it will load it into memory.  If it is not available, you will get an error message.  However, "graphics" is a base package and should be available
 41 | # 
 42 | # Line 2 of the code loads the "Women" data object into memory using the data(women) function. 
 43 | # Lines 3-5 of the code display the raw data in three different ways:
 44 | #   
 45 | #   1.	View(women) - This will visually display the dataframe. Although this is part of the actual R script, viewing a dataframe is a very common task, and is often issued directly as a command via the R Console. As you can see in the figure above, the "Women" data frame has 15 rows, and 2 columns named height and weight.  
 46 | # 
 47 | # 2.	plot(women$height, women$weight) - This uses the native R plot function which plots the values of the two variables against each other.  It is usually the first step one does to begin to understand the relationship between 2 variables. As you can see the relationship is very linear.
 48 | # 
 49 | # 3.	head(women) - This displays the first N rows of the women  data frame to the console. If you want no more than a certain number of rows, add that as a 2nd argument of the function.  E.g.  Head(women,99) will display UP TO 99 rows in the console. The tail() function works similarly, but displays the last rows of data.
 50 | # 
 51 | # Saving the script
 52 | # To save this script, navigate to the top navigation menu bar and select "File/Save". When the file selector appears navigate to the PracticalPredictiveAnalytics/R folder that was created, and name the file "Chapter1_DataSource".  Then select "Save".
 53 | # 
 54 | # Your 2nd script
 55 | # Our 2nd R script is a simple two variable regression model which predicts women's height based upon weight. 
 56 | # Begin by creating another Rscript by selecting "File/New File/R Script" from the top navigation bar.  
 57 | # If you create new scripts via "File/New File/R Script"  often enough you might get "Click Fatigue" (uses 3 clicks), so you can also save a click by selecting the icon in the top left with the + sign.
 58 | # 
 59 | # 
 60 | # 
 61 | # Whichever way you choose , a new blank script window will appear with the name "Untitled2"
 62 | # Now Paste the following code into the new script window
 63 | require(graphics)
 64 | data(women)
 65 | lm_output <- lm(women$height ~ women$weight)
 66 | summary(lm_output)
 67 | prediction <- predict(lm_output)
 68 | error <- women$height-prediction
 69 | plot(women$height,error) 
 70 | 
 71 | # Press the "Source" icon to run the entire code.  The display will change to something similar to what is displayed below:
 72 | #   
 73 | #   
 74 | #   B05033_01_07.png
 75 | # 
 76 | # Code Description
 77 | # Here are the notes and explanations for the script code that you have just ran:
 78 | #   Line 3 - lm() function: This function runs a simple linear regression using the  lm() function. This function will run a linear regression which predicts woman's height based upon the value of their weight.  In statistical parlance, you will be 'regressing' height on weight. The line of code which accomplishes this is:
 79 | #   
 80 | #   lm_output <- lm(women$height ~ women$weight)
 81 | # 
 82 | # There are two operators, and one function that you will become very familiar with when running Predictive Models in R.
 83 | # 
 84 | # 1.	The ~ operator (also called the tilde) is an expression that is used when specifying a formula in R. In a predictive context, is a shorthand way for separating what you want to predict, with what you are using to predict.   What you are predicting (the dependent or Target variable) is usually on the left side of the formula, and the predictors (independent variables, features) are on the right side. In the formula, I have specified them explicitly by using the data frame name together with the column name, i.e. women$height (what you are predicting) and women$weight (predictor)
 85 | # 
 86 | # 2.	The <- operator (also called assignment) says assign whatever function operators are on the right side to whatever object is on the left side.  This will always create or replace a new object that you can further display or manipulate. In this case we will be creating a new object called lm_output, which is created using the function lm(), which creates a Linear model based on the formula contained within the parentheses. 
 87 | # 
 88 | # 
 89 | # Note that the execution of this line does not produce any displayed output.  You can see if the line was executed by checking the console.  If there is any problem with running the line (or any line for that matter) you will see an error message in the console.
 90 | # Line 4 - summary(lm_output): The summary function is a generic R function which works on many different types of R objects. In this case, summary() will summarize the results of the linear regression output. 
 91 | # The results will appear in the Console window as pictured in the figure above.  Just to keep thing a little bit simpler for now, I will just show the first few lines of the output, and underline what you should be looking at. Do not be discouraged by the amount of output produced.
 92 | # First look at the lines marked (Intercept), and women$weight which appear under the Coefficients line in the console.
 93 | # Coefficients:
 94 | #   Estimate Std. Error t value Pr(>|t|)    
 95 | # (Intercept)  25.723456   1.043746   24.64 2.68e-12 ***
 96 | #   women$weight  0.287249   0.007588   37.85 1.09e-14 ***
 97 | #   
 98 | #   The Estimate Column illustrates the linear regression formula needed to derive height from weight.  We can actually use these numbers along with a calculator to determine the prediction ourselves. For our example the output tells us that we should perform the following steps for ALL of the observations in our dataframe in order to obtain the prediction for height.  We will obviously not want to do all of the observations (R will do that via the predict() function below), but we will illustrate the calculation for 1 data point.
 99 | # .	Take the weight value for each observation. Let's take the weight of the first woman which is 115 lbs.
100 | # .	Then,multiply weight by 0.2872 .  That is the number that is listed under Estimate for womens$weight.  Multiplying 115 lbs. by 0.2872 yield 33.028
101 | # .	Then add 25.7235 which is the estimate of the (intercept) row. That will yield a prediction of 58.75 inches.
102 | # If you do not have a calculator handy, the calculation is easily done in 'calculator' mode via the R Console, by typing the following:
103 | #   
104 | #   The Predict function
105 | # To predict the value for all of the values we will use a function called predict(). This function reads each input (independent) variable and then predicts a target (dependent) variable based on the linear regression equation.  In the code we have assigned the output of this function to a new object named "prediction". 
106 | # Switch over to the console area, and type "prediction", then Enter, to see the predicted values for the 15 women. The following should appear in the console.
107 | 
108 | prediction
109 | length(error)
110 | #[1] 15
111 | # 
112 | # In all of the above cases, the counts all compute as 15, so all is good.
113 | # If we want to see the raw data, predictions, and the prediction errors for all of the data, we can use the cbind() function  (Column bind) to concatenate all three of those values, and display as a simple table.
114 | # At the console enter the follow cbind command.
115 | cbind(height=women$height,PredictedHeight=prediction,ErrorInPrediction=error)
116 | height PredictedHeight ErrorInPrediction
117 | # 1      58        58.75712       -0.75711680
118 | # 2      59        59.33162       -0.33161526
119 | # 3      60        60.19336       -0.19336294
120 | # 4      61        61.05511       -0.05511062
121 | # 5      62        61.91686        0.08314170
122 | # 6      63        62.77861        0.22139402
123 | # 7      64        63.64035        0.35964634
124 | # 8      65        64.50210        0.49789866
125 | # 9      66        65.65110        0.34890175
126 | # 10     67        66.51285        0.48715407
127 | # 11     68        67.66184        0.33815716
128 | # 12     69        68.81084        0.18916026
129 | # 13     70        69.95984        0.04016335
130 | # 14     71        71.39608       -0.39608278
131 | # 15     72        72.83233       -0.83232892
132 | # 
133 | # 
134 | # From the output above, we can see that there are a total 15 predictions. If you compare the ErrorInPrediction with the error plot shown above, you can see that for this very simple model, the prediction errors are much larger for extreme values in height (shaded values).
135 | # Just to verify that we have one for each of our original observations we will use the nrow() function to count the number of rows.
136 | # At the command prompt in the console area, enter the command: 
137 | nrow(women)
138 | # 
139 | # The following should appear:
140 | # 
141 | # >nrow(women)
142 | # [1] 15
143 | # 
144 | # Line 7 - plot(women$height,error) :This plots the predicted height vs. the errors.  It shows how much the prediction was 'off' from the original value.  You can see that the errors show a non-random pattern. 
145 | # 
146 | # After you are done, save the file using "File/File Save", navigate to the PracticalPredictiveAnalytics/R folder that was created, and name it Chapter1_LinearRegression
147 | # R packages
148 | # An R package extends the functionality of basic R.  Base R, by itself, is very capable, and you can do an incredible amount of analytics without adding any additional packages.  However adding a package may be beneficial if it adds a functionality which does not exist in base R, improves or builds upon an existing functionality, or just makes something that you can already do easier.
149 | # For example, there are no built in packages in base R which enable you to perform certain types of machine learning (such as Random Forests).  As a result, you need to search for an add on package which performs this functionality.  Fortunately you are covered.  There are many packages available which implement this algorithm.
150 | # Bear in mind that there are always new packages coming out. I tend to favor packages which have been on CRAN for a long time and have large user base. When installing something new, I will try to reference the results against other packages which do similar things.  Speed is another reason to consider adopting a new package. 
151 | # 
152 | # 
153 | # The Stargazer Package
154 | # 
155 | # For an example of a package which can just make life easier, first let's consider the output produced by running a summary function on the regression results, as we did above.  You can run it again if you wish. 
156 | # 
157 | summary(lm_output)
158 | # 
159 | # The amount of statistical information output by the summary function can be overwhelming to the initiated. This is not only related to the amount of output, but the formatting. That is why I did not show the entire output in the above example.
160 | # One way to make output easier to look at is to first reduce the amount of output that is presented, and then reformat it so it is easier on the eyes. 
161 | # To accomplish this, we can utilize a package called "stargazer", which will reformat the large volume of output produced by summary and simplify the presentations. Stargazer excels at reformatting the output of many regression models, and displaying the results as HTML, PDF, Latex, or as simple formatted text. By default, it will show you the most important statistical output for various models, and you can always specify the types of statistical output that you want to see.
162 | # To obtain more information on the stargazer package you can first go to CRAN, and search for documentation about stargazer, and/or you can use the R help system:
163 | # IF you already have installed stargazer you can use the following command:
164 | # packageDescription("stargazer")
165 | # If you haven't installed the package, information about stargazer, (or other packages) can also be found using R specific internet searches:
166 | RSiteSearch("stargazer")
167 | # If you like searching for documentation within R, you can obtain more information about the R help system at::
168 | # https://www.r-project.org/help.html
169 | # Installing Stargazer
170 | # Now, on to installing stargazer:
171 | # .	First Create a new R script (File/New File/R Script).  
172 | # .	Enter the following lines and then select "Source" from the menu bar in the code pane, which will submit  the entire script.  
173 | # 
174 | install.packages("stargazer")
175 | library(stargazer)
176 | stargazer(lm_output, title="Lm Regression on Height", type="text")
177 | # 
178 | # After the script has been run, the following should appear in the Console:
179 | # 
180 | # B05033_01_08.png
181 | # Code Description
182 | # Line 1: install.packages("stargazer")
183 | # The line above will install the package to the default package directory on your machine. If you will be rerunning this code again,, you can comment out this line, since the package will have already be installed in your R repository.   
184 | # 
185 | # Line 2: library(stargazer)
186 | # Installing a package does not make the package automatically available.  You need to run a library (or require) function in order to actually load the stargazer package. 
187 | # 
188 | # Line 3: stargazer(lm_output, title="Lm Regression on Height", type="text")
189 | # 
190 | # This part of the code will take the output object lm_output, that was created in the first script, condense the output, and write it out to the console in a simpler, more readable format.  There are many other options in stargazer, which will format the output as HTML, or Latex.   Please refer to the Reference manual at  https://cran.r-project.org/web/packages/stargazer/index.html
191 | # 
192 | # 
193 | # The reformatted results will appear in the R Console. As you can see, the output written to the console is much cleaner and easier to read
194 | # Saving your work
195 | # After you are done, select "File/File Save" from the menu bar
196 | # Then navigate to the PracticalPredictiveAnalytics/Outputs folder that was created, and name it Chapter1_LinearRegressionOutput.  Press Save. 
197 | # 
198 | # Summary
199 | # In this chapter, we have learned a little about what predictive analytics is and how they can be used in various industries. We learned some things about data, and how they can be organized in projects.  Finally, we installed RStudio, and ran a simple linear regression, and installed and ran our first package. We learned that it is always good practice to examine data after it has been loaded into memory, and a lot can be learned from simply displaying and plotting the data. 
200 | # In the next chapter, we will discuss the overall predictive modeling process itself, introduce some key model packages using R, and provide some guidance on avoiding some predictive modeling pitfalls.
201 | # References
202 | # Computing and the Manhattan Project. (n.d.). Retrieved from http://www.atomicheritage.org/history/computing-and-manhattan-project
203 | # Gladwell, M. (2005). Blink :the power of thinking without thinking. New York: Little, Brown and Co.,.
204 | # Linda Miner et al. (2014). Practical Predictive Analytics and Decisioning Systems for Medicine. Elsevier.
205 | # Watson (Computer). (n.d.). Retrieved from Wikipedia: https://en.wikipedia.org/wiki/Watson_(computer)
206 | # Weather Forecasting through the Ages. (n.d.). Retrieved from http://earthobservatory.nasa.gov/Features/WxForecasting/wx2.php
207 | 


--------------------------------------------------------------------------------
/Chapter02/Ch 2 Codebase June 6.R:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Practical-Predictive-Analytics/8bdd60cc74f0f0c34a0c0add3a9e4d02c7380afb/Chapter02/Ch 2 Codebase June 6.R


--------------------------------------------------------------------------------
/Chapter03/ch 3 Codebase.R:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Practical-Predictive-Analytics/8bdd60cc74f0f0c34a0c0add3a9e4d02c7380afb/Chapter03/ch 3 Codebase.R


--------------------------------------------------------------------------------
/Chapter04/Ch 4 Codebase.R:
--------------------------------------------------------------------------------
  1 | rm(df)
  2 | df <- read.table(text = 'Treatment	Gender	Age	Duration	Pain 
  3 |                  P	F	68	1	0
  4 |                  P	M	66	26	1
  5 |                  A	F	71	12	0
  6 |                  A	M	71	17	1
  7 |                  B	F	66	12	0
  8 |                  A	F	64	17	0
  9 |                  P	M	70	1	1
 10 |                  A	F	64	30	0
 11 |                  B	F	78	1	0
 12 |                  B	M	75	30	1
 13 |                  A	M	70	12	0
 14 |                  B	M	70	1	0
 15 |                  P	M	78	12	1
 16 |                  P	M	66	4	1
 17 |                  A	M	78	15	1
 18 |                  P	F	72	27	0
 19 |                  B	F	65	7	0
 20 |                  P	M	67	17	1
 21 |                  P	F	67	1	1
 22 |                  A	F	74	1	0
 23 |                  B	M	74	16	0
 24 |                  B	F	67	28	0
 25 |                  B	F	72	50	0
 26 |                  A	F	63	27	0
 27 |                  A	M	62	42	0
 28 |                  P	M	74	4	0
 29 |                  B	M	66	19	0
 30 |                  A	M	70	28	0
 31 |                  P	M	83	1	1
 32 |                  P	M	77	29	1
 33 |                  A	F	69	12	0
 34 |                  B	M	67	23	0
 35 |                  B	M	77	1	1
 36 |                  P	F	65	29	0
 37 |                  B	M	75	21	1
 38 |                  P	F	70	13	1
 39 |                  P	F	68	27	1
 40 |                  B	M	70	22	0
 41 |                  A	M	67	10	0
 42 |                  B	M	80	21	1
 43 |                  P	F	67	30	0
 44 |                  B	F	77	16	0
 45 |                  B	F	76	9	1
 46 |                  A	F	69	18	1
 47 |                  P	F	64	1	1
 48 |                  A	F	72	25	0
 49 |                  B	M	59	29	0
 50 |                  A	M	69	1	0
 51 |                  B	F	69	42	0
 52 |                  P	F	79	20	1
 53 |                  B	F	65	14	0
 54 |                  A	M	76	25	1
 55 |                  B	F	69	24	0
 56 |                  P	M	60	26	1
 57 |                  A	F	67	11	0
 58 |                  A	M	75	6	1
 59 |                  P	M	68	11	1
 60 |                  A	M	65	15	0
 61 |                  P	F	72	11	1
 62 |                  A	F	69	3	0',header = TRUE)
 63 | 
 64 | str(df)
 65 | table(df$Pain)
 66 | prop.table(table(df$Pain))
 67 | setwd("C:/PracticalPredictiveAnalytics/Data")
 68 | save(df,file="pain_raw.Rda")
 69 | 
 70 | 
 71 | PainGLM <- glm(Pain ~ Treatment + Gender + Age + Duration, data=df, family="binomial")
 72 | 
 73 | summary(PainGLM)
 74 | cbind(exp(coef(PainGLM)),confint(PainGLM))
 75 | 
 76 | quantile(residuals(PainGLM, type="deviance")) # residuals
 77 | 
 78 | mean(residuals(PainGLM, type="deviance"))
 79 | plot(residuals(PainGLM, type="deviance"))
 80 | 
 81 | #install.packages("car")
 82 | library(car)
 83 | avPlots(PainGLM, id.method="mahal", id.n=2)
 84 | 
 85 | df[c(26,27,28,25),]
 86 | 
 87 | set.seed(1020)
 88 | lottery <- data.frame(
 89 |   cbind(x=rnorm(n=1000000,1000000,100),y=rnorm(1000000,1000001,100)  )
 90 | )
 91 | summary(lottery)
 92 | 
 93 | t.test(lottery$x,lottery$y)
 94 | 
 95 | 
 96 | set.seed(1020)
 97 | lottery <- data.frame(
 98 |   cbind(x=rnorm(n=100000,1000000,100),y=rnorm(n=100000,1000001,100)  )
 99 | )
100 | summary(lottery)
101 | t.test(lottery$x,lottery$y)
102 | 
103 | library(RcmdrMisc)
104 | 
105 | stepwise(PainGLM, direction='forward/backward', criterion='AIC')
106 | 
107 | 
108 | library(effects)
109 | output <- glm(Pain ~ Treatment + Gender + Age + Duration + Age:Treatment
110 |               ,data=df, family="binomial")
111 | summary(output)
112 | 
113 | install.package(effects)
114 | library(effects)
115 | output <- glm(Pain ~ Treatment + Gender + Age + Duration + Age:Treatment , data=df, family="binomial")
116 | summary(output)
117 | 
118 | plot(effect("Treatment*Age", output, xlevels=list(age=0:99)),
119 |      ticks=list(at=c(.001, .005, .01, .05, seq(.1,.9,by=.2), .95, .99, .995))) 
120 | 
121 | install.packages("pscl")
122 | 
123 | library(pscl)
124 | pR2(PainGLM)
125 | 
126 | 
127 | 
128 | oldpar <- par(oma=c(0,0,3,0), mfrow=c(2,2))
129 | plot(PainGLM)
130 | par(oldpar)
131 | 
132 | 
133 | library(TeachingDemos)
134 | set.seed(1)
135 | if(interactive()) {vis.test(residuals(PainGLM, type="response"), vt.qqnorm, nrow=4, ncol=4, npage=1)}
136 | 
137 | install.packages("ResourceSelection")
138 | library(ResourceSelection)
139 | hoslem.test(PainGLM$y,fitted(PainGLM))
140 | str(hoslem$observed)
141 | hoslem <- hoslem.test(PainGLM$y,fitted(PainGLM))
142 | cbind(hoslem$observed,hoslem$expected)
143 | 
144 | dummy.vars <- model.matrix(df$Pain ~ df$Treatment + df$Gender + df$Age + df$Duration)[,-1]
145 | x <- as.matrix(data.frame(df$Duration,dummy.vars))
146 | head(x)
147 | 
148 | 
149 | install.packages("glmnet")
150 | library(glmnet)
151 | options(scipen = 999)
152 | mod.result<-glmnet(x,y=as.factor(df$Pain),alpha=1,family='binomial')
153 | summary(mod.result$lambda)
154 | 
155 | plot(mod.result,xvar="lambda")
156 | 
157 | 
158 | head(print(mod.result)) 
159 | tail(print(mod.result))
160 | 
161 | 
162 | coef(mod.result,s=0.1957000)   #  4 of 5 coefficient are at 0, too much shrinkage
163 | coef(mod.result,s=0.0131800)   #  only 1 coefficient is set to 0
164 | 
165 | coef(mod.result,s=0.0120100)   #   All coefficients are included 
166 | 
167 | coef(mod.result,s=0.01)
168 | 
169 | 
170 | statedf <- as.data.frame(state.x77)
171 | x <- data.frame(statedf,state.abb,state.region)
172 | summary(x)
173 | 
174 | hist(x$Life.Exp)
175 | 
176 | install.packages("rpart")
177 | install.packages("rattle")					 
178 | install.packages("rpart.plot")				 
179 | install.packages("RColorBrewer")				 
180 | 
181 | library(rpart)
182 | library(rattle)					 
183 | library(rpart.plot)				 
184 | library(RColorBrewer)				 
185 | set.seed(1020)
186 | y1 <- rpart(Life.Exp ~ .,data=x,cp=.01)
187 | prp(y1, type=4, extra=1) 
188 | 
189 | 
190 | print(y1)
191 | 
192 | 
193 | y2 <- rpart(Life.Exp ~ Population + Income + Illiteracy +
194 |               Murder +
195 |               HS.Grad +
196 |               Frost +
197 |               Area +
198 |               state.region,method='anova',data=x
199 | )
200 | 
201 | prp(y2, type=4, extra=1)
202 | 
203 | install.packages("partykit")
204 | library(partykit)
205 | y2 <- ctree(Life.Exp ~ .,data=x)
206 | y2
207 | plot(y2)
208 | 
209 | PrunedTree <- prp(y1,type=4, extra=1,snip=TRUE)$obj 
210 | 
211 | prp(PrunedTree, type=4, extra=1)
212 | 
213 | 
214 | library(graphics)
215 | library(dplyr)
216 | 
217 | x <- read.table("C:/PracticalPredictiveAnalytics/Data/CDNOW_master.txt", quote="\"", stringsAsFactors=FALSE)
218 | 
219 | x$xd <- as.Date(as.character(x$V2), "%Y%m%d")
220 | x$diffdate <- as.integer(as.Date("1998-07-01")  - x$xd)
221 | #rename the columns
222 | 
223 | colnames(x) <- c("id","orig.date","units.bought","TotalPaid","purch.date","Days.since")
224 | str(x)
225 | summary(x)
226 | 
227 | attach(x)
228 | #cluster on the RFM variables
229 | y <- subset(x, select = c(units.bought,TotalPaid,Days.since))
230 | 
231 | #always set seed before clustering
232 | 
233 | set.seed(1020)
234 | 
235 | clust3 <- kmeans( y,3)
236 | clust3$betweenss/clust3$totss
237 | 
238 | 
239 | clust5 <- kmeans( y,5)
240 | clust5$betweenss/clust5$totss
241 | 
242 | clust7 <- kmeans( y,7)
243 | clust7$betweenss/clust7$totss
244 | 
245 | #elbow method
246 | set.seed(1020)
247 | # Compute and plot wss for k = 3 to k = 15
248 | df <- sapply(3:15,function(k){kmeans(y,k)$tot.withinss})
249 | 
250 | plot(3:15, df,type='b',xlab="# of clusters",ylab="Total Within Clusters SS")
251 | 
252 | #cluster assignments
253 | clusters3 <- clust3$cluster
254 | clusters5 <- clust5$cluster
255 | clusters7 <- clust7$cluster
256 | 
257 | #usually but now always the middle cluster is the average cluster
258 | par(mfrow=c(1,3))
259 | hist(clusters3)
260 | hist(clusters5)
261 | hist(clusters7)
262 | 
263 | #append the clusters the original data
264 | append.clust <- data.frame(x, clusters3,clusters5,clusters7)
265 | 
266 | install.packages("cluster")
267 | library(cluster) 
268 | set.seed(1020)
269 | sampleit <- append.clust[sample(nrow(append.clust), 100), ]
270 | str(sampleit)
271 | 
272 | prcomp(append.clust[,c(3,4,6)], scale = TRUE)
273 | par(mfrow=c(1,3))
274 | clusplot(sampleit[,c(3,4,6)], sampleit$clusters3, color=TRUE, shade=TRUE,labels=2, lines=0)
275 | clusplot(sampleit[,c(3,4,6)], sampleit$clusters5, color=TRUE, shade=TRUE,labels=2, lines=0)
276 | clusplot(sampleit[,c(3,4,6)], sampleit$clusters7, color=TRUE, shade=TRUE,labels=2, lines=0)
277 | 
278 | head(append.clust)
279 | library(dplyr)
280 | attach(append.clust)
281 | append.clust %>% select(units.bought,TotalPaid,Days.since,clusters3) %>% 
282 |   group_by(clusters3) %>% 
283 |   summarise_each(funs(n(),mean))
284 | append.clust %>% select(units.bought,TotalPaid,Days.since,clusters5) %>% 
285 |   group_by(clusters5) %>% 
286 |   summarise_each(funs(n(),mean))
287 | append.clust %>% select(units.bought,TotalPaid,Days.since,clusters7) %>% 
288 |   group_by(clusters7) %>% 
289 |   summarise_each(funs(n(),mean))
290 | 
291 | 
292 | require(graphics)
293 | setwd("C:/PracticalPredictiveAnalytics/Data")
294 | load("pain_raw.Rda")
295 | 
296 | df2 <- subset(df, select=c(Age,Duration,Pain))
297 | df2 <- scale(df2)  
298 | head(df2)
299 | fit <- hclust(dist(df2), "average")
300 | 
301 | groups <- cutree(fit, k=3) 
302 | rect.hclust(fit, k=3, border="red")
303 | cluster1 <- df[c(35,52,10,30),]
304 | View(cluster1)
305 | 
306 | cluster2 <- df[c(23,49,25,47),]
307 | 
308 | View(cluster2)
309 | 
310 | cluster3 <- df[c(20,26,38,53),]
311 | View(cluster3)
312 | 
313 | 
314 | #generate a non-linear circle of point
315 | 
316 | radius <- 2
317 | t2 <- data.frame(x=radius * cos(seq(0,6,length = 20)),y = radius * sin(seq(0, 6, length = 20)))
318 | names(t2) <- c("Latitude","High.Low.Temp")
319 | plot(t2$Latitude,t2$High.Low.Temp)
320 | 
321 | # create a new variable and plot it against on the original points
322 | 
323 | t2$z = (t2$Latitude^2*t2$High.Low.Temp^2)
324 | 
325 | plot(t2$High.Low.Temp,t2$z)
326 | 
327 | 
328 | install.packages("e1071")
329 | install.packages("RTextTools")
330 | 
331 | library(e1071)
332 | library(RTextTools)
333 | data <- read.csv("C:/PracticalPredictiveAnalytics/Data/Consumer_Complaints.csv", sep=",")
334 | 
335 | 
336 | 
337 | data <- subset(data, select=c(Issue,Consumer.complaint.narrative))
338 | data.samp <- subset(data[1:50,], select=c(Issue,Consumer.complaint.narrative))
339 | str(data)
340 | View(data.samp)
341 | 
342 | 
343 | # Create the document term matrix
344 | dtMatrix <- create_matrix(data.samp["Consumer.complaint.narrative"],minDocFreq = 1, removeNumbers=TRUE, 
345 |                           minWordLength=4,removeStopwords=TRUE,removePunctuation=TRUE,stemWords = FALSE)
346 | 
347 | 
348 | xx = as.data.frame( t(as.matrix(  dtMatrix )) ) 
349 | head(xx)
350 | 
351 | 
352 | dtMatrix <- create_matrix(data["Consumer.complaint.narrative"],minDocFreq = 1, removeNumbers=TRUE, 
353 |                           minWordLength=4,removeStopwords=TRUE,removePunctuation=TRUE,stemWords = FALSE)
354 | 
355 | 
356 | freq <- colSums(as.matrix(dtMatrix))   
357 | length(freq) 
358 | head(freq)
359 | freq.df <- as.data.frame(freq)
360 | View(freq.df)
361 | 
362 | container <- create_container(dtMatrix, data$Issue, trainSize=1:500,virgin=FALSE) 
363 | str(container)
364 | 
365 | 
366 | # train a SVM Model
367 | model <- train_model(container, "SVM", kernel="linear", cost=1)
368 | str(model)
369 | head(model)
370 | summary(model)
371 | 
372 | 
373 | predictionData <- data$Consumer.complaint.narrative[501:1000] 
374 | 
375 | # create a prediction document term matrix 
376 | predMatrix <- create_matrix(predictionData, originalMatrix=dtMatrix) 
377 | 
378 | # create the corresponding container
379 | 
380 | plength = length(predictionData);
381 | predictionContainer <- create_container(predMatrix, labels=rep(0,plength), testSize=1:plength, virgin=FALSE) 
382 | 
383 | 
384 | # predict
385 | results <- classify_model(predictionContainer, model)
386 | head(results)
387 | aggregate(results$SVM_PROB, by=list(results$SVM_LABEL), FUN=mean, na.rm=TRUE)
388 | 
389 | 
390 | 


--------------------------------------------------------------------------------
/Chapter04/Consumer_Complaints (1).zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Practical-Predictive-Analytics/8bdd60cc74f0f0c34a0c0add3a9e4d02c7380afb/Chapter04/Consumer_Complaints (1).zip


--------------------------------------------------------------------------------
/Chapter05/ch 5 codebase as of May 31.R:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Practical-Predictive-Analytics/8bdd60cc74f0f0c34a0c0add3a9e4d02c7380afb/Chapter05/ch 5 codebase as of May 31.R


--------------------------------------------------------------------------------
/Chapter06/ch 6 codebase June 6.R:
--------------------------------------------------------------------------------
   1 | # 6
   2 | # Using Market Basket Analysis as a Recommender Engine
   3 | # "It's not wise to violate the rules until you know how to observe them." - T.S. Eliot
   4 | # What is Market Basket Analysis?
   5 | # If you have "survived" the last chapter, you will now be introduced to the world of Market Basket Analysis. Market Basket Analysis (also sometimes called affinity analysis), is a predictive analytics technique which is used heavily in the retail industry in order to identify "baskets" of items which are purchased together. The typical use case for this is the supermarket shopping cart in which a shopper would typically purchase an assortment of items such as Milk, Bread, Cheese, etc. and the algorithm will predict how purchasing certain items together will affect the purchase of other items. It is one of those methods that retailers know to start sending you coupons, emails for things that you didn't know you needed! 
   6 | # One most quoted example of MBA is the relationship between diapers and beer: 
   7 | # "One super market chain discovered in its analysis that customers that bought diapers often bought beer as well, have put the diapers close to beer coolers, and their sales increased dramatically" - http://en.wikipedia.org/wiki/Market_basket
   8 | # However, it is not only restricted to the retail industry. Market Basket Analysis (or MBA) can be used in the insurance industry to look at the various products that an insured currently has, such as car, home, etc. as suggest possible other products such as life, disability, or investment products.
   9 | # MBA is generally considered an unsupervised learning algorithm, in that target variables are usually not specified. However, as you will see later it is possible to refine the association rules, so that specific items can be specified as target variables.
  10 | # MBA is also considered as a type of recommender engine in which purchases of a set of items imply the purchases of others. Certainly MBA and other recommender engines can share the same types of input data. However, MBA was developed before the advent of collaborative filtering techniques, as pioneered by Amazon, and is more suggestive of the integration of collected Web data, while MBA is more associated with the RFID Bar Coding technologies found in scanners. However in both cases, suggestions of future purchases based on past purchases is the goal.
  11 | # Terminology
  12 | # Critical to the understanding of MBA are the concepts of support, confidence, and lift. These are the measures which evaluated the 'goodness of fit' for a set of association rules. You will also learn some specific definitions that are used in MBA, such as consequence, antecedent, and itemsets.
  13 | # To introduce these concepts, we will first illustrate these terms through a very simplistic example. We will use only the first 10 transactions contained in the "Groceries" transaction file, which is contained in the package "arules".
  14 | # To see an example of a market basket, run the following code. Examine the output produced from the 'inspect' function, which prints items in the market basket.
  15 | rm(list = ls())
  16 | library(arules)
  17 | #library(datasets)
  18 | 
  19 | data(Groceries)
  20 | str(Groceries)
  21 | #RW changed added
  22 | summary(Groceries)
  23 | 
  24 | inspect(Groceries[10:19])
  25 | # >    items                      
  26 | # > 1  {whole milk,               
  27 | # >     cereals}                  
  28 | # > 2  {tropical fruit,           
  29 | # >     other vegetables,         
  30 | # >     white bread,              
  31 | # >     bottled water,            
  32 | # >     chocolate}                
  33 | # > 3  {citrus fruit,             
  34 | # >     tropical fruit,           
  35 | # >     whole milk,               
  36 | # >     butter,                   
  37 | # >     curd,                     
  38 | # >     yogurt,                   
  39 | # >     flour,                    
  40 | # >     bottled water,            
  41 | # >     dishes}                   
  42 | # > 4  {beef}                     
  43 | # > 5  {frankfurter,              
  44 | # >     rolls/buns,               
  45 | # >     soda}                     
  46 | # > 6  {chicken,                  
  47 | # >     tropical fruit}           
  48 | # > 7  {butter,                   
  49 | # >     sugar,                    
  50 | # >     fruit/vegetable juice,    
  51 | # >     newspapers}               
  52 | # > 8  {fruit/vegetable juice}    
  53 | # > 9  {packaged fruit/vegetables}
  54 | # > 10 {chocolate}
  55 | # The sample Market Basket
  56 | # Each transaction numbered 1-10 listed above represents a basket of items purchased by a shopper.  hese are typically all items that are associated with a particular transaction, or invoice. Each basket is enclosed within braces {}, and is referred to as an itemset. An itemset is a group of items that occur together.
  57 | # Market Basket algorithms construct rules in the form of 
  58 | # Itemset{x1,x2,x3 ...} --> Itemset{y1,y2,y3.}. 
  59 | # 
  60 | # This notation states that buyers who have purchased items on the left hand side of the formula (lhs) have a propensity to purchase items on the right hand side (rhs). The association is stated using the ??? symbol which can be interpreted as "implies"
  61 | # The lhs of the notation is also known as the antecedent, and the rhs is known as the consequence. If nothing appears on either the left hand side or right hand side there is no specific association rule for those items, however it also means that those items has appeared in the basket.
  62 | # Association rule algorithms
  63 | # Without an association rule algorithm, you are left with the computationally very expensive task of generating all possible pairs of itemsets, and then trying to prune through the best ones yourself. Associate rule algorithms help with filtering this. 
  64 | # The most popular algorithm for MBA is the apriori algorithm which is contained within the arules package. (The other popular algorithm is the eclat algorithm).
  65 | # Running apriori is fairly simple. We will demonstrate this using our demo 10 transaction itemset that we just printed.
  66 | # The apriori algorithm is based upon the principle that if a particular itemset is frequent, then all of its subsets must also be frequent. That principle it itself is helpful for reducing the number of itemsets which need to be evaluated, since it only needs to look at  the largest items sets first, and then be able to filter down.
  67 | # 1.	First, some housekeeping.  Fix the number of printable digits to 2.
  68 | options(digits = 2)  
  69 | 
  70 | # 2.	Next run the apriori algorithm. However only run it on rows 10 through 19 of the Groceries data set. We want to keep the results very small. Ignore warning messages for now. 
  71 | rules <- apriori(Groceries[10:19], parameter = list(supp = 0.1, conf = 0.6))
  72 | summary(rules)
  73 | # > Apriori
  74 | # > 
  75 | # > Parameter specification:
  76 | # >  confidence minval smax arem  aval originalSupport support minlen maxlen
  77 | # >         0.6    0.1    1 none FALSE            TRUE     0.1      1     10
  78 | # >  target   ext
  79 | # >   rules FALSE
  80 | # > 
  81 | # > Algorithmic control:
  82 | # >  filter tree heap memopt load sort verbose
  83 | # >     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
  84 | # > 
  85 | # > Absolute minimum support count: 1
  86 | # > Warning in apriori(Groceries[10:19], parameter = list(supp = 0.1, conf = 0.6)): You chose a very low absolute support count of 1. You might run out of memory! Increase minimum support.
  87 | # > set item appearances ...[0 item(s)] done [0.00s].
  88 | # > set transactions ...[22 item(s), 10 transaction(s)] done [0.00s].
  89 | # > sorting and recoding items ... [22 item(s)] done [0.00s].
  90 | # > creating transaction tree ... done [0.00s].
  91 | # > checking subsets of size 1 2 3 4 5 6 7 8 9 done [0.00s].
  92 | # > writing ... [2351 rule(s)] done [0.00s].
  93 | # > creating S4 object  ... done [0.00s].
  94 | # 3.	Sort the rules by support, which is one of the important evaluation metrics.
  95 | rules <- sort(rules, by = "support", decreasing = TRUE)  # 'high-confidence' rules.
  96 | 
  97 | # 4.	Look at the first 5 rules. Observe that it has a support of .2
  98 | inspect(head(rules, 5))
  99 | # >    lhs                 rhs              support confidence lift
 100 | # > 63 {bottled water}  => {tropical fruit} 0.2     1.00        3.3
 101 | # > 64 {tropical fruit} => {bottled water}  0.2     0.67        3.3
 102 | # > 1  {cereals}        => {whole milk}     0.1     1.00        5.0
 103 | # > 2  {chicken}        => {tropical fruit} 0.1     1.00        3.3
 104 | # > 3  {soda}           => {rolls/buns}     0.1     1.00       10.0
 105 | # Antecedents and Descendants
 106 | # The rules shown above are expressed as an implication between the antecedent (left hand side) and the Consequence (right hand side)  
 107 | # The first rule above, in bold, describes about customers who buy bottle water as also buying tropical fruit. The 3rd rule says that customers who buy cereals have a tendency to buy whole milk.
 108 | # Evaluating the accuracy of a rule
 109 | # Three main metrics have been developed which measure the importance, or accuracy of an association rule:  Support, Confidence, and Lift
 110 | # Support
 111 | # Support measures how frequently the items occur together. Imagine having a shopping cart in which there can be a very large number of combinations of items. Some items which occur rarely could be excluded from the analysis. When an item occurs frequently you will have more confidence in the association among the items, since it will be a more popular item. Often your analysis will be centered around items with high support.
 112 | # Calculating Support
 113 | # Calculating support is simple. You take the proportion of the number of times that the items in the rule appear in the basket divided by the number of itemsets. 
 114 | # .	We can see that for rule number 1, both bottled water and Tropical fruit appear together in 2 of the itemsets, therefore the support for that rule is 2/10 or 20%.
 115 | # .	In the above examples, rules number 1 and 2 have the highest support since bottled water appears times out of 10 Tropical fruit 3 times. Bottled water appears two times out of the three time for 2/3
 116 | # Confidence
 117 | # Confidence is the conditional probability that the event on the right hand side (consequence) will occur, given that the items on the left hand has occurred (antecedent). This is computed by counting the number of occurrences.
 118 | # .	For example, if we take a closer look at rule # 64, {tropical fruit} => {bottled water}, we can see that tropical fruit occurs in 3 separate itemsets, itemset 2,3 and 6. Therefore the denominator of the formula is 3. 
 119 | # .	Of those 3 itemsets, bottled water occurs in 2 of them (itemsets 2 and 3 only, but NOT 6). So the confidence is 2/3 or 67%.Note that the confidence for the reverse itemset {bottled water} => {tropical fruit} is higher, since every time bottled water is purchased, tropical fruit is purchased as well. You can easily verify that by inspecting and counting the elements by hand.
 120 | # Lift
 121 | # Lift is determined by dividing the confidence just calculated, by the independent probability of the consequent. In some respects, Lift is a better measure than support or confidence by itself since it incorporates features of both. 
 122 | # .	To calculate the lift for rule 64, we only need to determine the unconditional probability of the consequence. Since bottled water appears 2 out of 10 times (20%) as the consequence, we divide .67 by .20 to yield 3.4, which is the lift for rule 64. 
 123 | # .	When evaluating the lift metric, use 1 as a baseline lift measure, since a lift of 1 implies no relationship between the antecedent and the consequence.
 124 | # Preparing the raw data file for analysis.
 125 | # Now that we have had a short introduction to the association rules algorithm, we will illustrate applying association rules to a more meaningful example. 
 126 | # We will be using the Online Retail dataset which can be obtained from the UCI Machine Learning Repository at https://archive.ics.uci.edu/ml/datasets/Online+Retail.
 127 | # As described by the source, the data is:
 128 | # ". A transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail. The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers".
 129 | # Reading the Transaction file:
 130 | # We will input the data using the read.csv function.
 131 | # Set stringsAsFactors to FALSE since we will be manipulating the variables as character strings later.
 132 | # The code illustrates the capture.output function. The capture.output function will save the metadata for the raw input file. This is done because we want to track changes done to the input, and we want to capture the contents of the same dataframe at different points in time. That will enable us to save the values of metadata and compare them points.
 133 | # We can use the file.show function to directly examine the input file if needed. This is sometimes needed if you find that there are errors in the input. It has been commented out in the code, but you are encouraged to try it out yourself.
 134 | # The knitr library will be used mostly for the purposes of display output via html via the kable function.  If you wish, you can replace kable function calls with head or print functions.
 135 | library(sqldf)
 136 | library(knitr)
 137 | 
 138 | setwd("C:/PracticalPredictiveAnalytics/Data")
 139 | options(stringsAsFactors = F)
 140 | 
 141 | OnlineRetail <- read.csv("Online Retail.csv", strip.white = TRUE)
 142 | setwd("C:/PracticalPredictiveAnalytics/Outputs")
 143 | 
 144 | # Look at the first few records
 145 | head(OnlineRetail)
 146 | # >   InvoiceNo StockCode                         Description Quantity
 147 | # > 1    536365    85123A  WHITE HANGING HEART T-LIGHT HOLDER        6
 148 | # > 2    536365     71053                 WHITE METAL LANTERN        6
 149 | # > 3    536365    84406B      CREAM CUPID HEARTS COAT HANGER        8
 150 | # > 4    536365    84029G KNITTED UNION FLAG HOT WATER BOTTLE        6
 151 | # > 5    536365    84029E      RED WOOLLY HOTTIE WHITE HEART.        6
 152 | # > 6    536365     22752        SET 7 BABUSHKA NESTING BOXES        2
 153 | # >      InvoiceDate UnitPrice CustomerID        Country
 154 | # > 1 12/1/2010 8:26       2.5      17850 United Kingdom
 155 | # > 2 12/1/2010 8:26       3.4      17850 United Kingdom
 156 | # > 3 12/1/2010 8:26       2.8      17850 United Kingdom
 157 | # > 4 12/1/2010 8:26       3.4      17850 United Kingdom
 158 | # > 5 12/1/2010 8:26       3.4      17850 United Kingdom
 159 | # > 6 12/1/2010 8:26       7.6      17850 United Kingdom
 160 | # file.show('C:/Users/randy/Downloads/Online Retail.csv') NOT RUN
 161 | 
 162 | # Save it in case we need to look at the metadata later on.
 163 | 
 164 | OnlineRetail.Metadata <- capture.output(str(OnlineRetail))
 165 | 
 166 | # print it now.  We can see that the capture.output contains the str function, and that there are 541,909 observations
 167 | 
 168 | 
 169 | cat(OnlineRetail.Metadata, sep = "\n")
 170 | # > 'data.frame': 541909 obs. of  8 variables:
 171 | # >  $ InvoiceNo  : chr  "536365" "536365" "536365" "536365" ...
 172 | # >  $ StockCode  : chr  "85123A" "71053" "84406B" "84029G" ...
 173 | # >  $ Description: chr  "WHITE HANGING HEART T-LIGHT HOLDER" "WHITE METAL LANTERN" "CREAM CUPID HEARTS COAT HANGER" "KNITTED UNION FLAG HOT WATER BOTTLE" ...
 174 | # >  $ Quantity   : int  6 6 8 6 6 2 6 6 6 32 ...
 175 | # >  $ InvoiceDate: chr  "12/1/2010 8:26" "12/1/2010 8:26" "12/1/2010 8:26" "12/1/2010 8:26" ...
 176 | # >  $ UnitPrice  : num  2.55 3.39 2.75 3.39 3.39 7.65 4.25 1.85 1.85 1.69 ...
 177 | # >  $ CustomerID : int  17850 17850 17850 17850 17850 17850 17850 17850 17850 13047 ...
 178 | # >  $ Country    : chr  "United Kingdom" "United Kingdom" "United Kingdom" "United Kingdom" ...
 179 | # change stringsasfactors back to TRUE
 180 | 
 181 | options(stringsAsFactors = T)
 182 | 
 183 | 
 184 | # Analysing the Input file
 185 | # After reading in the file, the nrow function, shows that transaction file containing 541909 rows. We can use our handy View function to peruse the contents. Alternatively, you can use the kable function from the knitr library to display a simple tabular display of the dataframe in the console, as indicated below. 
 186 | # .	Note that if you are also using the Rmarkdown package, the output can be formatted to appear as an HTML table in the markdown file. Otherwise it will appear as plain ASCII text.
 187 | # .	We can also look at the distribution plots of InvoiceDate, although it will need to be transformed to date format, and sorted first.
 188 | # # View(OnlineRetail)
 189 | nrow(OnlineRetail)
 190 | # > [1] 541909
 191 | library(knitr)
 192 | 
 193 | InvoiceDate <- gsub(" .*$", "", OnlineRetail$InvoiceDate)
 194 | InvoiceDate <- (as.Date(InvoiceDate, format = "%m/%d/%Y"))
 195 | InvoiceDate <- sort(InvoiceDate, decreasing = FALSE)
 196 | 
 197 | # .	We can see from the head and tail commands, as well as the plots, that the Invoices encompass the period from 12/1/2011 through 12/9/2011. We can also see several spikes in orders around the December holiday season.
 198 | #RW This was changed!!
 199 | kable(head(as.data.frame(InvoiceDate)))
 200 | 
 201 | # 2010-12-01
 202 | # 2010-12-01
 203 | # 2010-12-01
 204 | # 2010-12-01
 205 | # 2010-12-01
 206 | # 2010-12-01
 207 | # 
 208 | #RW This was changed!!
 209 | #kable(tail(InvoiceDate))
 210 | kable(tail(as.data.frame(InvoiceDate)))
 211 | 
 212 | # 2011-12-09
 213 | # 2011-12-09
 214 | # 2011-12-09
 215 | # 2011-12-09
 216 | # 2011-12-09
 217 | # 2011-12-09
 218 | # 
 219 | # Plotting the dates
 220 | par(las = 2)
 221 | barplot(table(InvoiceDate), cex.lab = 0.5, cex.main = 0.5, cex.names = 0.5, 
 222 | col = c("blue"))
 223 | # Insert Image B05033_06_01.png
 224 | # Scrubbing and Cleaning the data
 225 | # Here comes the cleaning part! Unstructured data such as text usually needs to be scrubbed and cleaned before it is in a form that is suitable for analysis. Fortunately, there are several R functions available which help you do that.
 226 | # Removing Unneeded character spaces
 227 | # We can start by removing leading and trailing blanks from each of the product description, since they add no value to the analysis and just take up extra space. The trimws is a handy function to accomplish this, since it removes both leading and trailing spaces. The nchar function will count the number of bytes in a character string, so we can run this function on OnlineRetail$Description before and after performing the string trim to see how much space we will actually be saving.
 228 | sum(nchar(OnlineRetail$Description))
 229 | # > [1] 14284888
 230 | OnlineRetail$Description <- trimws(OnlineRetail$Description)
 231 | sum(nchar(OnlineRetail$Description))
 232 | # > [1] 14283213
 233 | # We see that trimws reduced the size of the Description. Had the number increased after using the function, that would be a clue to check your code!
 234 | # Simplifying the Descriptions
 235 | # Market basket analysis using color choice analysis is an interesting topic in itself. However, for this analysis, we will remove the colors in order to simplify some of the descriptions. We can use the gsub function to remove some of the specific colors which appear as part of the product description.
 236 | #CHANGED NEW CODE
 237 | #OnlineRetail.save <- OnlineRetail
 238 | #OnlineRetail <- OnlineRetail.save
 239 | #View(OnlineRetail.save)
 240 | gsub_multiple <- function(from, to, x) {
 241 |   updated <- x
 242 |   for (i in 1:length(from)) {
 243 |     updated <- gsub(from[i], to[i], updated) 
 244 |   }
 245 |   return(updated)
 246 | }
 247 | 
 248 | 
 249 | OnlineRetail$Description <- gsub_multiple(c("RED","PINK","GREEN","SMALL","MEDIUM","LARGE","JUMBO","STRAWBERRY"), rep("",8), OnlineRetail$Description)
 250 |                                           
 251 | 
 252 | View(OnlineRetail) 
 253 | 
 254 | 
 255 | 
 256 | #RW now not needed?
 257 | 
 258 | # OnlineRetail$Description2 <- gsub("RED", "", saveit$Description)
 259 | # OnlineRetail$Description2 <- gsub("PINK", "", OnlineRetail$Description2)
 260 | # OnlineRetail$Description2 <- gsub("GREEN", "", OnlineRetail$Description2)
 261 | # OnlineRetail$Description2 <- gsub("SMALL", "", OnlineRetail$Description2)
 262 | # OnlineRetail$Description2 <- gsub("MEDIUM", "", OnlineRetail$Description2)
 263 | # OnlineRetail$Description2 <- gsub("LARGE", "", OnlineRetail$Description2)
 264 | # OnlineRetail$Description2 <- gsub("JUMBO", "", OnlineRetail$Description2)
 265 | # OnlineRetail$Description2 <- gsub("PINK", "", OnlineRetail$Description2)
 266 | # OnlineRetail$Description2 <- gsub("STRAWBERRY", "", OnlineRetail$Description2)
 267 | # OnlineRetail$changed <- ifelse(OnlineRetail$Description==OnlineRetail$Description2,"1","0")
 268 | # all.equal(OnlineRetail$Description,OnlineRetail$Description2)
 269 | # View(OnlineRetail[87500:875021,])
 270 | View(OnlineRetail)
 271 | # Removing colors automatically
 272 | # If we want to remove colors automatically, we can also do that as well. The colors() function returns a list of colors which are used in the current palette. We can then perform a little code manipulation in conjunction with the gsub function that we just used, to replace all of the specified colors from OnlineRetail$Description with blanks
 273 | # We will also use the kable function, which is contained within the knitr package in order to produce simple html tables of the results.
 274 | # compute the length of the field before changes
 275 | before <- sum(nchar(OnlineRetail$Description))
 276 | # get the unique colors returned from the colors function, and remove any
 277 | # digits found at the end of the string
 278 | 
 279 | # get the unique colors
 280 | col2 <- unique(gsub("[0-9]+", "", colors(TRUE)))
 281 | 
 282 | # Filter out any colors with a length > 7. This is just done to reduce the
 283 | # #number of colors to the 'popular' ones. We may miss a couple, but we can
 284 | # #filter out later.
 285 | 
 286 | for (i in 1:length(col2)) {
 287 | col2[i] <- ifelse(nchar(col2[i]) > 7, "", col2[i])
 288 | }
 289 | 
 290 | col2 <- unique(col2)
 291 | 
 292 | 
 293 | cat("Unique Colors\n")
 294 | # > Unique Colors
 295 | #RW Changed
 296 | kable(head(data.frame(col2), 10))
 297 | 
 298 | # white
 299 | # azure beige bisque black blue brown coral cyan
 300 | # Cleaning up the colors
 301 | # Clean up the colors a little more by capitalizing all of the colors and inserting a delimiter, and then pass the result to gsub
 302 | #RW changed
 303 | col <- toupper(paste0(col2, collapse = "|"))
 304 | cat("Pass to gsub\n", head(col, 9))
 305 | # > Pass to gsub
 306 | # >  WHITE||AZURE|BEIGE|BISQUE|BLACK|BLUE|BROWN|CORAL|CYAN|DARKRED|DIMGRAY|GOLD|GRAY|GREEN|HOTPINK|IVORY|KHAKI|LINEN|MAGENTA|MAROON|NAVY|OLDLACE|ORANGE|ORCHID|PERU|PINK|PLUM|PURPLE|RED|SALMON|SIENNA|SKYBLUE|SNOW|TAN|THISTLE|TOMATO|VIOLET|WHEAT|YELLOW
 307 | # 
 308 | # To replace the colors in the dataframe with blanks: 
 309 | # 
 310 | OnlineRetail$Description <- gsub(col, "", OnlineRetail$Description)
 311 | 
 312 | # Check the length to see how much was removed. As before, print the character count before and after, to insure that the Description is reduced in size.
 313 | after <- sum(nchar(OnlineRetail$Description))
 314 | print(before)
 315 | # > [1] 13682222
 316 | print(after)
 317 | # > [1] 13341097
 318 | # Verify that there are no more colors by inspection. We will look at the first 5, although you will probably want to look at more, all from different parts of the data set. We will leave "Cream" in for now, but we would remove that as well if we felt it would not help in the analysis.
 319 | #RW changed
 320 | kable(data.frame(OnlineRetail$Description[1:5]))
 321 | # HANGING HEART T-LIGHT HOLDER
 322 | # METAL LANTERN
 323 | # CREAM CUPID HEARTS COAT HANGER
 324 | # KNITTED UNION FLAG HOT WATER BOTTLE
 325 | # WOOLLY HOTTIE HEART.
 326 | # Filtering out single item transactions
 327 | # Since we will want to have a 'basket' of items to perform some association rules on, we will want to filter out the transactions that only have 1 item per invoice. (That might be useful for a separate analysis of customer who only purchased one item).
 328 | # .	Let's use sqldf to first all of the single item transactions, and then we will create a separate dataframe consisting of the number of items per customer invoice.
 329 | 
 330 | library(sqldf)
 331 | 
 332 | #.	Construct a query: How many distinct invoices were there?
 333 | # 
 334 | # sqldf("select count(distinct InvoiceNo) from OnlineRetail")
 335 | # # > Loading required package: tcltk
 336 | # # >   count(distinct InvoiceNo)
 337 | # # > 1                     25900
 338 | # # 
 339 | # # .	How many invoices contain only single transactions?
 340 | 
 341 | single.trans <- sqldf("select InvoiceNo, count(*) as itemcount from OnlineRetail group by InvoiceNo having count(*)==1")
 342 | 
 343 | # .	Add them up. This shows us that there are not a lot of single transaction items.
 344 | 
 345 | sum(single.trans$itemcount)
 346 | # > [1] 5841
 347 | # .	SQL Query: How many have multiple transactions?
 348 | 
 349 | x2 <- sqldf("select InvoiceNo, count(*) as itemcount from OnlineRetail group by InvoiceNo having count(*) > 1")
 350 | 
 351 | sum(x2$itemcount)
 352 | # > [1] 536068
 353 | # 
 354 | # .	Show a tabulation of the number of items per invoice to verify that they all have at least 2 items.
 355 | 
 356 | kable(head(x2))
 357 | # 
 358 | # 
 359 | # 
 360 | # 536365	7
 361 | # 536366	2
 362 | # 536367	12
 363 | # 536368	4
 364 | # 536370	20
 365 | # 536372	2
 366 | # InvoiceNo	itemcount
 367 | # 536365	7
 368 | # 536366	2
 369 | # 536367	12
 370 | # 536368	4
 371 | # 536370	20
 372 | # 536372	2
 373 | # 
 374 | # Looking at the distributions
 375 | # Now we can take a look at the distribution of the number of items. We can see that there is an average of 27.94 items. This will be a large enough assortment of items to do a meaningful analysis.
 376 | mean(x2$itemcount)
 377 | # > [1] 27
 378 | # 
 379 | # We can also plot a histogram. The histogram shows a definite spike at a lower item number that we know is not equal to 1 which we have filtered out. We can see what that count is by using the min() function
 380 | # 
 381 | hist(x2$itemcount, breaks = 500, xlim = c(0, 50))
 382 | 
 383 | # 
 384 | # 
 385 | # Insert Image: B05033_06_02.extension
 386 | # 
 387 | # 
 388 | min(x2$itemcount)
 389 | # > [1] 2
 390 | # Merging the results back into the original data
 391 | # We will want to hold on to the number of total items for each invoice on the original data frame. Merge the number of items contained in each invoice back to the original transactions, with the merge function, using Invoicenum as the key. 
 392 | # .	If you count the number of distinct invoices before and after the merge, you can see that the invoice count is lower than prior to the merge.
 393 | nrow(OnlineRetail)
 394 | # > [1] 541909
 395 | sqldf("select count(distinct InvoiceNo) from OnlineRetail")
 396 | # >   count(distinct InvoiceNo)
 397 | # > 1                     25900
 398 | tmp <- merge(OnlineRetail, x2, by = "InvoiceNo")
 399 | nrow(tmp)
 400 | # > [1] 536068
 401 | sqldf("select count(distinct InvoiceNo) from tmp")
 402 | # >   count(distinct InvoiceNo)
 403 | # > 1                     20059
 404 | # # we can see that we filtered out some members
 405 | # 
 406 | OnlineRetail <- merge(OnlineRetail, x2, by = "InvoiceNo")
 407 | # 
 408 | # .	Print the OnlineRetail Table along with the merged itemcount
 409 | # 
 410 | kable(OnlineRetail[1:5, ], padding = 0)
 411 | # 
 412 | # InvoiceNo
 413 | # StockCode	Description	Quantity	InvoiceDate	UnitPrice	CustomerID	Country	itemcount
 414 | # 536365	84406B	CREAM CUPID HEARTS COAT HANGER	8	12/1/2010 8:26	2.8	17850	United kingdom	7
 415 | # 536365	22752	SET 7 BABUSHKA NESTING BOXES	2	12/1/2010 8:26	7.6	17850	United kingdom	7
 416 | # 536365	85123A	HANGING HEART T-LIGHT HOLDER	6	12/1/2010 8:26	2.5	17850	United kingdom	7
 417 | # 536365	84029E	WOOLLY HOTTIE HEART.	6	12/1/2010 8:26	3.4	17850	United kingdom	7
 418 | # 536365	71053	METAL LANTERN	6	12/1/2010 8:26	3.4	17850	United kingdom	7
 419 | # 
 420 | # 
 421 | # 
 422 | # 
 423 | # 
 424 | # 
 425 | # 
 426 | # 
 427 | # 
 428 | # 
 429 | # 
 430 | # 
 431 | # 84406B	
 432 | # 8	
 433 | # 2.8	17850	United Kingdom	7
 434 | # 
 435 | # 
 436 | # 
 437 | # 
 438 | # 
 439 | # 
 440 | # 
 441 | # 
 442 | # 
 443 | # 
 444 | # 
 445 | # 
 446 | # 
 447 | # 
 448 | # 
 449 | # 
 450 | # 
 451 | # 
 452 | # 
 453 | # 
 454 | # 
 455 | # 
 456 | # 
 457 | # 
 458 | # 
 459 | # 
 460 | # 
 461 | # 
 462 | # 
 463 | # 
 464 | # 
 465 | # 
 466 | # 
 467 | # 
 468 | # 
 469 | # 
 470 | # Compressing the descriptions using CamelCase
 471 | # For long descriptions, sometimes it is beneficial to compress the descriptions into CamelCase to improve readability. This is especially valuable when viewing descriptions which are labels on x or y axes. 
 472 | # CamelCase is a method which some programmers use for writing compound words where spaces are first removed, and then each word begins with a capital letter. It is also a way of conserving space. 
 473 | # To accomplish this, we can write a function called .simpleCap which performs this function. To illustrate how it works, we will pass it a two element character vector c("A certain good book","A very easy book"), and observe the results.
 474 | # Custom function to map to camelcase
 475 | # This is a simple example use of this function which maps the 2 character vector c("A certain good book", "A very easy book") to CamelCase. This vector is mapped to 2 new elements:
 476 | #   [1] "ACertainGoodBook", and  [2] "AVeryEasyBook"
 477 | 
 478 | # change descriptions to camelcase maybe append to itemnumber for uniqueness
 479 | .simpleCap <- function(x) {
 480 |   # s <- strsplit(x, ' ')[[1]]
 481 |   s <- strsplit(tolower(x), " ")[[1]]
 482 |   
 483 |   aa <- paste(toupper(substring(s, 1, 1)), substring(s, 2), sep = "", collapse = " ")
 484 |   gsub(" ", "", aa, fixed = TRUE)
 485 |   
 486 | }
 487 | 
 488 | a <- c("A certain good book", "A very easy book")
 489 | a4 <- gsub(" ", "", .simpleCap(a), fixed = TRUE)
 490 | a4
 491 | # > [1] "ACertainGoodBook"
 492 | lapply(a, .simpleCap)
 493 | # > [[1]]
 494 | # > [1] "ACertainGoodBook"
 495 | # > 
 496 | #   > [[2]]
 497 | # > [1] "AVeryEasyBook"
 498 | # 
 499 | # Let's use the .simpleCap function to create a new version of Description from our OnlineRetail dataset, and call it Desc2, which removes the blanks, and capitalizes the first letter of each word. 
 500 | OnlineRetail$Desc2 <- lapply(as.character(OnlineRetail$Description), .simpleCap)
 501 | 
 502 | kable(OnlineRetail[1:5, c(3, 10)], padding = 0)
 503 | # Description	Desc2
 504 | # CREAM CUPID HEARTS COAT HANGER	CreamCupidHeartsCoatHanger
 505 | # SET 7 BABUSHKA NESTING BOXES	Set7BabushkaNestingBoxes
 506 | # HANGING HEART T-LIGHT HOLDER	HangingHeartT-lightHolder
 507 | # WOOLLY HOTTIE HEART.	WoollyHottieHeart.
 508 | # METAL LANTERN	MetalLantern
 509 | # Extracting the 'last word'
 510 | # Often the first and last word of product descriptions contain useful information, and sometimes you can use a single word or phrase in place of the original longer description. This may not always be the case, but it is worth trying. In order to extract the last word from the descriptions, we can use the word function from the stringr package. 
 511 | library(stringr)
 512 | OnlineRetail$lastword <- word(OnlineRetail$Description, -1)  #supply -1 to extract the last word
 513 | OnlineRetail$Description <- trimws(OnlineRetail$Description, "l")
 514 | OnlineRetail$firstword <- word(OnlineRetail$Description, 1)
 515 | # use head(OnlineRetail) if you are no using Rmarkdown
 516 | 
 517 | kable(OnlineRetail[1:5, c(3, 10:12)], padding = 0)
 518 | 
 519 | 
 520 | 
 521 | 
 522 | 
 523 | # CREAM CUPID HEARTS COAT HANGER
 524 | # CreamCupidHeartsCoatHanger	HANGER	CREAM
 525 | # SET 7 BABUSHKA NESTING BOXES	Set7BabushkaNestingBoxes	BOXES	SET
 526 | # HANGING HEART T-LIGHT HOLDER	HangingHeartT-lightHolder	HOLDER	HANGING
 527 | # WOOLLY HOTTIE HEART.	WoollyHottieHeart.	HEART.	WOOLLY
 528 | # METAL LANTERN	MetalLantern	LANTERN	METAL
 529 | # 
 530 | # 
 531 | # 
 532 | # 
 533 | # Description	Desc2	lastword	firstword
 534 | # CREAM CUPID HEARTS COAT HANGER	CreamCupidHeartsCoatHanger
 535 | # HANGER	CREAM
 536 | # SET 7 BABUSHKA NESTING BOXES	Set7BabushkaNestingBoxes	BOXES	SET
 537 | # HANGING HEART T-LIGHT HOLDER	HangingHeartT-lightHolder	HOLDER	HANGING
 538 | # WOOLLY HOTTIE HEART.	WoollyHottieHeart.	HEART.	WOOLLY
 539 | # METAL LANTERN	MetalLantern	LANTERN	METAL
 540 | # 
 541 | # In order to see if this 'lastword' mapping makes any sense, we will sort the results so that we can see the most frequently occurring ending words. We could ultimately use this information to create category subproducts, such as CASES, BAGS, SIGNS etc.
 542 | kable(head(as.data.frame(sort(table(OnlineRetail$lastword[]), decreasing = TRUE)), 
 543 | 10))
 544 | 
 545 | sort(table(OnlineRetail$lastword[]), decreasing = TRUE)
 546 | # 36640
 547 | # DESIGN	25557
 548 | # HOLDER	13528
 549 | # RETROSPOT	13013
 550 | # BOX	12939
 551 | # SIGN	12210
 552 | # CASES	10888
 553 | # BAG	9723
 554 | # SET	9056
 555 | # CHRISTMAS	7868
 556 | # 
 557 | # 
 558 | nrow(OnlineRetail)
 559 | # > [1] 536068
 560 | # The lastword description looks useable, so we will keep it as part of the analytics dataset.
 561 | # Creating the Test and Training Datasets
 562 | # Now that we are finished with our transformations, we will create the training and test data frames. We will perform a 50/50 split between training and test.
 563 | # # Take a sample of full vector
 564 | # 
 565 | nrow(OnlineRetail)
 566 | # > [1] 536068
 567 | pctx <- round(0.5 * nrow(OnlineRetail))
 568 | set.seed(1)
 569 | 
 570 | # randomize rows
 571 | 
 572 | df <- OnlineRetail[sample(nrow(OnlineRetail)), ]
 573 | rows <- nrow(df)
 574 | OnlineRetail <- df[1:pctx, ]  #training set
 575 | OnlineRetail.test <- df[(pctx + 1):rows, ]  #test set
 576 | rm(df)
 577 | 
 578 | 
 579 | # Display the number of rows in the training and test datasets.
 580 | 
 581 | nrow(OnlineRetail)
 582 | # > [1] 268034
 583 | nrow(OnlineRetail.test)
 584 | # > [1] 268034
 585 | # Saving the results
 586 | # It is a good idea to periodically save your data frames, so that you can pick up your analysis from various checkpoints. 
 587 | # In this example I will first sort them both by InvoiceNo, and then save the test and train data sets to disk, where I can always load them back into memory as needed.
 588 | #RW changed
 589 | 
 590 | OnlineRetail <- OnlineRetail[order(OnlineRetail$InvoiceNo), ]
 591 | OnlineRetail.test <- OnlineRetail.test[order(OnlineRetail.test$InvoiceNo), ]
 592 | # 
 593 | # # save(OnlineRetail,file='OnlineRetail.full.Rda')
 594 | # # save(OnlineRetail.test,file='OnlineRetail.test.Rda')
 595 | # 
 596 | # # load('OnlineRetail.full.Rda') load('OnlineRetail.test.Rda')
 597 | # 
 598 | # nrow(OnlineRetail)
 599 | # > [1] 268034
 600 | nrow(OnlineRetail.test)
 601 | # > [1] 268034
 602 | nrow(OnlineRetail)
 603 | # > [1] 268034
 604 | 
 605 | # At this point we have prepared our analytics data sets and are ready to move on to the actual analysis.
 606 | # If you wish, you can save the entire workspace to disk as follows:
 607 | save.image(file = "ch6 part 1.Rdata")
 608 | # Loading the Analytics file
 609 | # If you are still in a session in which OnlineRetail is still in memory, you are OK! However, if you are picking up where we left off, you will need to load the data that we saved in the last session. Start by setting the working directory and then loading the OnlineRetail training dataset.
 610 | #RW temporarily disabling loads and saves and rm's
 611 | #rm(list = ls())
 612 | #setwd("C:/PracticalPredictiveAnalytics/Outputs")
 613 | #load("ch6 part 1.Rdata")
 614 | # works for small data OnlineRetail <- OnlineRetail[1:10000,]
 615 | cat(nrow(OnlineRetail), "rows loaded\n")
 616 | # > 268034 rows loaded
 617 | # The cat function in the previous step should reflect the number of rows in the training data set which is 268034
 618 | # Determining the Consequent Rules
 619 | # We have seen in the data prep stage that there are a large number of itemsets generated for each invoice. To begin to demonstrate the algorithm, we will extract one representative word from each product description, and use that word as the Consequent (or rhs) to build some association rules. We have already saved the first and last words from each product description. We would examine those words more closely and see if we can filter them to result in a manageable set of transactions.
 620 | # Let's first preview the frequency of the first and last word of the descriptions in descending order. That should give us a clue as to what the popular products are.
 621 | library(arules)
 622 | # > Loading required package: Matrix
 623 | # > 
 624 | #   > Attaching package: 'arules'
 625 | # > The following objects are masked from 'package:base':
 626 | #   > 
 627 | #   >     abbreviate, write
 628 | library(arulesViz)
 629 | # > Loading required package: grid
 630 | # Print the popular "first words" of the description. We will do that by sorting the frequency of "first word" in descending order:
 631 | #   
 632 |   kable(head(as.data.frame(sort(table(OnlineRetail$firstword[]), decreasing = TRUE)), 10))
 633 | sort(table(OnlineRetail$firstword[]), decreasing = TRUE)
 634 | # SET	17381
 635 | # BAG	8720
 636 | # LUNCH	7692
 637 | # RETROSPOT	7155
 638 | # PACK	6861
 639 | # VINTAGE	6204
 640 | # HEART	4799
 641 | # HANGING	4457
 642 | # DOORMAT	4175
 643 | # REGENCY	3452
 644 | # 
 645 | # Similarly, print the popular "last words" of the description
 646 | 
 647 | kable(head(as.data.frame(sort(table(OnlineRetail$lastword[]), decreasing = TRUE)), 
 648 |            10))
 649 | sort(table(OnlineRetail$lastword[]), decreasing = TRUE)
 650 | # 18376
 651 | # DESIGN	12713
 652 | # HOLDER	6792
 653 | # BOX	6528
 654 | # RETROSPOT	6517
 655 | # SIGN	6184
 656 | # CASES	5465
 657 | # BAG	4826
 658 | # SET	4418
 659 | # CHRISTMAS	3963
 660 | # Looking at the popular terms above show that many transactions concern the purchases of Boxes, Cases, Signs, Bags, etc.
 661 | # Replacing Missing Values
 662 | # For lastword, we see that there are some blank values, so we will replace any blank values with any value found in firstword. 
 663 | 
 664 | # replace blank values in lastword, with first word.
 665 | 
 666 | OnlineRetail$lastword <- ifelse(OnlineRetail$lastword == "", OnlineRetail$firstword, 
 667 |                                 OnlineRetail$lastword)
 668 | 
 669 | 
 670 | # After we are done with this, we will take another look at the frequencies and observe that the blank values have disappeared.
 671 | 
 672 | head(as.data.frame(sort(table(OnlineRetail$lastword[]), decreasing = TRUE)), 
 673 |      10)
 674 | sort(table(OnlineRetail$lastword[]), decreasing = TRUE)
 675 | # > DESIGN                                                      12713
 676 | # > HOLDER                                                       6792
 677 | # > RETROSPOT                                                    6574
 678 | # > BOX                                                          6528
 679 | # > SIGN                                                         6184
 680 | # > BAG                                                          5761
 681 | # > CASES                                                        5465
 682 | # > SET                                                          4418
 683 | # > HEART                                                        4027
 684 | # > CHRISTMAS                                                    4005
 685 | # Making the final subset
 686 | # Based upon these frequencies, we will filter the data to only include a subset of the top categories. We will exclude some of the terms which do not apply to the physical product, such as "DESIGN", "SET", and any associated colors.
 687 | # Testing OnlineRetail2 <- OnlineRetail
 688 | OnlineRetail2 <- subset(OnlineRetail, lastword %in% c("BAG", "CASES", "HOLDER", 
 689 |                                                       "BOX", "SIGN", "CHRISTMAS", "BOTTLE", "BUNTING", "MUG", "BOWL", "CANDLES", 
 690 |                                                       "COVER", "HEART", "MUG", "BOWL"))
 691 | 
 692 | # Run the table function again on the results to see the new frequencies.
 693 | 
 694 | head(as.data.frame(sort(table(OnlineRetail2$lastword[]), decreasing = TRUE)), 
 695 |      10)
 696 |            sort(table(OnlineRetail2$lastword[]), decreasing = TRUE)
 697 | # > HOLDER                                                        6792
 698 | # > BOX                                                           6528
 699 | # > SIGN                                                          6184
 700 | # > BAG                                                           5761
 701 | # > CASES                                                         5465
 702 | # > HEART                                                         4027
 703 | # > CHRISTMAS                                                     4005
 704 | # > BOTTLE                                                        3795
 705 | # > BUNTING                                                       3066
 706 | # > MUG                                                           2900
 707 | # 
 708 | # Use the nrow function to see how much of the data was filtered from the original.
 709 | 
 710 | cat(nrow(OnlineRetail), "Original before subsetting\n")
 711 | # > 268034 Original before subsetting
 712 | cat(nrow(OnlineRetail2), "After Subsetting\n")
 713 | # > 55609 After Subsetting
 714 | # Creating the Market Basket Transaction file
 715 | # We are almost there! There is extra step that we need to do in order to prepare our data for market basket analysis.
 716 | # The association rules package requires that the data be in transaction format. Transactions can either be specified in two different formats:
 717 | #   1.	One transaction per itemset with an identifier. This shows the entire basket in one line, just as we saw with the Groceries data.
 718 | # 2.	One single item per line with an identifier.
 719 | # Additionally, you can create the actual transaction file in two different ways, by either:
 720 | #   1.	Physically writing a transactions file
 721 | # 2.	Coercing a dataframe to transaction format. 
 722 | # For smaller amounts of data, coercing the dataframe to a transaction file is simpler, but for large transaction files, writing the transaction file first is preferable, since append files can be fed from large operational transaction systems . We will illustrate both ways.
 723 | # Method 1 - Coercing a dataframe to a transaction file.
 724 | # Now we are ready to coerce the data frame. We will create a temporary data frame containing just the transaction id (InvoiceNo), and the descriptor (lastword). 
 725 | # First we will verify the column names and numbers for these two variables. We can see that they correspond to columns 1, and 12 of the dataframe by first running a colnames function on OnlineRetail2. 
 726 | colnames(OnlineRetail2)
 727 | # >  [1] "InvoiceNo"   "StockCode"   "Description" "Quantity"    "InvoiceDate"
 728 | # >  [6] "UnitPrice"   "CustomerID"  "Country"     "itemcount"   "Desc2"      
 729 | # > [11] "lastword"    "firstword"
 730 | # 
 731 | # As a double check, display the first 25 rows, specifying the indices found above.
 732 | kable(head(OnlineRetail2[, c(1, 11)], 5))
 733 | # InvoiceNo	lastword
 734 | # 6	536365	HOLDER
 735 | # 45	536370	BOX
 736 | # 39	536370	BOX
 737 | # 57	536373	HOLDER
 738 | # 59	536373	BOTTLE
 739 | # Next we will create a temporary data frame from just these two columns, give them names, and eliminate duplicates.
 740 | # 
 741 | # First, create the dataframe with only two columns named TransactionID, and Items.
 742 | tmp <- data.frame(OnlineRetail2[, 1], OnlineRetail2[, 11])
 743 | names(tmp)[1] <- "TransactionID"
 744 | names(tmp)[2] <- "Items"
 745 | 
 746 | tmp <- unique(tmp)
 747 | nrow(tmp)
 748 | # > [1] 33182
 749 | # Verify the results
 750 | kable(head(tmp))
 751 | # TransactionID	Items
 752 | # 1	536365	HOLDER
 753 | # 2	536370	BOX
 754 | # 4	536373	HOLDER
 755 | # 5	536373	BOTTLE
 756 | # 7	536373	MUG
 757 | # 9	536375	HOLDER
 758 | # We will now use the split function to group the descriptions (lastword which is column 2), by InvoiceID (column 1). The "as" function is the critical keyword, as it converts the results of the split to transaction form.
 759 | 
 760 | trans4 <- as(split(tmp[, 2], tmp[, 1]), "transactions")
 761 | # Inspecting the transaction file
 762 | # Once the data has been coerced to transaction form, we can use the inspect function to examine the data. Note: when inspecting transaction files, you do not use normal print, or head functions directly since the objects are in sparse format. You use an "inspect" function instead.
 763 | # If you happen to have the tm package loaded (which we will use later), you MUST preface inspect with alrules::inspect, since there is also an inspect function in the tm package which serves a different purpose.
 764 | # If you run an inspect command on the first 5 records, you can see that the data is in 'basket' format, i.e each invoice shows the itemsets, delimited by {},  that are associated with each invoice.
 765 | arules::inspect(trans4[1:5])
 766 | # >   items               transactionID
 767 | # > 1 {HOLDER}            536365       
 768 | # > 2 {BOX}               536370       
 769 | # > 3 {BOTTLE,HOLDER,MUG} 536373       
 770 | # > 4 {BOTTLE,HOLDER}     536375       
 771 | # > 5 {HOLDER}            536376
 772 | # Another way of displaying transactions, other than using inspect,  is to first coerce it to a matrix and then display the items as boolean values. If you have many items you will need to subset the column vectors so that they fit on the screen.
 773 | as(trans4, "matrix")[1:5, 1:5]
 774 | # >          BAG BOTTLE  BOWL   BOX BUNTING
 775 | # > 536365 FALSE  FALSE FALSE FALSE   FALSE
 776 | # > 536370 FALSE  FALSE FALSE  TRUE   FALSE
 777 | # > 536373 FALSE   TRUE FALSE FALSE   FALSE
 778 | # > 536375 FALSE   TRUE FALSE FALSE   FALSE
 779 | # > 536376 FALSE  FALSE FALSE FALSE   FALSE
 780 | as(trans4, "matrix")[1:5, 6:ncol(trans4)]
 781 | # >        CANDLES CASES CHRISTMAS COVER HEART HOLDER   MUG  SIGN
 782 | # > 536365   FALSE FALSE     FALSE FALSE FALSE   TRUE FALSE FALSE
 783 | # > 536370   FALSE FALSE     FALSE FALSE FALSE  FALSE FALSE FALSE
 784 | # > 536373   FALSE FALSE     FALSE FALSE FALSE   TRUE  TRUE FALSE
 785 | # > 536375   FALSE FALSE     FALSE FALSE FALSE   TRUE FALSE FALSE
 786 | # > 536376   FALSE FALSE     FALSE FALSE FALSE   TRUE FALSE FALSE
 787 | # Obtaining the topN purchased items
 788 | # Even before we run any association rules we can obtain counts of the topN items, as shown in the first plot.
 789 | # If we are looking for items with a specified support level, we can include that as a parameter to the function as well.
 790 | # The second plot below is another way of plotting the items, which specifies support as a filter, and it indicates higher support for "Bags","Boxes","Cases", and "Holders"
 791 | par(mfrow = c(1, 2), bg = "white", col = c("blue"))
 792 | 
 793 | itemFrequencyPlot(trans4, topN = 10, type = "absolute", cex.names = 0.7)
 794 | itemFrequencyPlot(trans4, support = 0.2, cex.names = 0.75)
 795 | 
 796 | # Insert Image: B05033_06_03.extension
 797 | # 
 798 | # 
 799 | # Reset the graphics parameters
 800 | dev.off()
 801 | # > null device 
 802 | # >           1
 803 | # Finding the association rules
 804 | # As shown earlier, the association rules are run using the apriori function. The apriori function has several filtering parameters which are used to control the number of rules which are produced. In our example we will specify the minimum support and confidence threshold that a rule needs to pass in order to be considered. 
 805 | # The number that you pass to aprior depends upon how you want to look at the rules.  It can be an initial screening, or it can be a 'deeper dive', after you have performed several passes. But generally, If we want many rules we can decrease the support and confidence parameters. If we want to focus on items which appear frequently we raise the support threshold. If we want to concentrate on higher quality and more 'accurate' rules, we would raise the confidence level.
 806 | # These numbers are not absolutes. They are numbers that you can tweak in order to limit the number of rules that you will look at, versus their quality.
 807 | # The minlen=2 parameter is often specified in order to guarantee that there are itemsets included in the left hand side of the rule.
 808 | rulesx <- apriori(trans4, parameter = list(minlen = 2, support = 0.02, confidence = 0.01))
 809 | # > Apriori
 810 | # > 
 811 | #   > Parameter specification:
 812 | #   >  confidence minval smax arem  aval originalSupport support minlen maxlen
 813 | # >        0.01    0.1    1 none FALSE            TRUE    0.02      2     10
 814 | # >  target   ext
 815 | # >   rules FALSE
 816 | # > 
 817 | #   > Algorithmic control:
 818 | #   >  filter tree heap memopt load sort verbose
 819 | # >     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
 820 | # > 
 821 | #   > Absolute minimum support count: 272 
 822 | # > 
 823 | #   > set item appearances ...[0 item(s)] done [0.00s].
 824 | # > set transactions ...[13 item(s), 13617 transaction(s)] done [0.00s].
 825 | # > sorting and recoding items ... [13 item(s)] done [0.00s].
 826 | # > creating transaction tree ... done [0.00s].
 827 | # > checking subsets of size 1 2 3 4 5 done [0.00s].
 828 | # > writing ... [829 rule(s)] done [0.00s].
 829 | # > creating S4 object  ... done [0.00s].
 830 | # The output from the apriori algorithm will tell us how many rules were generated. If you also perform a str function on rulesx, that will show you all of the sublists that are contained within the rulesx object, which can get complex.  Ordinarily the output is sufficient for analyzing the rules, but these sublists can be used programmatically to print the number of rules generated if you vary the paramaters in a loop. For our example, printing the second column of one of the sublists (rulesx@lhs@data@Dim[2]) shows that here have been 829 rules generated. This is a manageable number of rules to look at.
 831 | str(rulesx)
 832 | # > Formal class 'rules' [package "arules"] with 4 slots
 833 | # >   ..@ lhs    :Formal class 'itemMatrix' [package "arules"] with 3 slots
 834 | # >   .. .. ..@ data       :Formal class 'ngCMatrix' [package "Matrix"] with 5 slots
 835 | # >   .. .. .. .. ..@ i       : int [1:1752] 8 5 8 7 8 11 8 2 8 4 ...
 836 | # >   .. .. .. .. ..@ p       : int [1:830] 0 1 2 3 4 5 6 7 8 9 ...
 837 | # >   .. .. .. .. ..@ Dim     : int [1:2] 13 829
 838 | # >   .. .. .. .. ..@ Dimnames:List of 2
 839 | # >   .. .. .. .. .. ..$ : NULL
 840 | # >   .. .. .. .. .. ..$ : NULL
 841 | # >   .. .. .. .. ..@ factors : list()
 842 | # >   .. .. ..@ itemInfo   :'data.frame': 13 obs. of  1 variable:
 843 | #   >   .. .. .. ..$ labels: chr [1:13] "BAG" "BOTTLE" "BOWL" "BOX" ...
 844 | # >   .. .. ..@ itemsetInfo:'data.frame': 0 obs. of  0 variables
 845 | # >   ..@ rhs    :Formal class 'itemMatrix' [package "arules"] with 3 slots
 846 | # >   .. .. ..@ data       :Formal class 'ngCMatrix' [package "Matrix"] with 5 slots
 847 | # >   .. .. .. .. ..@ i       : int [1:829] 5 8 7 8 11 8 2 8 4 8 ...
 848 | # >   .. .. .. .. ..@ p       : int [1:830] 0 1 2 3 4 5 6 7 8 9 ...
 849 | # >   .. .. .. .. ..@ Dim     : int [1:2] 13 829
 850 | # >   .. .. .. .. ..@ Dimnames:List of 2
 851 | # >   .. .. .. .. .. ..$ : NULL
 852 | # >   .. .. .. .. .. ..$ : NULL
 853 | # >   .. .. .. .. ..@ factors : list()
 854 | # >   .. .. ..@ itemInfo   :'data.frame': 13 obs. of  1 variable:
 855 | #   >   .. .. .. ..$ labels: chr [1:13] "BAG" "BOTTLE" "BOWL" "BOX" ...
 856 | # >   .. .. ..@ itemsetInfo:'data.frame': 0 obs. of  0 variables
 857 | # >   ..@ quality:'data.frame':   829 obs. of  3 variables:
 858 | #   >   .. ..$ support   : num [1:829] 0.0239 0.0239 0.0206 0.0206 0.0245 ...
 859 | # >   .. ..$ confidence: num [1:829] 0.278 0.21 0.24 0.129 0.285 ...
 860 | # >   .. ..$ lift      : num [1:829] 2.43 2.43 1.5 1.5 2.03 ...
 861 | # >   ..@ info   :List of 4
 862 | # >   .. ..$ data         : symbol trans4
 863 | # >   .. ..$ ntransactions: int 13617
 864 | # >   .. ..$ support      : num 0.02
 865 | # >   .. ..$ confidence   : num 0.01
 866 | rulesx@lhs@data@Dim[2]
 867 | # > [1] 829
 868 | # Examining the rules summary
 869 | # The rule length distribution indicates the number of itemsets which appear in both the left and right side of the association. The most frequent number of item is 3, which either means that the purchase of 2 items together, implies purchasing a single item, or conversely, purchasing a single item implies purchasing two additional item.
 870 | # Examining the rules quality and observe the highest support
 871 | # The quality measures tell you something about the distribution of support, confidence, and lift. The quality function will give the support, list, and confidence for each of the rules. You can also sort the rules by each of these important measures and observe which rules have the highest measures.
 872 | # We can see from the low support distribution that there are no particular itemsets which occur more than others. If you inspect the rules sorted by the support level, you will see the highest support level at .046 Notice that this agrees with the Max. support level given in the summary. The top 3 support levels are for those customers who purchase Holders, Boxes, or signs.
 873 | summary(rulesx)
 874 | # > set of 829 rules
 875 | # > 
 876 | #   > rule length distribution (lhs + rhs):sizes
 877 | # >   2   3   4   5 
 878 | # > 156 438 220  15 
 879 | # > 
 880 | #   >    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 881 | # >   2.000   3.000   3.000   3.113   4.000   5.000 
 882 | # > 
 883 | #   > summary of quality measures:
 884 | #   >     support          confidence          lift      
 885 | # >  Min.   :0.02005   Min.   :0.1293   Min.   :1.000  
 886 | # >  1st Qu.:0.02210   1st Qu.:0.3672   1st Qu.:1.948  
 887 | # >  Median :0.02512   Median :0.5110   Median :2.393  
 888 | # >  Mean   :0.02987   Mean   :0.5068   Mean   :2.444  
 889 | # >  3rd Qu.:0.03202   3rd Qu.:0.6533   3rd Qu.:2.899  
 890 | # >  Max.   :0.09885   Max.   :0.8571   Max.   :5.419  
 891 | # > 
 892 | #   > mining info:
 893 | #   >    data ntransactions support confidence
 894 | # >  trans4         13617    0.02       0.01
 895 | head(quality(rulesx))  #also look at the quality measures for each of the rules
 896 | # >      support confidence     lift
 897 | # > 1 0.02394066  0.2779199 2.432156
 898 | # > 2 0.02394066  0.2095116 2.432156
 899 | # > 3 0.02063597  0.2395567 1.500480
 900 | # > 4 0.02063597  0.1292548 1.500480
 901 | # > 5 0.02452816  0.2847400 2.026819
 902 | # > 6 0.02452816  0.1745949 2.026819
 903 | tmp <- as.data.frame(inspect(head(sort(rulesx, by = "support"), 10)))
 904 | # >     lhs         rhs      support    confidence lift    
 905 | # > 155 {BOX}    => {HOLDER} 0.09884703 0.3417111  1.063075
 906 | # > 156 {HOLDER} => {BOX}    0.09884703 0.3075166  1.063075
 907 | # > 135 {HEART}  => {HOLDER} 0.09620328 0.4768839  1.483602
 908 | # > 136 {HOLDER} => {HEART}  0.09620328 0.2992918  1.483602
 909 | # > 151 {BAG}    => {BOX}    0.09304546 0.3859275  1.334139
 910 | # > 152 {BOX}    => {BAG}    0.09304546 0.3216552  1.334139
 911 | # > 149 {SIGN}   => {HOLDER} 0.09069545 0.4203540  1.307736
 912 | # > 150 {HOLDER} => {SIGN}   0.09069545 0.2821567  1.307736
 913 | # > 141 {CASES}  => {BOX}    0.08511420 0.4048201  1.399451
 914 | # > 142 {BOX}    => {CASES}  0.08511420 0.2942371  1.399451
 915 | # Confidence and Lift measures.
 916 | # Similar to above, sort the rules and examine the highest confidence and lift measures.
 917 | tmp <- as.data.frame(arules::inspect(head(sort(rulesx, by = "confidence"), 10)))
 918 | # >     lhs                       rhs      support    confidence lift    
 919 | # > 815 {BAG,BOWL,BOX,SIGN}    => {HOLDER} 0.02070941 0.8571429  2.666601
 920 | # > 631 {BAG,BOTTLE,BOWL}      => {BOX}    0.02181097 0.8510029  2.941890
 921 | # > 643 {BOWL,HEART,SIGN}      => {HOLDER} 0.02151722 0.8492754  2.642125
 922 | # > 820 {BAG,BOX,HEART,SIGN}   => {HOLDER} 0.02137035 0.8434783  2.624090
 923 | # > 632 {BOTTLE,BOWL,BOX}      => {BAG}    0.02181097 0.8413598  3.489734
 924 | # > 816 {BAG,BOWL,HOLDER,SIGN} => {BOX}    0.02070941 0.8392857  2.901385
 925 | # > 817 {BOWL,BOX,HOLDER,SIGN} => {BAG}    0.02070941 0.8392857  3.481131
 926 | # > 707 {BOTTLE,HEART,SIGN}    => {HOLDER} 0.02144378 0.8366762  2.602929
 927 | # > 651 {BAG,BOWL,HEART}       => {HOLDER} 0.02115003 0.8347826  2.597038
 928 | # > 719 {BOTTLE,CASES,SIGN}    => {BAG}    0.02063597 0.8289086  3.438089
 929 | tmp <- as.data.frame(arules::inspect(head(sort(rulesx, by = "lift"), 10)))
 930 | # >     lhs                      rhs     support    confidence lift    
 931 | # > 602 {BAG,HOLDER,SIGN}     => {COVER} 0.02012191 0.4667802  5.418710
 932 | # > 598 {BAG,BOX,SIGN}        => {COVER} 0.02019534 0.4661017  5.410833
 933 | # > 819 {BAG,BOX,HOLDER,SIGN} => {BOWL}  0.02070941 0.6698337  5.254105
 934 | # > 610 {BAG,BOX,HOLDER}      => {COVER} 0.02070941 0.4483307  5.204534
 935 | # > 606 {BOX,HOLDER,SIGN}     => {COVER} 0.02004847 0.4333333  5.030435
 936 | # > 634 {BAG,BOTTLE,BOX}      => {BOWL}  0.02181097 0.6279070  4.925236
 937 | # > 638 {BAG,HEART,SIGN}      => {BOWL}  0.02034222 0.6141907  4.817647
 938 | # > 662 {BAG,CASES,SIGN}      => {BOWL}  0.02144378 0.6134454  4.811801
 939 | # > 159 {BAG,BOWL}            => {COVER} 0.02092972 0.4025424  4.672992
 940 | # > 670 {CASES,HOLDER,SIGN}   => {BOWL}  0.02078284 0.5811088  4.558156
 941 | # Filtering a large number of rules
 942 | # Once we have the rules built, we can use special subsetting functions to filter items from itemsets on either the left(lhs) or right(rhs) side of the association rule. This is valuable if you are looking for particular items within the itemsets.
 943 | # Use the %in% operator to perform an exact match, or %pin% to perform a partial match.
 944 | # # to see what 'Christmas' purchases imply.
 945 | 
 946 | lhs.rules <- subset(rulesx, subset = lhs %pin% "CHRISTMAS")
 947 | lhs.rules
 948 | # > set of 44 rules
 949 | inspect(lhs.rules)
 950 | # >     lhs                   rhs       support    confidence lift    
 951 | # > 4   {CHRISTMAS}        => {COVER}   0.02063597 0.1292548  1.500480
 952 | # > 26  {CHRISTMAS}        => {CANDLES} 0.02702504 0.1692732  1.481358
 953 | # > 47  {CHRISTMAS}        => {MUG}     0.02379379 0.1490340  1.060845
 954 | # > 49  {CHRISTMAS}        => {BOWL}    0.02680473 0.1678933  1.316937
 955 | # > 51  {CHRISTMAS}        => {BUNTING} 0.03018286 0.1890524  1.108191
 956 | # > 53  {CHRISTMAS}        => {BOTTLE}  0.04553132 0.2851886  1.797876
 957 | # > 55  {CHRISTMAS}        => {HEART}   0.03708600 0.2322907  1.151475
 958 | # > 57  {CHRISTMAS}        => {CASES}   0.04751414 0.2976081  1.415484
 959 | # > 59  {CHRISTMAS}        => {SIGN}    0.03547037 0.2221711  1.029715
 960 | # > 61  {CHRISTMAS}        => {BAG}     0.04663289 0.2920883  1.211504
 961 | # > 63  {CHRISTMAS}        => {BOX}     0.05500477 0.3445262  1.191016
 962 | # > 65  {CHRISTMAS}        => {HOLDER}  0.05133289 0.3215271  1.000282
 963 | # > 253 {BOTTLE,CHRISTMAS} => {CASES}   0.02210472 0.4854839  2.309058
 964 | # > 254 {CASES,CHRISTMAS}  => {BOTTLE}  0.02210472 0.4652241  2.932850
 965 | # > 256 {BOTTLE,CHRISTMAS} => {BAG}     0.02283910 0.5016129  2.080555
 966 | # > 257 {BAG,CHRISTMAS}    => {BOTTLE}  0.02283910 0.4897638  3.087552
 967 | # > 259 {BOTTLE,CHRISTMAS} => {BOX}     0.02489535 0.5467742  1.890181
 968 | # > 260 {BOX,CHRISTMAS}    => {BOTTLE}  0.02489535 0.4526035  2.853288
 969 | # > 262 {BOTTLE,CHRISTMAS} => {HOLDER}  0.02504223 0.5500000  1.711069
 970 | # > 263 {CHRISTMAS,HOLDER} => {BOTTLE}  0.02504223 0.4878398  3.075423
 971 | # > 265 {CHRISTMAS,HEART}  => {BAG}     0.02004847 0.5405941  2.242239
 972 | # > 266 {BAG,CHRISTMAS}    => {HEART}   0.02004847 0.4299213  2.131139
 973 | # > 268 {CHRISTMAS,HEART}  => {BOX}     0.02203128 0.5940594  2.053645
 974 | # > 269 {BOX,CHRISTMAS}    => {HEART}   0.02203128 0.4005340  1.985465
 975 | # > 271 {CHRISTMAS,HEART}  => {HOLDER}  0.02247191 0.6059406  1.885102
 976 | # > 272 {CHRISTMAS,HOLDER} => {HEART}   0.02247191 0.4377682  2.170036
 977 | # > 274 {CASES,CHRISTMAS}  => {BAG}     0.02291254 0.4822257  2.000142
 978 | # > 275 {BAG,CHRISTMAS}    => {CASES}   0.02291254 0.4913386  2.336904
 979 | # > 277 {CASES,CHRISTMAS}  => {BOX}     0.02548285 0.5363215  1.854047
 980 | # > 278 {BOX,CHRISTMAS}    => {CASES}   0.02548285 0.4632844  2.203473
 981 | # > 280 {CASES,CHRISTMAS}  => {HOLDER}  0.02283910 0.4806801  1.495412
 982 | # > 281 {CHRISTMAS,HOLDER} => {CASES}   0.02283910 0.4449213  2.116135
 983 | # > 283 {CHRISTMAS,SIGN}   => {BAG}     0.02137035 0.6024845  2.498943
 984 | # > 284 {BAG,CHRISTMAS}    => {SIGN}    0.02137035 0.4582677  2.123973
 985 | # > 286 {CHRISTMAS,SIGN}   => {BOX}     0.02342660 0.6604555  2.283174
 986 | # > 287 {BOX,CHRISTMAS}    => {SIGN}    0.02342660 0.4259012  1.973961
 987 | # > 289 {CHRISTMAS,SIGN}   => {HOLDER}  0.02276566 0.6418219  1.996731
 988 | # > 290 {CHRISTMAS,HOLDER} => {SIGN}    0.02276566 0.4434907  2.055484
 989 | # > 292 {BAG,CHRISTMAS}    => {BOX}     0.02687817 0.5763780  1.992521
 990 | # > 293 {BOX,CHRISTMAS}    => {BAG}     0.02687817 0.4886515  2.026795
 991 | # > 295 {BAG,CHRISTMAS}    => {HOLDER}  0.02460160 0.5275591  1.641255
 992 | # > 296 {CHRISTMAS,HOLDER} => {BAG}     0.02460160 0.4792561  1.987825
 993 | # > 298 {BOX,CHRISTMAS}    => {HOLDER}  0.02812661 0.5113485  1.590823
 994 | # > 299 {CHRISTMAS,HOLDER} => {BOX}     0.02812661 0.5479256  1.894162
 995 | # # what purchases yielded Candles?
 996 | 
 997 | rhs.rules <- subset(rulesx, subset = rhs %pin% c("CANDLES"))
 998 | rhs.rules
 999 | # > set of 26 rules
1000 | # tmp <- as.data.frame(arules::inspect(head(sort(rhs.rules, by = "support"), 10)))
1001 | # >     lhs             rhs       support    confidence lift    
1002 | # > 46  {HOLDER}     => {CANDLES} 0.04920320 0.1530729  1.339584
1003 | # > 44  {BOX}        => {CANDLES} 0.04714695 0.1629855  1.426333
1004 | # > 38  {CASES}      => {CANDLES} 0.03958287 0.1882641  1.647552
1005 | # > 42  {BAG}        => {CANDLES} 0.03789381 0.1571733  1.375469
1006 | # > 36  {HEART}      => {CANDLES} 0.03473599 0.1721878  1.506865
1007 | # > 40  {SIGN}       => {CANDLES} 0.03429537 0.1589517  1.391031
1008 | # > 34  {BOTTLE}     => {CANDLES} 0.02908130 0.1833333  1.604402
1009 | # > 30  {BOWL}       => {CANDLES} 0.02805317 0.2200461  1.925686
1010 | # > 252 {BOX,HOLDER} => {CANDLES} 0.02768598 0.2800892  2.451140
1011 | # > 32  {BUNTING}    => {CANDLES} 0.02739223 0.1605682  1.405178
1012 | inspect(head(sort(rhs.rules, by = "confidence")))
1013 | # >     lhs                rhs       support    confidence lift    
1014 | # > 219 {BAG,HEART}     => {CANDLES} 0.02056253 0.3517588  3.078342
1015 | # > 213 {BOWL,BOX}      => {CANDLES} 0.02019534 0.3467844  3.034809
1016 | # > 216 {BOTTLE,HOLDER} => {CANDLES} 0.02026878 0.3183391  2.785876
1017 | # > 222 {BOX,HEART}     => {CANDLES} 0.02247191 0.3122449  2.732544
1018 | # > 237 {BAG,SIGN}      => {CANDLES} 0.02232503 0.3111566  2.723020
1019 | # > 234 {CASES,HOLDER}  => {CANDLES} 0.02232503 0.3095723  2.709156
1020 | # 
1021 | # The plot function in arules is also very flexible. You can look at various scatterplots across the various metrics, or even count and group a small number of rules and show the metrics as bubbles.
1022 | 
1023 | plot(rhs.rules, method = "scatterplot")
1024 | 
1025 | # Insert Image: B05033_06_04.extension
1026 | # 
1027 | # 
1028 | plot(rhs.rules, method = "grouped")
1029 | 
1030 | # Insert Image: B05033_06_05.extension
1031 | # 
1032 | # Generating Many Rules
1033 | # If you wish to generate as many rules as possible, set support and confidence to a very low number.
1034 | 
1035 | many_rules <- apriori(trans4, parameter = list(minlen = 1, support = 0.01, confidence = 0.01))
1036 | # > Apriori
1037 | # > 
1038 | #   > Parameter specification:
1039 | #   >  confidence minval smax arem  aval originalSupport support minlen maxlen
1040 | # >        0.01    0.1    1 none FALSE            TRUE    0.01      1     10
1041 | # >  target   ext
1042 | # >   rules FALSE
1043 | # > 
1044 | #   > Algorithmic control:
1045 | #   >  filter tree heap memopt load sort verbose
1046 | # >     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
1047 | # > 
1048 | #   > Absolute minimum support count: 136 
1049 | # > 
1050 | #   > set item appearances ...[0 item(s)] done [0.00s].
1051 | # > set transactions ...[13 item(s), 13617 transaction(s)] done [0.00s].
1052 | # > sorting and recoding items ... [13 item(s)] done [0.00s].
1053 | # > creating transaction tree ... done [0.00s].
1054 | # > checking subsets of size 1 2 3 4 5 6 7 8 done [0.00s].
1055 | # > writing ... [8898 rule(s)] done [0.00s].
1056 | # > creating S4 object  ... done [0.00s].
1057 | many_rules
1058 | # > set of 8898 rules
1059 | 
1060 | # Plotting many rules
1061 | # Plots are especially helpful for scenarios in which there are many rules generated, and you need to filter on specific support and confidence ranges.
1062 | # 
1063 | # Here is a plot which shows 2 of the 3 metrics, along the x and y axis, and the 3rd metric (lift, support, or confidence) as shading. As the plot below suggests, there is a cluster of rules with high lift (>8), high confidence (> .6), but all with low support.
1064 | 
1065 | sel <- plot(many_rules, measure = c("support", "confidence"), shading = "lift", 
1066 |             interactive = FALSE)
1067 | 
1068 | # Insert Image: B05033_06_06.extension
1069 | # Method 2 - Creating a physical transactions file
1070 | # Now that you know have to run association rules using the "coerce to dataframe" method, we will now illustrate the "write to file" method.
1071 | # .	In the write to file method, each item is written to a separate line, along with the identifying key, which in our case is the InvoiceId. 
1072 | # .	The advantage to the write to file method is that very large data files can be accumulated separately, and them combined together if needed. 
1073 | # .	You can use the file.show function to display the contents of the file that will be input to the association rules algorithm.
1074 | 
1075 | setwd("C:/PracticalPredictiveAnalytics/Outputs")
1076 | #RW is this needed?
1077 | #load("OnlineRetail.full.Rda")
1078 | # OnlineRetail <- OnlineRetail[1:100,]
1079 | nrow(OnlineRetail)
1080 | # > [1] 268034
1081 | head(OnlineRetail)
1082 | # >   InvoiceNo StockCode                       Description Quantity
1083 | # > 5    536365     71053                     METAL LANTERN        6
1084 | # > 6    536365     21730 GLASS STAR FROSTED T-LIGHT HOLDER        6
1085 | # > 2    536365     22752      SET 7 BABUSHKA NESTING BOXES        2
1086 | # > 4    536365    84029E             WOOLLY HOTTIE  HEART.        6
1087 | # > 1    536365    84406B    CREAM CUPID HEARTS COAT HANGER        8
1088 | # > 8    536366     22632            HAND WARMER  POLKA DOT        6
1089 | # >      InvoiceDate UnitPrice CustomerID        Country itemcount
1090 | # > 5 12/1/2010 8:26      3.39      17850 United Kingdom         7
1091 | # > 6 12/1/2010 8:26      4.25      17850 United Kingdom         7
1092 | # > 2 12/1/2010 8:26      7.65      17850 United Kingdom         7
1093 | # > 4 12/1/2010 8:26      3.39      17850 United Kingdom         7
1094 | # > 1 12/1/2010 8:26      2.75      17850 United Kingdom         7
1095 | # > 8 12/1/2010 8:28      1.85      17850 United Kingdom         2
1096 | # >                           Desc2 lastword firstword
1097 | # > 5                  MetalLantern  LANTERN     METAL
1098 | # > 6 GlassStarFrostedT-lightHolder   HOLDER     GLASS
1099 | # > 2      Set7BabushkaNestingBoxes    BOXES       SET
1100 | # > 4            WoollyHottieHeart.   HEART.    WOOLLY
1101 | # > 1    CreamCupidHeartsCoatHanger   HANGER     CREAM
1102 | # > 8            HandWarmerPolkaDot      DOT      HAND
1103 | # concatenate the Invoice Number to the Description separated by a delimiter
1104 | data2 <- paste(OnlineRetail$InvoiceNo, OnlineRetail$Desc2, sep = "!")
1105 | 
1106 | # eliminate duplicates
1107 | data2 <- unique(data2)
1108 | # 
1109 | 
1110 | write(data2, file = "demo_single")
1111 | # file.show('demo_single')
1112 | # Reading the transaction file back in
1113 | # Use the read.transaction to read the delimited file back into memory formatted as a transaction file. This will have the same results as coercing the dataframe to a transaction file as we did earlier. 
1114 | # The difference is that the transactions are formatted one transaction per line, as specified by the format='single' option. We also specify where the transactionid and item desriptions are via the cols option. It is possible to have multiple descriptors and transaction id's, and identify them with the cols options. The sep keyword designates the delimiter, which in this case is the ! character. There is also a remove duplicate transactions option, which is a logical value which determines whether or not you want to eliminate duplicates.
1115 | # The returned object, trans, is an itemMatrix. You can type 'trans' to see the dimension, or run the dim(trans) to see the dimensions. This will tell you how many transactions the itemMatrix is based upon.
1116 | # As before, to view the items in the trans object, use the inspect function.
1117 | library(arules)
1118 | library(arulesViz)
1119 | setwd("C:/PracticalPredictiveAnalytics/Outputs")
1120 | 
1121 | # file.show('demo_single')
1122 | trans <- read.transactions("demo_single", format = "single", sep = "!", cols = c(1, 
1123 | 2), rm.duplicates = FALSE, quote = "")
1124 | trans
1125 | # > transactions in sparse format with
1126 | # >  19403 transactions (rows) and
1127 | # >  3462 items (columns)
1128 | dim(trans)
1129 | # > [1] 19403  3462
1130 | inspect(trans[1:5])
1131 | # >   items                            transactionID
1132 | # > 1 {CreamCupidHeartsCoatHanger,                  
1133 | # >    GlassStarFrostedT-lightHolder,               
1134 | # >    MetalLantern,                                
1135 | # >    Set7BabushkaNestingBoxes,                    
1136 | # >    WoollyHottieHeart.}                    536365
1137 | # > 2 {HandWarmerPolkaDot}                    536366
1138 | # > 3 {FeltcraftPrincessCharlotteDoll,              
1139 | # >    KnittedMugCosy,                              
1140 | # >    LoveBuildingBlockWord,                       
1141 | # >    Poppy'sPlayhouseKitchen}               536367
1142 | # > 4 {CoatRackParisFashion}                  536368
1143 | # > 5 {CircusParadeLunchBox,                        
1144 | #   >    LunchBoxILoveLondon,                         
1145 | #   >    MiniJigsawSpaceboy,                          
1146 | #   >    PandaAndBunniesStickerSheet,                 
1147 | #   >    Postage,                                     
1148 | #   >    RoundSnackBoxesSetOf4Woodland,               
1149 | #   >    SpaceboyLunchBox,                            
1150 | #   >    ToadstoolLedNightLight}                536370
1151 | # Take a look at some of the frequently purchases items. Use the itemFrequencePlot function to see a simple barchart of the top item purchases.
1152 | dim(trans)
1153 | # > [1] 19403  3462
1154 | # look up any item in labels to see if it is there.
1155 | 
1156 | itemFrequencyPlot(trans, topN = 10, cex.names = 1)
1157 | 
1158 | # Insert Image: B05033_06_07.extension
1159 | # 
1160 | # The itemLabels function lists all of the labels associated with the itemset. Since the top ranked item has an unusual abbreviation in it (T-light), you could check to see if there are other items which have that term in it. To accomplish this, use the grep function.
1161 | 
1162 | result <- grep("T-light", itemLabels(trans), value = TRUE)
1163 | str(result)
1164 | # >  chr [1:96] "6ChocolateLoveHeartT-lights" ...
1165 | head(result)
1166 | # > [1] "6ChocolateLoveHeartT-lights"   "AgedGlassSilverT-lightHolder" 
1167 | # > [3] "AntiqueSilverT-lightGlass"     "AssortedColourT-lightHolder"  
1168 | # > [5] "BeadedChandelierT-lightHolder" "BonneJamJarT-lightHolder"
1169 | # Apply the rules engine again. We will use a small support and confidence level to generate many rules.
1170 | rules1 <- apriori(trans, parameter = list(minlen = 1, support = 0.001, confidence = 0.001))
1171 | # > Apriori
1172 | #  > 
1173 | #   > Parameter specification:
1174 | #   >  confidence minval smax arem  aval originalSupport support minlen maxlen
1175 | # >       0.001    0.1    1 none FALSE            TRUE   0.001      1     10
1176 | # >  target   ext
1177 | # >   rules FALSE
1178 | # > 
1179 | #   > Algorithmic control:
1180 | #   >  filter tree heap memopt load sort verbose
1181 | # >     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
1182 | # > 
1183 | #   > Absolute minimum support count: 19 
1184 | # > 
1185 | #   > set item appearances ...[0 item(s)] done [0.00s].
1186 | # > set transactions ...[3462 item(s), 19403 transaction(s)] done [0.03s].
1187 | # > sorting and recoding items ... [2209 item(s)] done [0.01s].
1188 | # > creating transaction tree ... done [0.01s].
1189 | # > checking subsets of size 1 2 3 4 done [0.20s].
1190 | # > writing ... [63121 rule(s)] done [0.02s].
1191 | # > creating S4 object  ... done [0.02s].
1192 | rules1
1193 | # > set of 63121 rules
1194 | # Sort the rules by the 3 measures support, confidence, lift to get an idea of some of the more valuable rules. Sort by confidence,support, and lift to look at the highest scoring rules in each category.
1195 | tmp <- as.data.frame(inspect(tail(sort(rules1, by = "lift"))))
1196 | # >       lhs                             rhs                         
1197 | # > 38860 {AssortedColourBirdOrnament} => {StorageBagSuki}            
1198 | # > 38861 {StorageBagSuki}             => {AssortedColourBirdOrnament}
1199 | # > 38893 {BagRetrospot}               => {AssortedColourBirdOrnament}
1200 | # > 38892 {AssortedColourBirdOrnament} => {BagRetrospot}              
1201 | # > 11539 {AlarmClockBakelike}         => {RexCash+carryShopper}      
1202 | # > 11538 {RexCash+carryShopper}       => {AlarmClockBakelike}        
1203 | # >       support     confidence lift     
1204 | # > 38860 0.001030768 0.02724796 0.8976097
1205 | # > 38861 0.001030768 0.03395586 0.8976097
1206 | # > 38893 0.001700768 0.03122044 0.8252999
1207 | # > 38892 0.001700768 0.04495913 0.8252999
1208 | # > 11539 0.001082307 0.01621622 0.7183636
1209 | # > 11538 0.001082307 0.04794521 0.7183636
1210 | tmp <- as.data.frame(inspect(head(sort(rules1, by = "support"))))
1211 | # >      lhs    rhs                         support    confidence lift
1212 | # > 2207 {}  => {HangingHeartT-lightHolder} 0.07256610 0.07256610 1   
1213 | # > 2208 {}  => {HeartOfWicker}             0.07004072 0.07004072 1   
1214 | # > 2209 {}  => {AlarmClockBakelike}        0.06674226 0.06674226 1   
1215 | # > 2205 {}  => {BagRetrospot}              0.05447611 0.05447611 1   
1216 | # > 2201 {}  => {RegencyCakesd3Tier}        0.05251765 0.05251765 1   
1217 | # > 2190 {}  => {PartyBunting}              0.04184920 0.04184920 1
1218 | tmp <- as.data.frame(inspect(head(sort(rules1, by = "confidence"))))
1219 | # >   lhs                               rhs                      support confidence     lift
1220 | # > 1 {PolkadotCup,                                                                         
1221 | #   >    RetrospotCharlotteBag,                                                               
1222 | #   >    SpotCeramicDrawerKnob}        => {AlarmClockBakelike} 0.001082307  1.0000000 14.98301
1223 | # > 2 {AlarmClockBakelike,                                                                  
1224 | #   >    Charlie+lolaHotWaterBottle,                                                          
1225 | #   >    ChristmasGinghamTree}         => {BabushkaNotebook}   0.001185384  0.9200000 48.24530
1226 | # > 3 {ChristmasHangingStarWithBell,                                                        
1227 | #   >    RegencyTeacupAndSaucer}       => {AlarmClockBakelike} 0.001133845  0.9166667 13.73443
1228 | # > 4 {PolkadotBowl,                                                                        
1229 | #   >    RetrospotCharlotteBag,                                                               
1230 | #   >    SpotCeramicDrawerKnob}        => {AlarmClockBakelike} 0.001133845  0.9166667 13.73443
1231 | # > 5 {AlarmClockBakelikeChocolate,                                                         
1232 | #   >    PolkadotCup}                  => {AlarmClockBakelike} 0.001030768  0.9090909 13.62092
1233 | # > 6 {BabushkaNotebook,                                                                    
1234 | #   >    Charlie+lolaHotWaterBottle,                                                          
1235 | #   >    HeartMeasuringSpoons}         => {AlarmClockBakelike} 0.001339999  0.8965517 13.43304
1236 | # You can also coerce the rules to a dataframe and use kable to print the first 10 rows, or subset as you choose.
1237 | rules1 <- sort(rules1, by = "confidence")
1238 | rules1.df <- as(rules1, "data.frame")
1239 | cat("using kable to print rules")
1240 | # > using kable to print rules
1241 | library(knitr)
1242 | kable(rules1.df[1:10, ])
1243 | # 
1244 | # Rules	support	confidence	lift
1245 | # 62966	{PolkadotCup,RetrospotCharlotteBag,SpotCeramicDrawerKnob} => {AlarmClockBakelike}	0.0010823	1.0000000	14.98301
1246 | # 62971	{AlarmClockBakelike,Charlie+lolaHotWaterBottle,ChristmasGinghamTree} => {BabushkaNotebook}	0.0011854	0.9200000	48.24530
1247 | # 51467	{ChristmasHangingStarWithBell,RegencyTeacupAndSaucer} => {AlarmClockBakelike}	0.0011338	0.9166667	13.73443
1248 | # 62982	{PolkadotBowl,RetrospotCharlotteBag,SpotCeramicDrawerKnob} => {AlarmClockBakelike}	0.0011338	0.9166667	13.73443
1249 | # 51338	{AlarmClockBakelikeChocolate,PolkadotCup} => {AlarmClockBakelike}	0.0010308	0.9090909	13.62092
1250 | # 63058	{BabushkaNotebook,Charlie+lolaHotWaterBottle,HeartMeasuringSpoons} => {AlarmClockBakelike}	0.0013400	0.8965517	13.43304
1251 | # 62970	{BabushkaNotebook,Charlie+lolaHotWaterBottle,ChristmasGinghamTree} => {AlarmClockBakelike}	0.0011854	0.8846154	13.25420
1252 | # 62972	{AlarmClockBakelike,BabushkaNotebook,ChristmasGinghamTree} => {Charlie+lolaHotWaterBottle}	0.0011854	0.8846154	76.62586
1253 | # 51347	{AlarmClockBakelikeChocolate,DinerWallClock} => {AlarmClockBakelike}	0.0011338	0.8800000	13.18505
1254 | # 51356	{AlarmClockBakelikeChocolate,BoxOf24CocktailParasols} => {AlarmClockBakelike}	0.0011338	0.8800000	13.18505
1255 | # Plotting the rules
1256 | # The default plot of the rules will give you a scatterplot of all of the rules showing support on the x-axis and confidence on the y-axis. We can see from the density that confidence can vary from high to low with most of the density occurring at the .5 level. Support tends to be low, and the highest support level attains ~.07.
1257 | plot(rules1)
1258 | 
1259 | # Insert Image: B05033_06_08.extension
1260 | # Creating subsets of the rules
1261 | # As we did before, we can look at some of the subsets by parsing left or right side.
1262 | # For example, we might be interested in seeing what items yielded purchasing Chocolate things.
1263 | # .	Subset the rules set using the %pin% operator (Partial match), and look for any transactions where "Chocolate" appears in the right hand side.
1264 | 
1265 | purchased.this <- "Chocolate"
1266 | 
1267 | lhs.rules <- subset(rules1, subset = rhs %pin% purchased.this)
1268 | 
1269 | # .	Printing lhs.rules shows that there are 487 of them. 
1270 | # 
1271 | print(lhs.rules)
1272 | # > set of 487 rules
1273 | # 
1274 | # .	Sort them by lift, inspect them, and plot the first 15 as a graph.
1275 | # 
1276 | lhs.rules <- sort(lhs.rules, by = "lift")
1277 | 
1278 | inspect(head(sort(lhs.rules, by = "lift")))
1279 | # >   lhs                                rhs                                 support confidence     lift
1280 | # > 1 {CakeTowelSpots}                => {CakeTowelChocolateSpots}       0.001185384  0.2911392 89.66626
1281 | # > 2 {BiscuitsBowlLight,                                                                               
1282 | #   >    DollyMixDesignBowl}            => {ChocolatesBowl}                0.001030768  0.4444444 68.98844
1283 | # > 3 {BakingMouldHeartChocolate}     => {BakingMouldHeartMilkChocolate} 0.001030768  0.2941176 57.64409
1284 | # > 4 {BakingMouldHeartMilkChocolate} => {BakingMouldHeartChocolate}     0.001030768  0.2020202 57.64409
1285 | # > 5 {BiscuitsBowlLight}             => {ChocolatesBowl}                0.001700768  0.3586957 55.67817
1286 | # > 6 {MarshmallowsBowl}              => {ChocolatesBowl}                0.002422306  0.2397959 37.22208
1287 | # 
1288 | # .	The directional graph is a good way to illustrate which purchases influence other purchase once you have narrowed down the number of itemset to a small number.
1289 | # .	The DollyMixDesignBowl and MarshmallowsBowl appear as large and darker bubbles indicating that they are better predictors for "Chocolate" purchase relative to support, confidence and lift.
1290 | plot(lhs.rules[1:15], method = "graph", control = list(type = "items", cex = 0.5))
1291 | 
1292 | # Insert Image: B05033_06_09.extension
1293 | # 
1294 | # .	Finally, If you wish to save you workspace, use the save.image command.
1295 | save.image(file = "ch6 part 2.Rdata")
1296 | 
1297 | # Text Clustering
1298 | # In the previous sections we used the lastword technique for categorizing the types of purchases by simple keywords. We could also use more sophisticated techniques such as word clustering to try to identify which types of purchasing clusters occur and then use that to subset the association rules. To illustrate text clustering on our OnlineRetail dataset will first need to load our training and test dataframes that we previously saved. Also, issue a set.seed command since we will be doing some sampling later on.
1299 | #setwd("C:/Users/randy/Desktop/ch6")
1300 | # load the training data
1301 | #load("OnlineRetail.full.Rda")
1302 | set.seed(1)
1303 | 
1304 | # We previously demonstrated some text mining examples using a package called RTextTools. Another popular text mining package is "Tm". Tm has been around for a long time, and it will be useful to know this package works. Tm requires that all text data be converted to a corpus first. That can be done using the VCorpus function. We can use vector input, since we already have the data available in an existing data frame, and there is no need to read in additional external data.
1305 | library(tm)
1306 | # > Loading required package: NLP
1307 | attach(OnlineRetail)
1308 | nrow(OnlineRetail)
1309 | # > [1] 268034
1310 | corp <- VCorpus(VectorSource(OnlineRetail$Description))
1311 | 
1312 | # Displaying the corp object shows you some information about the metadata
1313 | 
1314 | head(corp)
1315 | # > <<VCorpus>>
1316 | #   > Metadata:  corpus specific: 0, document level (indexed): 0
1317 | # > Content:  documents: 6
1318 | # Converting to a Document Term Matrix
1319 | # Once we have a corpus, we can proceed to convert it to a document term matrix. When building DTM, care must be given to limiting the amount of data and resulting terms that are processed. If not parameterized correctly, it can take a very long time to run. Parameterization is accomplished via the options. We will remove any stopwords, punctuation and numbers. Additionally, we will only include minimum word length of 4.
1320 | 
1321 | library(tm)
1322 | dtm <- DocumentTermMatrix(corp, control = list(removePunctuation = TRUE, wordLengths = c(4, 
1323 |                                                                                          999), stopwords = TRUE, removeNumbers = TRUE, stemming = FALSE, bounds = list(global = c(5, 
1324 |                                                                                                                                                                                   Inf))))
1325 | save.image(file = "OnlineRetail-dtm.Rdata")
1326 | 
1327 | 
1328 | # We can begin to look at the data by using the inspect function.  Note: this is different from the inspect function in arules, and if you have the arules package loaded, you will want to preface this inspect with tm::inspect
1329 | #RW put a save file here since it is the longest running part
1330 | 
1331 | inspect(dtm[1:10, 1:10])
1332 | # > <<DocumentTermMatrix (documents: 10, terms: 10)>>
1333 | #   > Non-/sparse entries: 0/100
1334 | # > Sparsity           : 100%
1335 | # > Maximal term length: 8
1336 | # > Weighting          : term frequency (tf)
1337 | # > 
1338 | #   >     Terms
1339 | # > Docs abstract acapulco account acrylic address adult advent afghan aged
1340 | # >   1         0        0       0       0       0     0      0      0    0
1341 | # >   2         0        0       0       0       0     0      0      0    0
1342 | # >   3         0        0       0       0       0     0      0      0    0
1343 | # >   4         0        0       0       0       0     0      0      0    0
1344 | # >   5         0        0       0       0       0     0      0      0    0
1345 | # >   6         0        0       0       0       0     0      0      0    0
1346 | # >   7         0        0       0       0       0     0      0      0    0
1347 | # >   8         0        0       0       0       0     0      0      0    0
1348 | # >   9         0        0       0       0       0     0      0      0    0
1349 | # >   10        0        0       0       0       0     0      0      0    0
1350 | # >     Terms
1351 | # > Docs ahoy
1352 | # >   1     0
1353 | # >   2     0
1354 | # >   3     0
1355 | # >   4     0
1356 | # >   5     0
1357 | # >   6     0
1358 | # >   7     0
1359 | # >   8     0
1360 | # >   9     0
1361 | # >   10    0
1362 | # After the DTM has been created, we can look at the metadata that has been producted by issuing a print(dtm) command. We can see the number of documents and terms by looking at the first line.
1363 | print(dtm)
1364 | # > <<DocumentTermMatrix (documents: 268034, terms: 1675)>>
1365 | #   > Non-/sparse entries: 826898/448130052
1366 | # > Sparsity           : 100%
1367 | # > Maximal term length: 20
1368 | # > Weighting          : term frequency (tf)
1369 | # Remove Sparse Terms
1370 | # Most TDM's are initially filled with a lot of empty space. That is because every word in a corpus is indexed, and there are many words which occur so infrequently, that they do not matter analytically. Removing sparse terms is a method in which we can reduce the number of terms to a manageable size, and also save space at the same time.
1371 | # The removeSparseTerms function will reduce the number of terms in the Description from 268034 to 62.
1372 | dtms <- removeSparseTerms(dtm, 0.99)
1373 | dim(dtms)
1374 | # > [1] 268034     62
1375 | 
1376 | # As an alternative to inspect, we can also view it in matrix form
1377 | 
1378 | View(as.matrix(dtms))
1379 | 
1380 | 
1381 | 
1382 | # Insert Image: B05033_06_10.extension
1383 | # Finding Frequent Terms
1384 | # The tm package has a useful function called findFreqTerms which is useful to find the frequency of the popular terms used. The second argument to the function restricts the results to terms which have a minimum frequency specified. We can also compute the occurrences by summing up the 1's and 0's for each term in the TDM. Then we can sort the list and display the highest and lowest frequency occurrences.
1385 | # 
1386 | data.frame(findFreqTerms(dtms, 10000, Inf))
1387 | # >   findFreqTerms.dtms..10000..Inf.
1388 | # > 1                            cake
1389 | # > 2                       christmas
1390 | # > 3                          design
1391 | # > 4                           heart
1392 | # > 5                           metal
1393 | # > 6                       retrospot
1394 | # > 7                         vintage
1395 | freq <- colSums(as.matrix(dtms))
1396 | # there are xx terms
1397 | length(freq)
1398 | # > [1] 62
1399 | ord <- order(freq)
1400 | # look at the top and bottom number of terms
1401 | freq[head(ord, 12)]
1402 | # >     union     skull      zinc      bird      wood      wall  birthday 
1403 | # >      2752      2770      2837      2974      2993      3042      3069 
1404 | # >    colour charlotte      star   antique    silver 
1405 | # >      3089      3114      3121      3155      3175
1406 | freq[tail(ord, 10)]
1407 | # >   hanging      sign     lunch     metal      cake christmas    design 
1408 | # >      8437      8580      9107     10478     10623     12534     14884 
1409 | # >   vintage retrospot     heart 
1410 | # >     16755     17445     19520
1411 | # 
1412 | # For presentation purposes, a barplot is also useful for displaying the relative frequencies.
1413 | # 
1414 | barplot(freq[tail(ord, 10)], cex.names = 0.75, col = c("blue"))
1415 | 
1416 | # B05033_06_11.pngWe could also do a little code manipulation to only display the topN most frequent term.
1417 | 
1418 | dtmx <- dtms[, names(tail(sort(colSums(as.matrix(dtms))), 12))]
1419 | inspect(dtmx[1:10, ])
1420 | # > <<DocumentTermMatrix (documents: 10, terms: 12)>>
1421 | # > Non-/sparse entries: 3/117
1422 | # > Sparsity           : 98%
1423 | # > Maximal term length: 9
1424 | # > Weighting          : term frequency (tf)
1425 | # > 
1426 | # >     Terms
1427 | # > Docs pack holder hanging sign lunch metal cake christmas design vintage
1428 | # >   1     0      0       0    0     0     1    0         0      0       0
1429 | # >   2     0      1       0    0     0     0    0         0      0       0
1430 | # >   3     0      0       0    0     0     0    0         0      0       0
1431 | # >   4     0      0       0    0     0     0    0         0      0       0
1432 | # >   5     0      0       0    0     0     0    0         0      0       0
1433 | # >   6     0      0       0    0     0     0    0         0      0       0
1434 | # >   7     0      0       0    0     0     0    0         0      0       0
1435 | # >   8     0      0       0    0     0     0    0         0      0       0
1436 | # >   9     0      0       0    0     0     0    0         0      0       0
1437 | # >   10    0      0       0    0     0     0    0         0      0       0
1438 | # >     Terms
1439 | # > Docs retrospot heart
1440 | # >   1          0     0
1441 | # >   2          0     0
1442 | # >   3          0     0
1443 | # >   4          0     1
1444 | # >   5          0     0
1445 | # >   6          0     0
1446 | # >   7          0     0
1447 | # >   8          0     0
1448 | # >   9          0     0
1449 | # >   10         0     0
1450 | # Kmeans clustering of terms
1451 | # Now we can cluster the Term Document Matrix using Kmeans. We will specify that 5 clusters be generated.
1452 | 
1453 | kmeans5 <- kmeans(dtms, 5)
1454 | 
1455 | # Once kmeans is done, we will append the cluster number to the original data, and then create 5 subsets based upon the cluster.
1456 | 
1457 | kw_with_cluster <- as.data.frame(cbind(OnlineRetail, Cluster = kmeans5$cluster))
1458 | 
1459 | # subset the five clusters
1460 | cluster1 <- subset(kw_with_cluster, subset = Cluster == 1)
1461 | cluster2 <- subset(kw_with_cluster, subset = Cluster == 2)
1462 | cluster3 <- subset(kw_with_cluster, subset = Cluster == 3)
1463 | cluster4 <- subset(kw_with_cluster, subset = Cluster == 4)
1464 | cluster5 <- subset(kw_with_cluster, subset = Cluster == 5)
1465 | # Examining Cluster 1
1466 | # 
1467 | # Print out a sample of the data.
1468 | 
1469 | head(cluster1[10:13])
1470 | # Desc2  lastword firstword Cluster
1471 | # 50   VintageBillboardLove/hateMug       MUG   VINTAGE       1
1472 | # 86              BagVintagePaisley   PAISLEY       BAG       1
1473 | # 113         ShopperVintagePaisley   PAISLEY   SHOPPER       1
1474 | # 145         ShopperVintagePaisley   PAISLEY   SHOPPER       1
1475 | # 200  VintageHeadsAndTailsCardGame      GAME   VINTAGE       1
1476 | # 210 PaperChainKitVintageChristmas CHRISTMAS     PAPER       1
1477 | # 
1478 | # Table the frequencies and print out the most popular terms in the clusters.  Observe that many of these items have to do with Chrismas and Paisley items, which seem to occur together.
1479 | # 
1480 | # tail(sort(table(cluster1$lastword)), 10)
1481 | # MUG      GAME   BUNTING     CARDS    DESIGN      LEAF           
1482 | # 427       431       456       482       535       717       911 
1483 | # DOILY   PAISLEY CHRISTMAS 
1484 | # 1073      1699      1844
1485 | # 
1486 | # Examining Cluster 2
1487 | # 
1488 | # After tabling Cluster 2 below, it looks like this cluster has something to do with "Hanging Holders"
1489 | head(cluster2[10:13])
1490 | # Desc2 lastword firstword Cluster
1491 | # 6     GlassStarFrostedT-lightHolder   HOLDER     GLASS       2
1492 | # 57        HangingHeartT-lightHolder   HOLDER   HANGING       2
1493 | # 62    GlassStarFrostedT-lightHolder   HOLDER     GLASS       2
1494 | # 70        HangingHeartT-lightHolder   HOLDER   HANGING       2
1495 | # 81        HangingHeartT-lightHolder   HOLDER   HANGING       2
1496 | # 156 ColourGlassT-lightHolderHanging  HANGING    COLOUR       2
1497 | # 
1498 | 
1499 | tail(sort(table(cluster2$lastword)), 10)
1500 | # ANTIQUE	LANTERN	HLDR	DECORATION		GLASS	T-LIGHT	HANGING	HEART	HOLDER
1501 | # 167	181	226	260	319	361	531	639	668	6792
1502 | # 
1503 | # Examining Cluster 3
1504 | # 
1505 | # Cluster 3 may have to do with customers purchasing sets of design boxes.
1506 | 
1507 | head(cluster3[10:13])
1508 | # Desc2 lastword firstword Cluster
1509 | # 5                    MetalLantern  LANTERN     METAL       3
1510 | # 2        Set7BabushkaNestingBoxes    BOXES       SET       3
1511 | # 4              WoollyHottieHeart.   HEART.    WOOLLY       3
1512 | # 1      CreamCupidHeartsCoatHanger   HANGER     CREAM       3
1513 | # 8              HandWarmerPolkaDot      DOT      HAND       3
1514 | # 10 FeltcraftPrincessCharlotteDoll     DOLL FELTCRAFT       3
1515 | 
1516 | tail(sort(table(cluster3$lastword)),10)
1517 | head(cluster4[10:13])
1518 | # Desc2  lastword firstword Cluster
1519 | # 100 LunchBoxWithCutleryRetrospot RETROSPOT     LUNCH       4
1520 | # 102   PackOf72RetrospotCakeCases     CASES      PACK       4
1521 | # 84       60TeatimeFairyCakeCases     CASES        60       4
1522 | # 94     3PieceRetrospotCutlerySet       SET         3       4
1523 | # 91             LunchBagRetrospot RETROSPOT     LUNCH       4
1524 | # 127             RetrospotMilkJug       JUG RETROSPOT       4
1525 | 
1526 | tail(sort(table(cluster4$lastword)), 10)
1527 | # APRON   NAPKINS      BANK                  PC      TINS    DESIGN 
1528 | # 486       511       512       514       531       664      1203 
1529 | # BAG     CASES RETROSPOT 
1530 | # 1395      4318      6485 
1531 | # Examining Cluster 5
1532 | # 
1533 | # Finally, Cluster 5 seems to be concerned with the purchases of Bottles.  Possible having to do with perfumes, elixir's and tonics.
1534 | 
1535 | head(cluster5)
1536 | # > head(cluster5[10:13])
1537 | # Desc2 lastword firstword Cluster
1538 | # 59  KnittedUnionFlagHotWaterBottle   BOTTLE   KNITTED       5
1539 | # 68  KnittedUnionFlagHotWaterBottle   BOTTLE   KNITTED       5
1540 | # 181       AssortedBottleTopMagnets  MAGNETS  ASSORTED       5
1541 | # 206      EnglishRoseHotWaterBottle   BOTTLE   ENGLISH       5
1542 | # 229   RetrospotHeartHotWaterBottle   BOTTLE RETROSPOT       5
1543 | # 261   HotWaterBottleTeaAndSympathy SYMPATHY       HOT       5
1544 | # 
1545 | 
1546 | tail(sort(table(cluster5$lastword)), 10)
1547 | 
1548 | # BABUSHKA  MAGNETS    TONIC   ELIXIR  PERFUME   OPENER SYMPATHY   POORLY     CALM   BOTTLE 
1549 | # 87       95      108      112      132      149      318      345      418     3795 
1550 | # Creating a DTM from the test data set, and calculating the new cluster assignments
1551 | # The goal in this exercise is to score the test dataset based upon the predict method for the training dataset.
1552 | # First, let's read in our test data set and create a document term matrix. In this example we will use the create_matrix function from RTextTools which can create a TDM first without having a separate step of creating a corpus
1553 | 
1554 | #load("OnlineRetail.test.Rda")
1555 | library(RTextTools)
1556 | # > Loading required package: SparseM
1557 | # > 
1558 | #   > Attaching package: 'SparseM'
1559 | # > The following object is masked from 'package:base':
1560 | #   > 
1561 | #   >     backsolve
1562 | dtMatrixTest <- create_matrix(OnlineRetail.test$Description, minDocFreq = 1, 
1563 |                               removeNumbers = TRUE, minWordLength = 4, removeStopwords = TRUE, removePunctuation = TRUE, 
1564 |                               stemWords = FALSE, weighting = weightTf)
1565 | 
1566 | # As we did before for the training dataset, remove the sparse terms.
1567 | dtMatrixTest <- removeSparseTerms(dtMatrixTest, 0.99)
1568 | dtMatrixTest
1569 | # > <<DocumentTermMatrix (documents: 268034, terms: 61)>>
1570 | #   > Non-/sparse entries: 349663/16000411
1571 | # > Sparsity           : 98%
1572 | # > Maximal term length: 10
1573 | # > Weighting          : term frequency (tf)
1574 | # kmeans function does not have a prediction method. However we can use the flexclust package which does. Since the prediction method can take a long time to run, we will illustrate it only on a sample number of rows and columns. In order to compare the test and training results, they also need to have the same number of columns. Reload the test and training data we have saved, and create the TDM's from scratch, based upon the sample.size
1575 | 
1576 | set.seed(1)
1577 | sample.size <- 10000
1578 | max.cols <- 10
1579 | 
1580 | library("flexclust")
1581 | #load("OnlineRetail.full.Rda")
1582 | 
1583 | 
1584 | # Create the sample
1585 | OnlineRetail <- OnlineRetail[1:sample.size, ]
1586 | require(tm)
1587 | library(RTextTools)
1588 | # Create the DTM for the training data
1589 | 
1590 | dtMatrix <- create_matrix(OnlineRetail$Description, minDocFreq = 1, removeNumbers = TRUE, 
1591 | minWordLength = 4, removeStopwords = TRUE, removePunctuation = TRUE, stemWords = FALSE, 
1592 | weighting = weightTf)
1593 | # Check the dimensions of the data.  We can see that there are 1300 terms.  Virtually all of them are space terms so we will remove them from the matrix.
1594 | 
1595 | dim(dtMatrix)
1596 | # > [1] 10000  1300
1597 | 
1598 | dtMatrix <- removeSparseTerms(dtMatrix, 0.99)
1599 | 
1600 | # Removing the sparse terms reduces the number of columns from 1300 to 62!
1601 | 
1602 | dim(dtMatrix)
1603 | # > [1] 10000    62
1604 | # We will only keep same number of terms (max.cols) in test and train.
1605 | 
1606 | dtMatrix <- dtMatrix[, 1:max.cols]
1607 | 
1608 | # repeat kmeans using the kcca function. Clusters=5
1609 | 
1610 | clust1 = kcca(dtMatrix, k = 5, kccaFamily("kmeans"))
1611 | clust1
1612 | # > kcca object of family 'kmeans' 
1613 | # > 
1614 | # > call:
1615 | # > kcca(x = dtMatrix, k = 5, family = kccaFamily("kmeans"))
1616 | # > 
1617 | # > cluster sizes:
1618 | # > 
1619 | # >    1    2    3    4    5 
1620 | # >  360  120  152  387 8981
1621 | # 
1622 | # 
1623 | # Print the number of products categorized in each cluster
1624 | 
1625 | table(clust1@cluster)
1626 | # > 
1627 | # >    1    2    3    4    5 
1628 | # >  360  120  152  387 8981
1629 | # 
1630 | # Merge the clusters with the training data, and show the cluster assigned to each
1631 | 
1632 | kw_with_cluster2 <- as.data.frame(cbind(OnlineRetail, Cluster = clust1@cluster))
1633 | 
1634 | head(kw_with_cluster2)
1635 | # >   InvoiceNo StockCode                       Description Quantity
1636 | # > 5    536365     71053                     METAL LANTERN        6
1637 | # > 6    536365     21730 GLASS STAR FROSTED T-LIGHT HOLDER        6
1638 | # > 2    536365     22752      SET 7 BABUSHKA NESTING BOXES        2
1639 | # > 4    536365    84029E             WOOLLY HOTTIE  HEART.        6
1640 | # > 1    536365    84406B    CREAM CUPID HEARTS COAT HANGER        8
1641 | # > 8    536366     22632            HAND WARMER  POLKA DOT        6
1642 | # >      InvoiceDate UnitPrice CustomerID        Country itemcount
1643 | # > 5 12/1/2010 8:26      3.39      17850 United Kingdom         7
1644 | # > 6 12/1/2010 8:26      4.25      17850 United Kingdom         7
1645 | # > 2 12/1/2010 8:26      7.65      17850 United Kingdom         7
1646 | # > 4 12/1/2010 8:26      3.39      17850 United Kingdom         7
1647 | # > 1 12/1/2010 8:26      2.75      17850 United Kingdom         7
1648 | # > 8 12/1/2010 8:28      1.85      17850 United Kingdom         2
1649 | # >                           Desc2 lastword firstword Cluster
1650 | # > 5                  MetalLantern  LANTERN     METAL       5
1651 | # > 6 GlassStarFrostedT-lightHolder   HOLDER     GLASS       5
1652 | # > 2      Set7BabushkaNestingBoxes    BOXES       SET       5
1653 | # > 4            WoollyHottieHeart.   HEART.    WOOLLY       5
1654 | # > 1    CreamCupidHeartsCoatHanger   HANGER     CREAM       5
1655 | # > 8            HandWarmerPolkaDot      DOT      HAND       5
1656 | # Run the predict method on the training set. We will eventually apply it to the test data
1657 | pred_train <- predict(clust1)
1658 | 
1659 | # Load the test data set, take an identical sample size as was taken for the training data, and repeat the procedure starting with creating the term document matrix on the sample. 
1660 | # 
1661 | #load("OnlineRetail.test.Rda")
1662 | OnlineRetail.test <- OnlineRetail.test[1:sample.size, ]
1663 | dtMatrix.test <- create_matrix(OnlineRetail.test$Description, minDocFreq = 1, 
1664 | removeNumbers = TRUE, minWordLength = 4, removeStopwords = TRUE, removePunctuation = TRUE, 
1665 | stemWords = FALSE, weighting = weightTf)
1666 | # Remove sparse terms
1667 | dtMatrix.test <- removeSparseTerms(dtMatrix.test, 0.99)
1668 | 
1669 | # reduced to 61 terms
1670 | dim(dtMatrix.test)
1671 | # > [1] 10000    61
1672 | # Take the first max.col terms
1673 | dtMatrix.test <- dtMatrix.test[, 1:max.cols]
1674 | 
1675 | dtMatrix.test
1676 | # > <<DocumentTermMatrix (documents: 10000, terms: 10)>>
1677 | # > Non-/sparse entries: 2072/97928
1678 | # > Sparsity           : 98%
1679 | # > Maximal term length: 8
1680 | # > Weighting          : term frequency (tf)
1681 | # Verify that the test and training data have the same number of dimensions
1682 | dim(dtMatrix)
1683 | # > [1] 10000    10
1684 | dim(dtMatrix.test)
1685 | # > [1] 10000    10
1686 | # Run prediction function on training data, and apply it to the test data
1687 | 
1688 | 
1689 | pred_test <- predict(clust1, newdata = dtMatrix.test)
1690 | 
1691 | # Table the cluster assignments for the test dat
1692 | 
1693 | table(pred_test)
1694 | # > pred_test
1695 | # >    1    2    3    4    5 
1696 | # >  171  113  201  146 9369
1697 | # 
1698 | # Finally, merge the clusters with the test data, and show the cluster categories assigned to each. For this demonstration, display two transactions for each cluster 
1699 | kw_with_cluster2_score <- as.data.frame(cbind(OnlineRetail.test, Cluster=pred_test))
1700 | head(kw_with_cluster2_score)
1701 | clust1.score=head(subset(kw_with_cluster2_score,Cluster==1),2)
1702 | clust2.score=head(subset(kw_with_cluster2_score,Cluster==2),2)
1703 | clust3.score=head(subset(kw_with_cluster2_score,Cluster==3),2)
1704 | clust4.score=head(subset(kw_with_cluster2_score,Cluster==4),2)
1705 | clust5.score=head(subset(kw_with_cluster2_score,Cluster==5),2)
1706 | head(clust1.score[,10:13])
1707 | head(clust2.score[,10:13])
1708 | head(clust3.score[,10:13])
1709 | head(clust4.score[,10:13])
1710 | head(clust5.score[,10:13])
1711 | 
1712 | # > head(clust1.score[,10:13])
1713 | # Desc2 lastword firstword Cluster
1714 | # 89  PackOf60PaisleyCakeCases    CASES      PACK       1
1715 | # 96 PackOf60DinosaurCakeCases    CASES      PACK       1
1716 | # > head(clust2.score[,10:13])
1717 | # Desc2 lastword firstword Cluster
1718 | # 61            WoodenFrameAntique             WOODEN       2
1719 | # 140 AntiqueGlassDressingTablePot      POT   ANTIQUE       2
1720 | # > head(clust3.score[,10:13])
1721 | # Desc2 lastword firstword Cluster
1722 | # 148 3TierCakeTinAndCream    CREAM         3       3
1723 | # 143 3TierCakeTinAndCream    CREAM         3       3
1724 | # > head(clust4.score[,10:13])
1725 | # Desc2 lastword firstword Cluster
1726 | # 126 ZincWillieWinkieCandleStick    STICK      ZINC       4
1727 | # 488              LoveBirdCandle   CANDLE      LOVE       4
1728 | # > head(clust5.score[,10:13])
1729 | # Desc2 lastword firstword Cluster
1730 | # 3      HangingHeartT-lightHolder   HOLDER   HANGING       5
1731 | # 7 KnittedUnionFlagHotWaterBottle   BOTTLE   KNITTED       5
1732 | # 
1733 | # Running the aprior algorithm on the clusters
1734 | # Circling back to the arules algorithm, we can use the predicted clusters which were generated, instead of "lastword" in order to develop some rules.
1735 | # .	We will use the "coerce to dataframe" method to generate the transaction file as previously generated
1736 | # .	Create a rules_clust algorithm which builds association rules based upon the itemset of clusters {1,2,3,4,5}
1737 | # .	Inspect some of the generated rules by lift.
1738 | library(arules)
1739 | colnames(kw_with_cluster2_score)
1740 | kable(head(kw_with_cluster2_score[,c(1,13)],5))
1741 | tmp <- data.frame(kw_with_cluster2_score[,1],kw_with_cluster2_score[,13])
1742 | names(tmp) [1] <- "TransactionID"
1743 | names(tmp) [2] <- "Items"
1744 | tmp <- unique(tmp)
1745 | trans4 <- as(split(tmp[,2], tmp[,1]), "transactions")
1746 | rules_clust <- apriori(trans4,parameter = list(minlen=2,support = 0.02,confidence = 0.01))
1747 | summary(rules_clust)
1748 | #RW new arules::inspect
1749 | tmp <- as.data.frame(arules::inspect( head(sort(rules_clust, by="lift"),10)))  
1750 | 
1751 | # > tmp <- as.data.frame(inspect( head(sort(rules_clust, by="lift"),10)    ) )
1752 | # lhs      rhs support    confidence lift    
1753 | # 22 {2,5} => {4} 0.03065693 0.3088235  3.022059
1754 | # 1  {2}   => {4} 0.03065693 0.3043478  2.978261
1755 | # 2  {4}   => {2} 0.03065693 0.3000000  2.978261
1756 | # 23 {4,5} => {2} 0.03065693 0.3000000  2.978261
1757 | # 32 {1,5} => {4} 0.02773723 0.2087912  2.043171
1758 | # 9  {4}   => {1} 0.02773723 0.2714286  2.020963
1759 | # 10 {1}   => {4} 0.02773723 0.2065217  2.020963
1760 | # 31 {4,5} => {1} 0.02773723 0.2714286  2.020963
1761 | # 35 {3,5} => {4} 0.03357664 0.1965812  1.923687
1762 | # 11 {4}   => {3} 0.03357664 0.3285714  1.891357
1763 | # Summarizing the metrics
1764 | # Running a summary on the rules_clust object indicates an average support of .05, and average confidence of .43
1765 | # This demonstrates that using clustering can be a viable way to develop association rules, and reduced resources and the number of dimensions at the same time.
1766 | # 
1767 | # support          confidence           lift      
1768 | # Min.   :0.02044   Min.   :0.09985   Min.   :0.989  
1769 | # 1st Qu.:0.02664   1st Qu.:0.19816   1st Qu.:1.006  
1770 | # Median :0.03066   Median :0.27143   Median :1.526  
1771 | # Mean   :0.05040   Mean   :0.43040   Mean   :1.608  
1772 | # 3rd Qu.:0.04234   3rd Qu.:0.81954   3rd Qu.:1.891  
1773 | # Max.   :0.17080   Max.   :1.00000   Max.   :3.022  
1774 | # 
1775 | # 
1776 | # Summary
1777 | 
1778 | 


--------------------------------------------------------------------------------
/Chapter07/ch 7 codebase.R:
--------------------------------------------------------------------------------
  1 | # Ch 7. Exploring Health Care Enrollment Data as a Time Series
  2 | # "I have seen the future and it is very much like the present, only longer."
  3 | # --Kehlog Albran, The Profit
  4 | # Time Series Data
  5 | # Time Series data is usually a set of ordered data collected over equally spaced intervals. Time series data occurs in most business and scientific disciplines, and the data is closely tied to the concept of forecasting, which uses previously measured data points to predict future data points based upon a specific statistical model.
  6 | # Time series data differs from the kind of data that we have been looking at previously, because it is a set of ordered data points, it can contain components such as trend, seasonality, and autocorrelation, which have no meaning in other kinds of analysis.
  7 | # Usually time series data is collected in equally spaced intervals, such as days, weeks, quarters, or years, but that is not always the case. Measurement of events such as natural disasters are a prime example.  In some cases you can transform uneven data into equally spaced data, Often data needs to be prepped in order to have it conform to a certain model. In other cases, you can use specialized techniques such as Croston's method for forecasting intermittent, or unexpected demand for goods and services, in certain cases.
  8 | # Exploring Time Series data
  9 | # Many time series study start out by exploring one data metric which has been measured across equally spaced time intervals.  In a data science perspective, we could be interested in identifying various segments of a time series which could be exhibiting interesting trends, cyclical, or seasonal patterns. So, we always begin time series by looking at the data graphically, and producing aggregate measures before proceeding with the modeling.
 10 | # Health Insurance Coverage Data Set
 11 | # We will start by reading in a data set which contains Health Care enrollment data over a period for several categories. This data has been source -from "Table HIB-2. Health Insurance Coverage Status and Type of Coverage All Persons by Age and Sex: 1999 to 2012" and it is available from the CMS website at:
 12 | # http://www.census.gov/data/tables/time-series/demo/health-insurance/historical-series/hib.html
 13 | # This table shows the number of people covered by government and private insurance, as well as the number of people not covered.
 14 | # This table has several embedded time series across all the 14 years represented.  14 data points would not be considered an extremely long time series, however we will use this data to demonstrate how we can comb thru many time series at once. since it IS small, it will be easy enough to verify the results via visual inspection and printing subsets of the data.  As you become familiar with the methodology, it will enable you to expand to larger complex datasets with more data points, in order to isolate the most significant trends.
 15 | # Housekeeping
 16 | # As we have done in other chapters,  we will first clear the workspace, and set our working directory. Obviously, change the setwd to the path that you are using to store your files.
 17 | rm(list = ls())
 18 | #RW new
 19 | setwd("C:/PracticalPredictiveAnalytics/Data")
 20 | 
 21 | # Read the data in
 22 | # Next we will read in a few rows of the file (using the nrow parameters), and then run a str() function on the input to see which variables are contained within the file.  There are several metrics in the file related to Medicare enrollment. We will just concentrate on the total enrollment metrics, and not utilize some of the others, even though they probably produce separate insights.
 23 | x <- read.csv("hihist2bedit.csv", nrow = 10)
 24 | str(x)
 25 | # > 'data.frame': 10 obs. of  13 variables:
 26 | # >  $ Year        : Factor w/ 10 levels "2003","2004 (4)",..: 10 9 8 7 6 5 4 3 2 1
 27 | # >  $ Year.1      : int  2012 2011 2010 2009 2008 2007 2006 2005 2004 2003
 28 | # >  $ Total.People: num  311116 308827 306553 304280 301483 ...
 29 | # >  $ Total       : num  263165 260214 256603 255295 256702 ...
 30 | # >  $ pritotal    : num  198812 197323 196147 196245 202626 ...
 31 | # >  $ priemp      : num  170877 170102 169372 170762 177543 ...
 32 | # >  $ pridirect   : num  30622 30244 30347 29098 28513 ...
 33 | # >  $ govtotal    : num  101493 99497 95525 93245 87586 ...
 34 | # >  $ govmedicaid : num  50903 50835 48533 47847 42831 ...
 35 | # >  $ govmedicare : num  48884 46922 44906 43434 43031 ...
 36 | # >  $ govmilitary : num  13702 13712 12927 12414 11562 ...
 37 | # >  $ Not.Covered : num  47951 48613 49951 48985 44780 ...
 38 | # >  $ cat         : Factor w/ 1 level "ALL AGES": 1 1 1 1 1 1 1 1 1 1
 39 | # 
 40 | # Subsetting the columns
 41 | # For this exercise, we will be using a restricted set of columns from the csv file. We can either select the specific columns from the dataframe just read in (if we just read in the whole file), or reread the csv file using the colClasses parameter to only read the columns that are required. Often this method is preferable when you are reading large file, and will instruct read.csv to only retain the first 3 and the last 2 columns, and ignore the columns "priemp thru govmilitary".
 42 | # After re-reading in the file, with a subset of the columns, we print a few records from the beginning and end of the file. We can do this using a combination of the rbind, head and tail functions.  This will give us all of the columns we will be using for this chapter, except for some columns which we will derive in the next section.
 43 | x <- read.csv("hihist2bedit.csv", colClasses = c(NA, 
 44 | NA, NA, NA, rep("NULL", 7)))
 45 | 
 46 | rbind(head(x), tail(x))
 47 | # >          Year Year.1 Total.People     Total Not.Covered
 48 | # > 1        2012   2012    311116.15 263165.47  47950.6840
 49 | # > 2        2011   2011    308827.25 260213.79  48613.4625
 50 | # > 3   2010 (10)   2010    306553.20 256602.70  49950.5004
 51 | # > 4        2009   2009    304279.92 255295.10  48984.8204
 52 | # > 5        2008   2008    301482.82 256702.42  44780.4031
 53 | # > 6        2007   2007    299105.71 255017.52  44088.1840
 54 | # > 331  2004 (4)   2004     20062.67  19804.54    258.1313
 55 | # > 332      2003   2003     19862.49  19615.92    246.5703
 56 | # > 333      2002   2002     19705.99  19484.01    221.9879
 57 | # > 334      2001   2001     19533.99  19354.19    179.8033
 58 | # > 335  2000 (3)   2000     19450.52  19250.63    199.8871
 59 | # > 336  1999 (2)   1999     19378.56  19189.17    189.3922
 60 | # >                          cat
 61 | # > 1                   ALL AGES
 62 | # > 2                   ALL AGES
 63 | # > 3                   ALL AGES
 64 | # > 4                   ALL AGES
 65 | # > 5                   ALL AGES
 66 | # > 6                   ALL AGES
 67 | # > 331 FEMALE 65 YEARS AND OVER
 68 | # > 332 FEMALE 65 YEARS AND OVER
 69 | # > 333 FEMALE 65 YEARS AND OVER
 70 | # > 334 FEMALE 65 YEARS AND OVER
 71 | # > 335 FEMALE 65 YEARS AND OVER
 72 | # > 336 FEMALE 65 YEARS AND OVER
 73 | # Description of the data
 74 | # Year and Year.1 (columns 1 and 2)
 75 | # Year is the year for which the annual enrollment figure are taken. You will notice that Year appears twice, in column 1 (as a factor) and then again in column2 (integer).   This is because the data has been previously preprocess, and appears twice merely for convenience, since there are certain instances in which we will prefer to use a factor, and other instances in which we prefer to use an integer. The numbers in parentheses in Year refer to footnotes in the original data source.  While you could always create integers from factors and vice versa in the code, this saves valuable processing time if certain transformations can be made available beforehand.
 76 | # Total People (column 3)
 77 | # Total People is the population size of the category. They may either enrolled for health coverage (Total), or not (Not.Covered).
 78 | # Total (Column 4)
 79 | # Total is the number of people who were enrolled for health coverage for that year and in the category "cat".
 80 | # Not.Covered (Column 5)
 81 | # Not.Covered is the number of people NOT enrolled in the specified  year and category.
 82 | # Cat  (Column 6)
 83 | # Cat is the time series subset . 
 84 | # This column, along with year,  defines the particular row.  It defines the specific demographic data for enrollment for that year.
 85 | # The "ALL AGES"  category represents the entire population for the specified year. All of the other subsets in the file should roll up to this category when totaled together.
 86 | # For example, the last category, (printed as part of tail()) represents Females Over 65, which is a subset of the ALL AGES category.
 87 | # Target Time series variable
 88 | # The variable that we will begin to look at initially, will be the variable "Not.Covered". We will be interested in examining any possible enrollment trends using this variable. Since the population size will different depending upon the category, we will calculate the percentage of people not covered in a given year by dividing the raw number corresponding the this variable, by the total in the population for that category.  This will give us a new variable named "Not.Covered.Pct".This will also standardize the metric across the different sized categories, large and small, and enable us to compare.
 89 | # After calculating the variable, we can print the first few records,and also print some summary statistics for this one variable.
 90 | x$Not.Covered.Pct <- x$Not.Covered/x$Total.People
 91 | head(x)
 92 | # >        Year Year.1 Total.People    Total Not.Covered      cat
 93 | # > 1      2012   2012     311116.2 263165.5    47950.68 ALL AGES
 94 | # > 2      2011   2011     308827.2 260213.8    48613.46 ALL AGES
 95 | # > 3 2010 (10)   2010     306553.2 256602.7    49950.50 ALL AGES
 96 | # > 4      2009   2009     304279.9 255295.1    48984.82 ALL AGES
 97 | # > 5      2008   2008     301482.8 256702.4    44780.40 ALL AGES
 98 | # > 6      2007   2007     299105.7 255017.5    44088.18 ALL AGES
 99 | # >   Not.Covered.Pct
100 | # > 1       0.1541247
101 | # > 2       0.1574131
102 | # > 3       0.1629424
103 | # > 4       0.1609860
104 | # > 5       0.1485338
105 | # > 6       0.1474000
106 | # 
107 | summary(x$Not.Covered.Pct)
108 | # 
109 | # Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
110 | # 0.009205 0.109400 0.145300 0.154200 0.210400 0.325200
111 | # # 
112 | # 
113 | # As typical with target variable, we also like to see basic plots showing the distribution of the variable.
114 | # 
115 | # 
116 | hist(x$Not.Covered.Pct)
117 | # Saving the data
118 | # This may be a good time to save the current data to a file, since we may want to read it back later starting from the analysis stage. This will avoid having to reread the same file again, unless, of course the source data changes.
119 | setwd("C:/PracticalPredictiveAnalytics/Outputs")
120 | 
121 | save(x, file = "x.RData")
122 | # Determine all of the Subset Groups
123 | # Since we have only looked at parts of the file (via head or tail), we do not know how many categories there are and how they differ in terms of health care coverage. So we will start off by looking at some of the groupings.
124 | # In previous chapters we have used sql, and the aggregate function to group data. For this example we will use dplyr.  One advange of Dplyr is that it can also be used with 'pipe' syntax, which allows the result of one function to be passed to the next function without intermediate assignments. 
125 | library(dplyr)
126 | # > 
127 | # > Attaching package: 'dplyr'
128 | # > The following objects are masked from 'package:stats':
129 | # > 
130 | # >     filter, lag
131 | # > The following objects are masked from 'package:base':
132 | # > 
133 | # >     intersect, setdiff, setequal, union
134 | # # str(x)
135 | # 
136 | # 
137 | # The by.cat object will show the average number insured, and the average total population for each category. Remember, this data is also grouped by year, however we just want to get a sense of what the averages are across all of the years for now. Since the arrange function will end up sorting the data by the Total population sizes, from largest to smallest, we can see that the numbers line up as expected. 
138 | # 1.	"ALL AGES" is the largest category, 
139 | # 2.	followed by "FEMALE ALL AGES", 
140 | # 3.	and then "MALE ALL AGES". 
141 | # As a sanity check, If you add the totals for the latter 2 categories, you can see that they sum to the "ALL AGES" category. 
142 | # From the console,  we can see from the str(by.cat) function that there are 24 categories.
143 | by.cat <- x %>% select(cat, Total, Total.People) %>% group_by(cat) %>% summarise(Avg.Total.Insured = mean(Total), 
144 | Avg.People = mean(Total.People)) %>% arrange(desc(Avg.People))
145 | 
146 | str(by.cat)
147 | # > Classes 'tbl_df', 'tbl' and 'data.frame': 24 obs. of  3 variables:
148 | # >  $ cat              : Factor w/ 24 levels "18 to 24 YEARS",..: 7 14 22 24 3 4 2 23 6 15 ...
149 | # >  $ Avg.Total.Insured: num  251243 130201 121042 66200 34762 ...
150 | # >  $ Avg.People       : num  294700 150330 144371 73752 42433 ...
151 | by.cat
152 | # > Source: local data frame [24 x 3]
153 | # > 
154 | # >                      cat Avg.Total.Insured Avg.People
155 | # >                   (fctr)             (dbl)      (dbl)
156 | # > 1               ALL AGES         251242.96  294700.47
157 | # > 2        FEMALE ALL AGES         130200.90  150329.73
158 | # > 3          MALE ALL AGES         121042.06  144370.74
159 | # > 4         UNDER 18 YEARS          66200.46   73752.50
160 | # > 5         35 to 44 YEARS          34761.74   42433.12
161 | # > 6         45 to 54 YEARS          35911.82   42100.20
162 | # > 7         25 to 34 YEARS          29973.91   39942.64
163 | # > 8    MALE UNDER 18 YEARS          33832.87   37700.70
164 | # > 9      65 YEARS AND OVER          36199.32   36722.61
165 | # > 10 FEMALE UNDER 18 YEARS          32367.59   36051.79
166 | # > ..                   ...               ...        ...
167 | # Merging the aggregate data back into the original data
168 | # Often, you will want to augment your original data with some of the calculated data as derived above. In these cases, you can merge the data back into the original data using a common key. Again, we will use dplyr to take the results just obtained (by.cat), and join it back to the original data (x), using the common key "cat". 
169 | # We will be using a left_join just for an example, however we could have used a right join to obtain the same results, since by.cat was completely derived from 'x'. After joining the two dataframes, we will end up with a new dataframe named 'x2' 
170 | # # Merge the summary measures back into the original data. Merge by cat.
171 | # 
172 | #RW RM NOT NEEDED
173 | rm(x2)
174 | x2 <- by.cat %>% left_join(x, by = "cat")
175 | head(x2)
176 | # > Source: local data frame [6 x 9]
177 | # > 
178 | # >        cat Avg.Total.Insured Avg.People      Year Year.1 Total.People
179 | # >     (fctr)             (dbl)      (dbl)    (fctr)  (int)        (dbl)
180 | # > 1 ALL AGES            251243   294700.5      2012   2012     311116.2
181 | # > 2 ALL AGES            251243   294700.5      2011   2011     308827.2
182 | # > 3 ALL AGES            251243   294700.5 2010 (10)   2010     306553.2
183 | # > 4 ALL AGES            251243   294700.5      2009   2009     304279.9
184 | # > 5 ALL AGES            251243   294700.5      2008   2008     301482.8
185 | # > 6 ALL AGES            251243   294700.5      2007   2007     299105.7
186 | # > Variables not shown: Total (dbl), Not.Covered (dbl), Not.Covered.Pct (dbl)
187 | # checking the time intervals
188 | # Earlier, we mentioned needing to have equally sized time intervals. Additionally, before we perform any time series analysis, we need to check for the number of 'non-missing' time intervals. So, Let's check the number of enrollment years for each category.
189 | # 
190 | # Using dplyr, we can use summarize(n()) to count the number of entries for each category. 
191 | # # -- summarize and sort by the number of years -- updated to x2
192 | yr.count <- x2 %>% group_by(cat) %>% summarise(n = n()) %>% arrange(n)
193 | 
194 | # - we can see that there are 14 years for all of the groups.  That is good!
195 | print(yr.count, 10)
196 | # > Source: local data frame [24 x 2]
197 | # > 
198 | #   >                      cat     n
199 | # >                   (fctr) (int)
200 | # > 1         18 to 24 YEARS    14
201 | # > 2         25 to 34 YEARS    14
202 | # > 3         35 to 44 YEARS    14
203 | # > 4         45 to 54 YEARS    14
204 | # > 5         55 to 64 YEARS    14
205 | # > 6      65 YEARS AND OVER    14
206 | # > 7               ALL AGES    14
207 | # > 8  FEMALE 18 to 24 YEARS    14
208 | # > 9  FEMALE 25 to 34 YEARS    14
209 | # > 10 FEMALE 35 to 44 YEARS    14
210 | # > ..                   ...   ...
211 | # We can see from the above that every category has 14 years of data represented. 
212 | # So we don't have to worry about having uniform time period for each subset.  However, this is often not the case, and if you come across this, you may need to:
213 | #   Impute data for years that are missing
214 | # Only use common time periods
215 | # Use specialized time series techniques to account for unequally spaced time series.
216 | # Picking out the top groups in terms of average population size
217 | # In many instances, we will only want to look at the top categories, especially when there are many demographical categories which have been subsetted. In this example there are only 24 categories, but in other examples there may be a much larger number of categories. 
218 | # The dataframe x2 is already sorted by Avg.People.  Since we know that there are 14 enrollment records for each category, we can get the top 10 categories based upon the highest base population by selecting the first 14*10 (or 140) rows.  We will store this in a new dataframe 'x3' and save this to disk.
219 | # since we know each group has 14 years, extracting the top 10 groups is
220 | # easy to calculate
221 | 
222 | x3 <- x2[1:(14 * 10), ]
223 | head(x3)
224 | # > Source: local data frame [6 x 9]
225 | # > 
226 | #   >        cat Avg.Total.Insured Avg.People      Year Year.1 Total.People
227 | # >     (fctr)             (dbl)      (dbl)    (fctr)  (int)        (dbl)
228 | # > 1 ALL AGES            251243   294700.5      2012   2012     311116.2
229 | # > 2 ALL AGES            251243   294700.5      2011   2011     308827.2
230 | # > 3 ALL AGES            251243   294700.5 2010 (10)   2010     306553.2
231 | # > 4 ALL AGES            251243   294700.5      2009   2009     304279.9
232 | # > 5 ALL AGES            251243   294700.5      2008   2008     301482.8
233 | # > 6 ALL AGES            251243   294700.5      2007   2007     299105.7
234 | # > Variables not shown: Total (dbl), Not.Covered (dbl), Not.Covered.Pct (dbl)
235 | # # 
236 | #RW needed?
237 | save(x3, file = "x3.RData")
238 | # Plotting the data using Lattice
239 | # The Lattice package is a useful package to learn especially for analysts who like to work in formula notation (y~x).
240 | # In this example we will run a lattice plot in order to plot Not.Covered.Pct on the y axis, Year on the x axis, and produce separate plots by category.
241 | # The main call is specified by:
242 | #   xyplot(Not.Covered.Pct ~ Year | cat, data = x3)
243 | # Since we are plotting the top ten groups, we can specify layout=c(5,2) to indicate we want to arrange the 10 plots in a 5*2 matrix. Not.Covered.Pct is to be arranged on the y axis (left side of the ~ sign), and Year is arranged along the x-axis (Right side of ~ sign). The bar (|) indicates that the data is to be plotted separately by each category.
244 | library(lattice)
245 | x.tick.number <- 14
246 | at <- seq(1, nrow(x3), length.out = x.tick.number)
247 | labels <- round(seq(1999, 2012, length.out = x.tick.number))
248 | 
249 | p <- xyplot(Not.Covered.Pct ~ Year | cat, data = x3, type = "l", main = list(label = "Enrollment by Categories", 
250 |                                                                              cex = 1), par.strip.text = list(cex = 0.5), scales = list(x = list(labels = labels), 
251 |                                                                                                                                        cex = 0.4, rot = 45), layout = c(5, 2))
252 | 
253 | trellis.device()
254 | 
255 | print(p)
256 | 
257 | # Plottingthe data using ggplot
258 | # If you like using ggplot, a similar set of graphs can be rendered using facets.
259 | require("ggplot2")
260 | 
261 | .df <- data.frame(x = x3$Year.1, y = x3$Not.Covered.Pct, z = x3$cat, s = x3$cat)
262 | .df <- .df[order(.df$x), ]
263 | .plot <- ggplot(data = .df, aes(x = x, y = y, colour = z, shape = z)) + geom_point() + 
264 |   geom_line(size = 1) + scale_shape_manual(values = seq(0, 15)) + scale_y_continuous(expand = c(0.01, 
265 |                                                                                                 0)) + facet_wrap(~s) + xlab("Year.1") + ylab("Not.Covered.Pct") + labs(colour = "cat", 
266 |                                                                                                                                                                        shape = "cat") + theme(panel.margin = unit(0.3, "lines"), legend.position = "none")
267 | print(.plot)
268 | 
269 | 
270 | # Sending output to an external File
271 | # One benefit of assigning plots to a plot object is that you can later send the plots to an external file, such as pdf, view it exernally, and even view it in your browser directly from R.  For example, for the Lattice graphs example, you can use treliis.device and specify the output parameters, and then print the object. As we illustrated in an earlier chapter, you can use browseURL to open the PDF in your browser.
272 | # send to pdf
273 | #RW not needed
274 | #setwd("C:/Users/randy/Desktop/ch7")
275 | trellis.device(pdf, file = "x3.pdf", paper = "a4r", height = 8.3, width = 11.7, 
276 |                col = F)
277 | 
278 | # --Not run 
279 | # print(p) 
280 | # 
281 | #RW check this part
282 | dev.off()
283 | browseURL('x3.pdf')
284 | 
285 | browseURL('file://c://PracticalPredictiveAnalytics//Outputs//x3.pdf')
286 | # Examining the output
287 | # If we examine our 'top' plots we can see that some groups seem to be trending more than others. For example, the under 18 age group shows a enrollment trend that is declining, while the 25-54 age groups are trending up.
288 | # Detecting linear trend
289 | # In a linear trend model, one constructs a linear regression least squares line by running an lm regression thru the data points. These models are good for initially exploring trends visually.  We can take advantages of our lm function,  which is available in base R, in order to specifically calculate the slope of the trend line. 
290 | # For example, we can compute the slope of the trend line for our first group ("ALL AGES"), and we can see that it has an upward trend. 
291 | # After running the regression using the lm function, we can subsequently use the coef function to specifically extract the slope and the intercept from the lm model. Since this is a regression in which there is only one independent variable (Time), there will only be one coefficient.
292 | lm(Not.Covered.Pct ~ Year.1, data = x2[1:14, ])
293 | # > 
294 | #   > Call:
295 | #   > lm(formula = Not.Covered.Pct ~ Year.1, data = x2[1:14, ])
296 | # > 
297 | #   > Coefficients:
298 | #   > (Intercept)       Year.1  
299 | # >   -4.102621     0.002119
300 | coef(lm(Not.Covered.Pct ~ Year.1, data = x2[1:14, ]))
301 | # >  (Intercept)       Year.1 
302 | # > -4.102621443  0.002119058
303 | # Automating the regressions
304 | # Now that we have seen how we can run a single time series regression, we can move on to   automating separate regressions and extracting the coefficients over all of the categories.
305 | # There are several ways to do this. One way is by using the do() function within the dplyr package. In the example below: 
306 | #   .	The data is first grouped by category
307 | # .	Then, a linear regression (lm function) is run for each category, with Year as the independent variable, and Not.Covered as the dependent variable.
308 | # .	The coefficient is extracted from the model. The coefficient will act as a proxy for trend Finally, a dataframe of lists is create (fitted.models) , where the coefficients and intercepts are stored for each regression run on every category. The categories which have the highest positive coefficients exhibit the greatest increasing linear trend, while the declining trend is indicated by negative coefficients.
309 | library(dplyr)
310 | 
311 | fitted_models = x2 %>% group_by(cat) %>% do(model = coef(lm(Not.Covered.Pct ~ 
312 |                                                               Year.1, data = .)))
313 | # All of the generated models are now in the fitted_models object.
314 | # The kable function from knitr gives a simple output which displays the intercept as the first number and the coefficient in the "model" column.  As a check, We can see that the coefficients in the "ALL AGES" model, are the identical to those derived in the previous section.
315 | # str(fitted_models)
316 | library(knitr)
317 | kable(fitted_models)
318 | # Cat	Model
319 | # 18 to 24 YEARS	-0.4061834427, 0.0003367988
320 | # 25 to 34 YEARS	-11.375187597, 0.005796182
321 | # 35 to 44 YEARS	-10.916822084, 0.005534037
322 | # 45 to 54 YEARS	-11.544566448, 0.005829194
323 | # 55 to 64 YEARS	-4.709612146, 0.002409908
324 | # 65 YEARS AND OVER	-1.2562375095, 0.0006334125
325 | # ALL AGES	-4.102621443, 0.002119058
326 | # FEMALE 18 to 24 YEARS	-2.677300003, 0.001455388
327 | # FEMALE 25 to 34 YEARS	-9.990978769, 0.005088009
328 | # FEMALE 35 to 44 YEARS	-9.564724041, 0.004850188
329 | # FEMALE 45 to 54 YEARS	-10.36336551, 0.00523537
330 | # FEMALE 55 to 64 YEARS	-4.102774957, 0.002108343
331 | # FEMALE 65 YEARS AND OVER	-1.3674510779, 0.0006887743
332 | # FEMALE ALL AGES	-3.817483824, 0.001970059
333 | # FEMALE UNDER 18 YEARS	3.267593386, -0.001578328
334 | # MALE 18 to 24 YEARS	4.036127727, -0.001862991
335 | # MALE 25 to 34 YEARS	-9.715950286, 0.004983621
336 | # MALE 35 to 44 YEARS	-7.706624821, 0.003941543
337 | # MALE 45 to 54 YEARS	-10.975387917, 0.005549255
338 | # MALE 55 to 64 YEARS	-5.370380544, 0.002738269
339 | # MALE 65 YEARS AND OVER	-0.4834523450, 0.0002479691
340 | # MALE ALL AGES	-4.315036958, 0.002232003
341 | # MALE UNDER 18 YEARS	2.914343264, -0.001401998
342 | # UNDER 18 YEARS	3.086509947, -0.001487938
343 | # Ranking the coefficients
344 | # Now that we have the coefficients, we can begin to rank each of the categories by increasing trend. Since the results we have obtained so far are contained in embedded lists, which are a bit difficult to work with, we can perform some code manipulation to transform them into a regular data frame, with 1 row per category, consisting of the category name, coefficient, and coefficient rank.
345 | #RW remove
346 | library(dplyr)
347 | # extract the coefficients part from the model list, and then transpose the
348 | # data frame so that the coefficient appear one per row, rather than 1 per
349 | # column.
350 | 
351 | xx <- as.data.frame(fitted_models$model)
352 | xx2 <- as.data.frame(t(xx[2, ]))
353 | 
354 | # The output does no contain the category name, so we will merge it back
355 | # from the original data frame.
356 | 
357 | xx4 <- cbind(xx2, as.data.frame(fitted_models))[, c(1, 2)]  #only keep the first two columns
358 | 
359 | # rankthe coefficients from lowest to highest. Force the format of the rank
360 | # as length 2, with leading zero's
361 | 
362 | tmp <- sprintf("%02d", rank(xx4[, 1]))
363 | 
364 | # Finally prepend the rank to the actual category
365 | xx4$rankcat <- as.factor(paste(tmp, "-", as.character(xx4$cat)))
366 | 
367 | # name the columns
368 | names(xx4) <- c("lm.coef", "cat", "coef.rank")
369 | # and View the results
370 | View(xx4)
371 | # As you can see, columns 2,3, and 4 now contain neatly arranged representation of the coefficients, category, and coef.rank, which was derived by ranking lm.coef from smallest to largest, and then prepending the rank order to the category.
372 | # 
373 | # Merging scores back into the original dataframe
374 | # We will augment the original x2 dataframe with this new information by merging back by category, and then by sorting the dataframe by the rank of the coefficient, this will allow us to use this as a proxy for trend.
375 | x2x <- x2 %>% left_join(xx4, by = "cat") %>% arrange(coef.rank, cat)
376 | 
377 | 
378 | # exclude some columns so as to fit on one page
379 | head(x2x[, c(-2, -3, -4, -8)])
380 | # > Source: local data frame [6 x 7]
381 | # > 
382 | #   >                   cat Year.1 Total.People    Total Not.Covered.Pct
383 | # >                (fctr)  (int)        (dbl)    (dbl)           (dbl)
384 | # > 1 MALE 18 to 24 YEARS   2012     15142.04 11091.86       0.2674787
385 | # > 2 MALE 18 to 24 YEARS   2011     15159.87 11028.75       0.2725034
386 | # > 3 MALE 18 to 24 YEARS   2010     14986.02 10646.88       0.2895460
387 | # > 4 MALE 18 to 24 YEARS   2010     14837.14 10109.82       0.3186139
388 | # > 5 MALE 18 to 24 YEARS   2008     14508.04 10021.66       0.3092339
389 | # > 6 MALE 18 to 24 YEARS   2007     14391.92 10230.61       0.2891425
390 | # > Variables not shown: lm.coef (dbl), coef.rank (fctr)
391 | # Plotting the data with the trend lines
392 | # Now that we have the trend coefficients, we will use ggplot to first plot enrollment for all of the 24 categories, and then create a 2nd set up plots which add the trend line based upon the linear coefficients we have just calculated.
393 | # Code Notes: Facet_wrap will order the plots by the value of variable 'z', which was assigned to the coefficient rank. Thus, we can get to see the categories with declining enrollment first, ending with the categories having the highest trend in enrollment from the period 1999-2012
394 | # Tip: I like to assign the variables that I will be changing to standard variable names, like x,y,and z, so that I can remember there usage (e.g variable x is always the x variable, and y always the x variable). But you can supply the variable names directly in the call to ggplot, or set up your own function to do the same thing.
395 | 
396 | 
397 | 
398 | library(ggplot2)
399 | .df <- data.frame(x = x2x$Year.1, y = x2x$Not.Covered.Pct, z = x2x$coef.rank, 
400 |                   slope = x2x$lm.coef)
401 | .plot <- ggplot(data = .df, aes(x = x, y = y, colour = 1, shape = z)) + geom_point() + 
402 |   scale_shape_manual(values = seq(0, 24)) + scale_y_continuous(expand = c(0.1, 
403 |   0)) + scale_x_continuous(expand = c(0.1, 1)) + facet_wrap(~z) + xlab("Year.1") + 
404 |   ylab("Not.Covered.Pct") + labs(colour = "cat", shape = "cat") + theme(panel.margin = unit(0.3, 
405 |   "points"), legend.position = "none") + theme(strip.text.x = element_text(size = 6))
406 | #RW?
407 | print(.plot)
408 | 
409 | # As you can see, for the "ALL AGES" category, overall Non Covered percentage is seen to be increasing over the time period, until 2010 (where the Affordable Care Act was enacted), and then will begin to decrease. Contrarily, the under 18 age group shows a decrease in the proportion of non-insured relative to the population size.
410 | # We can take a closer look at the top and bottom four categories, along with the "ALL AGES" category, to examine this more closely. This time we will add our own trend line using the geom_smooth parameter, which will add a linear regression trendline.
411 | # .df <- data.frame(x = x2x$Year.1, y = x2x$Not.Covered.Pct, z =
412 | # x2x$coef.rank, slope=x2x$lm.coef)
413 | 
414 | # declining enrollment
415 | .df2 <- rbind(head(.df,(4*14)), tail(.df,(4*14)), .df[.df$z == "12 - ALL AGES", ])
416 | 
417 | .plot2 <- ggplot(data = .df2, aes(x = x, y = y, colour = 1, shape = z)) + geom_point() + 
418 |   scale_shape_manual(values = seq(0, nrow(.df2))) + scale_y_continuous(expand = c(0.1,0)) + 
419 |   scale_x_continuous(expand = c(0.1, 1)) + facet_wrap(~z) + xlab("Year.1") + 
420 |   ylab("Not.Covered.Pct") + labs(colour = "cat", shape = "cat") + theme(panel.margin = unit(0.3,"points"), 
421 |   legend.position = "none") + geom_smooth(method = "lm", se = FALSE, 
422 |   colour = "red")
423 | print(.plot2)
424 | 
425 | # Plotting all the categories on 1 graph.
426 | # Sometimes we like to plot all of the lines on one graph rather having them as separate plots. To achieve this, we will alter the synax a bit, so that the categories show up as stacked lines. Again we can see the percentage of uninsured align across ages, with the under 18 group having the lowest uninsured rate, and the 25-54 group having the highest.
427 | 
428 | library(ggplot2)
429 | ### plot all on one graph
430 | 
431 | .df <- data.frame(x = x3$Year.1, y = x3$Not.Covered.Pct, z = x3$cat)
432 | rm(.df)
433 | rm(.plot)
434 | .df <- x3[order(x3$Year.1), ]
435 | # str(.df)
436 | .plot <- ggplot(data = .df, aes(x = Year.1, y = Not.Covered.Pct, colour = cat,shape = cat)) + 
437 |   geom_point() + geom_line(size = 1) + scale_shape_manual(values = seq(0,15)) + 
438 |   ylab("Not.Covered.Pct") + labs(colour = "cat", shape = "cat") + theme_bw(base_size = 14,base_family = "serif")
439 | 
440 | 
441 | print(.plot)
442 | 
443 | 
444 | # Adding labels
445 | # It can sometimes be a bit difficult to discern the differences among the categories based the legend, especially as the number of categories increase, so we can use the directlables library to mark each line with the category name. Now we can see, for example, that ALL MALES have a higher uninsured rate than ALL FEMALES over all time periods.
446 | library(directlabels)
447 | direct.label(.plot, list(last.points, hjust = 0.75, cex = 0.75, vjust = 0))
448 | 
449 | # Performing some automated forecasting using the ets function
450 | # So far we have looked at ways in which we can explore any linear trends which may be inherent in our data.  That provided a solid foundation for the next step, prediction. Now we will begin to start to look at how we can perform some actual forecasting.
451 | # Converting the dataframe to a time series object
452 | # As a preparation step, We will use the ts function to convert our dataframe to a time series object. It is important that time series be equally spaced before converting to a ts object. At minimum, you supply the time series variable, and start and end dates as argument to the ts function.
453 | # After creating a new object "x", run a str() function to verify that all of the 14 time series from 1999 to 2012 have been created. 
454 | # only extract the 'ALL' timeseries
455 | x <- ts(x2$Not.Covered.Pct[1:14], start = c(1999), end = c(2012), frequency = 1)
456 | 
457 | str(x)
458 | # >  Time-Series [1:14] from 1999 to 2012: 0.154 0.157 0.163 0c.161 0.149 ...
459 | # Smoothing the data using Moving Averages
460 | # One simple technique used to analyze time series are simple and exponential moving averages. Both Simple Moving averages and Exponential Moving Average are ways in which we can smooth out the random noise in the series and observe cycles and trends.
461 | # Simple Moving Average
462 | # A simple Moving Average will simply take the sum of the time series variable for the last k periods and then will divide it by the number of periods. In this sense, it is identical to the calculation for the mean. However what makes it different from a simple mean is that:
463 | #   .	the average will shift for every additional time period.  Moving averages are backward looking, and every time a time period shifts, so will the average.  That is why they are called "Moving". Moving Average are sometimes called Rolling Averages.
464 | # .	The look backwards period can shift.  That is the 2nd characteristic of a moving average.  A 10 period moving average will take the average of the last 10 data elements, while a 20 period moving average will take the sum of the last 20 data points, and then divide by 20.  
465 | # Computing the SMA using a function
466 | # To compute a rolling 5 period moving average for our data, we will use the Simple Moving Average (SMA) function from the TTR package, and then display the first few rows.
467 | library(TTR)
468 | MA <- SMA(x, n = 5)
469 | cbind(head(x, 14), head(MA, 14))
470 | # >            [,1]      [,2]
471 | # >  [1,] 0.1541247        NA
472 | # >  [2,] 0.1574131        NA
473 | # >  [3,] 0.1629424        NA
474 | # >  [4,] 0.1609860        NA
475 | # >  [5,] 0.1485338 0.1568000
476 | # >  [6,] 0.1474000 0.1554551
477 | # >  [7,] 0.1523271 0.1544379
478 | # >  [8,] 0.1464598 0.1511414
479 | # >  [9,] 0.1433967 0.1476235
480 | # > [10,] 0.1455136 0.1470194
481 | # > [11,] 0.1391090 0.1453612
482 | # > [12,] 0.1347953 0.1418549
483 | # > [13,] 0.1308892 0.1387408
484 | # > [14,] 0.1362041 0.1373023
485 | # There are many ways in which you can plot the moving average with the original data. Using base R, use the ts.plot function to do this, which takes the original series and the moving average of the series as arguments.
486 | ts.plot(x, MA, gpars = list(xlab = "year", ylab = "Percentage of Non-Insured",lty = c(1:2)))
487 | title("Percentage of Non-Insured 1999-2012 - With SMA")
488 | 
489 | # You can see how moving averages are helpful in showing the upward and downward movements of the data, and also help smooth the data to help eliminate some of the noise.  Also notice that a moving average needs some "starter" data to begin calculations, so that is why the dotting moving average line is missing from the first 4 time periods of the graph.  Only by the 5th period is it able to determine the calculation by summing up the values which correspond to the time period 1999-2003 and then dividing by 5.  The next point is derived by summing up the values corresponding the the time periods 2000-2004 and then again dividing by 5.
490 | # Verifying the SMA Calculation
491 | # It is always important to be able to verify calculation, to insure that the values have been performed correctly, and to promote understanding.
492 | # In the case of the Simple Moving Average, we can switch to the console, and calculate the value of the SMA for the last 5 data points:
493 | #   First we calculate the sum of the elements and then divide by the number of data points in the moving average (5):
494 |   sum(x[10:14])/5
495 | # > [1] 0.1373023
496 | # That matches exactly with the column , given for the Simple Moving Average (SMA)
497 | # Exponential Moving Average
498 | # For a Simple Moving Average, equal weight is given to all data points, regardless of how old or how recent they occurred.   An Exponential Moving Average (EMA) gives move weight to recent data, under the assumption that the future is move likely to look like the recent past, rather than the "older past".
499 | # The exponential Moving average is actually a much simpler calculation. An EMA begins by calculating a simple moving average. When it reach the specified number of lookback periods (n) it computes the current value by weighing the value of the current value with the previous value of the EMA. 
500 | # This weighting is specified by the smoothing (or ratio ) factor. When ratio=1, the predicted value is entirely based upon the last time value. For ratios b=0, the prediction is based upon the average of the entire lookback period.  Therefore the closer the smoothing factor is to 1, the more weight it will give to recent data.  If you want to give additional weight to older data, decrease the smoothing factor towards 0.
501 | # Generally the formula for an EMA is:
502 | # {Current Data Point - EMA(previous)} x smoothing factor + EMA(previous day)
503 | # To compute the EMA, you can use the EMA function (from the TTR package).  You need to specify a smoothing constant (ratio), as well as a lookback period (n).
504 | # Computing the EMA using a function
505 | # The following code computes the EMA, along with the Simple Moving Average which was computed in the last section.
506 | # The following plot shows the data in graph form.  You can see that each data point is closer to its EMA that to the SMA, which as mentioned earlier, weights all previous data points equally within the lookback period. In this regard EMA's react quicker to the recent data, while SMA are slower moving, and have less variability.  Of course both are affected by the parameters, especially the lookback period.  Longer lookbacks will make for slower moving averages, in both cases.
507 | ExpMA <- EMA(x, n = 5, ratio = 0.8)
508 | cbind(head(x, 15), head(MA, 15), head(ExpMA, 15))
509 | # >            [,1]      [,2]      [,3]
510 | # >  [1,] 0.1541247        NA        NA
511 | # >  [2,] 0.1574131        NA        NA
512 | # >  [3,] 0.1629424        NA        NA
513 | # >  [4,] 0.1609860        NA        NA
514 | # >  [5,] 0.1485338 0.1568000 0.1568000
515 | # >  [6,] 0.1474000 0.1554551 0.1492800
516 | # >  [7,] 0.1523271 0.1544379 0.1517177
517 | # >  [8,] 0.1464598 0.1511414 0.1475114
518 | # >  [9,] 0.1433967 0.1476235 0.1442196
519 | # > [10,] 0.1455136 0.1470194 0.1452548
520 | # > [11,] 0.1391090 0.1453612 0.1403382
521 | # > [12,] 0.1347953 0.1418549 0.1359039
522 | # > [13,] 0.1308892 0.1387408 0.1318922
523 | # > [14,] 0.1362041 0.1373023 0.1353417
524 | ts.plot(x, ExpMA, gpars = list(xlab = "year", ylab = "Percentage of Non-Insured",lty = c(1:2)))
525 | title("Percentage of Non-Insured 1999-2012 - With EMA")
526 | #using 
527 | # Using the ets function
528 | # We will now use the ets function (forecast package) to compute an exponentially smoothed model for the "ALL AGES" category.
529 | # The ets function is flexible in that it can also incorporate trend, as well as seasonality for its forecasts. 
530 | # We will just be illustrating a simple exponentially smoothed model ("ANN").  However, for completeness, you should know that you specify three letters when calling the ets function, and you should be aware of what each letter represents.  Otherwise, it will model based upon the default parameters.
531 | # Here is the description as specified by the package author, Hydman:
532 | #   .	The first letter denotes the error type ("A", "M" or "Z")
533 | # .	The second letter denotes the trend type ("N","A","M" or "Z")
534 | # .	The third letter denotes the season type ("N","A","M"or "Z")
535 | # In all cases, "N"=none, "A"=additive, "M"=multiplicative and "Z"=automatically selected.
536 | # So for our example, if we want to model a simple exponentially smoothed model, as we did in our manual calculations, so we would specify model=ANN. 
537 | # Forecasting using "ALL AGES" 
538 | # The code below will perform the following steps:
539 | #   .	First in will filter out the "ALL AGES" category 
540 | # .	Create a time series object
541 | # .	Run a simple exponential model, using the ets function
542 | # Note that we did not specify a smoothing factor.  The ets functions calculates the optimal smoothing factor, (alpha, shown via the summary function) which in this case is .99, which means that model time series takes about 99% of the previous value to incorporate into the next time series prediction.
543 | library(dplyr)
544 | # > 
545 | #   > Attaching package: 'dplyr'
546 | # > The following objects are masked from 'package:stats':
547 | #   > 
548 | #   >     filter, lag
549 | # > The following objects are masked from 'package:base':
550 | #   > 
551 | #   >     intersect, setdiff, setequal, union
552 | library(forecast)
553 | # > Loading required package: zoo
554 | # > 
555 | #   > Attaching package: 'zoo'
556 | # > The following objects are masked from 'package:base':
557 | #   > 
558 | #   >     as.Date, as.Date.numeric
559 | # > Loading required package: timeDate
560 | # > This is forecast 7.1
561 | x4 <- x2[x2$cat == "ALL AGES", ]
562 | 
563 | # set up as a time series object
564 | x <- ts(x4$Not.Covered.Pct, start = c(1999), end = c(2012), frequency = 1)
565 | 
566 | fit <- ets(x, model = "ANN")
567 | summary(fit)
568 | # > ETS(A,N,N) 
569 | # > 
570 | #   > Call:
571 | #   >  ets(y = x, model = "ANN") 
572 | # > 
573 | #   >   Smoothing parameters:
574 | #   >     alpha = 0.9999 
575 | # > 
576 | #   >   Initial states:
577 | #   >     l = 0.1541 
578 | # > 
579 | #   >   sigma:  0.0052
580 | # > 
581 | #   >       AIC      AICc       BIC 
582 | # > -106.3560 -105.2651 -105.0779 
583 | # > 
584 | #   > Training set error measures:
585 | #   >                        ME        RMSE        MAE        MPE     MAPE
586 | # > Training set -0.001279923 0.005191075 0.00430566 -0.9445532 2.955436
587 | # >                   MASE        ACF1
588 | # > Training set 0.9286549 0.004655079
589 | # Plotting the Predicted and Actual Values
590 | # Next, we can plot the predicted vs actual values.  Notice that the predicted values are almost identical to the actual values, however they are always one step ahead.
591 | plot(x)
592 | lines(fit$fitted, col = "red")
593 | # We can 
594 | # The forecast(fit) method
595 | # The forecast method contains many objects which you can display, so as the fitted value, original values, confidence intervals and residuals.  Use str(forecast(fit)) to see which objects are available.
596 | # We will use cbind to print out the original data point, fitted data point, and model fitting method.
597 | # We can also use View to show some of the forecast object in matrix form.
598 | cbind(forecast(fit)$method,forecast(fit)$x,forecast(fit)$fitted,forecast(fit)$residuals)
599 | # Time Series:
600 | #   Start = 1999 
601 | # End = 2012 
602 | # Frequency = 1 
603 | # forecast(fit)$method   forecast(fit)$x forecast(fit)$fitted forecast(fit)$residuals
604 | # 1999           ETS(A,N,N)  0.15412470117969    0.154120663632029    4.03754766081788e-06
605 | # 2000           ETS(A,N,N) 0.157413125646824    0.154124700770241     0.00328842487658335
606 | # 2001           ETS(A,N,N) 0.162942355969924    0.157412792166205     0.00552956380371911
607 | # 2002           ETS(A,N,N) 0.160986044554207    0.162941795214416      -0.001955750660209
608 | # 2003           ETS(A,N,N) 0.148533847659868    0.160986242887746     -0.0124523952278778
609 | # 2004           ETS(A,N,N) 0.147400008880004    0.148535110462768    -0.00113510158276331
610 | # 2005           ETS(A,N,N) 0.152327126236553    0.147400123991157     0.00492700224539561
611 | # 2006           ETS(A,N,N) 0.146459794092561    0.152326626587079    -0.00586683249451758
612 | # 2007           ETS(A,N,N)  0.14339666192983    0.146460389050636    -0.00306372712080566
613 | # 2008           ETS(A,N,N) 0.145513631588618    0.143396972623751     0.00211665896486724
614 | # 2009           ETS(A,N,N) 0.139109023459534    0.145513416937297    -0.00640439347776356
615 | # 2010           ETS(A,N,N) 0.134795323545856    0.139109672931905    -0.00431434938604935
616 | # 2011           ETS(A,N,N) 0.130889234985064    0.134795761065932    -0.00390652608086872
617 | # 2012           ETS(A,N,N) 0.136204104247743    0.130889631147599     0.00531447310014455
618 | 
619 | View(forecast(fit))
620 | 
621 | 
622 | # Plotting the future values with confidence bands
623 | # Use the plot function to plot the future predictions. Notice that the prediction for the last value encompasses upper and lower confidence bands surrounding a horizontal prediction line. But why a horizontal prediction line? This is saying that there is no trend or seasonality for the exponential model, and that the best prediction is based upon the last value of the smoothed average.  However, we can see that there is significant variation to the prediction, based upon the confidence bands. The confidence bands will also increase in size as the forecast period increases, to reflect the uncertainty associated with the forecast.
624 | plot(forecast(fit))
625 | 
626 | 
627 | 
628 | # If we 
629 | # 
630 | # Modifying the model to include a trend components
631 | # Earlier, we added a linear trend line to the data. If we wanted to incorporate a linear trend into the forecast as well, we can substitute "A" for the second parameter (trend parameter), which yields an "AAN" model (Holt's linear trend). This type of method allows exponential smoothing with a trend.  
632 | fit <- ets(x, model = "AAN")
633 | summary(fit)
634 | # > ETS(A,A,N) 
635 | # > 
636 | #   > Call:
637 | #   >  ets(y = x, model = "AAN") 
638 | # > 
639 | #   >   Smoothing parameters:
640 | #   >     alpha = 0.0312 
641 | # >     beta  = 0.0312 
642 | # > 
643 | #   >   Initial states:
644 | #   >     l = 0.1641 
645 | # >     b = -0.0021 
646 | # > 
647 | #   >   sigma:  0.0042
648 | # > 
649 | #   >       AIC      AICc       BIC 
650 | # > -108.5711 -104.1267 -106.0149 
651 | # > 
652 | #   > Training set error measures:
653 | #   >                        ME        RMSE         MAE        MPE    MAPE
654 | # > Training set -0.000290753 0.004157744 0.003574276 -0.2632899 2.40212
655 | # >                   MASE       ACF1
656 | # > Training set 0.7709083 0.05003007
657 | plot(forecast(fit))
658 | 
659 | # Running the ets function iteratively over all of the categories
660 | # Now that we have run an ets model on one category, we can construct some code to automate model construction over ALL of the categories.  In the process, we will also save some of the accuracy measures so that we can see how our models performed.
661 | # .	First, we will sort the dataframe by category, and then by year. 
662 | # .	Then initialize a new dataframe (onestep.df) that we will use to store the accuracy results for each moving window prediction of test and training data
663 | # .	Then we will process each of the groups, all which have 14 time periods, as an iteration in a for loop
664 | # .	For each iteration, extract a test and training dataframe
665 | # .	Fit a simple exponential smoothed model for the training dataset
666 | # .	Apply a model fit to the test dataset
667 | # .	Apply the accuracy function in order to extract the validation statistics
668 | # .	Store each of them in the onestep.df dataframe which was initialized in the previous step
669 | df <- x2 %>% arrange(cat, Year.1)
670 | 
671 | # create results data frame
672 | onestep.df <- data.frame(cat = character(), rmse = numeric(), mae = numeric(), 
673 |                          mape = numeric(), acf1 = numeric(), stringsAsFactors = FALSE)
674 | 
675 | # 
676 | # 
677 | library(forecast)
678 | iterations <- 0
679 | for (i in seq(from = 1, to = 999, by = 14)) {
680 |   j <- i + 13
681 |   # pull out the next category.  It will always be 14 records.
682 |   x4 <- df[i:j, ]
683 |   x <- ts(x4$Not.Covered.Pct, start = c(1999), end = c(2012), frequency = 1)
684 |   
685 |   # assign the first 10 records to the training data, and the next 4 to the
686 |   # test data.
687 |   
688 |   
689 |   trainingdata <- window(x, start = c(1999), end = c(2008))
690 |   testdata <- window(x, start = c(2009), end = c(2012))
691 |   
692 |   
693 |   par(mfrow = c(2, 2))
694 |   
695 |   # first fit the training data, then the test data. 
696 |   # Use simple exponential smoothing
697 |   fit <- ets(trainingdata, model = "ANN")
698 |   # summary(fit)
699 |   fit2 <- ets(testdata, model = fit)
700 |   onestep <- fitted(fit2)
701 |   
702 |   iterations <- iterations + 1
703 |   onestep.df[iterations, 1] <- paste(x4$cat[1])
704 |   onestep.df[iterations, 2] <- accuracy(onestep, testdata)[, 2]  #RMSE
705 |   onestep.df[iterations, 3] <- accuracy(onestep, testdata)[, 3]  #MAE
706 |   onestep.df[iterations, 4] <- accuracy(onestep, testdata)[, 5]  #MAPE
707 |   onestep.df[iterations, 5] <- accuracy(onestep, testdata)[, 7]  #ACF1
708 |   
709 |   if (iterations == 24) 
710 |     break
711 | }
712 | 
713 | # Viewing the prediction with the original values
714 | # In the code above the fit object contains the original training data along with the fitted values. We can also print the residuals, which are the actual values minus the fitted values. The residuals are important, since they are usually the basis for the accuracy measures.
715 | tail(x4)
716 | # > Source: local data frame [6 x 9]
717 | # > 
718 | #   >              cat Avg.Total.Insured Avg.People      Year Year.1
719 | # >           (fctr)             (dbl)      (dbl)    (fctr)  (int)
720 | # > 1 UNDER 18 YEARS          66200.46    73752.5      2007   2007
721 | # > 2 UNDER 18 YEARS          66200.46    73752.5      2008   2008
722 | # > 3 UNDER 18 YEARS          66200.46    73752.5      2009   2009
723 | # > 4 UNDER 18 YEARS          66200.46    73752.5 2010 (10)   2010
724 | # > 5 UNDER 18 YEARS          66200.46    73752.5      2011   2011
725 | # > 6 UNDER 18 YEARS          66200.46    73752.5      2012   2012
726 | # > Variables not shown: Total.People (dbl), Total (dbl), Not.Covered (dbl),
727 | # >   Not.Covered.Pct (dbl)
728 | head(onestep.df)
729 | # >                 cat        rmse         mae      mape      acf1
730 | # > 1    18 to 24 YEARS 0.013772470 0.009869590  3.752381 1.0000552
731 | # > 2    25 to 34 YEARS 0.004036661 0.003380938  1.217612 1.0150588
732 | # > 3    35 to 44 YEARS 0.006441549 0.004790155  2.231886 0.9999469
733 | # > 4    45 to 54 YEARS 0.004261185 0.003129072  1.734022 0.9999750
734 | # > 5    55 to 64 YEARS 0.005160212 0.004988592  3.534093 0.7878765
735 | # > 6 65 YEARS AND OVER 0.002487451 0.002096323 12.156875 0.9999937
736 | cbind(fit$x, fit$fitted, fit$residuals)
737 | # > Time Series:
738 | #   > Start = 1999 
739 | # > End = 2008 
740 | # > Frequency = 1 
741 | # >           fit$x fit$fitted fit$residuals
742 | # > 1999 0.11954420  0.1056241  0.0139200744
743 | # > 2000 0.10725759  0.1056255  0.0016320737
744 | # > 2001 0.10649619  0.1056257  0.0008705016
745 | # > 2002 0.10291788  0.1056258 -0.0027078918
746 | # > 2003 0.10393522  0.1056255 -0.0016902769
747 | # > 2004 0.09942592  0.1056253 -0.0061994140
748 | # > 2005 0.10320781  0.1056247 -0.0024169001
749 | # > 2006 0.11231138  0.1056245  0.0066869135
750 | # > 2007 0.10587369  0.1056251  0.0002485556
751 | # > 2008 0.09527547  0.1056252 -0.0103496921
752 | 
753 | mean(fit$residuals)
754 | # > [1] -6.056125e-07
755 | # We should also plot the residuals for this category. 
756 | absresid <- abs(fit2$residuals)
757 | plot(absresid)
758 | 
759 | # Fit2 is the testdata along with the fitted values from the model developed from the training data (fit).
760 | cbind(fit2$x, fit2$fitted, fit2$residuals)
761 | # > Time Series:
762 | #   > Start = 2009 
763 | # > End = 2012 
764 | # > Frequency = 1 
765 | # >          fit2$x fit2$fitted fit2$residuals
766 | # > 2009 0.09745024  0.09451361   0.0029366252
767 | # > 2010 0.09785322  0.09451391   0.0033393083
768 | # > 2011 0.09397731  0.09451424  -0.0005369347
769 | # > 2012 0.08877369  0.09451419  -0.0057404940
770 | 
771 | mean(fit2$residuals)
772 | # > [1] -3.738216e-07
773 | # Accuracy Measures
774 | # Using the residuals, we can measure the error from the predicted and actual values based upon three popular accuracy measures.
775 | # Mean Absolute Error (MAE):  This measure takes the mean of the absolute values of all of the errors (residuals)
776 | # Root Mean Squared Error (RMSE): The root mean square error measures the error by first taking the mean of all of the squared errors, and then takes the square root of the mean, in order to revert back to the original scale.  This is a standard statistical method of measuring errors.
777 | # Tip: Both MAE, and RMSE are scale dependent measures, which means that that they can be used to compare problems with similar scales.  When comparing accuracy among models with different scales, other scale independent measures such as MAPE should be used.
778 | # Mean Percentage Error MAPE - This is the absolute difference between the actual and forecasted value, expressed as a percentage of the actual value.  This is intuitively easy to understand and is a very popular measure. 
779 | # 
780 | # We can look at the worst performing models, in terms of a metric such as MAPE, by simply sorting the onestep.df object by the MAPE columns.
781 | onestep.df %>% arrange(., desc(mape)) %>% head()
782 | # >                      cat        rmse         mae      mape      acf1
783 | # > 1 MALE 65 YEARS AND OVER 0.002647903 0.002226841 17.440671 0.5781697
784 | # > 2    MALE 35 to 44 YEARS 0.039044319 0.024111701 14.218445 0.9999903
785 | # > 3      65 YEARS AND OVER 0.002487451 0.002096323 12.156875 0.9999937
786 | # > 4    MALE 25 to 34 YEARS 0.024195057 0.019901117  6.748294 0.9999310
787 | # > 5    MALE 45 to 54 YEARS 0.017711280 0.010388681  6.380865 0.9999900
788 | # > 6  FEMALE 55 to 64 YEARS 0.006749771 0.005610275  4.021224 0.5038707
789 | # Charting all of the MAPE's at once show that the exponential model works better for the younger groups, and seems to degrade for older groups, especially for Males.
790 | lattice::barchart(cat ~ mape, data = onestep.df)
791 | 
792 | 
793 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2017 Packt
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | 
 2 | 
 3 | 
 4 | # Practical Predictive Analytics
 5 | This is the code repository for [Practical Predictive Analytics](https://www.packtpub.com/big-data-and-business-intelligence/practical-predictive-analytics?utm_source=github&utm_medium=repository&utm_campaign=9781785886188), published by [Packt](https://www.packtpub.com/?utm_source=github). It contains all the supporting project files necessary to work through the book from start to finish.
 6 | ## About the Book
 7 | This is the go-to book for anyone interested in the steps needed to develop predictive analytics solutions with examples from the world of marketing, healthcare, and retail. We'll get started with a brief history of predictive analytics and learn about different roles and functions people play within a predictive analytics project. Then, we will learn about various ways of installing R along with their pros and cons, combined with a step-by-step installation of RStudio, and a description of the best practices for organizing your projects.
 8 | 
 9 | On completing the installation, we will begin to acquire the skills necessary to input, clean, and prepare your data for modeling. We will learn the six specific steps needed to implement and successfully deploy a predictive model starting from asking the right questions through model development and ending with deploying your predictive model into production. We will learn why collaboration is important and how agile iterative modeling cycles can increase your chances of developing and deploying the best successful model.
10 | 
11 | 
12 | ## Instructions and Navigation
13 | All of the code is organized into folders. Each folder starts with a number followed by the application name. For example, Chapter02.
14 | 
15 | 
16 | 
17 | The code will look like the following:
18 | ```
19 | #run the model
20 | model <- OneR(train_data, frisked ~ ., verbose = TRUE)
21 | #summarize the model
22 | summary(model)
23 | #run the sql function from the SparkR package
24 | SparkR::sql("SELECT sample_bin , count(*)
25 | \FROM out_tbl group by sample_bin")
26 | ```
27 | 
28 | This is neither an introductory predictive analytics book, not an introductory book for
29 | learning R or Spark. Some knowledge of base R data manipulation techniques is expected.
30 | Some prior knowledge of predictive analytics is useful. As mentioned earlier, knowledge of
31 | basic statistical concepts such as hypothesis testing, correlation, means, standard deviations,
32 | and p-values will also help you navigate this book.
33 | 
34 | ## Related Products
35 | * [Practical Real-time Data Processing and Analytics](https://www.packtpub.com/big-data-and-business-intelligence/practical-real-time-data-processing-and-analytics?utm_source=github&utm_medium=repository&utm_campaign=9781787281202)
36 | 
37 | * [Mastering Predictive Analytics with Python](https://www.packtpub.com/big-data-and-business-intelligence/mastering-predictive-analytics-python?utm_source=github&utm_medium=repository&utm_campaign=9781785882715)
38 | 
39 | * [Learning Predictive Analytics with R](https://www.packtpub.com/big-data-and-business-intelligence/learning-predictive-analytics-r?utm_source=github&utm_medium=repository&utm_campaign=9781782169352)
40 | 
41 | ### Suggestions and Feedback
42 | [Click here](https://docs.google.com/forms/d/e/1FAIpQLSe5qwunkGf6PUvzPirPDtuy1Du5Rlzew23UBp2S-P3wB-GcwQ/viewform) if you have any feedback or suggestions.
43 | 
44 | ### Download a free PDF
45 | 
46 |  <i>If you have already purchased a print or Kindle version of this book, you can get a DRM-free PDF version at no cost.<br>Simply click on the link to claim your free PDF.</i>
47 | <p align="center"> <a href="https://packt.link/free-ebook/9781785886188">https://packt.link/free-ebook/9781785886188 </a> </p>


--------------------------------------------------------------------------------