├── 01_sparkr-basics-1.md
├── 02_sparkr-basics-2.md
├── 03_subsetting.md
├── 04_missing-data.md
├── 05_summary-statistics.md
├── 06_merging.md
├── 07_visualizations.md
├── 08_databases-with-jdbc.md
├── 09_glm.md
├── 10_timeseries-1.md
├── License.md
├── R
    ├── confint_SparkR.R
    ├── diff-in-diff.R
    ├── geom_bivar_histogram_SparkR.R
    ├── glm.R
    ├── merging.R
    ├── missing-data.R
    ├── ols1.R
    ├── ols2.R
    ├── ols2_SparkR2_test.R
    ├── qqnorm_SparkR.R
    ├── rbind-fill.R
    ├── rbind-intersection.R
    ├── sparkr-basics-1.R
    ├── sparkr-basics-2.R
    ├── subsetting.R
    ├── summary-statistics.R
    ├── time-series-1.R
    └── visualizations.R
├── README.md
├── glm_files
    └── figure-html
    │   ├── unnamed-chunk-10-1.png
    │   ├── unnamed-chunk-11-1.png
    │   ├── unnamed-chunk-25-1.png
    │   ├── unnamed-chunk-27-1.png
    │   ├── unnamed-chunk-29-1.png
    │   ├── unnamed-chunk-5-1.png
    │   ├── unnamed-chunk-7-1.png
    │   └── unnamed-chunk-9-1.png
├── rmd
    ├── 01_sparkr-basics-1.rmd
    ├── 02_sparkr-basics-2.rmd
    ├── 03_subsetting.rmd
    ├── 04_missing-data.rmd
    ├── 05_summary-statistics.rmd
    ├── 06_merging.rmd
    ├── 07_visualizations.rmd
    ├── 09_glm.rmd
    └── 10_timeseries-1.rmd
└── visualizations_files
    └── figure-html
        ├── unnamed-chunk-10-1.png
        ├── unnamed-chunk-11-1.png
        ├── unnamed-chunk-12-1.png
        ├── unnamed-chunk-13-1.png
        ├── unnamed-chunk-15-1.png
        ├── unnamed-chunk-17-1.png
        ├── unnamed-chunk-4-1.png
        ├── unnamed-chunk-5-1.png
        ├── unnamed-chunk-6-1.png
        ├── unnamed-chunk-7-1.png
        ├── unnamed-chunk-8-1.png
        └── unnamed-chunk-9-1.png


/03_subsetting.md:
--------------------------------------------------------------------------------
  1 | # Subsetting SparkR DataFrames
  2 | Sarah Armstrong, Urban Institute  
  3 | July 1, 2016  
  4 | 
  5 | 
  6 | 
  7 | **Last Updated**: May 23, 2017
  8 | 
  9 | 
 10 | **Objective**: Now that we understand what a SparkR DataFrame (DF) really is (remember, it's not actually data!) and can write expressions using essential DataFrame operations, such as `agg`, we are ready to start subsetting DFs using more advanced transformation operations. This tutorial discusses various ways of subsetting DFs, as well as how to work with a randomly sampled subset as a local data.frame in RStudio:
 11 | 
 12 | * Subset a DF by row
 13 | * Subset a DF by a list of columns
 14 | * Subset a DF by column expressions
 15 | * Drop a column from a DF
 16 | * Subset a DF by taking a random sample
 17 | * Collect a random sample as a local R data.frame
 18 | * Export a DF sample as a single .csv file to S3
 19 | 
 20 | **SparkR/R Operations Discussed**: `filter`, `where`, `select`, `sample`, `collect`, `write.table`
 21 | 
 22 | ***
 23 | 
 24 | :heavy_exclamation_mark: **Warning**: Before beginning this tutorial, please visit the SparkR Tutorials README file (found [here](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/README.md)) in order to load the SparkR library and subsequently initiate a SparkR session.
 25 | 
 26 | 
 27 | 
 28 | The following error indicates that you have not initiated a SparkR session:
 29 | 
 30 | 
 31 | ```r
 32 | Error in getSparkSession() : SparkSession not initialized
 33 | ```
 34 | 
 35 | If you receive this message, return to the SparkR tutorials [README](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/README.md) for guidance.
 36 | 
 37 | ***
 38 | 
 39 | **Read in initial data as DF**: Throughout this tutorial, we will use the loan performance example dataset that we exported at the conclusion of the [SparkR Basics I](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/01_sparkr-basics-1.md) tutorial. Note that we are __persisting__ the DataFrame since we will use it throughout this tutorial.
 40 | 
 41 | 
 42 | ```r
 43 | df <- read.df("s3://ui-spark-social-science-public/data/hfpc_ex", 
 44 | 				header = "false", 
 45 | 				inferSchema = "true")
 46 | cache(df)
 47 | ```
 48 | 
 49 | _Note_: documentation for the quarterly loan performance data can be found at http://www.fanniemae.com/portal/funding-the-market/data/loan-performance-data.html.
 50 | 
 51 | Let's check the dimensions our DF `df` and its column names so that we can compare the dimension sizes of `df` with those of the subsets that we will  define throughout this tutorial:
 52 | 
 53 | 
 54 | ```r
 55 | nrow(df)
 56 | ## [1] 13216516
 57 | ncol(df)
 58 | ## [1] 14
 59 | columns(df)
 60 | ##  [1] "loan_id"       "period"        "servicer_name" "new_int_rt"   
 61 | ##  [5] "act_endg_upb"  "loan_age"      "mths_remng"    "aj_mths_remng"
 62 | ##  [9] "dt_matr"       "cd_msa"        "delq_sts"      "flag_mod"     
 63 | ## [13] "cd_zero_bal"   "dt_zero_bal"
 64 | ```
 65 | 
 66 | ***
 67 | 
 68 | 
 69 | ### Subset DataFrame by row:
 70 | 
 71 | The SparkR operation `filter` allows us to subset the rows of a DF according to specified conditions. Before we begin working with `filter` to see how it works, let's print the schema of `df` since the types of subsetting conditions we are able to specify depend on the datatype of each column in the DF: 
 72 | 
 73 | 
 74 | ```r
 75 | printSchema(df)
 76 | ## root
 77 | ##  |-- loan_id: long (nullable = true)
 78 | ##  |-- period: string (nullable = true)
 79 | ##  |-- servicer_name: string (nullable = true)
 80 | ##  |-- new_int_rt: double (nullable = true)
 81 | ##  |-- act_endg_upb: double (nullable = true)
 82 | ##  |-- loan_age: integer (nullable = true)
 83 | ##  |-- mths_remng: integer (nullable = true)
 84 | ##  |-- aj_mths_remng: integer (nullable = true)
 85 | ##  |-- dt_matr: string (nullable = true)
 86 | ##  |-- cd_msa: integer (nullable = true)
 87 | ##  |-- delq_sts: string (nullable = true)
 88 | ##  |-- flag_mod: string (nullable = true)
 89 | ##  |-- cd_zero_bal: integer (nullable = true)
 90 | ##  |-- dt_zero_bal: string (nullable = true)
 91 | ```
 92 | 
 93 | We can subset `df` into a new DF, `f1`, that includes only those loans for which JPMorgan Chase is the servicer with the expression:
 94 | 
 95 | 
 96 | ```r
 97 | f1 <- filter(df, df$servicer_name == "JP MORGAN CHASE BANK, NA" | df$servicer_name == "JPMORGAN CHASE BANK, NA" |
 98 |                df$servicer_name == "JPMORGAN CHASE BANK, NATIONAL ASSOCIATION")
 99 | nrow(f1)
100 | ## [1] 102733
101 | ```
102 | 
103 | Notice that the `filter` considers normal logical syntax (e.g. logical conditions and operations), making working with the operation very straightforward. We can specify `filter` with SQL statement strings. For example, here we have the preceding example written in SQL statement format:
104 | 
105 | 
106 | ```r
107 | filter(df, "servicer_name = 'JP MORGAN CHASE BANK, NA' or servicer_name = 'JPMORGAN CHASE BANK, NA' or
108 |        servicer_name = 'JPMORGAN CHASE BANK, NATIONAL ASSOCIATION'")
109 | ```
110 | 
111 | Or, alternatively, in a syntax similar to how we subset data.frames by row in base R:
112 | 
113 | 
114 | ```r
115 | df[df$servicer_name == "JP MORGAN CHASE BANK, NA" | df$servicer_name == "JPMORGAN CHASE BANK, NA" | 
116 |      df$servicer_name == "JPMORGAN CHASE BANK, NATIONAL ASSOCIATION",]
117 | ```
118 | 
119 | Another example of using logical syntax with `filter` is that we can subset `df` such that the new DF only includes those loans for which the servicer name is known, i.e. the column `"servicer_name"` is not equa to an empty string or listed as `"OTHER"`:
120 | 
121 | 
122 | ```r
123 | f2 <- filter(df, df$servicer_name != "OTHER" & df$servicer_name != "")
124 | nrow(f2)
125 | ## [1] 226264
126 | ```
127 | 
128 | Or, if we wanted to only consider observations with a `"loan_age"` value of greater than 60 months (five years), we would evaluate:
129 | 
130 | 
131 | ```r
132 | f3 <- filter(df, df$loan_age > 60)
133 | nrow(f3)
134 | ## [1] 1714413
135 | ```
136 | 
137 | An alias for `filter` is `where`, which reads much more intuitively, particularly when `where` is embedded in a complex statement. For example, the following expression can be read as "__aggregate__ the mean loan age and count values __by__ `"servicer_name"` in `df` __where__ loan age is less than 60 months":
138 | 
139 | 
140 | ```r
141 | f4 <- agg(groupBy(where(df, df$loan_age < 60), where(df, df$loan_age < 60)$servicer_name), 
142 |           loan_age_avg = avg(where(df, df$loan_age < 60)$loan_age), 
143 |           count = n(where(df, df$loan_age < 60)$loan_age))
144 | head(f4)
145 | ##                                servicer_name loan_age_avg count
146 | ## 1 FIRST TENNESSEE BANK, NATIONAL ASSOCIATION     23.45820 12774
147 | ## 2                      BANK OF AMERICA, N.A.     20.95203 34688
148 | ## 3                     WELLS FARGO BANK, N.A.     47.94743   799
149 | ## 4                         GMAC MORTGAGE, LLC     21.17096 16554
150 | ## 5                         FLAGSTAR BANK, FSB     42.82895    76
151 | ## 6                  USAA FEDERAL SAVINGS BANK     20.35909  3080
152 | ```
153 | 
154 | ***
155 | 
156 | 
157 | ### Subset DataFrame by column:
158 | 
159 | The operation `select` allows us to subset a DF by a specified list of columns. In the expression below, for example, we create a subsetted DF that includes only the number of calendar months remaining until the borrower is expected to pay the mortgage loan in full (remaining maturity) and adjusted remaining maturity:
160 | 
161 | 
162 | ```r
163 | s1 <- select(df, "mths_remng", "aj_mths_remng")
164 | ncol(s1)
165 | ## [1] 2
166 | ```
167 | 
168 | We can also reference the column names through the DF name, i.e. `select(df, df$mths_remng, df$aj_mths_remng)`. Or, we can save a list of columns as a combination of strings. If we wanted to make a list of all columns that relate to remaining maturity, we could evaluate the expression `remng_mat <- c("mths_remng", "aj_mths_remng")` and then easily reference our list of columns later on with `select(df, remng_mat)`.
169 | 
170 | 
171 | Besides subsetting by a list of columns, we can also subset `df` while introducing a new column using a column expression, as we do in the example below. The DF `s2` includes the columns `"mths_remng"` and `"aj_mths_remng"` as in `s1`, but now with a column that lists the absolute value of the difference between the unadjusted and adjusted remaining maturity:
172 | 
173 | 
174 | ```r
175 | s2 <- select(df, df$mths_remng, df$aj_mths_remng, abs(df$aj_mths_remng - df$mths_remng))
176 | ncol(s2)
177 | ## [1] 3
178 | head(s2)
179 | ##   mths_remng aj_mths_remng abs((aj_mths_remng - mths_remng))
180 | ## 1        293           286                                 7
181 | ## 2        292           283                                 9
182 | ## 3        291           287                                 4
183 | ## 4        290           287                                 3
184 | ## 5        289           277                                12
185 | ## 6        288           277                                11
186 | ```
187 | 
188 | Note that, just as we can subset by row with syntax similar to that in base R, we can similarly acheive subsetting by column. The following expressions are equivalent:
189 | 
190 | 
191 | ```r
192 | select(df, df$period)
193 | df[,"period"]
194 | df[,2]
195 | ```
196 | 
197 | To simultaneously subset by column and row specifications, you can simply embed a `where` expression in a `select` operation (or vice versa). The following expression creates a DF that lists loan age values only for observations in which servicer name is unknown:
198 | 
199 | 
200 | ```r
201 | s3 <- select(where(df, df$servicer_name == "" | df$servicer_name == "OTHER"), "loan_age")
202 | head(s3)
203 | ##   loan_age
204 | ## 1       67
205 | ## 2       68
206 | ## 3       69
207 | ## 4       70
208 | ## 5       71
209 | ## 6       72
210 | ```
211 | 
212 | Note that we could have also written the above expression as `df[df$servicer_name == "" | df$servicer_name == "OTHER", "loan_age"]`.
213 | 
214 | 
215 | #### Drop a column from a DF:
216 | 
217 | We can drop a column from a DF very simply by assigning `NULL` to a DF column. Below, we drop `"aj_mths_remng"` from `s1`:
218 | 
219 | 
220 | ```r
221 | head(s1)
222 | ##   mths_remng aj_mths_remng
223 | ## 1        293           286
224 | ## 2        292           283
225 | ## 3        291           287
226 | ## 4        290           287
227 | ## 5        289           277
228 | ## 6        288           277
229 | s1$aj_mths_remng <- NULL
230 | head(s1)
231 | ##   mths_remng
232 | ## 1        293
233 | ## 2        292
234 | ## 3        291
235 | ## 4        290
236 | ## 5        289
237 | ## 6        288
238 | ```
239 | 
240 | ***
241 | 
242 | 
243 | ### Subset a DF by taking a random sample:
244 | 
245 | Perhaps the most useful subsetting operation is `sample`, which returns a randomly sampled subset of a DF. With `subset`, we can specify whether we want to sample with or without replace, the approximate size of the sample that we want the new DF to call and whether or not we want to define a random seed. If our initial DF is so massive that performing analysis on the entire dataset requires a more expensive cluster, we can: sample the massive dataset, interactively develop our analysis in SparkR using our sample and then evaluate the resulting program using our initial DF, which calls the entire massive dataset, only as is required. This strategy will help us to minimize wasting resources.
246 | 
247 | Below, we take a random sample of `df` without replacement that is, in size, approximately equal to 1% of `df`. Notice that we must define a random seed in order to be able to reproduce our random sample.
248 | 
249 | 
250 | ```r
251 | df_samp1 <- sample(df, withReplacement = FALSE, fraction = 0.01)  # Without set seed
252 | df_samp2 <- sample(df, withReplacement = FALSE, fraction = 0.01)
253 | count(df_samp1)
254 | ## [1] 132479
255 | count(df_samp2)
256 | ## [1] 132507
257 | # The row counts are different and, obviously, the DFs are not equivalent
258 | 
259 | df_samp3 <- sample(df, withReplacement = FALSE, fraction = 0.01, seed = 0)  # With set seed
260 | df_samp4 <- sample(df, withReplacement = FALSE, fraction = 0.01, seed = 0)
261 | count(df_samp3)
262 | ## [1] 131997
263 | count(df_samp4)
264 | ## [1] 131997
265 | # The row counts are equal and the DFs are equivalent
266 | ```
267 | 
268 | 
269 | #### Collect a random sample as a local data.frame:
270 | 
271 | An additional use of `sample` is to collect a random sample of a massive dataset as a local data.frame in R. This would allow us to work with a sample dataset in a traditional analysis environment that is likely more representative of the population since we are sampling from a larger set of observations than we are normally doing so. This can be achieved by simply using `collect` to create a local data.frame:
272 | 
273 | 
274 | ```r
275 | typeof(df_samp4)  # DFs are of class S4
276 | ## [1] "S4"
277 | dat <- collect(df_samp4)
278 | typeof(dat)
279 | ## [1] "list"
280 | ```
281 | 
282 | Note that this data.frame is _not_ local to _your_ personal computer, but rather it was gathered locally to a single node in our AWS cluster.
283 | 
284 | #### Export DF sample as a single .csv file to S3:
285 | 
286 | If we want to export the sampled DF from RStudio as a single .csv file that we can work with in any environment, we must first coalesce the rows of `df_samp4` to a single node in our cluster using the `repartition` operation. Then, we can use the `write.df` operation as we did in the [SparkR Basics I](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/01_sparkr-basics-1.md) tutorial:
287 | 
288 | 
289 | ```r
290 | df_samp4_1 <- repartition(df_samp4, numPartitions = 1)
291 | write.df(df_samp4_1, path = "s3://ui-spark-social-science-public/data/hfpc_samp.csv", 
292 | 							source = "csv",
293 | 							mode = "overwrite")
294 | ```
295 | 
296 | :heavy_exclamation_mark: __Warning__: We cannot collect a DF as a data.frame, nor can we repartition it to a single node, unless the DF is sufficiently small in size since it must fit onto a _single_ node!
297 | 
298 | __End of tutorial__ - Next up is [Dealing with Missing Data in SparkR](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/04_missing-data.md)
299 | 


--------------------------------------------------------------------------------
/R/confint_SparkR.R:
--------------------------------------------------------------------------------
 1 | #############################################################
 2 | ## confint.SparkR: Normal Distribution Confidence Interval ##
 3 | #############################################################
 4 | # Sarah Armstrong, Urban Institute
 5 | # August 31, 2016
 6 | 
 7 | # Summary: Function that returns a confidence intervals for parameter estimates of a GLM (Gaussian distribution family, identity link function) model.
 8 | 
 9 | # Inputs:
10 | 
11 | # (*) object: a SparkR GLM model, fit with `spark.glm` operation
12 | # (*) level: level of confidence for CI
13 | 
14 | # Returns: a local data.frame, detailing the CIs for each parameter estimate
15 | 
16 | # ci <- confint.SparkR(object = lm, level = 0.975)
17 | # ci
18 | 
19 | 
20 | confint.SparkR <- function(object, level){
21 |   
22 |   coef <- unname(unlist(summary(object)$coefficients[,1]))
23 |   
24 |   err <- unname(unlist(summary(object)$coefficients[,2]))
25 |   
26 |   ci <- as.data.frame(cbind(names(unlist(summary(object)$coefficients[,1])), coef - err*qt(level, summary(object)$df.null), coef + err*qt(0.975, summary(object)$df.null)))
27 |   
28 |   colnames(ci) <- c("","Lower Bound", "Upper Bound")
29 |   
30 |   return(ci)
31 |   
32 | }


--------------------------------------------------------------------------------
/R/diff-in-diff.R:
--------------------------------------------------------------------------------
  1 | ###################################################################################
  2 | ## Social Science Methodologies: Difference-in-differences (Diff-in-diff) Module ##
  3 | ###################################################################################
  4 | ## Objective: 
  5 | ## Operations discussed: 
  6 | 
  7 | ## Notes: ann overview of the Differences-in-differences method can be found at http://www.nber.org/WNE/lect_10_diffindiffs.pdf and at
  8 | ## http://eml.berkeley.edu/~webfac/saez/e131_s04/diff.pdf.
  9 | ## References: the SparkR code outlined below is adapted from the introductory Diff-in-Diff R code posted by Dr. Torres-Reyna at Princeton Unviersity, posted at
 10 | ## http://www.princeton.edu/~otorres/DID101R.pdf. The data used in this module may be found at http://dss.princeton.edu/training/Panel101.dta.
 11 | 
 12 | library(foreign)
 13 | library(magrittr)
 14 | library(SparkR)
 15 | 
 16 | ## Initiate SparkContext:
 17 | sc <- sparkR.init(sparkEnvir=list(spark.executor.memory="2g", 
 18 |                                   spark.driver.memory="1g",
 19 |                                   spark.driver.maxResultSize="1g")
 20 |                   ,sparkPackages="com.databricks:spark-csv_2.11:1.4.0") ## Load CSV Spark Package
 21 | ## AWS EMR is using Spark 2.11 so we need the associated version of spark-csv: http://spark-packages.org/package/databricks/spark-csv
 22 | ## Define Spark executor memory, as well as driver memory and maxResultSize according to cluster configuration
 23 | 
 24 | ## Initiate SparkRSQL:
 25 | sqlContext <- sparkRSQL.init(sc)
 26 | 
 27 | ## Read in example panel data from AWS S3 as a DataFrame (DF):
 28 | data <- read.df(sqlContext, "s3://sparkr-tutorials/DinD_R_ex.csv", header='true', delimiter=",", source="csv", inferSchema='true')
 29 | cache(data)
 30 | head(data)
 31 | 
 32 | ###########################################################################################
 33 | ## (1) Create indicators for countries receiving treatment & time periods for treatment: ##
 34 | ###########################################################################################
 35 | 
 36 | ## Create an indicator variable, 'time', identifying the unit of time at which treatment began (here, the unit of time is years and the year in which treatment began is
 37 | ## 1994). Therefore, 'time' at year 1994, and at subsequent years, is assigned a value of 1 and, for years preceding 1994, is given a value of 0. This indicator variable
 38 | ## represents the 
 39 | data. <- withColumn(data, "trt_time", ifelse(data$year >= 1994, 1, 0)) # Create a new DF, 'data_', with the variable 'time' appended; note that function format given by ifelse(test, yes, no)
 40 | ## Stata: gen trt_time = (year >= 1994) & !missing(year)
 41 | 
 42 | 
 43 | ## Create another indicator variable, 'treatment', indicating the within sample group exposed to the treatment. Here, countries E, F and G were received the treatment, so the
 44 | ## 'treatment' variable value for observations in these countries is set equal to 1, while the 'treatment' value for observations within countries A, B, C and D is set equal
 45 | ## to 0.
 46 | data_ <- withColumn(data., "trt_region", ifelse(data$country == "E" | data$country == "F" | data$country == "G", 1, 0))
 47 | cache(data_)
 48 | unpersist(data)
 49 | ## Stata: gen trt_region = (country > 4) & !missing(country)
 50 | 
 51 | 
 52 | ## Rename updated DFs to 'data':
 53 | head(data_) # Check the columns of updated DF to confirm DF updated properly
 54 | data <- data_ # Rename 'data.' to 'data'
 55 | rm(data.)
 56 | rm(data_)
 57 | head(data)
 58 | ## Stata: rename data_ data
 59 | ## Stata: drop data_ data.
 60 | ## Stata: list _all in 1/5
 61 | 
 62 | 
 63 | ##############################################################################################
 64 | ## (2) Manually compute the treatment effect (i.e. take the difference of the differences): ##
 65 | ##############################################################################################
 66 | 
 67 | ## To mimic an experimental design with observational data, exploiting an observed natural experiment, the diff-in-diff method measures the effect of a treatment on an
 68 | ## outcome by comparing the average change over time in the response variable for the treatment group and compares this to the average change over time for the control group.
 69 | ## This diff-in-diff estimator can be computed manually, as we do immediately below, or it can be computed as the parameter estimate of the treatment indicator in a linear
 70 | ## model, which we outline futher below.
 71 | 
 72 | ## Compute the four (4) measurements required to calculate the diff-in-diff estimator:
 73 | a <- collect(select(data[data$trt_time == 0 & data$trt_region == 0], mean(data$y)))
 74 | b <- collect(select(data[data$trt_time == 0 & data$trt_region == 1], mean(data$y)))
 75 | c <- collect(select(data[data$trt_time == 1 & data$trt_region == 0], mean(data$y)))
 76 | d <- collect(select(data[data$trt_time == 1 & data$trt_region == 1], mean(data$y)))
 77 | 
 78 | ## Now, manually calculate the diff-in-diff estimator; as you can see, we are literally calculating the difference between the differences:
 79 | did_est <- (d-c)-(b-a)
 80 | did_est
 81 | 
 82 | 
 83 | ############################################################
 84 | ## (3) Run a simple difference-in-differences regression: ##
 85 | ############################################################
 86 | 
 87 | ## As previously stated, the parameter estimation of the interaction term 'trt_time:trt_region' included in the below linear model is the diff-in-diff estimator. This can be
 88 | ## verified by comparing the 'did_est' value, which we calculated in Section (2), with the parameter estimation for 'trt_time:trt_region'. Note that the values are equal,
 89 | ## and that the interaction term 'trt_time:trt_region' is a binary variable that indicates treatment status.
 90 | m1 <- glm(y ~ trt_region + trt_time + trt_region:trt_time, data = data, family = "gaussian")
 91 | summary(m1)
 92 | 
 93 | 
 94 | #########################################
 95 | ## (4) Check diff-in-diff assumptions: ##
 96 | #########################################
 97 | 
 98 | ## Is a line graph possible in SparkR? Would be nice to be able to provide visualization of parallel trend assumption - traditionally necessary for causality justification!
 99 | 
100 | ## Include leads in regression to measure whether or not there is any evidence of an anticipatory effect (if there is no effect, then leads should be approx 0 - this supports parallel trend assumption)
101 | ## Include lags to measure direction and maginitude of effect following initatial treatment exposure
102 | ## >>> Create lead and lag and then re-run glm
103 | 
104 | ## Could perhaps run an F-test on the difference in mean(y) across the treatment and control groups (here, countries) for the pre-treatment years - if parallel trend
105 | ## asumption is valid, this F-test should yield stat. insignificant result; Note: this is a necessary condition, but not a sufficient condition for validation of parallel
106 | ## trend assumption since statistical insignificance of F-test results could be due to low test power


--------------------------------------------------------------------------------
/R/geom_bivar_histogram_SparkR.R:
--------------------------------------------------------------------------------
 1 | ##########################################
 2 | ## geom_bivar_histogram.SparkR Function ##
 3 | ##########################################
 4 | # Sarah Armstrong & Alex Engler, Urban Institute
 5 | # July 21, 2016
 6 | 
 7 | # Summary: Plots a two-dimensional (2-D) histogram of frequency counts for two numerical DataFrame columns over a `nbin`-by-`nbin` grid of bins.
 8 | 
 9 | # Inputs:
10 | # (*) df: SparkR DataFrame
11 | # (*) x, y (string): The names of two numerical-valued columns in the SparkR DataFrame df
12 | # (*) nbins (integer): The square root of the total number of bins that the frequency counts for x and y are aggregated over
13 | # (*) title (string): A string specifying the input for `ggtitle` input in `ggplot`
14 | # (*) xlab, ylab (string): A string specifying the input for `xlab` and `ylab` input in `ggplot`, respectively
15 | 
16 | # Returns: 2-D histogram of frequency counts (using `geom_tile` from ggplot2 package)
17 | 
18 | # Example:
19 | # p1 <- geom_bivar_histogram.SparkR(df = df, x = "carat", y = "price", nbins = 250)
20 | # p1 + scale_colour_brewer() + ggtitle("This is a title") + xlab("Carat") + ylab("Price")
21 | 
22 | geom_bivar_histogram.SparkR <- function(df, x, y, nbins){
23 |   
24 |   library(ggplot2)
25 |   
26 |   x_min <- collect(agg(df, min(df[[x]])))
27 |   x_max <- collect(agg(df, max(df[[x]])))
28 |   x.bin <- seq(floor(x_min[[1]]), ceiling(x_max[[1]]), length = nbins)
29 |   
30 |   y_min <- collect(agg(df, min(df[[y]])))
31 |   y_max <- collect(agg(df, max(df[[y]])))
32 |   y.bin <- seq(floor(y_min[[1]]), ceiling(y_max[[1]]), length = nbins)
33 |   
34 |   x.bin.w <- x.bin[[2]]-x.bin[[1]]
35 |   y.bin.w <- y.bin[[2]]-y.bin[[1]]
36 |   
37 |   df_ <- withColumn(df, "x_bin_", ceiling((df[[x]] - x_min[[1]]) / x.bin.w))
38 |   df_ <- withColumn(df_, "y_bin_", ceiling((df[[y]] - y_min[[1]]) / y.bin.w))
39 |   
40 |   df_ <- mutate(df_, x_bin = ifelse(df_$x_bin_ == 0, 1, df_$x_bin_))
41 |   df_ <- mutate(df_, y_bin = ifelse(df_$y_bin_ == 0, 1, df_$y_bin_))
42 |   
43 |   dat <- collect(agg(groupBy(df_, "x_bin", "y_bin"), count = n(df_$x_bin)))
44 |   
45 |   p <- ggplot(dat, aes(x = x_bin, y = y_bin, fill = count)) + geom_tile()
46 |   
47 |   return(p)
48 | }


--------------------------------------------------------------------------------
/R/merging.R:
--------------------------------------------------------------------------------
  1 | ###############################
  2 | ## Merging SparkR DataFrames ##
  3 | ###############################
  4 | 
  5 | ## Sarah Armstrong, Urban Institute  
  6 | ## July 7, 2016  
  7 | ## Last Updated: August 18, 2016
  8 | 
  9 | 
 10 | ## Objective: The following tutorial provides an overview of how to join SparkR DataFrames by column and by row. In particular, we discuss how to:
 11 | 
 12 | ## * Merge two DFs by column condition(s) (join by row)
 13 | ## * Append rows of data to a DataFrame (join by column)
 14 | ##     + When column name lists are equal across DFs
 15 | ##     + When column name lists are not equal
 16 | 
 17 | ## **SparkR/R Operations Discussed**: `join`, `merge`, `sample`, `except`, `intersect`, `rbind`, `rbind.intersect` (defined function), `rbind.fill` (defined function)
 18 | 
 19 | 
 20 | ## Initiate SparkR session:
 21 | 
 22 | if (nchar(Sys.getenv("SPARK_HOME")) < 1) {
 23 |   Sys.setenv(SPARK_HOME = "/home/spark")
 24 | }
 25 | library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
 26 | sparkR.session()
 27 | 
 28 | ## Read in initial data as DataFrame (DF):
 29 | 
 30 | df <- read.df("s3://sparkr-tutorials/hfpc_ex", header = "false", inferSchema = "true", na.strings = "")
 31 | cache(df)
 32 | 
 33 | 
 34 | #############################################################
 35 | ## (1) Join (merge) two DataFrames by column condition(s): ##
 36 | #############################################################
 37 | 
 38 | ## We begin by subsetting `df` by column, resulting in two (2) DataFrames that are disjoint, except for them both including the loan identification variable, `"loan_id"`:
 39 | 
 40 | # Print the column names of df:
 41 | columns(df)
 42 | 
 43 | # Specify column lists to fit `a` and `b` on - these are disjoint sets (except for "loan_id"):
 44 | cols_a <- c("loan_id", "period", "servicer_name", "new_int_rt", "act_endg_upb", "loan_age", "mths_remng")
 45 | cols_b <- c("loan_id", "aj_mths_remng", "dt_matr", "cd_msa", "delq_sts", "flag_mod", "cd_zero_bal", "dt_zero_bal")
 46 | 
 47 | # Create `a` and `b` DFs with the `select` operation:
 48 | a <- select(df, cols_a)
 49 | b <- select(df, cols_b)
 50 | 
 51 | # Print several rows from each subsetted DF:
 52 | str(a)
 53 | str(b)
 54 | 
 55 | ## We can use the SparkR operation `join` to merge `a` and `b` by row, returning a DataFrame equivalent to `df`. The `join` operation allows us to perform most SQL join types on SparkR DFs, including:
 56 | 
 57 | ## * `"inner"` (default): Returns rows where there is a match in both DFs
 58 | ## * `"outer"`: Returns rows where there is a match in both DFs, as well as rows in both the right and left DF where there was no match
 59 | ## * `"full"`, `"fullouter"`: Returns rows where there is a match in one of the DFs
 60 | ## * `"left"`, `"leftouter"`, `"left_outer"`: Returns all rows from the left DF, even if there are no matches in the right DF
 61 | ## * `"right"`, `"rightouter"`, `"right_outer"`: Returns all rows from the right DF, even if there are no matches in the left DF
 62 | ## * Cartesian: Returns the Cartesian product of the sets of records from the two or more joined DFs - `join` will return this DF when we _do not_ specify a `joinType` _nor_ a `joinExpr` (discussed below)
 63 | 
 64 | ## We communicate to SparkR what condition we want to join DFs on with the `joinExpr` specification in `join`. Below, we perform an `"inner"` (default) join on the DFs `a` and `b` on the condition that their `"loan_id"` values be equal:
 65 | 
 66 | ab1 <- join(a, b, a$loan_id == b$loan_id)
 67 | str(ab1)
 68 | 
 69 | ## Note that the resulting DF includes two (2) `"loan_id"` columns. Unfortunately, we cannot direct SparkR to keep only one of these columns when using `join` to merge by row, and the following command (which we introduced in the subsetting tutorial) drops both `"loan_id"` columns:
 70 | 
 71 | ab1$loan_id <- NULL
 72 | 
 73 | ## We can avoid this by renaming one of the columns before performing `join` and then, utilizing that the columns have distinct names, tell SparkR to drop only one of the columns. For example, we could rename `"loan_id"` in `a` with the expression `a <- withColumnRenamed(a, "loan_id", "loan_id_")`, then drop this column with `ab1$loan_id_ <- NULL` after performing `join` on `a` and `b` to return `ab1`.
 74 | 
 75 | ## The `merge` operation, alternatively, allows us to join DFs and produces two (2) _distinct_ merge columns. We can use this feature to retain the column on which we joined the DFs, but we must still perform a `withColumnRenamed` step if we want our merge column to retain its original column name.
 76 | 
 77 | ## Rather than defining a `joinExpr`, we explictly specify the column(s) that SparkR should `merge` the DFs on with the operation parameters `by` and `by.x`/`by.y` (if the merging column is named differently across the DFs). Note that, if we do not specify `by`, SparkR will merge the DFs on the list of common column names shared by the DFs. Rather than specifying a type of join, `merge` determines how SparkR should merge DFs based on boolean values, `all.x` and `all.y`, which indicate which rows in `x` and `y` should be included in the join, respectively. We can specify `merge` type with the following parameter values:
 78 | 
 79 | ## * `all.x = FALSE`, `all.y = FALSE`: Returns an inner join (this is the default and can be achieved by not specifying values for all.x and all.y)
 80 | ## * `all.x = TRUE`, `all.y = FALSE`: Returns a left outer join
 81 | ## * `all.x = FALSE`, `all.y = TRUE`: Returns a right outer join
 82 | ## * `all.x = TRUE`, `all.y = TRUE`: Returns a full outer join
 83 | 
 84 | ## The following `merge` expression is equivalent to the `join` expression in the preceding example:
 85 | 
 86 | ab2 <- merge(a, b, by = "loan_id")
 87 | str(ab2)
 88 | 
 89 | ## Note that the two merging columns are distinct as indicated by the `<column name>_x` and `<column name>_y` name assignments performed by `merge`. We utilize this distinction in the expressions below to retain a single merge column:
 90 | 
 91 | # Drop "loan_id" column from `b`:
 92 | ab2$loan_id_y <- NULL
 93 | 
 94 | # Rename "loan_id" column from `a`:
 95 | ab2 <- withColumnRenamed(ab2, "loan_id_x", "loan_id")
 96 | 
 97 | # Final DF with single "loan_id" column:
 98 | str(ab2)
 99 | 
100 | rm(a)
101 | rm(b)
102 | rm(ab1)
103 | rm(ab2)
104 | rm(cols_a)
105 | rm(cols_b)
106 | 
107 | #############################################
108 | ## (2) Append rows of data to a DataFrame: ##
109 | #############################################
110 | 
111 | ## In order to discuss how we can append the rows of one DF to those of another in SparkR, we must first subset `df` into two (2) distinct DataFrames, `A` and `B`. Below, we define `A` as a random subset of `df` with a row count that is approximately equal to half the size of `nrow(df)`. We use the DF operation `except` to create `B`, which includes every row of `df`, `except` for those included in `A`:
112 | 
113 | A <- sample(df, withReplacement = FALSE, fraction = 0.5, seed = 1)
114 | B <- except(df, A)
115 | 
116 | ## Let's also examine the row count for each subsetted row and confirm that `A` and `B` do not share common rows. We can check this with the SparkR operation `intersect`, which performs the intersection set operation on two DFs:
117 | 
118 | (nA <- nrow(A))
119 | (nB <- nrow(B))
120 | 
121 | nA + nB # Equal to nrow(df)
122 | 
123 | AintB <- intersect(A, B)
124 | nrow(AintB)
125 | 
126 | ###################################################################
127 | ## (2i) Append rows when column name lists are equal across DFs: ##
128 | ###################################################################
129 | 
130 | ## If we are certain that the two DFs have equivalent column name lists (with respect to both string values and column ordering), then appending the rows of one DF to another is straightforward. Here, we append the rows of `B` to `A` with the `rbind` operation:
131 | 
132 | df1 <- rbind(A, B)
133 | 
134 | nrow(df1)
135 | nrow(df)
136 | 
137 | ## We can see in the results above that `df1` is equivalent to `df`. We could, alternatively, accomplish this with the `unionALL` operation (e.g. `df1 <- unionAll(A, B)`. Note that `unionAll` is not an alias for `rbind` - we can combine any number of DFs with `rbind` while `unionAll` can only consider two (2) DataFrames at a time.
138 | 
139 | unpersist(df1)
140 | rm(df1)
141 | 
142 | ###############################################################
143 | ## (2i) Append rows when DF column name lists are not equal: ##
144 | ###############################################################
145 | 
146 | ## Before we can discuss appending rows when we do not have column name equivalency, we must first create two DataFrames that have different column names. Let's define a new DataFrame, `B_` that includes every column in `A` and `B`, excluding the column `"loan_age"`:
147 | 
148 | columns(B)
149 | 
150 | # Define column name list that has every column in `A` and `B`, except "loan_age":
151 | cols_ <- c("loan_id", "period", "servicer_name", "new_int_rt", "act_endg_upb", "mths_remng", "aj_mths_remng",
152 |            "dt_matr", "cd_msa", "delq_sts", "flag_mod", "cd_zero_bal", "dt_zero_bal" )
153 | 
154 | # Define subsetted DF:
155 | B_ <- select(B, cols_)
156 | 
157 | unpersist(B)
158 | rm(B)
159 | rm(cols_)
160 | 
161 | ## We can try to apply SparkR `rbind` operation to append `B_` to `A`, but the expression given below will result in the error: `"Union can only be performed on tables with the same number of columns, but the left table has 14 columns and" "the right has 13"`
162 | 
163 | df2 <- rbind(A, B_)
164 | 
165 | ## Two strategies to force SparkR to merge DataFrames with different column name lists are to:
166 | 
167 | ## 1. Append by an intersection of the two sets of column names, or
168 | ## 2. Use `withColumn` to add columns to DF where they are missing and set each entry in the appended rows of these columns equal to `NA`.
169 | 
170 | ## Below is a function, `rbind.intersect`, that accomplishes the first approach. Notice that, in this function, we simply take an intesection of the column names and ask SparkR to perform `rbind`, considering only this subset of (sorted) column names.
171 | 
172 | rbind.intersect <- function(x, y) {
173 |   cols <- base::intersect(colnames(x), colnames(y))
174 |   return(SparkR::rbind(x[, sort(cols)], y[, sort(cols)]))
175 | }
176 | 
177 | ## Here, we append `B_` to `A` using this function and then examine the dimensions of the resulting DF, `df2`, as well as its column names. We can see that, while the row count for `df2` is equal to that for `df`, the DF does not include the `"loan_age"` column (just as we expected!).
178 | 
179 | df2 <- rbind.intersect(A, B_)
180 | dim(df2)
181 | colnames(df2)
182 | 
183 | unpersist(df2)
184 | rm(df2)
185 | 
186 | ## Accomplishing the second approach is somewhat more involved. The `rbind.fill` function, given below, identifies the outersection of the list of column names for two (2) DataFrames and adds them onto one (1) or both of the DataFrames as needed using `withColumn`. The function appends these columns as string dtype, and we can later recast columns as needed:
187 | 
188 | rbind.fill <- function(x, y) {
189 |   
190 |   m1 <- ncol(x)
191 |   m2 <- ncol(y)
192 |   col_x <- colnames(x)
193 |   col_y <- colnames(y)
194 |   outersect <- function(x, y) {setdiff(union(x, y), intersect(x, y))}
195 |   col_outer <- outersect(col_x, col_y)
196 |   len <- length(col_outer)
197 |   
198 |   if (m2 < m1) {
199 |     for (j in 1:len){
200 |       y <- withColumn(y, col_outer[j], cast(lit(""), "string"))
201 |     }
202 |   } else { 
203 |     if (m2 > m1) {
204 |         for (j in 1:len){
205 |           x <- withColumn(x, col_outer[j], cast(lit(""), "string"))
206 |         }
207 |       }
208 |     if (m2 == m1 & col_x != col_y) {
209 |       for (j in 1:len){
210 |         x <- withColumn(x, col_outer[j], cast(lit(""), "string"))
211 |         y <- withColumn(y, col_outer[j], cast(lit(""), "string"))
212 |       }
213 |     } else { }         
214 |   }
215 |   x_sort <- x[,sort(colnames(x))]
216 |   y_sort <- y[,sort(colnames(y))]
217 |   return(SparkR::rbind(x_sort, y_sort))
218 | }
219 | 
220 | ## We again append `B_` to `A`, this time using the `rbind.fill` function:
221 | 
222 | df3 <- rbind.fill(A, B_)
223 | 
224 | ## Now, the row count for `df3` is equal to that for `df` _and_ it includes all fourteen (14) columns included in `df`:
225 | 
226 | dim(df3)
227 | colnames(df3)
228 | 
229 | ## We know from the missing data tutorial that `df$loan_age` does not contain any `NA` or `NaN` values. By appending `B_` to `A` with the `rbind.fill` function, therefore, we should have inserted exactly `nrow(B)` many empty string entries in `df3`. Note that `"loan_age"` is currently cast as string dtype and, therefore, the column does not contain any null values and we will need to recast the column to a numerical dtype.
230 | 
231 | df3_laEmpty <- where(df3, df3$loan_age == "")
232 | nrow(df3_laEmpty)
233 | 
234 | # There are no "loan_age" null values since it is string dtype
235 | df3_laNull <- where(df3, isNull(df3$loan_age))
236 | nrow(df3_laNull)
237 | 
238 | ## Below, we recast `"loan_age"` as integer dtype and check that the number of `"loan_age"` null values in `df3` now matches the number of entry string values in `df3` prior to recasting, as well as the number of rows in `B`:
239 | 
240 | # Recast
241 | df3$loan_age <- cast(df3$loan_age, dataType = "integer")
242 | str(df3)
243 | 
244 | # Check that values are equal
245 | 
246 | df3_laNull_ <- where(df3, isNull(df3$loan_age))
247 | nrow(df3_laEmpty) # No. of empty strings
248 | 
249 | nrow(df3_laNull_) # No. of null entries
250 | 
251 | nB                # No. of rows in DF `B`
252 | 
253 | ## Documentation for rbind.intersection can be found [here](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/R/rbind-intersection.R), and [here](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/R/rbind-fill.R) for rbind.fill.


--------------------------------------------------------------------------------
/R/missing-data.R:
--------------------------------------------------------------------------------
  1 | #########################################
  2 | ## Dealing with Missing Data in SparkR ##
  3 | #########################################
  4 | 
  5 | ## Sarah Armstrong, Urban Institute  
  6 | ## July 6, 2016  
  7 | ## Last Updated: August 17, 2016
  8 | 
  9 | 
 10 | ## Objective: In this tutorial, we discuss general strategies for dealing with missing data in the SparkR environment. While we do not consider conceptually how and why we might impute missing values in a dataset, we do discuss logistically how we could drop rows with missing data and impute missing data with replacement values. We specifically consider the following during this tutorial:
 11 |   
 12 | ## * Specify null values when loading data in as a DF
 13 | ## * Conditional expressions on empty DF entries
 14 | ##     + Null and NaN indicator operations
 15 | ##     + Conditioning on empty string entries
 16 | ##     + Distribution of missing data across grouped data
 17 | ## * Drop rows with missing data
 18 | ##     + Null value entries
 19 | ##     + Empty string entries
 20 | ## * Fill missing data entries
 21 | ##     + Null value entries
 22 | ##     + Empty string entries
 23 | 
 24 | ## SparkR/R Operations Discussed: `read.df` (`nullValue = "<string>"`), `printSchema`, `nrow`, `isNull`, `isNotNull`, `isNaN`, `count`, `where`, `agg`, `groupBy`, `n`, `collect`, `dropna`, `na.omit`, `list`, `fillna`
 25 | 
 26 | 
 27 | ## Initiate SparkR session:
 28 | 
 29 | if (nchar(Sys.getenv("SPARK_HOME")) < 1) {
 30 |   Sys.setenv(SPARK_HOME = "/home/spark")
 31 | }
 32 | library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
 33 | sparkR.session()
 34 | 
 35 | ##############################################################################
 36 | ## (1) Specify null values when loading data in as a SparkR DataFrame (DF): ##
 37 | ##############################################################################
 38 |   
 39 | ## Throughout this tutorial, we will use the loan performance example dataset that we exported at the conclusion of the [SparkR Basics I](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/sparkr-basics-1.md) tutorial. Note that we now include the `na.strings` option in the `read.df` transformation below. By setting `na.strings` equal to an empty string in `read.df`, we direct SparkR to interpret empty entries in the dataset as being equal to nulls in `df`. Therefore, any DF entries matching this string (here, set to equal an empty entry) will be set equal to a null value in `df`.
 40 | 
 41 | df <- read.df("s3://sparkr-tutorials/hfpc_ex", header = "false", inferSchema = "true", na.strings = "")
 42 | cache(df)
 43 | 
 44 | ## We can replace this empty string with any string that we know indicates a null entry in the dataset, i.e. with `na.strings="<string>"`. Note that SparkR only reads empty entries as null values in numerical and integer datatype (dtype) DF columns, meaning that empty entries in DF columns of string dtype will simply equal an empty string. We consider how to work with this type of observation throughout this tutorial alongside our treatment of null values.
 45 | 
 46 | ## With `printSchema`, we can see the dtype of each column in `df` and, noting which columns are of a numerical and integer dtypes and which are string, use this to determine how we should examine missing data in each column of `df`. We also count the number of rows in `df` so that we can compare this value to row counts that we compute throughout this tutorial:
 47 |   
 48 | printSchema(df)
 49 | (n <- nrow(df))
 50 | 
 51 | ######################################################
 52 | ## (2) Conditional expressions on empty DF entries: ##
 53 | ######################################################
 54 |   
 55 | #############################################
 56 | ## (2i) Null and NaN indicator operations: ##
 57 | #############################################
 58 |   
 59 | ## We saw in the subsetting tutorial how to subset a DF by some conditional statement. We can extend this reasoning in order to identify missing data in a DF and to explore the distribution of missing data within a DF. SparkR operations indicating null and NaN entries in a DF are `isNull`, `isNaN` and `isNotNull`, and these can be used in conditional statements to locate or to remove DF rows with null and NaN entries.
 60 | 
 61 | ## Below, we count the number of missing entries in `"loan_age"` and in `"mths_remng"`, which are both of integer dtype. We can see below that there are no missing or NaN entries in `"loan_age"`. Note that the `isNull` and `isNaN` count results differ for `"mths_remng"` - while there are missing values in `"mths_remng"`, there are no NaN entries (entires that are "not a number").
 62 | 
 63 | df_laNull <- where(df, isNull(df$loan_age))
 64 | count(df_laNull)
 65 | df_laNaN <- where(df, isNaN(df$loan_age))
 66 | count(df_laNaN)
 67 | 
 68 | df_mrNull <- where(df, isNull(df$mths_remng))
 69 | count(df_mrNull)
 70 | df_mrNaN <- where(df, isNaN(df$mths_remng))
 71 | count(df_mrNaN)
 72 | 
 73 | #################################
 74 | ## (2ii) Empty string entries: ##
 75 | #################################
 76 | 
 77 | ## If we want to count the number of rows with missing entries for `"servicer_name"` (string dtype) we can simply use the equality logical condition (==) to direct SparkR to `count` the number of rows `where` the entries in the `"servicer_name"` column are equal to an empty string:
 78 |   
 79 | df_snEmpty <- where(df, df$servicer_name == "")
 80 | count(df_snEmpty)
 81 | 
 82 | ##############################################################
 83 | ## (2iii) Distribution of missing data across grouped data: ##
 84 | ##############################################################
 85 | 
 86 | ## We can also condition on missing data when aggregating over grouped data in order to see how missing data is distributed over a categorical variable within our data. In order to view the distribution of `"mths_remng"` observations with null values over distinct entries of `"servicer_name"`, we (1) group the entries of the DF `df_mrNull` that we created in the preceding example over `"servicer_name"` entries, (2) create the DF `mrNull_by_sn` which consists of the number of observations in `df_mrNull` by `"servicer_name"` entries and (3) collect `mrNull_by_sn` into a nicely formatted table as a local data.frame:
 87 |   
 88 | gb_sn_mrNull <- groupBy(df_mrNull, df_mrNull$servicer_name)
 89 | mrNull_by_sn <- agg(gb_sn_mrNull, Nulls = n(df_mrNull$servicer_name))
 90 | 
 91 | mrNull_by_sn.dat <- collect(mrNull_by_sn)
 92 | mrNull_by_sn.dat
 93 | # Alternatively, we could have evaluated showDF(mrNull_by_sn) to print DF
 94 | 
 95 | ## Note that the resulting data.frame lists only nine (9) distinct string values for `"servicer_name"`. So, any row in `df` with a null entry for `"mths_remng"` has one of these strings as its corresponding `"servicer_name"` value. We could similarly examine the distribution of missing entries for some string dtype column across grouped data by first filtering a DF on the condition that the string column is equal to an empty string, rather than filtering with a null indicator operation (e.g. `isNull`), then performing the `groupBy` operation.
 96 | 
 97 | ######################################
 98 | ## (3) Drop rows with missing data: ##
 99 | ######################################
100 | 
101 | ##############################
102 | ## (3i) Null value entries: ##
103 | ##############################
104 |   
105 | ## The SparkR operation `dropna` (or its alias `na.omit`) creates a new DF that omits rows with null value entries. We can configure `dropna` in a number of ways, including whether we want to omit rows with nulls in a specified list of DF columns or across all columns within a DF.
106 | 
107 | ## If we want to drop rows with nulls for a list of columns in `df`, we can define a list of column names and then include this in `dropna` or we could embed this list directly in the operation. Below, we explicitly define a list of column names on which we condition `dropna`:
108 |   
109 | mrlist <- list("mths_remng", "aj_mths_remng")
110 | df_mrNoNulls <- dropna(df, cols = mrlist)
111 | nrow(df_mrNoNulls)
112 | 
113 | ## Alternatively, we could `filter` the DF using the `isNotNull` condition as follows:
114 |   
115 | df_mrNoNulls_ <- filter(df, isNotNull(df$mths_remng) & isNotNull(df$aj_mths_remng))
116 | nrow(df_mrNoNulls_)
117 | 
118 | ## If we want to consider all columns in a DF when omitting rows with null values, we can use either the `how` or `minNonNulls` paramters of `dropna`.
119 | 
120 | ## The parameter `how` allows us to decide whether we want to drop a row if it contains `"any"` nulls or if we want to drop a row only if `"all"` of its entries are nulls. We can see below that there are no rows in `df` in which all of its values are null, but only a small percentage of the rows in `df` have no null value entries:
121 |   
122 | df_all <- dropna(df, how = "all")
123 | nrow(df_all)    # Equal in value to n
124 | 
125 | df_any <- dropna(df, how = "any")
126 | (n_any <- nrow(df_any))
127 | (n_any/n)*100
128 | 
129 | ## We can set a minimum number of non-null entries required for a row to remain in the DF by specifying a `minNonNulls` value. If included in `dropna`, this specification directs SparkR to drop rows that have less than `minNonNulls = <value>` non-null entries. Note that including `minNonNulls` overwrites the `how` specification. Below, we omit rows with that have less than 5 and 12 entries that are _not_ nulls. Note that there are no rows in `df` that have less than 5 non-null entries, and there are only approximately 8,000 rows with less than 12 non-null entries.
130 | 
131 | df_5 <- dropna(df, minNonNulls = 5)
132 | nrow(df_5)    # Equal in value to n
133 | 
134 | df_12 <- dropna(df, minNonNulls = 12)
135 | (n_12 <- nrow(df_12))
136 | n - n_12
137 | 
138 | #################################
139 | ## (3ii) Empty string entries: ##
140 | #################################
141 | 
142 | ## If we want to create a new DF that does not include any row with missing entries for a column of string dtype, we could also use `filter` to accomplish this. In order to remove observations with a missing `"servicer_name"` value, we simply filter `df` on the condition that `"servicer_name"` does not equal an empty string entry:
143 |   
144 | df_snNoEmpty <- filter(df, df$servicer_name != "")
145 | nrow(df_snNoEmpty)
146 | 
147 | ####################################
148 | ## (4) Fill missing data entries: ##
149 | ####################################
150 |   
151 | ##############################
152 | ## (4i) Null value entries: ##
153 | ##############################
154 |   
155 | ## The `fillna` operation allows us to replace null entries with some specified value. In order to replace null entries in every numerical and integer column in `df` with a value, we simply evaluate the expression `fillna(df, <value>)`. We replace every null entry in `df` with the value 12345 below:
156 | 
157 | str(df)
158 | 
159 | df_ <- fillna(df, value = 12345)
160 | str(df_)
161 | rm(df_)
162 | 
163 | ## If we want to replace null values within a list of DF columns, we can specify a column list just as we did in `dropna`. Here, we replace the null values in only `"act_endg_upb"` with 12345:
164 |   
165 | str(df)
166 | 
167 | df_ <- fillna(df, list("act_endg_upb" = 12345))
168 | str(df_)
169 | rm(df_)
170 | 
171 | #################################
172 | ## (4ii) Empty string entries: ##
173 | #################################
174 | 
175 | ## Finally, we can replace the empty entries in string dtype columns with the `ifelse` operation, which follows the syntax `ifelse(<test>, <if true>, <if false>)`. Here, we replace the empty entries in `"servicer_name"` with the string `"Unknown"`:
176 |   
177 | str(df)
178 | df$servicer_name <- ifelse(df$servicer_name == "", "Unknown", df$servicer_name)
179 | str(df)


--------------------------------------------------------------------------------
/R/ols1.R:
--------------------------------------------------------------------------------
  1 | ############################################################################
  2 | ## Social Science Methodologies: Generalized Linear Models (GLM) Module 1 ##
  3 | ############################################################################
  4 | ## Objective: 
  5 | ## Operations discussed: glm
  6 | 
  7 | library(SparkR)
  8 | library(ggplot2)
  9 | library(reshape2)
 10 | 
 11 | ## Initiate SparkContext:
 12 | 
 13 | sc <- sparkR.init(sparkEnvir=list(spark.executor.memory="2g", 
 14 |                                   spark.driver.memory="1g",
 15 |                                   spark.driver.maxResultSize="1g")
 16 |                   ,sparkPackages="com.databricks:spark-csv_2.11:1.4.0") # Load CSV Spark Package
 17 | 
 18 | ## AWS EMR is using Spark 2.11 so we need the associated version of spark-csv: http://spark-packages.org/package/databricks/spark-csv
 19 | ## Define Spark executor memory, as well as driver memory and maxResultSize according to cluster configuration
 20 | 
 21 | ## Initiate SparkRSQL:
 22 | 
 23 | sqlContext <- sparkRSQL.init(sc)
 24 | 
 25 | ## Create a local R data.frame:
 26 | 
 27 | x1 <- rnorm(n=200, mean=10, sd=2)
 28 | x2 <- rnorm(n=200, mean=17, sd=3)
 29 | x3 <- rnorm(n=200, mean=8, sd=1)
 30 | y <- 1 + .2 * x1 + .4 * x2 + .5 * x3 + rnorm(n=200, mean=0, sd=.1) # Can see what the true values of the model parameters are
 31 | dat <- cbind.data.frame(y, x1, x2, x3)
 32 | 
 33 | ## Ordinary linear regression (OLR) model with local data.frame and print model summary:
 34 | 
 35 | m1 <- stats::lm(y ~ x1 + x2 + x3, data = dat) # Include `stats::` to require SparkR to estimate `m1` with base R `lm` operation
 36 | summary(m1)
 37 | 
 38 | ## Compute OLR model statistics:
 39 | 
 40 | output1 <- summary(m1)
 41 | yavg1 <- mean(dat$y)
 42 | yhat1 <- m1$fitted.values
 43 | coeffs1 <- m1$coefficients
 44 | r1 <- m1$resid
 45 | SSR1 <- deviance(m1)
 46 | Rsq1 <- output1$r.squared
 47 | aRsq1 <- output1$adj.r.squared
 48 | s1 <- output1$sigma
 49 | covmatr1 <- s1^2*output1$cov
 50 | 
 51 | 
 52 | ## Note: use `lm` function from `stats` R package to estimate ordinary linear regression model for local data.frame to easily compute Rsq and aRsq
 53 | ## The `glm` operation of neither `stats` nor `SparkR` yield Rsq/aRsq, which makes sense since Rsq/aRsq are widely-accepted measures of goodness-of-fit (GOF) for ordinary
 54 | ## linear regression, but not for generalized linear models. Other GOF measures are typically used when assessing GLMs since the meaning of the Rsq/aRsq values for a GLM
 55 | ## become convoluted when fitting a GLM of a family and with a link function different than Gaussian and identiy, respectively (in fact, there are several types of
 56 | ## residuals that can be computed for GLMs!). The `glm` function in R usually prints AIC, deviance residuals and null deviance in its model summary function. Below, we
 57 | ## fit an OLR model using the SparkR `glm` operation since g(Y) = Y = XB + e for the identity link function, g(Y) = Y.
 58 | 
 59 | 
 60 | 
 61 | ## Create SparkR DataFrame (DF) from local data.frame:
 62 | 
 63 | df <- as.DataFrame(sqlContext, dat)
 64 | 
 65 | ## Perform OLS estimation on DF with the same specifcations for our data.frame OLS estimation:
 66 | 
 67 | m2 <- SparkR::glm(y ~ x1 + x2 + x3, data = df, solver = "l-bfgs")
 68 | summary(m2)
 69 | 
 70 | ## Comput OLR model statistics:
 71 | 
 72 | output2 <- summary(m2)
 73 | coeffs2 <- output2$coefficients[,1]
 74 | 
 75 | # Calculate average y value:
 76 | yavg2 <- collect(agg(df, yavg_df = mean(df$y)))$yavg_df
 77 | # Predict fitted values using the DF OLS model -> yields new DF
 78 | yhat2_df <- predict(m2, df)
 79 | head(yhat2_df) # so you can see what the prediction DF looks like
 80 | # Transform the SparkR fitted values DF (yhat2_df) so that it is easier to read and includes squared residuals and squared totals & extract yhat vector (as new DF)
 81 | yhat2_df <- transform(yhat2_df, sq_res2 = (yhat2_df$y - yhat2_df$prediction)^2, sq_tot2 = (yhat2_df$y - yavg2)^2)
 82 | yhat2_df <- transform(yhat2_df, yhat = yhat2_df$prediction)
 83 | head(select(yhat2_df, "y", "yhat", "sq_res2", "sq_tot2"))
 84 | head(yhat2 <- select(yhat2_df, "yhat"))
 85 | # Compute sum of squared residuals and totals, then use these values to calculate R-squared:
 86 | SSR2 <- collect(agg(yhat2_df, SSR2=sum(yhat2_df$sq_res2)))  ##### Note: produces data.frame - get values out of d.f's in order to calculate aRsq and Rsq
 87 | SST2 <- collect(agg(yhat2_df, SST2=sum(yhat2_df$sq_res2)))
 88 | Rsq2 <- 1-(SSR2/SST2)
 89 | p <- 3
 90 | N <- nrow(df)
 91 | aRsq2 <- 1-(((1-Rsq2)*(N-1))/(N-p-1))
 92 | 
 93 | ## Iteratively fit linear regression models using SparkR `glm`, using l-bfgs for optimization, and plot resulting coefficient estimations with `lm` estimate values
 94 | 
 95 | n <- 10
 96 | b0 <- rep(0,n)
 97 | b1 <- rep(0,n)
 98 | b2 <- rep(0,n)
 99 | b3 <- rep(0,n)
100 | for(i in 1:n){
101 |   model <- SparkR::glm(y ~ x1 + x2 + x3, data = df)
102 |   b0[i] <- unname(summary(model)$coefficients[,1]["(Intercept)"])
103 |   b1[i] <- unname(summary(model)$coefficients[,1]["x1"])
104 |   b2[i] <- unname(summary(model)$coefficients[,1]["x2"])
105 |   b3[i] <- unname(summary(model)$coefficients[,1]["x3"])
106 | }
107 | 
108 | # Prepare parameter estimate lists above as data.frames to pass into ggplot:
109 | b_ests_ <- data.frame(cbind(b0 = unlist(b0), b1 = unlist(b1), b2 = unlist(b2), b3 = unlist(b3), Iteration = seq(1, n, by = 1)))
110 | b_ests <- melt(b_ests_, id.vars ="Iteration", measure.vars = c("b0", "b1", "b2", "b3"))
111 | names(b_ests) <- cbind("Iteration", "Variable", "Value")
112 | 
113 | 
114 | p <- ggplot(data = b_ests, aes(x = Iteration, y = Value, col = Variable), size = 5) + geom_point() + geom_hline(yintercept = unname(coeffs1["(Intercept)"]), linetype = 2) + geom_hline(yintercept = unname(coeffs1["x1"]), linetype = 2) + geom_hline(yintercept = unname(coeffs1["x2"]), linetype = 2) + geom_hline(yintercept = unname(coeffs1["x3"]), linetype = 2) + labs(title = "L.R. Parameters Estimated via L-BFGS")


--------------------------------------------------------------------------------
/R/ols2.R:
--------------------------------------------------------------------------------
  1 | ############################################################################
  2 | ## Social Science Methodologies: Generalized Linear Models (GLM) Module 2 ##
  3 | ############################################################################
  4 | ## Objective: 
  5 | ## Operations discussed: glm
  6 | 
  7 | library(SparkR)
  8 | 
  9 | ## Initiate SparkContext:
 10 | 
 11 | sc <- sparkR.init(sparkEnvir=list(spark.executor.memory="2g", 
 12 |                                   spark.driver.memory="1g",
 13 |                                   spark.driver.maxResultSize="1g")
 14 |                   ,sparkPackages="com.databricks:spark-csv_2.11:1.4.0") # Load CSV Spark Package
 15 | 
 16 | ## AWS EMR is using Spark 2.11 so we need the associated version of spark-csv: http://spark-packages.org/package/databricks/spark-csv
 17 | ## Define Spark executor memory, as well as driver memory and maxResultSize according to cluster configuration
 18 | 
 19 | ## Initiate SparkRSQL:
 20 | 
 21 | sqlContext <- sparkRSQL.init(sc)
 22 | 
 23 | ## Read in loan performance example data as DataFrame (DF) 'dat':
 24 | 
 25 | dat <- read.df(sqlContext, "s3://sparkr-tutorials/hfpc_ex", header='false', inferSchema='true')
 26 | cache(dat)
 27 | columns(dat)
 28 | ## > columns(dat)
 29 | ##  [1] "loan_id"       "period"        "servicer_name" "new_int_rt"    "act_endg_upb"  "loan_age"     
 30 | ##  [7] "mths_remng"    "aj_mths_remng" "dt_matr"       "cd_msa"        "delq_sts"      "flag_mod"     
 31 | ## [13] "cd_zero_bal"   "dt_zero_bal" 
 32 | 
 33 | ## 'loan_id' (Loan Identifier): A unique identifier for the mortgage loan
 34 | ## 'period' (Monthly Reporting Period): The month and year that pertain to the servicer’s cut-off period for mortgage loan information
 35 | ## 'servicer_name' (Servicer Name): the name of the entity that serves as the primary servicer of the mortgage loan
 36 | ## 'new_int_rt' (Current Interest Rate): The interest rate on a mortgage loan in effect for the periodic installment due
 37 | ## 'act_endg_upb' (Current Actual Unpaid Principal Balance (UPB)): The actual outstanding unpaid principal balance of the mortgage loan (for liquidated loans, the unpaid
 38 | ## principal balance of the mortgage loan at the time of liquidation)
 39 | ## 'loan_age' (Loan Age): The number of calendar months since the first full month the mortgage loan accrues interest
 40 | ## 'mths_remng' (Remaining Months to Maturity): The number of calendar months remaining until the borrower is expected to pay the mortgage loan in full 
 41 | ## 'aj_mths_remng' (Adjusted Remaining Months To Maturity): the number of calendar months remaining until the borrower is expected to pay the mortgage loan in full
 42 | ## 'dt_matr' (Maturity Date): The month and year in which a mortgage loan is scheduled to be paid in full as defined in the mortgage loan documents
 43 | ## 'cd_msa' (Metropolitan Statistical Area (MSA)): The numeric Metropolitan Statistical Area Code for the property securing the mortgage loan
 44 | ## 'delq_sts' (Current Loan Delinquent Status): The number of days, represented in months, the obligor is delinquent as determined by the governing mortgage documents
 45 | ## 'flag_mod' (Modification Flag): An indicator that denotes if the mortgage loan has been modified
 46 | ## 'cd_zero_bal' (Zero Balance Code): A code indicating the reason the mortgage loan's balance was reduced to zero
 47 | ## 'dt_zero_bal' (Zero Balance Effective Date): Date on which the mortgage loan balance was reduced to zero
 48 | 
 49 | ## Print the schema for the DF to see if the data types specifications in the schema make sense, given the variable descriptions above:
 50 | 
 51 | printSchema(dat)
 52 | ## > printSchema(dat)
 53 | ## root
 54 | ##  |-- loan_id: long (nullable = true)
 55 | ##  |-- period: string (nullable = true)		# Should be recast as a 'date'
 56 | ##  |-- servicer_name: string (nullable = true)
 57 | ##  |-- new_int_rt: double (nullable = true)
 58 | ##  |-- act_endg_upb: double (nullable = true)
 59 | ##  |-- loan_age: integer (nullable = true)
 60 | ##  |-- mths_remng: integer (nullable = true)
 61 | ##  |-- aj_mths_remng: integer (nullable = true)
 62 | ##  |-- dt_matr: string (nullable = true)		# Should be recast as a 'date'
 63 | ##  |-- cd_msa: integer (nullable = true)		# Should be recast as a 'string'
 64 | ##  |-- delq_sts: string (nullable = true)
 65 | ##  |-- flag_mod: string (nullable = true)
 66 | ##  |-- cd_zero_bal: integer (nullable = true)		# Should be recast as a 'string'
 67 | ##  |-- dt_zero_bal: string (nullable = true)		# Should be recast as a 'date'
 68 | 
 69 | ## Preprocessing data:
 70 | 
 71 | # Cast each of the columns noted above into the correct dtype before proceeding with specifying glms
 72 | 
 73 | period_dt <- cast(cast(unix_timestamp(dat$period, 'dd/MM/yyyy'), 'timestamp'), 'date')
 74 | dat <- withColumn(dat, 'period_dt', period_dt) # Note that we collapse this into a single step for subsequent casts to date dtype
 75 | dat$period <- NULL # Drop string form of period; below, we continue to drop string forms of date dtype columns
 76 | 
 77 | dat <- withColumn(dat, 'matr_dt', cast(cast(unix_timestamp(dat$dt_matr, 'MM/yyyy'), 'timestamp'), 'date'))
 78 | dat$dt_matr <- NULL
 79 | 
 80 | dat$cd_msa <- cast(dat$cd_msa, 'string') # We do not need to drop `cd_msa` since we can directly recast this column as a string
 81 | 
 82 | dat$cd_zero_bal <- cast(dat$cd_zero_bal, 'string')
 83 | 
 84 | dat <- withColumn(dat, 'zero_bal_dt', cast(cast(unix_timestamp(dat$dt_zero_bal, 'MM/yyyy'), 'timestamp'), 'date'))
 85 | dat$dt_zero_bal <- NULL
 86 | 
 87 | dat$matr_yr <- year(dat$matr_dt) # Extract year of maturity date of loan as an integer in dat DF
 88 | dat$zero_bal_yr <- year(dat$zero_bal_dt) # Extract year loan set to 0 as an integer in dat DF
 89 | 
 90 | 
 91 | 
 92 | head(dat)
 93 | printSchema(dat) # We now have each DF column in the appropriate dtype
 94 | 
 95 | # Drop rows with NAs:
 96 | 
 97 | nrow(dat)
 98 | dat_ <- dropna(dat)
 99 | nrow(dat_)
100 | dat <- dat_
101 | rm(dat_)
102 | cache(dat)
103 | 
104 | ###################################
105 | ## (1) Fit a Gaussian GLM model: ##
106 | ###################################
107 | 
108 | # Fit ordinary linear regression 
109 | m1 <- SparkR::glm(act_endg_upb ~ new_int_rt + loan_age + mths_remng + matr_yr + zero_bal_yr, data = dat, family = "gaussian")
110 | 
111 |      
112 | 
113 | output <- summary(m1)
114 | coeffs <- output$coefficients[,1]
115 | 
116 | # Calculate average y value:
117 | act_endg_upb_avg <- collect(agg(dat, act_endg_upb_avg = mean(dat$act_endg_upb)))$act_endg_upb_avg
118 | # Predict fitted values using the DF OLS model -> yields new DF
119 | act_endg_upb_hat <- predict(m1, dat)
120 | cache(act_endg_upb_hat)
121 | head(act_endg_upb_hat) # so you can see what the prediction DF looks like
122 | # Transform the SparkR fitted values DF (yhat2_df) so that it is easier to read and includes squared residuals and squared totals & extract yhat vector (as new DF)
123 | act_endg_upb_hat <- transform(act_endg_upb_hat, sq_res = (act_endg_upb_hat$act_endg_upb - act_endg_upb_hat$prediction)^2, sq_tot = (act_endg_upb_hat$act_endg_upb - act_endg_upb_avg)^2)
124 | act_endg_upb_hat <- transform(act_endg_upb_hat, act_endg_upb_hat = act_endg_upb_hat$prediction)
125 | head(select(act_endg_upb_hat, "act_endg_upb", "act_endg_upb_hat", "sq_res", "sq_tot"))
126 | head(act_endg_upb_hat <- select(act_endg_upb_hat, "act_endg_upb_hat"))
127 | 
128 | # Compute sum of squared residuals and totals, then use these values to calculate R-squared:
129 | SSR2 <- collect(agg(yhat2_df, SSR2=sum(yhat2_df$sq_res2)))  ##### Note: produces data.frame - get values out of d.f's in order to calculate aRsq and Rsq
130 | SST2 <- collect(agg(yhat2_df, SST2=sum(yhat2_df$sq_res2)))
131 | Rsq2 <- 1-(SSR2/SST2)
132 | p <- 3
133 | N <- nrow(df)
134 | aRsq2 <- 1-(((1-Rsq2)*(N-1))/(N-p-1))
135 | 
136 | 
137 | n <- 10
138 | b0 <- rep(0,n)
139 | b1 <- rep(0,n)
140 | b2 <- rep(0,n)
141 | b3 <- rep(0,n)
142 | b4 <- rep(0,n)
143 | b5 <- rep(0,n)
144 | for(i in 1:n){
145 |   model <- SparkR::glm(act_endg_upb ~ new_int_rt + loan_age + mths_remng + matr_yr + zero_bal_yr, data = dat, family = "gaussian")
146 |   b0[i] <- unname(summary(model)$coefficients[,1]["(Intercept)"])
147 |   b1[i] <- unname(summary(model)$coefficients[,1]["new_int_rt"])
148 |   b2[i] <- unname(summary(model)$coefficients[,1]["loan_age"])
149 |   b3[i] <- unname(summary(model)$coefficients[,1]["mths_remng"])
150 |   b4[i] <- unname(summary(model)$coefficients[,1]["matr_yr"])
151 |   b5[i] <- unname(summary(model)$coefficients[,1]["zero_bal_yr"])
152 | }
153 | 
154 | 
155 | # Prepare parameter estimate lists above as data.frames to pass into ggplot:
156 | b_ests_ <- data.frame(cbind(b0 = unlist(b0), b1 = unlist(b1), b2 = unlist(b2), b3 = unlist(b3), Iteration = seq(1, n, by = 1)))
157 | b_ests <- melt(b_ests_, id.vars ="Iteration", measure.vars = c("b0", "b1", "b2", "b3"))
158 | names(b_ests) <- cbind("Iteration", "Variable", "Value")
159 | 
160 | 
161 | p <- ggplot(data = b_ests, aes(x = Iteration, y = Value, col = Variable), size = 5) + geom_point() + geom_hline(yintercept = unname(coeffs1["(Intercept)"]), linetype = 2) + geom_hline(yintercept = unname(coeffs1["x1"]), linetype = 2) + geom_hline(yintercept = unname(coeffs1["x2"]), linetype = 2) + geom_hline(yintercept = unname(coeffs1["x3"]), linetype = 2) + labs(title = "L.R. Parameters Estimated via L-BFGS")


--------------------------------------------------------------------------------
/R/ols2_SparkR2_test.R:
--------------------------------------------------------------------------------
  1 | # Confirm that SPARK_HOME is set in environment: set SPARK_HOME to be equal to "/home/spark"
  2 | # if the size of the elements of SPARK_HOME are less than 1:
  3 | if (nchar(Sys.getenv("SPARK_HOME")) < 1) {
  4 |   Sys.setenv(SPARK_HOME = "/home/spark")
  5 | }
  6 | 
  7 | # Load the SparkR package
  8 | library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
  9 | 
 10 | # Call the SparkR session
 11 | sparkR.session(sparkPackages="com.databricks:spark-csv_2.10:1.4.0")
 12 | 
 13 | # Load data as DataFrame
 14 | dat <- read.df("s3://sparkr-tutorials/hfpc_ex", header = "false", inferSchema = "true")
 15 | cache(dat)
 16 | 
 17 | # Recast variables as needed
 18 | period_dt <- cast(cast(unix_timestamp(dat$period, 'dd/MM/yyyy'), 'timestamp'), 'date')
 19 | dat <- withColumn(dat, 'period_dt', period_dt) # Note that we collapse this into a single step for subsequent casts to date dtype
 20 | dat$period <- NULL # Drop string form of period; below, we continue to drop string forms of date dtype columns
 21 | 
 22 | dat <- withColumn(dat, 'matr_dt', cast(cast(unix_timestamp(dat$dt_matr, 'MM/yyyy'), 'timestamp'), 'date'))
 23 | dat$dt_matr <- NULL
 24 | 
 25 | dat$cd_msa <- cast(dat$cd_msa, 'string') # We do not need to drop `cd_msa` since we can directly recast this column as a string
 26 | 
 27 | dat$cd_zero_bal <- cast(dat$cd_zero_bal, 'string')
 28 | 
 29 | dat <- withColumn(dat, 'zero_bal_dt', cast(cast(unix_timestamp(dat$dt_zero_bal, 'MM/yyyy'), 'timestamp'), 'date'))
 30 | dat$dt_zero_bal <- NULL
 31 | 
 32 | dat$matr_yr <- year(dat$matr_dt) # Extract year of maturity date of loan as an integer in dat DF
 33 | dat$zero_bal_yr <- year(dat$zero_bal_dt)
 34 | 
 35 | # Drop nulls
 36 | list <- list("act_endg_upb", "new_int_rt", "loan_age", "mths_remng", "matr_yr", "zero_bal_yr")
 37 | dat <- dropna(dat, cols = list)
 38 | nrow(dat)
 39 | 
 40 | 
 41 | # Fit Gaussian family GLM with identity link
 42 | glm.gauss <- spark.glm(dat, act_endg_upb ~ new_int_rt + loan_age + mths_remng + matr_yr + zero_bal_yr, family = "gaussian")
 43 | 
 44 | # Save model summary and outputs
 45 | output <- summary(glm.gauss)
 46 | coeffs <- output$coefficients[,1]
 47 | 
 48 | # Calculate average y value:
 49 | act_endg_upb_avg <- collect(agg(dat, act_endg_upb_avg = mean(dat$act_endg_upb)))$act_endg_upb_avg
 50 | 
 51 | # Predict fitted values using the DF OLS model -> yields new DF
 52 | dat_pred <- predict(glm.gauss, dat)
 53 | head(dat_pred) # so you can see what the prediction DF looks like
 54 | 
 55 | # Transform the SparkR fitted values DF (dat_pred) so that it is easier to read and includes squared residuals and squared totals & extract yhat vector (as new DF)
 56 | dat_pred <- transform(dat_pred, sq_res = (dat_pred$act_endg_upb - dat_pred$prediction)^2, sq_tot = (dat_pred$act_endg_upb - act_endg_upb_avg)^2)
 57 | dat_pred <- transform(dat_pred, act_endg_upb_hat = dat_pred$prediction)
 58 | head(dat_pred2 <- select(dat_pred, "act_endg_upb", "act_endg_upb_hat", "sq_res", "sq_tot"))
 59 | head(act_endg_upb_hat <- select(dat_pred2, "act_endg_upb_hat"))
 60 | 
 61 | # Compute sum of squared residuals and totals, then use these values to calculate R-squared:
 62 | SSR <- collect(agg(dat_pred2, SSR = sum(dat_pred2$sq_res)))  ##### Note: produces data.frame - get values out of d.f's in order to calculate aRsq and Rsq
 63 | SST <- collect(agg(dat_pred2, SST = sum(dat_pred2$sq_tot)))
 64 | Rsq2 <- 1-(SSR[[1]]/SST[[1]])
 65 | p <- 5
 66 | N <- nrow(dat)
 67 | aRsq2 <- Rsq2 - (1 - Rsq2)*((p - 1)/(N - p))
 68 | 
 69 | # Compare iterations of spark.glm outputs
 70 | 
 71 | n <- 10
 72 | b0 <- rep(0,n)
 73 | b1 <- rep(0,n)
 74 | b2 <- rep(0,n)
 75 | b3 <- rep(0,n)
 76 | b4 <- rep(0,n)
 77 | b5 <- rep(0,n)
 78 | for(i in 1:n){
 79 |   model <- spark.glm(dat, act_endg_upb ~ new_int_rt + loan_age + mths_remng + matr_yr + zero_bal_yr, family = "gaussian")
 80 |   b0[i] <- unname(summary(model)$coefficients[,1]["(Intercept)"])
 81 |   b1[i] <- unname(summary(model)$coefficients[,1]["new_int_rt"])
 82 |   b2[i] <- unname(summary(model)$coefficients[,1]["loan_age"])
 83 |   b3[i] <- unname(summary(model)$coefficients[,1]["mths_remng"])
 84 |   b4[i] <- unname(summary(model)$coefficients[,1]["matr_yr"])
 85 |   b5[i] <- unname(summary(model)$coefficients[,1]["zero_bal_yr"])
 86 | }
 87 | 
 88 | # Prepare parameter estimate lists above as data.frames to pass into ggplot:
 89 | library(reshape2)
 90 | library(ggplot2)
 91 | b_ests_ <- data.frame(cbind(b0 = unlist(b0), b1 = unlist(b1), b2 = unlist(b2), b3 = unlist(b3), b4 = unlist(b4), b5 = unlist(b5), 
 92 |                             Iteration = seq(1, n, by = 1)))
 93 | b_ests <- melt(b_ests_, id.vars ="Iteration", measure.vars = c("b0", "b1", "b2", "b3", "b4", "b5"))
 94 | names(b_ests) <- cbind("Iteration", "Parameter", "Value")
 95 | 
 96 | p <- ggplot(data = b_ests, aes(x = Iteration, y = Value, col = Parameter), size = 5)
 97 | p + geom_point() + labs(title = "L.R. Parameters Estimated via OWLQN")
 98 | 
 99 | # Check functionality of other GLM families
100 | 
101 | # Create binary response variable:
102 | dat <- mutate(dat, act_endg_upb_large = ifelse(dat$act_endg_upb > 122640, lit(1), lit(0)))
103 | # Create non-negative loan_age column to use as count data for Poisson
104 | dat <- mutate(dat, loan_age_pos = abs(dat$loan_age))
105 | 
106 | # binomial(link = "logit")
107 | glm.logit <- spark.glm(dat, act_endg_upb_large ~ new_int_rt + loan_age + mths_remng + matr_yr + zero_bal_yr, family = "binomial")
108 | # Gamma(link = "inverse")
109 | glm.gamma <- spark.glm(dat, act_endg_upb ~ new_int_rt + loan_age + mths_remng + matr_yr + zero_bal_yr, family = "Gamma")
110 | # inverse.gaussian(link = "1/mu^2")
111 | glm.invgauss <- spark.glm(dat, act_endg_upb ~ new_int_rt + loan_age + mths_remng + matr_yr + zero_bal_yr, family = "inverse.gaussian")
112 | # poisson(link = "log")
113 | glm.poisson <- spark.glm(dat, loan_age_pos ~ act_endg_upb + new_int_rt + mths_remng + matr_yr + zero_bal_yr, family = "poisson")
114 | # quasi(link = "identity", variance = "constant")
115 | glm.quasi <- spark.glm(dat, loan_age_pos ~ act_endg_upb + new_int_rt + mths_remng + matr_yr + zero_bal_yr, family = "quasi")
116 | # quasibinomial(link = "logit")
117 | glm.quasibin <- spark.glm(dat, act_endg_upb_large ~ new_int_rt + loan_age + mths_remng + matr_yr + zero_bal_yr, family = "quasibinomial")
118 | # quasipoisson(link = "log")
119 | glm.quasipoiss <- spark.glm(dat, loan_age_pos ~ act_endg_upb + new_int_rt + mths_remng + matr_yr + zero_bal_yr, family = "quasipoisson")
120 | 
121 | summary(glm.logit)
122 | summary(glm.invgamma)
123 | summary(glm.invgauss)
124 | summary(glm.poisson)
125 | summary(glm.quasi)
126 | summary(glm.quasibin)
127 | summary(glm.quasipoiss)
128 | 
129 | ## Diamonds data
130 | 
131 | dat <- read.df("s3://sparkr-tutorials/diamonds.csv", header = "true", delimiter = ",", source = "csv", inferSchema = "true", na.strings = "")
132 | cache(dat)
133 | 
134 | glm.gauss <- spark.glm(dat, act_endg_upb ~ new_int_rt + loan_age + mths_remng + matr_yr + zero_bal_yr, family = "gaussian")
135 | 
136 | ct1 <- crosstab(dat, "clarity", "color")
137 | ct2 <- crosstab(dat, "clarity", "cut")
138 | ct3 <- crosstab(dat, "color", "cut")
139 | 
140 | glm1 <- spark.glm(dat, price ~ carat + clarity, family = "gaussian")
141 | op1 <- summary(glm1)
142 | 
143 | glm2 <- spark.glm(dat, price ~ carat + cut, family = "gaussian")
144 | op2 <- summary(glm2)


--------------------------------------------------------------------------------
/R/qqnorm_SparkR.R:
--------------------------------------------------------------------------------
 1 | #############################################################
 2 | ## qqnorm.SparkR: Normal Probability Plot of the Residuals ##
 3 | #############################################################
 4 | # Sarah Armstrong, Urban Institute
 5 | # August 23, 2016
 6 | # Last Updated: August 24, 2016
 7 | 
 8 | # Summary: Function that returns a quantile-quantile plot of the residual values from linear model, i.e. a plot that fits quantile values of the standardized residuals against those of a standard normal distribution.
 9 | 
10 | # Inputs:
11 | 
12 | # (*) df: a SparkR DF
13 | # (*) residuals: the column name assigned to the residual values (a string); note: the function will standardize these during execution
14 | # (*) qn: the number of quantiles plotted (default is 100)
15 | # (*) error: relativeError value used in the `approxQuantile` SparkR operation
16 | 
17 | # Returns: a ggplot object displaying the Q-Q plot, including axis labels and horizontal dashed lines, annotating the extremum values of the standardized residuals
18 | 
19 | # p <- qqnorm.SparkR(df = df, residuals = "res", qn = 100, error = 0.0001)
20 | # p + ggtitle("This is a title")
21 | 
22 | qqnorm.SparkR <- function(df, residuals, qn = 100, error){
23 |   
24 |   resdf <- select(df, residuals)
25 |   
26 |   sd.res <- collect(agg(resdf, stddev(resdf[[residuals]])))[[1]]
27 |   
28 |   resdf <- withColumn(resdf, "stdres", resdf[[residuals]] / sd.res)
29 |   
30 |   probs <- seq(0, 1, length = qn)
31 |   
32 |   norm_quantiles <- qnorm(probs, mean = 0, sd = 1)
33 |   stdres_quantiles <- unlist(approxQuantile(resdf, col = "stdres", probabilities = probs, relativeError = error))
34 |   
35 |   dat <- data.frame(sort(norm_quantiles), sort(stdres_quantiles))
36 |   
37 |   p_ <- ggplot(dat, aes(norm_quantiles, stdres_quantiles))
38 | 
39 |   p <- p_ + geom_point(color = "#FF3333") + geom_abline(intercept = 0, slope = 1) + xlab("Normal Scores") + ylab("Standardized Residuals") + geom_hline(aes(yintercept = min(dat$sort.stdres_quantiles.), linetype = "1st & qnth Quantile Values"), show.legend = TRUE) + geom_hline(yintercept = max(dat$sort.stdres_quantiles.), linetype = "dotted") + scale_linetype_manual(values = c(name = "none", "1st & qnth Quantile Values" = "dotted")) + guides(linetype = guide_legend("")) + theme(legend.position = "bottom")
40 |   
41 |   return(p)
42 |   
43 | }


--------------------------------------------------------------------------------
/R/rbind-fill.R:
--------------------------------------------------------------------------------
 1 | #########################
 2 | ## rbind.fill Function ##
 3 | #########################
 4 | # Sarah Armstrong, Urban Institute
 5 | # July 14, 2016
 6 | 
 7 | # Updated: July 28, 2016
 8 | 
 9 | # Summary: Function that allows us to append rows of one SparkR DataFrame (DF) to another, regardless of the column names for each DF. The function dentifies the outersection of the list of column names for two (2) DataFrames and adds them onto one (1) or both of the DataFrames as needed using `withColumn`. The function appends these columns as string dtype, and we can later recast columns as needed.
10 | 
11 | # Inputs: x (a DF) and y (another DF)
12 | # Returns: DataFrame
13 | 
14 | # Example:
15 | # df3 <- rbind.fill(df1, df2)
16 | # df3$col <- cast(df3$col, dataType = "integer")
17 | 
18 | 
19 | rbind.fill <- function(x, y) {
20 |   
21 |   m1 <- ncol(x)
22 |   m2 <- ncol(y)
23 |   col_x <- colnames(x)
24 |   col_y <- colnames(y)
25 |   outersect <- function(x, y) {setdiff(union(x, y), intersect(x, y))}
26 |   col_outer <- outersect(col_x, col_y)
27 |   len <- length(col_outer)
28 |   
29 |   if (m2 < m1) {
30 |     for (j in 1:len){
31 |       y <- withColumn(y, col_outer[j], cast(lit(""), "string"))
32 |     }
33 |   } else { 
34 |     if (m2 > m1) {
35 |         for (j in 1:len){
36 |           x <- withColumn(x, col_outer[j], cast(lit(""), "string"))
37 |         }
38 |       }
39 |     if (m2 == m1 & col_x != col_y) {
40 |       for (j in 1:len){
41 |         x <- withColumn(x, col_outer[j], cast(lit(""), "string"))
42 |         y <- withColumn(y, col_outer[j], cast(lit(""), "string"))
43 |       }
44 |     } else { }         
45 |   }
46 |   x_sort <- x[,sort(colnames(x))]
47 |   y_sort <- y[,sort(colnames(y))]
48 |   return(SparkR::rbind(x_sort, y_sort))
49 | }


--------------------------------------------------------------------------------
/R/rbind-intersection.R:
--------------------------------------------------------------------------------
 1 | ##############################
 2 | ## rbind.intersect Function ##
 3 | ##############################
 4 | # Sarah Armstrong, Urban Institute
 5 | # July 14, 2016
 6 | 
 7 | # Summary: Function that allows us to append rows of one SparkR DataFrame (DF) to another, regardless of the column names for each DF. Takes simple intersection of lists of column names and performs `rbind` SparkR operation on two (2) DFs, considering only the column names included in the intersected list of names.
 8 | 
 9 | # Inputs: x (a DF) and y (another DF)
10 | # Returns: DataFrame
11 | 
12 | rbind.intersect <- function(x, y) {
13 |   cols <- base::intersect(colnames(x), colnames(y))
14 |   return(SparkR::rbind(x[, sort(cols)], y[, sort(cols)]))
15 | }


--------------------------------------------------------------------------------
/R/sparkr-basics-1.R:
--------------------------------------------------------------------------------
  1 | ###################################################
  2 | ## SparkR Basics I: From CSV to SparkR DataFrame ##
  3 | ###################################################
  4 | 
  5 | ## Sarah Armstrong, Urban Institute  
  6 | ## June 23, 2016  
  7 | ## Last Updated: August 15, 2016
  8 | 
  9 | 
 10 | ## Objective: Become comfortable working with the SparkR DataFrame (DF) API; particularly, understand how to:
 11 | 
 12 | ## * Read a .csv file into SparkR as a DF
 13 | ## * Measure dimensions of a DF
 14 | ## * Append a DF with additional rows
 15 | ## * Rename columns of a DF
 16 | ## * Print column names of a DF
 17 | ## * Print a specified number of rows from a DF
 18 | ## * Print the SparkR schema
 19 | ## * Specify schema in `read.df` operation
 20 | ## * Manually specify a schema
 21 | ## * Change the data type of a column in a DF
 22 | ## * Export a DF to AWS S3 as a folder of partitioned parquet files
 23 | ## * Export a DF to AWS S3 as a folder of partitioned .csv files
 24 | ## * Read a partitioned file from S3 into SparkR
 25 | 
 26 | ## SparkR Operations Discussed: `read.df`, `nrow`, `ncol`, `dim`, `withColumnRenamed`, `columns`, `head`, `str`, `dtypes`, `schema`, `printSchema`, `cast`, `write.df`
 27 | 
 28 | 
 29 | ## Initiate Spark session:
 30 | 
 31 | if (nchar(Sys.getenv("SPARK_HOME")) < 1) {
 32 |   Sys.setenv(SPARK_HOME = "/home/spark")
 33 | }
 34 | # Load the SparkR library
 35 | library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
 36 | # Initiate a SparkR session
 37 | sparkR.session()
 38 | 
 39 | 
 40 | ######################################
 41 | ## (1) Load a csv file into SparkR: ##
 42 | ######################################
 43 | 
 44 | ## Use the operation `read.df` to load in quarterly Fannie Mae single-family loan performance data from the AWS S3 folder `"s3://sparkr-tutorials/"` as a Spark DataFrame (DF). Below, we load a single quarter (2000, Q1) into SparkR, and save it as the DF `perf`:
 45 | 
 46 | perf <- read.df("s3://sparkr-tutorials/Performance_2000Q1.txt", header = "false", delimiter = "|", source = "csv", inferSchema = "true", na.strings = "")
 47 | 
 48 | ## In the `read.df` operation, we give specifications typically included when reading data into Stata and SAS, such as the delimiter character for .csv files. However, we also include SparkR-specific input including `inferSchema`, which Spark uses to interpet data types for each column in the DF. We discuss this in more detail later on in this tutorial. An additional detail is that `read.df` includes the `na.strings = ""` specification because we want `read.df` to read entries of empty strings in our .csv dataset as NA in the SparkR DF, i.e. we are telling read.df to read entries equal to `""` as `NA` in the DF. We will discuss how SparkR handles empty and null entries in further detail in a subsequent tutorial.
 49 | 
 50 | ## Note: documentation for the quarterly loan performance data can be found at http://www.fanniemae.com/portal/funding-the-market/data/loan-performance-data.html.
 51 | 
 52 | ## We can save the dimensions of the 'perf' DF through the following operations. Note that wrapping the computation with () forces SparkR/R to print the computed value:
 53 | 
 54 | (n1 <- nrow(perf))	# Save the number of rows in 'perf'
 55 | (m1 <- ncol(perf))	# Save the number of columns in 'perf'
 56 | 
 57 | ## Update a DataFrame with new rows of data:
 58 | 
 59 | ## Since we'll want to analyze loan performance data beyond 2000 Q1, we append the `perf` DF below with the data from subsequent quarters of the same single-family loan performance dataset. Here, we're only appending one subsequent quarter (2000 Q2) to the DF so that our analysis in these tutorials runs quickly, but the following code can be easily adapted by specifying the `a` and `b` values to reflect the quarters that we want to append to our DF. Note that the for-loop below also uses the `read.df` operation, specified here just as when we loaded the initial .csv file as a DF:
 60 | 
 61 | a <- 2
 62 | b <- 2
 63 | 
 64 | for(q in a:b){
 65 |   
 66 |   filename <- paste0("Performance_2000Q", q)
 67 |   filepath <- paste0("s3://sparkr-tutorials/", filename, ".txt")
 68 |   .perf <- read.df(filepath, header = "false", delimiter = "|", 
 69 |                    source = "csv", inferSchema = "true", na.strings = "")
 70 |   
 71 |   perf <- rbind(perf, .perf)
 72 | }
 73 | 
 74 | ## The result of the for-loop is an appended `perf` DF that consists of the same columns as the initial `perf` DF that we read in from S3, but now with many appended rows. We can confirm this by taking the dimensions of the new DF:
 75 | 
 76 | (n2 <- nrow(perf))
 77 | (m2 <- ncol(perf))
 78 | 
 79 | 
 80 | #####################################
 81 | ## (2) Rename DataFrame column(s): ##
 82 | #####################################
 83 | 
 84 | ## The `select` operation performs a by column subset of an existing DF. The columns to be returned in the new DF are specified as a list of column name strings in the `select` operation. Here, we create a new DF called `perf_lim` that includes only the first 14 columns in the `perf` DF, i.e. the DF `perf_lim` is a subset of `perf`:
 85 | 
 86 | cols <- c("_C0","_C1","_C2","_C3","_C4","_C5","_C6","_C7","_C8","_C9","_C10","_C11","_C12","_C13")
 87 | perf_lim <- select(perf, col = cols)
 88 | 
 89 | ## We will discuss subsetting DataFrames in further detail in the "Subsetting" tutorial. For now, we will use this subsetted DF to learn how to change column names of DataFrames.
 90 | 
 91 | ## Using a for-loop and the SparkR operation `withColumnRenamed`, we rename the columns of `perf_lim`. The operation `withColumnRenamed` renames an existing column, or columns, in a DF and returns a new DF. By specifying the "new" DF name as `perf_lim`, however, we simply rename the columns of `perf_lim` (we could create an entirely separate DF with new column names by specifying a different DF name for `withColumnRenamed`):
 92 | 
 93 | old_colnames <- c("_C0","_C1","_C2","_C3","_C4","_C5","_C6","_C7","_C8","_C9","_C10","_C11","_C12","_C13")
 94 | new_colnames <- c("loan_id","period","servicer_name","new_int_rt","act_endg_upb","loan_age","mths_remng",
 95 |                   "aj_mths_remng","dt_matr","cd_msa","delq_sts","flag_mod","cd_zero_bal","dt_zero_bal")
 96 | 
 97 | for(i in 1:14){
 98 |   perf_lim <- withColumnRenamed(perf_lim, existingCol = old_colnames[i], newCol = new_colnames[i] )
 99 | }
100 | 
101 | ## We can check the column names of `perf_lim` with the `columns` operation or with its alias `colnames`:
102 | 
103 | columns(perf_lim)
104 | 
105 | ## Additionally, we can use the `head` operation to display the first n-many rows of `perf_lim` (here, we'll take the first five (5) rows of the DF):
106 | 
107 | head(perf_lim, num = 5)
108 | 
109 | ## We can also use the `str` operation to return a compact visualization of the first several rows of a DF:
110 | 
111 | str(perf_lim)
112 | 
113 | ############################################
114 | ## (3) Understanding data-types & schema: ##
115 | ############################################
116 | 
117 | ## We can see in the output for the command `head(perf_lim, num = 5)` that we have what appears to be several different data types (dtypes) in our DF. There are three (3) different ways to explicitly view dtype in SparkR - the operations `dtypes`, `schema` and `printSchema`. As stated above, Spark relies on a "schema" to determine what dtype to assign to each column in a DF (which is easy to remember since the English schema comes from the Greek word for shape or plan!). We can print a visual representation of the schema for a DF with the operations `schema` and `printSchema` while the `dtypes` operation prints a list of DF column names and their corresponding dtypes:
118 | 
119 | dtypes(perf_lim)	# Prints a list of DF column names and corresponding dtypes
120 | schema(perf_lim)	# Prints the schema of the DF
121 | printSchema(perf_lim) # Prints the schema of the DF in a concise tree format
122 | 
123 | ## Specifying schema in `read.df` operation & defining a custom schema:
124 | 
125 | ## Remember that, when we read in our DF from the S3-hosted .csv file, we included the condition `inferSchema = "true"`. This is just one of three (3) ways to communicate to Spark how the dtypes of the DF columns should be assigned. By specifying `inferSchema = "true"` in `read.df`, we allow Spark to infer the dtype of each column in the DF. Conversely, we could specify our own schema and pass this into the load call, forcing Spark to adopt our dtype specifications for each column. Each of these approaches have their pros and cons, which determine when it is appropriate to prefer one over the other:
126 | 
127 | ## * `inferSchema = "true"`: This approach minimizes programmer-driven error since we aren't required to make assertions about the dtypes of each column; however, it is comparatively computationally expensive
128 | 
129 | ## * `customSchema`: While computationally more efficient, manually specifying a schema will lead to errors if incorrect dtypes are assigned to columns - if Spark is not able to interpret a column as the specified dtype, `read.df` will fill that column in the DF with NA
130 | 
131 | ## Clearly, the situations in which these approaches would be helpful are starkly different. In the context of this tutorial, an efficient use of both approaches would be to use `inferSchema = "true"` when reading in `perf`. At this point, we could print the schema with `schema` or `printSchema`, note the dtype for each column (all 28 of them), and then write a `customSchema` with the corresponding specifications (or changed from the inferred schema as needed). We could then use this `customSchema` when appending the subsequent quarters to `perf`. While writing the customSchema may be tedious, including it in the appending for-loop would help that process to be much more efficient - this would be especially useful if we were appending, for example, 20 years worth of quarterly data together. The third way to communicate to Spark how to define dtypes is to not specify any schema, i.e. to not include `inferSchema` in `read.df`. Under this condition, every column in the DF is read in as a string dtype. Below is the an example of how we could specify a customSchema (here, however, we just use the same dtypes as interpreted for `inferSchema = "true"`):
132 | 
133 | customSchema <- structType(
134 |  structField("loan_id", type = "long"),
135 |  structField("period", type = "string"),
136 |  structField("servicer_name", type = "string"),
137 |  structField("new_int_rt", type = "double"),
138 |  structField("act_endg_upb", type = "double"),
139 |  structField("loan_age", type = "integer"),
140 |  structField("mths_remng", type = "integer"),
141 |  structField("aj_mths_remng", type = "integer")
142 |  structField("dt_matr", type = "string")
143 |  structField("cd_msa", type = "integer")
144 |  structField("delq_sts", type = "string")
145 |  structField("flag_mod", type = "string")
146 |  structField("cd_zero_bal", type = "integer")
147 |  structField("dt_zero_bal", type = "string")
148 | )
149 | 
150 | ## Finally, dtypes can be changed after the DF has been created, using the `cast` operation. However, it is clearly more efficient to properly specify dtypes when creating the DF. A quick example of using the `cast` operation is given below:
151 | 
152 | # We can see in the results from the previous printSchema output that `loan_id` is a `long` dtype, here we `cast` it
153 | # as a `string` and then call `printSchema` on this new DF
154 | perf_lim$loan_id <- cast(perf_lim$loan_id, dataType = "string")
155 | printSchema(perf_lim)
156 | 
157 | # If we want our original `perf_lim` DF, we can simply recast `loan_id` as a `long` dtype
158 | perf_lim$loan_id <- cast(perf_lim$loan_id, dataType = "long")
159 | printSchema(perf_lim)
160 | 
161 | 
162 | #######################################
163 | ## (4) Export DF as data file to S3: ##
164 | #######################################
165 | 
166 | ## Throughout this tutorial, we've built the Spark DataFrame `perf_lim` of quarterly loan performance data, which we'll use in several subsequent tutorials. In order to use this DF later on, we must first export it to a location that can handle large data sizes and in a data structure that works with the SparkR environment. We'll save this example data to an AWS S3 folder (`"sparkr-tutorials"`) from which we'll access other example datasets. Below, we save `perf_lim` as a collection of parquet type files into the folder `"hfpc_ex"` using the `write.df` operation:
167 | 
168 | write.df(perf_lim, path = "s3://sparkr-tutorials/hfpc_ex", source = "parquet", mode = "overwrite")
169 | 
170 | ## When working with the DF `perf_lim` in the analysis above, we were really accessing data that was partitioned across our cluster. In order to export this partitioned data, we export each partition from its node (computer) and then collect them into the folder `"hfpc_ex"`. This "file" of indiviudal, partitioned files should be treated like an indiviudal file when organizing an S3 folder, i.e. __do not__ attempt to save other DataFrames or files to this file. SparkR saves the DF in this partitioned structure to accomodate massive data.
171 | 
172 | ## Consider the conditions required for us to be able to save a DataFrame as a single .csv file: the given DF would need to be able to fit onto a single node of our cluster, i.e. it would need to be able to fit onto a single computer. Any data that would necessitate using SparkR in analysis will likely not fit onto a single computer. Note that we have specified `mode = "overwrite"`, indicating that existing data in this folder is expected to be overwritten by the contents of this DF (additional mode specifications include `"error"`, `"ignore"` and `"append"`).
173 | 
174 | ## The partitioned nature of `"hfpc_ex"` does not affect our ability to load it back into SparkR and perform further analysis. Below, we use the `read.df` to read in the partitioned parquet file from S3 as the DF `dat`:
175 | 
176 | dat <- read.df("s3://sparkr-tutorials/hfpc_ex", header = "false", inferSchema = "true")
177 | 
178 | ## Below, we confirm that the dimensions and column names of `dat` and `perf_lim` are equal. When comparing DFs, each with a large number of columns, the following if-else statement can be adapted to check equal dimensions and column names across DFs:
179 | 
180 | dim1 <- dim(perf_lim)
181 | dim2 <- dim(dat)
182 | if (dim1[1]!=dim2[1] | dim1[2]!=dim2[2]) {
183 |   "Error: dimension values not equal; DataFrame did not export correctly"
184 | } else {
185 |   "Dimension values are equal"
186 | }
187 | 
188 | ## We can also save the DF as a folder of partitioned .csv files with syntax similar to that which we used to export the DF as partitioned parquet files. Note, however, that this does not retain the column names like saving as partitioned parquet files does. The `write.df` expression for exporting the DF as a folder of partitioned .csv files is given below:
189 | 
190 | write.df(perf_lim, path = "s3://sparkr-tutorials/hfpc_ex_csv", source = "csv", mode = "overwrite")
191 | 
192 | ## We can read in the .csv files as a DF with the following expression:
193 | 
194 | dat2 <- read.df("s3://sparkr-tutorials/hfpc_ex_csv", source = "csv", inferSchema = "true")
195 | 
196 | ## Note that the DF columns are now given generic names, but we can use the same for-loop from a previous section in this tutorial to rename the columns in our new DF:
197 | 
198 | colnames(dat2)
199 | 
200 | for(i in 1:14){
201 |   dat2 <- withColumnRenamed(dat2, existingCol = old_colnames[i], newCol = new_colnames[i])
202 | }
203 | 
204 | colnames(dat2)


--------------------------------------------------------------------------------
/R/subsetting.R:
--------------------------------------------------------------------------------
  1 | ##################################
  2 | ## Subsetting SparkR DataFrames ##
  3 | ##################################
  4 | 
  5 | ## Sarah Armstrong, Urban Institute  
  6 | ## July 1, 2016  
  7 | ## Last Updated: August 17, 2016
  8 | 
  9 | 
 10 | ## Objective: Now that we understand what a SparkR DataFrame (DF) really is (remember, it's not actually data!) and can write expressions using essential DataFrame operations, such as `agg`, we are ready to start subsetting DFs using more advanced transformation operations. This tutorial discusses various ways of subsetting DFs, as well as how to work with a randomly sampled subset as a local data.frame in RStudio:
 11 | 
 12 | ## * Subset a DF by row
 13 | ## * Subset a DF by a list of columns
 14 | ## * Subset a DF by column expressions
 15 | ## * Drop a column from a DF
 16 | ## * Subset a DF by taking a random sample
 17 | ## * Collect a random sample as a local R data.frame
 18 | ## * Export a DF sample as a single .csv file to S3
 19 | 
 20 | ## SparkR/R Operations Discussed: `filter`, `where`, `select`, `sample`, `collect`, `write.table`
 21 | 
 22 | 
 23 | ## Initiate SparkR session:
 24 | 
 25 | if (nchar(Sys.getenv("SPARK_HOME")) < 1) {
 26 |   Sys.setenv(SPARK_HOME = "/home/spark")
 27 | }
 28 | library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
 29 | sparkR.session()
 30 | 
 31 | ## Read in example HFPC data from AWS S3 as a DataFrame (DF):
 32 | 
 33 | df <- read.df("s3://sparkr-tutorials/hfpc_ex", header = "false", inferSchema = "true")
 34 | cache(df)
 35 | 
 36 | 
 37 | ## Let's check the dimensions our DF `df` and its column names so that we can compare the dimension sizes of `df` with those of the subsets that we will  define throughout this tutorial:
 38 | nrow(df)
 39 | ncol(df)
 40 | columns(df)
 41 | 
 42 | ##################################
 43 | ## (1) Subset DataFrame by row: ##
 44 | ##################################
 45 | 
 46 | ## The SparkR operation `filter` allows us to subset the rows of a DF according to specified conditions. Before we begin working with `filter` to see how it works, let's print the schema of `df` since the types of subsetting conditions we are able to specify depend on the datatype of each column in the DF: 
 47 | 
 48 | printSchema(df)
 49 | 
 50 | ## We can subset `df` into a new DF, `f1`, that includes only those loans for which JPMorgan Chase is the servicer with the expression:
 51 | 
 52 | f1 <- filter(df, df$servicer_name == "JP MORGAN CHASE BANK, NA" | df$servicer_name == "JPMORGAN CHASE BANK, NA" |
 53 |                df$servicer_name == "JPMORGAN CHASE BANK, NATIONAL ASSOCIATION")
 54 | nrow(f1)
 55 | 
 56 | ## Notice that the `filter` considers normal logical syntax (e.g. logical conditions and operations), making working with the operation very straightforward. We can specify `filter` with SQL statement strings. For example, here we have the preceding example written in SQL statement format:
 57 | 
 58 | filter(df, "servicer_name = 'JP MORGAN CHASE BANK, NA' or servicer_name = 'JPMORGAN CHASE BANK, NA' or
 59 |        servicer_name = 'JPMORGAN CHASE BANK, NATIONAL ASSOCIATION'")
 60 | 
 61 | ## Or, alternatively, in a syntax similar to how we subset data.frames by row in base R:
 62 | 
 63 | df[df$servicer_name == "JP MORGAN CHASE BANK, NA" | df$servicer_name == "JPMORGAN CHASE BANK, NA" | 
 64 |      df$servicer_name == "JPMORGAN CHASE BANK, NATIONAL ASSOCIATION",]
 65 | 
 66 | ## Another example of using logical syntax with `filter` is that we can subset `df` such that the new DF only includes those loans for which the servicer name is known, i.e. the column `"servicer_name"` is not equa to an empty string or listed as `"OTHER"`:
 67 | 
 68 | f2 <- filter(df, df$servicer_name != "OTHER" & df$servicer_name != "")
 69 | nrow(f2)
 70 | 
 71 | ## Or, if we wanted to only consider observations with a `"loan_age"` value of greater than 60 months (five years), we would evaluate:
 72 | 
 73 | f3 <- filter(df, df$loan_age > 60)
 74 | nrow(f3)
 75 | 
 76 | ## An alias for `filter` is `where`, which reads much more intuitively, particularly when `where` is embedded in a complex statement. For example, the following expression can be read as "__aggregate__ the mean loan age and count values __by__ `"servicer_name"` in `df` __where__ loan age is less than 60 months":
 77 | 
 78 | f4 <- agg(groupBy(where(df, df$loan_age < 60), where(df, df$loan_age < 60)$servicer_name), 
 79 |           loan_age_avg = avg(where(df, df$loan_age < 60)$loan_age), 
 80 |           count = n(where(df, df$loan_age < 60)$loan_age))
 81 | head(f4)
 82 | 
 83 | #####################################
 84 | ## (2) Subset DataFrame by column: ##
 85 | #####################################
 86 | 
 87 | ## The operation `select` allows us to subset a DF by a specified list of columns. In the expression below, for example, we create a subsetted DF that includes only the number of calendar months remaining until the borrower is expected to pay the mortgage loan in full (remaining maturity) and adjusted remaining maturity:
 88 | 
 89 | s1 <- select(df, "mths_remng", "aj_mths_remng")
 90 | ncol(s1)
 91 | 
 92 | ## We can also reference the column names through the DF name, i.e. `select(df, df$mths_remng, df$aj_mths_remng)`. Or, we can save a list of columns as a combination of strings. If we wanted to make a list of all columns that relate to remaining maturity, we could evaluate the expression `remng_mat <- c("mths_remng", "aj_mths_remng")` and then easily reference our list of columns later on with `select(df, remng_mat)`.
 93 | 
 94 | ## Besides subsetting by a list of columns, we can also subset `df` while introducing a new column using a column expression, as we do in the example below. The DF `s2` includes the columns `"mths_remng"` and `"aj_mths_remng"` as in `s1`, but now with a column that lists the absolute value of the difference between the unadjusted and adjusted remaining maturity:
 95 | 
 96 | s2 <- select(df, df$mths_remng, df$aj_mths_remng, abs(df$aj_mths_remng - df$mths_remng))
 97 | ncol(s2)
 98 | head(s2)
 99 | 
100 | ## Note that, just as we can subset by row with syntax similar to that in base R, we can similarly acheive subsetting by column. The following expressions are equivalent:
101 | 
102 | select(df, df$period)
103 | df[,"period"]
104 | df[,2]
105 | 
106 | ## To simultaneously subset by column and row specifications, you can simply embed a `where` expression in a `select` operation (or vice versa). The following expression creates a DF that lists loan age values only for observations in which servicer name is unknown:
107 | 
108 | s3 <- select(where(df, df$servicer_name == "" | df$servicer_name == "OTHER"), "loan_age")
109 | head(s3)
110 | 
111 | ## Note that we could have also written the above expression as:
112 | 
113 | df[df$servicer_name == "" | df$servicer_name == "OTHER", "loan_age"]
114 | 
115 | ###################################
116 | ## (2i) Drop a column from a DF: ##
117 | ###################################
118 | 
119 | ## We can drop a column from a DF very simply by assigning `NULL` to a DF column. Below, we drop `"aj_mths_remng"` from `s1`:
120 | 
121 | head(s1)
122 | s1$aj_mths_remng <- NULL
123 | head(s1)
124 | 
125 | #################################################
126 | ## (3) Subset a DF by taking a random sample: ###
127 | #################################################
128 | 
129 | ## Perhaps the most useful subsetting operation is `sample`, which returns a randomly sampled subset of a DF. With `subset`, we can specify whether we want to sample with or without replace, the approximate size of the sample that we want the new DF to call and whether or not we want to define a random seed. If our initial DF is so massive that performing analysis on the entire dataset requires a more expensive cluster, we can: sample the massive dataset, interactively develop our analysis in SparkR using our sample and then evaluate the resulting program using our initial DF, which calls the entire massive dataset, only as is required. This strategy will help us to minimize wasting resources.
130 | 
131 | ## Below, we take a random sample of `df` without replacement that is, in size, approximately equal to 1% of `df`. Notice that we must define a random seed in order to be able to reproduce our random sample.
132 | 
133 | df_samp1 <- sample(df, withReplacement = FALSE, fraction = 0.01)  # Without set seed
134 | df_samp2 <- sample(df, withReplacement = FALSE, fraction = 0.01)
135 | count(df_samp1)
136 | count(df_samp2)
137 | # The row counts are different and, obviously, the DFs are not equivalent
138 | 
139 | df_samp3 <- sample(df, withReplacement = FALSE, fraction = 0.01, seed = 0)  # With set seed
140 | df_samp4 <- sample(df, withReplacement = FALSE, fraction = 0.01, seed = 0)
141 | count(df_samp3)
142 | count(df_samp4)
143 | # The row counts are equal and the DFs are equivalent
144 | 
145 | ##########################################################
146 | ## (3i) Collect a random sample as a local data.frame: ###
147 | ##########################################################
148 | 
149 | ## An additional use of `sample` is to collect a random sample of a massive dataset as a local data.frame in R. This would allow us to work with a sample dataset in a traditional analysis environment that is likely more representative of the population since we are sampling from a larger set of observations than we are normally doing so. This can be achieved by simply using `collect` to create a local data.frame:
150 | 
151 | typeof(df_samp4)  # DFs are of class S4
152 | dat <- collect(df_samp4)
153 | typeof(dat)
154 | 
155 | ## Note that this data.frame is _not_ local to _your_ personal computer, but rather it was gathered locally to a single node in our AWS cluster.
156 | 
157 | #########################################################
158 | ## (3ii) Export DF sample as a single .csv file to S3: ##
159 | #########################################################
160 | 
161 | ## If we want to export the sampled DF from RStudio as a single .csv file that we can work with in any environment, we must first coalesce the rows of `df_samp4` to a single node in our cluster using the `repartition` operation. Then, we can use the `write.df` operation as we did in the [SparkR Basics I](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/sparkr-basics-1.md) tutorial:
162 | 
163 | df_samp4_1 <- repartition(df_samp4, numPartitions = 1)
164 | write.df(df_samp4_1, path = "s3://sparkr-tutorials/hfpc_samp.csv", source = "csv", 
165 |          mode = "overwrite")
166 | 
167 | ## __Warning__: We cannot collect a DF as a data.frame, nor can we repartition it to a single node, unless the DF is sufficiently small in size since it must fit onto a _single_ node!


--------------------------------------------------------------------------------
/R/summary-statistics.R:
--------------------------------------------------------------------------------
  1 | ##################################
  2 | ## Subsetting SparkR DataFrames ##
  3 | ##################################
  4 | 
  5 | ## Sarah Armstrong, Urban Institute  
  6 | ## July 8, 2016  
  7 | ## Last Updated: August 18, 2016
  8 | 
  9 | 
 10 | ## Objective: Summary statistics and aggregations are essential means of summarizing a set of observations. In this tutorial, we discuss how to compute location, statistical dispersion, distribution and dependence measures of numerical variables in SparkR, as well as methods for examining categorical variables. In particular, we consider how to compute the following measurements and aggregations in SparkR:
 11 | 
 12 | ## Numerical Data
 13 | 
 14 | ## * Measures of location:
 15 | ##     + Mean
 16 | ##     + Extract summary statistics as local value
 17 | ## * Measures of dispersion:
 18 | ##     + Range width & limits
 19 | ##     + Variance
 20 | ##     + Standard deviation
 21 | ##     + Quantiles
 22 | ## * Measures of distribution shape:
 23 | ##     + Skewness
 24 | ##     + Kurtosis
 25 | ## * Measures of Dependence:
 26 | ##     + Covariance
 27 | ##     + Correlation
 28 | 
 29 | ## Categorical Data
 30 | 
 31 | ## * Frequency table
 32 | ## * Relative frequency table
 33 | ## * Contingency table
 34 | 
 35 | ## SparkR/R Operations Discussed: `describe`, `collect`, `showDF`, `agg`, `mean`, `typeof`, `min`, `max`, `abs`, `var`, `sd`, `skewness`, `kurtosis`, `cov`, `corr`, `count`, `n`, `groupBy`, `nrow`, `crosstab`
 36 | 
 37 | 
 38 | ## Initiate SparkR session:
 39 | 
 40 | if (nchar(Sys.getenv("SPARK_HOME")) < 1) {
 41 |   Sys.setenv(SPARK_HOME = "/home/spark")
 42 | }
 43 | library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
 44 | sparkR.session()
 45 | 
 46 | ## Read in example HFPC data from AWS S3 as a DataFrame (DF):
 47 | 
 48 | df <- read.df("s3://sparkr-tutorials/hfpc_ex", header = "false", inferSchema = "true")
 49 | cache(df)
 50 | 
 51 | ####################
 52 | ## NUMERICAL DATA ##
 53 | ####################
 54 | ####################
 55 | 
 56 | ## The operation `describe` (or its alias `summary`) creates a new DF that consists of several key aggregations (count, mean, max, mean, standard deviation) for a specified DF or list of DF columns (note that columns must be of a numerical datatype). We can either (1) use the action operation `showDF` to print this aggregation DF or (2) save it as a local data.frame with `collect`. Here, we perform both of these actions on the aggregation DF `sumstats_mthsremng`, which returns the aggregations listed above for the column `"mths_remng"` in `df`:
 57 | 
 58 | sumstats_mthsremng <- describe(df, "mths_remng")  # Specified list of columns here consists only of "mths_remng"
 59 | 
 60 | showDF(sumstats_mthsremng)  # Print the aggregation DF
 61 | 
 62 | sumstats_mthsremng.l <- collect(sumstats_mthsremng) # Collect aggregation DF as a local data.frame
 63 | sumstats_mthsremng.l
 64 | 
 65 | ## Note that measuring all five (5) of these aggregations at once can be computationally expensive with a massive data set, particularly if we are interested in only a subset of these measurements. Below, we outline ways to measure these aggregations individually, as well as several other key summary statistics for numerical data.
 66 | 
 67 | ###############################
 68 | ## (1) Measures of Location: ##
 69 | ###############################
 70 | 
 71 | ################
 72 | ## (1i) Mean: ##
 73 | ################
 74 | 
 75 | ## The mean is the only measure of central tendency currently supported by SparkR. The operations `mean` and `avg` can be used with the `agg` operation that we discussed in the [SparkR Basics II](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/sparkr-basics-2.md) tutorial to measure the average of a numerical DF column. Remember that `agg` returns another DF. Therefore, we can either print the DF with `showDF` or we can save the aggregation as a local data.frame. Collecting the DF may be preferred if we want to work with the mean `"mths_remng"` value as a single value in RStudio.
 76 | 
 77 | mths_remng.avg <- agg(df, mean = mean(df$mths_remng)) # Create an aggregation DF
 78 | 
 79 | # DataFrame
 80 | showDF(mths_remng.avg) # Print this DF
 81 | typeof(mths_remng.avg) # Aggregation DF is of class S4
 82 | 
 83 | # data.frame
 84 | mths_remng.avg.l <- collect(mths_remng.avg) # Collect the DF as a local data.frame
 85 | (mths_remng.avg.l <- mths_remng.avg.l[,1])  # Overwrite data.frame with numerical mean value (was entry in d.f)
 86 | typeof(mths_remng.avg.l)  # Object is now of a numerical dtype
 87 | 
 88 | #################################
 89 | ## (2) Measures of dispersion: ##
 90 | #################################
 91 | 
 92 | ################################
 93 | ## (2i) Range width & limits: ##
 94 | ################################
 95 | 
 96 | ## We can also use `agg` to create a DF that lists the minimum and maximum values within a numerical DF column (i.e. the limits of the range of values in the column) and the width of the range. Here, we create compute these values for `"mths_remng"` and print the resulting DF with `showDF`:
 97 | 
 98 | mr_range <- agg(df, minimum = min(df$mths_remng), maximum = max(df$mths_remng), 
 99 |                 range_width = abs(max(df$mths_remng) - min(df$mths_remng)))
100 | showDF(mr_range)
101 | 
102 | ##########################################
103 | ## (2ii) Variance & standard deviation: ##
104 | ##########################################
105 | 
106 | ## Again using `agg`, we compute the variance and standard deviation of `"mths_remng"` with the expressions below. Note that, here, we are computing sample variance and standard deviation (which we could also measure with their respective aliases, `variance` and `stddev`). To measure population variance and standard deviation, we would use `var_pop` and `stddev_pop`, respectively.
107 | 
108 | mr_var <- agg(df, variance = var(df$mths_remng))  # Sample variance
109 | showDF(mr_var)
110 | 
111 | mr_sd <- agg(df, std_dev = sd(df$mths_remng)) # Sample standard deviation
112 | showDF(mr_sd)
113 | 
114 | ###################################
115 | ## (2iii) Approximate Quantiles: ##
116 | ###################################
117 | 
118 | ## The operation `approxQuantile` returns approximate quantiles for a DF column. We specify the quantiles to be approximated by the operation as a vector set equal to the `probabilities` parameter, and the acceptable level of error by the `relativeError` paramter.
119 | 
120 | ## If the column includes `n` rows, then `approxQuantile` will return a list of quantile values with rank values that are acceptably close to those exact values specified by `probabilities`. In particular, the operation assigns approximate rank values such that the computed rank, (`probabilities * n`), falls within the inequality `floor((probabilities - relativeError) * n) <= rank(x) <= ceiling((probabilities + relativeError) * n)`.
121 | 
122 | ## Below, we define a new DF, `df_`, that includes only nonmissing values for `"mths_remng"` and then compute approximate Q1, Q2 and Q3 values for `"mths_remng"`:
123 | 
124 | df_ <- dropna(df, cols = "mths_remng")
125 | 
126 | quartiles_mr <- approxQuantile(x = df_, col = "mths_remng", probabilities = c(0.25, 0.5, 0.75), 
127 |                                relativeError = 0.001)
128 | quartiles_mr
129 | 
130 | 
131 | #########################################
132 | ## (3) Measures of distribution shape: ##
133 | #########################################
134 | 
135 | ####################
136 | ## (3i) Skewness: ##
137 | ####################
138 | 
139 | ## We can measure the magnitude and direction of skew in the distribution of a numerical DF column by using the operation `skewness` with `agg`, just as we did to measure the `mean`, `variance` and `stddev` of a numerical variable. Below, we measure the `skewness` of `"mths_remng"`:
140 | 
141 | mr_sk <- agg(df, skewness = skewness(df$mths_remng))
142 | showDF(mr_sk)
143 | 
144 | #####################
145 | ## (3ii) Kurtosis: ##
146 | #####################
147 | 
148 | ## Similarly, we can meaure the magnitude of, and how sharp is, the central peak of the distribution of a numerical variable, i.e. the "peakedness" of the distribution, (relative to a standard bell curve) with the `kurtosis` operation. Here, we measure the `kurtosis` of `"mths_remng"`:
149 | 
150 | mr_kr <- agg(df, kurtosis = kurtosis(df$mths_remng))
151 | showDF(mr_kr)
152 | 
153 | #################################
154 | ## (4) Measures of dependence: ##
155 | #################################
156 | 
157 | ####################################
158 | ## (4i) Covariance & correlation: ##
159 | ####################################
160 | 
161 | ## The actions `cov` and `corr` return the sample covariance and correlation measures of dependency between two DF columns, respectively. Currently, Pearson is the only supported method for calculating correlation. Here we compute the covariance and correlation of `"loan_age"` and `"mths_remng"`. Note that, in saving the covariance and correlation measures, we are not required to first `collect` locally since `cov` and `corr` return values, rather than DFs:
162 | 
163 | cov_la.mr <- cov(df, "loan_age", "mths_remng")
164 | corr_la.mr <- corr(df, "loan_age", "mths_remng", method = "pearson")
165 | cov_la.mr
166 | corr_la.mr
167 | 
168 | typeof(cov_la.mr)
169 | typeof(corr_la.mr)
170 | 
171 | ######################
172 | ## CATEGORICAL DATA ##
173 | ######################
174 | 
175 | ## We can compute descriptive statistics for categorical data using (1) the `groupBy` operation that we discussed in the [SparkR Basics II](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/sparkr-basics-2.md) tutorial and (2) operations native to SparkR for this purpose.
176 | 
177 | df$cd_zero_bal <- ifelse(isNull(df$cd_zero_bal), "Unknown", df$cd_zero_bal)
178 | df$servicer_name <- ifelse(df$servicer_name == "", "Unknown", df$servicer_name)
179 | 
180 | ##########################
181 | ## (1) Frequency table: ##
182 | ##########################
183 | 
184 | ## To create a frequency table for a categorical variable in SparkR, i.e. list the number of observations for each distinct value in a column of strings, we can simply use the `count` transformation with grouped data. Group the data by the categorical variable for which we want to return a frequency table. Here, we create a frequency table for using this approach `"cd_zero_bal"`:
185 | 
186 | zb_f <- count(groupBy(df, "cd_zero_bal"))
187 | showDF(zb_f)
188 | 
189 | ## We could also embed a grouping into an `agg` operation as we saw in the [SparkR Basics II](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/sparkr-basics-2.md) tutorial to achieve the same frequency table DF, i.e. we could evaluate the expression `agg(groupBy(df, df$cd_zero_bal), count = n(df$cd_zero_bal))`.
190 | 
191 | ###################################
192 | ## (2) Relative frequency table: ##
193 | ###################################
194 | 
195 | ## We could similarly create a DF that consists of a relative frequency table. Here, we reproduce the frequency table from the preceding section, but now including the relative frequency for each distinct string value, labeled `"Percentage"`:
196 | 
197 | n <- nrow(df)
198 | zb_rf <- agg(groupBy(df, df$cd_zero_bal), Count = n(df$cd_zero_bal), Percentage = n(df$cd_zero_bal) * (100/n))
199 | showDF(zb_rf)
200 | 
201 | ############################
202 | ## (3) Contingency table: ##
203 | ############################
204 | 
205 | ## Finally, we can create a contingency table with the operation `crosstab`, which returns a data.frame that consists of a contingency table between two categorical DF columns. Here, we create and print a contingency table for `"servicer_name"` and `"cd_zero_bal"`:
206 | 
207 | conting_sn.zb <- crosstab(df, "servicer_name", "cd_zero_bal")
208 | conting_sn.zb


--------------------------------------------------------------------------------
/R/time-series-1.R:
--------------------------------------------------------------------------------
  1 | ############################################################################
  2 | ## Time Series I: Working with the Date Datatype & Resampling a DataFrame ##
  3 | ############################################################################
  4 | 
  5 | ## Sarah Armstrong, Urban Institute  
  6 | ## July 8, 2016 
  7 | ## Last Updated: August 23, 2016
  8 | 
  9 | 
 10 | ## Objective: In this tutorial, we discuss how to perform several essential time series operations with SparkR. In particular, we discuss how to:
 11 | 
 12 | ## * Identify and parse date datatype (dtype) DF columns,
 13 | ## * Compute relative dates based on a specified increment of time,
 14 | ## * Extract and modify components of a date dtype column and
 15 | ## * Resample a time series DF to a particular unit of time frequency
 16 | 
 17 | ## SparkR/R Operations Discussed: `unix_timestamp`, `cast`, `withColumn`, `to_date`, `last_day`, `next_day`, `add_months`, `date_add`, `date_sub`, `weekofyear`, `dayofyear`, `dayofmonth`, `datediff`, `months_between`, `year`, `month`, `hour`, `minute`, `second`, `agg`, `groupBy`, `mean`
 18 | 
 19 | ## Initiate SparkR session:
 20 | 
 21 | if (nchar(Sys.getenv("SPARK_HOME")) < 1) {
 22 |   Sys.setenv(SPARK_HOME = "/home/spark")
 23 | }
 24 | library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
 25 | sparkR.session()
 26 | 
 27 | ## Read in example HFPC data from AWS S3 as a DataFrame (DF):
 28 | 
 29 | df <- read.df("s3://sparkr-tutorials/hfpc_ex", header = "false", inferSchema = "true")
 30 | cache(df)
 31 | 
 32 | ########################################################
 33 | ## (1) Converting a DataFrame column to 'date' dtype: ##
 34 | ########################################################
 35 | 
 36 | ## As we saw in previous tutorials, there are several columns in our dataset that list dates which are helpful in determining loan performance. We will specifically consider the following columns throughout this tutorial:
 37 | 
 38 | ## * `"period"` (Monthly Reporting Period): The month and year that pertain to the servicer’s cut-off period for mortgage loan information
 39 | ## * `"dt_matr"`(Maturity Date): The month and year in which a mortgage loan is scheduled to be paid in full as defined in the mortgage loan documents 
 40 | ## * `"dt_zero_bal"`(Zero Balance Effective Date): Date on which the mortgage loan balance was reduced to zero
 41 | 
 42 | ## Let's begin by reviewing the dytypes that `read.df` infers our date columns as. Note that each of our three (3) date columns were read in as strings:
 43 | 
 44 | str(df)
 45 | 
 46 | ## While we could parse the date strings into separate year, month and day integer dtype columns, converting the columns to date dtype allows us to utilize the datetime functions available in SparkR.
 47 | 
 48 | ## We can convert `"period"`, `"matr_dt"` and `"dt_zero_bal"` to date dtype with the following expressions:
 49 | 
 50 | # `period`
 51 | period_uts <- unix_timestamp(df$period, 'MM/dd/yyyy')	# 1. Gets current Unix timestamp in seconds
 52 | period_ts <- cast(period_uts, 'timestamp')	# 2. Casts Unix timestamp `period_uts` as timestamp
 53 | period_dt <- cast(period_ts, 'date')	# 3. Casts timestamp `period_ts` as date dtype
 54 | df <- withColumn(df, 'p_dt', period_dt)	# 4. Add date dtype column `period_dt` to `df`
 55 | 
 56 | # `dt_matr`
 57 | matr_uts <- unix_timestamp(df$dt_matr, 'MM/yyyy')
 58 | matr_ts <- cast(matr_uts, 'timestamp')
 59 | matr_dt <- cast(matr_ts, 'date')
 60 | df <- withColumn(df, 'mtr_dt', matr_dt)
 61 | 
 62 | # `dt_zero_bal`
 63 | zero_bal_uts <- unix_timestamp(df$dt_zero_bal, 'MM/yyyy')
 64 | zero_bal_ts <- cast(zero_bal_uts, 'timestamp')
 65 | zero_bal_dt <- cast(zero_bal_ts, 'date')
 66 | df <- withColumn(df, 'zb_dt', zero_bal_dt)
 67 | 
 68 | ## Note that the string entries of these date DF columns are written in the formats `'MM/dd/yyyy'` and `'MM/yyyy'`. While SparkR is able to easily read a date string when it is in the default format, `'yyyy-mm-dd'`, additional steps are required for string to date conversions when the DF column entries are in a format other than the default. In order to create `"p_dt"` from `"period"`, for example, we must:
 69 | 
 70 | ## 1. Define the Unix timestamp for the date string, specifying the date format that the string assumes (here, we specify `'MM/dd/yyyy'`),
 71 | ## 2. Use the `cast` operation to convert the Unix timestamp of the string to `'timestamp'` dtype,
 72 | ## 3. Similarly recast the `'timestamp'` form to `'date'` dtype and
 73 | ## 4. Append the new date dtype `"p_dt"` column to `df` using the `withColumn` operation.
 74 | 
 75 | ## We similarly create date dtype columns using `"dt_matr"` and `"dt_zero_bal"`. If the date string entries of these columns were in the default format, converting to date dtype would straightforward. If `"period"` was in the format `'yyyy-mm-dd'`, for example, we would be able to append `df` with a date dtype column using a simple `withColumn`/`cast` expression: `df <- withColumn(df, 'p_dt', cast(df$period, 'date'))`. We could also directly convert `"period"` to date dtype using the `to_date` operation: `df$period <- to_date(df$period)`.
 76 | 
 77 | ## If we are lucky enough that our date entires are in the default format, then dtype conversion is simple and we should use either the `withColumn`/`cast` or `to_date` expressions given above. Otherwise, the longer conversion process is required. Note that, if we are maintaining our own dataset that we will use SparkR to analyze, adopting the default date format at the start will make working with date values during analysis much easier. 
 78 | 
 79 | ## Now that we've appended our date dtype columns to `df`, let's again look at the DF and compare the date dtype values with their associated date string values:
 80 | 
 81 | str(df)
 82 | 
 83 | ## Note that the `"zb_dt"` entries corresponding to the missing date entries in `"dt_zero_bal"`, which were empty strings, are now nulls.
 84 | 
 85 | ################################################################################
 86 | ## (2) Compute relative dates and measures based on a specified unit of time: ##
 87 | ################################################################################
 88 | 
 89 | ## As we mentioned earlier, converting date strings to date dtype allows us to utilize SparkR datetime operations. In this section, we'll discuss several SparkR operations that return:
 90 | 
 91 | ## * Date dtype columns, which list dates relative to a preexisting date column in the DF, and
 92 | ## * Integer or numerical dtype columns, which list measures of time relative to a preexisting date column.
 93 | 
 94 | ## For convenience, we will review these operations using the `df_dt` DF, which includes only the date columns `"p_dt"` and `"mtr_dt"`, which we created in the preceding section:
 95 | 
 96 | cols_dt <- c("p_dt", "mtr_dt")
 97 | df_dt <- select(df, cols_dt)
 98 | 
 99 | ##########################
100 | ## (2i) Relative dates: ##
101 | ##########################
102 | 
103 | ## SparkR datetime operations that return a new date dtype column include:
104 | 
105 | ## * `last_day`: Returns the _last_ day of the month which the given date belongs to (e.g. inputting "2013-07-27" returns "2013-07-31")
106 | ## * `next_day`: Returns the _first_ date which is later than the value of the date column that is on the specified day of the week
107 | ## * `add_months`: Returns the date that is `'numMonths'` _after_ `'startDate'`
108 | ## * `date_add`: Returns the date that is `'days'` days _after_ `'start'`
109 | ## * `date_sub`: Returns the date that is `'days'` days _before_ `'start'`
110 | 
111 | ## Below, we create relative date columns (defining `"p_dt"` as the input date) using each of these operations and `withColumn`:
112 | 
113 | df_dt1 <- withColumn(df_dt, 'p_ld', last_day(df_dt$p_dt))
114 | df_dt1 <- withColumn(df_dt1, 'p_nd', next_day(df_dt$p_dt, "Sunday"))
115 | df_dt1 <- withColumn(df_dt1, 'p_addm', add_months(df_dt$p_dt, 1)) # 'startDate'="pdt", 'numMonths'=1
116 | df_dt1 <- withColumn(df_dt1, 'p_dtadd', date_add(df_dt$p_dt, 1)) # 'start'="pdt", 'days'=1
117 | df_dt1 <- withColumn(df_dt1, 'p_dtsub', date_sub(df_dt$p_dt, 1)) # 'start'="pdt", 'days'=1
118 | str(df_dt1)
119 | 
120 | ######################################
121 | ## (2ii) Relative measures of time: ##
122 | ######################################
123 | 
124 | ## SparkR datetime operations that return integer or numerical dtype columns include:
125 | 
126 | ## * `weekofyear`: Extracts the week number as an integer from a given date
127 | ## * `dayofyear`: Extracts the day of the year as an integer from a given date
128 | ## * `dayofmonth`: Extracts the day of the month as an integer from a given date
129 | ## * `datediff`: Returns number of months between dates 'date1' and 'date2'
130 | ## * `months_between`: Returns the number of days from 'start' to 'end'
131 | 
132 | ## Here, we use `"p_dt"` and `"mtr_dt"` as inputs in the above operations. We again use `withColumn` do append the new columns to a DF:
133 | 
134 | df_dt2 <- withColumn(df_dt, 'p_woy', weekofyear(df_dt$p_dt))
135 | df_dt2 <- withColumn(df_dt2, 'p_doy', dayofyear(df_dt$p_dt))
136 | df_dt2 <- withColumn(df_dt2, 'p_dom', dayofmonth(df_dt$p_dt))
137 | df_dt2 <- withColumn(df_dt2, 'mbtw_p.mtr', months_between(df_dt$mtr_dt, df_dt$p_dt)) # 'date1'=p_dt, 'date2'=mtr_dt
138 | df_dt2 <- withColumn(df_dt2, 'dbtw_p.mtr', datediff(df_dt$mtr_dt, df_dt$p_dt)) # 'start'=p_dt, 'end'=mtr_dt
139 | str(df_dt2)
140 | 
141 | ## Note that operations that consider two different dates are sensitive to how we specify column ordering in the operation expression. For example, if we incorrectly define `"p_dt"` as `date2` and `"mtr_dt"` as `date1`, `"mbtw_p.mtr"` will consist of negative values. Similarly, `datediff` will return negative values if `start` and `end` are misspecified.
142 | 
143 | ######################################################################
144 | ## (3) Extract components of a date dtype column as integer values: ##
145 | ######################################################################
146 | 
147 | ## There are also datetime operations supported by SparkR that allow us to extract individual components of a date dtype column and return these as integers. Below, we use the `year` and `month` operations to create integer dtype columns for each of our date columns. Similar functions include `hour`, `minute` and `second`.
148 | 
149 | # Year and month values for `"period_dt"`
150 | df <- withColumn(df, 'p_yr', year(df$p_dt))
151 | df <- withColumn(df, "p_m", month(df$p_dt))
152 | 
153 | # Year value for `"matr_dt"`
154 | df <- withColumn(df, 'mtr_yr', year(df$mtr_dt))
155 | df <- withColumn(df, "mtr_m", month(df$mtr_dt))
156 | 
157 | # Year value for `"zero_bal_dt"`
158 | df <- withColumn(df, 'zb_yr', year(df$zb_dt))
159 | df <- withColumn(df, "zb_m", month(df$zb_dt))
160 | 
161 | ## We can see that each of the above expressions returns a column of integer values representing the requested date value:
162 | 
163 | str(df)
164 | 
165 | ## Note that the `NA` entries of `"zb_dt"` result in `NA` values for `"zb_yr"` and `"zb_m"`.
166 | 
167 | ###########################################################################
168 | ## (4) Resample a time series DF to a particular unit of time frequency: ##
169 | ###########################################################################
170 | 
171 | ## When working with time series data, we are frequently required to resample data to a different time frequency. Combing the `agg` and `groupBy` operations, as we saw in the [SparkR Basics II](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/sparkr-basics-2.md) tutorial, is a convenient strategy for accomplishing this in SparkR. We create a new DF, `dat`, that only includes columns of numerical, integer and date dtype to use in our resampling examples:
172 | 
173 | rm(df_dt)
174 | rm(df_dt1)
175 | rm(df_dt2)
176 | 
177 | cols <- c("p_yr", "p_m", "mtr_yr", "mtr_m", "zb_yr", "zb_m", "new_int_rt", "act_endg_upb", "loan_age", "mths_remng", "aj_mths_remng")
178 | dat <- select(df, cols)
179 | 
180 | unpersist(df)
181 | cache(dat)
182 | 
183 | head(dat)
184 | 
185 | ## Note that, in our loan-level data, each row represents a unique loan (each made distinct by the `"loan_id"` column in `df`) and its corresponding characteristics such as `"loan_age"` and `"mths_remng"`. Note that `dat` is simply a subset `df` and, therefore, also refers to loan-level data.
186 | 
187 | ## While we can resample the data over distinct values of any of the columns in `dat`, we will resample the loan-level data as aggregations of the DF columns by units of time since we are working with time series data. Below, we aggregate the columns of `dat` (taking the mean of the column entries) by `"p_yr"`, and then by `"p_yr"` and `"p_m"`:
188 | 
189 | # Resample by "period_yr"
190 | dat1 <- agg(groupBy(dat, dat$p_yr), p_m = mean(dat$p_m), mtr_yr = mean(dat$mtr_yr), zb_yr = mean(dat$zb_yr), 
191 |             new_int_rt = mean(dat$new_int_rt), act_endg_upb = mean(dat$act_endg_upb), loan_age = mean(dat$loan_age), 
192 |             mths_remng = mean(dat$mths_remng), aj_mths_remng = mean(dat$aj_mths_remng))
193 | head(dat1)
194 | 
195 | # Resample by "period_yr" and "period_m"
196 | dat2 <- agg(groupBy(dat, dat$p_yr, dat$p_m), mtr_yr = mean(dat$mtr_yr), zb_yr = mean(dat$zb_yr), 
197 |             new_int_rt = mean(dat$new_int_rt), act_endg_upb = mean(dat$act_endg_upb), loan_age = mean(dat$loan_age), 
198 |             mths_remng = mean(dat$mths_remng), aj_mths_remng = mean(dat$aj_mths_remng))
199 | head(arrange(dat2, dat2$p_yr, dat2$p_m), 15)	# Arrange the first 15 rows of `dat2` by ascending `period_yr` and `period_m` values
200 | 
201 | ## Note that we specify the list of DF columns that we want to resample on by including it in `groupBy`. Here, we aggregated by taking the mean of each column. However, we could use any of the aggregation functions that `agg` is able to interpret (listed in [SparkR Basics II](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/sparkr-basics-2.md) tutorial) and that is inline with the resampling that we are trying to achieve.
202 | 
203 | ## We could resample to any unit of time that we can extract from a date column, e.g. `year`, `month`, `day`, `hour`, `minute`, `second`. Furthermore, could have skipped the step of creating separate year- and month-level date columns - instead, we could have embedded the datetime functions directly in the `agg` expression. The following expression creates a DF that is equivalent to `dat1` in the preceding example:
204 | 
205 | df2 <- agg(groupBy(df, year(df$p_dt)), p_m = mean(month(df$p_dt)), mtr_yr = mean(year(df$mtr_dt)), 
206 |            zb_yr = mean(month(df$mtr_dt)), new_int_rt = mean(df$new_int_rt), act_endg_upb = mean(df$act_endg_upb), 
207 |            loan_age = mean(df$loan_age), mths_remng = mean(df$mths_remng), aj_mths_remng = mean(df$aj_mths_remng))


--------------------------------------------------------------------------------
/R/visualizations.R:
--------------------------------------------------------------------------------
  1 | install.packages('devtools')
  2 | library(devtools)
  3 | library(SparkR)
  4 | devtools::install_github("SKKU-SKT/ggplot2.SparkR")
  5 | library(ggplot2.SparkR)
  6 | library(ggplot2)
  7 | 
  8 | sc <- sparkR.init(sparkEnvir=list(spark.executor.memory="2g", spark.driver.memory="1g", spark.driver.maxResultSize="1g"), sparkPackages="com.databricks:spark-csv_2.11:1.4.0")
  9 | sqlContext <- sparkRSQL.init(sc)
 10 | 
 11 | # Throughout this tutorial, we will use the diamonds data that is included in the `ggplot2` package and is frequently used `ggplot2` examples. The data consists of prices and quality information about 54,000 diamonds. The data contains the four C’s of diamond quality, carat, cut, colour and clarity; and five physical measurements, depth, table, x, y and z.
 12 | 
 13 | df <- read.df(sqlContext, "s3://ui-spark-data/diamonds.csv", header='true', delimiter=",", source="com.databricks.spark.csv", inferSchema='true', nullValue="")
 14 | cache(df)
 15 | 
 16 | # We can see what the data set looks like using the `str` operation:
 17 | 
 18 | str(df)
 19 | 
 20 | # Introduced in the spring of 2016, the SparkR extension of Hadley Wickham's `ggplot2` package, `ggplot2.SparkR`, allows SparkR users to build ggplot-type visualizations by specifying a SparkR DataFrame and DF columns in ggplot expressions identical to how we would specify R data.frame components when using the `ggplot2` package, i.e. the extension package allows SparkR users to implement ggplot without having to modify the SparkR DataFrame API.
 21 | 
 22 | 
 23 | # As of the publication date of this tutorial (first version), the `ggplot2.SparkR` package is still nascent and has identifiable bugs. However, we provide `ggplot2.SparkR` in this example for its ease of use, particularly for SparkR users wanting to build basic plots. We alternatively discuss how a SparkR user may develop their own plotting function and provide an example in which we plot a bivariate histogram.
 24 | 
 25 | # The description of the `diamonds` data given above was taken from http://ggplot2.org/book/qplot.pdf.
 26 | 
 27 | ##################
 28 | ### Bar graph: ###
 29 | ##################
 30 | 
 31 | # geom_bar(mapping = NULL, data = NULL, stat = "count", position = "stack", ..., width = NULL, binwidth = NULL, na.rm = FALSE, show.legend = NA, inherit.aes = TRUE)
 32 | 
 33 | # Just as we would when using `ggplot2`, the following expression plots a basic bar graph that gives frequency counts across the different levels of `"cut"` quality in the data:
 34 | 
 35 | p1 <- ggplot(df, aes(x = cut))
 36 | p1 + geom_bar()
 37 | 
 38 | ##### Stacked & proportional bar graphs
 39 | 
 40 | # One recognized bug within `ggplot2.SparkR` is that, when specifying a `fill` value, none of the `position` specifications--`"stack"`, `"fill"` nor `"dodge"`--necessarily return plots with constant factor-level ordering across groups. For example, the following expression successfully returns a bar graph that gives frequency counts of `"clarity"` levels (string dtype), grouped over diamond `"cut"` types (also string dtype). Note, however, that the varied color blocks representing `"clarity"` levels are not ordered similarly across different levels of `"cut"`. The same issue results when we specify either of the other two (2) `position` specifications:
 41 | 
 42 | p2 <- ggplot(df, aes(x = cut, fill = clarity))
 43 | p2 + geom_bar(position = "stack")
 44 | 
 45 | ##################
 46 | ### Histogram: ###
 47 | ##################
 48 | 
 49 | # geom_histogram(mapping = NULL, data = NULL, stat = "bin", position = "stack", ..., binwidth = NULL, bins = NULL, na.rm = FALSE, show.legend = NA, inherit.aes = TRUE)
 50 | 
 51 | # Just as we would when using `ggplot2`, the following expression plots a histogram that gives frequency counts across binned `"price"` values in the data:
 52 | 
 53 | p3 <- ggplot(df, aes(price))
 54 | p3 + geom_histogram()
 55 | 
 56 | # The preceding histogram plot assumes the `ggplot2` default, `bins = 30`, but we can change this value or override the `bins` specification by setting a `binwidth` value as we do in the following examples:
 57 | 
 58 | p3 + geom_histogram(binwidth = 250)
 59 | p3 + geom_histogram(bins = 50)
 60 | 
 61 | # Weighted histogram:
 62 | 
 63 | # ggplot(df, aes(cut)) + geom_histogram(aes(weight = price)) + ylab("total value") NOT available in `ggplot2.SparkR`
 64 | 
 65 | # Stacked histograms:
 66 | 
 67 | # ggplot(df, aes(price, fill = cut)) + geom_histogram() # NOT available in `ggplot2.SparkR`
 68 | # ggplot(df, aes(price, fill = cut)) + geom_histogram(position = "fill")
 69 | 
 70 | 
 71 | ###########################
 72 | ### Frequency Polygons: ###
 73 | ###########################
 74 | 
 75 | # geom_freqpoly(mapping = NULL, data = NULL, stat = "bin", position = "identity", ..., na.rm = FALSE, show.legend = NA, inherit.aes = TRUE)
 76 | 
 77 | # Frequency polygons provide a visual alternative to histogram plots (note that they describe equivalent aggregations). We can also fit frequency polygons with `ggplot2` syntax - the following expression returns a frequency polygon that is equivalent to the first histogram plotted in the preceding section:
 78 | 
 79 | p3 + geom_freqpoly()
 80 | 
 81 | # Again, we can change the class intervals by specifying `binwidth` or the number of `bins` for the frequency polygon:
 82 | 
 83 | p3 + geom_freqpoly(binwidth = 250)
 84 | p3 + geom_freqpoly(bins = 50)
 85 | 
 86 | # Frequency polygons over grouped data are perhaps more easily interpreted than stacked histograms; the following is equivalent to the preceding stacked histogram. Note that we specify `"cut"` as `colour`, rather than `fill` as we did when using `geom_histogram`:
 87 | 
 88 | # ggplot(df, aes(price, colour = cut)) + geom_freqpoly() NOT currently supported by `ggplot2.SparkR`
 89 | 
 90 | #################################################################
 91 | ### Dealing with overplotting in scatterplot using `stat_sum` ###
 92 | #################################################################
 93 | 
 94 | # stat_sum(mapping = NULL, data = NULL, geom = "point", position = "identity", ...)
 95 | 
 96 | # NOT supported by `ggplot2.SparkR`
 97 | 
 98 | ################
 99 | ### Boxplot: ###
100 | ################
101 | 
102 | # Finally, we can create boxplots just as we would in `ggplot2`. The following expression gives a boxplot of `"price"` values across levels of `"clarity"`:
103 | 
104 | p4 <- ggplot(df, aes(x = clarity, y = price))
105 | p4 + geom_boxplot()
106 | 
107 | ##################################################
108 | ### Additional `ggplot2.SparkR` functionality: ###
109 | ##################################################
110 | 
111 | # We can adapt the plot types discussed in the previous sections with the specifications given below: 
112 | 
113 | #+ Facets: `facet_grid`, `facet_wrap` and `facet_null` (default)
114 | #+ Coordinate systems: `coord_cartesian` and `coord_flip`
115 | #+ Position adjustments: `position_dodge`, `position_fill`, `position_stack` (as seen in previous example)
116 | #+ Scales: `scale_x_log10`, `scale_y_log10`, `labs`, `xlab`, `ylab`, `xlim` and `ylim`
117 | 
118 | # For example, the following expression facets our previous histogram example across the different levels of `"cut"` quality:
119 | 
120 | p3 + geom_histogram() + facet_wrap(~cut)
121 | 
122 | ##################################################################
123 | ### Functionality gaps between `ggplot2` and SparkR extension: ###
124 | ##################################################################
125 | 
126 | # Below, we list several operations supported by `ggplot2` that are not currently supported by its SparkR extension package. The list is not exhaustive and is subject to change as the package continues to be developed:
127 | 
128 | #+ Weighted bar graph (i.e. specify `weight` in aesthetic)
129 | #+ Weighted histogram
130 | #+ Strictly ordered layers for filled and stacked bar graphs (as we saw in an earlier example)
131 | #+ Stacked or filled histograms
132 | #+ Layer frequency polygon (i.e specify `colour` in aesthetic)
133 | #+ Density plot using `geom_freqpoly` by specifying `y = ..density..` in aesthetic (note that extension package does not support `geom_density`)
134 | 
135 | ############################
136 | ### Bivariate histogram: ###
137 | ############################
138 | 
139 | # In the previous examples, we relied on the `ggplot2.SparkR` package to build plots from DataFrames using syntax identical to that which we would use in a normal application of `ggplot2` on R data.frames. Given the current limitations of the extension package, we may need to develop our own function if we are interested in building a plot type that is not currently supported by `ggplot2.SparkR`. Here, we provide an example of a function that returns a bivariate histogram of two numerical DataFrame columns.
140 | 
141 | # When building a function in SparkR, we want to avoid operations that are computationally expensive and building one that returns a plot is no different. One of the most expensive operations in SparkR, `collect`, is of particular interest when building functions that return plots since collecting data locally allows us to leverage graphing tools that we use in traditional frameworks, e.g. `ggplot2`. We should `collect` data as infrequently as possible since the operation is highly memory-intensive. In the following function, we `collect` data five (5) times. Four of the times, we are collecting single values (two minimum and two maximum values), which does not use up a huge amount of memory. The last `collect` that we perform, collects a data.frame with three (3) columns and a row for each bin assignment pairing, which can fit in-memory on a single node (assuming we don't specify a massive value for `nbins`). When developing SparkR functions, we should only perform minor collections like the ones discussed.
142 | 
143 | geom_bivar_histogram.SparkR <- function(df, x, y, nbins){
144 |   
145 |   library(ggplot2)
146 |   
147 |   x_min <- collect(agg(df, min(df[[x]]))) # Collect
148 |   x_max <- collect(agg(df, max(df[[x]]))) # Collect
149 |   x.bin <- seq(floor(x_min[[1]]), ceiling(x_max[[1]]), length = nbins)
150 |   
151 |   y_min <- collect(agg(df, min(df[[y]]))) # Collect
152 |   y_max <- collect(agg(df, max(df[[y]]))) # Collect
153 |   y.bin <- seq(floor(y_min[[1]]), ceiling(y_max[[1]]), length = nbins)
154 |   
155 |   x.bin.w <- x.bin[[2]]-x.bin[[1]]
156 |   y.bin.w <- y.bin[[2]]-y.bin[[1]]
157 |   
158 |   df_ <- withColumn(df, "x_bin_", ceiling((df[[x]] - x_min[[1]]) / x.bin.w))
159 |   df_ <- withColumn(df_, "y_bin_", ceiling((df[[y]] - y_min[[1]]) / y.bin.w))
160 |   
161 |   df_ <- mutate(df_, x_bin = ifelse(df_$x_bin_ == 0, 1, df_$x_bin_))
162 |   df_ <- mutate(df_, y_bin = ifelse(df_$y_bin_ == 0, 1, df_$y_bin_))
163 |   
164 |   dat <- collect(agg(groupBy(df_, "x_bin", "y_bin"), count = n(df_$x_bin))) # Collect
165 |   
166 |   p <- ggplot(dat, aes(x = x_bin, y = y_bin, fill = count)) + geom_tile()
167 |   
168 |   return(p)
169 | }
170 | 
171 | # Here, we evaluate the `geom_bivar_histogram.SparkR` function using `"carat"` and `"price"`:
172 | 
173 | p5 <- geom_bivar_histogram.SparkR(df = df, x = "carat", y = "price", nbins = 100)
174 | p5 + scale_colour_brewer(palette = "Blues", type = "seq") + ggtitle("This is a title") + xlab("Carat") + ylab("Price")
175 | 
176 | # _Note_: Documentation for the `geom_bivar_histogram.SparkR` function is given here:
177 | 
178 | # Note that the plot closely resembles a scatterplot. Bivariate histograms are one strategy for mitigating overplotting that often occurs when attempting to visualize massive data sets. Furthermore, it is sometimes impossible to gather the data necessary to map individual points to a scatterplot onto a single node within our cluster - this is when aggregation becomes necessary rather than simply preferable. Just like plotting a univariate histogram, binning data reduces the number of points to plot and, with the appropriate choice of bin number and color scale, bivariate histograms can provide an intuitive alternative to scatterplots when working with massive data sets.
179 | 
180 | # For example, the following function is equivalent to our previous one, but we have changed the `fill` specification that partially determines the color scale from `count` to `log10(count)`. Then, we evaluate the new function with a larger `nbins` value, returning a new plot with more granular binning and a more nuanced color scale (since the breaks in the color scale are now log10-spaced).
181 | 
182 | geom_bivar_histogram.SparkR.log10 <- function(df, x, y, nbins){
183 |   
184 |   library(ggplot2)
185 |   
186 |   x_min <- collect(agg(df, min(df[[x]])))
187 |   x_max <- collect(agg(df, max(df[[x]])))
188 |   x.bin <- seq(floor(x_min[[1]]), ceiling(x_max[[1]]), length = nbins)
189 |   
190 |   y_min <- collect(agg(df, min(df[[y]])))
191 |   y_max <- collect(agg(df, max(df[[y]])))
192 |   y.bin <- seq(floor(y_min[[1]]), ceiling(y_max[[1]]), length = nbins)
193 |   
194 |   x.bin.w <- x.bin[[2]]-x.bin[[1]]
195 |   y.bin.w <- y.bin[[2]]-y.bin[[1]]
196 |   
197 |   df_ <- withColumn(df, "x_bin_", ceiling((df[[x]] - x_min[[1]]) / x.bin.w))
198 |   df_ <- withColumn(df_, "y_bin_", ceiling((df[[y]] - y_min[[1]]) / y.bin.w))
199 |   
200 |   df_ <- mutate(df_, x_bin = ifelse(df_$x_bin_ == 0, 1, df_$x_bin_))
201 |   df_ <- mutate(df_, y_bin = ifelse(df_$y_bin_ == 0, 1, df_$y_bin_))
202 |   
203 |   dat <- collect(agg(groupBy(df_, "x_bin", "y_bin"), count = n(df_$x_bin)))
204 |   
205 |   p <- ggplot(dat, aes(x = x_bin, y = y_bin, fill = log10(count))) + geom_tile()
206 |   
207 |   return(p)
208 | }
209 | 
210 | p6 <- geom_bivar_histogram.SparkR.log10(df = df, x = "carat", y = "price", nbins = 250)
211 | p6 + scale_colour_brewer(palette = "Blues", type = "seq") + ggtitle("This is a title") + xlab("Carat") + ylab("Price")


--------------------------------------------------------------------------------
/glm_files/figure-html/unnamed-chunk-10-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/UrbanInstitute/sparkr-tutorials/a4dabf38c81d8635a70158fe97ecb7b1c7dd08d0/glm_files/figure-html/unnamed-chunk-10-1.png


--------------------------------------------------------------------------------
/glm_files/figure-html/unnamed-chunk-11-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/UrbanInstitute/sparkr-tutorials/a4dabf38c81d8635a70158fe97ecb7b1c7dd08d0/glm_files/figure-html/unnamed-chunk-11-1.png


--------------------------------------------------------------------------------
/glm_files/figure-html/unnamed-chunk-25-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/UrbanInstitute/sparkr-tutorials/a4dabf38c81d8635a70158fe97ecb7b1c7dd08d0/glm_files/figure-html/unnamed-chunk-25-1.png


--------------------------------------------------------------------------------
/glm_files/figure-html/unnamed-chunk-27-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/UrbanInstitute/sparkr-tutorials/a4dabf38c81d8635a70158fe97ecb7b1c7dd08d0/glm_files/figure-html/unnamed-chunk-27-1.png


--------------------------------------------------------------------------------
/glm_files/figure-html/unnamed-chunk-29-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/UrbanInstitute/sparkr-tutorials/a4dabf38c81d8635a70158fe97ecb7b1c7dd08d0/glm_files/figure-html/unnamed-chunk-29-1.png


--------------------------------------------------------------------------------
/glm_files/figure-html/unnamed-chunk-5-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/UrbanInstitute/sparkr-tutorials/a4dabf38c81d8635a70158fe97ecb7b1c7dd08d0/glm_files/figure-html/unnamed-chunk-5-1.png


--------------------------------------------------------------------------------
/glm_files/figure-html/unnamed-chunk-7-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/UrbanInstitute/sparkr-tutorials/a4dabf38c81d8635a70158fe97ecb7b1c7dd08d0/glm_files/figure-html/unnamed-chunk-7-1.png


--------------------------------------------------------------------------------
/glm_files/figure-html/unnamed-chunk-9-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/UrbanInstitute/sparkr-tutorials/a4dabf38c81d8635a70158fe97ecb7b1c7dd08d0/glm_files/figure-html/unnamed-chunk-9-1.png


--------------------------------------------------------------------------------
/rmd/03_subsetting.rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: 'Subsetting SparkR DataFrames'
  3 | author: "Sarah Armstrong, Urban Institute"
  4 | date: "July 1, 2016"
  5 | output:
  6 |   html_document:
  7 |     keep_md: TRUE
  8 | ---
  9 | 
 10 | ```{r setup, include=FALSE}
 11 | knitr::opts_chunk$set(echo = TRUE)
 12 | ```
 13 | 
 14 | **Last Updated**: May 23, 2017
 15 | 
 16 | 
 17 | **Objective**: Now that we understand what a SparkR DataFrame (DF) really is (remember, it's not actually data!) and can write expressions using essential DataFrame operations, such as `agg`, we are ready to start subsetting DFs using more advanced transformation operations. This tutorial discusses various ways of subsetting DFs, as well as how to work with a randomly sampled subset as a local data.frame in RStudio:
 18 | 
 19 | * Subset a DF by row
 20 | * Subset a DF by a list of columns
 21 | * Subset a DF by column expressions
 22 | * Drop a column from a DF
 23 | * Subset a DF by taking a random sample
 24 | * Collect a random sample as a local R data.frame
 25 | * Export a DF sample as a single .csv file to S3
 26 | 
 27 | **SparkR/R Operations Discussed**: `filter`, `where`, `select`, `sample`, `collect`, `write.table`
 28 | 
 29 | ***
 30 | 
 31 | :heavy_exclamation_mark: **Warning**: Before beginning this tutorial, please visit the SparkR Tutorials README file (found [here](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/README.md)) in order to load the SparkR library and subsequently initiate a SparkR session.
 32 | 
 33 | ```{r, include=FALSE}
 34 | if (nchar(Sys.getenv("SPARK_HOME")) < 1) {
 35 |   Sys.setenv(SPARK_HOME = "/home/spark")
 36 | }
 37 | 
 38 | library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
 39 | 
 40 | sparkR.session()
 41 | ```
 42 | 
 43 | The following error indicates that you have not initiated a SparkR session:
 44 | 
 45 | ```{r, eval=FALSE}
 46 | Error in getSparkSession() : SparkSession not initialized
 47 | ```
 48 | 
 49 | If you receive this message, return to the SparkR tutorials [README](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/README.md) for guidance.
 50 | 
 51 | ***
 52 | 
 53 | **Read in initial data as DF**: Throughout this tutorial, we will use the loan performance example dataset that we exported at the conclusion of the [SparkR Basics I](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/01_sparkr-basics-1.md) tutorial. Note that we are __persisting__ the DataFrame since we will use it throughout this tutorial.
 54 | 
 55 | ```{r, message=F, warning=F, results='hide'}
 56 | df <- read.df("s3://ui-spark-social-science-public/data/hfpc_ex",
 57 |               header = "false", 
 58 |               inferSchema = "true",
 59 |               na.strings = "")
 60 | cache(df)
 61 | ```
 62 | 
 63 | _Note_: documentation for the quarterly loan performance data can be found at http://www.fanniemae.com/portal/funding-the-market/data/loan-performance-data.html.
 64 | 
 65 | Let's check the dimensions our DF `df` and its column names so that we can compare the dimension sizes of `df` with those of the subsets that we will  define throughout this tutorial:
 66 | 
 67 | ```{r, collapse=TRUE}
 68 | nrow(df)
 69 | ncol(df)
 70 | columns(df)
 71 | ```
 72 | 
 73 | ***
 74 | 
 75 | 
 76 | ### Subset DataFrame by row:
 77 | 
 78 | The SparkR operation `filter` allows us to subset the rows of a DF according to specified conditions. Before we begin working with `filter` to see how it works, let's print the schema of `df` since the types of subsetting conditions we are able to specify depend on the datatype of each column in the DF: 
 79 | 
 80 | ```{r, collapse=TRUE}
 81 | printSchema(df)
 82 | ```
 83 | 
 84 | We can subset `df` into a new DF, `f1`, that includes only those loans for which JPMorgan Chase is the servicer with the expression:
 85 | 
 86 | ```{r, collapse=TRUE}
 87 | f1 <- filter(df, df$servicer_name == "JP MORGAN CHASE BANK, NA" | df$servicer_name == "JPMORGAN CHASE BANK, NA" |
 88 |                df$servicer_name == "JPMORGAN CHASE BANK, NATIONAL ASSOCIATION")
 89 | nrow(f1)
 90 | ```
 91 | 
 92 | Notice that the `filter` considers normal logical syntax (e.g. logical conditions and operations), making working with the operation very straightforward. We can specify `filter` with SQL statement strings. For example, here we have the preceding example written in SQL statement format:
 93 | 
 94 | ```{r, collapse=TRUE, eval=FALSE}
 95 | filter(df, "servicer_name = 'JP MORGAN CHASE BANK, NA' or servicer_name = 'JPMORGAN CHASE BANK, NA' or
 96 |        servicer_name = 'JPMORGAN CHASE BANK, NATIONAL ASSOCIATION'")
 97 | ```
 98 | 
 99 | Or, alternatively, in a syntax similar to how we subset data.frames by row in base R:
100 | 
101 | ```{r, collapse=TRUE, eval=FALSE}
102 | df[df$servicer_name == "JP MORGAN CHASE BANK, NA" | df$servicer_name == "JPMORGAN CHASE BANK, NA" | 
103 |      df$servicer_name == "JPMORGAN CHASE BANK, NATIONAL ASSOCIATION",]
104 | ```
105 | 
106 | Another example of using logical syntax with `filter` is that we can subset `df` such that the new DF only includes those loans for which the servicer name is known, i.e. the column `"servicer_name"` is not equa to an empty string or listed as `"OTHER"`:
107 | 
108 | ```{r, collapse=TRUE}
109 | f2 <- filter(df, df$servicer_name != "OTHER" & df$servicer_name != "")
110 | nrow(f2)
111 | ```
112 | 
113 | Or, if we wanted to only consider observations with a `"loan_age"` value of greater than 60 months (five years), we would evaluate:
114 | 
115 | ```{r, collapse=TRUE}
116 | f3 <- filter(df, df$loan_age > 60)
117 | nrow(f3)
118 | ```
119 | 
120 | An alias for `filter` is `where`, which reads much more intuitively, particularly when `where` is embedded in a complex statement. For example, the following expression can be read as "__aggregate__ the mean loan age and count values __by__ `"servicer_name"` in `df` __where__ loan age is less than 60 months":
121 | 
122 | ```{r, collapse=TRUE}
123 | f4 <- agg(groupBy(where(df, df$loan_age < 60), where(df, df$loan_age < 60)$servicer_name), 
124 |           loan_age_avg = avg(where(df, df$loan_age < 60)$loan_age), 
125 |           count = n(where(df, df$loan_age < 60)$loan_age))
126 | head(f4)
127 | ```
128 | 
129 | ***
130 | 
131 | 
132 | ### Subset DataFrame by column:
133 | 
134 | The operation `select` allows us to subset a DF by a specified list of columns. In the expression below, for example, we create a subsetted DF that includes only the number of calendar months remaining until the borrower is expected to pay the mortgage loan in full (remaining maturity) and adjusted remaining maturity:
135 | 
136 | ```{r, collapse=TRUE}
137 | s1 <- select(df, "mths_remng", "aj_mths_remng")
138 | ncol(s1)
139 | ```
140 | 
141 | We can also reference the column names through the DF name, i.e. `select(df, df$mths_remng, df$aj_mths_remng)`. Or, we can save a list of columns as a combination of strings. If we wanted to make a list of all columns that relate to remaining maturity, we could evaluate the expression `remng_mat <- c("mths_remng", "aj_mths_remng")` and then easily reference our list of columns later on with `select(df, remng_mat)`.
142 | 
143 | 
144 | Besides subsetting by a list of columns, we can also subset `df` while introducing a new column using a column expression, as we do in the example below. The DF `s2` includes the columns `"mths_remng"` and `"aj_mths_remng"` as in `s1`, but now with a column that lists the absolute value of the difference between the unadjusted and adjusted remaining maturity:
145 | 
146 | ```{r, collapse=TRUE}
147 | s2 <- select(df, df$mths_remng, df$aj_mths_remng, abs(df$aj_mths_remng - df$mths_remng))
148 | ncol(s2)
149 | head(s2)
150 | ```
151 | 
152 | Note that, just as we can subset by row with syntax similar to that in base R, we can similarly acheive subsetting by column. The following expressions are equivalent:
153 | 
154 | ```{r, collapse=TRUE, eval=FALSE}
155 | select(df, df$period)
156 | df[,"period"]
157 | df[,2]
158 | ```
159 | 
160 | To simultaneously subset by column and row specifications, you can simply embed a `where` expression in a `select` operation (or vice versa). The following expression creates a DF that lists loan age values only for observations in which servicer name is unknown:
161 | 
162 | ```{r, collapse=TRUE}
163 | s3 <- select(where(df, df$servicer_name == "" | df$servicer_name == "OTHER"), "loan_age")
164 | head(s3)
165 | ```
166 | 
167 | Note that we could have also written the above expression as `df[df$servicer_name == "" | df$servicer_name == "OTHER", "loan_age"]`.
168 | 
169 | 
170 | #### Drop a column from a DF:
171 | 
172 | We can drop a column from a DF very simply by assigning `NULL` to a DF column. Below, we drop `"aj_mths_remng"` from `s1`:
173 | 
174 | ```{r, collapse=TRUE}
175 | head(s1)
176 | s1$aj_mths_remng <- NULL
177 | head(s1)
178 | ```
179 | 
180 | ***
181 | 
182 | 
183 | ### Subset a DF by taking a random sample:
184 | 
185 | Perhaps the most useful subsetting operation is `sample`, which returns a randomly sampled subset of a DF. With `subset`, we can specify whether we want to sample with or without replace, the approximate size of the sample that we want the new DF to call and whether or not we want to define a random seed. If our initial DF is so massive that performing analysis on the entire dataset requires a more expensive cluster, we can: sample the massive dataset, interactively develop our analysis in SparkR using our sample and then evaluate the resulting program using our initial DF, which calls the entire massive dataset, only as is required. This strategy will help us to minimize wasting resources.
186 | 
187 | Below, we take a random sample of `df` without replacement that is, in size, approximately equal to 1% of `df`. Notice that we must define a random seed in order to be able to reproduce our random sample.
188 | 
189 | ```{r, collapse=TRUE}
190 | df_samp1 <- sample(df, withReplacement = FALSE, fraction = 0.01)  # Without set seed
191 | df_samp2 <- sample(df, withReplacement = FALSE, fraction = 0.01)
192 | count(df_samp1)
193 | count(df_samp2)
194 | # The row counts are different and, obviously, the DFs are not equivalent
195 | 
196 | df_samp3 <- sample(df, withReplacement = FALSE, fraction = 0.01, seed = 0)  # With set seed
197 | df_samp4 <- sample(df, withReplacement = FALSE, fraction = 0.01, seed = 0)
198 | count(df_samp3)
199 | count(df_samp4)
200 | # The row counts are equal and the DFs are equivalent
201 | ```
202 | 
203 | 
204 | #### Collect a random sample as a local data.frame:
205 | 
206 | An additional use of `sample` is to collect a random sample of a massive dataset as a local data.frame in R. This would allow us to work with a sample dataset in a traditional analysis environment that is likely more representative of the population since we are sampling from a larger set of observations than we are normally doing so. This can be achieved by simply using `collect` to create a local data.frame:
207 | 
208 | ```{r, collapse=TRUE}
209 | typeof(df_samp4)  # DFs are of class S4
210 | dat <- collect(df_samp4)
211 | typeof(dat)
212 | ```
213 | 
214 | Note that this data.frame is _not_ local to _your_ personal computer, but rather it was gathered locally to a single node in our AWS cluster.
215 | 
216 | #### Export DF sample as a single .csv file to S3:
217 | 
218 | If we want to export the sampled DF from RStudio as a single .csv file that we can work with in any environment, we must first coalesce the rows of `df_samp4` to a single node in our cluster using the `repartition` operation. Then, we can use the `write.df` operation as we did in the [SparkR Basics I](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/01_sparkr-basics-1.md) tutorial:
219 | 
220 | ```{r, eval=FALSE, collapse=TRUE}
221 | df_samp4_1 <- repartition(df_samp4, numPartitions = 1)
222 | write.df(df_samp4_1, path = "s3://ui-spark-social-science-public/data/hfpc_samp.csv", 
223 |          source = "csv", 
224 |          mode = "overwrite")
225 | ```
226 | 
227 | :heavy_exclamation_mark: __Warning__: We cannot collect a DF as a data.frame, nor can we repartition it to a single node, unless the DF is sufficiently small in size since it must fit onto a _single_ node!
228 | 
229 | __End of tutorial__ - Next up is [Dealing with Missing Data in SparkR](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/04_missing-data.md)


--------------------------------------------------------------------------------
/rmd/04_missing-data.rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: 'Dealing with Missing Data in SparkR'
  3 | author: "Sarah Armstrong, Urban Institute"
  4 | date: "July 8, 2016"
  5 | output: 
  6 |   html_document: 
  7 |     keep_md: true
  8 | ---
  9 |   
 10 | ```{r setup, include=FALSE}
 11 | knitr::opts_chunk$set(echo = TRUE)
 12 | options(knitr.table.format = 'markdown')
 13 | ```
 14 | 
 15 | **Last Updated**: May 23, 2017
 16 | 
 17 | 
 18 | **Objective**: In this tutorial, we discuss general strategies for dealing with missing data in the SparkR environment. While we do not consider conceptually how and why we might impute missing values in a dataset, we do discuss logistically how we could drop rows with missing data and impute missing data with replacement values. We specifically consider the following during this tutorial:
 19 |   
 20 | * Specify null values when loading data in as a DF
 21 | * Conditional expressions on empty DF entries
 22 |     + Null and NaN indicator operations
 23 |     + Conditioning on empty string entries
 24 |     + Distribution of missing data across grouped data
 25 | * Drop rows with missing data
 26 |     + Null value entries
 27 |     + Empty string entries
 28 | * Fill missing data entries
 29 |     + Null value entries
 30 |     + Empty string entries
 31 | 
 32 | **SparkR/R Operations Discussed**: `read.df` (`nullValue = "<string>"`), `printSchema`, `nrow`, `isNull`, `isNotNull`, `isNaN`, `count`, `where`, `agg`, `groupBy`, `n`, `collect`, `dropna`, `na.omit`, `list`, `fillna`
 33 | 
 34 | ***
 35 | 
 36 | :heavy_exclamation_mark: **Warning**: Before beginning this tutorial, please visit the SparkR Tutorials README file (found [here](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/README.md)) in order to load the SparkR library and subsequently initiate a SparkR session.
 37 | 
 38 | ```{r, include=FALSE}
 39 | if (nchar(Sys.getenv("SPARK_HOME")) < 1) {
 40 |   Sys.setenv(SPARK_HOME = "/home/spark")
 41 | }
 42 | 
 43 | library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
 44 | 
 45 | sparkR.session()
 46 | ```
 47 | 
 48 | The following error indicates that you have not initiated a SparkR session:
 49 | 
 50 | ```{r, eval=FALSE}
 51 | Error in getSparkSession() : SparkSession not initialized
 52 | ```
 53 | 
 54 | If you receive this message, return to the SparkR tutorials [README](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/README.md) for guidance.
 55 | 
 56 | ***
 57 |   
 58 | ### Specify null values when loading data in as a SparkR DataFrame (DF)
 59 |   
 60 | Throughout this tutorial, we will use the loan performance example dataset that we exported at the conclusion of the [SparkR Basics I](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/01_sparkr-basics-1.md) tutorial. Note that we now include the `na.strings` option in the `read.df` transformation below. By setting `na.strings` equal to an empty string in `read.df`, we direct SparkR to interpret empty entries in the dataset as being equal to nulls in `df`. Therefore, any DF entries matching this string (here, set to equal an empty entry) will be set equal to a null value in `df`.
 61 | 
 62 | ```{r, message=F, warning=F, results='hide', collapse=TRUE}
 63 | df <- read.df("s3://ui-spark-social-science-public/data/hfpc_ex", 
 64 |               header = "false", 
 65 |               inferSchema = "true", 
 66 |               na.strings = "")
 67 | cache(df)
 68 | ```
 69 | 
 70 | We can replace this empty string with any string that we know indicates a null entry in the dataset, i.e. with `na.strings="<string>"`. Note that SparkR only reads empty entries as null values in numerical and integer datatype (dtype) DF columns, meaning that empty entries in DF columns of string dtype will simply equal an empty string. We consider how to work with this type of observation throughout this tutorial alongside our treatment of null values.
 71 | 
 72 | 
 73 | With `printSchema`, we can see the dtype of each column in `df` and, noting which columns are of a numerical and integer dtypes and which are string, use this to determine how we should examine missing data in each column of `df`. We also count the number of rows in `df` so that we can compare this value to row counts that we compute throughout this tutorial:
 74 |   
 75 | ```{r, collapse=TRUE}
 76 | printSchema(df)
 77 | (n <- nrow(df))
 78 | ```
 79 | 
 80 | _Note_: documentation for the quarterly loan performance data can be found at http://www.fanniemae.com/portal/funding-the-market/data/loan-performance-data.html.
 81 | 
 82 | ***
 83 |   
 84 |   
 85 | ### Conditional expressions on empty DF entries
 86 |   
 87 |   
 88 | #### Null and NaN indicator operations
 89 |   
 90 | We saw in the subsetting tutorial how to subset a DF by some conditional statement. We can extend this reasoning in order to identify missing data in a DF and to explore the distribution of missing data within a DF. SparkR operations indicating null and NaN entries in a DF are `isNull`, `isNaN` and `isNotNull`, and these can be used in conditional statements to locate or to remove DF rows with null and NaN entries.
 91 | 
 92 | 
 93 | Below, we count the number of missing entries in `"loan_age"` and in `"mths_remng"`, which are both of integer dtype. We can see below that there are no missing or NaN entries in `"loan_age"`. Note that the `isNull` and `isNaN` count results differ for `"mths_remng"` - while there are missing values in `"mths_remng"`, there are no NaN entries (entires that are "not a number").
 94 | 
 95 | ```{r, collapse=TRUE}
 96 | df_laNull <- where(df, isNull(df$loan_age))
 97 | count(df_laNull)
 98 | df_laNaN <- where(df, isNaN(df$loan_age))
 99 | count(df_laNaN)
100 | 
101 | df_mrNull <- where(df, isNull(df$mths_remng))
102 | count(df_mrNull)
103 | df_mrNaN <- where(df, isNaN(df$mths_remng))
104 | count(df_mrNaN)
105 | ```
106 | 
107 | 
108 | #### Empty string entries
109 | 
110 | If we want to count the number of rows with missing entries for `"servicer_name"` (string dtype) we can simply use the equality logical condition (==) to direct SparkR to `count` the number of rows `where` the entries in the `"servicer_name"` column are equal to an empty string:
111 |   
112 | ```{r, collapse=TRUE}
113 | df_snEmpty <- where(df, df$servicer_name == "")
114 | count(df_snEmpty)
115 | ```
116 | 
117 | 
118 | #### Distribution of missing data across grouped data
119 | 
120 | We can also condition on missing data when aggregating over grouped data in order to see how missing data is distributed over a categorical variable within our data. In order to view the distribution of `"mths_remng"` observations with null values over distinct entries of `"servicer_name"`, we (1) group the entries of the DF `df_mrNull` that we created in the preceding example over `"servicer_name"` entries, (2) create the DF `mrNull_by_sn` which consists of the number of observations in `df_mrNull` by `"servicer_name"` entries and (3) collect `mrNull_by_sn` into a nicely formatted table as a local data.frame:
121 |   
122 | ```{r, collapse=TRUE}
123 | gb_sn_mrNull <- groupBy(df_mrNull, df_mrNull$servicer_name)
124 | mrNull_by_sn <- agg(gb_sn_mrNull, Nulls = n(df_mrNull$servicer_name))
125 | 
126 | mrNull_by_sn.dat <- collect(mrNull_by_sn)
127 | mrNull_by_sn.dat
128 | # Alternatively, we could have evaluated showDF(mrNull_by_sn) to print DF
129 | ```
130 | 
131 | Note that the resulting data.frame lists only nine (9) distinct string values for `"servicer_name"`. So, any row in `df` with a null entry for `"mths_remng"` has one of these strings as its corresponding `"servicer_name"` value. We could similarly examine the distribution of missing entries for some string dtype column across grouped data by first filtering a DF on the condition that the string column is equal to an empty string, rather than filtering with a null indicator operation (e.g. `isNull`), then performing the `groupBy` operation.
132 | 
133 | ***
134 |   
135 |   
136 | ### Drop rows with missing data
137 |   
138 |   
139 | #### Null value entries
140 |   
141 | The SparkR operation `dropna` (or its alias `na.omit`) creates a new DF that omits rows with null value entries. We can configure `dropna` in a number of ways, including whether we want to omit rows with nulls in a specified list of DF columns or across all columns within a DF.
142 | 
143 | 
144 | If we want to drop rows with nulls for a list of columns in `df`, we can define a list of column names and then include this in `dropna` or we could embed this list directly in the operation. Below, we explicitly define a list of column names on which we condition `dropna`:
145 |   
146 | ```{r, collapse=TRUE}
147 | mrlist <- list("mths_remng", "aj_mths_remng")
148 | df_mrNoNulls <- dropna(df, cols = mrlist)
149 | nrow(df_mrNoNulls)
150 | ```
151 | 
152 | Alternatively, we could `filter` the DF using the `isNotNull` condition as follows:
153 |   
154 | ```{r, collapse=TRUE}
155 | df_mrNoNulls_ <- filter(df, isNotNull(df$mths_remng) & isNotNull(df$aj_mths_remng))
156 | nrow(df_mrNoNulls_)
157 | ```
158 | 
159 | If we want to consider all columns in a DF when omitting rows with null values, we can use either the `how` or `minNonNulls` paramters of `dropna`.
160 | 
161 | 
162 | The parameter `how` allows us to decide whether we want to drop a row if it contains `"any"` nulls or if we want to drop a row only if `"all"` of its entries are nulls. We can see below that there are no rows in `df` in which all of its values are null, but only a small percentage of the rows in `df` have no null value entries:
163 |   
164 | ```{r, collapse=TRUE}
165 | df_all <- dropna(df, how = "all")
166 | nrow(df_all)    # Equal in value to n
167 | 
168 | df_any <- dropna(df, how = "any")
169 | (n_any <- nrow(df_any))
170 | (n_any/n)*100
171 | ```
172 | 
173 | We can set a minimum number of non-null entries required for a row to remain in the DF by specifying a `minNonNulls` value. If included in `dropna`, this specification directs SparkR to drop rows that have less than `minNonNulls = <value>` non-null entries. Note that including `minNonNulls` overwrites the `how` specification. Below, we omit rows with that have less than 5 and 12 entries that are _not_ nulls. Note that there are no rows in `df` that have less than 5 non-null entries, and there are only approximately 8,000 rows with less than 12 non-null entries.
174 | 
175 | ```{r, collapse=TRUE}
176 | df_5 <- dropna(df, minNonNulls = 5)
177 | nrow(df_5)    # Equal in value to n
178 | 
179 | df_12 <- dropna(df, minNonNulls = 12)
180 | (n_12 <- nrow(df_12))
181 | n - n_12
182 | ```
183 | 
184 | 
185 | #### Empty string entries
186 | 
187 | If we want to create a new DF that does not include any row with missing entries for a column of string dtype, we could also use `filter` to accomplish this. In order to remove observations with a missing `"servicer_name"` value, we simply filter `df` on the condition that `"servicer_name"` does not equal an empty string entry:
188 |   
189 | ```{r, collapse=TRUE}
190 | df_snNoEmpty <- filter(df, df$servicer_name != "")
191 | nrow(df_snNoEmpty)
192 | ```
193 | 
194 | ***
195 |   
196 |   
197 | ### Fill missing data entries
198 |   
199 |   
200 | #### Null value entries
201 |   
202 | The `fillna` operation allows us to replace null entries with some specified value. In order to replace null entries in every numerical and integer column in `df` with a value, we simply evaluate the expression `fillna(df, <value>)`. We replace every null entry in `df` with the value 12345 below:
203 | 
204 | ```{r, collapse=TRUE}
205 | str(df)
206 | 
207 | df_ <- fillna(df, value = 12345)
208 | str(df_)
209 | rm(df_)
210 | ```
211 | 
212 | If we want to replace null values within a list of DF columns, we can specify a column list just as we did in `dropna`. Here, we replace the null values in only `"act_endg_upb"` with 12345:
213 |   
214 | ```{r, collapse=TRUE}
215 | str(df)
216 | 
217 | df_ <- fillna(df, list("act_endg_upb" = 12345))
218 | str(df_)
219 | rm(df_)
220 | ```
221 | 
222 | 
223 | #### Empty string entries
224 | 
225 | Finally, we can replace the empty entries in string dtype columns with the `ifelse` operation, which follows the syntax `ifelse(<test>, <if true>, <if false>)`. Here, we replace the empty entries in `"servicer_name"` with the string `"Unknown"`:
226 |   
227 | ```{r, collapse=TRUE}
228 | str(df)
229 | df$servicer_name <- ifelse(df$servicer_name == "", "Unknown", df$servicer_name)
230 | str(df)
231 | ```
232 | 
233 | 
234 | __End of tutorial__ - Next up is [Computing Summary Statistics with SparkR](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/05_summary-statistics.md)


--------------------------------------------------------------------------------
/rmd/05_summary-statistics.rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: 'Computing Summary Statistics with SparkR'
  3 | author: "Sarah Armstrong, Urban Institute"
  4 | date: "July 8, 2016"
  5 | output:
  6 |   html_document:
  7 |     keep_md: true
  8 | ---
  9 | 
 10 | ```{r setup, include=FALSE}
 11 | knitr::opts_chunk$set(echo = TRUE)
 12 | options(knitr.table.format = 'markdown')
 13 | ```
 14 | 
 15 | **Last Updated**: May 23, 2017
 16 | 
 17 | 
 18 | **Objective**: Summary statistics and aggregations are essential means of summarizing a set of observations. In this tutorial, we discuss how to compute location, statistical dispersion, distribution and dependence measures of numerical variables in SparkR, as well as methods for examining categorical variables. In particular, we consider how to compute the following measurements and aggregations in SparkR:
 19 | 
 20 | _Numerical Data_
 21 | 
 22 | * Measures of location:
 23 |     + Mean
 24 |     + Extract summary statistics as local value
 25 | * Measures of dispersion:
 26 |     + Range width & limits
 27 |     + Variance
 28 |     + Standard deviation
 29 |     + Quantiles
 30 | * Measures of distribution shape:
 31 |     + Skewness
 32 |     + Kurtosis
 33 | * Measures of Dependence:
 34 |     + Covariance
 35 |     + Correlation
 36 | 
 37 | _Categorical Data_
 38 | 
 39 | * Frequency table
 40 | * Relative frequency table
 41 | * Contingency table
 42 | 
 43 | **SparkR/R Operations Discussed**: `describe`, `collect`, `showDF`, `agg`, `mean`, `typeof`, `min`, `max`, `abs`, `var`, `sd`, `skewness`, `kurtosis`, `cov`, `corr`, `count`, `n`, `groupBy`, `nrow`, `crosstab`
 44 | 
 45 | ***
 46 | 
 47 | :heavy_exclamation_mark: **Warning**: Before beginning this tutorial, please visit the SparkR Tutorials README file (found [here](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/README.md)) in order to load the SparkR library and subsequently initiate a SparkR session.
 48 | 
 49 | ```{r, include=FALSE}
 50 | if (nchar(Sys.getenv("SPARK_HOME")) < 1) {
 51 |   Sys.setenv(SPARK_HOME = "/home/spark")
 52 | }
 53 | 
 54 | library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
 55 | 
 56 | sparkR.session()
 57 | ```
 58 | 
 59 | The following error indicates that you have not initiated a SparkR session:
 60 | 
 61 | ```{r, eval=FALSE}
 62 | Error in getSparkSession() : SparkSession not initialized
 63 | ```
 64 | 
 65 | If you receive this message, return to the SparkR tutorials [README](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/README.md) for guidance.
 66 | 
 67 | ***
 68 | 
 69 | **Read in initial data as DF**: Throughout this tutorial, we will use the loan performance example dataset that we exported at the conclusion of the [SparkR Basics I](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/01_sparkr-basics-1.md) tutorial.
 70 | 
 71 | ```{r, message=F, warning=F, results='hide', collapse=TRUE}
 72 | df <- read.df("s3://ui-spark-social-science-public/data/hfpc_ex", 
 73 |               header = "false", 
 74 |               inferSchema = "true", 
 75 |               na.strings = "")
 76 | cache(df)
 77 | ```
 78 | 
 79 | _Note_: documentation for the quarterly loan performance data can be found at http://www.fanniemae.com/portal/funding-the-market/data/loan-performance-data.html.
 80 | 
 81 | ***
 82 | 
 83 | 
 84 | ## Numerical Data
 85 | 
 86 | The operation `describe` (or its alias `summary`) creates a new DF that consists of several key aggregations (count, mean, max, mean, standard deviation) for a specified DF or list of DF columns (note that columns must be of a numerical datatype). We can either (1) use the action operation `showDF` to print this aggregation DF or (2) save it as a local data.frame with `collect`. Here, we perform both of these actions on the aggregation DF `sumstats_mthsremng`, which returns the aggregations listed above for the column `"mths_remng"` in `df`:
 87 | 
 88 | ```{r, collapse=TRUE}
 89 | sumstats_mthsremng <- describe(df, "mths_remng")  # Specified list of columns here consists only of "mths_remng"
 90 | 
 91 | showDF(sumstats_mthsremng)  # Print the aggregation DF
 92 | 
 93 | sumstats_mthsremng.l <- collect(sumstats_mthsremng) # Collect aggregation DF as a local data.frame
 94 | sumstats_mthsremng.l
 95 | ```
 96 | 
 97 | Note that measuring all five (5) of these aggregations at once can be computationally expensive with a massive data set, particularly if we are interested in only a subset of these measurements. Below, we outline ways to measure these aggregations individually, as well as several other key summary statistics for numerical data.
 98 | 
 99 | ***
100 | 
101 | 
102 | ### Measures of Location
103 | 
104 | 
105 | #### Mean
106 | 
107 | The mean is the only measure of central tendency currently supported by SparkR. The operations `mean` and `avg` can be used with the `agg` operation that we discussed in the [SparkR Basics II](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/02_sparkr-basics-2.md) tutorial to measure the average of a numerical DF column. Remember that `agg` returns another DF. Therefore, we can either print the DF with `showDF` or we can save the aggregation as a local data.frame. Collecting the DF may be preferred if we want to work with the mean `"mths_remng"` value as a single value in RStudio.
108 | 
109 | ```{r, collapse=TRUE}
110 | mths_remng.avg <- agg(df, mean = mean(df$mths_remng)) # Create an aggregation DF
111 | 
112 | # DataFrame
113 | showDF(mths_remng.avg) # Print this DF
114 | typeof(mths_remng.avg) # Aggregation DF is of class S4
115 | 
116 | # data.frame
117 | mths_remng.avg.l <- collect(mths_remng.avg) # Collect the DF as a local data.frame
118 | (mths_remng.avg.l <- mths_remng.avg.l[,1])  # Overwrite data.frame with numerical mean value (was entry in d.f)
119 | typeof(mths_remng.avg.l)  # Object is now of a numerical dtype
120 | ```
121 | 
122 | ***
123 | 
124 | 
125 | ### Measures of dispersion
126 | 
127 | 
128 | #### Range width & limits
129 | 
130 | We can also use `agg` to create a DF that lists the minimum and maximum values within a numerical DF column (i.e. the limits of the range of values in the column) and the width of the range. Here, we create compute these values for `"mths_remng"` and print the resulting DF with `showDF`:
131 | 
132 | ```{r, collapse=TRUE}
133 | mr_range <- agg(df, minimum = min(df$mths_remng), maximum = max(df$mths_remng), 
134 |                 range_width = abs(max(df$mths_remng) - min(df$mths_remng)))
135 | showDF(mr_range)
136 | ```
137 | 
138 | 
139 | #### Variance & standard deviation
140 | 
141 | Again using `agg`, we compute the variance and standard deviation of `"mths_remng"` with the expressions below. Note that, here, we are computing sample variance and standard deviation (which we could also measure with their respective aliases, `variance` and `stddev`). To measure population variance and standard deviation, we would use `var_pop` and `stddev_pop`, respectively.
142 | 
143 | ```{r, collapse=TRUE}
144 | mr_var <- agg(df, variance = var(df$mths_remng))  # Sample variance
145 | showDF(mr_var)
146 | 
147 | mr_sd <- agg(df, std_dev = sd(df$mths_remng)) # Sample standard deviation
148 | showDF(mr_sd)
149 | ```
150 | 
151 | 
152 | #### Approximate Quantiles
153 | 
154 | The operation `approxQuantile` returns approximate quantiles for a DF column. We specify the quantiles to be approximated by the operation as a vector set equal to the `probabilities` parameter, and the acceptable level of error by the `relativeError` paramter.
155 | 
156 | If the column includes `n` rows, then `approxQuantile` will return a list of quantile values with rank values that are acceptably close to those exact values specified by `probabilities`. In particular, the operation assigns approximate rank values such that the computed rank, (`probabilities * n`), falls within the inequality:
157 | 
158 | 
159 | `floor((probabilities - relativeError) * n) <= rank(x) <= ceiling((probabilities + relativeError) * n)`
160 | 
161 | 
162 | Below, we define a new DF, `df_`, that includes only nonmissing values for `"mths_remng"` and then compute approximate Q1, Q2 and Q3 values for `"mths_remng"`:
163 | 
164 | ```{r, collapse=TRUE}
165 | df_ <- dropna(df, cols = "mths_remng")
166 | 
167 | quartiles_mr <- approxQuantile(x = df_, col = "mths_remng", probabilities = c(0.25, 0.5, 0.75), 
168 |                                relativeError = 0.001)
169 | quartiles_mr
170 | ```
171 | 
172 | 
173 | ***
174 | 
175 | 
176 | ### Measures of distribution shape
177 | 
178 | 
179 | #### Skewness
180 | 
181 | We can measure the magnitude and direction of skew in the distribution of a numerical DF column by using the operation `skewness` with `agg`, just as we did to measure the `mean`, `variance` and `stddev` of a numerical variable. Below, we measure the `skewness` of `"mths_remng"`:
182 | 
183 | ```{r, collapse=TRUE}
184 | mr_sk <- agg(df, skewness = skewness(df$mths_remng))
185 | showDF(mr_sk)
186 | ```
187 | 
188 | 
189 | #### Kurtosis
190 | 
191 | Similarly, we can meaure the magnitude of, and how sharp is, the central peak of the distribution of a numerical variable, i.e. the "peakedness" of the distribution, (relative to a standard bell curve) with the `kurtosis` operation. Here, we measure the `kurtosis` of `"mths_remng"`:
192 | 
193 | ```{r, collapse=TRUE}
194 | mr_kr <- agg(df, kurtosis = kurtosis(df$mths_remng))
195 | showDF(mr_kr)
196 | ```
197 | 
198 | ***
199 | 
200 | 
201 | ### Measures of dependence
202 | 
203 | #### Covariance & correlation
204 | 
205 | The actions `cov` and `corr` return the sample covariance and correlation measures of dependency between two DF columns, respectively. Currently, Pearson is the only supported method for calculating correlation. Here we compute the covariance and correlation of `"loan_age"` and `"mths_remng"`. Note that, in saving the covariance and correlation measures, we are not required to first `collect` locally since `cov` and `corr` return values, rather than DFs:
206 | 
207 | ```{r, collapse=TRUE}
208 | cov_la.mr <- cov(df, "loan_age", "mths_remng")
209 | corr_la.mr <- corr(df, "loan_age", "mths_remng", method = "pearson")
210 | cov_la.mr
211 | corr_la.mr
212 | 
213 | typeof(cov_la.mr)
214 | typeof(corr_la.mr)
215 | ```
216 | 
217 | ***
218 | 
219 | 
220 | 
221 | ## Categorical Data
222 | 
223 | 
224 | We can compute descriptive statistics for categorical data using (1) the `groupBy` operation that we discussed in the [SparkR Basics II](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/02_sparkr-basics-2.md) tutorial and (2) operations native to SparkR for this purpose.
225 | 
226 | ```{r, include=FALSE}
227 | df$cd_zero_bal <- ifelse(isNull(df$cd_zero_bal), "Unknown", df$cd_zero_bal)
228 | df$servicer_name <- ifelse(df$servicer_name == "", "Unknown", df$servicer_name)
229 | ```
230 | 
231 | 
232 | #### Frequency table
233 | 
234 | To create a frequency table for a categorical variable in SparkR, i.e. list the number of observations for each distinct value in a column of strings, we can simply use the `count` transformation with grouped data. Group the data by the categorical variable for which we want to return a frequency table. Here, we create a frequency table for using this approach `"cd_zero_bal"`:
235 | 
236 | ```{r, collapse=TRUE}
237 | zb_f <- count(groupBy(df, "cd_zero_bal"))
238 | showDF(zb_f)
239 | ```
240 | 
241 | We could also embed a grouping into an `agg` operation as we saw in the [SparkR Basics II](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/02_sparkr-basics-2.md) tutorial to achieve the same frequency table DF, i.e. we could evaluate the expression `agg(groupBy(df, df$cd_zero_bal), count = n(df$cd_zero_bal))`.
242 | 
243 | #### Relative frequency table
244 | 
245 | We could similarly create a DF that consists of a relative frequency table. Here, we reproduce the frequency table from the preceding section, but now including the relative frequency for each distinct string value, labeled `"Percentage"`:
246 | 
247 | ```{r, collapse=TRUE}
248 | n <- nrow(df)
249 | zb_rf <- agg(groupBy(df, df$cd_zero_bal), Count = n(df$cd_zero_bal), Percentage = n(df$cd_zero_bal) * (100/n))
250 | showDF(zb_rf)
251 | ```
252 | 
253 | #### Contingency table
254 | 
255 | Finally, we can create a contingency table with the operation `crosstab`, which returns a data.frame that consists of a contingency table between two categorical DF columns. Here, we create and print a contingency table for `"servicer_name"` and `"cd_zero_bal"`:
256 | 
257 | ```{r, eval=FALSE}
258 | conting_sn.zb <- crosstab(df, "servicer_name", "cd_zero_bal")
259 | conting_sn.zb
260 | ```
261 | 
262 | Here, is the contingency table (the output of `crosstab`) in a formatted table:
263 | 
264 | ```{r kable, echo=FALSE}
265 | conting_sn.zb <- crosstab(df, "servicer_name", "cd_zero_bal")
266 | library(knitr)
267 | kable(conting_sn.zb)
268 | ```
269 | 
270 | __End of tutorial__ - Next up is [Merging SparkR DataFrames](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/06_merging.md)


--------------------------------------------------------------------------------
/rmd/06_merging.rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: 'Merging SparkR DataFrames'
  3 | author: "Sarah Armstrong, Urban Institute"
  4 | date: "July 12, 2016"
  5 | output:
  6 |   html_document:
  7 |     keep_md: yes
  8 | ---
  9 | 
 10 | ```{r setup, include=FALSE}
 11 | knitr::opts_chunk$set(echo = TRUE)
 12 | ```
 13 | 
 14 | **Last Updated**: May 23, 2017
 15 | 
 16 | 
 17 | **Objective**: The following tutorial provides an overview of how to join SparkR DataFrames by column and by row. In particular, we discuss how to:
 18 | 
 19 | * Merge two DFs by column condition(s) (join by row)
 20 | * Append rows of data to a DataFrame (join by column)
 21 |     + When column name lists are equal across DFs
 22 |     + When column name lists are not equal
 23 | 
 24 | **SparkR/R Operations Discussed**: `join`, `merge`, `sample`, `except`, `intersect`, `rbind`, `rbind.intersect` (defined function), `rbind.fill` (defined function)
 25 | 
 26 | ***
 27 | 
 28 | :heavy_exclamation_mark: **Warning**: Before beginning this tutorial, please visit the SparkR Tutorials README file (found [here](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/README.md)) in order to load the SparkR library and subsequently initiate a SparkR session.
 29 | 
 30 | ```{r, include=FALSE}
 31 | if (nchar(Sys.getenv("SPARK_HOME")) < 1) {
 32 |   Sys.setenv(SPARK_HOME = "/home/spark")
 33 | }
 34 | 
 35 | library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
 36 | 
 37 | sparkR.session()
 38 | ```
 39 | 
 40 | The following error indicates that you have not initiated a SparkR session:
 41 | 
 42 | ```{r, eval=FALSE}
 43 | Error in getSparkSession() : SparkSession not initialized
 44 | ```
 45 | 
 46 | If you receive this message, return to the SparkR tutorials [README](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/README.md) for guidance.
 47 | 
 48 | ***
 49 | 
 50 | **Read in initial data as DF**: Throughout this tutorial, we will use the loan performance example dataset that we exported at the conclusion of the [SparkR Basics I](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/01_sparkr-basics-1.md) tutorial.
 51 | 
 52 | ```{r, message=F, warning=F, results='hide', collapse=TRUE}
 53 | df <- read.df("s3://ui-spark-social-science-public/data/hfpc_ex", 
 54 |               header = "false", 
 55 |               inferSchema = "true", 
 56 |               na.strings = "")
 57 | cache(df)
 58 | ```
 59 | 
 60 | _Note_: documentation for the quarterly loan performance data can be found at http://www.fanniemae.com/portal/funding-the-market/data/loan-performance-data.html.
 61 | 
 62 | ***
 63 | 
 64 | 
 65 | ### Join (merge) two DataFrames by column condition(s)
 66 | 
 67 | We begin by subsetting `df` by column, resulting in two (2) DataFrames that are disjoint, except for them both including the loan identification variable, `"loan_id"`:
 68 | 
 69 | ```{r, collapse=TRUE}
 70 | # Print the column names of df:
 71 | columns(df)
 72 | 
 73 | # Specify column lists to fit `a` and `b` on - these are disjoint sets (except for "loan_id"):
 74 | cols_a <- c("loan_id", "period", "servicer_name", "new_int_rt", "act_endg_upb", "loan_age", "mths_remng")
 75 | cols_b <- c("loan_id", "aj_mths_remng", "dt_matr", "cd_msa", "delq_sts", "flag_mod", "cd_zero_bal", "dt_zero_bal")
 76 | 
 77 | # Create `a` and `b` DFs with the `select` operation:
 78 | a <- select(df, cols_a)
 79 | b <- select(df, cols_b)
 80 | 
 81 | # Print several rows from each subsetted DF:
 82 | str(a)
 83 | str(b)
 84 | ```
 85 | 
 86 | We can use the SparkR operation `join` to merge `a` and `b` by row, returning a DataFrame equivalent to `df`. The `join` operation allows us to perform most SQL join types on SparkR DFs, including:
 87 | 
 88 | * `"inner"` (default): Returns rows where there is a match in both DFs
 89 | * `"outer"`: Returns rows where there is a match in both DFs, as well as rows in both the right and left DF where there was no match
 90 | * `"full"`, `"fullouter"`: Returns rows where there is a match in one of the DFs
 91 | * `"left"`, `"leftouter"`, `"left_outer"`: Returns all rows from the left DF, even if there are no matches in the right DF
 92 | * `"right"`, `"rightouter"`, `"right_outer"`: Returns all rows from the right DF, even if there are no matches in the left DF
 93 | * Cartesian: Returns the Cartesian product of the sets of records from the two or more joined DFs - `join` will return this DF when we _do not_ specify a `joinType` _nor_ a `joinExpr` (discussed below)
 94 | 
 95 | We communicate to SparkR what condition we want to join DFs on with the `joinExpr` specification in `join`. Below, we perform an `"inner"` (default) join on the DFs `a` and `b` on the condition that their `"loan_id"` values be equal:
 96 | 
 97 | ```{r, collapse=TRUE}
 98 | ab1 <- join(a, b, a$loan_id == b$loan_id)
 99 | str(ab1)
100 | ```
101 | 
102 | Note that the resulting DF includes two (2) `"loan_id"` columns. Unfortunately, we cannot direct SparkR to keep only one of these columns when using `join` to merge by row, and the following command (which we introduced in the subsetting tutorial) drops both `"loan_id"` columns:
103 | 
104 | ```{r, collapse=TRUE}
105 | ab1$loan_id <- NULL
106 | ```
107 | 
108 | We can avoid this by renaming one of the columns before performing `join` and then, utilizing that the columns have distinct names, tell SparkR to drop only one of the columns. For example, we could rename `"loan_id"` in `a` with the expression `a <- withColumnRenamed(a, "loan_id", "loan_id_")`, then drop this column with `ab1$loan_id_ <- NULL` after performing `join` on `a` and `b` to return `ab1`.
109 | 
110 | 
111 | The `merge` operation, alternatively, allows us to join DFs and produces two (2) _distinct_ merge columns. We can use this feature to retain the column on which we joined the DFs, but we must still perform a `withColumnRenamed` step if we want our merge column to retain its original column name.
112 | 
113 | 
114 | Rather than defining a `joinExpr`, we explictly specify the column(s) that SparkR should `merge` the DFs on with the operation parameters `by` and `by.x`/`by.y` (if the merging column is named differently across the DFs). Note that, if we do not specify `by`, SparkR will merge the DFs on the list of common column names shared by the DFs. Rather than specifying a type of join, `merge` determines how SparkR should merge DFs based on boolean values, `all.x` and `all.y`, which indicate which rows in `x` and `y` should be included in the join, respectively. We can specify `merge` type with the following parameter values:
115 | 
116 | * `all.x = FALSE`, `all.y = FALSE`: Returns an inner join (this is the default and can be achieved by not specifying values for all.x and all.y)
117 | * `all.x = TRUE`, `all.y = FALSE`: Returns a left outer join
118 | * `all.x = FALSE`, `all.y = TRUE`: Returns a right outer join
119 | * `all.x = TRUE`, `all.y = TRUE`: Returns a full outer join
120 | 
121 | The following `merge` expression is equivalent to the `join` expression in the preceding example:
122 | 
123 | ```{r, collapse=TRUE}
124 | ab2 <- merge(a, b, by = "loan_id")
125 | str(ab2)
126 | ```
127 | 
128 | Note that the two merging columns are distinct as indicated by the `<column name>_x` and `<column name>_y` name assignments performed by `merge`. We utilize this distinction in the expressions below to retain a single merge column:
129 | 
130 | ```{r, collapse=TRUE}
131 | # Drop "loan_id" column from `b`:
132 | ab2$loan_id_y <- NULL
133 | 
134 | # Rename "loan_id" column from `a`:
135 | ab2 <- withColumnRenamed(ab2, "loan_id_x", "loan_id")
136 | 
137 | # Final DF with single "loan_id" column:
138 | str(ab2)
139 | ```
140 | 
141 | ```{r, include=FALSE}
142 | rm(a)
143 | rm(b)
144 | rm(ab1)
145 | rm(ab2)
146 | rm(cols_a)
147 | rm(cols_b)
148 | ```
149 | 
150 | 
151 | ***
152 | 
153 | 
154 | ### Append rows of data to a DataFrame
155 | 
156 | In order to discuss how we can append the rows of one DF to those of another in SparkR, we must first subset `df` into two (2) distinct DataFrames, `A` and `B`. Below, we define `A` as a random subset of `df` with a row count that is approximately equal to half the size of `nrow(df)`. We use the DF operation `except` to create `B`, which includes every row of `df`, `except` for those included in `A`:
157 | 
158 | ```{r, collapse=TRUE}
159 | A <- sample(df, withReplacement = FALSE, fraction = 0.5, seed = 1)
160 | B <- except(df, A)
161 | ```
162 | 
163 | Let's also examine the row count for each subsetted row and confirm that `A` and `B` do not share common rows. We can check this with the SparkR operation `intersect`, which performs the intersection set operation on two DFs:
164 | 
165 | ```{r, collapse=TRUE}
166 | (nA <- nrow(A))
167 | (nB <- nrow(B))
168 | 
169 | nA + nB # Equal to nrow(df)
170 | 
171 | AintB <- intersect(A, B)
172 | nrow(AintB)
173 | ```
174 | 
175 | #### Append rows when column name lists are equal across DFs
176 | 
177 | If we are certain that the two DFs have equivalent column name lists (with respect to both string values and column ordering), then appending the rows of one DF to another is straightforward. Here, we append the rows of `B` to `A` with the `rbind` operation:
178 | 
179 | ```{r, collapse=TRUE}
180 | df1 <- rbind(A, B)
181 | 
182 | nrow(df1)
183 | nrow(df)
184 | ```
185 | 
186 | We can see in the results above that `df1` is equivalent to `df`. We could, alternatively, accomplish this with the `unionALL` operation (e.g. `df1 <- unionAll(A, B)`. Note that `unionAll` is not an alias for `rbind` - we can combine any number of DFs with `rbind` while `unionAll` can only consider two (2) DataFrames at a time.
187 | 
188 | ```{r, include=FALSE}
189 | unpersist(df1)
190 | rm(df1)
191 | ```
192 | 
193 | 
194 | #### Append rows when DF column name lists are not equal
195 | 
196 | Before we can discuss appending rows when we do not have column name equivalency, we must first create two DataFrames that have different column names. Let's define a new DataFrame, `B_` that includes every column in `A` and `B`, excluding the column `"loan_age"`:
197 | 
198 | ```{r, collapse=TRUE}
199 | columns(B)
200 | 
201 | # Define column name list that has every column in `A` and `B`, except "loan_age":
202 | cols_ <- c("loan_id", "period", "servicer_name", "new_int_rt", "act_endg_upb", "mths_remng", "aj_mths_remng",
203 |            "dt_matr", "cd_msa", "delq_sts", "flag_mod", "cd_zero_bal", "dt_zero_bal" )
204 | 
205 | # Define subsetted DF:
206 | B_ <- select(B, cols_)
207 | ```
208 | 
209 | ```{r, include=FALSE}
210 | unpersist(B)
211 | rm(B)
212 | rm(cols_)
213 | ```
214 | 
215 | 
216 | We can try to apply SparkR `rbind` operation to append `B_` to `A`, but the expression given below will result in the error: `"Union can only be performed on tables with the same number of columns, but the left table has 14 columns and" "the right has 13"`
217 | 
218 | ```{r, eval=FALSE}
219 | df2 <- rbind(A, B_)
220 | ```
221 | 
222 | Two strategies to force SparkR to merge DataFrames with different column name lists are to:
223 | 
224 | 1. Append by an intersection of the two sets of column names, or
225 | 2. Use `withColumn` to add columns to DF where they are missing and set each entry in the appended rows of these columns equal to `NA`.
226 | 
227 | Below is a function, `rbind.intersect`, that accomplishes the first approach. Notice that, in this function, we simply take an intesection of the column names and ask SparkR to perform `rbind`, considering only this subset of (sorted) column names.
228 | 
229 | ```{r, collapse=TRUE}
230 | rbind.intersect <- function(x, y) {
231 |   cols <- base::intersect(colnames(x), colnames(y))
232 |   return(SparkR::rbind(x[, sort(cols)], y[, sort(cols)]))
233 | }
234 | ```
235 | 
236 | Here, we append `B_` to `A` using this function and then examine the dimensions of the resulting DF, `df2`, as well as its column names. We can see that, while the row count for `df2` is equal to that for `df`, the DF does not include the `"loan_age"` column (just as we expected!).
237 | 
238 | ```{r, collapse=TRUE}
239 | df2 <- rbind.intersect(A, B_)
240 | dim(df2)
241 | colnames(df2)
242 | ```
243 | 
244 | ```{r, include=FALSE}
245 | unpersist(df2)
246 | rm(df2)
247 | ```
248 | 
249 | 
250 | Accomplishing the second approach is somewhat more involved. The `rbind.fill` function, given below, identifies the outersection of the list of column names for two (2) DataFrames and adds them onto one (1) or both of the DataFrames as needed using `withColumn`. The function appends these columns as string dtype, and we can later recast columns as needed:
251 | 
252 | ```{r, collapse=TRUE}
253 | rbind.fill <- function(x, y) {
254 |   
255 |   m1 <- ncol(x)
256 |   m2 <- ncol(y)
257 |   col_x <- colnames(x)
258 |   col_y <- colnames(y)
259 |   outersect <- function(x, y) {setdiff(union(x, y), intersect(x, y))}
260 |   col_outer <- outersect(col_x, col_y)
261 |   len <- length(col_outer)
262 |   
263 |   if (m2 < m1) {
264 |     for (j in 1:len){
265 |       y <- withColumn(y, col_outer[j], cast(lit(""), "string"))
266 |     }
267 |   } else { 
268 |     if (m2 > m1) {
269 |         for (j in 1:len){
270 |           x <- withColumn(x, col_outer[j], cast(lit(""), "string"))
271 |         }
272 |       }
273 |     if (m2 == m1 & col_x != col_y) {
274 |       for (j in 1:len){
275 |         x <- withColumn(x, col_outer[j], cast(lit(""), "string"))
276 |         y <- withColumn(y, col_outer[j], cast(lit(""), "string"))
277 |       }
278 |     } else { }         
279 |   }
280 |   x_sort <- x[,sort(colnames(x))]
281 |   y_sort <- y[,sort(colnames(y))]
282 |   return(SparkR::rbind(x_sort, y_sort))
283 | }
284 | ```
285 | 
286 | We again append `B_` to `A`, this time using the `rbind.fill` function:
287 | 
288 | ```{r, collapse=TRUE}
289 | df3 <- rbind.fill(A, B_)
290 | ```
291 | 
292 | Now, the row count for `df3` is equal to that for `df` _and_ it includes all fourteen (14) columns included in `df`:
293 | 
294 | ```{r, collapse=TRUE}
295 | dim(df3)
296 | colnames(df3)
297 | ```
298 | 
299 | We know from the missing data tutorial that `df$loan_age` does not contain any `NA` or `NaN` values. By appending `B_` to `A` with the `rbind.fill` function, therefore, we should have inserted exactly `nrow(B)` many empty string entries in `df3`. Note that `"loan_age"` is currently cast as string dtype and, therefore, the column does not contain any null values and we will need to recast the column to a numerical dtype.
300 | 
301 | ```{r, collapse=TRUE}
302 | df3_laEmpty <- where(df3, df3$loan_age == "")
303 | nrow(df3_laEmpty)
304 | 
305 | # There are no "loan_age" null values since it is string dtype
306 | df3_laNull <- where(df3, isNull(df3$loan_age))
307 | nrow(df3_laNull)
308 | ```
309 | 
310 | Below, we recast `"loan_age"` as integer dtype and check that the number of `"loan_age"` null values in `df3` now matches the number of entry string values in `df3` prior to recasting, as well as the number of rows in `B`:
311 | 
312 | ```{r, collapse=TRUE}
313 | # Recast
314 | df3$loan_age <- cast(df3$loan_age, dataType = "integer")
315 | str(df3)
316 | 
317 | # Check that values are equal
318 | 
319 | df3_laNull_ <- where(df3, isNull(df3$loan_age))
320 | nrow(df3_laEmpty) # No. of empty strings
321 | 
322 | nrow(df3_laNull_) # No. of null entries
323 | 
324 | nB                # No. of rows in DF `B`
325 | ```
326 | 
327 | 
328 | Documentation for `rbind.intersection` can be found [here](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/R/rbind-intersection.R), and [here](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/R/rbind-fill.R) for `rbind.fill`.
329 | 
330 | __End of tutorial__ - Next up is [Data Visualizations in SparkR](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/07_visualizations.md)


--------------------------------------------------------------------------------
/rmd/07_visualizations.rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: 'Data Visualizations in SparkR'
  3 | author: "Sarah Armstrong, Urban Institute"
  4 | date: "July 27, 2016"
  5 | output: 
  6 |   html_document: 
  7 |     keep_md: true
  8 | ---
  9 | 
 10 | ```{r setup, include=FALSE}
 11 | knitr::opts_chunk$set(echo = TRUE)
 12 | ```
 13 | 
 14 | **Last Updated**: May 23, 2017 - Some ggplot2.SparkR package functions do not function; package needs updating.
 15 | 
 16 | 
 17 | **Objective**: In this tutorial, we illustrate various plot types that can be created in SparkR and discuss different strategies for obtaining these plots. We discuss the SparkR ggplot2 package that is in development and provide examples of plots that can be created using this package, as well as how SparkR users may develop their own functions to build visualizations. We provide examples of the following plot types:
 18 | 
 19 | * Bar graph
 20 | * Stacked or proportional bar graph
 21 | * Histogram
 22 | * Frequency polygon
 23 | * Bivariate histogram
 24 | 
 25 | **SparkR/R Operations Discussed**: `ggplot` (`ggplot2.SparkR`), `geom_bar` (`ggplot2.SparkR`), `geom_histogram` (`ggplot2.SparkR`), `geom_freqpoly` (`ggplot2.SparkR`), `geom_boxplot`, `geom_bivar_histogram.SparkR` (defined function)
 26 | 
 27 | ***
 28 | 
 29 | :heavy_exclamation_mark: **Warning**: Before beginning this tutorial, please visit the SparkR Tutorials README file (found [here](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/README.md)) in order to load the SparkR library and subsequently initiate a SparkR session.
 30 | 
 31 | ```{r, include=FALSE}
 32 | library(devtools)
 33 | 
 34 | if (nchar(Sys.getenv("SPARK_HOME")) < 1) {
 35 |   Sys.setenv(SPARK_HOME = "/home/spark")
 36 | }
 37 | library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
 38 | 
 39 | devtools::install_github("SKKU-SKT/ggplot2.SparkR")
 40 | library(ggplot2.SparkR)
 41 | 
 42 | sparkR.session()
 43 | ```
 44 | 
 45 | The following error indicates that you have not initiated a SparkR session:
 46 | 
 47 | ```{r, eval=FALSE}
 48 | Error in getSparkSession() : SparkSession not initialized
 49 | ```
 50 | 
 51 | If you receive this message, return to the SparkR tutorials [README](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/README.md) for guidance.
 52 | 
 53 | ***
 54 | 
 55 | **Read in initial data as DataFrame (DF)**: Throughout this tutorial, we will use the diamonds data that is included in the `ggplot2` package and is frequently used in `ggplot2` examples. The data consists of prices and quality information about 54,000 diamonds. The data contains the four C’s of diamond quality, carat, cut, colour and clarity; and five physical measurements, depth, table, x, y and z.
 56 | 
 57 | ```{r, message=F, warning=F, results='hide'}
 58 | df <- read.df("s3://ui-spark-social-science-public/data/diamonds.csv", 
 59 |               header = "true", 
 60 |               delimiter = ",",
 61 |               source = "csv", 
 62 |               inferSchema = "true", 
 63 |               na.strings = "")
 64 | cache(df)
 65 | ```
 66 | 
 67 | We can see what the data set looks like using the `str` operation:
 68 | 
 69 | ```{r, collapse=TRUE}
 70 | str(df)
 71 | ```
 72 | 
 73 | _Note_: The description of the `diamonds` data given above is adapted from http://ggplot2.org/book/qplot.pdf.
 74 | 
 75 | 
 76 | Introduced in the spring of 2016, the SparkR extension of Hadley Wickham's `ggplot2` package, `ggplot2.SparkR`, allows SparkR users to build visualizations by specifying a SparkR DataFrame and DF columns in ggplot expressions identically to how we would specify R data.frame components when using the `ggplot2` package, i.e. the extension package allows SparkR users to implement ggplot without having to modify the SparkR DataFrame API or to compute aggregations needed to build some plots.
 77 | 
 78 | 
 79 | As of the publication date of this tutorial, the `ggplot2.SparkR` package is still nascent and has identifiable bugs, including slow processing time. However, we provide `ggplot2.SparkR` in this example for its ease of use, particularly for SparkR users wanting to build basic plots. We alternatively discuss how a SparkR user may develop their own plotting function and provide an example in which we plot a bivariate histogram.
 80 | 
 81 | 
 82 | _Note_: Documentation for `ggplot2.SparkR` can be found [here](http://skku-skt.github.io/ggplot2.SparkR/), and we can view the project on GitHub [here](https://github.com/SKKU-SKT/ggplot2.SparkR). Documentation for the latest version of `ggplot2` can be found [here](http://docs.ggplot2.org/current/).
 83 | 
 84 | ***
 85 | 
 86 | 
 87 | ### Bar graph
 88 | 
 89 | Just as we would when using `ggplot2`, the following expression plots a basic bar graph that gives frequency counts across the different levels of `"cut"` quality in the data:
 90 | 
 91 | ```{r, collapse=TRUE}
 92 | p1 <- ggplot(df, aes(x = cut))
 93 | p1 + geom_bar()
 94 | ```
 95 | 
 96 | 
 97 | #### Stacked or proportional bar graph
 98 | 
 99 | One recognized bug within `ggplot2.SparkR` is that, when specifying a `fill` column, none of the `position` specifications--`"stack"`, `"fill"` nor `"dodge"`--necessarily return plots with constant factor-level ordering across groups. For example, the following expression successfully returns a bar graph that describes proportional frequency of `"clarity"` levels (string dtype), grouped over diamond `"cut"` types (also string dtype). Note, however, that the varied color blocks representing `"clarity"` levels are not ordered similarly across different levels of `"cut"`. The same issue results when we specify either of the other two (2) `position` specifications:
100 | 
101 | ```{r, collapse=TRUE}
102 | p2 <- ggplot(df, aes(x = cut, fill = clarity))
103 | p2 + geom_bar(position = "fill")
104 | ```
105 | 
106 | ***
107 | 
108 | 
109 | ### Histogram
110 | 
111 | Just as we would when using `ggplot2`, the following expression plots a histogram that gives frequency counts across binned `"price"` values in the data:
112 | 
113 | ```{r, collapse=TRUE, message=F, warning=F}
114 | p3 <- ggplot(df, aes(price))
115 | p3 + geom_histogram()
116 | ```
117 | 
118 | The preceding histogram plot assumes the `ggplot2` default, `bins = 30`, but we can change this value or override the `bins` specification by setting a `binwidth` value as we do in the following examples:
119 | 
120 | ```{r, collapse=TRUE}
121 | p3 + geom_histogram(binwidth = 250)
122 | ```
123 | 
124 | ```{r, collapse=TRUE}
125 | p3 + geom_histogram(bins = 100)
126 | ```
127 | 
128 | ***
129 | 
130 | 
131 | ### Frequency polygon
132 | 
133 | Frequency polygons provide a visual alternative to histogram plots (note that they describe equivalent aggregations), and we can fit this plot type also with `ggplot2` syntax - the following expression returns a frequency polygon that is equivalent to the first histogram plotted in the preceding section:
134 | 
135 | ```{r, collapse=TRUE, message=F, warning=F}
136 | p3 + geom_freqpoly()
137 | ```
138 | 
139 | Again, we can change the class intervals by specifying `binwidth` or the number of `bins` for the frequency polygon:
140 | 
141 | ```{r, collapse=TRUE}
142 | p3 + geom_freqpoly(binwidth = 250)
143 | ```
144 | 
145 | ```{r, collapse=TRUE}
146 | p3 + geom_freqpoly(bins = 100)
147 | ```
148 | 
149 | ***
150 | 
151 | 
152 | ### Boxplot
153 | 
154 | Finally, we can create boxplots just as we would in `ggplot2`. The following expression gives a boxplot of `"price"` values across levels of `"clarity"`:
155 | 
156 | ```{r, collapse=TRUE}
157 | p4 <- ggplot(df, aes(x = clarity, y = price))
158 | p4 + geom_boxplot()
159 | ```
160 | 
161 | ***
162 | 
163 | 
164 | ### Additional `ggplot2.SparkR` functionality
165 | 
166 | We can adapt the plot types discussed in the previous sections with the specifications given below: 
167 | 
168 | * Facets: `facet_grid`, `facet_wrap` and `facet_null` (default)
169 | * Coordinate systems: `coord_cartesian` and `coord_flip`
170 | * Position adjustments: `position_dodge`, `position_fill`, `position_stack` (as seen in previous example)
171 | * Scales: `scale_x_log10`, `scale_y_log10`, `labs`, `xlab`, `ylab`, `xlim` and `ylim`
172 | 
173 | For example, the following expression facets our previous histogram example across the different levels of `"cut"` quality:
174 | 
175 | ```{r, collapse=TRUE}
176 | p3 + geom_histogram() + facet_wrap(~cut)
177 | ```
178 | 
179 | 
180 | ### Functionality gaps between `ggplot2` and SparkR extension:
181 | 
182 | Below, we list several functions and plot types supported by `ggplot2` that are not currently supported by its SparkR extension package. The list is not exhaustive and is subject to change as the package continues to be developed:
183 | 
184 | * Weighted bar graph
185 | * Weighted histogram
186 | * Strictly ordered layers for filled and stacked bar graphs (as we saw in an earlier example)
187 | * Stacked or filled histogram
188 | * Layered frequency polygon
189 | * Density plot using `geom_freqpoly` by specifying `y = ..density..` in aesthetic (note that the extension package does not support `geom_density`)
190 | 
191 | ***
192 | 
193 | 
194 | ### Bivariate histogram
195 | 
196 | In the previous examples, we relied on the `ggplot2.SparkR` package to build plots from DataFrames using syntax identical to that which we would use in a normal application of `ggplot2` on R data.frames. Given the current limitations of the extension package, we may need to develop our own function if we are interested in building a plot type that is not currently supported by `ggplot2.SparkR`. Here, we provide an example of a function that returns a bivariate histogram of two numerical DataFrame columns.
197 | 
198 | 
199 | When building a function in SparkR (or any other environment), we want to avoid operations that are computationally expensive and building one that returns a plot is no different. One of the most expensive operations in SparkR, `collect`, is of particular interest when building functions that return plots since collecting data locally allows us to leverage graphing tools that we use in traditional frameworks, e.g. `ggplot2`. We should `collect` data as infrequently as possible since the operation is highly memory-intensive.
200 | 
201 | 
202 | In the following function, we `collect` data five (5) times. Four of the times, we are collecting single values (two minimum and two maximum values), which does not require a huge amount of memory. The last `collect` that we perform collects a data.frame with three (3) columns and a row for each bin assignment pairing, which can fit in-memory on a single node (assuming we don't specify a massive value for `nbins`). When developing SparkR functions, we should only perform minor collections like the ones described.
203 | 
204 | ```{r, collapse=TRUE}
205 | geom_bivar_histogram.SparkR <- function(df, x, y, nbins){
206 |   
207 |   library(ggplot2)
208 |   
209 |   x_min <- collect(agg(df, min(df[[x]]))) # Collect 1
210 |   x_max <- collect(agg(df, max(df[[x]]))) # Collect 2
211 |   x.bin <- seq(floor(x_min[[1]]), ceiling(x_max[[1]]), length = nbins)
212 |   
213 |   y_min <- collect(agg(df, min(df[[y]]))) # Collect 3
214 |   y_max <- collect(agg(df, max(df[[y]]))) # Collect 4
215 |   y.bin <- seq(floor(y_min[[1]]), ceiling(y_max[[1]]), length = nbins)
216 |   
217 |   x.bin.w <- x.bin[[2]]-x.bin[[1]]
218 |   y.bin.w <- y.bin[[2]]-y.bin[[1]]
219 |   
220 |   df_ <- withColumn(df, "x_bin_", ceiling((df[[x]] - x_min[[1]]) / x.bin.w))
221 |   df_ <- withColumn(df_, "y_bin_", ceiling((df[[y]] - y_min[[1]]) / y.bin.w))
222 |   
223 |   df_ <- mutate(df_, x_bin = ifelse(df_$x_bin_ == 0, 1, df_$x_bin_))
224 |   df_ <- mutate(df_, y_bin = ifelse(df_$y_bin_ == 0, 1, df_$y_bin_))
225 |   
226 |   dat <- collect(agg(groupBy(df_, "x_bin", "y_bin"), count = n(df_$x_bin))) # Collect 5
227 |   
228 |   p <- ggplot(dat, aes(x = x_bin, y = y_bin, fill = count)) + geom_tile()
229 |   
230 |   return(p)
231 | }
232 | ```
233 | 
234 | Here, we evaluate the `geom_bivar_histogram.SparkR` function using `"carat"` and `"price"`:
235 | 
236 | ```{r, collapse=TRUE}
237 | p5 <- geom_bivar_histogram.SparkR(df = df, x = "carat", y = "price", nbins = 100)
238 | p5 + scale_colour_brewer(palette = "Blues", type = "seq") + ggtitle("This is a title") + xlab("Carat") +
239 |   ylab("Price")
240 | ```
241 | 
242 | _Note_: Documentation for the `geom_bivar_histogram.SparkR` function is given [here](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/R/geom_bivar_histogram_SparkR.R).
243 | 
244 | 
245 | Note that the plot closely resembles a scatterplot. Bivariate histograms are one strategy for mitigating the overplotting that often occurs when attempting to visualize apparent correlation between two (2) columns in massive data sets. Furthermore, it is sometimes impossible to gather the data that is necessary to map individual points to a scatterplot onto a single node within our cluster - this is when aggregation becomes necessary rather than simply preferable. Just like plotting a univariate histogram, binning data reduces the number of points to plot and, with the appropriate choice of bin number and color scale, bivariate histograms can provide an intuitive alternative to scatterplots when working with massive data sets.
246 | 
247 | 
248 | For example, the following function is equivalent to our previous one, but we have changed the `fill` specification that determines the color scale from `count` to `log10(count)`. Then, we evaluate the new function with a larger `nbins` value, returning a new plot with more granular binning and a more nuanced color scale (since the breaks in the color scale are now log10-spaced).
249 | 
250 | ```{r, collapse=TRUE}
251 | geom_bivar_histogram.SparkR.log10 <- function(df, x, y, nbins){
252 |   
253 |   library(ggplot2)
254 |   
255 |   x_min <- collect(agg(df, min(df[[x]])))
256 |   x_max <- collect(agg(df, max(df[[x]])))
257 |   x.bin <- seq(floor(x_min[[1]]), ceiling(x_max[[1]]), length = nbins)
258 |   
259 |   y_min <- collect(agg(df, min(df[[y]])))
260 |   y_max <- collect(agg(df, max(df[[y]])))
261 |   y.bin <- seq(floor(y_min[[1]]), ceiling(y_max[[1]]), length = nbins)
262 |   
263 |   x.bin.w <- x.bin[[2]]-x.bin[[1]]
264 |   y.bin.w <- y.bin[[2]]-y.bin[[1]]
265 |   
266 |   df_ <- withColumn(df, "x_bin_", ceiling((df[[x]] - x_min[[1]]) / x.bin.w))
267 |   df_ <- withColumn(df_, "y_bin_", ceiling((df[[y]] - y_min[[1]]) / y.bin.w))
268 |   
269 |   df_ <- mutate(df_, x_bin = ifelse(df_$x_bin_ == 0, 1, df_$x_bin_))
270 |   df_ <- mutate(df_, y_bin = ifelse(df_$y_bin_ == 0, 1, df_$y_bin_))
271 |   
272 |   dat <- collect(agg(groupBy(df_, "x_bin", "y_bin"), count = n(df_$x_bin)))
273 |   
274 |   p <- ggplot(dat, aes(x = x_bin, y = y_bin, fill = log10(count))) + geom_tile()
275 |   
276 |   return(p)
277 | }
278 | ```
279 | 
280 | We now evaluate the `geom_bivar_histogram.SparkR.log10` function with `"carat"` and `"price"`:
281 | 
282 | ```{r, collapse=TRUE}
283 | p6 <- geom_bivar_histogram.SparkR.log10(df = df, x = "carat", y = "price", nbins = 250)
284 | p6 + scale_colour_brewer(palette = "Blues", type = "seq") + ggtitle("This is a title") + xlab("Carat") +
285 |   ylab("Price")
286 | ```
287 | 
288 | 
289 | __End of tutorial__ - Next up is [Interacting with databases using SparkR](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/08_databases-with-jdbc.md)


--------------------------------------------------------------------------------
/rmd/10_timeseries-1.rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: 'Time Series I: Working with the Date Datatype & Resampling a DataFrame'
  3 | author: "Sarah Armstrong, Urban Institute"
  4 | date: "July 12, 2016"
  5 | output:
  6 |   html_document:
  7 |     keep_md: yes
  8 | ---
  9 | 
 10 | ```{r setup, include=FALSE}
 11 | knitr::opts_chunk$set(echo = TRUE)
 12 | ```
 13 | 
 14 | **Last Updated**: May 23, 2017
 15 | 
 16 | 
 17 | **Objective**: In this tutorial, we discuss how to perform several essential time series operations with SparkR. In particular, we discuss how to:
 18 | 
 19 | * Identify and parse date datatype (dtype) DF columns,
 20 | * Compute relative dates based on a specified increment of time,
 21 | * Extract and modify components of a date dtype column and
 22 | * Resample a time series DF to a particular unit of time frequency
 23 | 
 24 | **SparkR/R Operations Discussed**: `unix_timestamp`, `cast`, `withColumn`, `to_date`, `last_day`, `next_day`, `add_months`, `date_add`, `date_sub`, `weekofyear`, `dayofyear`, `dayofmonth`, `datediff`, `months_between`, `year`, `month`, `hour`, `minute`, `second`, `agg`, `groupBy`, `mean`
 25 | 
 26 | ***
 27 | 
 28 | :heavy_exclamation_mark: **Warning**: Before beginning this tutorial, please visit the SparkR Tutorials README file (found [here](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/README.md)) in order to load the SparkR library and subsequently initiate a SparkR session.
 29 | 
 30 | ```{r, include=FALSE}
 31 | if (nchar(Sys.getenv("SPARK_HOME")) < 1) {
 32 |   Sys.setenv(SPARK_HOME = "/home/spark")
 33 | }
 34 | 
 35 | library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
 36 | 
 37 | sparkR.session()
 38 | ```
 39 | 
 40 | The following error indicates that you have not initiated a SparkR session:
 41 | 
 42 | ```{r, eval=FALSE}
 43 | Error in getSparkSession() : SparkSession not initialized
 44 | ```
 45 | 
 46 | If you receive this message, return to the SparkR tutorials [README](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/README.md) for guidance.
 47 | 
 48 | ***
 49 | 
 50 | **Read in initial data as DF**: Throughout this tutorial, we will use the loan performance example dataset that we exported at the conclusion of the [SparkR Basics I](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/01_sparkr-basics-1.md) tutorial.
 51 | 
 52 | ```{r, message=F, warning=F, results='hide', collapse=TRUE}
 53 | df <- read.df("s3://ui-spark-social-science-public/data/hfpc_ex", 
 54 |               header = "false", 
 55 |               inferSchema = "true", 
 56 |               na.strings = "")
 57 | cache(df)
 58 | ```
 59 | 
 60 | _Note_: documentation for the quarterly loan performance data can be found at http://www.fanniemae.com/portal/funding-the-market/data/loan-performance-data.html.
 61 | 
 62 | ***
 63 | 
 64 | 
 65 | ### Converting a DataFrame column to 'date' dtype
 66 | 
 67 | 
 68 | As we saw in previous tutorials, there are several columns in our dataset that list dates which are helpful in determining loan performance. We will specifically consider the following columns throughout this tutorial:
 69 | 
 70 | * `"period"` (Monthly Reporting Period): The month and year that pertain to the servicer’s cut-off period for mortgage loan information
 71 | * `"dt_matr"`(Maturity Date): The month and year in which a mortgage loan is scheduled to be paid in full as defined in the mortgage loan documents 
 72 | * `"dt_zero_bal"`(Zero Balance Effective Date): Date on which the mortgage loan balance was reduced to zero
 73 | 
 74 | Let's begin by reviewing the dytypes that `read.df` infers our date columns as. Note that each of our three (3) date columns were read in as strings:
 75 | 
 76 | ```{r, collapse=TRUE}
 77 | str(df)
 78 | ```
 79 | 
 80 | While we could parse the date strings into separate year, month and day integer dtype columns, converting the columns to date dtype allows us to utilize the datetime functions available in SparkR.
 81 | 
 82 | 
 83 | We can convert `"period"`, `"matr_dt"` and `"dt_zero_bal"` to date dtype with the following expressions:
 84 | 
 85 | ```{r, collapse=TRUE}
 86 | # `period`
 87 | period_uts <- unix_timestamp(df$period, 'MM/dd/yyyy')	# 1. Gets current Unix timestamp in seconds
 88 | period_ts <- cast(period_uts, 'timestamp')	# 2. Casts Unix timestamp `period_uts` as timestamp
 89 | period_dt <- cast(period_ts, 'date')	# 3. Casts timestamp `period_ts` as date dtype
 90 | df <- withColumn(df, 'p_dt', period_dt)	# 4. Add date dtype column `period_dt` to `df`
 91 | 
 92 | # `dt_matr`
 93 | matr_uts <- unix_timestamp(df$dt_matr, 'MM/yyyy')
 94 | matr_ts <- cast(matr_uts, 'timestamp')
 95 | matr_dt <- cast(matr_ts, 'date')
 96 | df <- withColumn(df, 'mtr_dt', matr_dt)
 97 | 
 98 | # `dt_zero_bal`
 99 | zero_bal_uts <- unix_timestamp(df$dt_zero_bal, 'MM/yyyy')
100 | zero_bal_ts <- cast(zero_bal_uts, 'timestamp')
101 | zero_bal_dt <- cast(zero_bal_ts, 'date')
102 | df <- withColumn(df, 'zb_dt', zero_bal_dt)
103 | ```
104 | 
105 | Note that the string entries of these date DF columns are written in the formats `'MM/dd/yyyy'` and `'MM/yyyy'`. While SparkR is able to easily read a date string when it is in the default format, `'yyyy-mm-dd'`, additional steps are required for string to date conversions when the DF column entries are in a format other than the default. In order to create `"p_dt"` from `"period"`, for example, we must:
106 | 
107 | 1. Define the Unix timestamp for the date string, specifying the date format that the string assumes (here, we specify `'MM/dd/yyyy'`),
108 | 2. Use the `cast` operation to convert the Unix timestamp of the string to `'timestamp'` dtype,
109 | 3. Similarly recast the `'timestamp'` form to `'date'` dtype and
110 | 4. Append the new date dtype `"p_dt"` column to `df` using the `withColumn` operation.
111 | 
112 | We similarly create date dtype columns using `"dt_matr"` and `"dt_zero_bal"`. If the date string entries of these columns were in the default format, converting to date dtype would straightforward. If `"period"` was in the format `'yyyy-mm-dd'`, for example, we would be able to append `df` with a date dtype column using a simple `withColumn`/`cast` expression: `df <- withColumn(df, 'p_dt', cast(df$period, 'date'))`. We could also directly convert `"period"` to date dtype using the `to_date` operation: `df$period <- to_date(df$period)`.
113 | 
114 | 
115 | If we are lucky enough that our date entires are in the default format, then dtype conversion is simple and we should use either the `withColumn`/`cast` or `to_date` expressions given above. Otherwise, the longer conversion process is required. Note that, if we are maintaining our own dataset that we will use SparkR to analyze, adopting the default date format at the start will make working with date values during analysis much easier. 
116 | 
117 | 
118 | Now that we've appended our date dtype columns to `df`, let's again look at the DF and compare the date dtype values with their associated date string values:
119 | 
120 | ```{r, collapse=TRUE}
121 | str(df)
122 | ```
123 | 
124 | Note that the `"zb_dt"` entries corresponding to the missing date entries in `"dt_zero_bal"`, which were empty strings, are now nulls.
125 | 
126 | ***
127 | 
128 | 
129 | ### Compute relative dates and measures based on a specified unit of time
130 | 
131 | As we mentioned earlier, converting date strings to date dtype allows us to utilize SparkR datetime operations. In this section, we'll discuss several SparkR operations that return:
132 | 
133 | * Date dtype columns, which list dates relative to a preexisting date column in the DF, and
134 | * Integer or numerical dtype columns, which list measures of time relative to a preexisting date column.
135 | 
136 | For convenience, we will review these operations using the `df_dt` DF, which includes only the date columns `"p_dt"` and `"mtr_dt"`, which we created in the preceding section:
137 | 
138 | ```{r, collapse=TRUE}
139 | cols_dt <- c("p_dt", "mtr_dt")
140 | df_dt <- select(df, cols_dt)
141 | ```
142 | 
143 | 
144 | #### Relative dates
145 | 
146 | SparkR datetime operations that return a new date dtype column include:
147 | 
148 | * `last_day`: Returns the _last_ day of the month which the given date belongs to (e.g. inputting "2013-07-27" returns "2013-07-31")
149 | * `next_day`: Returns the _first_ date which is later than the value of the date column that is on the specified day of the week
150 | * `add_months`: Returns the date that is `'numMonths'` _after_ `'startDate'`
151 | * `date_add`: Returns the date that is `'days'` days _after_ `'start'`
152 | * `date_sub`: Returns the date that is `'days'` days _before_ `'start'`
153 | 
154 | Below, we create relative date columns (defining `"p_dt"` as the input date) using each of these operations and `withColumn`:
155 | 
156 | ```{r, collapse=TRUE}
157 | df_dt1 <- withColumn(df_dt, 'p_ld', last_day(df_dt$p_dt))
158 | df_dt1 <- withColumn(df_dt1, 'p_nd', next_day(df_dt$p_dt, "Sunday"))
159 | df_dt1 <- withColumn(df_dt1, 'p_addm', add_months(df_dt$p_dt, 1)) # 'startDate'="pdt", 'numMonths'=1
160 | df_dt1 <- withColumn(df_dt1, 'p_dtadd', date_add(df_dt$p_dt, 1)) # 'start'="pdt", 'days'=1
161 | df_dt1 <- withColumn(df_dt1, 'p_dtsub', date_sub(df_dt$p_dt, 1)) # 'start'="pdt", 'days'=1
162 | str(df_dt1)
163 | ```
164 | 
165 | #### Relative measures of time
166 | 
167 | SparkR datetime operations that return integer or numerical dtype columns include:
168 | 
169 | * `weekofyear`: Extracts the week number as an integer from a given date
170 | * `dayofyear`: Extracts the day of the year as an integer from a given date
171 | * `dayofmonth`: Extracts the day of the month as an integer from a given date
172 | * `datediff`: Returns number of months between dates 'date1' and 'date2'
173 | * `months_between`: Returns the number of days from 'start' to 'end'
174 | 
175 | Here, we use `"p_dt"` and `"mtr_dt"` as inputs in the above operations. We again use `withColumn` do append the new columns to a DF:
176 | 
177 | ```{r, collapse=TRUE}
178 | df_dt2 <- withColumn(df_dt, 'p_woy', weekofyear(df_dt$p_dt))
179 | df_dt2 <- withColumn(df_dt2, 'p_doy', dayofyear(df_dt$p_dt))
180 | df_dt2 <- withColumn(df_dt2, 'p_dom', dayofmonth(df_dt$p_dt))
181 | df_dt2 <- withColumn(df_dt2, 'mbtw_p.mtr', months_between(df_dt$mtr_dt, df_dt$p_dt)) # 'date1'=p_dt, 'date2'=mtr_dt
182 | df_dt2 <- withColumn(df_dt2, 'dbtw_p.mtr', datediff(df_dt$mtr_dt, df_dt$p_dt)) # 'start'=p_dt, 'end'=mtr_dt
183 | str(df_dt2)
184 | ```
185 | 
186 | Note that operations that consider two different dates are sensitive to how we specify column ordering in the operation expression. For example, if we incorrectly define `"p_dt"` as `date2` and `"mtr_dt"` as `date1`, `"mbtw_p.mtr"` will consist of negative values. Similarly, `datediff` will return negative values if `start` and `end` are misspecified.
187 | 
188 | ***
189 | 
190 | 
191 | ### Extract components of a date dtype column as integer values
192 | 
193 | There are also datetime operations supported by SparkR that allow us to extract individual components of a date dtype column and return these as integers. Below, we use the `year` and `month` operations to create integer dtype columns for each of our date columns. Similar functions include `hour`, `minute` and `second`.
194 | 
195 | ```{r, collapse=TRUE}
196 | # Year and month values for `"period_dt"`
197 | df <- withColumn(df, 'p_yr', year(df$p_dt))
198 | df <- withColumn(df, "p_m", month(df$p_dt))
199 | 
200 | # Year value for `"matr_dt"`
201 | df <- withColumn(df, 'mtr_yr', year(df$mtr_dt))
202 | df <- withColumn(df, "mtr_m", month(df$mtr_dt))
203 | 
204 | # Year value for `"zero_bal_dt"`
205 | df <- withColumn(df, 'zb_yr', year(df$zb_dt))
206 | df <- withColumn(df, "zb_m", month(df$zb_dt))
207 | ```
208 | 
209 | We can see that each of the above expressions returns a column of integer values representing the requested date value:
210 | 
211 | ```{r, collapse=TRUE}
212 | str(df)
213 | ```
214 | 
215 | Note that the `NA` entries of `"zb_dt"` result in `NA` values for `"zb_yr"` and `"zb_m"`.
216 | 
217 | ***
218 | 
219 | 
220 | ### Resample a time series DF to a particular unit of time frequency
221 | 
222 | When working with time series data, we are frequently required to resample data to a different time frequency. Combing the `agg` and `groupBy` operations, as we saw in the [SparkR Basics II](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/02_sparkr-basics-2.md) tutorial, is a convenient strategy for accomplishing this in SparkR. We create a new DF, `dat`, that only includes columns of numerical, integer and date dtype to use in our resampling examples:
223 | 
224 | ```{r, include=FALSE}
225 | rm(df_dt)
226 | rm(df_dt1)
227 | rm(df_dt2)
228 | ```
229 | 
230 | ```{r, collapse=TRUE}
231 | cols <- c("p_yr", "p_m", "mtr_yr", "mtr_m", "zb_yr", "zb_m", "new_int_rt", "act_endg_upb", "loan_age", "mths_remng", "aj_mths_remng")
232 | dat <- select(df, cols)
233 | 
234 | unpersist(df)
235 | cache(dat)
236 | 
237 | head(dat)
238 | ```
239 | 
240 | Note that, in our loan-level data, each row represents a unique loan (each made distinct by the `"loan_id"` column in `df`) and its corresponding characteristics such as `"loan_age"` and `"mths_remng"`. Note that `dat` is simply a subset `df` and, therefore, also refers to loan-level data.
241 | 
242 | 
243 | While we can resample the data over distinct values of any of the columns in `dat`, we will resample the loan-level data as aggregations of the DF columns by units of time since we are working with time series data. Below, we aggregate the columns of `dat` (taking the mean of the column entries) by `"p_yr"`, and then by `"p_yr"` and `"p_m"`:
244 | 
245 | ```{r, collapse=TRUE}
246 | # Resample by "period_yr"
247 | dat1 <- agg(groupBy(dat, dat$p_yr), p_m = mean(dat$p_m), mtr_yr = mean(dat$mtr_yr), zb_yr = mean(dat$zb_yr), 
248 |             new_int_rt = mean(dat$new_int_rt), act_endg_upb = mean(dat$act_endg_upb), loan_age = mean(dat$loan_age), 
249 |             mths_remng = mean(dat$mths_remng), aj_mths_remng = mean(dat$aj_mths_remng))
250 | head(dat1)
251 | 
252 | # Resample by "period_yr" and "period_m"
253 | dat2 <- agg(groupBy(dat, dat$p_yr, dat$p_m), mtr_yr = mean(dat$mtr_yr), zb_yr = mean(dat$zb_yr), 
254 |             new_int_rt = mean(dat$new_int_rt), act_endg_upb = mean(dat$act_endg_upb), loan_age = mean(dat$loan_age), 
255 |             mths_remng = mean(dat$mths_remng), aj_mths_remng = mean(dat$aj_mths_remng))
256 | head(arrange(dat2, dat2$p_yr, dat2$p_m), 15)	# Arrange the first 15 rows of `dat2` by ascending `period_yr` and `period_m` values
257 | ```
258 | 
259 | Note that we specify the list of DF columns that we want to resample on by including it in `groupBy`. Here, we aggregated by taking the mean of each column. However, we could use any of the aggregation functions that `agg` is able to interpret (listed in [SparkR Basics II](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/02_sparkr-basics-2.md) tutorial) and that is inline with the resampling that we are trying to achieve.
260 | 
261 | 
262 | We could resample to any unit of time that we can extract from a date column, e.g. `year`, `month`, `day`, `hour`, `minute`, `second`. Furthermore, could have skipped the step of creating separate year- and month-level date columns - instead, we could have embedded the datetime functions directly in the `agg` expression. The following expression creates a DF that is equivalent to `dat1` in the preceding example:
263 | 
264 | ```{r, collapse=TRUE}
265 | df2 <- agg(groupBy(df, year(df$p_dt)), p_m = mean(month(df$p_dt)), mtr_yr = mean(year(df$mtr_dt)), 
266 |            zb_yr = mean(month(df$mtr_dt)), new_int_rt = mean(df$new_int_rt), act_endg_upb = mean(df$act_endg_upb), 
267 |            loan_age = mean(df$loan_age), mths_remng = mean(df$mths_remng), aj_mths_remng = mean(df$aj_mths_remng))
268 | ```
269 | 
270 | 
271 | __End of tutorial__ - Next up is [Insert next tutorial]


--------------------------------------------------------------------------------
/visualizations_files/figure-html/unnamed-chunk-10-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/UrbanInstitute/sparkr-tutorials/a4dabf38c81d8635a70158fe97ecb7b1c7dd08d0/visualizations_files/figure-html/unnamed-chunk-10-1.png


--------------------------------------------------------------------------------
/visualizations_files/figure-html/unnamed-chunk-11-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/UrbanInstitute/sparkr-tutorials/a4dabf38c81d8635a70158fe97ecb7b1c7dd08d0/visualizations_files/figure-html/unnamed-chunk-11-1.png


--------------------------------------------------------------------------------
/visualizations_files/figure-html/unnamed-chunk-12-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/UrbanInstitute/sparkr-tutorials/a4dabf38c81d8635a70158fe97ecb7b1c7dd08d0/visualizations_files/figure-html/unnamed-chunk-12-1.png


--------------------------------------------------------------------------------
/visualizations_files/figure-html/unnamed-chunk-13-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/UrbanInstitute/sparkr-tutorials/a4dabf38c81d8635a70158fe97ecb7b1c7dd08d0/visualizations_files/figure-html/unnamed-chunk-13-1.png


--------------------------------------------------------------------------------
/visualizations_files/figure-html/unnamed-chunk-15-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/UrbanInstitute/sparkr-tutorials/a4dabf38c81d8635a70158fe97ecb7b1c7dd08d0/visualizations_files/figure-html/unnamed-chunk-15-1.png


--------------------------------------------------------------------------------
/visualizations_files/figure-html/unnamed-chunk-17-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/UrbanInstitute/sparkr-tutorials/a4dabf38c81d8635a70158fe97ecb7b1c7dd08d0/visualizations_files/figure-html/unnamed-chunk-17-1.png


--------------------------------------------------------------------------------
/visualizations_files/figure-html/unnamed-chunk-4-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/UrbanInstitute/sparkr-tutorials/a4dabf38c81d8635a70158fe97ecb7b1c7dd08d0/visualizations_files/figure-html/unnamed-chunk-4-1.png


--------------------------------------------------------------------------------
/visualizations_files/figure-html/unnamed-chunk-5-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/UrbanInstitute/sparkr-tutorials/a4dabf38c81d8635a70158fe97ecb7b1c7dd08d0/visualizations_files/figure-html/unnamed-chunk-5-1.png


--------------------------------------------------------------------------------
/visualizations_files/figure-html/unnamed-chunk-6-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/UrbanInstitute/sparkr-tutorials/a4dabf38c81d8635a70158fe97ecb7b1c7dd08d0/visualizations_files/figure-html/unnamed-chunk-6-1.png


--------------------------------------------------------------------------------
/visualizations_files/figure-html/unnamed-chunk-7-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/UrbanInstitute/sparkr-tutorials/a4dabf38c81d8635a70158fe97ecb7b1c7dd08d0/visualizations_files/figure-html/unnamed-chunk-7-1.png


--------------------------------------------------------------------------------
/visualizations_files/figure-html/unnamed-chunk-8-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/UrbanInstitute/sparkr-tutorials/a4dabf38c81d8635a70158fe97ecb7b1c7dd08d0/visualizations_files/figure-html/unnamed-chunk-8-1.png


--------------------------------------------------------------------------------
/visualizations_files/figure-html/unnamed-chunk-9-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/UrbanInstitute/sparkr-tutorials/a4dabf38c81d8635a70158fe97ecb7b1c7dd08d0/visualizations_files/figure-html/unnamed-chunk-9-1.png


--------------------------------------------------------------------------------