├── 01_sparkr-basics-1.md ├── 02_sparkr-basics-2.md ├── 03_subsetting.md ├── 04_missing-data.md ├── 05_summary-statistics.md ├── 06_merging.md ├── 07_visualizations.md ├── 08_databases-with-jdbc.md ├── 09_glm.md ├── 10_timeseries-1.md ├── License.md ├── R ├── confint_SparkR.R ├── diff-in-diff.R ├── geom_bivar_histogram_SparkR.R ├── glm.R ├── merging.R ├── missing-data.R ├── ols1.R ├── ols2.R ├── ols2_SparkR2_test.R ├── qqnorm_SparkR.R ├── rbind-fill.R ├── rbind-intersection.R ├── sparkr-basics-1.R ├── sparkr-basics-2.R ├── subsetting.R ├── summary-statistics.R ├── time-series-1.R └── visualizations.R ├── README.md ├── glm_files └── figure-html │ ├── unnamed-chunk-10-1.png │ ├── unnamed-chunk-11-1.png │ ├── unnamed-chunk-25-1.png │ ├── unnamed-chunk-27-1.png │ ├── unnamed-chunk-29-1.png │ ├── unnamed-chunk-5-1.png │ ├── unnamed-chunk-7-1.png │ └── unnamed-chunk-9-1.png ├── rmd ├── 01_sparkr-basics-1.rmd ├── 02_sparkr-basics-2.rmd ├── 03_subsetting.rmd ├── 04_missing-data.rmd ├── 05_summary-statistics.rmd ├── 06_merging.rmd ├── 07_visualizations.rmd ├── 09_glm.rmd └── 10_timeseries-1.rmd └── visualizations_files └── figure-html ├── unnamed-chunk-10-1.png ├── unnamed-chunk-11-1.png ├── unnamed-chunk-12-1.png ├── unnamed-chunk-13-1.png ├── unnamed-chunk-15-1.png ├── unnamed-chunk-17-1.png ├── unnamed-chunk-4-1.png ├── unnamed-chunk-5-1.png ├── unnamed-chunk-6-1.png ├── unnamed-chunk-7-1.png ├── unnamed-chunk-8-1.png └── unnamed-chunk-9-1.png /03_subsetting.md: -------------------------------------------------------------------------------- 1 | # Subsetting SparkR DataFrames 2 | Sarah Armstrong, Urban Institute 3 | July 1, 2016 4 | 5 | 6 | 7 | **Last Updated**: May 23, 2017 8 | 9 | 10 | **Objective**: Now that we understand what a SparkR DataFrame (DF) really is (remember, it's not actually data!) and can write expressions using essential DataFrame operations, such as `agg`, we are ready to start subsetting DFs using more advanced transformation operations. This tutorial discusses various ways of subsetting DFs, as well as how to work with a randomly sampled subset as a local data.frame in RStudio: 11 | 12 | * Subset a DF by row 13 | * Subset a DF by a list of columns 14 | * Subset a DF by column expressions 15 | * Drop a column from a DF 16 | * Subset a DF by taking a random sample 17 | * Collect a random sample as a local R data.frame 18 | * Export a DF sample as a single .csv file to S3 19 | 20 | **SparkR/R Operations Discussed**: `filter`, `where`, `select`, `sample`, `collect`, `write.table` 21 | 22 | *** 23 | 24 | :heavy_exclamation_mark: **Warning**: Before beginning this tutorial, please visit the SparkR Tutorials README file (found [here](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/README.md)) in order to load the SparkR library and subsequently initiate a SparkR session. 25 | 26 | 27 | 28 | The following error indicates that you have not initiated a SparkR session: 29 | 30 | 31 | ```r 32 | Error in getSparkSession() : SparkSession not initialized 33 | ``` 34 | 35 | If you receive this message, return to the SparkR tutorials [README](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/README.md) for guidance. 36 | 37 | *** 38 | 39 | **Read in initial data as DF**: Throughout this tutorial, we will use the loan performance example dataset that we exported at the conclusion of the [SparkR Basics I](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/01_sparkr-basics-1.md) tutorial. Note that we are __persisting__ the DataFrame since we will use it throughout this tutorial. 40 | 41 | 42 | ```r 43 | df <- read.df("s3://ui-spark-social-science-public/data/hfpc_ex", 44 | header = "false", 45 | inferSchema = "true") 46 | cache(df) 47 | ``` 48 | 49 | _Note_: documentation for the quarterly loan performance data can be found at http://www.fanniemae.com/portal/funding-the-market/data/loan-performance-data.html. 50 | 51 | Let's check the dimensions our DF `df` and its column names so that we can compare the dimension sizes of `df` with those of the subsets that we will define throughout this tutorial: 52 | 53 | 54 | ```r 55 | nrow(df) 56 | ## [1] 13216516 57 | ncol(df) 58 | ## [1] 14 59 | columns(df) 60 | ## [1] "loan_id" "period" "servicer_name" "new_int_rt" 61 | ## [5] "act_endg_upb" "loan_age" "mths_remng" "aj_mths_remng" 62 | ## [9] "dt_matr" "cd_msa" "delq_sts" "flag_mod" 63 | ## [13] "cd_zero_bal" "dt_zero_bal" 64 | ``` 65 | 66 | *** 67 | 68 | 69 | ### Subset DataFrame by row: 70 | 71 | The SparkR operation `filter` allows us to subset the rows of a DF according to specified conditions. Before we begin working with `filter` to see how it works, let's print the schema of `df` since the types of subsetting conditions we are able to specify depend on the datatype of each column in the DF: 72 | 73 | 74 | ```r 75 | printSchema(df) 76 | ## root 77 | ## |-- loan_id: long (nullable = true) 78 | ## |-- period: string (nullable = true) 79 | ## |-- servicer_name: string (nullable = true) 80 | ## |-- new_int_rt: double (nullable = true) 81 | ## |-- act_endg_upb: double (nullable = true) 82 | ## |-- loan_age: integer (nullable = true) 83 | ## |-- mths_remng: integer (nullable = true) 84 | ## |-- aj_mths_remng: integer (nullable = true) 85 | ## |-- dt_matr: string (nullable = true) 86 | ## |-- cd_msa: integer (nullable = true) 87 | ## |-- delq_sts: string (nullable = true) 88 | ## |-- flag_mod: string (nullable = true) 89 | ## |-- cd_zero_bal: integer (nullable = true) 90 | ## |-- dt_zero_bal: string (nullable = true) 91 | ``` 92 | 93 | We can subset `df` into a new DF, `f1`, that includes only those loans for which JPMorgan Chase is the servicer with the expression: 94 | 95 | 96 | ```r 97 | f1 <- filter(df, df$servicer_name == "JP MORGAN CHASE BANK, NA" | df$servicer_name == "JPMORGAN CHASE BANK, NA" | 98 | df$servicer_name == "JPMORGAN CHASE BANK, NATIONAL ASSOCIATION") 99 | nrow(f1) 100 | ## [1] 102733 101 | ``` 102 | 103 | Notice that the `filter` considers normal logical syntax (e.g. logical conditions and operations), making working with the operation very straightforward. We can specify `filter` with SQL statement strings. For example, here we have the preceding example written in SQL statement format: 104 | 105 | 106 | ```r 107 | filter(df, "servicer_name = 'JP MORGAN CHASE BANK, NA' or servicer_name = 'JPMORGAN CHASE BANK, NA' or 108 | servicer_name = 'JPMORGAN CHASE BANK, NATIONAL ASSOCIATION'") 109 | ``` 110 | 111 | Or, alternatively, in a syntax similar to how we subset data.frames by row in base R: 112 | 113 | 114 | ```r 115 | df[df$servicer_name == "JP MORGAN CHASE BANK, NA" | df$servicer_name == "JPMORGAN CHASE BANK, NA" | 116 | df$servicer_name == "JPMORGAN CHASE BANK, NATIONAL ASSOCIATION",] 117 | ``` 118 | 119 | Another example of using logical syntax with `filter` is that we can subset `df` such that the new DF only includes those loans for which the servicer name is known, i.e. the column `"servicer_name"` is not equa to an empty string or listed as `"OTHER"`: 120 | 121 | 122 | ```r 123 | f2 <- filter(df, df$servicer_name != "OTHER" & df$servicer_name != "") 124 | nrow(f2) 125 | ## [1] 226264 126 | ``` 127 | 128 | Or, if we wanted to only consider observations with a `"loan_age"` value of greater than 60 months (five years), we would evaluate: 129 | 130 | 131 | ```r 132 | f3 <- filter(df, df$loan_age > 60) 133 | nrow(f3) 134 | ## [1] 1714413 135 | ``` 136 | 137 | An alias for `filter` is `where`, which reads much more intuitively, particularly when `where` is embedded in a complex statement. For example, the following expression can be read as "__aggregate__ the mean loan age and count values __by__ `"servicer_name"` in `df` __where__ loan age is less than 60 months": 138 | 139 | 140 | ```r 141 | f4 <- agg(groupBy(where(df, df$loan_age < 60), where(df, df$loan_age < 60)$servicer_name), 142 | loan_age_avg = avg(where(df, df$loan_age < 60)$loan_age), 143 | count = n(where(df, df$loan_age < 60)$loan_age)) 144 | head(f4) 145 | ## servicer_name loan_age_avg count 146 | ## 1 FIRST TENNESSEE BANK, NATIONAL ASSOCIATION 23.45820 12774 147 | ## 2 BANK OF AMERICA, N.A. 20.95203 34688 148 | ## 3 WELLS FARGO BANK, N.A. 47.94743 799 149 | ## 4 GMAC MORTGAGE, LLC 21.17096 16554 150 | ## 5 FLAGSTAR BANK, FSB 42.82895 76 151 | ## 6 USAA FEDERAL SAVINGS BANK 20.35909 3080 152 | ``` 153 | 154 | *** 155 | 156 | 157 | ### Subset DataFrame by column: 158 | 159 | The operation `select` allows us to subset a DF by a specified list of columns. In the expression below, for example, we create a subsetted DF that includes only the number of calendar months remaining until the borrower is expected to pay the mortgage loan in full (remaining maturity) and adjusted remaining maturity: 160 | 161 | 162 | ```r 163 | s1 <- select(df, "mths_remng", "aj_mths_remng") 164 | ncol(s1) 165 | ## [1] 2 166 | ``` 167 | 168 | We can also reference the column names through the DF name, i.e. `select(df, df$mths_remng, df$aj_mths_remng)`. Or, we can save a list of columns as a combination of strings. If we wanted to make a list of all columns that relate to remaining maturity, we could evaluate the expression `remng_mat <- c("mths_remng", "aj_mths_remng")` and then easily reference our list of columns later on with `select(df, remng_mat)`. 169 | 170 | 171 | Besides subsetting by a list of columns, we can also subset `df` while introducing a new column using a column expression, as we do in the example below. The DF `s2` includes the columns `"mths_remng"` and `"aj_mths_remng"` as in `s1`, but now with a column that lists the absolute value of the difference between the unadjusted and adjusted remaining maturity: 172 | 173 | 174 | ```r 175 | s2 <- select(df, df$mths_remng, df$aj_mths_remng, abs(df$aj_mths_remng - df$mths_remng)) 176 | ncol(s2) 177 | ## [1] 3 178 | head(s2) 179 | ## mths_remng aj_mths_remng abs((aj_mths_remng - mths_remng)) 180 | ## 1 293 286 7 181 | ## 2 292 283 9 182 | ## 3 291 287 4 183 | ## 4 290 287 3 184 | ## 5 289 277 12 185 | ## 6 288 277 11 186 | ``` 187 | 188 | Note that, just as we can subset by row with syntax similar to that in base R, we can similarly acheive subsetting by column. The following expressions are equivalent: 189 | 190 | 191 | ```r 192 | select(df, df$period) 193 | df[,"period"] 194 | df[,2] 195 | ``` 196 | 197 | To simultaneously subset by column and row specifications, you can simply embed a `where` expression in a `select` operation (or vice versa). The following expression creates a DF that lists loan age values only for observations in which servicer name is unknown: 198 | 199 | 200 | ```r 201 | s3 <- select(where(df, df$servicer_name == "" | df$servicer_name == "OTHER"), "loan_age") 202 | head(s3) 203 | ## loan_age 204 | ## 1 67 205 | ## 2 68 206 | ## 3 69 207 | ## 4 70 208 | ## 5 71 209 | ## 6 72 210 | ``` 211 | 212 | Note that we could have also written the above expression as `df[df$servicer_name == "" | df$servicer_name == "OTHER", "loan_age"]`. 213 | 214 | 215 | #### Drop a column from a DF: 216 | 217 | We can drop a column from a DF very simply by assigning `NULL` to a DF column. Below, we drop `"aj_mths_remng"` from `s1`: 218 | 219 | 220 | ```r 221 | head(s1) 222 | ## mths_remng aj_mths_remng 223 | ## 1 293 286 224 | ## 2 292 283 225 | ## 3 291 287 226 | ## 4 290 287 227 | ## 5 289 277 228 | ## 6 288 277 229 | s1$aj_mths_remng <- NULL 230 | head(s1) 231 | ## mths_remng 232 | ## 1 293 233 | ## 2 292 234 | ## 3 291 235 | ## 4 290 236 | ## 5 289 237 | ## 6 288 238 | ``` 239 | 240 | *** 241 | 242 | 243 | ### Subset a DF by taking a random sample: 244 | 245 | Perhaps the most useful subsetting operation is `sample`, which returns a randomly sampled subset of a DF. With `subset`, we can specify whether we want to sample with or without replace, the approximate size of the sample that we want the new DF to call and whether or not we want to define a random seed. If our initial DF is so massive that performing analysis on the entire dataset requires a more expensive cluster, we can: sample the massive dataset, interactively develop our analysis in SparkR using our sample and then evaluate the resulting program using our initial DF, which calls the entire massive dataset, only as is required. This strategy will help us to minimize wasting resources. 246 | 247 | Below, we take a random sample of `df` without replacement that is, in size, approximately equal to 1% of `df`. Notice that we must define a random seed in order to be able to reproduce our random sample. 248 | 249 | 250 | ```r 251 | df_samp1 <- sample(df, withReplacement = FALSE, fraction = 0.01) # Without set seed 252 | df_samp2 <- sample(df, withReplacement = FALSE, fraction = 0.01) 253 | count(df_samp1) 254 | ## [1] 132479 255 | count(df_samp2) 256 | ## [1] 132507 257 | # The row counts are different and, obviously, the DFs are not equivalent 258 | 259 | df_samp3 <- sample(df, withReplacement = FALSE, fraction = 0.01, seed = 0) # With set seed 260 | df_samp4 <- sample(df, withReplacement = FALSE, fraction = 0.01, seed = 0) 261 | count(df_samp3) 262 | ## [1] 131997 263 | count(df_samp4) 264 | ## [1] 131997 265 | # The row counts are equal and the DFs are equivalent 266 | ``` 267 | 268 | 269 | #### Collect a random sample as a local data.frame: 270 | 271 | An additional use of `sample` is to collect a random sample of a massive dataset as a local data.frame in R. This would allow us to work with a sample dataset in a traditional analysis environment that is likely more representative of the population since we are sampling from a larger set of observations than we are normally doing so. This can be achieved by simply using `collect` to create a local data.frame: 272 | 273 | 274 | ```r 275 | typeof(df_samp4) # DFs are of class S4 276 | ## [1] "S4" 277 | dat <- collect(df_samp4) 278 | typeof(dat) 279 | ## [1] "list" 280 | ``` 281 | 282 | Note that this data.frame is _not_ local to _your_ personal computer, but rather it was gathered locally to a single node in our AWS cluster. 283 | 284 | #### Export DF sample as a single .csv file to S3: 285 | 286 | If we want to export the sampled DF from RStudio as a single .csv file that we can work with in any environment, we must first coalesce the rows of `df_samp4` to a single node in our cluster using the `repartition` operation. Then, we can use the `write.df` operation as we did in the [SparkR Basics I](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/01_sparkr-basics-1.md) tutorial: 287 | 288 | 289 | ```r 290 | df_samp4_1 <- repartition(df_samp4, numPartitions = 1) 291 | write.df(df_samp4_1, path = "s3://ui-spark-social-science-public/data/hfpc_samp.csv", 292 | source = "csv", 293 | mode = "overwrite") 294 | ``` 295 | 296 | :heavy_exclamation_mark: __Warning__: We cannot collect a DF as a data.frame, nor can we repartition it to a single node, unless the DF is sufficiently small in size since it must fit onto a _single_ node! 297 | 298 | __End of tutorial__ - Next up is [Dealing with Missing Data in SparkR](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/04_missing-data.md) 299 | -------------------------------------------------------------------------------- /R/confint_SparkR.R: -------------------------------------------------------------------------------- 1 | ############################################################# 2 | ## confint.SparkR: Normal Distribution Confidence Interval ## 3 | ############################################################# 4 | # Sarah Armstrong, Urban Institute 5 | # August 31, 2016 6 | 7 | # Summary: Function that returns a confidence intervals for parameter estimates of a GLM (Gaussian distribution family, identity link function) model. 8 | 9 | # Inputs: 10 | 11 | # (*) object: a SparkR GLM model, fit with `spark.glm` operation 12 | # (*) level: level of confidence for CI 13 | 14 | # Returns: a local data.frame, detailing the CIs for each parameter estimate 15 | 16 | # ci <- confint.SparkR(object = lm, level = 0.975) 17 | # ci 18 | 19 | 20 | confint.SparkR <- function(object, level){ 21 | 22 | coef <- unname(unlist(summary(object)$coefficients[,1])) 23 | 24 | err <- unname(unlist(summary(object)$coefficients[,2])) 25 | 26 | ci <- as.data.frame(cbind(names(unlist(summary(object)$coefficients[,1])), coef - err*qt(level, summary(object)$df.null), coef + err*qt(0.975, summary(object)$df.null))) 27 | 28 | colnames(ci) <- c("","Lower Bound", "Upper Bound") 29 | 30 | return(ci) 31 | 32 | } -------------------------------------------------------------------------------- /R/diff-in-diff.R: -------------------------------------------------------------------------------- 1 | ################################################################################### 2 | ## Social Science Methodologies: Difference-in-differences (Diff-in-diff) Module ## 3 | ################################################################################### 4 | ## Objective: 5 | ## Operations discussed: 6 | 7 | ## Notes: ann overview of the Differences-in-differences method can be found at http://www.nber.org/WNE/lect_10_diffindiffs.pdf and at 8 | ## http://eml.berkeley.edu/~webfac/saez/e131_s04/diff.pdf. 9 | ## References: the SparkR code outlined below is adapted from the introductory Diff-in-Diff R code posted by Dr. Torres-Reyna at Princeton Unviersity, posted at 10 | ## http://www.princeton.edu/~otorres/DID101R.pdf. The data used in this module may be found at http://dss.princeton.edu/training/Panel101.dta. 11 | 12 | library(foreign) 13 | library(magrittr) 14 | library(SparkR) 15 | 16 | ## Initiate SparkContext: 17 | sc <- sparkR.init(sparkEnvir=list(spark.executor.memory="2g", 18 | spark.driver.memory="1g", 19 | spark.driver.maxResultSize="1g") 20 | ,sparkPackages="com.databricks:spark-csv_2.11:1.4.0") ## Load CSV Spark Package 21 | ## AWS EMR is using Spark 2.11 so we need the associated version of spark-csv: http://spark-packages.org/package/databricks/spark-csv 22 | ## Define Spark executor memory, as well as driver memory and maxResultSize according to cluster configuration 23 | 24 | ## Initiate SparkRSQL: 25 | sqlContext <- sparkRSQL.init(sc) 26 | 27 | ## Read in example panel data from AWS S3 as a DataFrame (DF): 28 | data <- read.df(sqlContext, "s3://sparkr-tutorials/DinD_R_ex.csv", header='true', delimiter=",", source="csv", inferSchema='true') 29 | cache(data) 30 | head(data) 31 | 32 | ########################################################################################### 33 | ## (1) Create indicators for countries receiving treatment & time periods for treatment: ## 34 | ########################################################################################### 35 | 36 | ## Create an indicator variable, 'time', identifying the unit of time at which treatment began (here, the unit of time is years and the year in which treatment began is 37 | ## 1994). Therefore, 'time' at year 1994, and at subsequent years, is assigned a value of 1 and, for years preceding 1994, is given a value of 0. This indicator variable 38 | ## represents the 39 | data. <- withColumn(data, "trt_time", ifelse(data$year >= 1994, 1, 0)) # Create a new DF, 'data_', with the variable 'time' appended; note that function format given by ifelse(test, yes, no) 40 | ## Stata: gen trt_time = (year >= 1994) & !missing(year) 41 | 42 | 43 | ## Create another indicator variable, 'treatment', indicating the within sample group exposed to the treatment. Here, countries E, F and G were received the treatment, so the 44 | ## 'treatment' variable value for observations in these countries is set equal to 1, while the 'treatment' value for observations within countries A, B, C and D is set equal 45 | ## to 0. 46 | data_ <- withColumn(data., "trt_region", ifelse(data$country == "E" | data$country == "F" | data$country == "G", 1, 0)) 47 | cache(data_) 48 | unpersist(data) 49 | ## Stata: gen trt_region = (country > 4) & !missing(country) 50 | 51 | 52 | ## Rename updated DFs to 'data': 53 | head(data_) # Check the columns of updated DF to confirm DF updated properly 54 | data <- data_ # Rename 'data.' to 'data' 55 | rm(data.) 56 | rm(data_) 57 | head(data) 58 | ## Stata: rename data_ data 59 | ## Stata: drop data_ data. 60 | ## Stata: list _all in 1/5 61 | 62 | 63 | ############################################################################################## 64 | ## (2) Manually compute the treatment effect (i.e. take the difference of the differences): ## 65 | ############################################################################################## 66 | 67 | ## To mimic an experimental design with observational data, exploiting an observed natural experiment, the diff-in-diff method measures the effect of a treatment on an 68 | ## outcome by comparing the average change over time in the response variable for the treatment group and compares this to the average change over time for the control group. 69 | ## This diff-in-diff estimator can be computed manually, as we do immediately below, or it can be computed as the parameter estimate of the treatment indicator in a linear 70 | ## model, which we outline futher below. 71 | 72 | ## Compute the four (4) measurements required to calculate the diff-in-diff estimator: 73 | a <- collect(select(data[data$trt_time == 0 & data$trt_region == 0], mean(data$y))) 74 | b <- collect(select(data[data$trt_time == 0 & data$trt_region == 1], mean(data$y))) 75 | c <- collect(select(data[data$trt_time == 1 & data$trt_region == 0], mean(data$y))) 76 | d <- collect(select(data[data$trt_time == 1 & data$trt_region == 1], mean(data$y))) 77 | 78 | ## Now, manually calculate the diff-in-diff estimator; as you can see, we are literally calculating the difference between the differences: 79 | did_est <- (d-c)-(b-a) 80 | did_est 81 | 82 | 83 | ############################################################ 84 | ## (3) Run a simple difference-in-differences regression: ## 85 | ############################################################ 86 | 87 | ## As previously stated, the parameter estimation of the interaction term 'trt_time:trt_region' included in the below linear model is the diff-in-diff estimator. This can be 88 | ## verified by comparing the 'did_est' value, which we calculated in Section (2), with the parameter estimation for 'trt_time:trt_region'. Note that the values are equal, 89 | ## and that the interaction term 'trt_time:trt_region' is a binary variable that indicates treatment status. 90 | m1 <- glm(y ~ trt_region + trt_time + trt_region:trt_time, data = data, family = "gaussian") 91 | summary(m1) 92 | 93 | 94 | ######################################### 95 | ## (4) Check diff-in-diff assumptions: ## 96 | ######################################### 97 | 98 | ## Is a line graph possible in SparkR? Would be nice to be able to provide visualization of parallel trend assumption - traditionally necessary for causality justification! 99 | 100 | ## Include leads in regression to measure whether or not there is any evidence of an anticipatory effect (if there is no effect, then leads should be approx 0 - this supports parallel trend assumption) 101 | ## Include lags to measure direction and maginitude of effect following initatial treatment exposure 102 | ## >>> Create lead and lag and then re-run glm 103 | 104 | ## Could perhaps run an F-test on the difference in mean(y) across the treatment and control groups (here, countries) for the pre-treatment years - if parallel trend 105 | ## asumption is valid, this F-test should yield stat. insignificant result; Note: this is a necessary condition, but not a sufficient condition for validation of parallel 106 | ## trend assumption since statistical insignificance of F-test results could be due to low test power -------------------------------------------------------------------------------- /R/geom_bivar_histogram_SparkR.R: -------------------------------------------------------------------------------- 1 | ########################################## 2 | ## geom_bivar_histogram.SparkR Function ## 3 | ########################################## 4 | # Sarah Armstrong & Alex Engler, Urban Institute 5 | # July 21, 2016 6 | 7 | # Summary: Plots a two-dimensional (2-D) histogram of frequency counts for two numerical DataFrame columns over a `nbin`-by-`nbin` grid of bins. 8 | 9 | # Inputs: 10 | # (*) df: SparkR DataFrame 11 | # (*) x, y (string): The names of two numerical-valued columns in the SparkR DataFrame df 12 | # (*) nbins (integer): The square root of the total number of bins that the frequency counts for x and y are aggregated over 13 | # (*) title (string): A string specifying the input for `ggtitle` input in `ggplot` 14 | # (*) xlab, ylab (string): A string specifying the input for `xlab` and `ylab` input in `ggplot`, respectively 15 | 16 | # Returns: 2-D histogram of frequency counts (using `geom_tile` from ggplot2 package) 17 | 18 | # Example: 19 | # p1 <- geom_bivar_histogram.SparkR(df = df, x = "carat", y = "price", nbins = 250) 20 | # p1 + scale_colour_brewer() + ggtitle("This is a title") + xlab("Carat") + ylab("Price") 21 | 22 | geom_bivar_histogram.SparkR <- function(df, x, y, nbins){ 23 | 24 | library(ggplot2) 25 | 26 | x_min <- collect(agg(df, min(df[[x]]))) 27 | x_max <- collect(agg(df, max(df[[x]]))) 28 | x.bin <- seq(floor(x_min[[1]]), ceiling(x_max[[1]]), length = nbins) 29 | 30 | y_min <- collect(agg(df, min(df[[y]]))) 31 | y_max <- collect(agg(df, max(df[[y]]))) 32 | y.bin <- seq(floor(y_min[[1]]), ceiling(y_max[[1]]), length = nbins) 33 | 34 | x.bin.w <- x.bin[[2]]-x.bin[[1]] 35 | y.bin.w <- y.bin[[2]]-y.bin[[1]] 36 | 37 | df_ <- withColumn(df, "x_bin_", ceiling((df[[x]] - x_min[[1]]) / x.bin.w)) 38 | df_ <- withColumn(df_, "y_bin_", ceiling((df[[y]] - y_min[[1]]) / y.bin.w)) 39 | 40 | df_ <- mutate(df_, x_bin = ifelse(df_$x_bin_ == 0, 1, df_$x_bin_)) 41 | df_ <- mutate(df_, y_bin = ifelse(df_$y_bin_ == 0, 1, df_$y_bin_)) 42 | 43 | dat <- collect(agg(groupBy(df_, "x_bin", "y_bin"), count = n(df_$x_bin))) 44 | 45 | p <- ggplot(dat, aes(x = x_bin, y = y_bin, fill = count)) + geom_tile() 46 | 47 | return(p) 48 | } -------------------------------------------------------------------------------- /R/merging.R: -------------------------------------------------------------------------------- 1 | ############################### 2 | ## Merging SparkR DataFrames ## 3 | ############################### 4 | 5 | ## Sarah Armstrong, Urban Institute 6 | ## July 7, 2016 7 | ## Last Updated: August 18, 2016 8 | 9 | 10 | ## Objective: The following tutorial provides an overview of how to join SparkR DataFrames by column and by row. In particular, we discuss how to: 11 | 12 | ## * Merge two DFs by column condition(s) (join by row) 13 | ## * Append rows of data to a DataFrame (join by column) 14 | ## + When column name lists are equal across DFs 15 | ## + When column name lists are not equal 16 | 17 | ## **SparkR/R Operations Discussed**: `join`, `merge`, `sample`, `except`, `intersect`, `rbind`, `rbind.intersect` (defined function), `rbind.fill` (defined function) 18 | 19 | 20 | ## Initiate SparkR session: 21 | 22 | if (nchar(Sys.getenv("SPARK_HOME")) < 1) { 23 | Sys.setenv(SPARK_HOME = "/home/spark") 24 | } 25 | library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"))) 26 | sparkR.session() 27 | 28 | ## Read in initial data as DataFrame (DF): 29 | 30 | df <- read.df("s3://sparkr-tutorials/hfpc_ex", header = "false", inferSchema = "true", na.strings = "") 31 | cache(df) 32 | 33 | 34 | ############################################################# 35 | ## (1) Join (merge) two DataFrames by column condition(s): ## 36 | ############################################################# 37 | 38 | ## We begin by subsetting `df` by column, resulting in two (2) DataFrames that are disjoint, except for them both including the loan identification variable, `"loan_id"`: 39 | 40 | # Print the column names of df: 41 | columns(df) 42 | 43 | # Specify column lists to fit `a` and `b` on - these are disjoint sets (except for "loan_id"): 44 | cols_a <- c("loan_id", "period", "servicer_name", "new_int_rt", "act_endg_upb", "loan_age", "mths_remng") 45 | cols_b <- c("loan_id", "aj_mths_remng", "dt_matr", "cd_msa", "delq_sts", "flag_mod", "cd_zero_bal", "dt_zero_bal") 46 | 47 | # Create `a` and `b` DFs with the `select` operation: 48 | a <- select(df, cols_a) 49 | b <- select(df, cols_b) 50 | 51 | # Print several rows from each subsetted DF: 52 | str(a) 53 | str(b) 54 | 55 | ## We can use the SparkR operation `join` to merge `a` and `b` by row, returning a DataFrame equivalent to `df`. The `join` operation allows us to perform most SQL join types on SparkR DFs, including: 56 | 57 | ## * `"inner"` (default): Returns rows where there is a match in both DFs 58 | ## * `"outer"`: Returns rows where there is a match in both DFs, as well as rows in both the right and left DF where there was no match 59 | ## * `"full"`, `"fullouter"`: Returns rows where there is a match in one of the DFs 60 | ## * `"left"`, `"leftouter"`, `"left_outer"`: Returns all rows from the left DF, even if there are no matches in the right DF 61 | ## * `"right"`, `"rightouter"`, `"right_outer"`: Returns all rows from the right DF, even if there are no matches in the left DF 62 | ## * Cartesian: Returns the Cartesian product of the sets of records from the two or more joined DFs - `join` will return this DF when we _do not_ specify a `joinType` _nor_ a `joinExpr` (discussed below) 63 | 64 | ## We communicate to SparkR what condition we want to join DFs on with the `joinExpr` specification in `join`. Below, we perform an `"inner"` (default) join on the DFs `a` and `b` on the condition that their `"loan_id"` values be equal: 65 | 66 | ab1 <- join(a, b, a$loan_id == b$loan_id) 67 | str(ab1) 68 | 69 | ## Note that the resulting DF includes two (2) `"loan_id"` columns. Unfortunately, we cannot direct SparkR to keep only one of these columns when using `join` to merge by row, and the following command (which we introduced in the subsetting tutorial) drops both `"loan_id"` columns: 70 | 71 | ab1$loan_id <- NULL 72 | 73 | ## We can avoid this by renaming one of the columns before performing `join` and then, utilizing that the columns have distinct names, tell SparkR to drop only one of the columns. For example, we could rename `"loan_id"` in `a` with the expression `a <- withColumnRenamed(a, "loan_id", "loan_id_")`, then drop this column with `ab1$loan_id_ <- NULL` after performing `join` on `a` and `b` to return `ab1`. 74 | 75 | ## The `merge` operation, alternatively, allows us to join DFs and produces two (2) _distinct_ merge columns. We can use this feature to retain the column on which we joined the DFs, but we must still perform a `withColumnRenamed` step if we want our merge column to retain its original column name. 76 | 77 | ## Rather than defining a `joinExpr`, we explictly specify the column(s) that SparkR should `merge` the DFs on with the operation parameters `by` and `by.x`/`by.y` (if the merging column is named differently across the DFs). Note that, if we do not specify `by`, SparkR will merge the DFs on the list of common column names shared by the DFs. Rather than specifying a type of join, `merge` determines how SparkR should merge DFs based on boolean values, `all.x` and `all.y`, which indicate which rows in `x` and `y` should be included in the join, respectively. We can specify `merge` type with the following parameter values: 78 | 79 | ## * `all.x = FALSE`, `all.y = FALSE`: Returns an inner join (this is the default and can be achieved by not specifying values for all.x and all.y) 80 | ## * `all.x = TRUE`, `all.y = FALSE`: Returns a left outer join 81 | ## * `all.x = FALSE`, `all.y = TRUE`: Returns a right outer join 82 | ## * `all.x = TRUE`, `all.y = TRUE`: Returns a full outer join 83 | 84 | ## The following `merge` expression is equivalent to the `join` expression in the preceding example: 85 | 86 | ab2 <- merge(a, b, by = "loan_id") 87 | str(ab2) 88 | 89 | ## Note that the two merging columns are distinct as indicated by the `_x` and `_y` name assignments performed by `merge`. We utilize this distinction in the expressions below to retain a single merge column: 90 | 91 | # Drop "loan_id" column from `b`: 92 | ab2$loan_id_y <- NULL 93 | 94 | # Rename "loan_id" column from `a`: 95 | ab2 <- withColumnRenamed(ab2, "loan_id_x", "loan_id") 96 | 97 | # Final DF with single "loan_id" column: 98 | str(ab2) 99 | 100 | rm(a) 101 | rm(b) 102 | rm(ab1) 103 | rm(ab2) 104 | rm(cols_a) 105 | rm(cols_b) 106 | 107 | ############################################# 108 | ## (2) Append rows of data to a DataFrame: ## 109 | ############################################# 110 | 111 | ## In order to discuss how we can append the rows of one DF to those of another in SparkR, we must first subset `df` into two (2) distinct DataFrames, `A` and `B`. Below, we define `A` as a random subset of `df` with a row count that is approximately equal to half the size of `nrow(df)`. We use the DF operation `except` to create `B`, which includes every row of `df`, `except` for those included in `A`: 112 | 113 | A <- sample(df, withReplacement = FALSE, fraction = 0.5, seed = 1) 114 | B <- except(df, A) 115 | 116 | ## Let's also examine the row count for each subsetted row and confirm that `A` and `B` do not share common rows. We can check this with the SparkR operation `intersect`, which performs the intersection set operation on two DFs: 117 | 118 | (nA <- nrow(A)) 119 | (nB <- nrow(B)) 120 | 121 | nA + nB # Equal to nrow(df) 122 | 123 | AintB <- intersect(A, B) 124 | nrow(AintB) 125 | 126 | ################################################################### 127 | ## (2i) Append rows when column name lists are equal across DFs: ## 128 | ################################################################### 129 | 130 | ## If we are certain that the two DFs have equivalent column name lists (with respect to both string values and column ordering), then appending the rows of one DF to another is straightforward. Here, we append the rows of `B` to `A` with the `rbind` operation: 131 | 132 | df1 <- rbind(A, B) 133 | 134 | nrow(df1) 135 | nrow(df) 136 | 137 | ## We can see in the results above that `df1` is equivalent to `df`. We could, alternatively, accomplish this with the `unionALL` operation (e.g. `df1 <- unionAll(A, B)`. Note that `unionAll` is not an alias for `rbind` - we can combine any number of DFs with `rbind` while `unionAll` can only consider two (2) DataFrames at a time. 138 | 139 | unpersist(df1) 140 | rm(df1) 141 | 142 | ############################################################### 143 | ## (2i) Append rows when DF column name lists are not equal: ## 144 | ############################################################### 145 | 146 | ## Before we can discuss appending rows when we do not have column name equivalency, we must first create two DataFrames that have different column names. Let's define a new DataFrame, `B_` that includes every column in `A` and `B`, excluding the column `"loan_age"`: 147 | 148 | columns(B) 149 | 150 | # Define column name list that has every column in `A` and `B`, except "loan_age": 151 | cols_ <- c("loan_id", "period", "servicer_name", "new_int_rt", "act_endg_upb", "mths_remng", "aj_mths_remng", 152 | "dt_matr", "cd_msa", "delq_sts", "flag_mod", "cd_zero_bal", "dt_zero_bal" ) 153 | 154 | # Define subsetted DF: 155 | B_ <- select(B, cols_) 156 | 157 | unpersist(B) 158 | rm(B) 159 | rm(cols_) 160 | 161 | ## We can try to apply SparkR `rbind` operation to append `B_` to `A`, but the expression given below will result in the error: `"Union can only be performed on tables with the same number of columns, but the left table has 14 columns and" "the right has 13"` 162 | 163 | df2 <- rbind(A, B_) 164 | 165 | ## Two strategies to force SparkR to merge DataFrames with different column name lists are to: 166 | 167 | ## 1. Append by an intersection of the two sets of column names, or 168 | ## 2. Use `withColumn` to add columns to DF where they are missing and set each entry in the appended rows of these columns equal to `NA`. 169 | 170 | ## Below is a function, `rbind.intersect`, that accomplishes the first approach. Notice that, in this function, we simply take an intesection of the column names and ask SparkR to perform `rbind`, considering only this subset of (sorted) column names. 171 | 172 | rbind.intersect <- function(x, y) { 173 | cols <- base::intersect(colnames(x), colnames(y)) 174 | return(SparkR::rbind(x[, sort(cols)], y[, sort(cols)])) 175 | } 176 | 177 | ## Here, we append `B_` to `A` using this function and then examine the dimensions of the resulting DF, `df2`, as well as its column names. We can see that, while the row count for `df2` is equal to that for `df`, the DF does not include the `"loan_age"` column (just as we expected!). 178 | 179 | df2 <- rbind.intersect(A, B_) 180 | dim(df2) 181 | colnames(df2) 182 | 183 | unpersist(df2) 184 | rm(df2) 185 | 186 | ## Accomplishing the second approach is somewhat more involved. The `rbind.fill` function, given below, identifies the outersection of the list of column names for two (2) DataFrames and adds them onto one (1) or both of the DataFrames as needed using `withColumn`. The function appends these columns as string dtype, and we can later recast columns as needed: 187 | 188 | rbind.fill <- function(x, y) { 189 | 190 | m1 <- ncol(x) 191 | m2 <- ncol(y) 192 | col_x <- colnames(x) 193 | col_y <- colnames(y) 194 | outersect <- function(x, y) {setdiff(union(x, y), intersect(x, y))} 195 | col_outer <- outersect(col_x, col_y) 196 | len <- length(col_outer) 197 | 198 | if (m2 < m1) { 199 | for (j in 1:len){ 200 | y <- withColumn(y, col_outer[j], cast(lit(""), "string")) 201 | } 202 | } else { 203 | if (m2 > m1) { 204 | for (j in 1:len){ 205 | x <- withColumn(x, col_outer[j], cast(lit(""), "string")) 206 | } 207 | } 208 | if (m2 == m1 & col_x != col_y) { 209 | for (j in 1:len){ 210 | x <- withColumn(x, col_outer[j], cast(lit(""), "string")) 211 | y <- withColumn(y, col_outer[j], cast(lit(""), "string")) 212 | } 213 | } else { } 214 | } 215 | x_sort <- x[,sort(colnames(x))] 216 | y_sort <- y[,sort(colnames(y))] 217 | return(SparkR::rbind(x_sort, y_sort)) 218 | } 219 | 220 | ## We again append `B_` to `A`, this time using the `rbind.fill` function: 221 | 222 | df3 <- rbind.fill(A, B_) 223 | 224 | ## Now, the row count for `df3` is equal to that for `df` _and_ it includes all fourteen (14) columns included in `df`: 225 | 226 | dim(df3) 227 | colnames(df3) 228 | 229 | ## We know from the missing data tutorial that `df$loan_age` does not contain any `NA` or `NaN` values. By appending `B_` to `A` with the `rbind.fill` function, therefore, we should have inserted exactly `nrow(B)` many empty string entries in `df3`. Note that `"loan_age"` is currently cast as string dtype and, therefore, the column does not contain any null values and we will need to recast the column to a numerical dtype. 230 | 231 | df3_laEmpty <- where(df3, df3$loan_age == "") 232 | nrow(df3_laEmpty) 233 | 234 | # There are no "loan_age" null values since it is string dtype 235 | df3_laNull <- where(df3, isNull(df3$loan_age)) 236 | nrow(df3_laNull) 237 | 238 | ## Below, we recast `"loan_age"` as integer dtype and check that the number of `"loan_age"` null values in `df3` now matches the number of entry string values in `df3` prior to recasting, as well as the number of rows in `B`: 239 | 240 | # Recast 241 | df3$loan_age <- cast(df3$loan_age, dataType = "integer") 242 | str(df3) 243 | 244 | # Check that values are equal 245 | 246 | df3_laNull_ <- where(df3, isNull(df3$loan_age)) 247 | nrow(df3_laEmpty) # No. of empty strings 248 | 249 | nrow(df3_laNull_) # No. of null entries 250 | 251 | nB # No. of rows in DF `B` 252 | 253 | ## Documentation for rbind.intersection can be found [here](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/R/rbind-intersection.R), and [here](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/R/rbind-fill.R) for rbind.fill. -------------------------------------------------------------------------------- /R/missing-data.R: -------------------------------------------------------------------------------- 1 | ######################################### 2 | ## Dealing with Missing Data in SparkR ## 3 | ######################################### 4 | 5 | ## Sarah Armstrong, Urban Institute 6 | ## July 6, 2016 7 | ## Last Updated: August 17, 2016 8 | 9 | 10 | ## Objective: In this tutorial, we discuss general strategies for dealing with missing data in the SparkR environment. While we do not consider conceptually how and why we might impute missing values in a dataset, we do discuss logistically how we could drop rows with missing data and impute missing data with replacement values. We specifically consider the following during this tutorial: 11 | 12 | ## * Specify null values when loading data in as a DF 13 | ## * Conditional expressions on empty DF entries 14 | ## + Null and NaN indicator operations 15 | ## + Conditioning on empty string entries 16 | ## + Distribution of missing data across grouped data 17 | ## * Drop rows with missing data 18 | ## + Null value entries 19 | ## + Empty string entries 20 | ## * Fill missing data entries 21 | ## + Null value entries 22 | ## + Empty string entries 23 | 24 | ## SparkR/R Operations Discussed: `read.df` (`nullValue = ""`), `printSchema`, `nrow`, `isNull`, `isNotNull`, `isNaN`, `count`, `where`, `agg`, `groupBy`, `n`, `collect`, `dropna`, `na.omit`, `list`, `fillna` 25 | 26 | 27 | ## Initiate SparkR session: 28 | 29 | if (nchar(Sys.getenv("SPARK_HOME")) < 1) { 30 | Sys.setenv(SPARK_HOME = "/home/spark") 31 | } 32 | library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"))) 33 | sparkR.session() 34 | 35 | ############################################################################## 36 | ## (1) Specify null values when loading data in as a SparkR DataFrame (DF): ## 37 | ############################################################################## 38 | 39 | ## Throughout this tutorial, we will use the loan performance example dataset that we exported at the conclusion of the [SparkR Basics I](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/sparkr-basics-1.md) tutorial. Note that we now include the `na.strings` option in the `read.df` transformation below. By setting `na.strings` equal to an empty string in `read.df`, we direct SparkR to interpret empty entries in the dataset as being equal to nulls in `df`. Therefore, any DF entries matching this string (here, set to equal an empty entry) will be set equal to a null value in `df`. 40 | 41 | df <- read.df("s3://sparkr-tutorials/hfpc_ex", header = "false", inferSchema = "true", na.strings = "") 42 | cache(df) 43 | 44 | ## We can replace this empty string with any string that we know indicates a null entry in the dataset, i.e. with `na.strings=""`. Note that SparkR only reads empty entries as null values in numerical and integer datatype (dtype) DF columns, meaning that empty entries in DF columns of string dtype will simply equal an empty string. We consider how to work with this type of observation throughout this tutorial alongside our treatment of null values. 45 | 46 | ## With `printSchema`, we can see the dtype of each column in `df` and, noting which columns are of a numerical and integer dtypes and which are string, use this to determine how we should examine missing data in each column of `df`. We also count the number of rows in `df` so that we can compare this value to row counts that we compute throughout this tutorial: 47 | 48 | printSchema(df) 49 | (n <- nrow(df)) 50 | 51 | ###################################################### 52 | ## (2) Conditional expressions on empty DF entries: ## 53 | ###################################################### 54 | 55 | ############################################# 56 | ## (2i) Null and NaN indicator operations: ## 57 | ############################################# 58 | 59 | ## We saw in the subsetting tutorial how to subset a DF by some conditional statement. We can extend this reasoning in order to identify missing data in a DF and to explore the distribution of missing data within a DF. SparkR operations indicating null and NaN entries in a DF are `isNull`, `isNaN` and `isNotNull`, and these can be used in conditional statements to locate or to remove DF rows with null and NaN entries. 60 | 61 | ## Below, we count the number of missing entries in `"loan_age"` and in `"mths_remng"`, which are both of integer dtype. We can see below that there are no missing or NaN entries in `"loan_age"`. Note that the `isNull` and `isNaN` count results differ for `"mths_remng"` - while there are missing values in `"mths_remng"`, there are no NaN entries (entires that are "not a number"). 62 | 63 | df_laNull <- where(df, isNull(df$loan_age)) 64 | count(df_laNull) 65 | df_laNaN <- where(df, isNaN(df$loan_age)) 66 | count(df_laNaN) 67 | 68 | df_mrNull <- where(df, isNull(df$mths_remng)) 69 | count(df_mrNull) 70 | df_mrNaN <- where(df, isNaN(df$mths_remng)) 71 | count(df_mrNaN) 72 | 73 | ################################# 74 | ## (2ii) Empty string entries: ## 75 | ################################# 76 | 77 | ## If we want to count the number of rows with missing entries for `"servicer_name"` (string dtype) we can simply use the equality logical condition (==) to direct SparkR to `count` the number of rows `where` the entries in the `"servicer_name"` column are equal to an empty string: 78 | 79 | df_snEmpty <- where(df, df$servicer_name == "") 80 | count(df_snEmpty) 81 | 82 | ############################################################## 83 | ## (2iii) Distribution of missing data across grouped data: ## 84 | ############################################################## 85 | 86 | ## We can also condition on missing data when aggregating over grouped data in order to see how missing data is distributed over a categorical variable within our data. In order to view the distribution of `"mths_remng"` observations with null values over distinct entries of `"servicer_name"`, we (1) group the entries of the DF `df_mrNull` that we created in the preceding example over `"servicer_name"` entries, (2) create the DF `mrNull_by_sn` which consists of the number of observations in `df_mrNull` by `"servicer_name"` entries and (3) collect `mrNull_by_sn` into a nicely formatted table as a local data.frame: 87 | 88 | gb_sn_mrNull <- groupBy(df_mrNull, df_mrNull$servicer_name) 89 | mrNull_by_sn <- agg(gb_sn_mrNull, Nulls = n(df_mrNull$servicer_name)) 90 | 91 | mrNull_by_sn.dat <- collect(mrNull_by_sn) 92 | mrNull_by_sn.dat 93 | # Alternatively, we could have evaluated showDF(mrNull_by_sn) to print DF 94 | 95 | ## Note that the resulting data.frame lists only nine (9) distinct string values for `"servicer_name"`. So, any row in `df` with a null entry for `"mths_remng"` has one of these strings as its corresponding `"servicer_name"` value. We could similarly examine the distribution of missing entries for some string dtype column across grouped data by first filtering a DF on the condition that the string column is equal to an empty string, rather than filtering with a null indicator operation (e.g. `isNull`), then performing the `groupBy` operation. 96 | 97 | ###################################### 98 | ## (3) Drop rows with missing data: ## 99 | ###################################### 100 | 101 | ############################## 102 | ## (3i) Null value entries: ## 103 | ############################## 104 | 105 | ## The SparkR operation `dropna` (or its alias `na.omit`) creates a new DF that omits rows with null value entries. We can configure `dropna` in a number of ways, including whether we want to omit rows with nulls in a specified list of DF columns or across all columns within a DF. 106 | 107 | ## If we want to drop rows with nulls for a list of columns in `df`, we can define a list of column names and then include this in `dropna` or we could embed this list directly in the operation. Below, we explicitly define a list of column names on which we condition `dropna`: 108 | 109 | mrlist <- list("mths_remng", "aj_mths_remng") 110 | df_mrNoNulls <- dropna(df, cols = mrlist) 111 | nrow(df_mrNoNulls) 112 | 113 | ## Alternatively, we could `filter` the DF using the `isNotNull` condition as follows: 114 | 115 | df_mrNoNulls_ <- filter(df, isNotNull(df$mths_remng) & isNotNull(df$aj_mths_remng)) 116 | nrow(df_mrNoNulls_) 117 | 118 | ## If we want to consider all columns in a DF when omitting rows with null values, we can use either the `how` or `minNonNulls` paramters of `dropna`. 119 | 120 | ## The parameter `how` allows us to decide whether we want to drop a row if it contains `"any"` nulls or if we want to drop a row only if `"all"` of its entries are nulls. We can see below that there are no rows in `df` in which all of its values are null, but only a small percentage of the rows in `df` have no null value entries: 121 | 122 | df_all <- dropna(df, how = "all") 123 | nrow(df_all) # Equal in value to n 124 | 125 | df_any <- dropna(df, how = "any") 126 | (n_any <- nrow(df_any)) 127 | (n_any/n)*100 128 | 129 | ## We can set a minimum number of non-null entries required for a row to remain in the DF by specifying a `minNonNulls` value. If included in `dropna`, this specification directs SparkR to drop rows that have less than `minNonNulls = ` non-null entries. Note that including `minNonNulls` overwrites the `how` specification. Below, we omit rows with that have less than 5 and 12 entries that are _not_ nulls. Note that there are no rows in `df` that have less than 5 non-null entries, and there are only approximately 8,000 rows with less than 12 non-null entries. 130 | 131 | df_5 <- dropna(df, minNonNulls = 5) 132 | nrow(df_5) # Equal in value to n 133 | 134 | df_12 <- dropna(df, minNonNulls = 12) 135 | (n_12 <- nrow(df_12)) 136 | n - n_12 137 | 138 | ################################# 139 | ## (3ii) Empty string entries: ## 140 | ################################# 141 | 142 | ## If we want to create a new DF that does not include any row with missing entries for a column of string dtype, we could also use `filter` to accomplish this. In order to remove observations with a missing `"servicer_name"` value, we simply filter `df` on the condition that `"servicer_name"` does not equal an empty string entry: 143 | 144 | df_snNoEmpty <- filter(df, df$servicer_name != "") 145 | nrow(df_snNoEmpty) 146 | 147 | #################################### 148 | ## (4) Fill missing data entries: ## 149 | #################################### 150 | 151 | ############################## 152 | ## (4i) Null value entries: ## 153 | ############################## 154 | 155 | ## The `fillna` operation allows us to replace null entries with some specified value. In order to replace null entries in every numerical and integer column in `df` with a value, we simply evaluate the expression `fillna(df, )`. We replace every null entry in `df` with the value 12345 below: 156 | 157 | str(df) 158 | 159 | df_ <- fillna(df, value = 12345) 160 | str(df_) 161 | rm(df_) 162 | 163 | ## If we want to replace null values within a list of DF columns, we can specify a column list just as we did in `dropna`. Here, we replace the null values in only `"act_endg_upb"` with 12345: 164 | 165 | str(df) 166 | 167 | df_ <- fillna(df, list("act_endg_upb" = 12345)) 168 | str(df_) 169 | rm(df_) 170 | 171 | ################################# 172 | ## (4ii) Empty string entries: ## 173 | ################################# 174 | 175 | ## Finally, we can replace the empty entries in string dtype columns with the `ifelse` operation, which follows the syntax `ifelse(, , )`. Here, we replace the empty entries in `"servicer_name"` with the string `"Unknown"`: 176 | 177 | str(df) 178 | df$servicer_name <- ifelse(df$servicer_name == "", "Unknown", df$servicer_name) 179 | str(df) -------------------------------------------------------------------------------- /R/ols1.R: -------------------------------------------------------------------------------- 1 | ############################################################################ 2 | ## Social Science Methodologies: Generalized Linear Models (GLM) Module 1 ## 3 | ############################################################################ 4 | ## Objective: 5 | ## Operations discussed: glm 6 | 7 | library(SparkR) 8 | library(ggplot2) 9 | library(reshape2) 10 | 11 | ## Initiate SparkContext: 12 | 13 | sc <- sparkR.init(sparkEnvir=list(spark.executor.memory="2g", 14 | spark.driver.memory="1g", 15 | spark.driver.maxResultSize="1g") 16 | ,sparkPackages="com.databricks:spark-csv_2.11:1.4.0") # Load CSV Spark Package 17 | 18 | ## AWS EMR is using Spark 2.11 so we need the associated version of spark-csv: http://spark-packages.org/package/databricks/spark-csv 19 | ## Define Spark executor memory, as well as driver memory and maxResultSize according to cluster configuration 20 | 21 | ## Initiate SparkRSQL: 22 | 23 | sqlContext <- sparkRSQL.init(sc) 24 | 25 | ## Create a local R data.frame: 26 | 27 | x1 <- rnorm(n=200, mean=10, sd=2) 28 | x2 <- rnorm(n=200, mean=17, sd=3) 29 | x3 <- rnorm(n=200, mean=8, sd=1) 30 | y <- 1 + .2 * x1 + .4 * x2 + .5 * x3 + rnorm(n=200, mean=0, sd=.1) # Can see what the true values of the model parameters are 31 | dat <- cbind.data.frame(y, x1, x2, x3) 32 | 33 | ## Ordinary linear regression (OLR) model with local data.frame and print model summary: 34 | 35 | m1 <- stats::lm(y ~ x1 + x2 + x3, data = dat) # Include `stats::` to require SparkR to estimate `m1` with base R `lm` operation 36 | summary(m1) 37 | 38 | ## Compute OLR model statistics: 39 | 40 | output1 <- summary(m1) 41 | yavg1 <- mean(dat$y) 42 | yhat1 <- m1$fitted.values 43 | coeffs1 <- m1$coefficients 44 | r1 <- m1$resid 45 | SSR1 <- deviance(m1) 46 | Rsq1 <- output1$r.squared 47 | aRsq1 <- output1$adj.r.squared 48 | s1 <- output1$sigma 49 | covmatr1 <- s1^2*output1$cov 50 | 51 | 52 | ## Note: use `lm` function from `stats` R package to estimate ordinary linear regression model for local data.frame to easily compute Rsq and aRsq 53 | ## The `glm` operation of neither `stats` nor `SparkR` yield Rsq/aRsq, which makes sense since Rsq/aRsq are widely-accepted measures of goodness-of-fit (GOF) for ordinary 54 | ## linear regression, but not for generalized linear models. Other GOF measures are typically used when assessing GLMs since the meaning of the Rsq/aRsq values for a GLM 55 | ## become convoluted when fitting a GLM of a family and with a link function different than Gaussian and identiy, respectively (in fact, there are several types of 56 | ## residuals that can be computed for GLMs!). The `glm` function in R usually prints AIC, deviance residuals and null deviance in its model summary function. Below, we 57 | ## fit an OLR model using the SparkR `glm` operation since g(Y) = Y = XB + e for the identity link function, g(Y) = Y. 58 | 59 | 60 | 61 | ## Create SparkR DataFrame (DF) from local data.frame: 62 | 63 | df <- as.DataFrame(sqlContext, dat) 64 | 65 | ## Perform OLS estimation on DF with the same specifcations for our data.frame OLS estimation: 66 | 67 | m2 <- SparkR::glm(y ~ x1 + x2 + x3, data = df, solver = "l-bfgs") 68 | summary(m2) 69 | 70 | ## Comput OLR model statistics: 71 | 72 | output2 <- summary(m2) 73 | coeffs2 <- output2$coefficients[,1] 74 | 75 | # Calculate average y value: 76 | yavg2 <- collect(agg(df, yavg_df = mean(df$y)))$yavg_df 77 | # Predict fitted values using the DF OLS model -> yields new DF 78 | yhat2_df <- predict(m2, df) 79 | head(yhat2_df) # so you can see what the prediction DF looks like 80 | # Transform the SparkR fitted values DF (yhat2_df) so that it is easier to read and includes squared residuals and squared totals & extract yhat vector (as new DF) 81 | yhat2_df <- transform(yhat2_df, sq_res2 = (yhat2_df$y - yhat2_df$prediction)^2, sq_tot2 = (yhat2_df$y - yavg2)^2) 82 | yhat2_df <- transform(yhat2_df, yhat = yhat2_df$prediction) 83 | head(select(yhat2_df, "y", "yhat", "sq_res2", "sq_tot2")) 84 | head(yhat2 <- select(yhat2_df, "yhat")) 85 | # Compute sum of squared residuals and totals, then use these values to calculate R-squared: 86 | SSR2 <- collect(agg(yhat2_df, SSR2=sum(yhat2_df$sq_res2))) ##### Note: produces data.frame - get values out of d.f's in order to calculate aRsq and Rsq 87 | SST2 <- collect(agg(yhat2_df, SST2=sum(yhat2_df$sq_res2))) 88 | Rsq2 <- 1-(SSR2/SST2) 89 | p <- 3 90 | N <- nrow(df) 91 | aRsq2 <- 1-(((1-Rsq2)*(N-1))/(N-p-1)) 92 | 93 | ## Iteratively fit linear regression models using SparkR `glm`, using l-bfgs for optimization, and plot resulting coefficient estimations with `lm` estimate values 94 | 95 | n <- 10 96 | b0 <- rep(0,n) 97 | b1 <- rep(0,n) 98 | b2 <- rep(0,n) 99 | b3 <- rep(0,n) 100 | for(i in 1:n){ 101 | model <- SparkR::glm(y ~ x1 + x2 + x3, data = df) 102 | b0[i] <- unname(summary(model)$coefficients[,1]["(Intercept)"]) 103 | b1[i] <- unname(summary(model)$coefficients[,1]["x1"]) 104 | b2[i] <- unname(summary(model)$coefficients[,1]["x2"]) 105 | b3[i] <- unname(summary(model)$coefficients[,1]["x3"]) 106 | } 107 | 108 | # Prepare parameter estimate lists above as data.frames to pass into ggplot: 109 | b_ests_ <- data.frame(cbind(b0 = unlist(b0), b1 = unlist(b1), b2 = unlist(b2), b3 = unlist(b3), Iteration = seq(1, n, by = 1))) 110 | b_ests <- melt(b_ests_, id.vars ="Iteration", measure.vars = c("b0", "b1", "b2", "b3")) 111 | names(b_ests) <- cbind("Iteration", "Variable", "Value") 112 | 113 | 114 | p <- ggplot(data = b_ests, aes(x = Iteration, y = Value, col = Variable), size = 5) + geom_point() + geom_hline(yintercept = unname(coeffs1["(Intercept)"]), linetype = 2) + geom_hline(yintercept = unname(coeffs1["x1"]), linetype = 2) + geom_hline(yintercept = unname(coeffs1["x2"]), linetype = 2) + geom_hline(yintercept = unname(coeffs1["x3"]), linetype = 2) + labs(title = "L.R. Parameters Estimated via L-BFGS") -------------------------------------------------------------------------------- /R/ols2.R: -------------------------------------------------------------------------------- 1 | ############################################################################ 2 | ## Social Science Methodologies: Generalized Linear Models (GLM) Module 2 ## 3 | ############################################################################ 4 | ## Objective: 5 | ## Operations discussed: glm 6 | 7 | library(SparkR) 8 | 9 | ## Initiate SparkContext: 10 | 11 | sc <- sparkR.init(sparkEnvir=list(spark.executor.memory="2g", 12 | spark.driver.memory="1g", 13 | spark.driver.maxResultSize="1g") 14 | ,sparkPackages="com.databricks:spark-csv_2.11:1.4.0") # Load CSV Spark Package 15 | 16 | ## AWS EMR is using Spark 2.11 so we need the associated version of spark-csv: http://spark-packages.org/package/databricks/spark-csv 17 | ## Define Spark executor memory, as well as driver memory and maxResultSize according to cluster configuration 18 | 19 | ## Initiate SparkRSQL: 20 | 21 | sqlContext <- sparkRSQL.init(sc) 22 | 23 | ## Read in loan performance example data as DataFrame (DF) 'dat': 24 | 25 | dat <- read.df(sqlContext, "s3://sparkr-tutorials/hfpc_ex", header='false', inferSchema='true') 26 | cache(dat) 27 | columns(dat) 28 | ## > columns(dat) 29 | ## [1] "loan_id" "period" "servicer_name" "new_int_rt" "act_endg_upb" "loan_age" 30 | ## [7] "mths_remng" "aj_mths_remng" "dt_matr" "cd_msa" "delq_sts" "flag_mod" 31 | ## [13] "cd_zero_bal" "dt_zero_bal" 32 | 33 | ## 'loan_id' (Loan Identifier): A unique identifier for the mortgage loan 34 | ## 'period' (Monthly Reporting Period): The month and year that pertain to the servicer’s cut-off period for mortgage loan information 35 | ## 'servicer_name' (Servicer Name): the name of the entity that serves as the primary servicer of the mortgage loan 36 | ## 'new_int_rt' (Current Interest Rate): The interest rate on a mortgage loan in effect for the periodic installment due 37 | ## 'act_endg_upb' (Current Actual Unpaid Principal Balance (UPB)): The actual outstanding unpaid principal balance of the mortgage loan (for liquidated loans, the unpaid 38 | ## principal balance of the mortgage loan at the time of liquidation) 39 | ## 'loan_age' (Loan Age): The number of calendar months since the first full month the mortgage loan accrues interest 40 | ## 'mths_remng' (Remaining Months to Maturity): The number of calendar months remaining until the borrower is expected to pay the mortgage loan in full 41 | ## 'aj_mths_remng' (Adjusted Remaining Months To Maturity): the number of calendar months remaining until the borrower is expected to pay the mortgage loan in full 42 | ## 'dt_matr' (Maturity Date): The month and year in which a mortgage loan is scheduled to be paid in full as defined in the mortgage loan documents 43 | ## 'cd_msa' (Metropolitan Statistical Area (MSA)): The numeric Metropolitan Statistical Area Code for the property securing the mortgage loan 44 | ## 'delq_sts' (Current Loan Delinquent Status): The number of days, represented in months, the obligor is delinquent as determined by the governing mortgage documents 45 | ## 'flag_mod' (Modification Flag): An indicator that denotes if the mortgage loan has been modified 46 | ## 'cd_zero_bal' (Zero Balance Code): A code indicating the reason the mortgage loan's balance was reduced to zero 47 | ## 'dt_zero_bal' (Zero Balance Effective Date): Date on which the mortgage loan balance was reduced to zero 48 | 49 | ## Print the schema for the DF to see if the data types specifications in the schema make sense, given the variable descriptions above: 50 | 51 | printSchema(dat) 52 | ## > printSchema(dat) 53 | ## root 54 | ## |-- loan_id: long (nullable = true) 55 | ## |-- period: string (nullable = true) # Should be recast as a 'date' 56 | ## |-- servicer_name: string (nullable = true) 57 | ## |-- new_int_rt: double (nullable = true) 58 | ## |-- act_endg_upb: double (nullable = true) 59 | ## |-- loan_age: integer (nullable = true) 60 | ## |-- mths_remng: integer (nullable = true) 61 | ## |-- aj_mths_remng: integer (nullable = true) 62 | ## |-- dt_matr: string (nullable = true) # Should be recast as a 'date' 63 | ## |-- cd_msa: integer (nullable = true) # Should be recast as a 'string' 64 | ## |-- delq_sts: string (nullable = true) 65 | ## |-- flag_mod: string (nullable = true) 66 | ## |-- cd_zero_bal: integer (nullable = true) # Should be recast as a 'string' 67 | ## |-- dt_zero_bal: string (nullable = true) # Should be recast as a 'date' 68 | 69 | ## Preprocessing data: 70 | 71 | # Cast each of the columns noted above into the correct dtype before proceeding with specifying glms 72 | 73 | period_dt <- cast(cast(unix_timestamp(dat$period, 'dd/MM/yyyy'), 'timestamp'), 'date') 74 | dat <- withColumn(dat, 'period_dt', period_dt) # Note that we collapse this into a single step for subsequent casts to date dtype 75 | dat$period <- NULL # Drop string form of period; below, we continue to drop string forms of date dtype columns 76 | 77 | dat <- withColumn(dat, 'matr_dt', cast(cast(unix_timestamp(dat$dt_matr, 'MM/yyyy'), 'timestamp'), 'date')) 78 | dat$dt_matr <- NULL 79 | 80 | dat$cd_msa <- cast(dat$cd_msa, 'string') # We do not need to drop `cd_msa` since we can directly recast this column as a string 81 | 82 | dat$cd_zero_bal <- cast(dat$cd_zero_bal, 'string') 83 | 84 | dat <- withColumn(dat, 'zero_bal_dt', cast(cast(unix_timestamp(dat$dt_zero_bal, 'MM/yyyy'), 'timestamp'), 'date')) 85 | dat$dt_zero_bal <- NULL 86 | 87 | dat$matr_yr <- year(dat$matr_dt) # Extract year of maturity date of loan as an integer in dat DF 88 | dat$zero_bal_yr <- year(dat$zero_bal_dt) # Extract year loan set to 0 as an integer in dat DF 89 | 90 | 91 | 92 | head(dat) 93 | printSchema(dat) # We now have each DF column in the appropriate dtype 94 | 95 | # Drop rows with NAs: 96 | 97 | nrow(dat) 98 | dat_ <- dropna(dat) 99 | nrow(dat_) 100 | dat <- dat_ 101 | rm(dat_) 102 | cache(dat) 103 | 104 | ################################### 105 | ## (1) Fit a Gaussian GLM model: ## 106 | ################################### 107 | 108 | # Fit ordinary linear regression 109 | m1 <- SparkR::glm(act_endg_upb ~ new_int_rt + loan_age + mths_remng + matr_yr + zero_bal_yr, data = dat, family = "gaussian") 110 | 111 | 112 | 113 | output <- summary(m1) 114 | coeffs <- output$coefficients[,1] 115 | 116 | # Calculate average y value: 117 | act_endg_upb_avg <- collect(agg(dat, act_endg_upb_avg = mean(dat$act_endg_upb)))$act_endg_upb_avg 118 | # Predict fitted values using the DF OLS model -> yields new DF 119 | act_endg_upb_hat <- predict(m1, dat) 120 | cache(act_endg_upb_hat) 121 | head(act_endg_upb_hat) # so you can see what the prediction DF looks like 122 | # Transform the SparkR fitted values DF (yhat2_df) so that it is easier to read and includes squared residuals and squared totals & extract yhat vector (as new DF) 123 | act_endg_upb_hat <- transform(act_endg_upb_hat, sq_res = (act_endg_upb_hat$act_endg_upb - act_endg_upb_hat$prediction)^2, sq_tot = (act_endg_upb_hat$act_endg_upb - act_endg_upb_avg)^2) 124 | act_endg_upb_hat <- transform(act_endg_upb_hat, act_endg_upb_hat = act_endg_upb_hat$prediction) 125 | head(select(act_endg_upb_hat, "act_endg_upb", "act_endg_upb_hat", "sq_res", "sq_tot")) 126 | head(act_endg_upb_hat <- select(act_endg_upb_hat, "act_endg_upb_hat")) 127 | 128 | # Compute sum of squared residuals and totals, then use these values to calculate R-squared: 129 | SSR2 <- collect(agg(yhat2_df, SSR2=sum(yhat2_df$sq_res2))) ##### Note: produces data.frame - get values out of d.f's in order to calculate aRsq and Rsq 130 | SST2 <- collect(agg(yhat2_df, SST2=sum(yhat2_df$sq_res2))) 131 | Rsq2 <- 1-(SSR2/SST2) 132 | p <- 3 133 | N <- nrow(df) 134 | aRsq2 <- 1-(((1-Rsq2)*(N-1))/(N-p-1)) 135 | 136 | 137 | n <- 10 138 | b0 <- rep(0,n) 139 | b1 <- rep(0,n) 140 | b2 <- rep(0,n) 141 | b3 <- rep(0,n) 142 | b4 <- rep(0,n) 143 | b5 <- rep(0,n) 144 | for(i in 1:n){ 145 | model <- SparkR::glm(act_endg_upb ~ new_int_rt + loan_age + mths_remng + matr_yr + zero_bal_yr, data = dat, family = "gaussian") 146 | b0[i] <- unname(summary(model)$coefficients[,1]["(Intercept)"]) 147 | b1[i] <- unname(summary(model)$coefficients[,1]["new_int_rt"]) 148 | b2[i] <- unname(summary(model)$coefficients[,1]["loan_age"]) 149 | b3[i] <- unname(summary(model)$coefficients[,1]["mths_remng"]) 150 | b4[i] <- unname(summary(model)$coefficients[,1]["matr_yr"]) 151 | b5[i] <- unname(summary(model)$coefficients[,1]["zero_bal_yr"]) 152 | } 153 | 154 | 155 | # Prepare parameter estimate lists above as data.frames to pass into ggplot: 156 | b_ests_ <- data.frame(cbind(b0 = unlist(b0), b1 = unlist(b1), b2 = unlist(b2), b3 = unlist(b3), Iteration = seq(1, n, by = 1))) 157 | b_ests <- melt(b_ests_, id.vars ="Iteration", measure.vars = c("b0", "b1", "b2", "b3")) 158 | names(b_ests) <- cbind("Iteration", "Variable", "Value") 159 | 160 | 161 | p <- ggplot(data = b_ests, aes(x = Iteration, y = Value, col = Variable), size = 5) + geom_point() + geom_hline(yintercept = unname(coeffs1["(Intercept)"]), linetype = 2) + geom_hline(yintercept = unname(coeffs1["x1"]), linetype = 2) + geom_hline(yintercept = unname(coeffs1["x2"]), linetype = 2) + geom_hline(yintercept = unname(coeffs1["x3"]), linetype = 2) + labs(title = "L.R. Parameters Estimated via L-BFGS") -------------------------------------------------------------------------------- /R/ols2_SparkR2_test.R: -------------------------------------------------------------------------------- 1 | # Confirm that SPARK_HOME is set in environment: set SPARK_HOME to be equal to "/home/spark" 2 | # if the size of the elements of SPARK_HOME are less than 1: 3 | if (nchar(Sys.getenv("SPARK_HOME")) < 1) { 4 | Sys.setenv(SPARK_HOME = "/home/spark") 5 | } 6 | 7 | # Load the SparkR package 8 | library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"))) 9 | 10 | # Call the SparkR session 11 | sparkR.session(sparkPackages="com.databricks:spark-csv_2.10:1.4.0") 12 | 13 | # Load data as DataFrame 14 | dat <- read.df("s3://sparkr-tutorials/hfpc_ex", header = "false", inferSchema = "true") 15 | cache(dat) 16 | 17 | # Recast variables as needed 18 | period_dt <- cast(cast(unix_timestamp(dat$period, 'dd/MM/yyyy'), 'timestamp'), 'date') 19 | dat <- withColumn(dat, 'period_dt', period_dt) # Note that we collapse this into a single step for subsequent casts to date dtype 20 | dat$period <- NULL # Drop string form of period; below, we continue to drop string forms of date dtype columns 21 | 22 | dat <- withColumn(dat, 'matr_dt', cast(cast(unix_timestamp(dat$dt_matr, 'MM/yyyy'), 'timestamp'), 'date')) 23 | dat$dt_matr <- NULL 24 | 25 | dat$cd_msa <- cast(dat$cd_msa, 'string') # We do not need to drop `cd_msa` since we can directly recast this column as a string 26 | 27 | dat$cd_zero_bal <- cast(dat$cd_zero_bal, 'string') 28 | 29 | dat <- withColumn(dat, 'zero_bal_dt', cast(cast(unix_timestamp(dat$dt_zero_bal, 'MM/yyyy'), 'timestamp'), 'date')) 30 | dat$dt_zero_bal <- NULL 31 | 32 | dat$matr_yr <- year(dat$matr_dt) # Extract year of maturity date of loan as an integer in dat DF 33 | dat$zero_bal_yr <- year(dat$zero_bal_dt) 34 | 35 | # Drop nulls 36 | list <- list("act_endg_upb", "new_int_rt", "loan_age", "mths_remng", "matr_yr", "zero_bal_yr") 37 | dat <- dropna(dat, cols = list) 38 | nrow(dat) 39 | 40 | 41 | # Fit Gaussian family GLM with identity link 42 | glm.gauss <- spark.glm(dat, act_endg_upb ~ new_int_rt + loan_age + mths_remng + matr_yr + zero_bal_yr, family = "gaussian") 43 | 44 | # Save model summary and outputs 45 | output <- summary(glm.gauss) 46 | coeffs <- output$coefficients[,1] 47 | 48 | # Calculate average y value: 49 | act_endg_upb_avg <- collect(agg(dat, act_endg_upb_avg = mean(dat$act_endg_upb)))$act_endg_upb_avg 50 | 51 | # Predict fitted values using the DF OLS model -> yields new DF 52 | dat_pred <- predict(glm.gauss, dat) 53 | head(dat_pred) # so you can see what the prediction DF looks like 54 | 55 | # Transform the SparkR fitted values DF (dat_pred) so that it is easier to read and includes squared residuals and squared totals & extract yhat vector (as new DF) 56 | dat_pred <- transform(dat_pred, sq_res = (dat_pred$act_endg_upb - dat_pred$prediction)^2, sq_tot = (dat_pred$act_endg_upb - act_endg_upb_avg)^2) 57 | dat_pred <- transform(dat_pred, act_endg_upb_hat = dat_pred$prediction) 58 | head(dat_pred2 <- select(dat_pred, "act_endg_upb", "act_endg_upb_hat", "sq_res", "sq_tot")) 59 | head(act_endg_upb_hat <- select(dat_pred2, "act_endg_upb_hat")) 60 | 61 | # Compute sum of squared residuals and totals, then use these values to calculate R-squared: 62 | SSR <- collect(agg(dat_pred2, SSR = sum(dat_pred2$sq_res))) ##### Note: produces data.frame - get values out of d.f's in order to calculate aRsq and Rsq 63 | SST <- collect(agg(dat_pred2, SST = sum(dat_pred2$sq_tot))) 64 | Rsq2 <- 1-(SSR[[1]]/SST[[1]]) 65 | p <- 5 66 | N <- nrow(dat) 67 | aRsq2 <- Rsq2 - (1 - Rsq2)*((p - 1)/(N - p)) 68 | 69 | # Compare iterations of spark.glm outputs 70 | 71 | n <- 10 72 | b0 <- rep(0,n) 73 | b1 <- rep(0,n) 74 | b2 <- rep(0,n) 75 | b3 <- rep(0,n) 76 | b4 <- rep(0,n) 77 | b5 <- rep(0,n) 78 | for(i in 1:n){ 79 | model <- spark.glm(dat, act_endg_upb ~ new_int_rt + loan_age + mths_remng + matr_yr + zero_bal_yr, family = "gaussian") 80 | b0[i] <- unname(summary(model)$coefficients[,1]["(Intercept)"]) 81 | b1[i] <- unname(summary(model)$coefficients[,1]["new_int_rt"]) 82 | b2[i] <- unname(summary(model)$coefficients[,1]["loan_age"]) 83 | b3[i] <- unname(summary(model)$coefficients[,1]["mths_remng"]) 84 | b4[i] <- unname(summary(model)$coefficients[,1]["matr_yr"]) 85 | b5[i] <- unname(summary(model)$coefficients[,1]["zero_bal_yr"]) 86 | } 87 | 88 | # Prepare parameter estimate lists above as data.frames to pass into ggplot: 89 | library(reshape2) 90 | library(ggplot2) 91 | b_ests_ <- data.frame(cbind(b0 = unlist(b0), b1 = unlist(b1), b2 = unlist(b2), b3 = unlist(b3), b4 = unlist(b4), b5 = unlist(b5), 92 | Iteration = seq(1, n, by = 1))) 93 | b_ests <- melt(b_ests_, id.vars ="Iteration", measure.vars = c("b0", "b1", "b2", "b3", "b4", "b5")) 94 | names(b_ests) <- cbind("Iteration", "Parameter", "Value") 95 | 96 | p <- ggplot(data = b_ests, aes(x = Iteration, y = Value, col = Parameter), size = 5) 97 | p + geom_point() + labs(title = "L.R. Parameters Estimated via OWLQN") 98 | 99 | # Check functionality of other GLM families 100 | 101 | # Create binary response variable: 102 | dat <- mutate(dat, act_endg_upb_large = ifelse(dat$act_endg_upb > 122640, lit(1), lit(0))) 103 | # Create non-negative loan_age column to use as count data for Poisson 104 | dat <- mutate(dat, loan_age_pos = abs(dat$loan_age)) 105 | 106 | # binomial(link = "logit") 107 | glm.logit <- spark.glm(dat, act_endg_upb_large ~ new_int_rt + loan_age + mths_remng + matr_yr + zero_bal_yr, family = "binomial") 108 | # Gamma(link = "inverse") 109 | glm.gamma <- spark.glm(dat, act_endg_upb ~ new_int_rt + loan_age + mths_remng + matr_yr + zero_bal_yr, family = "Gamma") 110 | # inverse.gaussian(link = "1/mu^2") 111 | glm.invgauss <- spark.glm(dat, act_endg_upb ~ new_int_rt + loan_age + mths_remng + matr_yr + zero_bal_yr, family = "inverse.gaussian") 112 | # poisson(link = "log") 113 | glm.poisson <- spark.glm(dat, loan_age_pos ~ act_endg_upb + new_int_rt + mths_remng + matr_yr + zero_bal_yr, family = "poisson") 114 | # quasi(link = "identity", variance = "constant") 115 | glm.quasi <- spark.glm(dat, loan_age_pos ~ act_endg_upb + new_int_rt + mths_remng + matr_yr + zero_bal_yr, family = "quasi") 116 | # quasibinomial(link = "logit") 117 | glm.quasibin <- spark.glm(dat, act_endg_upb_large ~ new_int_rt + loan_age + mths_remng + matr_yr + zero_bal_yr, family = "quasibinomial") 118 | # quasipoisson(link = "log") 119 | glm.quasipoiss <- spark.glm(dat, loan_age_pos ~ act_endg_upb + new_int_rt + mths_remng + matr_yr + zero_bal_yr, family = "quasipoisson") 120 | 121 | summary(glm.logit) 122 | summary(glm.invgamma) 123 | summary(glm.invgauss) 124 | summary(glm.poisson) 125 | summary(glm.quasi) 126 | summary(glm.quasibin) 127 | summary(glm.quasipoiss) 128 | 129 | ## Diamonds data 130 | 131 | dat <- read.df("s3://sparkr-tutorials/diamonds.csv", header = "true", delimiter = ",", source = "csv", inferSchema = "true", na.strings = "") 132 | cache(dat) 133 | 134 | glm.gauss <- spark.glm(dat, act_endg_upb ~ new_int_rt + loan_age + mths_remng + matr_yr + zero_bal_yr, family = "gaussian") 135 | 136 | ct1 <- crosstab(dat, "clarity", "color") 137 | ct2 <- crosstab(dat, "clarity", "cut") 138 | ct3 <- crosstab(dat, "color", "cut") 139 | 140 | glm1 <- spark.glm(dat, price ~ carat + clarity, family = "gaussian") 141 | op1 <- summary(glm1) 142 | 143 | glm2 <- spark.glm(dat, price ~ carat + cut, family = "gaussian") 144 | op2 <- summary(glm2) -------------------------------------------------------------------------------- /R/qqnorm_SparkR.R: -------------------------------------------------------------------------------- 1 | ############################################################# 2 | ## qqnorm.SparkR: Normal Probability Plot of the Residuals ## 3 | ############################################################# 4 | # Sarah Armstrong, Urban Institute 5 | # August 23, 2016 6 | # Last Updated: August 24, 2016 7 | 8 | # Summary: Function that returns a quantile-quantile plot of the residual values from linear model, i.e. a plot that fits quantile values of the standardized residuals against those of a standard normal distribution. 9 | 10 | # Inputs: 11 | 12 | # (*) df: a SparkR DF 13 | # (*) residuals: the column name assigned to the residual values (a string); note: the function will standardize these during execution 14 | # (*) qn: the number of quantiles plotted (default is 100) 15 | # (*) error: relativeError value used in the `approxQuantile` SparkR operation 16 | 17 | # Returns: a ggplot object displaying the Q-Q plot, including axis labels and horizontal dashed lines, annotating the extremum values of the standardized residuals 18 | 19 | # p <- qqnorm.SparkR(df = df, residuals = "res", qn = 100, error = 0.0001) 20 | # p + ggtitle("This is a title") 21 | 22 | qqnorm.SparkR <- function(df, residuals, qn = 100, error){ 23 | 24 | resdf <- select(df, residuals) 25 | 26 | sd.res <- collect(agg(resdf, stddev(resdf[[residuals]])))[[1]] 27 | 28 | resdf <- withColumn(resdf, "stdres", resdf[[residuals]] / sd.res) 29 | 30 | probs <- seq(0, 1, length = qn) 31 | 32 | norm_quantiles <- qnorm(probs, mean = 0, sd = 1) 33 | stdres_quantiles <- unlist(approxQuantile(resdf, col = "stdres", probabilities = probs, relativeError = error)) 34 | 35 | dat <- data.frame(sort(norm_quantiles), sort(stdres_quantiles)) 36 | 37 | p_ <- ggplot(dat, aes(norm_quantiles, stdres_quantiles)) 38 | 39 | p <- p_ + geom_point(color = "#FF3333") + geom_abline(intercept = 0, slope = 1) + xlab("Normal Scores") + ylab("Standardized Residuals") + geom_hline(aes(yintercept = min(dat$sort.stdres_quantiles.), linetype = "1st & qnth Quantile Values"), show.legend = TRUE) + geom_hline(yintercept = max(dat$sort.stdres_quantiles.), linetype = "dotted") + scale_linetype_manual(values = c(name = "none", "1st & qnth Quantile Values" = "dotted")) + guides(linetype = guide_legend("")) + theme(legend.position = "bottom") 40 | 41 | return(p) 42 | 43 | } -------------------------------------------------------------------------------- /R/rbind-fill.R: -------------------------------------------------------------------------------- 1 | ######################### 2 | ## rbind.fill Function ## 3 | ######################### 4 | # Sarah Armstrong, Urban Institute 5 | # July 14, 2016 6 | 7 | # Updated: July 28, 2016 8 | 9 | # Summary: Function that allows us to append rows of one SparkR DataFrame (DF) to another, regardless of the column names for each DF. The function dentifies the outersection of the list of column names for two (2) DataFrames and adds them onto one (1) or both of the DataFrames as needed using `withColumn`. The function appends these columns as string dtype, and we can later recast columns as needed. 10 | 11 | # Inputs: x (a DF) and y (another DF) 12 | # Returns: DataFrame 13 | 14 | # Example: 15 | # df3 <- rbind.fill(df1, df2) 16 | # df3$col <- cast(df3$col, dataType = "integer") 17 | 18 | 19 | rbind.fill <- function(x, y) { 20 | 21 | m1 <- ncol(x) 22 | m2 <- ncol(y) 23 | col_x <- colnames(x) 24 | col_y <- colnames(y) 25 | outersect <- function(x, y) {setdiff(union(x, y), intersect(x, y))} 26 | col_outer <- outersect(col_x, col_y) 27 | len <- length(col_outer) 28 | 29 | if (m2 < m1) { 30 | for (j in 1:len){ 31 | y <- withColumn(y, col_outer[j], cast(lit(""), "string")) 32 | } 33 | } else { 34 | if (m2 > m1) { 35 | for (j in 1:len){ 36 | x <- withColumn(x, col_outer[j], cast(lit(""), "string")) 37 | } 38 | } 39 | if (m2 == m1 & col_x != col_y) { 40 | for (j in 1:len){ 41 | x <- withColumn(x, col_outer[j], cast(lit(""), "string")) 42 | y <- withColumn(y, col_outer[j], cast(lit(""), "string")) 43 | } 44 | } else { } 45 | } 46 | x_sort <- x[,sort(colnames(x))] 47 | y_sort <- y[,sort(colnames(y))] 48 | return(SparkR::rbind(x_sort, y_sort)) 49 | } -------------------------------------------------------------------------------- /R/rbind-intersection.R: -------------------------------------------------------------------------------- 1 | ############################## 2 | ## rbind.intersect Function ## 3 | ############################## 4 | # Sarah Armstrong, Urban Institute 5 | # July 14, 2016 6 | 7 | # Summary: Function that allows us to append rows of one SparkR DataFrame (DF) to another, regardless of the column names for each DF. Takes simple intersection of lists of column names and performs `rbind` SparkR operation on two (2) DFs, considering only the column names included in the intersected list of names. 8 | 9 | # Inputs: x (a DF) and y (another DF) 10 | # Returns: DataFrame 11 | 12 | rbind.intersect <- function(x, y) { 13 | cols <- base::intersect(colnames(x), colnames(y)) 14 | return(SparkR::rbind(x[, sort(cols)], y[, sort(cols)])) 15 | } -------------------------------------------------------------------------------- /R/sparkr-basics-1.R: -------------------------------------------------------------------------------- 1 | ################################################### 2 | ## SparkR Basics I: From CSV to SparkR DataFrame ## 3 | ################################################### 4 | 5 | ## Sarah Armstrong, Urban Institute 6 | ## June 23, 2016 7 | ## Last Updated: August 15, 2016 8 | 9 | 10 | ## Objective: Become comfortable working with the SparkR DataFrame (DF) API; particularly, understand how to: 11 | 12 | ## * Read a .csv file into SparkR as a DF 13 | ## * Measure dimensions of a DF 14 | ## * Append a DF with additional rows 15 | ## * Rename columns of a DF 16 | ## * Print column names of a DF 17 | ## * Print a specified number of rows from a DF 18 | ## * Print the SparkR schema 19 | ## * Specify schema in `read.df` operation 20 | ## * Manually specify a schema 21 | ## * Change the data type of a column in a DF 22 | ## * Export a DF to AWS S3 as a folder of partitioned parquet files 23 | ## * Export a DF to AWS S3 as a folder of partitioned .csv files 24 | ## * Read a partitioned file from S3 into SparkR 25 | 26 | ## SparkR Operations Discussed: `read.df`, `nrow`, `ncol`, `dim`, `withColumnRenamed`, `columns`, `head`, `str`, `dtypes`, `schema`, `printSchema`, `cast`, `write.df` 27 | 28 | 29 | ## Initiate Spark session: 30 | 31 | if (nchar(Sys.getenv("SPARK_HOME")) < 1) { 32 | Sys.setenv(SPARK_HOME = "/home/spark") 33 | } 34 | # Load the SparkR library 35 | library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"))) 36 | # Initiate a SparkR session 37 | sparkR.session() 38 | 39 | 40 | ###################################### 41 | ## (1) Load a csv file into SparkR: ## 42 | ###################################### 43 | 44 | ## Use the operation `read.df` to load in quarterly Fannie Mae single-family loan performance data from the AWS S3 folder `"s3://sparkr-tutorials/"` as a Spark DataFrame (DF). Below, we load a single quarter (2000, Q1) into SparkR, and save it as the DF `perf`: 45 | 46 | perf <- read.df("s3://sparkr-tutorials/Performance_2000Q1.txt", header = "false", delimiter = "|", source = "csv", inferSchema = "true", na.strings = "") 47 | 48 | ## In the `read.df` operation, we give specifications typically included when reading data into Stata and SAS, such as the delimiter character for .csv files. However, we also include SparkR-specific input including `inferSchema`, which Spark uses to interpet data types for each column in the DF. We discuss this in more detail later on in this tutorial. An additional detail is that `read.df` includes the `na.strings = ""` specification because we want `read.df` to read entries of empty strings in our .csv dataset as NA in the SparkR DF, i.e. we are telling read.df to read entries equal to `""` as `NA` in the DF. We will discuss how SparkR handles empty and null entries in further detail in a subsequent tutorial. 49 | 50 | ## Note: documentation for the quarterly loan performance data can be found at http://www.fanniemae.com/portal/funding-the-market/data/loan-performance-data.html. 51 | 52 | ## We can save the dimensions of the 'perf' DF through the following operations. Note that wrapping the computation with () forces SparkR/R to print the computed value: 53 | 54 | (n1 <- nrow(perf)) # Save the number of rows in 'perf' 55 | (m1 <- ncol(perf)) # Save the number of columns in 'perf' 56 | 57 | ## Update a DataFrame with new rows of data: 58 | 59 | ## Since we'll want to analyze loan performance data beyond 2000 Q1, we append the `perf` DF below with the data from subsequent quarters of the same single-family loan performance dataset. Here, we're only appending one subsequent quarter (2000 Q2) to the DF so that our analysis in these tutorials runs quickly, but the following code can be easily adapted by specifying the `a` and `b` values to reflect the quarters that we want to append to our DF. Note that the for-loop below also uses the `read.df` operation, specified here just as when we loaded the initial .csv file as a DF: 60 | 61 | a <- 2 62 | b <- 2 63 | 64 | for(q in a:b){ 65 | 66 | filename <- paste0("Performance_2000Q", q) 67 | filepath <- paste0("s3://sparkr-tutorials/", filename, ".txt") 68 | .perf <- read.df(filepath, header = "false", delimiter = "|", 69 | source = "csv", inferSchema = "true", na.strings = "") 70 | 71 | perf <- rbind(perf, .perf) 72 | } 73 | 74 | ## The result of the for-loop is an appended `perf` DF that consists of the same columns as the initial `perf` DF that we read in from S3, but now with many appended rows. We can confirm this by taking the dimensions of the new DF: 75 | 76 | (n2 <- nrow(perf)) 77 | (m2 <- ncol(perf)) 78 | 79 | 80 | ##################################### 81 | ## (2) Rename DataFrame column(s): ## 82 | ##################################### 83 | 84 | ## The `select` operation performs a by column subset of an existing DF. The columns to be returned in the new DF are specified as a list of column name strings in the `select` operation. Here, we create a new DF called `perf_lim` that includes only the first 14 columns in the `perf` DF, i.e. the DF `perf_lim` is a subset of `perf`: 85 | 86 | cols <- c("_C0","_C1","_C2","_C3","_C4","_C5","_C6","_C7","_C8","_C9","_C10","_C11","_C12","_C13") 87 | perf_lim <- select(perf, col = cols) 88 | 89 | ## We will discuss subsetting DataFrames in further detail in the "Subsetting" tutorial. For now, we will use this subsetted DF to learn how to change column names of DataFrames. 90 | 91 | ## Using a for-loop and the SparkR operation `withColumnRenamed`, we rename the columns of `perf_lim`. The operation `withColumnRenamed` renames an existing column, or columns, in a DF and returns a new DF. By specifying the "new" DF name as `perf_lim`, however, we simply rename the columns of `perf_lim` (we could create an entirely separate DF with new column names by specifying a different DF name for `withColumnRenamed`): 92 | 93 | old_colnames <- c("_C0","_C1","_C2","_C3","_C4","_C5","_C6","_C7","_C8","_C9","_C10","_C11","_C12","_C13") 94 | new_colnames <- c("loan_id","period","servicer_name","new_int_rt","act_endg_upb","loan_age","mths_remng", 95 | "aj_mths_remng","dt_matr","cd_msa","delq_sts","flag_mod","cd_zero_bal","dt_zero_bal") 96 | 97 | for(i in 1:14){ 98 | perf_lim <- withColumnRenamed(perf_lim, existingCol = old_colnames[i], newCol = new_colnames[i] ) 99 | } 100 | 101 | ## We can check the column names of `perf_lim` with the `columns` operation or with its alias `colnames`: 102 | 103 | columns(perf_lim) 104 | 105 | ## Additionally, we can use the `head` operation to display the first n-many rows of `perf_lim` (here, we'll take the first five (5) rows of the DF): 106 | 107 | head(perf_lim, num = 5) 108 | 109 | ## We can also use the `str` operation to return a compact visualization of the first several rows of a DF: 110 | 111 | str(perf_lim) 112 | 113 | ############################################ 114 | ## (3) Understanding data-types & schema: ## 115 | ############################################ 116 | 117 | ## We can see in the output for the command `head(perf_lim, num = 5)` that we have what appears to be several different data types (dtypes) in our DF. There are three (3) different ways to explicitly view dtype in SparkR - the operations `dtypes`, `schema` and `printSchema`. As stated above, Spark relies on a "schema" to determine what dtype to assign to each column in a DF (which is easy to remember since the English schema comes from the Greek word for shape or plan!). We can print a visual representation of the schema for a DF with the operations `schema` and `printSchema` while the `dtypes` operation prints a list of DF column names and their corresponding dtypes: 118 | 119 | dtypes(perf_lim) # Prints a list of DF column names and corresponding dtypes 120 | schema(perf_lim) # Prints the schema of the DF 121 | printSchema(perf_lim) # Prints the schema of the DF in a concise tree format 122 | 123 | ## Specifying schema in `read.df` operation & defining a custom schema: 124 | 125 | ## Remember that, when we read in our DF from the S3-hosted .csv file, we included the condition `inferSchema = "true"`. This is just one of three (3) ways to communicate to Spark how the dtypes of the DF columns should be assigned. By specifying `inferSchema = "true"` in `read.df`, we allow Spark to infer the dtype of each column in the DF. Conversely, we could specify our own schema and pass this into the load call, forcing Spark to adopt our dtype specifications for each column. Each of these approaches have their pros and cons, which determine when it is appropriate to prefer one over the other: 126 | 127 | ## * `inferSchema = "true"`: This approach minimizes programmer-driven error since we aren't required to make assertions about the dtypes of each column; however, it is comparatively computationally expensive 128 | 129 | ## * `customSchema`: While computationally more efficient, manually specifying a schema will lead to errors if incorrect dtypes are assigned to columns - if Spark is not able to interpret a column as the specified dtype, `read.df` will fill that column in the DF with NA 130 | 131 | ## Clearly, the situations in which these approaches would be helpful are starkly different. In the context of this tutorial, an efficient use of both approaches would be to use `inferSchema = "true"` when reading in `perf`. At this point, we could print the schema with `schema` or `printSchema`, note the dtype for each column (all 28 of them), and then write a `customSchema` with the corresponding specifications (or changed from the inferred schema as needed). We could then use this `customSchema` when appending the subsequent quarters to `perf`. While writing the customSchema may be tedious, including it in the appending for-loop would help that process to be much more efficient - this would be especially useful if we were appending, for example, 20 years worth of quarterly data together. The third way to communicate to Spark how to define dtypes is to not specify any schema, i.e. to not include `inferSchema` in `read.df`. Under this condition, every column in the DF is read in as a string dtype. Below is the an example of how we could specify a customSchema (here, however, we just use the same dtypes as interpreted for `inferSchema = "true"`): 132 | 133 | customSchema <- structType( 134 | structField("loan_id", type = "long"), 135 | structField("period", type = "string"), 136 | structField("servicer_name", type = "string"), 137 | structField("new_int_rt", type = "double"), 138 | structField("act_endg_upb", type = "double"), 139 | structField("loan_age", type = "integer"), 140 | structField("mths_remng", type = "integer"), 141 | structField("aj_mths_remng", type = "integer") 142 | structField("dt_matr", type = "string") 143 | structField("cd_msa", type = "integer") 144 | structField("delq_sts", type = "string") 145 | structField("flag_mod", type = "string") 146 | structField("cd_zero_bal", type = "integer") 147 | structField("dt_zero_bal", type = "string") 148 | ) 149 | 150 | ## Finally, dtypes can be changed after the DF has been created, using the `cast` operation. However, it is clearly more efficient to properly specify dtypes when creating the DF. A quick example of using the `cast` operation is given below: 151 | 152 | # We can see in the results from the previous printSchema output that `loan_id` is a `long` dtype, here we `cast` it 153 | # as a `string` and then call `printSchema` on this new DF 154 | perf_lim$loan_id <- cast(perf_lim$loan_id, dataType = "string") 155 | printSchema(perf_lim) 156 | 157 | # If we want our original `perf_lim` DF, we can simply recast `loan_id` as a `long` dtype 158 | perf_lim$loan_id <- cast(perf_lim$loan_id, dataType = "long") 159 | printSchema(perf_lim) 160 | 161 | 162 | ####################################### 163 | ## (4) Export DF as data file to S3: ## 164 | ####################################### 165 | 166 | ## Throughout this tutorial, we've built the Spark DataFrame `perf_lim` of quarterly loan performance data, which we'll use in several subsequent tutorials. In order to use this DF later on, we must first export it to a location that can handle large data sizes and in a data structure that works with the SparkR environment. We'll save this example data to an AWS S3 folder (`"sparkr-tutorials"`) from which we'll access other example datasets. Below, we save `perf_lim` as a collection of parquet type files into the folder `"hfpc_ex"` using the `write.df` operation: 167 | 168 | write.df(perf_lim, path = "s3://sparkr-tutorials/hfpc_ex", source = "parquet", mode = "overwrite") 169 | 170 | ## When working with the DF `perf_lim` in the analysis above, we were really accessing data that was partitioned across our cluster. In order to export this partitioned data, we export each partition from its node (computer) and then collect them into the folder `"hfpc_ex"`. This "file" of indiviudal, partitioned files should be treated like an indiviudal file when organizing an S3 folder, i.e. __do not__ attempt to save other DataFrames or files to this file. SparkR saves the DF in this partitioned structure to accomodate massive data. 171 | 172 | ## Consider the conditions required for us to be able to save a DataFrame as a single .csv file: the given DF would need to be able to fit onto a single node of our cluster, i.e. it would need to be able to fit onto a single computer. Any data that would necessitate using SparkR in analysis will likely not fit onto a single computer. Note that we have specified `mode = "overwrite"`, indicating that existing data in this folder is expected to be overwritten by the contents of this DF (additional mode specifications include `"error"`, `"ignore"` and `"append"`). 173 | 174 | ## The partitioned nature of `"hfpc_ex"` does not affect our ability to load it back into SparkR and perform further analysis. Below, we use the `read.df` to read in the partitioned parquet file from S3 as the DF `dat`: 175 | 176 | dat <- read.df("s3://sparkr-tutorials/hfpc_ex", header = "false", inferSchema = "true") 177 | 178 | ## Below, we confirm that the dimensions and column names of `dat` and `perf_lim` are equal. When comparing DFs, each with a large number of columns, the following if-else statement can be adapted to check equal dimensions and column names across DFs: 179 | 180 | dim1 <- dim(perf_lim) 181 | dim2 <- dim(dat) 182 | if (dim1[1]!=dim2[1] | dim1[2]!=dim2[2]) { 183 | "Error: dimension values not equal; DataFrame did not export correctly" 184 | } else { 185 | "Dimension values are equal" 186 | } 187 | 188 | ## We can also save the DF as a folder of partitioned .csv files with syntax similar to that which we used to export the DF as partitioned parquet files. Note, however, that this does not retain the column names like saving as partitioned parquet files does. The `write.df` expression for exporting the DF as a folder of partitioned .csv files is given below: 189 | 190 | write.df(perf_lim, path = "s3://sparkr-tutorials/hfpc_ex_csv", source = "csv", mode = "overwrite") 191 | 192 | ## We can read in the .csv files as a DF with the following expression: 193 | 194 | dat2 <- read.df("s3://sparkr-tutorials/hfpc_ex_csv", source = "csv", inferSchema = "true") 195 | 196 | ## Note that the DF columns are now given generic names, but we can use the same for-loop from a previous section in this tutorial to rename the columns in our new DF: 197 | 198 | colnames(dat2) 199 | 200 | for(i in 1:14){ 201 | dat2 <- withColumnRenamed(dat2, existingCol = old_colnames[i], newCol = new_colnames[i]) 202 | } 203 | 204 | colnames(dat2) -------------------------------------------------------------------------------- /R/subsetting.R: -------------------------------------------------------------------------------- 1 | ################################## 2 | ## Subsetting SparkR DataFrames ## 3 | ################################## 4 | 5 | ## Sarah Armstrong, Urban Institute 6 | ## July 1, 2016 7 | ## Last Updated: August 17, 2016 8 | 9 | 10 | ## Objective: Now that we understand what a SparkR DataFrame (DF) really is (remember, it's not actually data!) and can write expressions using essential DataFrame operations, such as `agg`, we are ready to start subsetting DFs using more advanced transformation operations. This tutorial discusses various ways of subsetting DFs, as well as how to work with a randomly sampled subset as a local data.frame in RStudio: 11 | 12 | ## * Subset a DF by row 13 | ## * Subset a DF by a list of columns 14 | ## * Subset a DF by column expressions 15 | ## * Drop a column from a DF 16 | ## * Subset a DF by taking a random sample 17 | ## * Collect a random sample as a local R data.frame 18 | ## * Export a DF sample as a single .csv file to S3 19 | 20 | ## SparkR/R Operations Discussed: `filter`, `where`, `select`, `sample`, `collect`, `write.table` 21 | 22 | 23 | ## Initiate SparkR session: 24 | 25 | if (nchar(Sys.getenv("SPARK_HOME")) < 1) { 26 | Sys.setenv(SPARK_HOME = "/home/spark") 27 | } 28 | library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"))) 29 | sparkR.session() 30 | 31 | ## Read in example HFPC data from AWS S3 as a DataFrame (DF): 32 | 33 | df <- read.df("s3://sparkr-tutorials/hfpc_ex", header = "false", inferSchema = "true") 34 | cache(df) 35 | 36 | 37 | ## Let's check the dimensions our DF `df` and its column names so that we can compare the dimension sizes of `df` with those of the subsets that we will define throughout this tutorial: 38 | nrow(df) 39 | ncol(df) 40 | columns(df) 41 | 42 | ################################## 43 | ## (1) Subset DataFrame by row: ## 44 | ################################## 45 | 46 | ## The SparkR operation `filter` allows us to subset the rows of a DF according to specified conditions. Before we begin working with `filter` to see how it works, let's print the schema of `df` since the types of subsetting conditions we are able to specify depend on the datatype of each column in the DF: 47 | 48 | printSchema(df) 49 | 50 | ## We can subset `df` into a new DF, `f1`, that includes only those loans for which JPMorgan Chase is the servicer with the expression: 51 | 52 | f1 <- filter(df, df$servicer_name == "JP MORGAN CHASE BANK, NA" | df$servicer_name == "JPMORGAN CHASE BANK, NA" | 53 | df$servicer_name == "JPMORGAN CHASE BANK, NATIONAL ASSOCIATION") 54 | nrow(f1) 55 | 56 | ## Notice that the `filter` considers normal logical syntax (e.g. logical conditions and operations), making working with the operation very straightforward. We can specify `filter` with SQL statement strings. For example, here we have the preceding example written in SQL statement format: 57 | 58 | filter(df, "servicer_name = 'JP MORGAN CHASE BANK, NA' or servicer_name = 'JPMORGAN CHASE BANK, NA' or 59 | servicer_name = 'JPMORGAN CHASE BANK, NATIONAL ASSOCIATION'") 60 | 61 | ## Or, alternatively, in a syntax similar to how we subset data.frames by row in base R: 62 | 63 | df[df$servicer_name == "JP MORGAN CHASE BANK, NA" | df$servicer_name == "JPMORGAN CHASE BANK, NA" | 64 | df$servicer_name == "JPMORGAN CHASE BANK, NATIONAL ASSOCIATION",] 65 | 66 | ## Another example of using logical syntax with `filter` is that we can subset `df` such that the new DF only includes those loans for which the servicer name is known, i.e. the column `"servicer_name"` is not equa to an empty string or listed as `"OTHER"`: 67 | 68 | f2 <- filter(df, df$servicer_name != "OTHER" & df$servicer_name != "") 69 | nrow(f2) 70 | 71 | ## Or, if we wanted to only consider observations with a `"loan_age"` value of greater than 60 months (five years), we would evaluate: 72 | 73 | f3 <- filter(df, df$loan_age > 60) 74 | nrow(f3) 75 | 76 | ## An alias for `filter` is `where`, which reads much more intuitively, particularly when `where` is embedded in a complex statement. For example, the following expression can be read as "__aggregate__ the mean loan age and count values __by__ `"servicer_name"` in `df` __where__ loan age is less than 60 months": 77 | 78 | f4 <- agg(groupBy(where(df, df$loan_age < 60), where(df, df$loan_age < 60)$servicer_name), 79 | loan_age_avg = avg(where(df, df$loan_age < 60)$loan_age), 80 | count = n(where(df, df$loan_age < 60)$loan_age)) 81 | head(f4) 82 | 83 | ##################################### 84 | ## (2) Subset DataFrame by column: ## 85 | ##################################### 86 | 87 | ## The operation `select` allows us to subset a DF by a specified list of columns. In the expression below, for example, we create a subsetted DF that includes only the number of calendar months remaining until the borrower is expected to pay the mortgage loan in full (remaining maturity) and adjusted remaining maturity: 88 | 89 | s1 <- select(df, "mths_remng", "aj_mths_remng") 90 | ncol(s1) 91 | 92 | ## We can also reference the column names through the DF name, i.e. `select(df, df$mths_remng, df$aj_mths_remng)`. Or, we can save a list of columns as a combination of strings. If we wanted to make a list of all columns that relate to remaining maturity, we could evaluate the expression `remng_mat <- c("mths_remng", "aj_mths_remng")` and then easily reference our list of columns later on with `select(df, remng_mat)`. 93 | 94 | ## Besides subsetting by a list of columns, we can also subset `df` while introducing a new column using a column expression, as we do in the example below. The DF `s2` includes the columns `"mths_remng"` and `"aj_mths_remng"` as in `s1`, but now with a column that lists the absolute value of the difference between the unadjusted and adjusted remaining maturity: 95 | 96 | s2 <- select(df, df$mths_remng, df$aj_mths_remng, abs(df$aj_mths_remng - df$mths_remng)) 97 | ncol(s2) 98 | head(s2) 99 | 100 | ## Note that, just as we can subset by row with syntax similar to that in base R, we can similarly acheive subsetting by column. The following expressions are equivalent: 101 | 102 | select(df, df$period) 103 | df[,"period"] 104 | df[,2] 105 | 106 | ## To simultaneously subset by column and row specifications, you can simply embed a `where` expression in a `select` operation (or vice versa). The following expression creates a DF that lists loan age values only for observations in which servicer name is unknown: 107 | 108 | s3 <- select(where(df, df$servicer_name == "" | df$servicer_name == "OTHER"), "loan_age") 109 | head(s3) 110 | 111 | ## Note that we could have also written the above expression as: 112 | 113 | df[df$servicer_name == "" | df$servicer_name == "OTHER", "loan_age"] 114 | 115 | ################################### 116 | ## (2i) Drop a column from a DF: ## 117 | ################################### 118 | 119 | ## We can drop a column from a DF very simply by assigning `NULL` to a DF column. Below, we drop `"aj_mths_remng"` from `s1`: 120 | 121 | head(s1) 122 | s1$aj_mths_remng <- NULL 123 | head(s1) 124 | 125 | ################################################# 126 | ## (3) Subset a DF by taking a random sample: ### 127 | ################################################# 128 | 129 | ## Perhaps the most useful subsetting operation is `sample`, which returns a randomly sampled subset of a DF. With `subset`, we can specify whether we want to sample with or without replace, the approximate size of the sample that we want the new DF to call and whether or not we want to define a random seed. If our initial DF is so massive that performing analysis on the entire dataset requires a more expensive cluster, we can: sample the massive dataset, interactively develop our analysis in SparkR using our sample and then evaluate the resulting program using our initial DF, which calls the entire massive dataset, only as is required. This strategy will help us to minimize wasting resources. 130 | 131 | ## Below, we take a random sample of `df` without replacement that is, in size, approximately equal to 1% of `df`. Notice that we must define a random seed in order to be able to reproduce our random sample. 132 | 133 | df_samp1 <- sample(df, withReplacement = FALSE, fraction = 0.01) # Without set seed 134 | df_samp2 <- sample(df, withReplacement = FALSE, fraction = 0.01) 135 | count(df_samp1) 136 | count(df_samp2) 137 | # The row counts are different and, obviously, the DFs are not equivalent 138 | 139 | df_samp3 <- sample(df, withReplacement = FALSE, fraction = 0.01, seed = 0) # With set seed 140 | df_samp4 <- sample(df, withReplacement = FALSE, fraction = 0.01, seed = 0) 141 | count(df_samp3) 142 | count(df_samp4) 143 | # The row counts are equal and the DFs are equivalent 144 | 145 | ########################################################## 146 | ## (3i) Collect a random sample as a local data.frame: ### 147 | ########################################################## 148 | 149 | ## An additional use of `sample` is to collect a random sample of a massive dataset as a local data.frame in R. This would allow us to work with a sample dataset in a traditional analysis environment that is likely more representative of the population since we are sampling from a larger set of observations than we are normally doing so. This can be achieved by simply using `collect` to create a local data.frame: 150 | 151 | typeof(df_samp4) # DFs are of class S4 152 | dat <- collect(df_samp4) 153 | typeof(dat) 154 | 155 | ## Note that this data.frame is _not_ local to _your_ personal computer, but rather it was gathered locally to a single node in our AWS cluster. 156 | 157 | ######################################################### 158 | ## (3ii) Export DF sample as a single .csv file to S3: ## 159 | ######################################################### 160 | 161 | ## If we want to export the sampled DF from RStudio as a single .csv file that we can work with in any environment, we must first coalesce the rows of `df_samp4` to a single node in our cluster using the `repartition` operation. Then, we can use the `write.df` operation as we did in the [SparkR Basics I](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/sparkr-basics-1.md) tutorial: 162 | 163 | df_samp4_1 <- repartition(df_samp4, numPartitions = 1) 164 | write.df(df_samp4_1, path = "s3://sparkr-tutorials/hfpc_samp.csv", source = "csv", 165 | mode = "overwrite") 166 | 167 | ## __Warning__: We cannot collect a DF as a data.frame, nor can we repartition it to a single node, unless the DF is sufficiently small in size since it must fit onto a _single_ node! -------------------------------------------------------------------------------- /R/summary-statistics.R: -------------------------------------------------------------------------------- 1 | ################################## 2 | ## Subsetting SparkR DataFrames ## 3 | ################################## 4 | 5 | ## Sarah Armstrong, Urban Institute 6 | ## July 8, 2016 7 | ## Last Updated: August 18, 2016 8 | 9 | 10 | ## Objective: Summary statistics and aggregations are essential means of summarizing a set of observations. In this tutorial, we discuss how to compute location, statistical dispersion, distribution and dependence measures of numerical variables in SparkR, as well as methods for examining categorical variables. In particular, we consider how to compute the following measurements and aggregations in SparkR: 11 | 12 | ## Numerical Data 13 | 14 | ## * Measures of location: 15 | ## + Mean 16 | ## + Extract summary statistics as local value 17 | ## * Measures of dispersion: 18 | ## + Range width & limits 19 | ## + Variance 20 | ## + Standard deviation 21 | ## + Quantiles 22 | ## * Measures of distribution shape: 23 | ## + Skewness 24 | ## + Kurtosis 25 | ## * Measures of Dependence: 26 | ## + Covariance 27 | ## + Correlation 28 | 29 | ## Categorical Data 30 | 31 | ## * Frequency table 32 | ## * Relative frequency table 33 | ## * Contingency table 34 | 35 | ## SparkR/R Operations Discussed: `describe`, `collect`, `showDF`, `agg`, `mean`, `typeof`, `min`, `max`, `abs`, `var`, `sd`, `skewness`, `kurtosis`, `cov`, `corr`, `count`, `n`, `groupBy`, `nrow`, `crosstab` 36 | 37 | 38 | ## Initiate SparkR session: 39 | 40 | if (nchar(Sys.getenv("SPARK_HOME")) < 1) { 41 | Sys.setenv(SPARK_HOME = "/home/spark") 42 | } 43 | library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"))) 44 | sparkR.session() 45 | 46 | ## Read in example HFPC data from AWS S3 as a DataFrame (DF): 47 | 48 | df <- read.df("s3://sparkr-tutorials/hfpc_ex", header = "false", inferSchema = "true") 49 | cache(df) 50 | 51 | #################### 52 | ## NUMERICAL DATA ## 53 | #################### 54 | #################### 55 | 56 | ## The operation `describe` (or its alias `summary`) creates a new DF that consists of several key aggregations (count, mean, max, mean, standard deviation) for a specified DF or list of DF columns (note that columns must be of a numerical datatype). We can either (1) use the action operation `showDF` to print this aggregation DF or (2) save it as a local data.frame with `collect`. Here, we perform both of these actions on the aggregation DF `sumstats_mthsremng`, which returns the aggregations listed above for the column `"mths_remng"` in `df`: 57 | 58 | sumstats_mthsremng <- describe(df, "mths_remng") # Specified list of columns here consists only of "mths_remng" 59 | 60 | showDF(sumstats_mthsremng) # Print the aggregation DF 61 | 62 | sumstats_mthsremng.l <- collect(sumstats_mthsremng) # Collect aggregation DF as a local data.frame 63 | sumstats_mthsremng.l 64 | 65 | ## Note that measuring all five (5) of these aggregations at once can be computationally expensive with a massive data set, particularly if we are interested in only a subset of these measurements. Below, we outline ways to measure these aggregations individually, as well as several other key summary statistics for numerical data. 66 | 67 | ############################### 68 | ## (1) Measures of Location: ## 69 | ############################### 70 | 71 | ################ 72 | ## (1i) Mean: ## 73 | ################ 74 | 75 | ## The mean is the only measure of central tendency currently supported by SparkR. The operations `mean` and `avg` can be used with the `agg` operation that we discussed in the [SparkR Basics II](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/sparkr-basics-2.md) tutorial to measure the average of a numerical DF column. Remember that `agg` returns another DF. Therefore, we can either print the DF with `showDF` or we can save the aggregation as a local data.frame. Collecting the DF may be preferred if we want to work with the mean `"mths_remng"` value as a single value in RStudio. 76 | 77 | mths_remng.avg <- agg(df, mean = mean(df$mths_remng)) # Create an aggregation DF 78 | 79 | # DataFrame 80 | showDF(mths_remng.avg) # Print this DF 81 | typeof(mths_remng.avg) # Aggregation DF is of class S4 82 | 83 | # data.frame 84 | mths_remng.avg.l <- collect(mths_remng.avg) # Collect the DF as a local data.frame 85 | (mths_remng.avg.l <- mths_remng.avg.l[,1]) # Overwrite data.frame with numerical mean value (was entry in d.f) 86 | typeof(mths_remng.avg.l) # Object is now of a numerical dtype 87 | 88 | ################################# 89 | ## (2) Measures of dispersion: ## 90 | ################################# 91 | 92 | ################################ 93 | ## (2i) Range width & limits: ## 94 | ################################ 95 | 96 | ## We can also use `agg` to create a DF that lists the minimum and maximum values within a numerical DF column (i.e. the limits of the range of values in the column) and the width of the range. Here, we create compute these values for `"mths_remng"` and print the resulting DF with `showDF`: 97 | 98 | mr_range <- agg(df, minimum = min(df$mths_remng), maximum = max(df$mths_remng), 99 | range_width = abs(max(df$mths_remng) - min(df$mths_remng))) 100 | showDF(mr_range) 101 | 102 | ########################################## 103 | ## (2ii) Variance & standard deviation: ## 104 | ########################################## 105 | 106 | ## Again using `agg`, we compute the variance and standard deviation of `"mths_remng"` with the expressions below. Note that, here, we are computing sample variance and standard deviation (which we could also measure with their respective aliases, `variance` and `stddev`). To measure population variance and standard deviation, we would use `var_pop` and `stddev_pop`, respectively. 107 | 108 | mr_var <- agg(df, variance = var(df$mths_remng)) # Sample variance 109 | showDF(mr_var) 110 | 111 | mr_sd <- agg(df, std_dev = sd(df$mths_remng)) # Sample standard deviation 112 | showDF(mr_sd) 113 | 114 | ################################### 115 | ## (2iii) Approximate Quantiles: ## 116 | ################################### 117 | 118 | ## The operation `approxQuantile` returns approximate quantiles for a DF column. We specify the quantiles to be approximated by the operation as a vector set equal to the `probabilities` parameter, and the acceptable level of error by the `relativeError` paramter. 119 | 120 | ## If the column includes `n` rows, then `approxQuantile` will return a list of quantile values with rank values that are acceptably close to those exact values specified by `probabilities`. In particular, the operation assigns approximate rank values such that the computed rank, (`probabilities * n`), falls within the inequality `floor((probabilities - relativeError) * n) <= rank(x) <= ceiling((probabilities + relativeError) * n)`. 121 | 122 | ## Below, we define a new DF, `df_`, that includes only nonmissing values for `"mths_remng"` and then compute approximate Q1, Q2 and Q3 values for `"mths_remng"`: 123 | 124 | df_ <- dropna(df, cols = "mths_remng") 125 | 126 | quartiles_mr <- approxQuantile(x = df_, col = "mths_remng", probabilities = c(0.25, 0.5, 0.75), 127 | relativeError = 0.001) 128 | quartiles_mr 129 | 130 | 131 | ######################################### 132 | ## (3) Measures of distribution shape: ## 133 | ######################################### 134 | 135 | #################### 136 | ## (3i) Skewness: ## 137 | #################### 138 | 139 | ## We can measure the magnitude and direction of skew in the distribution of a numerical DF column by using the operation `skewness` with `agg`, just as we did to measure the `mean`, `variance` and `stddev` of a numerical variable. Below, we measure the `skewness` of `"mths_remng"`: 140 | 141 | mr_sk <- agg(df, skewness = skewness(df$mths_remng)) 142 | showDF(mr_sk) 143 | 144 | ##################### 145 | ## (3ii) Kurtosis: ## 146 | ##################### 147 | 148 | ## Similarly, we can meaure the magnitude of, and how sharp is, the central peak of the distribution of a numerical variable, i.e. the "peakedness" of the distribution, (relative to a standard bell curve) with the `kurtosis` operation. Here, we measure the `kurtosis` of `"mths_remng"`: 149 | 150 | mr_kr <- agg(df, kurtosis = kurtosis(df$mths_remng)) 151 | showDF(mr_kr) 152 | 153 | ################################# 154 | ## (4) Measures of dependence: ## 155 | ################################# 156 | 157 | #################################### 158 | ## (4i) Covariance & correlation: ## 159 | #################################### 160 | 161 | ## The actions `cov` and `corr` return the sample covariance and correlation measures of dependency between two DF columns, respectively. Currently, Pearson is the only supported method for calculating correlation. Here we compute the covariance and correlation of `"loan_age"` and `"mths_remng"`. Note that, in saving the covariance and correlation measures, we are not required to first `collect` locally since `cov` and `corr` return values, rather than DFs: 162 | 163 | cov_la.mr <- cov(df, "loan_age", "mths_remng") 164 | corr_la.mr <- corr(df, "loan_age", "mths_remng", method = "pearson") 165 | cov_la.mr 166 | corr_la.mr 167 | 168 | typeof(cov_la.mr) 169 | typeof(corr_la.mr) 170 | 171 | ###################### 172 | ## CATEGORICAL DATA ## 173 | ###################### 174 | 175 | ## We can compute descriptive statistics for categorical data using (1) the `groupBy` operation that we discussed in the [SparkR Basics II](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/sparkr-basics-2.md) tutorial and (2) operations native to SparkR for this purpose. 176 | 177 | df$cd_zero_bal <- ifelse(isNull(df$cd_zero_bal), "Unknown", df$cd_zero_bal) 178 | df$servicer_name <- ifelse(df$servicer_name == "", "Unknown", df$servicer_name) 179 | 180 | ########################## 181 | ## (1) Frequency table: ## 182 | ########################## 183 | 184 | ## To create a frequency table for a categorical variable in SparkR, i.e. list the number of observations for each distinct value in a column of strings, we can simply use the `count` transformation with grouped data. Group the data by the categorical variable for which we want to return a frequency table. Here, we create a frequency table for using this approach `"cd_zero_bal"`: 185 | 186 | zb_f <- count(groupBy(df, "cd_zero_bal")) 187 | showDF(zb_f) 188 | 189 | ## We could also embed a grouping into an `agg` operation as we saw in the [SparkR Basics II](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/sparkr-basics-2.md) tutorial to achieve the same frequency table DF, i.e. we could evaluate the expression `agg(groupBy(df, df$cd_zero_bal), count = n(df$cd_zero_bal))`. 190 | 191 | ################################### 192 | ## (2) Relative frequency table: ## 193 | ################################### 194 | 195 | ## We could similarly create a DF that consists of a relative frequency table. Here, we reproduce the frequency table from the preceding section, but now including the relative frequency for each distinct string value, labeled `"Percentage"`: 196 | 197 | n <- nrow(df) 198 | zb_rf <- agg(groupBy(df, df$cd_zero_bal), Count = n(df$cd_zero_bal), Percentage = n(df$cd_zero_bal) * (100/n)) 199 | showDF(zb_rf) 200 | 201 | ############################ 202 | ## (3) Contingency table: ## 203 | ############################ 204 | 205 | ## Finally, we can create a contingency table with the operation `crosstab`, which returns a data.frame that consists of a contingency table between two categorical DF columns. Here, we create and print a contingency table for `"servicer_name"` and `"cd_zero_bal"`: 206 | 207 | conting_sn.zb <- crosstab(df, "servicer_name", "cd_zero_bal") 208 | conting_sn.zb -------------------------------------------------------------------------------- /R/time-series-1.R: -------------------------------------------------------------------------------- 1 | ############################################################################ 2 | ## Time Series I: Working with the Date Datatype & Resampling a DataFrame ## 3 | ############################################################################ 4 | 5 | ## Sarah Armstrong, Urban Institute 6 | ## July 8, 2016 7 | ## Last Updated: August 23, 2016 8 | 9 | 10 | ## Objective: In this tutorial, we discuss how to perform several essential time series operations with SparkR. In particular, we discuss how to: 11 | 12 | ## * Identify and parse date datatype (dtype) DF columns, 13 | ## * Compute relative dates based on a specified increment of time, 14 | ## * Extract and modify components of a date dtype column and 15 | ## * Resample a time series DF to a particular unit of time frequency 16 | 17 | ## SparkR/R Operations Discussed: `unix_timestamp`, `cast`, `withColumn`, `to_date`, `last_day`, `next_day`, `add_months`, `date_add`, `date_sub`, `weekofyear`, `dayofyear`, `dayofmonth`, `datediff`, `months_between`, `year`, `month`, `hour`, `minute`, `second`, `agg`, `groupBy`, `mean` 18 | 19 | ## Initiate SparkR session: 20 | 21 | if (nchar(Sys.getenv("SPARK_HOME")) < 1) { 22 | Sys.setenv(SPARK_HOME = "/home/spark") 23 | } 24 | library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"))) 25 | sparkR.session() 26 | 27 | ## Read in example HFPC data from AWS S3 as a DataFrame (DF): 28 | 29 | df <- read.df("s3://sparkr-tutorials/hfpc_ex", header = "false", inferSchema = "true") 30 | cache(df) 31 | 32 | ######################################################## 33 | ## (1) Converting a DataFrame column to 'date' dtype: ## 34 | ######################################################## 35 | 36 | ## As we saw in previous tutorials, there are several columns in our dataset that list dates which are helpful in determining loan performance. We will specifically consider the following columns throughout this tutorial: 37 | 38 | ## * `"period"` (Monthly Reporting Period): The month and year that pertain to the servicer’s cut-off period for mortgage loan information 39 | ## * `"dt_matr"`(Maturity Date): The month and year in which a mortgage loan is scheduled to be paid in full as defined in the mortgage loan documents 40 | ## * `"dt_zero_bal"`(Zero Balance Effective Date): Date on which the mortgage loan balance was reduced to zero 41 | 42 | ## Let's begin by reviewing the dytypes that `read.df` infers our date columns as. Note that each of our three (3) date columns were read in as strings: 43 | 44 | str(df) 45 | 46 | ## While we could parse the date strings into separate year, month and day integer dtype columns, converting the columns to date dtype allows us to utilize the datetime functions available in SparkR. 47 | 48 | ## We can convert `"period"`, `"matr_dt"` and `"dt_zero_bal"` to date dtype with the following expressions: 49 | 50 | # `period` 51 | period_uts <- unix_timestamp(df$period, 'MM/dd/yyyy') # 1. Gets current Unix timestamp in seconds 52 | period_ts <- cast(period_uts, 'timestamp') # 2. Casts Unix timestamp `period_uts` as timestamp 53 | period_dt <- cast(period_ts, 'date') # 3. Casts timestamp `period_ts` as date dtype 54 | df <- withColumn(df, 'p_dt', period_dt) # 4. Add date dtype column `period_dt` to `df` 55 | 56 | # `dt_matr` 57 | matr_uts <- unix_timestamp(df$dt_matr, 'MM/yyyy') 58 | matr_ts <- cast(matr_uts, 'timestamp') 59 | matr_dt <- cast(matr_ts, 'date') 60 | df <- withColumn(df, 'mtr_dt', matr_dt) 61 | 62 | # `dt_zero_bal` 63 | zero_bal_uts <- unix_timestamp(df$dt_zero_bal, 'MM/yyyy') 64 | zero_bal_ts <- cast(zero_bal_uts, 'timestamp') 65 | zero_bal_dt <- cast(zero_bal_ts, 'date') 66 | df <- withColumn(df, 'zb_dt', zero_bal_dt) 67 | 68 | ## Note that the string entries of these date DF columns are written in the formats `'MM/dd/yyyy'` and `'MM/yyyy'`. While SparkR is able to easily read a date string when it is in the default format, `'yyyy-mm-dd'`, additional steps are required for string to date conversions when the DF column entries are in a format other than the default. In order to create `"p_dt"` from `"period"`, for example, we must: 69 | 70 | ## 1. Define the Unix timestamp for the date string, specifying the date format that the string assumes (here, we specify `'MM/dd/yyyy'`), 71 | ## 2. Use the `cast` operation to convert the Unix timestamp of the string to `'timestamp'` dtype, 72 | ## 3. Similarly recast the `'timestamp'` form to `'date'` dtype and 73 | ## 4. Append the new date dtype `"p_dt"` column to `df` using the `withColumn` operation. 74 | 75 | ## We similarly create date dtype columns using `"dt_matr"` and `"dt_zero_bal"`. If the date string entries of these columns were in the default format, converting to date dtype would straightforward. If `"period"` was in the format `'yyyy-mm-dd'`, for example, we would be able to append `df` with a date dtype column using a simple `withColumn`/`cast` expression: `df <- withColumn(df, 'p_dt', cast(df$period, 'date'))`. We could also directly convert `"period"` to date dtype using the `to_date` operation: `df$period <- to_date(df$period)`. 76 | 77 | ## If we are lucky enough that our date entires are in the default format, then dtype conversion is simple and we should use either the `withColumn`/`cast` or `to_date` expressions given above. Otherwise, the longer conversion process is required. Note that, if we are maintaining our own dataset that we will use SparkR to analyze, adopting the default date format at the start will make working with date values during analysis much easier. 78 | 79 | ## Now that we've appended our date dtype columns to `df`, let's again look at the DF and compare the date dtype values with their associated date string values: 80 | 81 | str(df) 82 | 83 | ## Note that the `"zb_dt"` entries corresponding to the missing date entries in `"dt_zero_bal"`, which were empty strings, are now nulls. 84 | 85 | ################################################################################ 86 | ## (2) Compute relative dates and measures based on a specified unit of time: ## 87 | ################################################################################ 88 | 89 | ## As we mentioned earlier, converting date strings to date dtype allows us to utilize SparkR datetime operations. In this section, we'll discuss several SparkR operations that return: 90 | 91 | ## * Date dtype columns, which list dates relative to a preexisting date column in the DF, and 92 | ## * Integer or numerical dtype columns, which list measures of time relative to a preexisting date column. 93 | 94 | ## For convenience, we will review these operations using the `df_dt` DF, which includes only the date columns `"p_dt"` and `"mtr_dt"`, which we created in the preceding section: 95 | 96 | cols_dt <- c("p_dt", "mtr_dt") 97 | df_dt <- select(df, cols_dt) 98 | 99 | ########################## 100 | ## (2i) Relative dates: ## 101 | ########################## 102 | 103 | ## SparkR datetime operations that return a new date dtype column include: 104 | 105 | ## * `last_day`: Returns the _last_ day of the month which the given date belongs to (e.g. inputting "2013-07-27" returns "2013-07-31") 106 | ## * `next_day`: Returns the _first_ date which is later than the value of the date column that is on the specified day of the week 107 | ## * `add_months`: Returns the date that is `'numMonths'` _after_ `'startDate'` 108 | ## * `date_add`: Returns the date that is `'days'` days _after_ `'start'` 109 | ## * `date_sub`: Returns the date that is `'days'` days _before_ `'start'` 110 | 111 | ## Below, we create relative date columns (defining `"p_dt"` as the input date) using each of these operations and `withColumn`: 112 | 113 | df_dt1 <- withColumn(df_dt, 'p_ld', last_day(df_dt$p_dt)) 114 | df_dt1 <- withColumn(df_dt1, 'p_nd', next_day(df_dt$p_dt, "Sunday")) 115 | df_dt1 <- withColumn(df_dt1, 'p_addm', add_months(df_dt$p_dt, 1)) # 'startDate'="pdt", 'numMonths'=1 116 | df_dt1 <- withColumn(df_dt1, 'p_dtadd', date_add(df_dt$p_dt, 1)) # 'start'="pdt", 'days'=1 117 | df_dt1 <- withColumn(df_dt1, 'p_dtsub', date_sub(df_dt$p_dt, 1)) # 'start'="pdt", 'days'=1 118 | str(df_dt1) 119 | 120 | ###################################### 121 | ## (2ii) Relative measures of time: ## 122 | ###################################### 123 | 124 | ## SparkR datetime operations that return integer or numerical dtype columns include: 125 | 126 | ## * `weekofyear`: Extracts the week number as an integer from a given date 127 | ## * `dayofyear`: Extracts the day of the year as an integer from a given date 128 | ## * `dayofmonth`: Extracts the day of the month as an integer from a given date 129 | ## * `datediff`: Returns number of months between dates 'date1' and 'date2' 130 | ## * `months_between`: Returns the number of days from 'start' to 'end' 131 | 132 | ## Here, we use `"p_dt"` and `"mtr_dt"` as inputs in the above operations. We again use `withColumn` do append the new columns to a DF: 133 | 134 | df_dt2 <- withColumn(df_dt, 'p_woy', weekofyear(df_dt$p_dt)) 135 | df_dt2 <- withColumn(df_dt2, 'p_doy', dayofyear(df_dt$p_dt)) 136 | df_dt2 <- withColumn(df_dt2, 'p_dom', dayofmonth(df_dt$p_dt)) 137 | df_dt2 <- withColumn(df_dt2, 'mbtw_p.mtr', months_between(df_dt$mtr_dt, df_dt$p_dt)) # 'date1'=p_dt, 'date2'=mtr_dt 138 | df_dt2 <- withColumn(df_dt2, 'dbtw_p.mtr', datediff(df_dt$mtr_dt, df_dt$p_dt)) # 'start'=p_dt, 'end'=mtr_dt 139 | str(df_dt2) 140 | 141 | ## Note that operations that consider two different dates are sensitive to how we specify column ordering in the operation expression. For example, if we incorrectly define `"p_dt"` as `date2` and `"mtr_dt"` as `date1`, `"mbtw_p.mtr"` will consist of negative values. Similarly, `datediff` will return negative values if `start` and `end` are misspecified. 142 | 143 | ###################################################################### 144 | ## (3) Extract components of a date dtype column as integer values: ## 145 | ###################################################################### 146 | 147 | ## There are also datetime operations supported by SparkR that allow us to extract individual components of a date dtype column and return these as integers. Below, we use the `year` and `month` operations to create integer dtype columns for each of our date columns. Similar functions include `hour`, `minute` and `second`. 148 | 149 | # Year and month values for `"period_dt"` 150 | df <- withColumn(df, 'p_yr', year(df$p_dt)) 151 | df <- withColumn(df, "p_m", month(df$p_dt)) 152 | 153 | # Year value for `"matr_dt"` 154 | df <- withColumn(df, 'mtr_yr', year(df$mtr_dt)) 155 | df <- withColumn(df, "mtr_m", month(df$mtr_dt)) 156 | 157 | # Year value for `"zero_bal_dt"` 158 | df <- withColumn(df, 'zb_yr', year(df$zb_dt)) 159 | df <- withColumn(df, "zb_m", month(df$zb_dt)) 160 | 161 | ## We can see that each of the above expressions returns a column of integer values representing the requested date value: 162 | 163 | str(df) 164 | 165 | ## Note that the `NA` entries of `"zb_dt"` result in `NA` values for `"zb_yr"` and `"zb_m"`. 166 | 167 | ########################################################################### 168 | ## (4) Resample a time series DF to a particular unit of time frequency: ## 169 | ########################################################################### 170 | 171 | ## When working with time series data, we are frequently required to resample data to a different time frequency. Combing the `agg` and `groupBy` operations, as we saw in the [SparkR Basics II](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/sparkr-basics-2.md) tutorial, is a convenient strategy for accomplishing this in SparkR. We create a new DF, `dat`, that only includes columns of numerical, integer and date dtype to use in our resampling examples: 172 | 173 | rm(df_dt) 174 | rm(df_dt1) 175 | rm(df_dt2) 176 | 177 | cols <- c("p_yr", "p_m", "mtr_yr", "mtr_m", "zb_yr", "zb_m", "new_int_rt", "act_endg_upb", "loan_age", "mths_remng", "aj_mths_remng") 178 | dat <- select(df, cols) 179 | 180 | unpersist(df) 181 | cache(dat) 182 | 183 | head(dat) 184 | 185 | ## Note that, in our loan-level data, each row represents a unique loan (each made distinct by the `"loan_id"` column in `df`) and its corresponding characteristics such as `"loan_age"` and `"mths_remng"`. Note that `dat` is simply a subset `df` and, therefore, also refers to loan-level data. 186 | 187 | ## While we can resample the data over distinct values of any of the columns in `dat`, we will resample the loan-level data as aggregations of the DF columns by units of time since we are working with time series data. Below, we aggregate the columns of `dat` (taking the mean of the column entries) by `"p_yr"`, and then by `"p_yr"` and `"p_m"`: 188 | 189 | # Resample by "period_yr" 190 | dat1 <- agg(groupBy(dat, dat$p_yr), p_m = mean(dat$p_m), mtr_yr = mean(dat$mtr_yr), zb_yr = mean(dat$zb_yr), 191 | new_int_rt = mean(dat$new_int_rt), act_endg_upb = mean(dat$act_endg_upb), loan_age = mean(dat$loan_age), 192 | mths_remng = mean(dat$mths_remng), aj_mths_remng = mean(dat$aj_mths_remng)) 193 | head(dat1) 194 | 195 | # Resample by "period_yr" and "period_m" 196 | dat2 <- agg(groupBy(dat, dat$p_yr, dat$p_m), mtr_yr = mean(dat$mtr_yr), zb_yr = mean(dat$zb_yr), 197 | new_int_rt = mean(dat$new_int_rt), act_endg_upb = mean(dat$act_endg_upb), loan_age = mean(dat$loan_age), 198 | mths_remng = mean(dat$mths_remng), aj_mths_remng = mean(dat$aj_mths_remng)) 199 | head(arrange(dat2, dat2$p_yr, dat2$p_m), 15) # Arrange the first 15 rows of `dat2` by ascending `period_yr` and `period_m` values 200 | 201 | ## Note that we specify the list of DF columns that we want to resample on by including it in `groupBy`. Here, we aggregated by taking the mean of each column. However, we could use any of the aggregation functions that `agg` is able to interpret (listed in [SparkR Basics II](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/sparkr-basics-2.md) tutorial) and that is inline with the resampling that we are trying to achieve. 202 | 203 | ## We could resample to any unit of time that we can extract from a date column, e.g. `year`, `month`, `day`, `hour`, `minute`, `second`. Furthermore, could have skipped the step of creating separate year- and month-level date columns - instead, we could have embedded the datetime functions directly in the `agg` expression. The following expression creates a DF that is equivalent to `dat1` in the preceding example: 204 | 205 | df2 <- agg(groupBy(df, year(df$p_dt)), p_m = mean(month(df$p_dt)), mtr_yr = mean(year(df$mtr_dt)), 206 | zb_yr = mean(month(df$mtr_dt)), new_int_rt = mean(df$new_int_rt), act_endg_upb = mean(df$act_endg_upb), 207 | loan_age = mean(df$loan_age), mths_remng = mean(df$mths_remng), aj_mths_remng = mean(df$aj_mths_remng)) -------------------------------------------------------------------------------- /R/visualizations.R: -------------------------------------------------------------------------------- 1 | install.packages('devtools') 2 | library(devtools) 3 | library(SparkR) 4 | devtools::install_github("SKKU-SKT/ggplot2.SparkR") 5 | library(ggplot2.SparkR) 6 | library(ggplot2) 7 | 8 | sc <- sparkR.init(sparkEnvir=list(spark.executor.memory="2g", spark.driver.memory="1g", spark.driver.maxResultSize="1g"), sparkPackages="com.databricks:spark-csv_2.11:1.4.0") 9 | sqlContext <- sparkRSQL.init(sc) 10 | 11 | # Throughout this tutorial, we will use the diamonds data that is included in the `ggplot2` package and is frequently used `ggplot2` examples. The data consists of prices and quality information about 54,000 diamonds. The data contains the four C’s of diamond quality, carat, cut, colour and clarity; and five physical measurements, depth, table, x, y and z. 12 | 13 | df <- read.df(sqlContext, "s3://ui-spark-data/diamonds.csv", header='true', delimiter=",", source="com.databricks.spark.csv", inferSchema='true', nullValue="") 14 | cache(df) 15 | 16 | # We can see what the data set looks like using the `str` operation: 17 | 18 | str(df) 19 | 20 | # Introduced in the spring of 2016, the SparkR extension of Hadley Wickham's `ggplot2` package, `ggplot2.SparkR`, allows SparkR users to build ggplot-type visualizations by specifying a SparkR DataFrame and DF columns in ggplot expressions identical to how we would specify R data.frame components when using the `ggplot2` package, i.e. the extension package allows SparkR users to implement ggplot without having to modify the SparkR DataFrame API. 21 | 22 | 23 | # As of the publication date of this tutorial (first version), the `ggplot2.SparkR` package is still nascent and has identifiable bugs. However, we provide `ggplot2.SparkR` in this example for its ease of use, particularly for SparkR users wanting to build basic plots. We alternatively discuss how a SparkR user may develop their own plotting function and provide an example in which we plot a bivariate histogram. 24 | 25 | # The description of the `diamonds` data given above was taken from http://ggplot2.org/book/qplot.pdf. 26 | 27 | ################## 28 | ### Bar graph: ### 29 | ################## 30 | 31 | # geom_bar(mapping = NULL, data = NULL, stat = "count", position = "stack", ..., width = NULL, binwidth = NULL, na.rm = FALSE, show.legend = NA, inherit.aes = TRUE) 32 | 33 | # Just as we would when using `ggplot2`, the following expression plots a basic bar graph that gives frequency counts across the different levels of `"cut"` quality in the data: 34 | 35 | p1 <- ggplot(df, aes(x = cut)) 36 | p1 + geom_bar() 37 | 38 | ##### Stacked & proportional bar graphs 39 | 40 | # One recognized bug within `ggplot2.SparkR` is that, when specifying a `fill` value, none of the `position` specifications--`"stack"`, `"fill"` nor `"dodge"`--necessarily return plots with constant factor-level ordering across groups. For example, the following expression successfully returns a bar graph that gives frequency counts of `"clarity"` levels (string dtype), grouped over diamond `"cut"` types (also string dtype). Note, however, that the varied color blocks representing `"clarity"` levels are not ordered similarly across different levels of `"cut"`. The same issue results when we specify either of the other two (2) `position` specifications: 41 | 42 | p2 <- ggplot(df, aes(x = cut, fill = clarity)) 43 | p2 + geom_bar(position = "stack") 44 | 45 | ################## 46 | ### Histogram: ### 47 | ################## 48 | 49 | # geom_histogram(mapping = NULL, data = NULL, stat = "bin", position = "stack", ..., binwidth = NULL, bins = NULL, na.rm = FALSE, show.legend = NA, inherit.aes = TRUE) 50 | 51 | # Just as we would when using `ggplot2`, the following expression plots a histogram that gives frequency counts across binned `"price"` values in the data: 52 | 53 | p3 <- ggplot(df, aes(price)) 54 | p3 + geom_histogram() 55 | 56 | # The preceding histogram plot assumes the `ggplot2` default, `bins = 30`, but we can change this value or override the `bins` specification by setting a `binwidth` value as we do in the following examples: 57 | 58 | p3 + geom_histogram(binwidth = 250) 59 | p3 + geom_histogram(bins = 50) 60 | 61 | # Weighted histogram: 62 | 63 | # ggplot(df, aes(cut)) + geom_histogram(aes(weight = price)) + ylab("total value") NOT available in `ggplot2.SparkR` 64 | 65 | # Stacked histograms: 66 | 67 | # ggplot(df, aes(price, fill = cut)) + geom_histogram() # NOT available in `ggplot2.SparkR` 68 | # ggplot(df, aes(price, fill = cut)) + geom_histogram(position = "fill") 69 | 70 | 71 | ########################### 72 | ### Frequency Polygons: ### 73 | ########################### 74 | 75 | # geom_freqpoly(mapping = NULL, data = NULL, stat = "bin", position = "identity", ..., na.rm = FALSE, show.legend = NA, inherit.aes = TRUE) 76 | 77 | # Frequency polygons provide a visual alternative to histogram plots (note that they describe equivalent aggregations). We can also fit frequency polygons with `ggplot2` syntax - the following expression returns a frequency polygon that is equivalent to the first histogram plotted in the preceding section: 78 | 79 | p3 + geom_freqpoly() 80 | 81 | # Again, we can change the class intervals by specifying `binwidth` or the number of `bins` for the frequency polygon: 82 | 83 | p3 + geom_freqpoly(binwidth = 250) 84 | p3 + geom_freqpoly(bins = 50) 85 | 86 | # Frequency polygons over grouped data are perhaps more easily interpreted than stacked histograms; the following is equivalent to the preceding stacked histogram. Note that we specify `"cut"` as `colour`, rather than `fill` as we did when using `geom_histogram`: 87 | 88 | # ggplot(df, aes(price, colour = cut)) + geom_freqpoly() NOT currently supported by `ggplot2.SparkR` 89 | 90 | ################################################################# 91 | ### Dealing with overplotting in scatterplot using `stat_sum` ### 92 | ################################################################# 93 | 94 | # stat_sum(mapping = NULL, data = NULL, geom = "point", position = "identity", ...) 95 | 96 | # NOT supported by `ggplot2.SparkR` 97 | 98 | ################ 99 | ### Boxplot: ### 100 | ################ 101 | 102 | # Finally, we can create boxplots just as we would in `ggplot2`. The following expression gives a boxplot of `"price"` values across levels of `"clarity"`: 103 | 104 | p4 <- ggplot(df, aes(x = clarity, y = price)) 105 | p4 + geom_boxplot() 106 | 107 | ################################################## 108 | ### Additional `ggplot2.SparkR` functionality: ### 109 | ################################################## 110 | 111 | # We can adapt the plot types discussed in the previous sections with the specifications given below: 112 | 113 | #+ Facets: `facet_grid`, `facet_wrap` and `facet_null` (default) 114 | #+ Coordinate systems: `coord_cartesian` and `coord_flip` 115 | #+ Position adjustments: `position_dodge`, `position_fill`, `position_stack` (as seen in previous example) 116 | #+ Scales: `scale_x_log10`, `scale_y_log10`, `labs`, `xlab`, `ylab`, `xlim` and `ylim` 117 | 118 | # For example, the following expression facets our previous histogram example across the different levels of `"cut"` quality: 119 | 120 | p3 + geom_histogram() + facet_wrap(~cut) 121 | 122 | ################################################################## 123 | ### Functionality gaps between `ggplot2` and SparkR extension: ### 124 | ################################################################## 125 | 126 | # Below, we list several operations supported by `ggplot2` that are not currently supported by its SparkR extension package. The list is not exhaustive and is subject to change as the package continues to be developed: 127 | 128 | #+ Weighted bar graph (i.e. specify `weight` in aesthetic) 129 | #+ Weighted histogram 130 | #+ Strictly ordered layers for filled and stacked bar graphs (as we saw in an earlier example) 131 | #+ Stacked or filled histograms 132 | #+ Layer frequency polygon (i.e specify `colour` in aesthetic) 133 | #+ Density plot using `geom_freqpoly` by specifying `y = ..density..` in aesthetic (note that extension package does not support `geom_density`) 134 | 135 | ############################ 136 | ### Bivariate histogram: ### 137 | ############################ 138 | 139 | # In the previous examples, we relied on the `ggplot2.SparkR` package to build plots from DataFrames using syntax identical to that which we would use in a normal application of `ggplot2` on R data.frames. Given the current limitations of the extension package, we may need to develop our own function if we are interested in building a plot type that is not currently supported by `ggplot2.SparkR`. Here, we provide an example of a function that returns a bivariate histogram of two numerical DataFrame columns. 140 | 141 | # When building a function in SparkR, we want to avoid operations that are computationally expensive and building one that returns a plot is no different. One of the most expensive operations in SparkR, `collect`, is of particular interest when building functions that return plots since collecting data locally allows us to leverage graphing tools that we use in traditional frameworks, e.g. `ggplot2`. We should `collect` data as infrequently as possible since the operation is highly memory-intensive. In the following function, we `collect` data five (5) times. Four of the times, we are collecting single values (two minimum and two maximum values), which does not use up a huge amount of memory. The last `collect` that we perform, collects a data.frame with three (3) columns and a row for each bin assignment pairing, which can fit in-memory on a single node (assuming we don't specify a massive value for `nbins`). When developing SparkR functions, we should only perform minor collections like the ones discussed. 142 | 143 | geom_bivar_histogram.SparkR <- function(df, x, y, nbins){ 144 | 145 | library(ggplot2) 146 | 147 | x_min <- collect(agg(df, min(df[[x]]))) # Collect 148 | x_max <- collect(agg(df, max(df[[x]]))) # Collect 149 | x.bin <- seq(floor(x_min[[1]]), ceiling(x_max[[1]]), length = nbins) 150 | 151 | y_min <- collect(agg(df, min(df[[y]]))) # Collect 152 | y_max <- collect(agg(df, max(df[[y]]))) # Collect 153 | y.bin <- seq(floor(y_min[[1]]), ceiling(y_max[[1]]), length = nbins) 154 | 155 | x.bin.w <- x.bin[[2]]-x.bin[[1]] 156 | y.bin.w <- y.bin[[2]]-y.bin[[1]] 157 | 158 | df_ <- withColumn(df, "x_bin_", ceiling((df[[x]] - x_min[[1]]) / x.bin.w)) 159 | df_ <- withColumn(df_, "y_bin_", ceiling((df[[y]] - y_min[[1]]) / y.bin.w)) 160 | 161 | df_ <- mutate(df_, x_bin = ifelse(df_$x_bin_ == 0, 1, df_$x_bin_)) 162 | df_ <- mutate(df_, y_bin = ifelse(df_$y_bin_ == 0, 1, df_$y_bin_)) 163 | 164 | dat <- collect(agg(groupBy(df_, "x_bin", "y_bin"), count = n(df_$x_bin))) # Collect 165 | 166 | p <- ggplot(dat, aes(x = x_bin, y = y_bin, fill = count)) + geom_tile() 167 | 168 | return(p) 169 | } 170 | 171 | # Here, we evaluate the `geom_bivar_histogram.SparkR` function using `"carat"` and `"price"`: 172 | 173 | p5 <- geom_bivar_histogram.SparkR(df = df, x = "carat", y = "price", nbins = 100) 174 | p5 + scale_colour_brewer(palette = "Blues", type = "seq") + ggtitle("This is a title") + xlab("Carat") + ylab("Price") 175 | 176 | # _Note_: Documentation for the `geom_bivar_histogram.SparkR` function is given here: 177 | 178 | # Note that the plot closely resembles a scatterplot. Bivariate histograms are one strategy for mitigating overplotting that often occurs when attempting to visualize massive data sets. Furthermore, it is sometimes impossible to gather the data necessary to map individual points to a scatterplot onto a single node within our cluster - this is when aggregation becomes necessary rather than simply preferable. Just like plotting a univariate histogram, binning data reduces the number of points to plot and, with the appropriate choice of bin number and color scale, bivariate histograms can provide an intuitive alternative to scatterplots when working with massive data sets. 179 | 180 | # For example, the following function is equivalent to our previous one, but we have changed the `fill` specification that partially determines the color scale from `count` to `log10(count)`. Then, we evaluate the new function with a larger `nbins` value, returning a new plot with more granular binning and a more nuanced color scale (since the breaks in the color scale are now log10-spaced). 181 | 182 | geom_bivar_histogram.SparkR.log10 <- function(df, x, y, nbins){ 183 | 184 | library(ggplot2) 185 | 186 | x_min <- collect(agg(df, min(df[[x]]))) 187 | x_max <- collect(agg(df, max(df[[x]]))) 188 | x.bin <- seq(floor(x_min[[1]]), ceiling(x_max[[1]]), length = nbins) 189 | 190 | y_min <- collect(agg(df, min(df[[y]]))) 191 | y_max <- collect(agg(df, max(df[[y]]))) 192 | y.bin <- seq(floor(y_min[[1]]), ceiling(y_max[[1]]), length = nbins) 193 | 194 | x.bin.w <- x.bin[[2]]-x.bin[[1]] 195 | y.bin.w <- y.bin[[2]]-y.bin[[1]] 196 | 197 | df_ <- withColumn(df, "x_bin_", ceiling((df[[x]] - x_min[[1]]) / x.bin.w)) 198 | df_ <- withColumn(df_, "y_bin_", ceiling((df[[y]] - y_min[[1]]) / y.bin.w)) 199 | 200 | df_ <- mutate(df_, x_bin = ifelse(df_$x_bin_ == 0, 1, df_$x_bin_)) 201 | df_ <- mutate(df_, y_bin = ifelse(df_$y_bin_ == 0, 1, df_$y_bin_)) 202 | 203 | dat <- collect(agg(groupBy(df_, "x_bin", "y_bin"), count = n(df_$x_bin))) 204 | 205 | p <- ggplot(dat, aes(x = x_bin, y = y_bin, fill = log10(count))) + geom_tile() 206 | 207 | return(p) 208 | } 209 | 210 | p6 <- geom_bivar_histogram.SparkR.log10(df = df, x = "carat", y = "price", nbins = 250) 211 | p6 + scale_colour_brewer(palette = "Blues", type = "seq") + ggtitle("This is a title") + xlab("Carat") + ylab("Price") -------------------------------------------------------------------------------- /glm_files/figure-html/unnamed-chunk-10-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UrbanInstitute/sparkr-tutorials/a4dabf38c81d8635a70158fe97ecb7b1c7dd08d0/glm_files/figure-html/unnamed-chunk-10-1.png -------------------------------------------------------------------------------- /glm_files/figure-html/unnamed-chunk-11-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UrbanInstitute/sparkr-tutorials/a4dabf38c81d8635a70158fe97ecb7b1c7dd08d0/glm_files/figure-html/unnamed-chunk-11-1.png -------------------------------------------------------------------------------- /glm_files/figure-html/unnamed-chunk-25-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UrbanInstitute/sparkr-tutorials/a4dabf38c81d8635a70158fe97ecb7b1c7dd08d0/glm_files/figure-html/unnamed-chunk-25-1.png -------------------------------------------------------------------------------- /glm_files/figure-html/unnamed-chunk-27-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UrbanInstitute/sparkr-tutorials/a4dabf38c81d8635a70158fe97ecb7b1c7dd08d0/glm_files/figure-html/unnamed-chunk-27-1.png -------------------------------------------------------------------------------- /glm_files/figure-html/unnamed-chunk-29-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UrbanInstitute/sparkr-tutorials/a4dabf38c81d8635a70158fe97ecb7b1c7dd08d0/glm_files/figure-html/unnamed-chunk-29-1.png -------------------------------------------------------------------------------- /glm_files/figure-html/unnamed-chunk-5-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UrbanInstitute/sparkr-tutorials/a4dabf38c81d8635a70158fe97ecb7b1c7dd08d0/glm_files/figure-html/unnamed-chunk-5-1.png -------------------------------------------------------------------------------- /glm_files/figure-html/unnamed-chunk-7-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UrbanInstitute/sparkr-tutorials/a4dabf38c81d8635a70158fe97ecb7b1c7dd08d0/glm_files/figure-html/unnamed-chunk-7-1.png -------------------------------------------------------------------------------- /glm_files/figure-html/unnamed-chunk-9-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UrbanInstitute/sparkr-tutorials/a4dabf38c81d8635a70158fe97ecb7b1c7dd08d0/glm_files/figure-html/unnamed-chunk-9-1.png -------------------------------------------------------------------------------- /rmd/03_subsetting.rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: 'Subsetting SparkR DataFrames' 3 | author: "Sarah Armstrong, Urban Institute" 4 | date: "July 1, 2016" 5 | output: 6 | html_document: 7 | keep_md: TRUE 8 | --- 9 | 10 | ```{r setup, include=FALSE} 11 | knitr::opts_chunk$set(echo = TRUE) 12 | ``` 13 | 14 | **Last Updated**: May 23, 2017 15 | 16 | 17 | **Objective**: Now that we understand what a SparkR DataFrame (DF) really is (remember, it's not actually data!) and can write expressions using essential DataFrame operations, such as `agg`, we are ready to start subsetting DFs using more advanced transformation operations. This tutorial discusses various ways of subsetting DFs, as well as how to work with a randomly sampled subset as a local data.frame in RStudio: 18 | 19 | * Subset a DF by row 20 | * Subset a DF by a list of columns 21 | * Subset a DF by column expressions 22 | * Drop a column from a DF 23 | * Subset a DF by taking a random sample 24 | * Collect a random sample as a local R data.frame 25 | * Export a DF sample as a single .csv file to S3 26 | 27 | **SparkR/R Operations Discussed**: `filter`, `where`, `select`, `sample`, `collect`, `write.table` 28 | 29 | *** 30 | 31 | :heavy_exclamation_mark: **Warning**: Before beginning this tutorial, please visit the SparkR Tutorials README file (found [here](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/README.md)) in order to load the SparkR library and subsequently initiate a SparkR session. 32 | 33 | ```{r, include=FALSE} 34 | if (nchar(Sys.getenv("SPARK_HOME")) < 1) { 35 | Sys.setenv(SPARK_HOME = "/home/spark") 36 | } 37 | 38 | library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"))) 39 | 40 | sparkR.session() 41 | ``` 42 | 43 | The following error indicates that you have not initiated a SparkR session: 44 | 45 | ```{r, eval=FALSE} 46 | Error in getSparkSession() : SparkSession not initialized 47 | ``` 48 | 49 | If you receive this message, return to the SparkR tutorials [README](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/README.md) for guidance. 50 | 51 | *** 52 | 53 | **Read in initial data as DF**: Throughout this tutorial, we will use the loan performance example dataset that we exported at the conclusion of the [SparkR Basics I](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/01_sparkr-basics-1.md) tutorial. Note that we are __persisting__ the DataFrame since we will use it throughout this tutorial. 54 | 55 | ```{r, message=F, warning=F, results='hide'} 56 | df <- read.df("s3://ui-spark-social-science-public/data/hfpc_ex", 57 | header = "false", 58 | inferSchema = "true", 59 | na.strings = "") 60 | cache(df) 61 | ``` 62 | 63 | _Note_: documentation for the quarterly loan performance data can be found at http://www.fanniemae.com/portal/funding-the-market/data/loan-performance-data.html. 64 | 65 | Let's check the dimensions our DF `df` and its column names so that we can compare the dimension sizes of `df` with those of the subsets that we will define throughout this tutorial: 66 | 67 | ```{r, collapse=TRUE} 68 | nrow(df) 69 | ncol(df) 70 | columns(df) 71 | ``` 72 | 73 | *** 74 | 75 | 76 | ### Subset DataFrame by row: 77 | 78 | The SparkR operation `filter` allows us to subset the rows of a DF according to specified conditions. Before we begin working with `filter` to see how it works, let's print the schema of `df` since the types of subsetting conditions we are able to specify depend on the datatype of each column in the DF: 79 | 80 | ```{r, collapse=TRUE} 81 | printSchema(df) 82 | ``` 83 | 84 | We can subset `df` into a new DF, `f1`, that includes only those loans for which JPMorgan Chase is the servicer with the expression: 85 | 86 | ```{r, collapse=TRUE} 87 | f1 <- filter(df, df$servicer_name == "JP MORGAN CHASE BANK, NA" | df$servicer_name == "JPMORGAN CHASE BANK, NA" | 88 | df$servicer_name == "JPMORGAN CHASE BANK, NATIONAL ASSOCIATION") 89 | nrow(f1) 90 | ``` 91 | 92 | Notice that the `filter` considers normal logical syntax (e.g. logical conditions and operations), making working with the operation very straightforward. We can specify `filter` with SQL statement strings. For example, here we have the preceding example written in SQL statement format: 93 | 94 | ```{r, collapse=TRUE, eval=FALSE} 95 | filter(df, "servicer_name = 'JP MORGAN CHASE BANK, NA' or servicer_name = 'JPMORGAN CHASE BANK, NA' or 96 | servicer_name = 'JPMORGAN CHASE BANK, NATIONAL ASSOCIATION'") 97 | ``` 98 | 99 | Or, alternatively, in a syntax similar to how we subset data.frames by row in base R: 100 | 101 | ```{r, collapse=TRUE, eval=FALSE} 102 | df[df$servicer_name == "JP MORGAN CHASE BANK, NA" | df$servicer_name == "JPMORGAN CHASE BANK, NA" | 103 | df$servicer_name == "JPMORGAN CHASE BANK, NATIONAL ASSOCIATION",] 104 | ``` 105 | 106 | Another example of using logical syntax with `filter` is that we can subset `df` such that the new DF only includes those loans for which the servicer name is known, i.e. the column `"servicer_name"` is not equa to an empty string or listed as `"OTHER"`: 107 | 108 | ```{r, collapse=TRUE} 109 | f2 <- filter(df, df$servicer_name != "OTHER" & df$servicer_name != "") 110 | nrow(f2) 111 | ``` 112 | 113 | Or, if we wanted to only consider observations with a `"loan_age"` value of greater than 60 months (five years), we would evaluate: 114 | 115 | ```{r, collapse=TRUE} 116 | f3 <- filter(df, df$loan_age > 60) 117 | nrow(f3) 118 | ``` 119 | 120 | An alias for `filter` is `where`, which reads much more intuitively, particularly when `where` is embedded in a complex statement. For example, the following expression can be read as "__aggregate__ the mean loan age and count values __by__ `"servicer_name"` in `df` __where__ loan age is less than 60 months": 121 | 122 | ```{r, collapse=TRUE} 123 | f4 <- agg(groupBy(where(df, df$loan_age < 60), where(df, df$loan_age < 60)$servicer_name), 124 | loan_age_avg = avg(where(df, df$loan_age < 60)$loan_age), 125 | count = n(where(df, df$loan_age < 60)$loan_age)) 126 | head(f4) 127 | ``` 128 | 129 | *** 130 | 131 | 132 | ### Subset DataFrame by column: 133 | 134 | The operation `select` allows us to subset a DF by a specified list of columns. In the expression below, for example, we create a subsetted DF that includes only the number of calendar months remaining until the borrower is expected to pay the mortgage loan in full (remaining maturity) and adjusted remaining maturity: 135 | 136 | ```{r, collapse=TRUE} 137 | s1 <- select(df, "mths_remng", "aj_mths_remng") 138 | ncol(s1) 139 | ``` 140 | 141 | We can also reference the column names through the DF name, i.e. `select(df, df$mths_remng, df$aj_mths_remng)`. Or, we can save a list of columns as a combination of strings. If we wanted to make a list of all columns that relate to remaining maturity, we could evaluate the expression `remng_mat <- c("mths_remng", "aj_mths_remng")` and then easily reference our list of columns later on with `select(df, remng_mat)`. 142 | 143 | 144 | Besides subsetting by a list of columns, we can also subset `df` while introducing a new column using a column expression, as we do in the example below. The DF `s2` includes the columns `"mths_remng"` and `"aj_mths_remng"` as in `s1`, but now with a column that lists the absolute value of the difference between the unadjusted and adjusted remaining maturity: 145 | 146 | ```{r, collapse=TRUE} 147 | s2 <- select(df, df$mths_remng, df$aj_mths_remng, abs(df$aj_mths_remng - df$mths_remng)) 148 | ncol(s2) 149 | head(s2) 150 | ``` 151 | 152 | Note that, just as we can subset by row with syntax similar to that in base R, we can similarly acheive subsetting by column. The following expressions are equivalent: 153 | 154 | ```{r, collapse=TRUE, eval=FALSE} 155 | select(df, df$period) 156 | df[,"period"] 157 | df[,2] 158 | ``` 159 | 160 | To simultaneously subset by column and row specifications, you can simply embed a `where` expression in a `select` operation (or vice versa). The following expression creates a DF that lists loan age values only for observations in which servicer name is unknown: 161 | 162 | ```{r, collapse=TRUE} 163 | s3 <- select(where(df, df$servicer_name == "" | df$servicer_name == "OTHER"), "loan_age") 164 | head(s3) 165 | ``` 166 | 167 | Note that we could have also written the above expression as `df[df$servicer_name == "" | df$servicer_name == "OTHER", "loan_age"]`. 168 | 169 | 170 | #### Drop a column from a DF: 171 | 172 | We can drop a column from a DF very simply by assigning `NULL` to a DF column. Below, we drop `"aj_mths_remng"` from `s1`: 173 | 174 | ```{r, collapse=TRUE} 175 | head(s1) 176 | s1$aj_mths_remng <- NULL 177 | head(s1) 178 | ``` 179 | 180 | *** 181 | 182 | 183 | ### Subset a DF by taking a random sample: 184 | 185 | Perhaps the most useful subsetting operation is `sample`, which returns a randomly sampled subset of a DF. With `subset`, we can specify whether we want to sample with or without replace, the approximate size of the sample that we want the new DF to call and whether or not we want to define a random seed. If our initial DF is so massive that performing analysis on the entire dataset requires a more expensive cluster, we can: sample the massive dataset, interactively develop our analysis in SparkR using our sample and then evaluate the resulting program using our initial DF, which calls the entire massive dataset, only as is required. This strategy will help us to minimize wasting resources. 186 | 187 | Below, we take a random sample of `df` without replacement that is, in size, approximately equal to 1% of `df`. Notice that we must define a random seed in order to be able to reproduce our random sample. 188 | 189 | ```{r, collapse=TRUE} 190 | df_samp1 <- sample(df, withReplacement = FALSE, fraction = 0.01) # Without set seed 191 | df_samp2 <- sample(df, withReplacement = FALSE, fraction = 0.01) 192 | count(df_samp1) 193 | count(df_samp2) 194 | # The row counts are different and, obviously, the DFs are not equivalent 195 | 196 | df_samp3 <- sample(df, withReplacement = FALSE, fraction = 0.01, seed = 0) # With set seed 197 | df_samp4 <- sample(df, withReplacement = FALSE, fraction = 0.01, seed = 0) 198 | count(df_samp3) 199 | count(df_samp4) 200 | # The row counts are equal and the DFs are equivalent 201 | ``` 202 | 203 | 204 | #### Collect a random sample as a local data.frame: 205 | 206 | An additional use of `sample` is to collect a random sample of a massive dataset as a local data.frame in R. This would allow us to work with a sample dataset in a traditional analysis environment that is likely more representative of the population since we are sampling from a larger set of observations than we are normally doing so. This can be achieved by simply using `collect` to create a local data.frame: 207 | 208 | ```{r, collapse=TRUE} 209 | typeof(df_samp4) # DFs are of class S4 210 | dat <- collect(df_samp4) 211 | typeof(dat) 212 | ``` 213 | 214 | Note that this data.frame is _not_ local to _your_ personal computer, but rather it was gathered locally to a single node in our AWS cluster. 215 | 216 | #### Export DF sample as a single .csv file to S3: 217 | 218 | If we want to export the sampled DF from RStudio as a single .csv file that we can work with in any environment, we must first coalesce the rows of `df_samp4` to a single node in our cluster using the `repartition` operation. Then, we can use the `write.df` operation as we did in the [SparkR Basics I](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/01_sparkr-basics-1.md) tutorial: 219 | 220 | ```{r, eval=FALSE, collapse=TRUE} 221 | df_samp4_1 <- repartition(df_samp4, numPartitions = 1) 222 | write.df(df_samp4_1, path = "s3://ui-spark-social-science-public/data/hfpc_samp.csv", 223 | source = "csv", 224 | mode = "overwrite") 225 | ``` 226 | 227 | :heavy_exclamation_mark: __Warning__: We cannot collect a DF as a data.frame, nor can we repartition it to a single node, unless the DF is sufficiently small in size since it must fit onto a _single_ node! 228 | 229 | __End of tutorial__ - Next up is [Dealing with Missing Data in SparkR](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/04_missing-data.md) -------------------------------------------------------------------------------- /rmd/04_missing-data.rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: 'Dealing with Missing Data in SparkR' 3 | author: "Sarah Armstrong, Urban Institute" 4 | date: "July 8, 2016" 5 | output: 6 | html_document: 7 | keep_md: true 8 | --- 9 | 10 | ```{r setup, include=FALSE} 11 | knitr::opts_chunk$set(echo = TRUE) 12 | options(knitr.table.format = 'markdown') 13 | ``` 14 | 15 | **Last Updated**: May 23, 2017 16 | 17 | 18 | **Objective**: In this tutorial, we discuss general strategies for dealing with missing data in the SparkR environment. While we do not consider conceptually how and why we might impute missing values in a dataset, we do discuss logistically how we could drop rows with missing data and impute missing data with replacement values. We specifically consider the following during this tutorial: 19 | 20 | * Specify null values when loading data in as a DF 21 | * Conditional expressions on empty DF entries 22 | + Null and NaN indicator operations 23 | + Conditioning on empty string entries 24 | + Distribution of missing data across grouped data 25 | * Drop rows with missing data 26 | + Null value entries 27 | + Empty string entries 28 | * Fill missing data entries 29 | + Null value entries 30 | + Empty string entries 31 | 32 | **SparkR/R Operations Discussed**: `read.df` (`nullValue = ""`), `printSchema`, `nrow`, `isNull`, `isNotNull`, `isNaN`, `count`, `where`, `agg`, `groupBy`, `n`, `collect`, `dropna`, `na.omit`, `list`, `fillna` 33 | 34 | *** 35 | 36 | :heavy_exclamation_mark: **Warning**: Before beginning this tutorial, please visit the SparkR Tutorials README file (found [here](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/README.md)) in order to load the SparkR library and subsequently initiate a SparkR session. 37 | 38 | ```{r, include=FALSE} 39 | if (nchar(Sys.getenv("SPARK_HOME")) < 1) { 40 | Sys.setenv(SPARK_HOME = "/home/spark") 41 | } 42 | 43 | library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"))) 44 | 45 | sparkR.session() 46 | ``` 47 | 48 | The following error indicates that you have not initiated a SparkR session: 49 | 50 | ```{r, eval=FALSE} 51 | Error in getSparkSession() : SparkSession not initialized 52 | ``` 53 | 54 | If you receive this message, return to the SparkR tutorials [README](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/README.md) for guidance. 55 | 56 | *** 57 | 58 | ### Specify null values when loading data in as a SparkR DataFrame (DF) 59 | 60 | Throughout this tutorial, we will use the loan performance example dataset that we exported at the conclusion of the [SparkR Basics I](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/01_sparkr-basics-1.md) tutorial. Note that we now include the `na.strings` option in the `read.df` transformation below. By setting `na.strings` equal to an empty string in `read.df`, we direct SparkR to interpret empty entries in the dataset as being equal to nulls in `df`. Therefore, any DF entries matching this string (here, set to equal an empty entry) will be set equal to a null value in `df`. 61 | 62 | ```{r, message=F, warning=F, results='hide', collapse=TRUE} 63 | df <- read.df("s3://ui-spark-social-science-public/data/hfpc_ex", 64 | header = "false", 65 | inferSchema = "true", 66 | na.strings = "") 67 | cache(df) 68 | ``` 69 | 70 | We can replace this empty string with any string that we know indicates a null entry in the dataset, i.e. with `na.strings=""`. Note that SparkR only reads empty entries as null values in numerical and integer datatype (dtype) DF columns, meaning that empty entries in DF columns of string dtype will simply equal an empty string. We consider how to work with this type of observation throughout this tutorial alongside our treatment of null values. 71 | 72 | 73 | With `printSchema`, we can see the dtype of each column in `df` and, noting which columns are of a numerical and integer dtypes and which are string, use this to determine how we should examine missing data in each column of `df`. We also count the number of rows in `df` so that we can compare this value to row counts that we compute throughout this tutorial: 74 | 75 | ```{r, collapse=TRUE} 76 | printSchema(df) 77 | (n <- nrow(df)) 78 | ``` 79 | 80 | _Note_: documentation for the quarterly loan performance data can be found at http://www.fanniemae.com/portal/funding-the-market/data/loan-performance-data.html. 81 | 82 | *** 83 | 84 | 85 | ### Conditional expressions on empty DF entries 86 | 87 | 88 | #### Null and NaN indicator operations 89 | 90 | We saw in the subsetting tutorial how to subset a DF by some conditional statement. We can extend this reasoning in order to identify missing data in a DF and to explore the distribution of missing data within a DF. SparkR operations indicating null and NaN entries in a DF are `isNull`, `isNaN` and `isNotNull`, and these can be used in conditional statements to locate or to remove DF rows with null and NaN entries. 91 | 92 | 93 | Below, we count the number of missing entries in `"loan_age"` and in `"mths_remng"`, which are both of integer dtype. We can see below that there are no missing or NaN entries in `"loan_age"`. Note that the `isNull` and `isNaN` count results differ for `"mths_remng"` - while there are missing values in `"mths_remng"`, there are no NaN entries (entires that are "not a number"). 94 | 95 | ```{r, collapse=TRUE} 96 | df_laNull <- where(df, isNull(df$loan_age)) 97 | count(df_laNull) 98 | df_laNaN <- where(df, isNaN(df$loan_age)) 99 | count(df_laNaN) 100 | 101 | df_mrNull <- where(df, isNull(df$mths_remng)) 102 | count(df_mrNull) 103 | df_mrNaN <- where(df, isNaN(df$mths_remng)) 104 | count(df_mrNaN) 105 | ``` 106 | 107 | 108 | #### Empty string entries 109 | 110 | If we want to count the number of rows with missing entries for `"servicer_name"` (string dtype) we can simply use the equality logical condition (==) to direct SparkR to `count` the number of rows `where` the entries in the `"servicer_name"` column are equal to an empty string: 111 | 112 | ```{r, collapse=TRUE} 113 | df_snEmpty <- where(df, df$servicer_name == "") 114 | count(df_snEmpty) 115 | ``` 116 | 117 | 118 | #### Distribution of missing data across grouped data 119 | 120 | We can also condition on missing data when aggregating over grouped data in order to see how missing data is distributed over a categorical variable within our data. In order to view the distribution of `"mths_remng"` observations with null values over distinct entries of `"servicer_name"`, we (1) group the entries of the DF `df_mrNull` that we created in the preceding example over `"servicer_name"` entries, (2) create the DF `mrNull_by_sn` which consists of the number of observations in `df_mrNull` by `"servicer_name"` entries and (3) collect `mrNull_by_sn` into a nicely formatted table as a local data.frame: 121 | 122 | ```{r, collapse=TRUE} 123 | gb_sn_mrNull <- groupBy(df_mrNull, df_mrNull$servicer_name) 124 | mrNull_by_sn <- agg(gb_sn_mrNull, Nulls = n(df_mrNull$servicer_name)) 125 | 126 | mrNull_by_sn.dat <- collect(mrNull_by_sn) 127 | mrNull_by_sn.dat 128 | # Alternatively, we could have evaluated showDF(mrNull_by_sn) to print DF 129 | ``` 130 | 131 | Note that the resulting data.frame lists only nine (9) distinct string values for `"servicer_name"`. So, any row in `df` with a null entry for `"mths_remng"` has one of these strings as its corresponding `"servicer_name"` value. We could similarly examine the distribution of missing entries for some string dtype column across grouped data by first filtering a DF on the condition that the string column is equal to an empty string, rather than filtering with a null indicator operation (e.g. `isNull`), then performing the `groupBy` operation. 132 | 133 | *** 134 | 135 | 136 | ### Drop rows with missing data 137 | 138 | 139 | #### Null value entries 140 | 141 | The SparkR operation `dropna` (or its alias `na.omit`) creates a new DF that omits rows with null value entries. We can configure `dropna` in a number of ways, including whether we want to omit rows with nulls in a specified list of DF columns or across all columns within a DF. 142 | 143 | 144 | If we want to drop rows with nulls for a list of columns in `df`, we can define a list of column names and then include this in `dropna` or we could embed this list directly in the operation. Below, we explicitly define a list of column names on which we condition `dropna`: 145 | 146 | ```{r, collapse=TRUE} 147 | mrlist <- list("mths_remng", "aj_mths_remng") 148 | df_mrNoNulls <- dropna(df, cols = mrlist) 149 | nrow(df_mrNoNulls) 150 | ``` 151 | 152 | Alternatively, we could `filter` the DF using the `isNotNull` condition as follows: 153 | 154 | ```{r, collapse=TRUE} 155 | df_mrNoNulls_ <- filter(df, isNotNull(df$mths_remng) & isNotNull(df$aj_mths_remng)) 156 | nrow(df_mrNoNulls_) 157 | ``` 158 | 159 | If we want to consider all columns in a DF when omitting rows with null values, we can use either the `how` or `minNonNulls` paramters of `dropna`. 160 | 161 | 162 | The parameter `how` allows us to decide whether we want to drop a row if it contains `"any"` nulls or if we want to drop a row only if `"all"` of its entries are nulls. We can see below that there are no rows in `df` in which all of its values are null, but only a small percentage of the rows in `df` have no null value entries: 163 | 164 | ```{r, collapse=TRUE} 165 | df_all <- dropna(df, how = "all") 166 | nrow(df_all) # Equal in value to n 167 | 168 | df_any <- dropna(df, how = "any") 169 | (n_any <- nrow(df_any)) 170 | (n_any/n)*100 171 | ``` 172 | 173 | We can set a minimum number of non-null entries required for a row to remain in the DF by specifying a `minNonNulls` value. If included in `dropna`, this specification directs SparkR to drop rows that have less than `minNonNulls = ` non-null entries. Note that including `minNonNulls` overwrites the `how` specification. Below, we omit rows with that have less than 5 and 12 entries that are _not_ nulls. Note that there are no rows in `df` that have less than 5 non-null entries, and there are only approximately 8,000 rows with less than 12 non-null entries. 174 | 175 | ```{r, collapse=TRUE} 176 | df_5 <- dropna(df, minNonNulls = 5) 177 | nrow(df_5) # Equal in value to n 178 | 179 | df_12 <- dropna(df, minNonNulls = 12) 180 | (n_12 <- nrow(df_12)) 181 | n - n_12 182 | ``` 183 | 184 | 185 | #### Empty string entries 186 | 187 | If we want to create a new DF that does not include any row with missing entries for a column of string dtype, we could also use `filter` to accomplish this. In order to remove observations with a missing `"servicer_name"` value, we simply filter `df` on the condition that `"servicer_name"` does not equal an empty string entry: 188 | 189 | ```{r, collapse=TRUE} 190 | df_snNoEmpty <- filter(df, df$servicer_name != "") 191 | nrow(df_snNoEmpty) 192 | ``` 193 | 194 | *** 195 | 196 | 197 | ### Fill missing data entries 198 | 199 | 200 | #### Null value entries 201 | 202 | The `fillna` operation allows us to replace null entries with some specified value. In order to replace null entries in every numerical and integer column in `df` with a value, we simply evaluate the expression `fillna(df, )`. We replace every null entry in `df` with the value 12345 below: 203 | 204 | ```{r, collapse=TRUE} 205 | str(df) 206 | 207 | df_ <- fillna(df, value = 12345) 208 | str(df_) 209 | rm(df_) 210 | ``` 211 | 212 | If we want to replace null values within a list of DF columns, we can specify a column list just as we did in `dropna`. Here, we replace the null values in only `"act_endg_upb"` with 12345: 213 | 214 | ```{r, collapse=TRUE} 215 | str(df) 216 | 217 | df_ <- fillna(df, list("act_endg_upb" = 12345)) 218 | str(df_) 219 | rm(df_) 220 | ``` 221 | 222 | 223 | #### Empty string entries 224 | 225 | Finally, we can replace the empty entries in string dtype columns with the `ifelse` operation, which follows the syntax `ifelse(, , )`. Here, we replace the empty entries in `"servicer_name"` with the string `"Unknown"`: 226 | 227 | ```{r, collapse=TRUE} 228 | str(df) 229 | df$servicer_name <- ifelse(df$servicer_name == "", "Unknown", df$servicer_name) 230 | str(df) 231 | ``` 232 | 233 | 234 | __End of tutorial__ - Next up is [Computing Summary Statistics with SparkR](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/05_summary-statistics.md) -------------------------------------------------------------------------------- /rmd/05_summary-statistics.rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: 'Computing Summary Statistics with SparkR' 3 | author: "Sarah Armstrong, Urban Institute" 4 | date: "July 8, 2016" 5 | output: 6 | html_document: 7 | keep_md: true 8 | --- 9 | 10 | ```{r setup, include=FALSE} 11 | knitr::opts_chunk$set(echo = TRUE) 12 | options(knitr.table.format = 'markdown') 13 | ``` 14 | 15 | **Last Updated**: May 23, 2017 16 | 17 | 18 | **Objective**: Summary statistics and aggregations are essential means of summarizing a set of observations. In this tutorial, we discuss how to compute location, statistical dispersion, distribution and dependence measures of numerical variables in SparkR, as well as methods for examining categorical variables. In particular, we consider how to compute the following measurements and aggregations in SparkR: 19 | 20 | _Numerical Data_ 21 | 22 | * Measures of location: 23 | + Mean 24 | + Extract summary statistics as local value 25 | * Measures of dispersion: 26 | + Range width & limits 27 | + Variance 28 | + Standard deviation 29 | + Quantiles 30 | * Measures of distribution shape: 31 | + Skewness 32 | + Kurtosis 33 | * Measures of Dependence: 34 | + Covariance 35 | + Correlation 36 | 37 | _Categorical Data_ 38 | 39 | * Frequency table 40 | * Relative frequency table 41 | * Contingency table 42 | 43 | **SparkR/R Operations Discussed**: `describe`, `collect`, `showDF`, `agg`, `mean`, `typeof`, `min`, `max`, `abs`, `var`, `sd`, `skewness`, `kurtosis`, `cov`, `corr`, `count`, `n`, `groupBy`, `nrow`, `crosstab` 44 | 45 | *** 46 | 47 | :heavy_exclamation_mark: **Warning**: Before beginning this tutorial, please visit the SparkR Tutorials README file (found [here](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/README.md)) in order to load the SparkR library and subsequently initiate a SparkR session. 48 | 49 | ```{r, include=FALSE} 50 | if (nchar(Sys.getenv("SPARK_HOME")) < 1) { 51 | Sys.setenv(SPARK_HOME = "/home/spark") 52 | } 53 | 54 | library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"))) 55 | 56 | sparkR.session() 57 | ``` 58 | 59 | The following error indicates that you have not initiated a SparkR session: 60 | 61 | ```{r, eval=FALSE} 62 | Error in getSparkSession() : SparkSession not initialized 63 | ``` 64 | 65 | If you receive this message, return to the SparkR tutorials [README](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/README.md) for guidance. 66 | 67 | *** 68 | 69 | **Read in initial data as DF**: Throughout this tutorial, we will use the loan performance example dataset that we exported at the conclusion of the [SparkR Basics I](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/01_sparkr-basics-1.md) tutorial. 70 | 71 | ```{r, message=F, warning=F, results='hide', collapse=TRUE} 72 | df <- read.df("s3://ui-spark-social-science-public/data/hfpc_ex", 73 | header = "false", 74 | inferSchema = "true", 75 | na.strings = "") 76 | cache(df) 77 | ``` 78 | 79 | _Note_: documentation for the quarterly loan performance data can be found at http://www.fanniemae.com/portal/funding-the-market/data/loan-performance-data.html. 80 | 81 | *** 82 | 83 | 84 | ## Numerical Data 85 | 86 | The operation `describe` (or its alias `summary`) creates a new DF that consists of several key aggregations (count, mean, max, mean, standard deviation) for a specified DF or list of DF columns (note that columns must be of a numerical datatype). We can either (1) use the action operation `showDF` to print this aggregation DF or (2) save it as a local data.frame with `collect`. Here, we perform both of these actions on the aggregation DF `sumstats_mthsremng`, which returns the aggregations listed above for the column `"mths_remng"` in `df`: 87 | 88 | ```{r, collapse=TRUE} 89 | sumstats_mthsremng <- describe(df, "mths_remng") # Specified list of columns here consists only of "mths_remng" 90 | 91 | showDF(sumstats_mthsremng) # Print the aggregation DF 92 | 93 | sumstats_mthsremng.l <- collect(sumstats_mthsremng) # Collect aggregation DF as a local data.frame 94 | sumstats_mthsremng.l 95 | ``` 96 | 97 | Note that measuring all five (5) of these aggregations at once can be computationally expensive with a massive data set, particularly if we are interested in only a subset of these measurements. Below, we outline ways to measure these aggregations individually, as well as several other key summary statistics for numerical data. 98 | 99 | *** 100 | 101 | 102 | ### Measures of Location 103 | 104 | 105 | #### Mean 106 | 107 | The mean is the only measure of central tendency currently supported by SparkR. The operations `mean` and `avg` can be used with the `agg` operation that we discussed in the [SparkR Basics II](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/02_sparkr-basics-2.md) tutorial to measure the average of a numerical DF column. Remember that `agg` returns another DF. Therefore, we can either print the DF with `showDF` or we can save the aggregation as a local data.frame. Collecting the DF may be preferred if we want to work with the mean `"mths_remng"` value as a single value in RStudio. 108 | 109 | ```{r, collapse=TRUE} 110 | mths_remng.avg <- agg(df, mean = mean(df$mths_remng)) # Create an aggregation DF 111 | 112 | # DataFrame 113 | showDF(mths_remng.avg) # Print this DF 114 | typeof(mths_remng.avg) # Aggregation DF is of class S4 115 | 116 | # data.frame 117 | mths_remng.avg.l <- collect(mths_remng.avg) # Collect the DF as a local data.frame 118 | (mths_remng.avg.l <- mths_remng.avg.l[,1]) # Overwrite data.frame with numerical mean value (was entry in d.f) 119 | typeof(mths_remng.avg.l) # Object is now of a numerical dtype 120 | ``` 121 | 122 | *** 123 | 124 | 125 | ### Measures of dispersion 126 | 127 | 128 | #### Range width & limits 129 | 130 | We can also use `agg` to create a DF that lists the minimum and maximum values within a numerical DF column (i.e. the limits of the range of values in the column) and the width of the range. Here, we create compute these values for `"mths_remng"` and print the resulting DF with `showDF`: 131 | 132 | ```{r, collapse=TRUE} 133 | mr_range <- agg(df, minimum = min(df$mths_remng), maximum = max(df$mths_remng), 134 | range_width = abs(max(df$mths_remng) - min(df$mths_remng))) 135 | showDF(mr_range) 136 | ``` 137 | 138 | 139 | #### Variance & standard deviation 140 | 141 | Again using `agg`, we compute the variance and standard deviation of `"mths_remng"` with the expressions below. Note that, here, we are computing sample variance and standard deviation (which we could also measure with their respective aliases, `variance` and `stddev`). To measure population variance and standard deviation, we would use `var_pop` and `stddev_pop`, respectively. 142 | 143 | ```{r, collapse=TRUE} 144 | mr_var <- agg(df, variance = var(df$mths_remng)) # Sample variance 145 | showDF(mr_var) 146 | 147 | mr_sd <- agg(df, std_dev = sd(df$mths_remng)) # Sample standard deviation 148 | showDF(mr_sd) 149 | ``` 150 | 151 | 152 | #### Approximate Quantiles 153 | 154 | The operation `approxQuantile` returns approximate quantiles for a DF column. We specify the quantiles to be approximated by the operation as a vector set equal to the `probabilities` parameter, and the acceptable level of error by the `relativeError` paramter. 155 | 156 | If the column includes `n` rows, then `approxQuantile` will return a list of quantile values with rank values that are acceptably close to those exact values specified by `probabilities`. In particular, the operation assigns approximate rank values such that the computed rank, (`probabilities * n`), falls within the inequality: 157 | 158 | 159 | `floor((probabilities - relativeError) * n) <= rank(x) <= ceiling((probabilities + relativeError) * n)` 160 | 161 | 162 | Below, we define a new DF, `df_`, that includes only nonmissing values for `"mths_remng"` and then compute approximate Q1, Q2 and Q3 values for `"mths_remng"`: 163 | 164 | ```{r, collapse=TRUE} 165 | df_ <- dropna(df, cols = "mths_remng") 166 | 167 | quartiles_mr <- approxQuantile(x = df_, col = "mths_remng", probabilities = c(0.25, 0.5, 0.75), 168 | relativeError = 0.001) 169 | quartiles_mr 170 | ``` 171 | 172 | 173 | *** 174 | 175 | 176 | ### Measures of distribution shape 177 | 178 | 179 | #### Skewness 180 | 181 | We can measure the magnitude and direction of skew in the distribution of a numerical DF column by using the operation `skewness` with `agg`, just as we did to measure the `mean`, `variance` and `stddev` of a numerical variable. Below, we measure the `skewness` of `"mths_remng"`: 182 | 183 | ```{r, collapse=TRUE} 184 | mr_sk <- agg(df, skewness = skewness(df$mths_remng)) 185 | showDF(mr_sk) 186 | ``` 187 | 188 | 189 | #### Kurtosis 190 | 191 | Similarly, we can meaure the magnitude of, and how sharp is, the central peak of the distribution of a numerical variable, i.e. the "peakedness" of the distribution, (relative to a standard bell curve) with the `kurtosis` operation. Here, we measure the `kurtosis` of `"mths_remng"`: 192 | 193 | ```{r, collapse=TRUE} 194 | mr_kr <- agg(df, kurtosis = kurtosis(df$mths_remng)) 195 | showDF(mr_kr) 196 | ``` 197 | 198 | *** 199 | 200 | 201 | ### Measures of dependence 202 | 203 | #### Covariance & correlation 204 | 205 | The actions `cov` and `corr` return the sample covariance and correlation measures of dependency between two DF columns, respectively. Currently, Pearson is the only supported method for calculating correlation. Here we compute the covariance and correlation of `"loan_age"` and `"mths_remng"`. Note that, in saving the covariance and correlation measures, we are not required to first `collect` locally since `cov` and `corr` return values, rather than DFs: 206 | 207 | ```{r, collapse=TRUE} 208 | cov_la.mr <- cov(df, "loan_age", "mths_remng") 209 | corr_la.mr <- corr(df, "loan_age", "mths_remng", method = "pearson") 210 | cov_la.mr 211 | corr_la.mr 212 | 213 | typeof(cov_la.mr) 214 | typeof(corr_la.mr) 215 | ``` 216 | 217 | *** 218 | 219 | 220 | 221 | ## Categorical Data 222 | 223 | 224 | We can compute descriptive statistics for categorical data using (1) the `groupBy` operation that we discussed in the [SparkR Basics II](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/02_sparkr-basics-2.md) tutorial and (2) operations native to SparkR for this purpose. 225 | 226 | ```{r, include=FALSE} 227 | df$cd_zero_bal <- ifelse(isNull(df$cd_zero_bal), "Unknown", df$cd_zero_bal) 228 | df$servicer_name <- ifelse(df$servicer_name == "", "Unknown", df$servicer_name) 229 | ``` 230 | 231 | 232 | #### Frequency table 233 | 234 | To create a frequency table for a categorical variable in SparkR, i.e. list the number of observations for each distinct value in a column of strings, we can simply use the `count` transformation with grouped data. Group the data by the categorical variable for which we want to return a frequency table. Here, we create a frequency table for using this approach `"cd_zero_bal"`: 235 | 236 | ```{r, collapse=TRUE} 237 | zb_f <- count(groupBy(df, "cd_zero_bal")) 238 | showDF(zb_f) 239 | ``` 240 | 241 | We could also embed a grouping into an `agg` operation as we saw in the [SparkR Basics II](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/02_sparkr-basics-2.md) tutorial to achieve the same frequency table DF, i.e. we could evaluate the expression `agg(groupBy(df, df$cd_zero_bal), count = n(df$cd_zero_bal))`. 242 | 243 | #### Relative frequency table 244 | 245 | We could similarly create a DF that consists of a relative frequency table. Here, we reproduce the frequency table from the preceding section, but now including the relative frequency for each distinct string value, labeled `"Percentage"`: 246 | 247 | ```{r, collapse=TRUE} 248 | n <- nrow(df) 249 | zb_rf <- agg(groupBy(df, df$cd_zero_bal), Count = n(df$cd_zero_bal), Percentage = n(df$cd_zero_bal) * (100/n)) 250 | showDF(zb_rf) 251 | ``` 252 | 253 | #### Contingency table 254 | 255 | Finally, we can create a contingency table with the operation `crosstab`, which returns a data.frame that consists of a contingency table between two categorical DF columns. Here, we create and print a contingency table for `"servicer_name"` and `"cd_zero_bal"`: 256 | 257 | ```{r, eval=FALSE} 258 | conting_sn.zb <- crosstab(df, "servicer_name", "cd_zero_bal") 259 | conting_sn.zb 260 | ``` 261 | 262 | Here, is the contingency table (the output of `crosstab`) in a formatted table: 263 | 264 | ```{r kable, echo=FALSE} 265 | conting_sn.zb <- crosstab(df, "servicer_name", "cd_zero_bal") 266 | library(knitr) 267 | kable(conting_sn.zb) 268 | ``` 269 | 270 | __End of tutorial__ - Next up is [Merging SparkR DataFrames](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/06_merging.md) -------------------------------------------------------------------------------- /rmd/06_merging.rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: 'Merging SparkR DataFrames' 3 | author: "Sarah Armstrong, Urban Institute" 4 | date: "July 12, 2016" 5 | output: 6 | html_document: 7 | keep_md: yes 8 | --- 9 | 10 | ```{r setup, include=FALSE} 11 | knitr::opts_chunk$set(echo = TRUE) 12 | ``` 13 | 14 | **Last Updated**: May 23, 2017 15 | 16 | 17 | **Objective**: The following tutorial provides an overview of how to join SparkR DataFrames by column and by row. In particular, we discuss how to: 18 | 19 | * Merge two DFs by column condition(s) (join by row) 20 | * Append rows of data to a DataFrame (join by column) 21 | + When column name lists are equal across DFs 22 | + When column name lists are not equal 23 | 24 | **SparkR/R Operations Discussed**: `join`, `merge`, `sample`, `except`, `intersect`, `rbind`, `rbind.intersect` (defined function), `rbind.fill` (defined function) 25 | 26 | *** 27 | 28 | :heavy_exclamation_mark: **Warning**: Before beginning this tutorial, please visit the SparkR Tutorials README file (found [here](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/README.md)) in order to load the SparkR library and subsequently initiate a SparkR session. 29 | 30 | ```{r, include=FALSE} 31 | if (nchar(Sys.getenv("SPARK_HOME")) < 1) { 32 | Sys.setenv(SPARK_HOME = "/home/spark") 33 | } 34 | 35 | library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"))) 36 | 37 | sparkR.session() 38 | ``` 39 | 40 | The following error indicates that you have not initiated a SparkR session: 41 | 42 | ```{r, eval=FALSE} 43 | Error in getSparkSession() : SparkSession not initialized 44 | ``` 45 | 46 | If you receive this message, return to the SparkR tutorials [README](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/README.md) for guidance. 47 | 48 | *** 49 | 50 | **Read in initial data as DF**: Throughout this tutorial, we will use the loan performance example dataset that we exported at the conclusion of the [SparkR Basics I](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/01_sparkr-basics-1.md) tutorial. 51 | 52 | ```{r, message=F, warning=F, results='hide', collapse=TRUE} 53 | df <- read.df("s3://ui-spark-social-science-public/data/hfpc_ex", 54 | header = "false", 55 | inferSchema = "true", 56 | na.strings = "") 57 | cache(df) 58 | ``` 59 | 60 | _Note_: documentation for the quarterly loan performance data can be found at http://www.fanniemae.com/portal/funding-the-market/data/loan-performance-data.html. 61 | 62 | *** 63 | 64 | 65 | ### Join (merge) two DataFrames by column condition(s) 66 | 67 | We begin by subsetting `df` by column, resulting in two (2) DataFrames that are disjoint, except for them both including the loan identification variable, `"loan_id"`: 68 | 69 | ```{r, collapse=TRUE} 70 | # Print the column names of df: 71 | columns(df) 72 | 73 | # Specify column lists to fit `a` and `b` on - these are disjoint sets (except for "loan_id"): 74 | cols_a <- c("loan_id", "period", "servicer_name", "new_int_rt", "act_endg_upb", "loan_age", "mths_remng") 75 | cols_b <- c("loan_id", "aj_mths_remng", "dt_matr", "cd_msa", "delq_sts", "flag_mod", "cd_zero_bal", "dt_zero_bal") 76 | 77 | # Create `a` and `b` DFs with the `select` operation: 78 | a <- select(df, cols_a) 79 | b <- select(df, cols_b) 80 | 81 | # Print several rows from each subsetted DF: 82 | str(a) 83 | str(b) 84 | ``` 85 | 86 | We can use the SparkR operation `join` to merge `a` and `b` by row, returning a DataFrame equivalent to `df`. The `join` operation allows us to perform most SQL join types on SparkR DFs, including: 87 | 88 | * `"inner"` (default): Returns rows where there is a match in both DFs 89 | * `"outer"`: Returns rows where there is a match in both DFs, as well as rows in both the right and left DF where there was no match 90 | * `"full"`, `"fullouter"`: Returns rows where there is a match in one of the DFs 91 | * `"left"`, `"leftouter"`, `"left_outer"`: Returns all rows from the left DF, even if there are no matches in the right DF 92 | * `"right"`, `"rightouter"`, `"right_outer"`: Returns all rows from the right DF, even if there are no matches in the left DF 93 | * Cartesian: Returns the Cartesian product of the sets of records from the two or more joined DFs - `join` will return this DF when we _do not_ specify a `joinType` _nor_ a `joinExpr` (discussed below) 94 | 95 | We communicate to SparkR what condition we want to join DFs on with the `joinExpr` specification in `join`. Below, we perform an `"inner"` (default) join on the DFs `a` and `b` on the condition that their `"loan_id"` values be equal: 96 | 97 | ```{r, collapse=TRUE} 98 | ab1 <- join(a, b, a$loan_id == b$loan_id) 99 | str(ab1) 100 | ``` 101 | 102 | Note that the resulting DF includes two (2) `"loan_id"` columns. Unfortunately, we cannot direct SparkR to keep only one of these columns when using `join` to merge by row, and the following command (which we introduced in the subsetting tutorial) drops both `"loan_id"` columns: 103 | 104 | ```{r, collapse=TRUE} 105 | ab1$loan_id <- NULL 106 | ``` 107 | 108 | We can avoid this by renaming one of the columns before performing `join` and then, utilizing that the columns have distinct names, tell SparkR to drop only one of the columns. For example, we could rename `"loan_id"` in `a` with the expression `a <- withColumnRenamed(a, "loan_id", "loan_id_")`, then drop this column with `ab1$loan_id_ <- NULL` after performing `join` on `a` and `b` to return `ab1`. 109 | 110 | 111 | The `merge` operation, alternatively, allows us to join DFs and produces two (2) _distinct_ merge columns. We can use this feature to retain the column on which we joined the DFs, but we must still perform a `withColumnRenamed` step if we want our merge column to retain its original column name. 112 | 113 | 114 | Rather than defining a `joinExpr`, we explictly specify the column(s) that SparkR should `merge` the DFs on with the operation parameters `by` and `by.x`/`by.y` (if the merging column is named differently across the DFs). Note that, if we do not specify `by`, SparkR will merge the DFs on the list of common column names shared by the DFs. Rather than specifying a type of join, `merge` determines how SparkR should merge DFs based on boolean values, `all.x` and `all.y`, which indicate which rows in `x` and `y` should be included in the join, respectively. We can specify `merge` type with the following parameter values: 115 | 116 | * `all.x = FALSE`, `all.y = FALSE`: Returns an inner join (this is the default and can be achieved by not specifying values for all.x and all.y) 117 | * `all.x = TRUE`, `all.y = FALSE`: Returns a left outer join 118 | * `all.x = FALSE`, `all.y = TRUE`: Returns a right outer join 119 | * `all.x = TRUE`, `all.y = TRUE`: Returns a full outer join 120 | 121 | The following `merge` expression is equivalent to the `join` expression in the preceding example: 122 | 123 | ```{r, collapse=TRUE} 124 | ab2 <- merge(a, b, by = "loan_id") 125 | str(ab2) 126 | ``` 127 | 128 | Note that the two merging columns are distinct as indicated by the `_x` and `_y` name assignments performed by `merge`. We utilize this distinction in the expressions below to retain a single merge column: 129 | 130 | ```{r, collapse=TRUE} 131 | # Drop "loan_id" column from `b`: 132 | ab2$loan_id_y <- NULL 133 | 134 | # Rename "loan_id" column from `a`: 135 | ab2 <- withColumnRenamed(ab2, "loan_id_x", "loan_id") 136 | 137 | # Final DF with single "loan_id" column: 138 | str(ab2) 139 | ``` 140 | 141 | ```{r, include=FALSE} 142 | rm(a) 143 | rm(b) 144 | rm(ab1) 145 | rm(ab2) 146 | rm(cols_a) 147 | rm(cols_b) 148 | ``` 149 | 150 | 151 | *** 152 | 153 | 154 | ### Append rows of data to a DataFrame 155 | 156 | In order to discuss how we can append the rows of one DF to those of another in SparkR, we must first subset `df` into two (2) distinct DataFrames, `A` and `B`. Below, we define `A` as a random subset of `df` with a row count that is approximately equal to half the size of `nrow(df)`. We use the DF operation `except` to create `B`, which includes every row of `df`, `except` for those included in `A`: 157 | 158 | ```{r, collapse=TRUE} 159 | A <- sample(df, withReplacement = FALSE, fraction = 0.5, seed = 1) 160 | B <- except(df, A) 161 | ``` 162 | 163 | Let's also examine the row count for each subsetted row and confirm that `A` and `B` do not share common rows. We can check this with the SparkR operation `intersect`, which performs the intersection set operation on two DFs: 164 | 165 | ```{r, collapse=TRUE} 166 | (nA <- nrow(A)) 167 | (nB <- nrow(B)) 168 | 169 | nA + nB # Equal to nrow(df) 170 | 171 | AintB <- intersect(A, B) 172 | nrow(AintB) 173 | ``` 174 | 175 | #### Append rows when column name lists are equal across DFs 176 | 177 | If we are certain that the two DFs have equivalent column name lists (with respect to both string values and column ordering), then appending the rows of one DF to another is straightforward. Here, we append the rows of `B` to `A` with the `rbind` operation: 178 | 179 | ```{r, collapse=TRUE} 180 | df1 <- rbind(A, B) 181 | 182 | nrow(df1) 183 | nrow(df) 184 | ``` 185 | 186 | We can see in the results above that `df1` is equivalent to `df`. We could, alternatively, accomplish this with the `unionALL` operation (e.g. `df1 <- unionAll(A, B)`. Note that `unionAll` is not an alias for `rbind` - we can combine any number of DFs with `rbind` while `unionAll` can only consider two (2) DataFrames at a time. 187 | 188 | ```{r, include=FALSE} 189 | unpersist(df1) 190 | rm(df1) 191 | ``` 192 | 193 | 194 | #### Append rows when DF column name lists are not equal 195 | 196 | Before we can discuss appending rows when we do not have column name equivalency, we must first create two DataFrames that have different column names. Let's define a new DataFrame, `B_` that includes every column in `A` and `B`, excluding the column `"loan_age"`: 197 | 198 | ```{r, collapse=TRUE} 199 | columns(B) 200 | 201 | # Define column name list that has every column in `A` and `B`, except "loan_age": 202 | cols_ <- c("loan_id", "period", "servicer_name", "new_int_rt", "act_endg_upb", "mths_remng", "aj_mths_remng", 203 | "dt_matr", "cd_msa", "delq_sts", "flag_mod", "cd_zero_bal", "dt_zero_bal" ) 204 | 205 | # Define subsetted DF: 206 | B_ <- select(B, cols_) 207 | ``` 208 | 209 | ```{r, include=FALSE} 210 | unpersist(B) 211 | rm(B) 212 | rm(cols_) 213 | ``` 214 | 215 | 216 | We can try to apply SparkR `rbind` operation to append `B_` to `A`, but the expression given below will result in the error: `"Union can only be performed on tables with the same number of columns, but the left table has 14 columns and" "the right has 13"` 217 | 218 | ```{r, eval=FALSE} 219 | df2 <- rbind(A, B_) 220 | ``` 221 | 222 | Two strategies to force SparkR to merge DataFrames with different column name lists are to: 223 | 224 | 1. Append by an intersection of the two sets of column names, or 225 | 2. Use `withColumn` to add columns to DF where they are missing and set each entry in the appended rows of these columns equal to `NA`. 226 | 227 | Below is a function, `rbind.intersect`, that accomplishes the first approach. Notice that, in this function, we simply take an intesection of the column names and ask SparkR to perform `rbind`, considering only this subset of (sorted) column names. 228 | 229 | ```{r, collapse=TRUE} 230 | rbind.intersect <- function(x, y) { 231 | cols <- base::intersect(colnames(x), colnames(y)) 232 | return(SparkR::rbind(x[, sort(cols)], y[, sort(cols)])) 233 | } 234 | ``` 235 | 236 | Here, we append `B_` to `A` using this function and then examine the dimensions of the resulting DF, `df2`, as well as its column names. We can see that, while the row count for `df2` is equal to that for `df`, the DF does not include the `"loan_age"` column (just as we expected!). 237 | 238 | ```{r, collapse=TRUE} 239 | df2 <- rbind.intersect(A, B_) 240 | dim(df2) 241 | colnames(df2) 242 | ``` 243 | 244 | ```{r, include=FALSE} 245 | unpersist(df2) 246 | rm(df2) 247 | ``` 248 | 249 | 250 | Accomplishing the second approach is somewhat more involved. The `rbind.fill` function, given below, identifies the outersection of the list of column names for two (2) DataFrames and adds them onto one (1) or both of the DataFrames as needed using `withColumn`. The function appends these columns as string dtype, and we can later recast columns as needed: 251 | 252 | ```{r, collapse=TRUE} 253 | rbind.fill <- function(x, y) { 254 | 255 | m1 <- ncol(x) 256 | m2 <- ncol(y) 257 | col_x <- colnames(x) 258 | col_y <- colnames(y) 259 | outersect <- function(x, y) {setdiff(union(x, y), intersect(x, y))} 260 | col_outer <- outersect(col_x, col_y) 261 | len <- length(col_outer) 262 | 263 | if (m2 < m1) { 264 | for (j in 1:len){ 265 | y <- withColumn(y, col_outer[j], cast(lit(""), "string")) 266 | } 267 | } else { 268 | if (m2 > m1) { 269 | for (j in 1:len){ 270 | x <- withColumn(x, col_outer[j], cast(lit(""), "string")) 271 | } 272 | } 273 | if (m2 == m1 & col_x != col_y) { 274 | for (j in 1:len){ 275 | x <- withColumn(x, col_outer[j], cast(lit(""), "string")) 276 | y <- withColumn(y, col_outer[j], cast(lit(""), "string")) 277 | } 278 | } else { } 279 | } 280 | x_sort <- x[,sort(colnames(x))] 281 | y_sort <- y[,sort(colnames(y))] 282 | return(SparkR::rbind(x_sort, y_sort)) 283 | } 284 | ``` 285 | 286 | We again append `B_` to `A`, this time using the `rbind.fill` function: 287 | 288 | ```{r, collapse=TRUE} 289 | df3 <- rbind.fill(A, B_) 290 | ``` 291 | 292 | Now, the row count for `df3` is equal to that for `df` _and_ it includes all fourteen (14) columns included in `df`: 293 | 294 | ```{r, collapse=TRUE} 295 | dim(df3) 296 | colnames(df3) 297 | ``` 298 | 299 | We know from the missing data tutorial that `df$loan_age` does not contain any `NA` or `NaN` values. By appending `B_` to `A` with the `rbind.fill` function, therefore, we should have inserted exactly `nrow(B)` many empty string entries in `df3`. Note that `"loan_age"` is currently cast as string dtype and, therefore, the column does not contain any null values and we will need to recast the column to a numerical dtype. 300 | 301 | ```{r, collapse=TRUE} 302 | df3_laEmpty <- where(df3, df3$loan_age == "") 303 | nrow(df3_laEmpty) 304 | 305 | # There are no "loan_age" null values since it is string dtype 306 | df3_laNull <- where(df3, isNull(df3$loan_age)) 307 | nrow(df3_laNull) 308 | ``` 309 | 310 | Below, we recast `"loan_age"` as integer dtype and check that the number of `"loan_age"` null values in `df3` now matches the number of entry string values in `df3` prior to recasting, as well as the number of rows in `B`: 311 | 312 | ```{r, collapse=TRUE} 313 | # Recast 314 | df3$loan_age <- cast(df3$loan_age, dataType = "integer") 315 | str(df3) 316 | 317 | # Check that values are equal 318 | 319 | df3_laNull_ <- where(df3, isNull(df3$loan_age)) 320 | nrow(df3_laEmpty) # No. of empty strings 321 | 322 | nrow(df3_laNull_) # No. of null entries 323 | 324 | nB # No. of rows in DF `B` 325 | ``` 326 | 327 | 328 | Documentation for `rbind.intersection` can be found [here](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/R/rbind-intersection.R), and [here](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/R/rbind-fill.R) for `rbind.fill`. 329 | 330 | __End of tutorial__ - Next up is [Data Visualizations in SparkR](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/07_visualizations.md) -------------------------------------------------------------------------------- /rmd/07_visualizations.rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: 'Data Visualizations in SparkR' 3 | author: "Sarah Armstrong, Urban Institute" 4 | date: "July 27, 2016" 5 | output: 6 | html_document: 7 | keep_md: true 8 | --- 9 | 10 | ```{r setup, include=FALSE} 11 | knitr::opts_chunk$set(echo = TRUE) 12 | ``` 13 | 14 | **Last Updated**: May 23, 2017 - Some ggplot2.SparkR package functions do not function; package needs updating. 15 | 16 | 17 | **Objective**: In this tutorial, we illustrate various plot types that can be created in SparkR and discuss different strategies for obtaining these plots. We discuss the SparkR ggplot2 package that is in development and provide examples of plots that can be created using this package, as well as how SparkR users may develop their own functions to build visualizations. We provide examples of the following plot types: 18 | 19 | * Bar graph 20 | * Stacked or proportional bar graph 21 | * Histogram 22 | * Frequency polygon 23 | * Bivariate histogram 24 | 25 | **SparkR/R Operations Discussed**: `ggplot` (`ggplot2.SparkR`), `geom_bar` (`ggplot2.SparkR`), `geom_histogram` (`ggplot2.SparkR`), `geom_freqpoly` (`ggplot2.SparkR`), `geom_boxplot`, `geom_bivar_histogram.SparkR` (defined function) 26 | 27 | *** 28 | 29 | :heavy_exclamation_mark: **Warning**: Before beginning this tutorial, please visit the SparkR Tutorials README file (found [here](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/README.md)) in order to load the SparkR library and subsequently initiate a SparkR session. 30 | 31 | ```{r, include=FALSE} 32 | library(devtools) 33 | 34 | if (nchar(Sys.getenv("SPARK_HOME")) < 1) { 35 | Sys.setenv(SPARK_HOME = "/home/spark") 36 | } 37 | library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"))) 38 | 39 | devtools::install_github("SKKU-SKT/ggplot2.SparkR") 40 | library(ggplot2.SparkR) 41 | 42 | sparkR.session() 43 | ``` 44 | 45 | The following error indicates that you have not initiated a SparkR session: 46 | 47 | ```{r, eval=FALSE} 48 | Error in getSparkSession() : SparkSession not initialized 49 | ``` 50 | 51 | If you receive this message, return to the SparkR tutorials [README](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/README.md) for guidance. 52 | 53 | *** 54 | 55 | **Read in initial data as DataFrame (DF)**: Throughout this tutorial, we will use the diamonds data that is included in the `ggplot2` package and is frequently used in `ggplot2` examples. The data consists of prices and quality information about 54,000 diamonds. The data contains the four C’s of diamond quality, carat, cut, colour and clarity; and five physical measurements, depth, table, x, y and z. 56 | 57 | ```{r, message=F, warning=F, results='hide'} 58 | df <- read.df("s3://ui-spark-social-science-public/data/diamonds.csv", 59 | header = "true", 60 | delimiter = ",", 61 | source = "csv", 62 | inferSchema = "true", 63 | na.strings = "") 64 | cache(df) 65 | ``` 66 | 67 | We can see what the data set looks like using the `str` operation: 68 | 69 | ```{r, collapse=TRUE} 70 | str(df) 71 | ``` 72 | 73 | _Note_: The description of the `diamonds` data given above is adapted from http://ggplot2.org/book/qplot.pdf. 74 | 75 | 76 | Introduced in the spring of 2016, the SparkR extension of Hadley Wickham's `ggplot2` package, `ggplot2.SparkR`, allows SparkR users to build visualizations by specifying a SparkR DataFrame and DF columns in ggplot expressions identically to how we would specify R data.frame components when using the `ggplot2` package, i.e. the extension package allows SparkR users to implement ggplot without having to modify the SparkR DataFrame API or to compute aggregations needed to build some plots. 77 | 78 | 79 | As of the publication date of this tutorial, the `ggplot2.SparkR` package is still nascent and has identifiable bugs, including slow processing time. However, we provide `ggplot2.SparkR` in this example for its ease of use, particularly for SparkR users wanting to build basic plots. We alternatively discuss how a SparkR user may develop their own plotting function and provide an example in which we plot a bivariate histogram. 80 | 81 | 82 | _Note_: Documentation for `ggplot2.SparkR` can be found [here](http://skku-skt.github.io/ggplot2.SparkR/), and we can view the project on GitHub [here](https://github.com/SKKU-SKT/ggplot2.SparkR). Documentation for the latest version of `ggplot2` can be found [here](http://docs.ggplot2.org/current/). 83 | 84 | *** 85 | 86 | 87 | ### Bar graph 88 | 89 | Just as we would when using `ggplot2`, the following expression plots a basic bar graph that gives frequency counts across the different levels of `"cut"` quality in the data: 90 | 91 | ```{r, collapse=TRUE} 92 | p1 <- ggplot(df, aes(x = cut)) 93 | p1 + geom_bar() 94 | ``` 95 | 96 | 97 | #### Stacked or proportional bar graph 98 | 99 | One recognized bug within `ggplot2.SparkR` is that, when specifying a `fill` column, none of the `position` specifications--`"stack"`, `"fill"` nor `"dodge"`--necessarily return plots with constant factor-level ordering across groups. For example, the following expression successfully returns a bar graph that describes proportional frequency of `"clarity"` levels (string dtype), grouped over diamond `"cut"` types (also string dtype). Note, however, that the varied color blocks representing `"clarity"` levels are not ordered similarly across different levels of `"cut"`. The same issue results when we specify either of the other two (2) `position` specifications: 100 | 101 | ```{r, collapse=TRUE} 102 | p2 <- ggplot(df, aes(x = cut, fill = clarity)) 103 | p2 + geom_bar(position = "fill") 104 | ``` 105 | 106 | *** 107 | 108 | 109 | ### Histogram 110 | 111 | Just as we would when using `ggplot2`, the following expression plots a histogram that gives frequency counts across binned `"price"` values in the data: 112 | 113 | ```{r, collapse=TRUE, message=F, warning=F} 114 | p3 <- ggplot(df, aes(price)) 115 | p3 + geom_histogram() 116 | ``` 117 | 118 | The preceding histogram plot assumes the `ggplot2` default, `bins = 30`, but we can change this value or override the `bins` specification by setting a `binwidth` value as we do in the following examples: 119 | 120 | ```{r, collapse=TRUE} 121 | p3 + geom_histogram(binwidth = 250) 122 | ``` 123 | 124 | ```{r, collapse=TRUE} 125 | p3 + geom_histogram(bins = 100) 126 | ``` 127 | 128 | *** 129 | 130 | 131 | ### Frequency polygon 132 | 133 | Frequency polygons provide a visual alternative to histogram plots (note that they describe equivalent aggregations), and we can fit this plot type also with `ggplot2` syntax - the following expression returns a frequency polygon that is equivalent to the first histogram plotted in the preceding section: 134 | 135 | ```{r, collapse=TRUE, message=F, warning=F} 136 | p3 + geom_freqpoly() 137 | ``` 138 | 139 | Again, we can change the class intervals by specifying `binwidth` or the number of `bins` for the frequency polygon: 140 | 141 | ```{r, collapse=TRUE} 142 | p3 + geom_freqpoly(binwidth = 250) 143 | ``` 144 | 145 | ```{r, collapse=TRUE} 146 | p3 + geom_freqpoly(bins = 100) 147 | ``` 148 | 149 | *** 150 | 151 | 152 | ### Boxplot 153 | 154 | Finally, we can create boxplots just as we would in `ggplot2`. The following expression gives a boxplot of `"price"` values across levels of `"clarity"`: 155 | 156 | ```{r, collapse=TRUE} 157 | p4 <- ggplot(df, aes(x = clarity, y = price)) 158 | p4 + geom_boxplot() 159 | ``` 160 | 161 | *** 162 | 163 | 164 | ### Additional `ggplot2.SparkR` functionality 165 | 166 | We can adapt the plot types discussed in the previous sections with the specifications given below: 167 | 168 | * Facets: `facet_grid`, `facet_wrap` and `facet_null` (default) 169 | * Coordinate systems: `coord_cartesian` and `coord_flip` 170 | * Position adjustments: `position_dodge`, `position_fill`, `position_stack` (as seen in previous example) 171 | * Scales: `scale_x_log10`, `scale_y_log10`, `labs`, `xlab`, `ylab`, `xlim` and `ylim` 172 | 173 | For example, the following expression facets our previous histogram example across the different levels of `"cut"` quality: 174 | 175 | ```{r, collapse=TRUE} 176 | p3 + geom_histogram() + facet_wrap(~cut) 177 | ``` 178 | 179 | 180 | ### Functionality gaps between `ggplot2` and SparkR extension: 181 | 182 | Below, we list several functions and plot types supported by `ggplot2` that are not currently supported by its SparkR extension package. The list is not exhaustive and is subject to change as the package continues to be developed: 183 | 184 | * Weighted bar graph 185 | * Weighted histogram 186 | * Strictly ordered layers for filled and stacked bar graphs (as we saw in an earlier example) 187 | * Stacked or filled histogram 188 | * Layered frequency polygon 189 | * Density plot using `geom_freqpoly` by specifying `y = ..density..` in aesthetic (note that the extension package does not support `geom_density`) 190 | 191 | *** 192 | 193 | 194 | ### Bivariate histogram 195 | 196 | In the previous examples, we relied on the `ggplot2.SparkR` package to build plots from DataFrames using syntax identical to that which we would use in a normal application of `ggplot2` on R data.frames. Given the current limitations of the extension package, we may need to develop our own function if we are interested in building a plot type that is not currently supported by `ggplot2.SparkR`. Here, we provide an example of a function that returns a bivariate histogram of two numerical DataFrame columns. 197 | 198 | 199 | When building a function in SparkR (or any other environment), we want to avoid operations that are computationally expensive and building one that returns a plot is no different. One of the most expensive operations in SparkR, `collect`, is of particular interest when building functions that return plots since collecting data locally allows us to leverage graphing tools that we use in traditional frameworks, e.g. `ggplot2`. We should `collect` data as infrequently as possible since the operation is highly memory-intensive. 200 | 201 | 202 | In the following function, we `collect` data five (5) times. Four of the times, we are collecting single values (two minimum and two maximum values), which does not require a huge amount of memory. The last `collect` that we perform collects a data.frame with three (3) columns and a row for each bin assignment pairing, which can fit in-memory on a single node (assuming we don't specify a massive value for `nbins`). When developing SparkR functions, we should only perform minor collections like the ones described. 203 | 204 | ```{r, collapse=TRUE} 205 | geom_bivar_histogram.SparkR <- function(df, x, y, nbins){ 206 | 207 | library(ggplot2) 208 | 209 | x_min <- collect(agg(df, min(df[[x]]))) # Collect 1 210 | x_max <- collect(agg(df, max(df[[x]]))) # Collect 2 211 | x.bin <- seq(floor(x_min[[1]]), ceiling(x_max[[1]]), length = nbins) 212 | 213 | y_min <- collect(agg(df, min(df[[y]]))) # Collect 3 214 | y_max <- collect(agg(df, max(df[[y]]))) # Collect 4 215 | y.bin <- seq(floor(y_min[[1]]), ceiling(y_max[[1]]), length = nbins) 216 | 217 | x.bin.w <- x.bin[[2]]-x.bin[[1]] 218 | y.bin.w <- y.bin[[2]]-y.bin[[1]] 219 | 220 | df_ <- withColumn(df, "x_bin_", ceiling((df[[x]] - x_min[[1]]) / x.bin.w)) 221 | df_ <- withColumn(df_, "y_bin_", ceiling((df[[y]] - y_min[[1]]) / y.bin.w)) 222 | 223 | df_ <- mutate(df_, x_bin = ifelse(df_$x_bin_ == 0, 1, df_$x_bin_)) 224 | df_ <- mutate(df_, y_bin = ifelse(df_$y_bin_ == 0, 1, df_$y_bin_)) 225 | 226 | dat <- collect(agg(groupBy(df_, "x_bin", "y_bin"), count = n(df_$x_bin))) # Collect 5 227 | 228 | p <- ggplot(dat, aes(x = x_bin, y = y_bin, fill = count)) + geom_tile() 229 | 230 | return(p) 231 | } 232 | ``` 233 | 234 | Here, we evaluate the `geom_bivar_histogram.SparkR` function using `"carat"` and `"price"`: 235 | 236 | ```{r, collapse=TRUE} 237 | p5 <- geom_bivar_histogram.SparkR(df = df, x = "carat", y = "price", nbins = 100) 238 | p5 + scale_colour_brewer(palette = "Blues", type = "seq") + ggtitle("This is a title") + xlab("Carat") + 239 | ylab("Price") 240 | ``` 241 | 242 | _Note_: Documentation for the `geom_bivar_histogram.SparkR` function is given [here](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/R/geom_bivar_histogram_SparkR.R). 243 | 244 | 245 | Note that the plot closely resembles a scatterplot. Bivariate histograms are one strategy for mitigating the overplotting that often occurs when attempting to visualize apparent correlation between two (2) columns in massive data sets. Furthermore, it is sometimes impossible to gather the data that is necessary to map individual points to a scatterplot onto a single node within our cluster - this is when aggregation becomes necessary rather than simply preferable. Just like plotting a univariate histogram, binning data reduces the number of points to plot and, with the appropriate choice of bin number and color scale, bivariate histograms can provide an intuitive alternative to scatterplots when working with massive data sets. 246 | 247 | 248 | For example, the following function is equivalent to our previous one, but we have changed the `fill` specification that determines the color scale from `count` to `log10(count)`. Then, we evaluate the new function with a larger `nbins` value, returning a new plot with more granular binning and a more nuanced color scale (since the breaks in the color scale are now log10-spaced). 249 | 250 | ```{r, collapse=TRUE} 251 | geom_bivar_histogram.SparkR.log10 <- function(df, x, y, nbins){ 252 | 253 | library(ggplot2) 254 | 255 | x_min <- collect(agg(df, min(df[[x]]))) 256 | x_max <- collect(agg(df, max(df[[x]]))) 257 | x.bin <- seq(floor(x_min[[1]]), ceiling(x_max[[1]]), length = nbins) 258 | 259 | y_min <- collect(agg(df, min(df[[y]]))) 260 | y_max <- collect(agg(df, max(df[[y]]))) 261 | y.bin <- seq(floor(y_min[[1]]), ceiling(y_max[[1]]), length = nbins) 262 | 263 | x.bin.w <- x.bin[[2]]-x.bin[[1]] 264 | y.bin.w <- y.bin[[2]]-y.bin[[1]] 265 | 266 | df_ <- withColumn(df, "x_bin_", ceiling((df[[x]] - x_min[[1]]) / x.bin.w)) 267 | df_ <- withColumn(df_, "y_bin_", ceiling((df[[y]] - y_min[[1]]) / y.bin.w)) 268 | 269 | df_ <- mutate(df_, x_bin = ifelse(df_$x_bin_ == 0, 1, df_$x_bin_)) 270 | df_ <- mutate(df_, y_bin = ifelse(df_$y_bin_ == 0, 1, df_$y_bin_)) 271 | 272 | dat <- collect(agg(groupBy(df_, "x_bin", "y_bin"), count = n(df_$x_bin))) 273 | 274 | p <- ggplot(dat, aes(x = x_bin, y = y_bin, fill = log10(count))) + geom_tile() 275 | 276 | return(p) 277 | } 278 | ``` 279 | 280 | We now evaluate the `geom_bivar_histogram.SparkR.log10` function with `"carat"` and `"price"`: 281 | 282 | ```{r, collapse=TRUE} 283 | p6 <- geom_bivar_histogram.SparkR.log10(df = df, x = "carat", y = "price", nbins = 250) 284 | p6 + scale_colour_brewer(palette = "Blues", type = "seq") + ggtitle("This is a title") + xlab("Carat") + 285 | ylab("Price") 286 | ``` 287 | 288 | 289 | __End of tutorial__ - Next up is [Interacting with databases using SparkR](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/08_databases-with-jdbc.md) -------------------------------------------------------------------------------- /rmd/10_timeseries-1.rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: 'Time Series I: Working with the Date Datatype & Resampling a DataFrame' 3 | author: "Sarah Armstrong, Urban Institute" 4 | date: "July 12, 2016" 5 | output: 6 | html_document: 7 | keep_md: yes 8 | --- 9 | 10 | ```{r setup, include=FALSE} 11 | knitr::opts_chunk$set(echo = TRUE) 12 | ``` 13 | 14 | **Last Updated**: May 23, 2017 15 | 16 | 17 | **Objective**: In this tutorial, we discuss how to perform several essential time series operations with SparkR. In particular, we discuss how to: 18 | 19 | * Identify and parse date datatype (dtype) DF columns, 20 | * Compute relative dates based on a specified increment of time, 21 | * Extract and modify components of a date dtype column and 22 | * Resample a time series DF to a particular unit of time frequency 23 | 24 | **SparkR/R Operations Discussed**: `unix_timestamp`, `cast`, `withColumn`, `to_date`, `last_day`, `next_day`, `add_months`, `date_add`, `date_sub`, `weekofyear`, `dayofyear`, `dayofmonth`, `datediff`, `months_between`, `year`, `month`, `hour`, `minute`, `second`, `agg`, `groupBy`, `mean` 25 | 26 | *** 27 | 28 | :heavy_exclamation_mark: **Warning**: Before beginning this tutorial, please visit the SparkR Tutorials README file (found [here](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/README.md)) in order to load the SparkR library and subsequently initiate a SparkR session. 29 | 30 | ```{r, include=FALSE} 31 | if (nchar(Sys.getenv("SPARK_HOME")) < 1) { 32 | Sys.setenv(SPARK_HOME = "/home/spark") 33 | } 34 | 35 | library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"))) 36 | 37 | sparkR.session() 38 | ``` 39 | 40 | The following error indicates that you have not initiated a SparkR session: 41 | 42 | ```{r, eval=FALSE} 43 | Error in getSparkSession() : SparkSession not initialized 44 | ``` 45 | 46 | If you receive this message, return to the SparkR tutorials [README](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/README.md) for guidance. 47 | 48 | *** 49 | 50 | **Read in initial data as DF**: Throughout this tutorial, we will use the loan performance example dataset that we exported at the conclusion of the [SparkR Basics I](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/01_sparkr-basics-1.md) tutorial. 51 | 52 | ```{r, message=F, warning=F, results='hide', collapse=TRUE} 53 | df <- read.df("s3://ui-spark-social-science-public/data/hfpc_ex", 54 | header = "false", 55 | inferSchema = "true", 56 | na.strings = "") 57 | cache(df) 58 | ``` 59 | 60 | _Note_: documentation for the quarterly loan performance data can be found at http://www.fanniemae.com/portal/funding-the-market/data/loan-performance-data.html. 61 | 62 | *** 63 | 64 | 65 | ### Converting a DataFrame column to 'date' dtype 66 | 67 | 68 | As we saw in previous tutorials, there are several columns in our dataset that list dates which are helpful in determining loan performance. We will specifically consider the following columns throughout this tutorial: 69 | 70 | * `"period"` (Monthly Reporting Period): The month and year that pertain to the servicer’s cut-off period for mortgage loan information 71 | * `"dt_matr"`(Maturity Date): The month and year in which a mortgage loan is scheduled to be paid in full as defined in the mortgage loan documents 72 | * `"dt_zero_bal"`(Zero Balance Effective Date): Date on which the mortgage loan balance was reduced to zero 73 | 74 | Let's begin by reviewing the dytypes that `read.df` infers our date columns as. Note that each of our three (3) date columns were read in as strings: 75 | 76 | ```{r, collapse=TRUE} 77 | str(df) 78 | ``` 79 | 80 | While we could parse the date strings into separate year, month and day integer dtype columns, converting the columns to date dtype allows us to utilize the datetime functions available in SparkR. 81 | 82 | 83 | We can convert `"period"`, `"matr_dt"` and `"dt_zero_bal"` to date dtype with the following expressions: 84 | 85 | ```{r, collapse=TRUE} 86 | # `period` 87 | period_uts <- unix_timestamp(df$period, 'MM/dd/yyyy') # 1. Gets current Unix timestamp in seconds 88 | period_ts <- cast(period_uts, 'timestamp') # 2. Casts Unix timestamp `period_uts` as timestamp 89 | period_dt <- cast(period_ts, 'date') # 3. Casts timestamp `period_ts` as date dtype 90 | df <- withColumn(df, 'p_dt', period_dt) # 4. Add date dtype column `period_dt` to `df` 91 | 92 | # `dt_matr` 93 | matr_uts <- unix_timestamp(df$dt_matr, 'MM/yyyy') 94 | matr_ts <- cast(matr_uts, 'timestamp') 95 | matr_dt <- cast(matr_ts, 'date') 96 | df <- withColumn(df, 'mtr_dt', matr_dt) 97 | 98 | # `dt_zero_bal` 99 | zero_bal_uts <- unix_timestamp(df$dt_zero_bal, 'MM/yyyy') 100 | zero_bal_ts <- cast(zero_bal_uts, 'timestamp') 101 | zero_bal_dt <- cast(zero_bal_ts, 'date') 102 | df <- withColumn(df, 'zb_dt', zero_bal_dt) 103 | ``` 104 | 105 | Note that the string entries of these date DF columns are written in the formats `'MM/dd/yyyy'` and `'MM/yyyy'`. While SparkR is able to easily read a date string when it is in the default format, `'yyyy-mm-dd'`, additional steps are required for string to date conversions when the DF column entries are in a format other than the default. In order to create `"p_dt"` from `"period"`, for example, we must: 106 | 107 | 1. Define the Unix timestamp for the date string, specifying the date format that the string assumes (here, we specify `'MM/dd/yyyy'`), 108 | 2. Use the `cast` operation to convert the Unix timestamp of the string to `'timestamp'` dtype, 109 | 3. Similarly recast the `'timestamp'` form to `'date'` dtype and 110 | 4. Append the new date dtype `"p_dt"` column to `df` using the `withColumn` operation. 111 | 112 | We similarly create date dtype columns using `"dt_matr"` and `"dt_zero_bal"`. If the date string entries of these columns were in the default format, converting to date dtype would straightforward. If `"period"` was in the format `'yyyy-mm-dd'`, for example, we would be able to append `df` with a date dtype column using a simple `withColumn`/`cast` expression: `df <- withColumn(df, 'p_dt', cast(df$period, 'date'))`. We could also directly convert `"period"` to date dtype using the `to_date` operation: `df$period <- to_date(df$period)`. 113 | 114 | 115 | If we are lucky enough that our date entires are in the default format, then dtype conversion is simple and we should use either the `withColumn`/`cast` or `to_date` expressions given above. Otherwise, the longer conversion process is required. Note that, if we are maintaining our own dataset that we will use SparkR to analyze, adopting the default date format at the start will make working with date values during analysis much easier. 116 | 117 | 118 | Now that we've appended our date dtype columns to `df`, let's again look at the DF and compare the date dtype values with their associated date string values: 119 | 120 | ```{r, collapse=TRUE} 121 | str(df) 122 | ``` 123 | 124 | Note that the `"zb_dt"` entries corresponding to the missing date entries in `"dt_zero_bal"`, which were empty strings, are now nulls. 125 | 126 | *** 127 | 128 | 129 | ### Compute relative dates and measures based on a specified unit of time 130 | 131 | As we mentioned earlier, converting date strings to date dtype allows us to utilize SparkR datetime operations. In this section, we'll discuss several SparkR operations that return: 132 | 133 | * Date dtype columns, which list dates relative to a preexisting date column in the DF, and 134 | * Integer or numerical dtype columns, which list measures of time relative to a preexisting date column. 135 | 136 | For convenience, we will review these operations using the `df_dt` DF, which includes only the date columns `"p_dt"` and `"mtr_dt"`, which we created in the preceding section: 137 | 138 | ```{r, collapse=TRUE} 139 | cols_dt <- c("p_dt", "mtr_dt") 140 | df_dt <- select(df, cols_dt) 141 | ``` 142 | 143 | 144 | #### Relative dates 145 | 146 | SparkR datetime operations that return a new date dtype column include: 147 | 148 | * `last_day`: Returns the _last_ day of the month which the given date belongs to (e.g. inputting "2013-07-27" returns "2013-07-31") 149 | * `next_day`: Returns the _first_ date which is later than the value of the date column that is on the specified day of the week 150 | * `add_months`: Returns the date that is `'numMonths'` _after_ `'startDate'` 151 | * `date_add`: Returns the date that is `'days'` days _after_ `'start'` 152 | * `date_sub`: Returns the date that is `'days'` days _before_ `'start'` 153 | 154 | Below, we create relative date columns (defining `"p_dt"` as the input date) using each of these operations and `withColumn`: 155 | 156 | ```{r, collapse=TRUE} 157 | df_dt1 <- withColumn(df_dt, 'p_ld', last_day(df_dt$p_dt)) 158 | df_dt1 <- withColumn(df_dt1, 'p_nd', next_day(df_dt$p_dt, "Sunday")) 159 | df_dt1 <- withColumn(df_dt1, 'p_addm', add_months(df_dt$p_dt, 1)) # 'startDate'="pdt", 'numMonths'=1 160 | df_dt1 <- withColumn(df_dt1, 'p_dtadd', date_add(df_dt$p_dt, 1)) # 'start'="pdt", 'days'=1 161 | df_dt1 <- withColumn(df_dt1, 'p_dtsub', date_sub(df_dt$p_dt, 1)) # 'start'="pdt", 'days'=1 162 | str(df_dt1) 163 | ``` 164 | 165 | #### Relative measures of time 166 | 167 | SparkR datetime operations that return integer or numerical dtype columns include: 168 | 169 | * `weekofyear`: Extracts the week number as an integer from a given date 170 | * `dayofyear`: Extracts the day of the year as an integer from a given date 171 | * `dayofmonth`: Extracts the day of the month as an integer from a given date 172 | * `datediff`: Returns number of months between dates 'date1' and 'date2' 173 | * `months_between`: Returns the number of days from 'start' to 'end' 174 | 175 | Here, we use `"p_dt"` and `"mtr_dt"` as inputs in the above operations. We again use `withColumn` do append the new columns to a DF: 176 | 177 | ```{r, collapse=TRUE} 178 | df_dt2 <- withColumn(df_dt, 'p_woy', weekofyear(df_dt$p_dt)) 179 | df_dt2 <- withColumn(df_dt2, 'p_doy', dayofyear(df_dt$p_dt)) 180 | df_dt2 <- withColumn(df_dt2, 'p_dom', dayofmonth(df_dt$p_dt)) 181 | df_dt2 <- withColumn(df_dt2, 'mbtw_p.mtr', months_between(df_dt$mtr_dt, df_dt$p_dt)) # 'date1'=p_dt, 'date2'=mtr_dt 182 | df_dt2 <- withColumn(df_dt2, 'dbtw_p.mtr', datediff(df_dt$mtr_dt, df_dt$p_dt)) # 'start'=p_dt, 'end'=mtr_dt 183 | str(df_dt2) 184 | ``` 185 | 186 | Note that operations that consider two different dates are sensitive to how we specify column ordering in the operation expression. For example, if we incorrectly define `"p_dt"` as `date2` and `"mtr_dt"` as `date1`, `"mbtw_p.mtr"` will consist of negative values. Similarly, `datediff` will return negative values if `start` and `end` are misspecified. 187 | 188 | *** 189 | 190 | 191 | ### Extract components of a date dtype column as integer values 192 | 193 | There are also datetime operations supported by SparkR that allow us to extract individual components of a date dtype column and return these as integers. Below, we use the `year` and `month` operations to create integer dtype columns for each of our date columns. Similar functions include `hour`, `minute` and `second`. 194 | 195 | ```{r, collapse=TRUE} 196 | # Year and month values for `"period_dt"` 197 | df <- withColumn(df, 'p_yr', year(df$p_dt)) 198 | df <- withColumn(df, "p_m", month(df$p_dt)) 199 | 200 | # Year value for `"matr_dt"` 201 | df <- withColumn(df, 'mtr_yr', year(df$mtr_dt)) 202 | df <- withColumn(df, "mtr_m", month(df$mtr_dt)) 203 | 204 | # Year value for `"zero_bal_dt"` 205 | df <- withColumn(df, 'zb_yr', year(df$zb_dt)) 206 | df <- withColumn(df, "zb_m", month(df$zb_dt)) 207 | ``` 208 | 209 | We can see that each of the above expressions returns a column of integer values representing the requested date value: 210 | 211 | ```{r, collapse=TRUE} 212 | str(df) 213 | ``` 214 | 215 | Note that the `NA` entries of `"zb_dt"` result in `NA` values for `"zb_yr"` and `"zb_m"`. 216 | 217 | *** 218 | 219 | 220 | ### Resample a time series DF to a particular unit of time frequency 221 | 222 | When working with time series data, we are frequently required to resample data to a different time frequency. Combing the `agg` and `groupBy` operations, as we saw in the [SparkR Basics II](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/02_sparkr-basics-2.md) tutorial, is a convenient strategy for accomplishing this in SparkR. We create a new DF, `dat`, that only includes columns of numerical, integer and date dtype to use in our resampling examples: 223 | 224 | ```{r, include=FALSE} 225 | rm(df_dt) 226 | rm(df_dt1) 227 | rm(df_dt2) 228 | ``` 229 | 230 | ```{r, collapse=TRUE} 231 | cols <- c("p_yr", "p_m", "mtr_yr", "mtr_m", "zb_yr", "zb_m", "new_int_rt", "act_endg_upb", "loan_age", "mths_remng", "aj_mths_remng") 232 | dat <- select(df, cols) 233 | 234 | unpersist(df) 235 | cache(dat) 236 | 237 | head(dat) 238 | ``` 239 | 240 | Note that, in our loan-level data, each row represents a unique loan (each made distinct by the `"loan_id"` column in `df`) and its corresponding characteristics such as `"loan_age"` and `"mths_remng"`. Note that `dat` is simply a subset `df` and, therefore, also refers to loan-level data. 241 | 242 | 243 | While we can resample the data over distinct values of any of the columns in `dat`, we will resample the loan-level data as aggregations of the DF columns by units of time since we are working with time series data. Below, we aggregate the columns of `dat` (taking the mean of the column entries) by `"p_yr"`, and then by `"p_yr"` and `"p_m"`: 244 | 245 | ```{r, collapse=TRUE} 246 | # Resample by "period_yr" 247 | dat1 <- agg(groupBy(dat, dat$p_yr), p_m = mean(dat$p_m), mtr_yr = mean(dat$mtr_yr), zb_yr = mean(dat$zb_yr), 248 | new_int_rt = mean(dat$new_int_rt), act_endg_upb = mean(dat$act_endg_upb), loan_age = mean(dat$loan_age), 249 | mths_remng = mean(dat$mths_remng), aj_mths_remng = mean(dat$aj_mths_remng)) 250 | head(dat1) 251 | 252 | # Resample by "period_yr" and "period_m" 253 | dat2 <- agg(groupBy(dat, dat$p_yr, dat$p_m), mtr_yr = mean(dat$mtr_yr), zb_yr = mean(dat$zb_yr), 254 | new_int_rt = mean(dat$new_int_rt), act_endg_upb = mean(dat$act_endg_upb), loan_age = mean(dat$loan_age), 255 | mths_remng = mean(dat$mths_remng), aj_mths_remng = mean(dat$aj_mths_remng)) 256 | head(arrange(dat2, dat2$p_yr, dat2$p_m), 15) # Arrange the first 15 rows of `dat2` by ascending `period_yr` and `period_m` values 257 | ``` 258 | 259 | Note that we specify the list of DF columns that we want to resample on by including it in `groupBy`. Here, we aggregated by taking the mean of each column. However, we could use any of the aggregation functions that `agg` is able to interpret (listed in [SparkR Basics II](https://github.com/UrbanInstitute/sparkr-tutorials/blob/master/02_sparkr-basics-2.md) tutorial) and that is inline with the resampling that we are trying to achieve. 260 | 261 | 262 | We could resample to any unit of time that we can extract from a date column, e.g. `year`, `month`, `day`, `hour`, `minute`, `second`. Furthermore, could have skipped the step of creating separate year- and month-level date columns - instead, we could have embedded the datetime functions directly in the `agg` expression. The following expression creates a DF that is equivalent to `dat1` in the preceding example: 263 | 264 | ```{r, collapse=TRUE} 265 | df2 <- agg(groupBy(df, year(df$p_dt)), p_m = mean(month(df$p_dt)), mtr_yr = mean(year(df$mtr_dt)), 266 | zb_yr = mean(month(df$mtr_dt)), new_int_rt = mean(df$new_int_rt), act_endg_upb = mean(df$act_endg_upb), 267 | loan_age = mean(df$loan_age), mths_remng = mean(df$mths_remng), aj_mths_remng = mean(df$aj_mths_remng)) 268 | ``` 269 | 270 | 271 | __End of tutorial__ - Next up is [Insert next tutorial] -------------------------------------------------------------------------------- /visualizations_files/figure-html/unnamed-chunk-10-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UrbanInstitute/sparkr-tutorials/a4dabf38c81d8635a70158fe97ecb7b1c7dd08d0/visualizations_files/figure-html/unnamed-chunk-10-1.png -------------------------------------------------------------------------------- /visualizations_files/figure-html/unnamed-chunk-11-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UrbanInstitute/sparkr-tutorials/a4dabf38c81d8635a70158fe97ecb7b1c7dd08d0/visualizations_files/figure-html/unnamed-chunk-11-1.png -------------------------------------------------------------------------------- /visualizations_files/figure-html/unnamed-chunk-12-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UrbanInstitute/sparkr-tutorials/a4dabf38c81d8635a70158fe97ecb7b1c7dd08d0/visualizations_files/figure-html/unnamed-chunk-12-1.png -------------------------------------------------------------------------------- /visualizations_files/figure-html/unnamed-chunk-13-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UrbanInstitute/sparkr-tutorials/a4dabf38c81d8635a70158fe97ecb7b1c7dd08d0/visualizations_files/figure-html/unnamed-chunk-13-1.png -------------------------------------------------------------------------------- /visualizations_files/figure-html/unnamed-chunk-15-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UrbanInstitute/sparkr-tutorials/a4dabf38c81d8635a70158fe97ecb7b1c7dd08d0/visualizations_files/figure-html/unnamed-chunk-15-1.png -------------------------------------------------------------------------------- /visualizations_files/figure-html/unnamed-chunk-17-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UrbanInstitute/sparkr-tutorials/a4dabf38c81d8635a70158fe97ecb7b1c7dd08d0/visualizations_files/figure-html/unnamed-chunk-17-1.png -------------------------------------------------------------------------------- /visualizations_files/figure-html/unnamed-chunk-4-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UrbanInstitute/sparkr-tutorials/a4dabf38c81d8635a70158fe97ecb7b1c7dd08d0/visualizations_files/figure-html/unnamed-chunk-4-1.png -------------------------------------------------------------------------------- /visualizations_files/figure-html/unnamed-chunk-5-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UrbanInstitute/sparkr-tutorials/a4dabf38c81d8635a70158fe97ecb7b1c7dd08d0/visualizations_files/figure-html/unnamed-chunk-5-1.png -------------------------------------------------------------------------------- /visualizations_files/figure-html/unnamed-chunk-6-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UrbanInstitute/sparkr-tutorials/a4dabf38c81d8635a70158fe97ecb7b1c7dd08d0/visualizations_files/figure-html/unnamed-chunk-6-1.png -------------------------------------------------------------------------------- /visualizations_files/figure-html/unnamed-chunk-7-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UrbanInstitute/sparkr-tutorials/a4dabf38c81d8635a70158fe97ecb7b1c7dd08d0/visualizations_files/figure-html/unnamed-chunk-7-1.png -------------------------------------------------------------------------------- /visualizations_files/figure-html/unnamed-chunk-8-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UrbanInstitute/sparkr-tutorials/a4dabf38c81d8635a70158fe97ecb7b1c7dd08d0/visualizations_files/figure-html/unnamed-chunk-8-1.png -------------------------------------------------------------------------------- /visualizations_files/figure-html/unnamed-chunk-9-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UrbanInstitute/sparkr-tutorials/a4dabf38c81d8635a70158fe97ecb7b1c7dd08d0/visualizations_files/figure-html/unnamed-chunk-9-1.png --------------------------------------------------------------------------------